Testing is a fundamental part of the process of software development, but creating good tests is quite difficult. A good battery of tests for a given function ideally is able to explore all possible branches of that function. For example, let's say we want to test the following Python function:

def birthday(name, age):
    if age % 10 == 1:
        return "Happy {age}st birthday, {name}!"
    elif age % 10 == 2:
        return "Happy {age}nd birthday, {name}!"
    elif age % 10 == 3:
        return "Happy {age}rd birthday, {name}!"
    elif age == 21:
        return "Happy 21st, {name}! Have fun!"
    else:
        return "Happy {age}th birthday, {name}!"

To properly test this function, you want to test all its possible cases and compares the expected result to the actual result. For our example:

class BirthdayTests(unittest.TestCase):
    def by_case(self):
        one = birthday("Arnold", 31)
        self.assertEqual(one, "Happy 31st birthday, Arnold!")

        two = birthday("Mike", 22)
        self.assertEqual(two, "Happy 22nd birthday, Mike!")

        three = birthday("Savannah", 3)
        self.assertEqual(three, "Happy 3rd birthday, Savannah!")

        twentyone = birthday("August", 21)
        self.assertEqual(twentyone, "Happy 21st, August! Have fun!") # Error: not equal: left = "Happy 21st birthday, August!"

        other = birthday("Becky", 24)
        self.assertEqual(other, "Happy 24th birthday, Becky!")

Unit testing like this can reveal some bad assumptions about the data or control flow of a function. In our example, testing revealed the birthday function has conditions in the wrong order: having age set to 21 goes to the age % 10 == 1 case, instead of the special 21 case.

But, in some cases, unit testing is not enough. Let's say we're developing a new way to serialize and deserialize data. We want to prove that every serialized object will result in an identitical object when deserialized. In other words:

# take an object and convert it into our string representation
def serialize(s_object: Any) -> str:
    pass

# take a string in our representation and convert it into an object
def deserialize(d_object: str) -> Any:
    pass

our_object = [5, 6, 7]
self.assertEqual(our_object, deserialize(serialize(our_object)))

But testing for this property using unit tests would be exhausting, ineffiecient, and not truly correct.

For example, if the definition of serialization and deserialization is as follows:

def serialize(s_object: Any) -> str:
    return ",".join([str(s) for s in s_object])

def deserialize(d_object: str) -> Any:
    return [int(d) for d in d_object.split(",")]

that would work for the specific example we gave (our_object = [5, 6, 7]), but it wouldn't work for most other objects (our_object = ["space", {"a": 5}]). We can keep adding more and more complex tests to try to cover the whole space of what an object is, but testing using specific examples will never test for our property exhaustively.

So how should we test complex functions?

Property-Based Tests

What if, instead of giving specific examples to verify our functions against, we created a way to generate the kind of data we want to test for and then used that data in a function that verified our code's output was correct?

For our deserialization example, that might look something like:

# create an entirely random object using the given random seed
object_generator(rand) -> Any:
    pass

rand = random.seed(42)

for i in range(0, 1000):
    our_object = object_generator(rand)
    assertEqual(our_object, deserialize(serialize(our_object)))

In this case, we randomly generate 1000 different objects and then test our (de)serialization on each of them.

This is known as a property-based test. Property-based tests can be beneficial because they can automatically check a much larger search space than humans can write manual tests for, but it's often more difficult to understand how a PBT should be implemented and what a given specification means.

To get a better grasp of property-based testing, let's explore a library purpose-built for it: Hypothesis.

Hypothesis

Here's our (de)serialization example, done using Hypothesis:

from hypothesis import given

object_strategy = ...

@given(object_strategy)
def test_round_trip(d):
    assert d == deserialize(serialize(d))

Breaking this down, we have an annotation on the function (@given) telling Hypothesis what kind of data the function uses. Then, we have a "strategy" (called object_strategy) to generates the kind of data we want. Finally, the function itself is simply an assertion that our property holds. This example would essentially be desugared into our earlier explanation of PBTs.

Trying this will yield the following error:

+ Exception Group Traceback (most recent call last):
  |   File "/home/runner/workspace/main.py", line 25, in <module>
  |     test_round_trip()
  |   File "/home/runner/workspace/main.py", line 20, in test_round_trip
  |     def test_round_trip(d):
  |                    ^^^
  |   File "/home/runner/workspace/.pythonlibs/lib/python3.11/site-packages/hypothesis/core.py", line 1834, in wrapped_test
  |     raise the_error_hypothesis_found
  | ExceptionGroup: Hypothesis found 2 distinct failures. (2 sub-exceptions)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/runner/workspace/main.py", line 21, in test_round_trip
    |     assert d == deserialize(serialize(d))
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/runner/workspace/main.py", line 16, in deserialize
    |     return [int(d) for d in d_object.split(",")]
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/runner/workspace/main.py", line 16, in <listcomp>
    |     return [int(d) for d in d_object.split(",")]
    |             ^^^^^^
    | ValueError: invalid literal for int() with base 10: ''
    | Falsifying example: test_round_trip(
    |     d=[],  # or any other generated value
    | )
    +---------------- 2 ----------------
    | Traceback (most recent call last):
    |   File "/home/runner/workspace/main.py", line 21, in test_round_trip
    |     assert d == deserialize(serialize(d))
    |                             ^^^^^^^^^^^^
    |   File "/home/runner/workspace/main.py", line 12, in serialize
    |     return ",".join([str(s) for s in s_object])
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
    | TypeError: 'NoneType' object is not iterable
    | Falsifying example: test_round_trip(
    |     d=None,
    | )
    +------------------------------------

which is telling us that our example doesn't work for all our possible objects (since some objects aren't lists, and some of our objects cannot be turned into ints).

This is the essence of property-based tests, but how does Hypothesis actually work?

Custom Data Generation

In that previous example, the definition for our object strategy was omitted. How is such a strategy defined?

Hypothesis strategies cover all the basic python types (e.g. floats() for float types, text() for string types, etc.), but it also has more advanced strategies.

For example, you can union types:

@given(floats() | booleans())

which will test the function given a float or given a boolean.

You can also have lists:

@given(lists(floats()))

And you can even define strategies recursively:

recur_strat = recursive(
    floats() | booleans(),
    lambda children: lists(children)
)

@given(recur_strat)

which will give floats or booleans, or lists of floats booleans or lists.

Combining all of these together, we can create our object strategy like the following:

object_strategy = recursive(
    none() | booleans() | floats() | text(),
    lambda children: lists(children) | dictionaries(text(), children)
)

which can create a none type, a boolean, a float, a string or a list of any of our types or a dictionary from strings to any of our types.