Why Property Testing Finds Bugs Unit Testing Does Not (2021)

Unfortunately it ends before it gets to the good stuff. It has me interested that maybe PBT can find some bugs that unit testing wouldn't - however I'm not sure how to write a PBT that would catch those bugs. The obvious tests that drive PBT advocates to drink are not interesting - unit tests will catch all the errors and because there is no randomness they will catch the errors faster in general. However how do I write a property for non-trivial functions it not clear at all, and the article stops.

IanCal · 3 months ago

I used it for running a series of "execute this api call".

If it's valid for a user to do, you can make a list and have it do those in sequence.

I had this for a UI library. It could call the functions to add and create the library and then afterwards would move through it. It was for the BBC so on TVs and could move u/d/l/r - the logic was regardless of the UI if you moved right and the focus changed then moving left should bring you back to where you were (u/d the same, etc).

That's tricky, yet being able to write

FOR ANY ui a person can construct

FOR ANY path a user takes through it

WHEN a user presses R

AND the focus changes

THEN when the user presses L

THEN the user is on the item they were before

Was actually quite easy to write and yet insanely powerful.

The one that really convinced me on PBT was one in this library where it found the bug, and the bug had an explicit test for it and it was explicitly in the spec but the spec was inconsistent! The spec was broken, and nobody had noticed.

Another that drove out a lot of bugs was similar but was that regardless of how many ui changes we made and how many movements the user made, we always had something in focus.

Anyway, the big thing here I want to stress is a series of API calls and asserting something at the end or all the way through.

Side note - oh my this is so long ago, 15 years ago building a new PBT tool in actionscript

ngruhn · 3 months ago

If you have a pair of functions for encoding/decoding "something" you can do a round-trip and test that you get the original input back out, e.g.:

    JSON.parse(JSON.stringify(randomObject)) === randomObject

That often works. What also often works is generating the expected output and constructing the input from it. For example, a `stripPrefix` function that removes a known prefix from string, e.g. `stripPrefix("foo", "foobar") === "bar"`. Property test:

    stripPrefix(randomPrefix, randomPrefix + randomSuffix) === randomSuffix

Note, we "go backwards" and generated the expected output `randomSuffix` directly and then construct the input from it `randomPrefix + randomSuffix`.

Reference implementation based properties also work very often. For example, we've been developing a JavaScript rich text editor. That requires a bunch of utility functions on DOM trees that are analogous to standard string functions. For example, on a standard string you can get a char at an index with `"foo bar".charAt(3)` and on a rich text DOM tree we would need something like `treeCharAt(<U><B>foo</B> bar</U>, 3)`. The string functions can serve as a reference implementation for the more complex tree functions:

    treeCharAt(randomTree, randomIndex) === extractStringContent(randomTree).charAt(randomIndex)

The same can be done with all string functions like `slice`, `indexOf`, `trim`, ...

bpshaver · 3 months ago

Does the article he links to towards the end of the article address your concerns?

> Without complex input spaces, there's no explosion of edge cases, which minimizes the actual benefit of PBT. The real benefits come when you have complex input spaces. Unfortunately, you need to be good at PBT to write complex input strategies. I wrote a bit about it here...

Here's the link: https://www.hillelwayne.com/post/property-testing-complex-in...

bluGill · 3 months ago

It is a start, but I still feel like I'm not sure how I'd apply that to my own domain.

hwayne · 3 months ago

A couple of useful general approaches:

- "Metamorphic testing" is where analyze how code changes with changing inputs. For example, adding more filters to a query should return a strict subset of the results, or if a computer vision system recognizes a person, it should recognize the same person if you tilt the image.

- Creating a simplified model of the code, and then comparing the code implementation to the model, a la https://matklad.github.io/2024/07/05/properly-testing-concur... or https://johanneslink.net/model-based-testing

There's also this paper, which I haven't read yet but seems intriguing: https://andrewhead.info/assets/pdf/pbt-in-practice.pdf

senderista · 3 months ago

You can do "white-box" PBT by just asserting all the nontrivial invariants you can think of in your code, and then counting on the generator to force evaluation of those invariants on a representative sample of inputs.

drowsspa · 3 months ago

Yeah, for simple nearly mathematical functions it's quite clear how to do it. I find it hard to extend this to more business-focused inputs

zelos · 3 months ago

I remember using something similar a long time ago - basically fuzz testing, I suppose you could call it property based testing where the property is "whatever edits the user does, we should be able to save and reopen the document without crashing".

It found so many bugs: file corruption, crashes, memory leaks, pathological performance issues. The kind of issues that standard unit testing doesn't find.

perrygeo · 3 months ago

The base property you can test for on every function is "does this crash or return?"

chipsrafferty · 3 months ago

Indeed! The article says PBT is useful, but can't provide any examples of how :/

chriswarbo · 3 months ago

For code that's more "business logic" rather than "algorithmic", I find the following helpful:

- Despite the terrible tutorial examples, PBT isn't about running one function on an arbitrary input, then trying to think of assertions about the result. Instead, focus on ways that different parts of your production code fits together, what assumptions are being made at each point, etc.

- You don't need to plug random inputs directly into the code you're testing. There are usually very few things to say regarding truly arbitrary inputs, like `forAll(x) { foo(x) }`; but lots more to say about e.g. "inputs which don't contain Y" (so run the input through a filter first), or "inputs which don't overlap" (so remove any overlapping region first), and so on.

- Don't focus on the random inputs; the whole idea is that they're irrelevant to the statement you're asserting (it's meant to hold regardless of their value). Likewise, if your unit test contains some irrelevant details, use PBT to generate those parts instead.

- It's often useful in business-type software to think of a "sequence of actions" (which could be method calls, REST endpoints, DB queries, or whatever). For example, "any actions taken as User A will not affect the data for User B". Come up with a simple datatype to represent the actions you care about, write a function which "interprets" those actions (i.e. a `switch` to actually call the method, or trigger the endpoint, or submit to query, or whatever). Then we can write properties which take a list of actions as input. Remember, we don't need to run truly arbitrary lists: a property might filter certain things out of the list, prepend/append some particular actions, etc.

- Once we have some assertion, look for ways to generalise it; for example by looking for places to stick extra things which should be irrelevant.

As a simple example, say we have a function like `store(key, value)`; it's hard to say much about the result of that on its own, but we can instead say how it relates to other functions, like `lookup(key)`:

    forAll(key, value) {
      store(key, value);
      assertEqual(lookup(key), Some(value))
    }

Yet we don't really care about lookups happening immediately after stores, we want to make a more general statement about values being persisted:

    forAll(key, value, pre, suf) {
      runActions(pre)  # Storing shouldn't be affected by anything before it
      store(key, value)
      runActions(suf.filter(notIsStore(key)))  # Do anything except storing the same key
      assertEqual(lookup(key), Some(value))
    }

Jtsummers · 3 months ago

https://hypothesis.readthedocs.io/en/latest/stateful.html

That last test style you describe can be done with Hypothesis. I've had some good success testing both Python programs and programs written in other languages that could be driven from Python with it. Like a server using gRPC (or CORBA once) as an interface, driven by tests written in Python imitating client behavior.

Why are we even testing to begin with, and not using theorem provers like lean to prove without any doubt that our commutative function is indeed commutative?

bluGill · 3 months ago

Multiple problems.

Proving code is only as good as the requirements which are often garbage - the customer often doesn't know what they even want. Even if you put in effort, requirements as the proof needs are often very abstract from the customer requirements and so your program can be proved but still be wrong because it doesn't do what the customer really wanted. In any complex program is a reasonable to state that several requirements are wrong and thus even if your prove your code correct it will be wrong. Often the problem itself cannot even be formally defined - a spell checker cannot be proved correct because human languages are not formally defined, not that you can't prove one, just that whatever you prove will be wrong.

Many systems are very complex. You can (should!) prove simple algorithms, but put everything together and a proof is not something we can do at all. There are too many halting problems like things in large programs.

Tests solve some of the above problems: They can (do not confuse with what they do!) be a simple example of "yes when inputs are exactly x,y,z then I expect that result". A bunch of simple examples that make sense can often be close enough.

We do a lot more theorem proving than most people realize. Types which many languages have are a form of formal proof. They don't cover everything, but even in C++ they cover a lot of issues.

I think the best answer is a combination: prove the things we know how to prove, and test the rest and hope that between the two we have covered enough to prevent bugs.

pfdietz · 3 months ago

It's an example of Alfred Whitehead's quote:

“Civilization advances by extending the number of important operations which we can perform without thinking about them.”

We can do property based testing with less thinking than if we try to prove correctness. We are exploiting the ability of the computer to run millions or even billions of tests. It's the enormous power of today's computers that enables this to work.

The theorem prover approach would work only if it could be automatic: press a button, get the proof (after some acceptable delay). Otherwise, look at all the expensive manual effort you just signed up for.

You might think that as computers get faster, theorem proving becomes easier. But testing becomes easier also. It's not clear testing will ever lose this race.

nickpsecurity · 3 months ago

High-assurance systems always required a combo of methods since they can catch what others missed. Also, they have different cost-benefit ratios. A quick glance at the code or some tests catch many problems quickly while formal verification takes roughly forever in real-world, project time.

Here's a few reasons to use testing strategies:

1. Your developers might not be mathematicians.

2. Your system or its properties might be hard to specify mathematically. One can often design a test for such properties.

3. Your functions that are easy to model mathematically might also have side effects or environmental dependencies due to other requirements (eg performance, legacy).

4. Your specifications and code might get out of sync at some point. If it does, people will think the code has properties that it doesn't. That can poison the verification all the way up the proof chain.

5. Mathematical modeling or proof might take much, much longer to find the bug than a code review or testing. That is, it's a waste of money.

6. Your mathematical tools might have errors that cause a false claim of correctness. Diverse, assurance methods catch errors like this. Also, testing often uses the most, widely-used parts of a programming language. The constructs are highly likely to be compiled correctly vs esoteric methods or tools in formally-proven systems.

7. Automated testing that, in some way, searches through your execution paths can find problems your team never thought of. Fuzzing is the most common technique. However, there's many methods of automated, test generation.

Your best bet is to use code reviews, Design-by-Contract, static analyzers for common problems, contract/property-based generation of tests, fuzzing with contracts as runtime checks, and manual tests for anything hard to specify.

Don't waste time on formal verification at all unless it's a high-value asset that's worth it. If you do, first attempt it with tools like SPARK Ada and Frama-C. That have high automation. Also, if you fail on full correctness, you might still prove no runtime errors in certain categories.

kylereeve · 3 months ago

Assuming you're not being facetious, one of the best parts about PBT is it gets you a good percent of the value of formal proof with a lot less work. PBT at least lets you demonstrate that property is ~probably~ true, whereas traditional unit testing doesn't usually explicitly state properties.

jmull · 3 months ago

For purely mathematical properties, a purely mathematical technique is probably best.

But "is commutative" is just an example here (one of the topics of this post is how simplistic examples can mislead people as to the usefulness of a given verification technique).

The general point of software verification is to ensure the software "does what I want". But in a very large proportion of cases, people aren't clear on precisely what they want. They could not use a formal method because they could not write a formal specification. A nice thing about unit tests is that you can work through your expectations iteratively and incrementally, broadening and deepening your understanding of exactly what the software should do, capturing each insight along the way in a reusable way.

recroad · 3 months ago

Can you please explain more and maybe give some examples that resonate with people who don't have the understanding that you do?

moi2388 · 3 months ago

An example would be critical systems such as defence and Aerospace, where they use for example Ada Spark to formally proof certain bugs cannot occur

klysm · 3 months ago

I agree with the sentiment, but formal methods are tricky