I wrote some property-based tests for a parsing library. Eventually one of the tests for case-insensitive parsing failed... because it hit the letter "DŽ" which has upper, lower and title-case.
The funny thing is that the parsing library was correct and it was the test property that was wrong—but I still learned about an edge case I had never considered!
This has been a common pattern for "simpler" property-based tests I've written: I write a test, it fails right away, and it turns out that my code is fine, but my property was wrong. And this is almost always useful; if I write an incorrect property, it means that some assumption I had about my code is incorrect, so the exercise directly improves my conceptual model of whatever I'm doing.
Yup, and many have discovered this is an unexpected benefit of formal methods in general: writing a formal spec forces you to think precisely about requirements, assumptions, and edge cases, independently of whatever benefit formal verification may provide.
Indeed, I've had a property fail due to `Char.isWhitespace` disagreeing with a `\s` regex over whether Mongolian Vowel Separator counts as whitespace (it's not any more, but was prior to Unicode 6.3, according to https://unicode-explorer.com/c/180E )
It's been said (I think by Hughes?) that the causes of property failures tend to be spread equally between buggy code, buggy property and buggy generator.
Update: It’s actually been divided into multiple sections; if your code assumes a single section, it will break as of the addition of multibyte characters in Unicode v7.1
Is the root of the problem Unicode? Or is the root of the problem the complexity of the union of written human languages? To the extent that it's the latter, Unicode is just the messenger.
Unfortunately it ends before it gets to the good stuff. It has me interested that maybe PBT can find some bugs that unit testing wouldn't - however I'm not sure how to write a PBT that would catch those bugs. The obvious tests that drive PBT advocates to drink are not interesting - unit tests will catch all the errors and because there is no randomness they will catch the errors faster in general. However how do I write a property for non-trivial functions it not clear at all, and the article stops.
I used it for running a series of "execute this api call".
If it's valid for a user to do, you can make a list and have it do those in sequence.
I had this for a UI library. It could call the functions to add and create the library and then afterwards would move through it. It was for the BBC so on TVs and could move u/d/l/r - the logic was regardless of the UI if you moved right and the focus changed then moving left should bring you back to where you were (u/d the same, etc).
That's tricky, yet being able to write
FOR ANY ui a person can construct
FOR ANY path a user takes through it
WHEN a user presses R
AND the focus changes
THEN when the user presses L
THEN the user is on the item they were before
Was actually quite easy to write and yet insanely powerful.
The one that really convinced me on PBT was one in this library where it found the bug, and the bug had an explicit test for it and it was explicitly in the spec but the spec was inconsistent! The spec was broken, and nobody had noticed.
Another that drove out a lot of bugs was similar but was that regardless of how many ui changes we made and how many movements the user made, we always had something in focus.
Anyway, the big thing here I want to stress is a series of API calls and asserting something at the end or all the way through.
Side note - oh my this is so long ago, 15 years ago building a new PBT tool in actionscript
That often works. What also often works is generating the expected output and constructing the input from it. For example, a `stripPrefix` function that removes a known prefix from string, e.g. `stripPrefix("foo", "foobar") === "bar"`. Property test:
Note, we "go backwards" and generated the expected output `randomSuffix` directly and then construct the input from it `randomPrefix + randomSuffix`.
Reference implementation based properties also work very often. For example, we've been developing a JavaScript rich text editor. That requires a bunch of utility functions on DOM trees that are analogous to standard string functions. For example, on a standard string you can get a char at an index with `"foo bar".charAt(3)` and on a rich text DOM tree we would need something like `treeCharAt(<U><B>foo</B> bar</U>, 3)`. The string functions can serve as a reference implementation for the more complex tree functions:
Does the article he links to towards the end of the article address your concerns?
> Without complex input spaces, there's no explosion of edge cases, which minimizes the actual benefit of PBT. The real benefits come when you have complex input spaces. Unfortunately, you need to be good at PBT to write complex input strategies. I wrote a bit about it here...
- "Metamorphic testing" is where analyze how code changes with changing inputs. For example, adding more filters to a query should return a strict subset of the results, or if a computer vision system recognizes a person, it should recognize the same person if you tilt the image.
You can do "white-box" PBT by just asserting all the nontrivial invariants you can think of in your code, and then counting on the generator to force evaluation of those invariants on a representative sample of inputs.
I remember using something similar a long time ago - basically fuzz testing, I suppose you could call it property based testing where the property is "whatever edits the user does, we should be able to save and reopen the document without crashing".
It found so many bugs: file corruption, crashes, memory leaks, pathological performance issues. The kind of issues that standard unit testing doesn't find.
For code that's more "business logic" rather than "algorithmic", I find the following helpful:
- Despite the terrible tutorial examples, PBT isn't about running one function on an arbitrary input, then trying to think of assertions about the result. Instead, focus on ways that different parts of your production code fits together, what assumptions are being made at each point, etc.
- You don't need to plug random inputs directly into the code you're testing. There are usually very few things to say regarding truly arbitrary inputs, like `forAll(x) { foo(x) }`; but lots more to say about e.g. "inputs which don't contain Y" (so run the input through a filter first), or "inputs which don't overlap" (so remove any overlapping region first), and so on.
- Don't focus on the random inputs; the whole idea is that they're irrelevant to the statement you're asserting (it's meant to hold regardless of their value). Likewise, if your unit test contains some irrelevant details, use PBT to generate those parts instead.
- It's often useful in business-type software to think of a "sequence of actions" (which could be method calls, REST endpoints, DB queries, or whatever). For example, "any actions taken as User A will not affect the data for User B". Come up with a simple datatype to represent the actions you care about, write a function which "interprets" those actions (i.e. a `switch` to actually call the method, or trigger the endpoint, or submit to query, or whatever). Then we can write properties which take a list of actions as input. Remember, we don't need to run truly arbitrary lists: a property might filter certain things out of the list, prepend/append some particular actions, etc.
- Once we have some assertion, look for ways to generalise it; for example by looking for places to stick extra things which should be irrelevant.
As a simple example, say we have a function like `store(key, value)`; it's hard to say much about the result of that on its own, but we can instead say how it relates to other functions, like `lookup(key)`:
Yet we don't really care about lookups happening immediately after stores, we want to make a more general statement about values being persisted:
forAll(key, value, pre, suf) {
runActions(pre) # Storing shouldn't be affected by anything before it
store(key, value)
runActions(suf.filter(notIsStore(key))) # Do anything except storing the same key
assertEqual(lookup(key), Some(value))
}
That last test style you describe can be done with Hypothesis. I've had some good success testing both Python programs and programs written in other languages that could be driven from Python with it. Like a server using gRPC (or CORBA once) as an interface, driven by tests written in Python imitating client behavior.
I haven't used PBT much, but I did once get a lot of mileage out of it, which was when I was implementing a fairly gnarly edit-distance algorithm. In that case, I used PBT to check that the required properties of metric functions (d(x,x)=0, d(x,y)>0, d(x,y)=d(y,x), d(x,z) <= d(x,y)+d(y,z)) held for my implementation, which helped shake out a fair few stupid implementation mistakes.
> The majority of errors you find with testing are either issues with an entire "partition" of inputs or "boundary" inputs, like INT_MIN.
I don't find this to be true at all. Most bugs I find are business scenarios that I didn't consider or mismatches in API expectations etc. Rarely is a bug for me coming from not considering edge cases of min and maximum values for integers, floats etc.
I'll give you the good example I've been doing for the last two decades: testing a compiler.
The complexity here is the complete opposite of the simple toy examples. What are the edge cases of an optimizing compiler? How do you even approach them, if they're buried deep in a chain of transformations?
The properties are simple things like "the compiler shouldn't crash, the compiled code shouldn't crash, and code compiled with different optimization levels should do the same thing." This assumes the randomly generated code doesn't touch undefined behavior in the language spec.
Thanks, that's gotta be one of the best talks I've ever watched. A passionate speaker, talking about fascinating, useful tech, giving specific real world examples of utility, AND showing how to actually apply it for interesting "dirty" situations.
One can use property testing to automatically generate unit test inputs. The property becomes "does this do something new that I wanted to test for but wasn't?" (or, rather, the negation; inputs "pass" and can be ignored if they do nothing new.) This could be code coverage, or it could be killing code mutants.
I independently discovered PBT when I was a junior, and suggested we use it. My coworkers rejected it because tests should be predictable, and it's the programmer's job to pick the edge cases.
You do need to be able to reproduce a test failure, which can be done by printing the inputs or the random seed used in a way that makes it trivial to rerun it.
Nah. That's not what this is about. In any system worth testing, with property-based testing, the system will take so many steps before it encounters an error that even knowing what the steps were isn't going to be very helpful in reproducing the error.
I've been there. It takes many hours to try to guess where the system went wrong to produce the undesirable result, and then you still might not be sure if you are looking at the right place, and then there are always environment issues, you aren't sure of. So, you don't know who to blame.
Only very simple systems will reproduce pathological results 100% of the time given some initial conditions. The bane of complex systems is the timeouts. They are usually very hard to justify and are easy to blame for undesirable behavior.
I think it goes without saying. PBT shouldn't be random-random (e.g. use a timestamp or cryptographic seed), it should be deterministically pseudorandom if it uses random values. You shouldn't be able to run it 10 times and get 9 pass and one failure. It's either 10 passes or 10 failures.
Tests should be optimized to find bugs. Tests that you have already run have a lower chance of doing that (they only find regressions); tests with novel inputs are preferred. And since writing tests manually is so expensive, this means automatic test input generation. How do you determine if such tests pass? Properties.
In practice, property based testing fails because the organization is not actually interested in delivering correct code. "This bug will never happen in practice so we won't fix it." "If we fix this, we may change some incorrect behavior some customer is depending on." And once that happens, PBT is useless, because it will keep finding that "don't fix" bug over and over.
Your last paragraph is using terms like "correct", "fix" and "bug" as if they're absolute, when they're actually relative to some sort of spec (whether formal or informal, written or vibes-based, etc.). If the organisation controls the spec, then it can be perfectly reasonable for them to update that spec to e.g. allow certain behaviours that previously would have been considered bugs.
In that case, we update the properties to reflect the new spec.
That sounds like fuzz testing; which is similar to PBT, but (a) usually checks a single property ("the program doesn't crash") and (b) sends data via the program/system's ordinary input channels (whereas PBT has "white box" access to internals, like unit test do).
Why are we even testing to begin with, and not using theorem provers like lean to prove without any doubt that our commutative function is indeed commutative?
Proving code is only as good as the requirements which are often garbage - the customer often doesn't know what they even want. Even if you put in effort, requirements as the proof needs are often very abstract from the customer requirements and so your program can be proved but still be wrong because it doesn't do what the customer really wanted. In any complex program is a reasonable to state that several requirements are wrong and thus even if your prove your code correct it will be wrong. Often the problem itself cannot even be formally defined - a spell checker cannot be proved correct because human languages are not formally defined, not that you can't prove one, just that whatever you prove will be wrong.
Many systems are very complex. You can (should!) prove simple algorithms, but put everything together and a proof is not something we can do at all. There are too many halting problems like things in large programs.
Tests solve some of the above problems: They can (do not confuse with what they do!) be a simple example of "yes when inputs are exactly x,y,z then I expect that result". A bunch of simple examples that make sense can often be close enough.
We do a lot more theorem proving than most people realize. Types which many languages have are a form of formal proof. They don't cover everything, but even in C++ they cover a lot of issues.
I think the best answer is a combination: prove the things we know how to prove, and test the rest and hope that between the two we have covered enough to prevent bugs.
“Civilization advances by extending the number of important operations which we can perform without thinking about them.”
We can do property based testing with less thinking than if we try to prove correctness. We are exploiting the ability of the computer to run millions or even billions of tests. It's the enormous power of today's computers that enables this to work.
The theorem prover approach would work only if it could be automatic: press a button, get the proof (after some acceptable delay). Otherwise, look at all the expensive manual effort you just signed up for.
You might think that as computers get faster, theorem proving becomes easier. But testing becomes easier also. It's not clear testing will ever lose this race.
High-assurance systems always required a combo of methods since they can catch what others missed. Also, they have different cost-benefit ratios. A quick glance at the code or some tests catch many problems quickly while formal verification takes roughly forever in real-world, project time.
Here's a few reasons to use testing strategies:
1. Your developers might not be mathematicians.
2. Your system or its properties might be hard to specify mathematically. One can often design a test for such properties.
3. Your functions that are easy to model mathematically might also have side effects or environmental dependencies due to other requirements (eg performance, legacy).
4. Your specifications and code might get out of sync at some point. If it does, people will think the code has properties that it doesn't. That can poison the verification all the way up the proof chain.
5. Mathematical modeling or proof might take much, much longer to find the bug than a code review or testing. That is, it's a waste of money.
6. Your mathematical tools might have errors that cause a false claim of correctness. Diverse, assurance methods catch errors like this. Also, testing often uses the most, widely-used parts of a programming language. The constructs are highly likely to be compiled correctly vs esoteric methods or tools in formally-proven systems.
7. Automated testing that, in some way, searches through your execution paths can find problems your team never thought of. Fuzzing is the most common technique. However, there's many methods of automated, test generation.
Your best bet is to use code reviews, Design-by-Contract, static analyzers for common problems, contract/property-based generation of tests, fuzzing with contracts as runtime checks, and manual tests for anything hard to specify.
Don't waste time on formal verification at all unless it's a high-value asset that's worth it. If you do, first attempt it with tools like SPARK Ada and Frama-C. That have high automation. Also, if you fail on full correctness, you might still prove no runtime errors in certain categories.
Assuming you're not being facetious, one of the best parts about PBT is it gets you a good percent of the value of formal proof with a lot less work. PBT at least lets you demonstrate that property is ~probably~ true, whereas traditional unit testing doesn't usually explicitly state properties.
For purely mathematical properties, a purely mathematical technique is probably best.
But "is commutative" is just an example here (one of the topics of this post is how simplistic examples can mislead people as to the usefulness of a given verification technique).
The general point of software verification is to ensure the software "does what I want". But in a very large proportion of cases, people aren't clear on precisely what they want. They could not use a formal method because they could not write a formal specification. A nice thing about unit tests is that you can work through your expectations iteratively and incrementally, broadening and deepening your understanding of exactly what the software should do, capturing each insight along the way in a reusable way.
The funny thing is that the parsing library was correct and it was the test property that was wrong—but I still learned about an edge case I had never considered!
This has been a common pattern for "simpler" property-based tests I've written: I write a test, it fails right away, and it turns out that my code is fine, but my property was wrong. And this is almost always useful; if I write an incorrect property, it means that some assumption I had about my code is incorrect, so the exercise directly improves my conceptual model of whatever I'm doing.
1) lightweight, because most of our test suites run on production infrastructure and can’t afford to run them constantly
2) "creative", to find bugs we hadn’t considered before
Probabilistic test scenarios allow us to increase the surface we're testing without needing to exhaustively test every scenario.
It's been said (I think by Hughes?) that the causes of property failures tend to be spread equally between buggy code, buggy property and buggy generator.
If it's valid for a user to do, you can make a list and have it do those in sequence.
I had this for a UI library. It could call the functions to add and create the library and then afterwards would move through it. It was for the BBC so on TVs and could move u/d/l/r - the logic was regardless of the UI if you moved right and the focus changed then moving left should bring you back to where you were (u/d the same, etc).
That's tricky, yet being able to write
FOR ANY ui a person can construct
FOR ANY path a user takes through it
WHEN a user presses R
AND the focus changes
THEN when the user presses L
THEN the user is on the item they were before
Was actually quite easy to write and yet insanely powerful.
The one that really convinced me on PBT was one in this library where it found the bug, and the bug had an explicit test for it and it was explicitly in the spec but the spec was inconsistent! The spec was broken, and nobody had noticed.
Another that drove out a lot of bugs was similar but was that regardless of how many ui changes we made and how many movements the user made, we always had something in focus.
Anyway, the big thing here I want to stress is a series of API calls and asserting something at the end or all the way through.
Side note - oh my this is so long ago, 15 years ago building a new PBT tool in actionscript
Reference implementation based properties also work very often. For example, we've been developing a JavaScript rich text editor. That requires a bunch of utility functions on DOM trees that are analogous to standard string functions. For example, on a standard string you can get a char at an index with `"foo bar".charAt(3)` and on a rich text DOM tree we would need something like `treeCharAt(<U><B>foo</B> bar</U>, 3)`. The string functions can serve as a reference implementation for the more complex tree functions:
The same can be done with all string functions like `slice`, `indexOf`, `trim`, ...> Without complex input spaces, there's no explosion of edge cases, which minimizes the actual benefit of PBT. The real benefits come when you have complex input spaces. Unfortunately, you need to be good at PBT to write complex input strategies. I wrote a bit about it here...
Here's the link: https://www.hillelwayne.com/post/property-testing-complex-in...
- "Metamorphic testing" is where analyze how code changes with changing inputs. For example, adding more filters to a query should return a strict subset of the results, or if a computer vision system recognizes a person, it should recognize the same person if you tilt the image.
- Creating a simplified model of the code, and then comparing the code implementation to the model, a la https://matklad.github.io/2024/07/05/properly-testing-concur... or https://johanneslink.net/model-based-testing
There's also this paper, which I haven't read yet but seems intriguing: https://andrewhead.info/assets/pdf/pbt-in-practice.pdf
It found so many bugs: file corruption, crashes, memory leaks, pathological performance issues. The kind of issues that standard unit testing doesn't find.
- Despite the terrible tutorial examples, PBT isn't about running one function on an arbitrary input, then trying to think of assertions about the result. Instead, focus on ways that different parts of your production code fits together, what assumptions are being made at each point, etc.
- You don't need to plug random inputs directly into the code you're testing. There are usually very few things to say regarding truly arbitrary inputs, like `forAll(x) { foo(x) }`; but lots more to say about e.g. "inputs which don't contain Y" (so run the input through a filter first), or "inputs which don't overlap" (so remove any overlapping region first), and so on.
- Don't focus on the random inputs; the whole idea is that they're irrelevant to the statement you're asserting (it's meant to hold regardless of their value). Likewise, if your unit test contains some irrelevant details, use PBT to generate those parts instead.
- It's often useful in business-type software to think of a "sequence of actions" (which could be method calls, REST endpoints, DB queries, or whatever). For example, "any actions taken as User A will not affect the data for User B". Come up with a simple datatype to represent the actions you care about, write a function which "interprets" those actions (i.e. a `switch` to actually call the method, or trigger the endpoint, or submit to query, or whatever). Then we can write properties which take a list of actions as input. Remember, we don't need to run truly arbitrary lists: a property might filter certain things out of the list, prepend/append some particular actions, etc.
- Once we have some assertion, look for ways to generalise it; for example by looking for places to stick extra things which should be irrelevant.
As a simple example, say we have a function like `store(key, value)`; it's hard to say much about the result of that on its own, but we can instead say how it relates to other functions, like `lookup(key)`:
Yet we don't really care about lookups happening immediately after stores, we want to make a more general statement about values being persisted:That last test style you describe can be done with Hypothesis. I've had some good success testing both Python programs and programs written in other languages that could be driven from Python with it. Like a server using gRPC (or CORBA once) as an interface, driven by tests written in Python imitating client behavior.
I don't find this to be true at all. Most bugs I find are business scenarios that I didn't consider or mismatches in API expectations etc. Rarely is a bug for me coming from not considering edge cases of min and maximum values for integers, floats etc.
The complexity here is the complete opposite of the simple toy examples. What are the edge cases of an optimizing compiler? How do you even approach them, if they're buried deep in a chain of transformations?
The properties are simple things like "the compiler shouldn't crash, the compiled code shouldn't crash, and code compiled with different optimization levels should do the same thing." This assumes the randomly generated code doesn't touch undefined behavior in the language spec.
Here's a recent example of a bug found by this approach. The Common Lisp code stimulating the bug has been automatically minimized: https://bugs.launchpad.net/sbcl/+bug/2109837 with the bug fix https://sourceforge.net/p/sbcl/sbcl/ci/1abebf7addda1a43d6d24...
https://www.youtube.com/watch?v=zi0rHwfiX1Q
I've been there. It takes many hours to try to guess where the system went wrong to produce the undesirable result, and then you still might not be sure if you are looking at the right place, and then there are always environment issues, you aren't sure of. So, you don't know who to blame.
Only very simple systems will reproduce pathological results 100% of the time given some initial conditions. The bane of complex systems is the timeouts. They are usually very hard to justify and are easy to blame for undesirable behavior.
In practice, property based testing fails because the organization is not actually interested in delivering correct code. "This bug will never happen in practice so we won't fix it." "If we fix this, we may change some incorrect behavior some customer is depending on." And once that happens, PBT is useless, because it will keep finding that "don't fix" bug over and over.
In that case, we update the properties to reflect the new spec.
It seems to be almost totally forgotten, since the only link I could find is an excerpt from a PalmOS programming book:
https://www.oreilly.com/library/view/palm-programming-the/15...
https://developer.android.com/studio/test/other-testing-tool...
Proving code is only as good as the requirements which are often garbage - the customer often doesn't know what they even want. Even if you put in effort, requirements as the proof needs are often very abstract from the customer requirements and so your program can be proved but still be wrong because it doesn't do what the customer really wanted. In any complex program is a reasonable to state that several requirements are wrong and thus even if your prove your code correct it will be wrong. Often the problem itself cannot even be formally defined - a spell checker cannot be proved correct because human languages are not formally defined, not that you can't prove one, just that whatever you prove will be wrong.
Many systems are very complex. You can (should!) prove simple algorithms, but put everything together and a proof is not something we can do at all. There are too many halting problems like things in large programs.
Tests solve some of the above problems: They can (do not confuse with what they do!) be a simple example of "yes when inputs are exactly x,y,z then I expect that result". A bunch of simple examples that make sense can often be close enough.
We do a lot more theorem proving than most people realize. Types which many languages have are a form of formal proof. They don't cover everything, but even in C++ they cover a lot of issues.
I think the best answer is a combination: prove the things we know how to prove, and test the rest and hope that between the two we have covered enough to prevent bugs.
“Civilization advances by extending the number of important operations which we can perform without thinking about them.”
We can do property based testing with less thinking than if we try to prove correctness. We are exploiting the ability of the computer to run millions or even billions of tests. It's the enormous power of today's computers that enables this to work.
The theorem prover approach would work only if it could be automatic: press a button, get the proof (after some acceptable delay). Otherwise, look at all the expensive manual effort you just signed up for.
You might think that as computers get faster, theorem proving becomes easier. But testing becomes easier also. It's not clear testing will ever lose this race.
Here's a few reasons to use testing strategies:
1. Your developers might not be mathematicians.
2. Your system or its properties might be hard to specify mathematically. One can often design a test for such properties.
3. Your functions that are easy to model mathematically might also have side effects or environmental dependencies due to other requirements (eg performance, legacy).
4. Your specifications and code might get out of sync at some point. If it does, people will think the code has properties that it doesn't. That can poison the verification all the way up the proof chain.
5. Mathematical modeling or proof might take much, much longer to find the bug than a code review or testing. That is, it's a waste of money.
6. Your mathematical tools might have errors that cause a false claim of correctness. Diverse, assurance methods catch errors like this. Also, testing often uses the most, widely-used parts of a programming language. The constructs are highly likely to be compiled correctly vs esoteric methods or tools in formally-proven systems.
7. Automated testing that, in some way, searches through your execution paths can find problems your team never thought of. Fuzzing is the most common technique. However, there's many methods of automated, test generation.
Your best bet is to use code reviews, Design-by-Contract, static analyzers for common problems, contract/property-based generation of tests, fuzzing with contracts as runtime checks, and manual tests for anything hard to specify.
Don't waste time on formal verification at all unless it's a high-value asset that's worth it. If you do, first attempt it with tools like SPARK Ada and Frama-C. That have high automation. Also, if you fail on full correctness, you might still prove no runtime errors in certain categories.
But "is commutative" is just an example here (one of the topics of this post is how simplistic examples can mislead people as to the usefulness of a given verification technique).
The general point of software verification is to ensure the software "does what I want". But in a very large proportion of cases, people aren't clear on precisely what they want. They could not use a formal method because they could not write a formal specification. A nice thing about unit tests is that you can work through your expectations iteratively and incrementally, broadening and deepening your understanding of exactly what the software should do, capturing each insight along the way in a reusable way.