024 - holistic testing

23 Apr, 2025

There is a tendency for programmers to be exceedingly dogmatic about certain topics. One of those topics is testing. You can situate opinions on testing along a continuum that ranges from "tests get in your way and slow you down" to "every last line of code should have testing coverage at the unit, integration and acceptance levels".

At my previous company, we ran a critical business application for some of the largest law firms in the world. The consequences of merging a breaking change were extremely high. As a result, I had gravitated toward the latter side of this continuum, becoming a fierce advocate of including a test with every pull request and building out extensive continuous integration infrastructure that would test the application top to bottom every time it was deployed.

Now I'm starting fresh again and I've walked back my zealotry on testing considerably. I have no employees and no customers. I am the only person who can merge a breaking change and the consequence of doing so is only my own frustration. My codebase isn't that complicated and a large percentage of the code I write today will get deleted long before it is consumed by actual users.

The code I am writing is also different. Previously, I wrote dynamically typed code (python / javascript), now I am working in a typed language (zig). A lot of little typos I make while I'm writing result in a compile error - where before they might not become apparent until the app is running production and a specific code path is triggered.

I was also using web frameworks previously - which meant most of the code I wrote was not triggered unless the application behaviours it related to were exercised. Compare this with my new project where I am writing a runtime from scratch. A large percentage of functions are run every time the app starts up - even if it's not displaying any content at all. I can lean into this further with assert statements that trigger even more crashes when things aren't quite going as expected. Joran Greef of TigerBeetle fame calls this upgrading a correctness bug to a liveness bug.

When I first started this project, I knew I should go light on the tests so I could experiment quickly, but I still reached for them occasionally. Now, the only time I write a test is when I know it will be the fastest way to crush a bug I am dealing with. Then I get immediate payback on the time invested in writing the test. Some of these tests end up isolating discrete behaviours and I can imagine them forming the beginnings of a comprehensive unit test suite. Others are mostly devised to quickly reproduce certain states and trigger code paths that manifest a bugs. These are likely disposable and I always include all-caps "BUGFIX" the test name so future me knows not to think too hard about deleting them.

There are other ways in which I feel unit tests are suboptimal for the parts of my codebase. I could write all kinds of unit tests in my font rendering code and never discover the bug where any time an "f" followed by an "i" was rendered to the screen the entire application would crash due to interactions between my (fi forms a ligature in many fonts). Perhaps an entire unit test suite could be replaced by a simple script that boots up my app with all manner of unicode characters on the screen and checks it doesn't crash.

A large part of my codebase is dedicated to managing the state of text documents. For "reasons", it's gotten far more complicated than I ever expected. I use unit tests the most here. There are many tests where I set up a state, do an operation and check I got the correct final state. These are great. They are also woefully insufficient. There was a bug where changing "quick fox" to "fast fox" wouldn't work if it had previously been "fast brown fox" and "brown " had been deleted. How would I ever know to write that unit test? I had to rely on Lincoln to find that one.

For those new to the blog, Lincoln is my deterministic simulation testing framework for eno. You give it a text file and it will reproduce it in eno a few small edits at time. It makes mistakes along the way and goes back and corrects them, resulting in lots of overlapping deletes and inserts. I currently get Lincoln to write the entirety of the Sun Also Rises in eno as a basic sanity check on whether my code is working. I have a random seed that I know has worked in the past which means this is kind of like running thousands of bizarrely specific unit tests.

One of my biggest lessons learned in testing from my previous company was the astounding ability of users to put content into your software that you never expected. This could be rare unicode codepoints or strange sequences of operations. It could also just be sheer volume of data. Frequently we would design an interface expecting 10-100 items and then find a user had put 1,500 in it. It's very useful to test that your code can handle lots of stuff without slowing down.

It turns out Lincoln is great for performance benchmarking too. I can use my random seed to replay thousands of operations and track performance counters to see how timers increase as more content is produced. While there is always a temptation to be fancy, I just track max and avg times for key functions and print the results to stdout after a successful Lincoln run. I copy and paste this output to the bottom of a text file in the root of the repo each time I run the benchmark- it ends up getting committed with the same hash as the code that produced those results allowing me to roughly track how performance has evolved. This has been invaluable in sanity checking the performance of my data model and experimenting with optimizations.

I opened by saying I have soured a bit on writing tests. Several paragraphs later, I hope it's clear that testing is by no means an afterthought. It's essential to think about testing from day one of development - it just doesn't need to look a certain way. When your code is early and experimental, tests can slow you down and waste time. Unit testing might not add much value at all depending on what you are writing. If your application has large entry points for user generated content - building something to randomly generate that content might give you far more leverage. Finally, poor performance should be thought of as equivalent to your software crashing from a bug and your testing strategy should reflect this. Performance benchmarking should be in your mind from very early on. You cannot monkeypatch this once all your design decisions have already been made.

If you enjoyed this post, subscribe below to get notifications for the next one.

We also have an RSS feed