Software Failure

2020-09-18

After a handful of years in the software industry, I have come to appreciate the importance of handling failure in software. For any kind of software that is expected to run in production and do so reliably, failure handling ends up being a core aspect of that software.

Unfortunately, when we learn programming in the early ages, we may not put too much attention into failure handling, and can end up developing bad practices in that respect. Books typically also tend to skip over those details, often presenting code that has no failure handling at all. And later, when we transition into university, there does not seem to be enough emphasis on failure handling either. But in any production system, failure will be a core aspect of the software, its design, and a substantial part of its implementation.

The choice of programming language already matters. I had always considered dynamically-typed languages, while insightful in many ways, to be overall inferior, and to be better left only for scripts and programs of small size. A strong and static type system goes a long way into preventing bugs and improving software quality, as we can see in programming languages like Haskell or more recently, Rust. Software written in dynamically-typed languages often ends up with unit tests that are really just type tests. But unit tests should check the software’s logic, not something that a compiler can enforce for you. If we could easily get a compiler to prove the correctness of our software with no effort on our behalf, then we would do that too.

Failure handling manifests itself in API design too. A good API is easy to use, but also makes minimal assumptions, which often translates to fewer error cases, and therefore a lower potential for error. A good API minimizes the number of failure cases. At university, this was taught to me in the form of function preconditions in an early programming course. You write your functions and think about its preconditions, the conditions that must be satisfied for the function to produce its desired postconditions. Thinking about this stuff is all good, but what about minimization? The best function is the one that has no failure mode. This may result in degraded performance or a more complex implementation, and in those cases we should consider just not handling (though still documenting) that failure case. But in many other cases, we can handle failure cases without degradation. This is true at the function level as much as it is at the high level of an API, where we are dealing with multiple modules and perhaps classes if we are doing OOD. There we need to think about how different things fit together and interact with each other, but the idea is fundamentally the same: to think of ways in which the design can fail, and then ways to craft that design so as to reduce the possibility of failure.

And finally, at the lowest level, there is the implementation. Failure handling is often a substantial part of the implementation. You write a line of code to read data from a file and there you go, think about all the ways that can fail: the file does not exist; the file exists but the program has no permissions to read it; the data is corrupted; the data is not corrupted, but there is additional data you don’t expect; etc. And we have not done anything productive yet. But those are all cases we need to handle if we want to ship software that is reliable. We may choose to ignore them, to pretend that things are simpler than they are, but it’s a false choice.

Failure handling in software therefore exists at every scope, from your choice of language, to your design and implementation.

The quality and thoroughness of failure handling is often a distinguishing characteristic of the more senior engineers. You can tell right away whom they are from a code review. The level of detail, thought, attentiveness and carefulness that they put into their work manifests itself very obviously to a careful reader. And these code reviews currently seem to me to be one of the best ways to learn failure handling.

I wish there was a greater effort put into failure handling everywhere in the software ecosystem, from schools to books to blog posts to slides and presentations. It is something that I think should be taught as early as possible, and all throughout one’s education as a software engineer. Rather than repeating the same mistakes of those who came before us and learn the hard way, we should instill the discipline of failure handling, and more generally producing quality sofware, early on in education.