Software Development and the False Promise of Science

October 13, 2019

You are, I will assume, the typical software developer. In arguments about anything from “will single payer healthcare be good for the economy?” to “can Myers-Briggs personality tests be a meaningful basis for business decisions” – your habitual response is to cite a peer-reviewed, scientific study to bolster your argument.

That is, except when it comes to software. You have opinions – strong opinions – on questions such as “do microservices encourage modular code design?” and “should software projects stick to a ‘novelty budget’?” and “should composition be preferred to inheritance?”. But are your opinions backed by peer-reviewed analyses of hypotheses subjected to statistical tests of empirical data? Not really. Your view that software projects should stick to a ‘novelty budget’, for instance, is backed by your experience reading about this idea in some rando’s blog post and the argument seeming plausible in light of the recent bankruptcy of your friend’s web startup built on WebAssembly, CockroachDB, Elixir and Unikernels.

You’re not entirely happy with this. You’d like your opinions on software development to have a more objective grounding. So you’ve turned to books.

You were disappointed with Accelerate: The Science of Lean Software and DevOps. You agreed with most of its prescriptions. It made liberal use of descriptive statistics. Still, there wasn’t too much hypothesis testing or causal inference involved, with the exception of “continuous delivery improves organization performance,” which most of your peer group already accepts, anyway. And the methodology didn’t seem as conclusive as you had hoped. If the statistics had turned out the opposite way, would that really have shaken your conviction that CD is a good idea?

You were disappointed – angered, even – by Good to Great: Why Some Companies Make the Leap and Others Don’t (whose methodology is basically this).

You read The Leprechauns of Software Engineering by Laurent Bossavit, a debunking of the academic merit of several popular “scientific” claims about software engineering. You don’t know how you’ll ever be able to trust a citation ever again.

You watched the conference talk “What We Know We Don’t Know”, by Hillel Wayne, who, also disturbed by software’s apparent lack of scientific foundation, found and read as many scholarly papers as he could find. His conclusions are grim. While “we just don’t know” the answers for “almost everything in software” – he does seem optimistic that empirical researchers will one day be able to rise to the challenge of building an “empirical software engineering”, and devotes most of the talk to summarizing results of some published papers. (Apparently, code review and getting enough sleep are good ideas. TDD probably doesn’t help.) This only deepens your dissatisfaction.You don’t live in this indefinite future when empirical researchers finally start figuring things out. You live in the here and now.

At last, truly lost, you found yourself here.

I’m here for two reasons. First, I want to put to rest any hope you may still hold for a scientific foundation for software development. Second, I want to reassure you that this is all right.

The Limits of Science

Why is the project to construct a science-based discipline of software development hopeless? For one, I don’t think researchers are really aiming to do this. They are quite content to continue telling insightful stories about their relatively narrow specialties – bits of our discipline, here and there. There is no grand plan to reform us. But even if there were, it would be bound to fail for methodological reasons. This is worth some discussion, so I’ll describe four of what I mean by “methodological reasons”.

1. Controlled experiments are typically nothing like professional programming environments

At my day job, I’ve had the same teammates for months, used the same tools and practices for months, and have worked on the same applications for months. I work on a codebase that is nine years old, contains millions of lines of code, and has hundreds of people changing it every week. If you google “how long does it take a new developer to become productive?” common wisdom seems to say “months”. So far as I know, no researcher has ever gathered treatment and control groups of ten five-developer teams each, put them to work M-F, 9-5, for even a single month, in order to realistically simulate the conditions of a stable, familiar team and codebase. It’s easy to see why – the market wage for these developers will be a million dollars per month. That’s much too expensive for me to expect that there will soon be any meaningful number of experiments like this. As it stands, The modal experiment is probably a group of undergrads working on a tiny codebase they’ve never seen before, solving a toy problem for a couple of hours. The potential this has for generating persuasive arguments about the practices of professional programmers is low. If I were arguing with my colleagues, and one of them said “X worked well for us at Twitter” and I replied with “well X worked poorly for my CS 262 lab group” I’d be laughed out of the Zoom channel.

There’s a continuum between “professional software team” and “undergrad computer lab” of course. You might luck out and get masters students, for example, or study semester-long projects, instead of afternoon coding sessions. But as it stands, the “professional software team” end of the spectrum is largely unexplored by controlled experiments, and due to the economics of the matter, will remain so indefinitely.

2. The unpredictable dynamics of human decision-making obscure the effects of software practices in field data.

The alternative to controlled experiments is analyzing field data. You find teams working in industry, some of which have adopted the practice you are studying (the treatment group), some of which have not (the control group). Then you see whether the practice correlates with better outcomes. The challenge here is that “correlation does not prove causation”. This is less problematic in a controlled experiment, where any difference between the treatment group and the control group must be caused by the practice, random chance (which can be ruled out if your sample is large enough), or the researchers goofing up. This doesn’t hold for field data, because real-life software teams don’t adopt software practices in a random manner, independent from all other factors that might potentially affect outcomes.

Take the study in Accelerate. It found a correlation between continuous delivery (CD) and better outcomes. Does this prove that CD itself leads to better outcomes? It could easily be that CD only appears to improve outcomes because it is fashionable and helps organizations attract better developers, not because anything about CD is inherently better than its absence. Or maybe the ideology that leads a company to adopt CD also leads them to some other, unknown practice – and this unknown practice is the true cause of the better outcome. You can design your analysis to rule out alternate explanations such as these only if you are aware of and can measure them. But decisions to adopt software practices are human choices, driven by complicated social and psychological processes, which are difficult to understand, let alone measure.

This difficulty (the statistical term for this is “omitted variable bias”, a flavor of a broader class of problems called “endogeneity”) is endemic. Suppose you examine data from Github, and you find a lower defect rate in Typescript codebases than in Javascript codebases. Is this actually because type systems inherently help detect errors? Or maybe it’s because programmers who don’t have the leisure to experiment with Typescript are more likely to be in a time crunch and need to cut corners, or perhaps programmers who choose Typescript simply value defect-free code more than those who don’t, and their values lead them to spend more time looking for defects– there are a million imaginable human reasons for this difference – none of which you have a realistic ability to analyze based on the information in a git log.

There are clever things researchers can do to grapple with endogeneity. You can look for data with some component (an “instrumental variable”) that you can trust to be independent from other potentially confounding factors. For example, suppose Microsoft launched marketing campaigns for Typescript in different countries in different years, and you manage to find data showing these marketing campaigns correlate with a subsequent decrease in the defect rate across all Javascript/Typescript codebases in those countries. Since Typescript marketing campaigns probably don’t make people into better developers, or change their values, or give them more free time, or change anything that might affect their ability to produce defect-free code other than their willingness to adopt Typescript, the decrease in defects can safely be attributed to Typescript itself. (This isn’t quite an accurate portrayal of how the instrumental variable technique works, but it captures the essence of the argument.) So the endogeneity problem isn’t necessarily insurmountable. Still, data like this is not easy to find. There’s no reason to believe researchers will produce anything more than a few faint glimpses through the fog of endogeneity in the near future.

3. The outcomes that can be measured aren’t always the outcomes that matter.

Most studies content themselves with measuring things like “defect rate” or “complexity” or “mean time to recovery” or self-reported “code quality”. Most software developers aren’t single-mindedly trying to minimize the number of defects, complexity, lead-time, or any single outcome. They’re solving a multi-faceted optimization problem. They could slow down and eliminate more defects if they wanted to and so forth, but they must trade this off because it’s also important to ship in a reasonable time frame. Indeed, I would say, the rate of defects or complexity in the code seems much more likely to reflect its authors’ constraints and values, rather than inherent qualities of the practices they employ. So in order to effectively inform practice, research needs to ask a slightly different, more sophisticated question – not e.g. “what is the effect software practice X has on ‘defect rate’”, but “what is the effect software practice X has on ‘defect rate per unit effort’”. While it might be feasible to ask this question in the controlled experiment setting, it is difficult or impossible to ask of field data.

4. Software practices and the conditions which modify them are varied, which limits the generality and authority of any tested hypothesis

Suppose, for instance, you’re trying to test “microservices encourage modular code design”. You test it across all the obvious dimensions: team size/experience, type system, read/write access pattern, consistency requirements, and in all these circumstances, your data shows a correlation between microservices and worse outcomes. A practitioner might object “well that’s no surprise, most developers get it wrong because they don’t understand the right way to do microservices, but if you do them right they are incredible – and I know how to do them right” and this claim could be completely consistent with reality! Perhaps developers who have read the right book, or have worked at a certain company and experienced a certain prototypical system, have a high chance of success with microservices, and developers who have not don’t. This is a completely plausible story. And this sort of objection severely limits the authority of any empirical study. What it comes down to is nuance. No two developers actually mean quite the same thing when they say “microservices encourage modular code design,” and if you fight with them, you will discover a rich, unique, and nuanced mental model of microservices and modularity. If a researcher had unlimited time and resources, they could explore and test every developer’s individual mental model. But they don’t, and so they are constrained to studying whatever notion seems most representative. This will produce insights, to be sure – but these insights only have authority in proportion to how homogenous the experiences and mental models of software developers are with respect to the practice under study. And as a rule, software development is not homogenous.

None of this is to say that there’s anything wrong with empirical research. The “threats to validity” section of any paper will show you the researchers are quite aware of their research’s limitations. This sort of research is still important, thought-provoking, and worthy of the attention of practitioners. But if you’re holding your breath for the day when empirical science will produce a comprehensive framework for software development – like it does for, say, medicine – you will die of hypoxia.

Science is not the only source of knowledge

Is this cause for despair? If science-based software development is off the table, what remains? Is it really true as Hillel suggests, that in the absence of science “we just don’t know” anything, and we are doomed to an era of “charisma-driven development” where the loudest opinion wins, and where superstition, ideology, and dogmatism reign supreme?

Of course not. Scientific knowledge is not the only kind of knowledge, and scientific arguments are not the only type of arguments. Disciplines like history and philosophy, for instance, seem to do rather well, despite seldom subjecting their hypotheses to statistical tests.

The reason for this is that history and philosophy – software development too, I will argue – study meanings that we can directly apprehend. If you want to judge a theory about the Protestant Reformation, you can read the writings of Martin Luther and see whether their meaning, as you interpret it, seems to match the theory. If you want to judge say, David Hume’s epistemology, you can examine your mind’s internal patterns of thought and see if they are consistent with the concepts Hume introduces, or whether there are counterexamples. Whereas, no ray of light has ever written a treatise about wavelength, and you cannot adopt the mindset of a pancreas in hopes of building an introspective understanding of its secretions. Measurement and statistics are preferred in the natural science only because the subject matter can only be understood indirectly, through measurement and inference.

Software development, like history and philosophy, can be understood without measurement. If somebody claims “microservices encourage modular code design,” illustrating their point with descriptions of specific ways in which the code at their company was more modular after the introduction on microservices, and offering a theory as to why, you can hear their explanation, determine whether or not their example seems plausible, and whether their theory generalizes to situations you have faced, and weigh this again any of your own experiences or stories you have heard that seem to conflict with their explanation. This sort of judgement isn’t “scientific”, but it isn’t unreasonable. Your decision to accept or reject the argument might be mistaken – you might overlook some major inconsistency, or your judgement might be skewed by your own personal biases, or you might be fooled by some clever rhetorical trick. But all in all, your judgement will be based in part on the objective merit of the argument – not entirely the charisma of its advocates. And so we can improve our understanding of software through the act of hearing stories, reflecting upon theories, and comparing them with our own experience. Measurement and statistical tests are not necessary.

And here is a secret: in the natural sciences themselves, storytelling and bare conjecture are far more important modes of persuasion than data-based empirical argument, anyway. Data and statistics have a critical role, of course, but they don’t capture the imagination in the same way as the narrative of the subatomic adventures of a motley crew of quarks, bosons, and leptons. In fact, progress in science can sometimes depend on scientists to proceed “counterinductively”, and, driven by intuition, ideology, insanity, or other sentiments, deliberately accept a theory that is less consistent with the data. For more on this, you should read Against Method by the philosopher of science Paul Feyerabend. His favorite example of “counterinductive” science is Galileo, driven by ideology, advancing heliocentrism in an era when it was truly unjustified by the available evidence and the prevailing theories in the “auxiliary sciences” of optics and locomotion. He mentions other examples too, although these are less accessible the less familiar you are with physics.

So I don’t think software developers need to be too worried that empirical science seems to be rather limited in what it can say about what we do. Empirical science itself isn’t even always that empirical. We’ll get along fine just by telling stories, listening to the stories of our colleagues, keeping an open mind, and thinking critically about what we do.

A good example of the sort of argument I think is helpful is A Philosophy of Software Design. Ousterhout defines his terms clearly, accompanies his definitions and claims with illustrative examples, and tells an occasional story. You, the reader, are free to evaluate each claim based on whether it plausibly seems to capture the essence of what you have encountered in your experiences writing software. For my part, I didn’t find most of Ousterhout’s ideas to be persuasive, as some of my colleagues did, but that doesn’t mean they aren’t good arguments, and that doesn’t mean I didn’t learn anything from reading the book.

Another example is Code Simplicity by Max Kanat-Alexander. I personally did not enjoy this book. But it is possible to learn even from obnoxious presentations.

Science gone wrong

Science – or at least a mysticized version of it – can be a threat to this sort of inquiry. Lazy thinkers and ideologues don’t use science merely as a tool for critical thinking and reasoned argument, but as a substitute. Science appears to offer easy answers. Code review works. Continuous delivery works. TDD probably doesn’t. Why bother sifting through your experiences and piecing together your own narrative about these matters, when you can just read studies – outsource the reasoning to the researchers? You can probably get away with just reading the abstracts, even. No need to trouble yourself about the line of argument that led the researchers to their conclusions. Heaven forbid you read the “threats to validity” section – nobody takes that seriously. They’ve got p-values, confidence intervals and such. Those are essentially SLAs on the objective truth. It’s been peer-reviewed. QED. If there’s a corroborating study, double QED. If there’s been a corroborating meta-study or literature review, that proves it’s been proven. QED squared. It’s easy to deal with people that disagree with us too, when Science is on our side. No need to demonstrate the shortcomings and inconsistencies of their ideas. We can simply dismiss them as “anti-science” and compare them to anti-vaxxers. Why appeal to rational sensibilities when you can exploit tribal instincts instead?

This sentiment exists. I witnessed it play out among industry leaders in my Twitter feed, the day after I started drafting this post. It’s amplified by the fact that nuanced scientific analysis doesn’t make good clickbait, and by the fact that everything except the abstract of a research paper tends to live behind a paywall.

TL;DR

Software developers are domain experts. We know what we’re doing. We have rich internal narratives, and nuanced mental models of what it is we’re about, and we can learn to get better through the simple but difficult act of telling our stories, articulating our ideas and listening to each other, like philosophers or historians do. Empirical researchers have a different perspective and can say interesting things about our field, but it is important to consider these arguments on their (necessarily limited) merits, not idealizing them as an absolute authority. Down that road lies pettiness and lazy thinking.

Postscript

Definitely read Hillel Wayne’s thoughtful response to this, on lobste.rs.

He corrected me on a few points, notably

But as it stands, the “professional software team” end of the spectrum is largely unexplored by controlled experiments