Pre-AI Baseline: How Development and QA Worked Before Agents

Part of QA in the Age of AI-Accelerated Development.


Before we can reason about what has changed in QA and testing now, when code is generated by agents, we need a clear picture of what the process looked like before.

The general process

  1. Decision: An agreement is made by humans on what needs to be built (idea, hypothesis, PRD, specification, whatever you call it). Decisions are always incomplete, humans fill the gaps using shared context, domain knowledge, and tacit understanding of the system. This shared understanding is a load-bearing quality mechanism.
  2. Implementation: From this agreement, a human designs and builds implementation (architecture, design, code, other assets). Implementation is also a learning loop: while building, a developer asks the PM more questions, clarifies the PRD with the tester, notices gaps the spec didn’t anticipate, and learns the product better in the process. Some of this learning flows back into the decision (step 1), some stays as tacit knowledge the developer now carries.
  3. Testing: Testing is done to verify that the implementation matches the decision from step 1. Testing also produces learning: while testing, we might discover that the spec was ambiguous, that the user’s real workflow didn’t match the PRD, that two features interact unexpectedly, or that certain business risks weren’t assured. This learning flows back to step 1 and updates the shared understanding. Testing thus has two functions: verdict (pass/fail) and learning (updating the team’s model of the system).

Testing was done in a few ways:

Business domain understanding

In all three steps above, humans carry business domain understanding, accumulating it for free, just by doing their jobs.

In decisions, domain knowledge shapes what gets specified and what gets left implicit. “We need to handle refunds” means very different things depending on whether you know the business processed 3 months of chargebacks last quarter.

In implementation, domain knowledge shapes design choices: what to optimize, where to be paranoid, where “good enough” is fine. A developer who’s been on the payment team for a year doesn’t write the same code as a contractor who just joined.

In testing, domain knowledge shapes risk-based prioritization. The tester who knows users constantly misuse the export feature tests it harder. This is invisible in the process, nobody writes “test this harder because of domain risk”, the human just does it.

Business domain understanding was mostly not an explicit activity, but rather a byproduct of humans being in the loop. It didn’t appear as a line item in anyone’s cost model, nobody budgeted for “domain knowledge accumulation”, it just happened “automatically”. Some companies did make parts of it explicit, e.g. by documenting risk registers, architecture decision records (ADRs), component criticality classifications. However, these captured a fraction of the domain knowledge that humans carried implicitly. The explicit artifacts were supplements to human judgment, not substitutes for it.

Why doing-the-work produces learning. The mechanism behind “free” domain knowledge accumulation is well-established in cognitive science. The generation effect (Slamecka & Graf, 1978; meta-analysed across 310 experiments from 126 articles by McCurdy et al., 2020) shows that actively producing information yields dramatically better retention than passively receiving it, and the advantage grows with time, roughly doubling at longer retention intervals. Bjork’s desirable difficulties framework (1994, 2020) formalises the principle: the cognitive effort of doing something yourself is the learning. The struggle is not a cost to be eliminated; it’s the process by which understanding forms.

Domain understanding also includes transferable professional heuristics, this is what Howden called the “oracle concept”. QA professionals don’t just know this project; they reason by analogy from other systems, languages, and domains. When testing a new language feature in Kotlin, a tester draws on how Java or Scala handle the same concept. When testing a new web service, they draw on how similar services work. This is not project-specific knowledge but a cross-domain professional intuition accumulated across a career.

The risk knowledge that actually drives testing decisions is always project-specific (see economics of testing). It sits in three layers:

  1. Strategic risk knowledge (how to assess risk, how to build a risk register, how to map testing to business exposure): transferable across projects. Available to experienced humans. This is where the economics of testing framework operates.
  2. Population-level risk weighting (payments are risky, settings pages aren’t): a useful starting point, but potentially miscalibrated for any specific project. An experienced tester who lived through a payment disaster at their previous company arrives at a new team overcautious about payments, while the real risk may be in the export module that nobody’s testing. Their “lived experience weighting” is just a bias at that point.
  3. Local risk calibration (in THIS project, the export module is where the real risk lives): only comes from engagement with this specific codebase, this team, this product. This is the implicit learning from doing the work.

Human business domain understanding was enabling development and testing. It worked implicitly, as a side effect of humans collaborating. It had real issues (the recurring industry debates about keeping documentation up to date are one symptom), but it worked better, and scaled better, in companies that leaned more heavily on collaborative practices and proactive QA. The more collaboration there was, the stronger the shared domain understanding, and the more structurally that understanding could be embedded in how the company produced quality.

Knowledge could still fade as people quit or transferred, but the rate was bounded: new joiners absorbed the same context implicitly through participation, so the system replenished what it lost. The next chapter is where that bounded fade becomes unbounded, and where we name it: comprehension debt (loss of the how) and intent debt (loss of the why).

Two types of companies

Some people talk about “QA cultures” or maturity models with multiple levels, but for simplicity I see two types of companies on the market. To understand how different QA approaches scale, and why agents make the difference matter more, we need to model QA costs at the team level:

Proactive QA companies

Humans design the processes within steps 1-3 so that each step’s probability of creating something of good quality is higher, and that the system stays in this regime. Practices include: Kaizen, zero bug policy, pair/mob programming, TDD, shift-left testing, continuous training, etc. Quality control measures (testing at different stages) are also present and act as inspection/verification (appraisal) activities, but the balance of effort is shifted towards prevention over appraisal.

Practices like pair programming and mob programming serve the function of shared understanding amplifiers. They keep mental models aligned across the team.

Domain knowledge is embedded in their prevention practices: risk-based testing, quality gates calibrated to component criticality, architecture decisions shaped by business risks; all of these are based on solid domain understanding.

The operational version of this is a risk-based investment model: identify business risks, prioritize by exposure (likelihood × impact), select testing approaches that produce credible evidence at minimal cost, and review periodically to rebalance. This is described in detail in the Economics of Testing research, which frames testing as an economic investment within the Cost of Quality (CoQ) framework. The key economic insight is that prevention costs less than appraisal, which costs less than failure; so proactive QA companies invest heavily in prevention and use appraisal (testing) deliberately where it produces the highest marginal risk reduction. Every step of this model depends on business domain understanding: what risks exist, what failure costs, what the product lifecycle demands, what tradeoffs are acceptable.

Scaling characteristics when adding n teams:

Intra-team costs (the “linear” part):

Cross-team costs (the “hidden superlinear” part):

Actual cost function: O(n + εn²) where:

The difference between proactive QA and reactive QC companies is not just the size of ε; it’s small ε and no rework multiplier vs. large ε compounded by the rework feedback loop (see reactive QC below).

Reactive QC companies

Little to no proactive QA. Quality is mostly controlled, not assured. Testing is often done “after” development and frequently struggles to cope with delivery. Dev and tester mental models drift apart, testing becomes “guess what the developer meant” rather than “verify what we all agreed on”.

Domain knowledge still exists in people’s heads, even if processes don’t leverage it formally. The individual tester still knows “this area is risky”, but they can’t act on it structurally.

Scaling characteristics when adding n teams is much harder:

Using the same framework as proactive QA: cost is also O(n + εn²), but ε is much larger because there are no architectural or process investments to suppress the quadratic term. Additionally, the rework loop adds a multiplicative factor 1/(1-r(n)) on top, making the actual cost closer to O((n + εn²) / (1-r(n))), which accelerates sharply as r approaches 1.

Scaling walls don’t just appear in team cost structures; they also appear inside testing itself.

Historical evidence: the “testing-after” model was already hitting ceilings

Even before AI, as systems grew more complex, deterministic testing repeatedly hit combinatorial walls. Each time, the industry responded by inventing a new paradigm that accepted incompleteness. Each paradigm is a historical marker of a specific scaling wall being hit:

Paradigm Year What hit the wall The wall Response
Fuzzing 1988 Input space Can’t enumerate all inputs to a program Sample randomly from input space, observe what breaks
Property-based testing 2000 Scenario space Can’t enumerate all combinations of valid inputs and states Specify invariants, let the framework generate cases automatically
Chaos engineering 2011 Interaction/failure space Can’t enumerate all failure combinations in distributed systems Inject random failures in production, observe emergent behavior

This is the same principle behind risk-based E2E testing: it’s infeasible to test every user journey, so teams prioritize critical flows accepting incompleteness in exchange for tractability.

The common move: when enumeration becomes intractable, shift from deterministic verification to probabilistic exploration. This is the Monte Carlo method applied to testing: when you can’t compute the integral analytically, you sample.

These paradigms did not replace traditional testing, but supplemented it. Companies still write unit tests AND do chaos testing. Each paradigm emerged as a new layer because the previous layers couldn’t reach certain kinds of defects. The testing stack keeps growing.

All three paradigms, despite accepting incompleteness in enumeration, still relied on human comprehension as their foundation:

This is the critical setup for the AI transition: the pre-AI testing world was already evolving toward probabilistic approaches because deterministic enumeration couldn’t scale.

But every new paradigm still assumed that humans understood the system deeply enough to design the tests, interpret the results, and act on the findings. AI-accelerated development threatens precisely this foundation.


Next: With agents