Pre-AI Baseline: How Development and QA Worked Before Agents

Part of QA in the Age of AI-Accelerated Development.

Before we can reason about what has changed in QA and testing now, when code is generated by agents, we need a clear picture of what the process looked like before.

The general process

Decision: An agreement is made by humans on what needs to be built (idea, hypothesis, PRD, specification, whatever you call it). Decisions are always incomplete, humans fill the gaps using shared context, domain knowledge, and tacit understanding of the system. This shared understanding is a load-bearing quality mechanism.
Implementation: From this agreement, a human designs and builds implementation (architecture, design, code, other assets). Implementation is also a learning loop: while building, a developer asks the PM more questions, clarifies the PRD with the tester, notices gaps the spec didn’t anticipate, and learns the product better in the process. Some of this learning flows back into the decision (step 1), some stays as tacit knowledge the developer now carries.
Testing: Testing is done to verify that the implementation matches the decision from step 1. Testing also produces learning: while testing, we might discover that the spec was ambiguous, that the user’s real workflow didn’t match the PRD, that two features interact unexpectedly, or that certain business risks weren’t assured. This learning flows back to step 1 and updates the shared understanding. Testing thus has two functions: verdict (pass/fail) and learning (updating the team’s model of the system).

Testing was done in a few ways:

Fully manually (a human interprets the decision and the implementation, “tries it”) and reports
Manually and automatically (same as previous, but also humans write automated tests)
Only automatically

Business domain understanding

In all three steps above, humans carry business domain understanding, accumulating it for free, just by doing their jobs.

In decisions, domain knowledge shapes what gets specified and what gets left implicit. “We need to handle refunds” means very different things depending on whether you know the business processed 3 months of chargebacks last quarter.

In implementation, domain knowledge shapes design choices: what to optimize, where to be paranoid, where “good enough” is fine. A developer who’s been on the payment team for a year doesn’t write the same code as a contractor who just joined.

In testing, domain knowledge shapes risk-based prioritization. The tester who knows users constantly misuse the export feature tests it harder. This is invisible in the process, nobody writes “test this harder because of domain risk”, the human just does it.

Business domain understanding was mostly not an explicit activity, but rather a byproduct of humans being in the loop. It didn’t appear as a line item in anyone’s cost model, nobody budgeted for “domain knowledge accumulation”, it just happened “automatically”. Some companies did make parts of it explicit, e.g. by documenting risk registers, architecture decision records (ADRs), component criticality classifications. However, these captured a fraction of the domain knowledge that humans carried implicitly. The explicit artifacts were supplements to human judgment, not substitutes for it.

Why doing-the-work produces learning. The mechanism behind “free” domain knowledge accumulation is well-established in cognitive science. The generation effect (Slamecka & Graf, 1978; meta-analysed across 310 experiments from 126 articles by McCurdy et al., 2020) shows that actively producing information yields dramatically better retention than passively receiving it, and the advantage grows with time, roughly doubling at longer retention intervals. Bjork’s desirable difficulties framework (1994, 2020) formalises the principle: the cognitive effort of doing something yourself is the learning. The struggle is not a cost to be eliminated; it’s the process by which understanding forms.

Domain understanding also includes transferable professional heuristics, this is what Howden called the “oracle concept”. QA professionals don’t just know this project; they reason by analogy from other systems, languages, and domains. When testing a new language feature in Kotlin, a tester draws on how Java or Scala handle the same concept. When testing a new web service, they draw on how similar services work. This is not project-specific knowledge but a cross-domain professional intuition accumulated across a career.

The risk knowledge that actually drives testing decisions is always project-specific (see economics of testing). It sits in three layers:

Strategic risk knowledge (how to assess risk, how to build a risk register, how to map testing to business exposure): transferable across projects. Available to experienced humans. This is where the economics of testing framework operates.
Population-level risk weighting (payments are risky, settings pages aren’t): a useful starting point, but potentially miscalibrated for any specific project. An experienced tester who lived through a payment disaster at their previous company arrives at a new team overcautious about payments, while the real risk may be in the export module that nobody’s testing. Their “lived experience weighting” is just a bias at that point.
Local risk calibration (in THIS project, the export module is where the real risk lives): only comes from engagement with this specific codebase, this team, this product. This is the implicit learning from doing the work.

Human business domain understanding was enabling development and testing. It worked implicitly, as a side effect of humans collaborating. It had real issues (the recurring industry debates about keeping documentation up to date are one symptom), but it worked better, and scaled better, in companies that leaned more heavily on collaborative practices and proactive QA. The more collaboration there was, the stronger the shared domain understanding, and the more structurally that understanding could be embedded in how the company produced quality.

Knowledge could still fade as people quit or transferred, but the rate was bounded: new joiners absorbed the same context implicitly through participation, so the system replenished what it lost. The next chapter is where that bounded fade becomes unbounded, and where we name it: comprehension debt (loss of the how) and intent debt (loss of the why).

Two types of companies

Some people talk about “QA cultures” or maturity models with multiple levels, but for simplicity I see two types of companies on the market. To understand how different QA approaches scale, and why agents make the difference matter more, we need to model QA costs at the team level:

Proactive QA companies

Humans design the processes within steps 1-3 so that each step’s probability of creating something of good quality is higher, and that the system stays in this regime. Practices include: Kaizen, zero bug policy, pair/mob programming, TDD, shift-left testing, continuous training, etc. Quality control measures (testing at different stages) are also present and act as inspection/verification (appraisal) activities, but the balance of effort is shifted towards prevention over appraisal.

Practices like pair programming and mob programming serve the function of shared understanding amplifiers. They keep mental models aligned across the team.

Domain knowledge is embedded in their prevention practices: risk-based testing, quality gates calibrated to component criticality, architecture decisions shaped by business risks; all of these are based on solid domain understanding.

The operational version of this is a risk-based investment model: identify business risks, prioritize by exposure (likelihood × impact), select testing approaches that produce credible evidence at minimal cost, and review periodically to rebalance. This is described in detail in the Economics of Testing research, which frames testing as an economic investment within the Cost of Quality (CoQ) framework. The key economic insight is that prevention costs less than appraisal, which costs less than failure; so proactive QA companies invest heavily in prevention and use appraisal (testing) deliberately where it produces the highest marginal risk reduction. Every step of this model depends on business domain understanding: what risks exist, what failure costs, what the product lifecycle demands, what tradeoffs are acceptable.

Scaling characteristics when adding n teams:

Intra-team costs (the “linear” part):

QA prevention practices (TDD, pairing, etc.) are per-team, each team bears its own QA cost. Adding a team adds a roughly constant marginal cost.
Automated tests are written by developers as part of development, they scale with the team, not as a separate coordination overhead.

Cross-team costs (the “hidden superlinear” part):

CI infrastructure pressure: tests run in a shared pipeline. CI run time grows with total test count. Flaky test probability compounds: if each of t tests has independent flake probability p, probability of at least one flake per run is 1-(1-p)^t, which approaches 1 quickly. Flaky test investigation is a shared cost, not per-team.
Cross-team test conflicts: team A’s change breaks team B’s tests. Investigation cost grows with the number of team pairs = O(n²).
Shared test utilities/fixtures: maintenance becomes a coordination problem as more teams depend on them.
Architectural coherence: the system becomes harder to reason about as a whole. Coordination mechanisms (architecture reviews, platform teams, interface contracts) are needed.

Actual cost function: O(n + εn²) where:

n = number of teams
ε = cross-team coordination coefficient
In proactive QA companies, architectural practices (clear module boundaries, APIs, contracts) exist specifically to keep ε small
ε is never zero, the “linearity” of proactive QA companies means they’ve invested in making the quadratic term’s coefficient small, not that the quadratic term is absent

The difference between proactive QA and reactive QC companies is not just the size of ε; it’s small ε and no rework multiplier vs. large ε compounded by the rework feedback loop (see reactive QC below).

Reactive QC companies

Little to no proactive QA. Quality is mostly controlled, not assured. Testing is often done “after” development and frequently struggles to cope with delivery. Dev and tester mental models drift apart, testing becomes “guess what the developer meant” rather than “verify what we all agreed on”.

Domain knowledge still exists in people’s heads, even if processes don’t leverage it formally. The individual tester still knows “this area is risky”, but they can’t act on it structurally.

Scaling characteristics when adding n teams is much harder:

Rework feedback loop (superlinear): defects found → rework → rework introduces new defects at rate r. Total defect resolution effort is multiplied by ~1/(1-r). As system complexity grows, r increases (fixes are more likely to break something else), so the multiplier itself grows. This is a positive feedback loop: scaling is superlinear and can blow up as r approaches 1. Formally: if base defect count scales as O(n) and each fix injects r new defects, total effort ≈ O(n / (1-r)). Since r is itself an increasing function of system complexity (which grows with n), the denominator shrinks as n grows, making the effective scaling superlinear.
Automated test maintenance (superlinear): writing new tests may be O(n), but maintenance grows with system interconnectedness. Each new feature can break existing tests due to changing assumptions, not bugs. The fragility grows with the number of cross-module dependencies, which scales as O(n²) in the worst case.
Manual testing (superlinear): combinatorial state explosion. If the system has k interacting features, the state space grows combinatorially. Each new team adds features that interact with existing ones; testing the interactions grows faster than the number of features.

Using the same framework as proactive QA: cost is also O(n + εn²), but ε is much larger because there are no architectural or process investments to suppress the quadratic term. Additionally, the rework loop adds a multiplicative factor 1/(1-r(n)) on top, making the actual cost closer to O((n + εn²) / (1-r(n))), which accelerates sharply as r approaches 1.

Scaling walls don’t just appear in team cost structures; they also appear inside testing itself.

Historical evidence: the “testing-after” model was already hitting ceilings

Even before AI, as systems grew more complex, deterministic testing repeatedly hit combinatorial walls. Each time, the industry responded by inventing a new paradigm that accepted incompleteness. Each paradigm is a historical marker of a specific scaling wall being hit:

Paradigm	Year	What hit the wall	The wall	Response
Fuzzing	1988	Input space	Can’t enumerate all inputs to a program	Sample randomly from input space, observe what breaks
Property-based testing	2000	Scenario space	Can’t enumerate all combinations of valid inputs and states	Specify invariants, let the framework generate cases automatically
Chaos engineering	2011	Interaction/failure space	Can’t enumerate all failure combinations in distributed systems	Inject random failures in production, observe emergent behavior

This is the same principle behind risk-based E2E testing: it’s infeasible to test every user journey, so teams prioritize critical flows accepting incompleteness in exchange for tractability.

The common move: when enumeration becomes intractable, shift from deterministic verification to probabilistic exploration. This is the Monte Carlo method applied to testing: when you can’t compute the integral analytically, you sample.

These paradigms did not replace traditional testing, but supplemented it. Companies still write unit tests AND do chaos testing. Each paradigm emerged as a new layer because the previous layers couldn’t reach certain kinds of defects. The testing stack keeps growing.

All three paradigms, despite accepting incompleteness in enumeration, still relied on human comprehension as their foundation:

Fuzzing requires humans to recognize which crashes matter and diagnose root causes
Property-based testing requires humans to specify the invariants (which requires understanding what “correct” means)
Chaos engineering requires humans to interpret system behavior under failure and design resilience patterns

This is the critical setup for the AI transition: the pre-AI testing world was already evolving toward probabilistic approaches because deterministic enumeration couldn’t scale.

But every new paradigm still assumed that humans understood the system deeply enough to design the tests, interpret the results, and act on the findings. AI-accelerated development threatens precisely this foundation.

Next: With agents →