Research Question

Part of QA in the Age of AI-Accelerated Development. Pre-requisite: Lifecycle drift.

In the previous pages, both debts emerge at the generation point, and the pre-AI compensation layer turns into ratification rather than correction. The shape of any solution depends on the asymmetry between the two debts.

The two debts are not equally non-negotiable.

Intent is absolute. Without the “why”, you cannot define success, assess the product, or decide what to build next. Lose the intent and you are not building a product; you are just generating code. Intent debt manifests not when things break, but when decisions need to be made, and at that point, no amount of comprehension can compensate for not knowing why the system exists.

Comprehension is conditional. You need enough of it to supervise (catch when agents go wrong), intervene (debug when things break), and evolve (make architectural decisions about the system’s future). But you might not need line-by-line understanding of generated code, or the level of comprehension you would have if you wrote it yourself. Practical experience with AI-assisted pairing (where a domain expert and an engineer collaborate while delegating implementation to an agent) suggests that teams can operate with partial comprehension and full intent, provided the right collaboration structures are in place.

A worked example. Consider a payment refund flow.

Scenario A: full intent, degraded comprehension. The team knows the refund policy precisely. Refunds over $500 require manager approval. Fraud cases bypass the manager and route to legal. Partial refunds for subscription disputes prorate to the day. The implementation was generated by an agent six months ago, and no one on the current team has read the code in detail. A bug surfaces: a customer was refunded twice. The team re-runs the flow with logging, compares actual behaviour to the policy they understand, decides whether the fix is “patch this branch” or “redesign the deduplication”. They debug slowly, but they recover. The product still does the right thing for new customers in the meantime, because the intent is intact and the team can supervise.

Scenario B: full comprehension, fuzzy intent. Same codebase, two years later. The team that made the original decisions has rotated out. The A/B test records are gone. The compliance document that cited the $500 threshold has been superseded. The current team has read every line and understands the implementation perfectly. What they have lost is why some branches exist. Why does the fraud-case refund bypass the manager? Was that a compliance requirement, an old ops decision, or a hack from an A/B test that was never cleaned up? Why does the partial refund prorate to the day rather than the hour? Does anyone still need that, or was it for a customer who left two years ago? The team cannot answer “is this still doing the right thing?” without intent. They cannot decide whether to keep, refactor, or remove. They cannot evaluate whether new requirements (a regulatory change, a product redesign) are compatible with what is already there. The codebase becomes a museum of decisions whose context is gone.

Degraded comprehension is a cost of speed and risk. Debugging is slower, regressions more likely, but the system continues to function and decisions remain possible. Lost intent is a cost of direction. You do not know whether to keep, change, or rebuild. The first is recoverable. The second compounds.

When agents produce the code, how much comprehension the human team needs depends on two things: how reliable the agent is (how often it produces subtly wrong code the team must catch) and how well intent is preserved in artifacts the team can consult rather than re-derive from code. If intent is well-preserved and agents are highly reliable, the team needs less comprehension. If agents are unreliable or intent is fuzzy, the team needs more.

Current evidence (Bastani 2025, Shen & Tamkin 2026, SlopCodeBench 2026) places today’s agents outside the “highly reliable” regime: code looks right but is subtly wrong, quality degrades across iterations, humans overestimate agent output. A team today cannot rely on agent reliability to reduce their comprehension investment. Agent reliability is driven by LLM provider research rather than by process change; this research treats it as a variable rather than something it aims to improve.

The research question follows:

Is there a configuration of people, process, and AI agents that can:

prevent intent debt from accumulating during construction
keep comprehension sufficient for supervision, intervention, and evolution
produce human-authored anchors for the lifecycle artifacts that compensate for knowledge loss (docs, ADRs, tests, onboarding)
keep those anchors coherent as the system is maintained, extended, and teams rotate

This is a design space, not a fixed position. Each condition maps to a claim from the analysis: construction-time debt prevention (intent absolute, comprehension conditional), lifecycle artifact production (breaking the generative ratification loop), and lifecycle maintenance (surviving turnover and extension).

The only family of response that can satisfy these four conditions is prevention at construction time. Appraisal approaches (inspection of agent output, AI-generated tests of agent code) operate after the debt has already accumulated and cannot, by construction, prevent it. They remain useful as complementary layers but they cannot satisfy the four conditions above. The proposal advances one prevention-at-construction candidate, aiming to satisfy all four conditions at a cost the organisation can bear (per the economics of testing framework). It remains a hypothesis, not a settled answer. The evaluation assesses representative appraisal products against these same four conditions.

← Lifecycle drift

Next: Proposal →