References and Context

Annotated links for the AI-era testing research. Each entry includes a brief note on relevance to the analysis and the evaluation.

Foundational testing theory

Howden, W. E. (1978). Theoretical and empirical studies of program testing. IEEE Transactions on Software Engineering, SE-4(4), 293–298. DOI: 10.1109/TSE.1978.231514. The paper that introduced the test oracle concept into the software testing literature: an oracle is whatever mechanism (a person, a specification, a redundant computation) can determine whether a test result is correct. Cited in analysis-pre-ai.md as the canonical source for the “oracle concept”, used in the broader QA-discourse sense: transferable professional heuristics that let a tester reason by analogy from other systems and domains and recognise correctness without project-specific instruction.

Cognitive science of learning by doing

Slamecka, N. J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 592–604. PsycNet record. The seminal study establishing the generation effect: across five experiments using cued and free recall, recognition, and confidence ratings, items that subjects generated themselves were remembered better than items they merely read. Cited in analysis-pre-ai.md as the foundational evidence that producing information yields better retention than passively receiving it: the mechanism underlying why pre-AI human-in-the-loop development produced “free” domain knowledge accumulation, and why delegating code generation to agents removes it.
McCurdy, M. P., Viechtbauer, W., Sklenar, A. M., Frankenstein, A. N., & Leshikar, E. D. (2020). Theories of the generation effect and the impact of generation constraint: A meta-analytic review. Psychonomic Bulletin & Review, 27(6), 1139–1165. DOI: 10.3758/s13423-020-01762-3; PubMed. Meta-analytic review across 126 articles (310 experiments, 1,653 effect estimates) examining seven prominent theoretical accounts of the generation effect. Confirms the effect is robust across measures of item and context memory, and that generation constraint moderates it (less constrained generation produces a stronger effect, particularly in cued and free recall). Cited in analysis-pre-ai.md for the size and stability of the evidence base behind the generation effect.
Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185–205). MIT Press. The original formulation of the desirable difficulties framework: conditions that produce the best performance during practice are not the conditions that produce the best learning. Difficulties that introduce variation, interleaving, spacing, or generation force deeper encoding and retrieval, and learners systematically misjudge their own learning because re-exposure feels easier than retrieval, and easier feels like learning.
Bjork, R. A., & Bjork, E. L. (2020). Desirable difficulties in theory and practice. Journal of Applied Research in Memory and Cognition, 9(4), 475–479. DOI: 10.1016/j.jarmac.2020.09.003. The 2020 update, summarising 25+ years of empirical work and clarifying which difficulties are desirable (those that engage retrieval and reconstructive processes) versus undesirable (those that overwhelm). Cited in analysis-pre-ai.md alongside the 1994 chapter as the formal grounding for “the cognitive effort of doing something yourself is the learning”.

Software economics

Boehm, B. W. (1981). Software Engineering Economics. Prentice-Hall. Reference page. The book that established the cost-of-change curve, drawn from data on Waterfall projects at TRW and IBM in the 1970s. The cost of fixing a defect grows exponentially as it progresses through requirements, design, code, testing, and production stages: up to 100x for large projects, around 4x for small ones. Cited in analysis-agents.md as the pre-AI baseline for how cost-of-defect scales by lifecycle stage, against which the AI-era reshape is described.

Testing under velocity pressure (pre-AI)

Sabourin, R., “Just-In-Time Testing”. Scribd. Frameworks for testing under time pressure: bug flow, test triage, test ideas generation, developer collaboration. Pre-AI work, but documents the industry’s existing velocity mismatch problem. Relevant as prior art: the coping mechanisms (triage, prioritization) address throughput, not the comprehension gap that AI introduces.

Industry responses to AI-accelerated development

Harman, M. (2026), “The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It”. Meta Engineering Blog. Proposes JiTTests: LLM-generated tests created per code change using mutation testing + LLM assessors. Eliminates test maintenance and adapts to each change. Strong Direction 2 response: makes testing-after automated and per-change. Addresses row 1 and partially rows 2, 5, 6; misses rows 3, 4, 7, 8 (feedback loops, shared understanding, business domain, intent preservation). See evaluation for full assessment.
Borg, M. et al. (2026), “Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability”. arXiv. Two-phase controlled experiment (151 professional developers): Phase 1 added features with or without GitHub Copilot, Phase 2 had new developers evolve that code without AI. Found no significant differences in comprehensibility or maintainability. Two reasons our analysis predicts a different result for agentic AI rather than autocomplete: (1) Mode of authorship. In Copilot-style autocomplete, the human is still the author. They read each suggestion, accept or reject it, integrate it into their mental model. The generation effect described in the analysis still operates because cognitive engagement is preserved. In agentic mode, the agent produces blocks autonomously and the human reviews at much coarser grain or not at all; the generative engagement is what gets removed. The comprehension gap from row 2 of our analysis is specifically about that loss, not about the presence of AI per se. (2) Statistical power. Borg’s Phase 2 had n=75 with power ~0.17 for small effects, so even within autocomplete the study is underpowered for the moderate effect sizes our model predicts. Until a comparable study runs against agentic AI, the gap in evidence is real and the prediction stands as a prediction, not a finding.
Anthropic (2026), Claude Code Review. TechCrunch, docs. Multi-agent automated PR review: specialized agents for logic errors, boundary conditions, API misuse, auth flaws, project conventions. $15-25/review. Pure Direction 1: addresses row 1 (code volume), partially row 5 (test independence), misses rows 2, 3, 4, 6, 7, 8 (comprehension, feedback loops, shared understanding, novel failure modes, business domain, intent preservation). Commercially significant: the company selling code-generating agents also sells the inspection tool, making the trust tax explicit. See evaluation for full assessment.

Empirical evidence on AI and developer learning

Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, Ö., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26), e2422633122. DOI: 10.1073/pnas.2422633122; open-access PMC. Field experiment with nearly 1,000 high school math students. GPT-4 Base access improved practice performance by 48%, but reduced unassisted assessment scores by 17% relative to controls who never had access. A “GPT Tutor” condition with teacher-designed prompting safeguards mitigated the harm. Cited in analysis-agents.md and analysis-research-question.md as direct empirical support for the comprehension-debt prediction: when access to AI is removed, learners who relied on it underperform learners who did the work themselves.
Shen, J. H., & Tamkin, A. (2026). How AI impacts skill formation. arXiv:2601.20245. arXiv; Anthropic blog summary. Randomised controlled trial with 52 software engineers learning Trio (a Python async library new to them). AI-assisted engineers completed the task in roughly the same time as controls but scored 50% on a follow-up comprehension quiz versus 67% for the hand-coding group, with the largest gaps in debugging and conceptual understanding. Engineers who used AI for conceptual questions maintained their comprehension; engineers who delegated to AI did not. Cited in analysis-agents.md, analysis-research-question.md, and proposal.md as the cleanest empirical evidence to date that the mode of AI use (delegation vs. dialogue) determines whether comprehension debt accrues.
Becker, J., Rush, N., Barnes, E., & Rein, D. (2025). Measuring the impact of early-2025 AI on experienced open-source developer productivity. METR. arXiv:2507.09089. METR blog; arXiv. Within-subject randomised study of 16 experienced developers on 246 real issues in their own large open-source repositories (averaging more than 22k stars and 1M lines of code). When allowed to use AI (primarily Cursor Pro with Claude 3.5/3.7 Sonnet), tasks took 19% longer than when not allowed, with the 95% confidence interval ranging from +2% to +39%. Developers afterwards estimated AI made them 20% faster, the opposite of the measured direction. Cited in analysis-agents.md as evidence for the perception/reality gap that lets comprehension debt accumulate invisibly.

Empirical evidence on AI feedback loops and degradation

Shumailov, I. et al. (2024), “AI models collapse when trained on recursively generated data”. Nature 631, 755–759 (DOI: 10.1038/s41586-024-07566-y); preprint as “The Curse of Recursion: Training on Generated Data Makes Models Forget”, arXiv:2305.17493. Demonstrates model collapse: training LLMs on data generated by previous LLMs causes “irreversible defects in the resulting models, in which tails of the original content distribution disappear”, degrading the model’s ability to generate diverse high-quality output. Cited in analysis: lifecycle drift as the prominent instance of the broader pattern (self-referential AI loops without external corrective signal compound errors) of which our generative ratification loop is the artifact-generation member. Mechanistically different from our loop (training-data distribution drift vs artifact-chain error propagation), but the meta-pattern is the same.
Orlanski, G. et al. (2026), “SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks”. arXiv:2603.24755. Language-agnostic benchmark, 20 problems, 93 checkpoints, 11 models. Tracks two trajectory-level signals as agents iteratively extend their own prior solutions: verbosity (redundant or duplicated code) and structural erosion (complexity mass concentrated in high-complexity functions). Findings: agent code degrades monotonically (erosion in 80% of trajectories, verbosity in 89.8%); against 48 open-source Python repositories, agent code is 2.2x more verbose; tracking 20 of those repositories over time shows human code stays flat while agent code deteriorates each iteration; prompt interventions improve initial quality but do not halt degradation. Cited in analysis: lifecycle drift as empirical support for the longitudinal symptom of the generative ratification loop. The paper measures code-level consequences rather than the artifact-generation loop directly, so the support is consistent-with rather than direct.