Technology Utilization Architecture

Optimal workflow architecture for an individual knowledge worker given the current AI / agent / automation toolset. Not "use AI" but the specific choreography — which tools for which cognitive operations, where human judgment is essential vs. bottleneck, and how to structure the feedback loops.

The topic running through the LLM Iterate pipeline. The question is not “does AI help” — that has been answered (15–55% productivity gains on well-bounded tasks, replicated across ~25 RCTs). The question is which workflow architecture maximizes output quality per unit of human attention, given that the binding constraint has shifted from production throughput to metacognitive load.

Stage 1 (lit review) maps three layers: a mature HCI / decision-science literature on appropriate reliance and complementarity; a classical foundation in cognitive systems engineering being re-imported (Bainbridge 1983, Klein et al. 2004, Hollnagel & Woods 2005); and a practitioner stack (Mollick, Karpathy, Anthropic, Cognition, Claude Code) that runs 12–18 months ahead of peer review. The headline finding across the empirical record is that workflow architecture predicts outcomes more reliably than which frontier model you use.

Stage 2 (topology) is the dependency graph — three foundational assumptions (attention-as-binding-constraint, jagged frontier exists, verification cost is comparable to generation cost) carry most of the inferential weight. Six crux nodes are where collapse propagates farthest. The graph also encodes how practitioner frameworks and academic theory map onto the same underlying structure.

Stages 3–5 will formalize the workflow choreography into a parameterized model (capability-by-operation × verification-cost × autonomy level → routing decision), test it against the available task-level evidence, and ship a small interactive tool for individual workflow design.

The research landscape on optimal human-AI workflow design for individual knowledge workers — three layers (HCI/decision-science, classical cognitive systems engineering, practitioner stack), 25+ RCTs, the metacognitive bottleneck reframing, and the load-bearing assumptions any formalization must survive.

TLDR

The research landscape on optimal human-AI workflow design for individual knowledge workers spans three largely disconnected layers: a mature HCI/decision-science literature on reliance and complementarity, a classical foundation in cognitive systems engineering now being re-imported, and a practitioner literature that drives terminology 12–18 months ahead of peer review. The single most important empirical finding across ~25 RCTs (2023–2026) is that workflow architecture predicts outcomes more reliably than model capability — how you structure human-AI interaction (centaur vs. cyborg, independent-then-synthesize, guardrailed vs. unfettered) matters more than which frontier model you use. A corollary is that the binding constraint on AI-augmented knowledge work is not production throughput but the metacognitive bottleneck: planning what to delegate, verifying outputs, and maintaining calibration on where AI fails.

The empirical record shows 15–55% productivity gains on well-bounded tasks, robust skill-leveling effects for novices, but contested and sometimes negative results for experts on open-ended judgment work. A major unresolved puzzle is that these micro-RCT gains produce precisely zero impact on aggregate labor-market outcomes at two-year horizons (Humlum & Vestergaard 2025). The “ironies of automation” from 1983 human-factors research replay exactly in LLM contexts: AI that handles routine work degrades human capacity to catch the rare errors where human judgment is critical (Simkute, Tankelevitch et al. 2024/2025). Practitioner frameworks — Mollick’s centaur/cyborg typology, Karpathy’s autonomy slider, Anthropic’s agent-design patterns, Cognition’s context-engineering principles — are converging toward a common architecture but lack formal integration.

The largest theoretical gap is that no unified normative framework exists for the individual knowledge worker’s daily AI workflow choreography — when to consult, delegate, verify, or refuse AI on a per-task basis. The closest academic scaffolding is Parasuraman, Sheridan & Wickens’ (2000) function-allocation model, but it was designed for system engineers, not individual users. The practitioner world has de facto answers (autonomy sliders, AI sandwiches, compound engineering), and the academic world has converging threads (CoALA cognitive architectures, HAIJCS joint cognitive systems, Tankelevitch’s metacognitive demands framework), but no synthesis yet integrates them. That integration — a formal, empirically grounded model of human-AI cognitive partnership that maximizes output quality per unit of human attention — is the field’s open frontier and the target of the next phase of this project. Six load-bearing assumptions underlying this landscape are identified in Section 12, along with the specific evidence that would flip each one — these define the risk surface for any formalization attempt.

1. Formal Models of Human-AI Task Allocation

The deepest formal literature concerns “learning to defer” (L2D) and human-AI complementarity — algorithms that decide, per instance, whether the AI or the human should handle a task. The canonical formulations (Madras et al. 2018 NeurIPS; Mozannar & Sontag 2020 ICML; Mozannar et al. 2023 AISTATS, arXiv:2301.06197) established that optimal joint human-AI assignment is computationally hard and that naive heuristic approaches systematically underperform. Wilder, Horvitz & Kamar (2020, IJCAI) operationalized “learning to complement humans” by training models end-to-end against team accuracy.

Why this matters for workflow design: These are formal proofs that the intuitive approach — “use AI when it’s better, use humans when they’re better” — is not a well-specified decision rule. Optimal allocation requires modeling the joint performance surface, not comparing solo accuracies.

The most actionable synthesis is Hemmer et al.’s “Complementarity in Human-AI Collaboration” (2025, EJIS, link), which distinguishes “complementarity potential” (the mathematical possibility of exceeding either agent alone) from “complementary team performance” (actually achieving it). They identify information asymmetry and capability asymmetry as the two sources. The uncomfortable finding: complementary team performance is rarely empirically observed despite decades of theoretical promise. Amin et al.’s (2026) Bayesian framework adds a behavioral explanation: “correlation neglect,” where humans treat AI advice as independent evidence despite shared training data, can make AI advice anti-augmentative.

Vaccaro, Almaatouq & Malone’s (2024, Nature Human Behaviour) meta-analysis provides the closest thing to a quantitative allocation rule: human-AI combinations help most when (a) humans alone outperform AI, (b) the task is creation rather than decision-making, and (c) AI handles sub-tasks rather than the whole task.

2. Appropriate Reliance, Trust Calibration, and Verification Cost

Bansal et al. (2021, CHI) established the canonical finding: AI explanations increase acceptance regardless of correctness — they do not produce complementary performance. Buçinca, Malaya & Gajos (2021, arXiv:2102.09692) showed that “cognitive forcing functions” (commit to your own answer before seeing AI) reduce overreliance, but only for users high in Need for Cognition — creating intervention-generated inequality.

The major reframing came from Vasconcelos et al. (2023, arXiv:2212.06823) and Fok & Weld (2023, arXiv:2305.07722): overreliance is a rational cost-benefit choice, not a cognitive defect. People engage with verification only when it is cheap relative to the expected payoff. This produced a methodological pivot from “outcome-graded” to “strategy-graded” reliance metrics. Buçinca et al.’s (2024/2025, CHI) offline-RL approach learns adaptive per-instance policies for what kind of AI support to provide.

The practical design implication: minimize verification cost, not maximize explanation quality. Confidence indicators and linguistic uncertainty markers shift reliance more reliably than feature-importance explanations. Microsoft Research’s 2024 synthesis (PDF) endorses this framing for generative AI.

A newly identified failure mode: sycophancy in feedback loops. Randazzo et al. (HBS WP 26-021, 2026) document that when professionals push back on incorrect AI output, the AI escalates persuasive justification rather than disclosing uncertainty, sometimes flipping correct human judgments to incorrect ones.

3. The Metacognitive Bottleneck and Ironies of Generative AI

This section covers what is arguably the most important reframing in the 2024–2026 literature. Horvitz’s (1999, CHI, link) twelve principles of mixed-initiative interfaces and Amershi et al.’s (2019, CHI) 18 guidelines for human-AI interaction remain the design base layer.

Tankelevitch, Sarkar, Sellen, Rintel et al. (CHI 2024 Best Paper, arXiv:2312.10893) introduced the metacognitive demands framework: GenAI reduces cognitive load on production but increases metacognitive load — planning goals, evaluating outputs, monitoring confidence, and deciding when to use AI at all. The optimization target shifts from throughput to metacognitive efficiency.

Simkute, Tankelevitch, Kewenig, Scott, Sellen & Rintel’s “Ironies of Generative AI” (2024/2025, IJHCI, arXiv:2402.11364) directly bridged Bainbridge’s 1983 “Ironies of Automation” to GenAI. They identify four GenAI-specific productivity losses that mirror classical automation ironies: (1) the shift from creative production to supervisory demands, (2) workflow disruptions breaking established rhythms, (3) frequent task interruptions from AI suggestions, and (4) a polarization effect where simple tasks become easier but complex ones become harder. Their proposed mitigations — continuous feedback, personalization, ecological interface design, clear task allocation — echo Bainbridge almost exactly, suggesting the field is rediscovering rather than advancing.

The CHI 2025 “Tools for Thought” workshop synthesis (Tankelevitch et al. 2025, arXiv:2508.21036) consolidates the MSR research program’s position: knowledge work is shifting from production to critical integration — decisions about when and how to use AI, how to frame tasks, and how to assess outputs. Sarkar’s “Friction-Induced AI” concept adds deliberate intervention points to improve verification short-term and prevent skill atrophy long-term.

Mozannar, Bansal, Fourney & Horvitz’s CUPS taxonomy (CHI 2024, arXiv:2210.14306) provides the empirical anatomy for coding specifically: programmers using Copilot spend large amounts of time verifying and thinking about AI suggestions. Verification time is the hidden tax, and it is substantial.

4. The Empirical Productivity Record (25+ RCTs, 2023–2026)

Stable findings. Generative AI yields 15–55% productivity gains on well-defined knowledge tasks. Time-savings are large and robustly replicated; quality effects are smaller and more variable. The headline studies:

Brynjolfsson, Li & Raymond (2023/2025, QJE, link): 5,172 customer-support agents. +15% average, +34% for novices, ~0% for top performers.
Noy & Zhang (2023, Science, link): 453 professional writers. 40% time reduction, 18% quality lift.
Peng et al. (2023): GitHub Copilot RCT, +55.8% task completion speed.
Cui et al. (2025, Management Science): 4,867 developers, +26% tasks/week. But a 2025 longitudinal case study found experienced developers gained less and sometimes slowed down (arXiv:2509.20353).
Dell’Acqua, Mollick et al. (2023/2025, HBS, link): 758 BCG consultants. +25% speed and +40% quality on inside-frontier tasks; 19-percentage-point quality drop on outside-frontier tasks. This study coined the “jagged technological frontier” concept.

Contested findings.

Does AI help experts? The skill-leveling pattern breaks down for open-ended judgment. Otis et al. (2024): 640 Kenyan entrepreneurs over 5 months — high-baseline +15–20%, low-baseline –8–10%. METR (2025, link): 16 experienced open-source developers were 19% slower with AI in their own repos, despite predicting 24% speedup. Likely resolution: the bottleneck differs by task type — execution speed (where AI levels) vs. judgment/filtering (where AI amplifies those who already can filter).

Human+AI vs. AI alone. Goh et al. (2024, JAMA Network Open): GPT-4 alone outscored physicians + GPT-4 on diagnostic vignettes. But Everett et al. (2025, link): an “independent-then-synthesize” workflow eliminated the underperformance. Workflow architecture, not model capability, explains the discrepancy.

Long-term cognitive effects. Bastani et al. (PNAS 2025, link): AI boosted in-session math performance 48–127% but produced 17% worse unassisted performance afterward — unless AI was guardrailed to give hints rather than answers. Lee, Sarkar et al. (CHI 2025, link): 319 knowledge workers — higher AI confidence correlates with less critical thinking enacted.

The aggregate puzzle. Humlum & Vestergaard (2025, NBER 33777, link): 25,000 Danish workers across 11 exposed occupations, precise zero impact on earnings or hours at two-year horizons. This is the field’s largest unresolved tension: micro-RCT productivity does not translate to aggregate productivity. Possible mechanisms: task reorganization, weak wage pass-through, substitution effects, cross-task productivity bundling (Cowen 2026, link).

Methodological caveat: The Toner-Rodgers (2024) materials-discovery study (+44% novel materials) was publicly disavowed by MIT in May 2025 following data-integrity concerns. Widely cited but should not be treated as established fact.

5. Interaction Modes: Centaur, Cyborg, Self-Automator

Mollick’s three-mode taxonomy is now empirically grounded:

Centaurs maintain clean human/AI role separation, handing off discrete tasks based on frontier mapping. Cyborgs intertwine human and AI continuously at sub-task granularity. Randazzo, Lifshitz et al. (HBS WP 26-036, 2026) added the self-automator: full delegation with periodic oversight. Empirical distribution across 244 BCG consultants: ~60% cyborg, ~30% centaur, ~10% self-automator.

Schoenegger, Park, Karger & Tetlock’s superforecasting study (2024/2025, ACM TiiS, link) found both well-calibrated and deliberately overconfident GPT assistants improved forecasting accuracy 23–43% — suggesting much of the centaur gain comes from forced structured reasoning rather than AI advice quality. Combined with the historical chess record, this raises the question of whether the centaur advantage is a transient regime that disappears when AI exceeds humans on the full task, or a permanent feature of asymmetric cognitive strengths.

6. Automation Levels and Autonomy Frameworks

Parasuraman, Sheridan & Wickens’ (2000, link) four-function × ten-level model remains the cleanest formal scaffolding. Their four automation functions — information acquisition, information analysis, decision/action selection, action implementation — map directly onto modern LLM workflow stages (RAG/retrieval, synthesis/analysis, recommendation, tool use/code execution). Yet no one has formally re-operationalized this for LLMs.

The 2023–2026 wave: Morris et al.’s “Levels of AGI” (DeepMind, 2023, arXiv:2311.02462) separates capability from autonomy across six levels. Feng, McDonald & Zhang’s “Levels of Autonomy for AI Agents” (2025, arXiv:2506.12469) defines five user-centered roles (Operator → Collaborator → Consultant → Approver → Observer) and is the most directly applicable to individual workflow design. Anthropic’s “Measuring AI Agent Autonomy in Practice” (2025/2026, link) surveys five competing frameworks empirically.

Shneiderman’s Human-Centered AI 2D framework explicitly rejects the “more automation = less control” assumption — high automation and high human control can coexist (cameras, GPS, modern IDEs). This is a crucial conceptual move for knowledge work, where the goal is high-autonomy AI with high human oversight, not a trade-off between them.

The classical critique tempering all level-talk: Dekker & Woods’ “MABA-MABA or Abracadabra?” (2002) — automation does not merely replace human work, it transforms it. The substitution myth is alive in current LLM discourse. Every sub-task offloaded to AI creates new monitoring, verification, and coordination work.

7. Cognitive Operation Taxonomies and Task-to-Tool Mapping

Bloom’s revised taxonomy (Remember → Understand → Apply → Analyze → Evaluate → Create × factual/conceptual/procedural/metacognitive) is the most-imported cognitive framework in 2024–2026 LLM research. Empirically, LLM capability decays sharply up the Bloom hierarchy: BloomAPR (Ma et al. 2025, arXiv:2509.25465) found ~81% success at Remember-level tasks, dropping to 43% at Apply and 13–41% at Analyze. Lee et al.’s CHI 2025 survey explicitly used Bloom’s levels to show GenAI shifts cognitive labor from lower-order production to higher-order verification, integration, and stewardship.

Cognitive Task Analysis (CTA) methods (Crandall, Klein & Hoffman 2006 Working Minds; Militello & Hutton’s ACTA, PubMed) remain conspicuously underutilized. CTA is the canonical method for understanding what a knowledge worker actually does cognitively before allocating sub-tasks to AI — yet almost no production agent design uses it. Klein et al.’s macrocognition framework (sensemaking, problem detection, mental projection, coordination, PMC) is similarly absent despite obvious fit.

The cleanest bridge between classical cognitive architectures and LLM agents: Sumers, Yao, Narasimhan & Griffiths’ CoALA framework (2024, TMLR, arXiv:2309.02427), mapping LLM agents onto modular memory (working/episodic/semantic/procedural), structured action spaces, and decision cycles drawn from ACT-R and SOAR.

The practitioner world has a de facto task-routing approach that academia hasn’t formalized. Paterson’s (2026, link) empirical benchmark of 15 models across 38 real daily tasks concluded that “routing beats model selection” — the generating function is a dispatch table matching task type to tool, not a single best model. This echoes Power’s DSS taxonomy (model-/data-/knowledge-/document-/communication-driven systems) but is grounded in per-task empirical measurement rather than a priori categorization.

8. Practitioner Frameworks and Emerging Workflow Architectures

Practitioner literature is now driving the discipline. This section maps the most influential frameworks and where they converge.

Mollick’s centaur/cyborg/jagged frontier (link) and his book Co-Intelligence (2024) function as the dominant practitioner vocabulary. His four rules — always invite AI, be the human in the loop, give it a persona, assume this is the worst AI you’ll ever use — are the closest to a widely adopted practitioner heuristic set.

Karpathy’s framework (Software 1.0/2.0/3.0, jagged intelligence, anterograde amnesia, the autonomy slider, generator-verifier loop, link) gives precise vocabulary for coding workflows. The autonomy slider — instantiated in Cursor’s Tab → Cmd+K → Agent Mode progression — is the clearest practitioner instantiation of what academic autonomy taxonomies describe abstractly: a per-action user control surface.

Anthropic’s agent-design patterns (link): prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. Their multi-agent research system (link) showed orchestrator-worker patterns outperformed single-agent Claude Opus by 90.2% at ~15× token cost. Their context-engineering guide (link) formalizes compaction, just-in-time retrieval, structured memory, and subagent isolation.

Cognition’s context-engineering principles (link): the read/write asymmetry — multi-agent works for read-heavy tasks (research) but breaks for write-heavy tasks (code) unless writes are serialized. This is now consensus across Anthropic, LangChain, and Cognition.

Coding-agent workflow patterns have stabilized around three approaches: (1) continuous pairing (Cursor — taxes attention, preserves flow), (2) batch delegation (Devin — reduces presence, adds re-entry cost), and (3) spec-driven development (Harper Reed’s spec → plan → execute loops, Amazon Kiro, GitHub Spec Kit). Claude Code’s documented harness (CLAUDE.md memory, writer/reviewer, test-then-code) is the most comprehensive single-tool pattern.

Personal knowledge management + AI (Karpathy’s “LLM Wiki,” Obsidian + Claude Code patterns from Eric Ma and others) converges on: plain-text-as-substrate, persistent context files teaching personal taxonomy, reusable commands, inbox → process → integrate → review lifecycle. Every’s Dan Shipper articulates this as the “AI Sandwich” (humans frame and review; AI does the middle) and “Compound Engineering” (plan → work → review → compound).

Key practitioner insight lacking academic formalization: Cowen’s cross-task productivity bundling — per-task speedups don’t translate proportionally to aggregate productivity because related tasks are productivity-linked. This connects directly to the Humlum & Vestergaard aggregate-zero puzzle.

9. Adjacent Fields: Imported, Underutilized, and Ripe for Bridging

Joint Cognitive Systems / Cognitive Systems Engineering (Hollnagel & Woods 2005) reframes human + LLM as a single coupled system. Klein, Woods, Bradshaw, Hoffman & Feltovich’s “Ten Challenges for Making Automation a Team Player” (2004, PDF) has become the most-cited pre-LLM paper in 2024–2026 agent design. Its requirements — Basic Compact, mutual models, predictability, directability, observability, goal negotiation, attention management, common-ground repair — function as the checklist for what an agent teammate needs. Xu & Gao’s (2024, Interactions) HAIJCS framework is the cleanest bridge from CSE to LLM human-AI teaming.

Distributed cognition (Hutchins 1995) is being imported by Hutchins himself (Paris IAS 2024) and by Tao An’s “Cognitive Workspace” (2025, arXiv), which grounds LLM context management in Baddeley’s working memory model. The extended mind thesis (Clark & Chalmers 1998) has been explicitly extended to LLMs by Smart, Clowes & Clark in Synthese 2025 (link). Tong’s 2026 survey (arXiv) synthesizes the full Licklider–Engelbart–Clark lineage through to modern human-AI symbiosis.

Engelbart’s H-LAM/T (1962) — Human using Language, Artifacts, Methodology, Training — is the most under-imported framework. It required co-evolution of all four components; current AI rollouts ship the artifact (the model) while methodology and training lag. Treating H-LAM/T as a literal rollout checklist would discipline most AI deployments.

Other underutilized resources: Power’s DSS taxonomy for classifying AI tools by purpose; Nonaka’s SECI cycle being extended to human-AI knowledge creation (Böhm & Durst 2025, Matsumoto et al.); Personal Information Management (Jones, Bergman & Whittaker) providing taxonomies for the “AI second brain” movement; Endsley’s situation awareness model extended to human-AI teams in her own 2023 paper; and Wickens’ Multiple Resource Theory, which would predict tool-stack attention overload (running Cursor + ChatGPT + a meeting simultaneously) but is absent from AI workflow research.

10. Key Researchers, Labs, and Thought Leaders

Microsoft Research has the deepest portfolio: Horvitz, Kamar, Amershi, Liao, Tankelevitch, Rintel, Sarkar, Bansal, Mozannar, Buçinca. The “Tools for Thought” research program (link) and the associated CHI 2025 workshop (link) are the most concentrated effort on AI-augmented knowledge work. Stanford HAI covers empirical reliance work. MIT CSAIL + Sloan/D³ drives both formal L2D theory (Sontag, Mozannar) and field experiments (Dell’Acqua, Lakhani). Harvard SEAS/D³ hosts Buçinca, Gajos. Wharton/HBS bridges practice and research (Mollick, Lifshitz-Assaf, Kellogg). CMU HCII, UW/AI2 (Weld, Fok), and Stanford Digital Economy Lab (Brynjolfsson) round out the empirical work.

Adjacent-field bridgers: Wei Xu (HAIJCS), Smart/Clowes/Clark (extended mind), Endsley (SA), Klein/Bradshaw/Hoffman/Feltovich (CSE/NDM), Tao An (cognitive workspace), Tong (augmentation→symbiosis).

Practitioner thought leaders with formalized frameworks: Mollick (Wharton), Karpathy (independent), Willison (independent — coined the canonical agent definition, link), Schluntz & Zhang (Anthropic), Yan & Cognition Labs, Chase (LangChain), swyx (Latent Space, link), Shipper & Klaassen (Every, link), Cowen (GMU, link), and the Microsoft New Future of Work Report team.

11. Open Questions, Contested Ground, and Unfilled Gaps

Stable consensus: Explanations alone don’t yield complementary performance. AI helps novices most on well-defined tasks. Cognitive forcing reduces overreliance with equity caveats. AI homogenizes outputs at the population level. Verification cost is the binding constraint. Workflow architecture predicts outcomes better than model choice.

Genuinely contested:

Whether AI helps experts. Skill-leveling (Brynjolfsson, Noy) vs. skill-amplifying (Otis high-baseline finding) vs. net-negative (METR). Likely resolution: the bottleneck differs — execution speed (AI levels skill) vs. judgment/filtering (AI amplifies whoever can already filter).
Micro-to-macro translation. 15–55% RCT gains coexisting with Humlum & Vestergaard’s aggregate zero. Possible explanations: task reorganization absorbs time savings, cross-task bundling (Cowen), weak wage pass-through, measurement artifacts.
Long-term cognitive effects. Bastani’s skill-atrophy evidence vs. Brynjolfsson’s accelerated learning curves. The guardrail design matters more than the binary of AI access vs. none.
Human+AI vs. AI alone in expert domains. Goh’s medical finding that AI alone wins vs. Everett’s workflow-architecture fix. The claim appears workflow-dependent, not capability-dependent.
Persistence of the centaur regime. Chess history + Schoenegger’s findings suggest centaur advantages may be transient as AI capability crosses task thresholds.

What hasn’t been formalized:

No normative framework for individual daily workflow choreography (when to consult, delegate, verify, refuse) — the Parasuraman (2000) equivalent for end users rather than system designers.
Context engineering remains a practitioner discipline without academic theory.
Multi-tool attention allocation lacks quantitative models despite Wickens’ MRT being directly applicable.
The interaction between agent autonomy level and metacognitive load is under-theorized.
Long-term skill formation under continuous AI use lacks longitudinal data (most studies ≤6 months).
Direct workflow-architecture comparison RCTs are rare; the field needs more studies structured like Everett 2025 and Bastani’s guardrailed vs. unguardrailed designs.
The feedback loop between task routing, skill development, and frontier migration over time (as you get better at using AI, the frontier shifts, changing optimal allocation) has no formal model.

12. Load-Bearing Assumptions and What Would Flip Them

Any formalization built on this landscape will inherit certain assumptions. Making them explicit now disciplines the next phase.

Crux 1: “Workflow architecture > model capability.” This is the document’s central claim. It’s supported by Dell’Acqua (inside vs. outside frontier), Everett (workflow fix restoring physician+AI performance), and the general pattern that the same model yields very different outcomes under different interaction designs. But this claim is load-bearing on a specific regime: one where capability differences between frontier models are small relative to design differences between workflows. If a model capability jump is large enough that even naive workflows dramatically outperform expert workflows on current models, this claim inverts. What would flip it: A capability discontinuity (not incremental improvement) that eliminates the jagged frontier for a broad task class. The evidence base comes from a narrow window (2023–2025) of similar-capability frontier models — the claim may not survive a regime change.

Crux 2: “The metacognitive bottleneck is the binding constraint.” This assumes production has been sufficiently automated that the bottleneck has shifted upward to planning, evaluation, and calibration. But for many knowledge workers, production is still the bottleneck — they lack the time, skill, or tool access to make AI-assisted production easy. The metacognitive framing may describe elite power users, not the median worker. What would flip it: Evidence that the majority of knowledge workers are production-constrained rather than metacognition-constrained, even with AI access. Lee et al.’s CHI 2025 survey (319 workers) partially supports the metacognitive framing, but the sample skews toward workers who already use AI regularly.

Crux 3: “The jagged frontier is mappable and relatively stable.” Design principle #1 says “map your personal jagged frontier task-by-task.” This assumes the frontier is stable enough to calibrate against. But if model capabilities shift every 3–6 months, the frontier migrates faster than a user can recalibrate. What would flip it: Evidence that frontier migration rate exceeds human calibration rate — that by the time you’ve learned where GPT-4 fails, GPT-5 has moved the boundary. The likely resolution is that the frontier has stable topological features (AI is reliably good at X-type tasks, reliably bad at Y-type) even as the boundary shifts, making the shape mappable even if the exact edge is volatile. This is an empirical question the field hasn’t tested.

Crux 4: “Verification cost is the binding constraint on appropriate reliance.” The rational-cost-benefit reframing (Vasconcelos, Fok & Weld) is load-bearing on approximate rationality — that people correctly estimate when verification is worth the effort. But if people are systematically miscalibrated about AI error rates (which the sycophancy finding from Randazzo et al. directly suggests), then the binding constraint isn’t verification cost but verification calibration. The distinction matters for design: reducing cost helps a rational actor; improving calibration helps a miscalibrated one. Both interventions are different.

Crux 5: “The individual knowledge worker is the right unit of analysis.” The entire document frames optimization at the individual level. But if the Humlum & Vestergaard aggregate-zero puzzle is explained by organizational dynamics (task reallocation, managerial absorption of time savings, coordination costs), then individual workflow optimization is locally optimal but globally insufficient. The right unit might be the team or the value chain. What would flip it: Evidence that individually optimized AI workflows produce organizational friction (e.g., faster individual output creating review bottlenecks downstream, or AI-homogenized outputs reducing team diversity of thought). The Anderson et al. (2024) homogenization finding and the organizational-absorption explanation for aggregate-zero both point in this direction.

Crux 6: “The centaur/cyborg/self-automator taxonomy is durable.” It may instead be a transient artifact of current tool limitations. As tools evolve toward seamless human-AI blending (real-time co-editing, ambient AI, continuous context), the discrete modes may dissolve into a continuum. The taxonomy’s value for formalization depends on whether the modes capture something structurally real about cognitive coupling or merely describe current interface affordances. What would flip it: Evidence that as tool integration deepens, the behavioral distinction between centaur and cyborg disappears — users naturally slide between modes within a single task rather than choosing one.

Adversarial Challenge to the Project Framing

Strongest objection: “You’re trying to formalize a workflow architecture for a system where one of the components (AI capability) changes faster than any formal model can track. By the time you’ve mapped the frontier, built the model, and tested it, the frontier has moved. The practitioner literature is ahead precisely because it doesn’t try to formalize — it adapts via heuristics and rapid iteration. The academic aspiration to a ‘unified normative framework’ is a category error: this is an engineering problem requiring adaptive heuristics, not a science problem requiring formal models.”

Why this objection is partially right: The objection correctly identifies that any model parameterized on a specific capability profile (GPT-4 is good at X, bad at Y) will go stale within months. Fixed allocation rules are doomed. The practitioner instinct to stay adaptive is sound.

Why the strongest version of the project survives it: Even in rapidly changing systems, structural invariants exist that a formal model should capture. The metacognitive bottleneck doesn’t disappear when models improve — it shifts to new decisions. The verification cost trade-off doesn’t change shape when models get better — the threshold moves. The automation ironies are structural properties of any delegation relationship between a principal and an imperfect agent. What a formal model should capture is the generating function — the invariant structure that produces the right allocation given any capability profile — not a specific allocation for a specific model. The model should be parameterized by capability, not dependent on a fixed capability level. This is exactly the difference between Parasuraman’s (2000) framework (which has lasted 26 years despite massive automation changes) and any specific automation allocation table (which goes stale quickly). The target is a model that says “here is how to decide what to delegate” — not “here is what to delegate.”

13. Design Principles Supported by the Current Evidence

The empirical and theoretical record converges on a set of actionable principles for designing an individual knowledge worker’s AI workflow:

Map your personal jagged frontier task-by-task. Outside-frontier AI use is actively harmful, so the first design decision is calibrating which sub-tasks are inside and outside for you specifically. This frontier is personal (varies by expertise) and dynamic (shifts with practice and model updates).
Match interaction mode to task structure. Centaur (clean handoff) for tasks with verifiable checkpoints. Cyborg (interleaved) for creative or ill-structured work. “Independent-then-synthesize” for high-stakes expert judgment.
Minimize verification cost, not maximize AI capability. The binding constraint is verification, not generation. Design for structured outputs, confidence signals, and cheap-to-check formats.
Insert deliberate friction at decision points. Cognitive forcing functions (form your own view before seeing AI output) reduce overreliance. Sarkar’s “Friction-Induced AI” concept shows this can be built into tool design.
Treat context engineering as the central craft. Practitioner consensus: the binding constraint is not model intelligence but what context the model operates in. CLAUDE.md files, system prompts, persistent memory, and structured instructions are higher-leverage than model selection.
Route tasks, don’t pick a single best tool. Paterson’s empirical result (“routing beats model selection”) is the practitioner instantiation of L2D theory. Build a personal dispatch table matching task types to tools.
Preserve skills with guardrails. Hint-only AI in learning contexts. AI-free zones for capabilities you need to maintain. Bastani’s guardrailed-AI design prevented skill atrophy while preserving performance gains.
Serialize writes in agentic systems. Cognition’s read/write asymmetry: multi-agent is powerful for research/analysis but breaks for code/document production unless writes are serialized.
Watch for the metacognitive bottleneck. The limiting resource in AI-augmented work is no longer effort but judgment and attention allocation. Tankelevitch’s framework suggests optimizing for metacognitive efficiency, not throughput.
Budget for automation ironies. Every sub-task delegated to AI creates new monitoring, verification, and coordination work. Simkute et al.’s four productivity-loss categories are predictable and designable-against.

Read full stage →

pass 9

Dependency graph of the lit review. 66 nodes typed across nine classes (foundational assumptions / methods / empirical claims / logical necessities / generating mechanisms / synthesis / practitioner frameworks / open questions / distortions); seven cruxes (A1, A2, A3, A6, L1, L2, L3 — every weight-5 A node + every weight-5 L node) and four variant views (Vulnerability / Flow / Minimal / Capability-regime). The genuinely novel structural move for this topic: encoding the practitioner ↔ academic operationalization bridge that runs 12–18 months out of phase.

TLDR

The lit review documents the research landscape on optimal human-AI workflow design. This topology asks the sharper question: what depends on what? Strip the field down to its load-bearing structure and the picture is surprisingly clean. Four foundational assumptions sit upstream of most of the empirical and synthesis nodes — that human attention is the binding scarce resource (A1), that AI capability is heterogeneous across cognitive operations (the jagged frontier; A2), that verification cost is comparable to or cheaper than generation cost (A3), and that individual-level workflow optimization aggregates upward rather than being absorbed by organizational dynamics (A6). If any one of them flipped, large regions of the picture would have to be rebuilt. Three logical guardrails (L1 substitution myth is wrong; L2 optimal allocation needs the joint performance surface, not solo accuracies; L3 parameterize by capability) cannot be falsified at all — they can only be ignored, which is exactly how most overconfident AI-rollout discourse proceeds. Everything else is methodology, empirical claim, generating mechanism, synthesis, practitioner framework, open question, or distortion vector.

The genuinely novel structural move for this topic is the practitioner ↔ academic bridge that the new node type P and the new edge type op make explicit. The practitioner stack (Mollick, Karpathy, Anthropic, Cognition, Claude Code) and the academic literature address the same structural problems but with different methodologies and on different timescales — and the relationship is bidirectional, not one-directional. Three patterns recur on the cross-community bridge. (a) Practitioners ahead of academic measurement: Mollick proposed the centaur/cyborg typology in 2024; Randazzo 2026 (HBS) is the first peer-reviewed empirical measurement of the behavior distribution (60/30/10) the typology names. (b) Practitioners concretizing prior academic principles: Karpathy’s autonomy slider concretizes Shneiderman’s 2D framework as a per-action user control surface; the AI Sandwich and Compound Engineering loop (Shipper / Every) concretize Tankelevitch’s metacognitive-demands framework as a daily workflow practice. (c) Practitioner and academic work converging in parallel without direct lineage: Karpathy’s autonomy slider (per-action UX) and Feng 2025’s five-level academic taxonomy (per-task user roles) converged on the same discretized-levels-of-autonomy shape independently, at different granularities; Anthropic’s agent design patterns (chain / route / parallelize / orchestrator-workers / evaluator-optimizer) address the same structural problem (joint allocation under capability heterogeneity) that L2D theory formalizes, but emerged from engineering practice rather than as L2D operationalizations. Note that some P-edges in the graph are practitioner-internal rather than cross-community — e.g., P6 (spec-driven development) → S4 (context engineering as central craft) and P7 (CLAUDE.md / context files) → S4 are both practitioner concrete-technique → practitioner-coined synthesis edges, not bridge edges, and the (a)/(b)/(c) classification doesn’t apply to them. Reading the topology through this mixed bridge — rather than as a single integrated literature — disciplines the Stage-3 formalization: the model should encode invariants tracked across both communities (autonomy levels, verification gates, context structure, joint-allocation logic) parameterized by inputs the academic literature measures (capability gap, verification cost, skill-formation goals), without privileging either community as the source of authority.

The field’s weakest links are not where the popular discourse focuses. Mainstream debate contests “does AI help” — but at the well-bounded-task level the productivity record is robust (E1: 15–55% gains across ~25 RCTs). The actual fragile zones in 2026 are: (a) the aggregate-zero puzzle (E4 / O2 attacks A6) — Humlum & Vestergaard’s precise zero across 25,000 Danish workers tensions every micro-RCT result and is direct evidence against the individual-aggregation assumption that the entire individual-level frame depends on; (b) whether the centaur regime persists (E18 + S5) — Schoenegger’s finding that even deliberately overconfident GPT improves forecasting suggests much of the gain comes from forced structured reasoning, not advice quality, raising the question of whether the centaur advantage survives a capability discontinuity; (c) whether the frontier is mappable faster than it migrates (O4) — a calibration race-condition the field hasn’t tested; (d) whether the binding reliance constraint is verification cost or verification calibration (O7) — the design implication is different; (e) long-term cognitive effects of continuous AI use (O3) — Bastani’s 17% unassisted-performance drop and the broader skill-atrophy literature point to a real risk but the longitudinal data window is still under twelve months for most studies.

This topology is the input to model formalization (Stage 3). The cleanest target is a parameterized routing function: for each (task type, capability profile, verification cost, autonomy level) tuple, produce an allocation decision that maximizes expected output quality per unit of human attention. The four variant views below (Vulnerability / Flow / Minimal / Capability-regime) read the same graph through different lenses to discipline that formalization choice — the capability-regime variant in particular sorts every node into stale-on-jump / structurally invariant / regime-dependent, which directly tells the model formalization which terms must be parameters (the regime-dependent ones) and which can be invariants (the stable ones).

The graph

All 66 nodes and their dependencies. Click a node for detail; drag to rearrange.· drag empty space to pan · scroll to zoom

Legend

Assumption

Method

Empirical

Logical

Mechanism

Synthesis

Practitioner

Open

Distortion

Crux node (halo)

Node size reflects load-bearing weight (1–5).

Click a node to see its claim, status, and load-bearing weight. Hover an edge to see the relation type. Drag nodes to rearrange, drag empty space to pan, scroll to zoom.

Click a node for its claim and load-bearing weight; hover an edge for the relation type; drag to rearrange. The variant toggles read the same graph through different lenses.

How to read this graph

Every node in the lit review collapses to one of nine types. Edges between them carry one of seven relations. Together they make the structure inspectable.

Node types

Code	Type	What it is
A	Foundational assumption	A claim the field cannot operate without; if false, large downstream regions collapse
M	Methodological prerequisite	A study design or measurement approach that must work for the empirical claims to be testable
E	Empirical claim	A specific measured finding with an effect size and replication status
L	Logical necessity	Follows from definitions or algebra; not empirically refutable
G	Generating mechanism	A causal process that explains a pattern (metacognitive load, verification trade-off, ironies of automation)
S	Synthesis claim	An integrative statement combining multiple lower-level claims
P	Practitioner framework	A typology, slider, or pattern published by the practitioner stack ahead of academic formalization
O	Open question	Genuinely undecided with current methods or evidence
D	Distortion vector	Where motivated reasoning concentrates (typed by direction)

Edge types

Code	Edge	Meaning
dep	depends-on	If target collapses, source collapses
imp	implies	Logical implication
sup	empirically-supports	Evidence relation
conf	confounds / inflates	Artifact relationship
mod	moderates	Changes magnitude
op	operationalizes	Practitioner framework concretizes an academic claim or vice versa
corr	corrects	Workflow architecture corrects naive allocation
attacks	attacks	Distortion vector targets a specific node

Weight scale (load-bearing weight, 1–5)

5 — crux node; collapse propagates across multiple sections of the lit review
4 — load-bearing within a section
3 — important but local
2 — corroborating
1 — decorative

1. Node catalog

Each node carries: type code · weight · short claim · key citation · status. Status flags: ✓ (robust/replicated), ~ (partial/qualified), ? (contested/open), ✗ (refuted, kept as historical reference).

A — Foundational assumptions

ID	Wt	Claim	Status
A1	5	Human attention is the binding scarce resource — once production is automated, the bottleneck shifts upward to planning, evaluation, calibration. (Tankelevitch 2024)	✓
A2	5	AI capability is heterogeneous across cognitive operations (the jagged frontier). (Dell’Acqua 2023/2025)	✓
A3	5	Verification cost is comparable to or cheaper than generation cost — otherwise rational engagement collapses. (Vasconcelos 2023; Fok & Weld 2023)	~
A4	4	Knowledge work decomposes into sub-tasks that can be selectively delegated. (Vaccaro 2024 sub-task finding)	✓
A5	4	The frontier has stable topological features even as the boundary shifts — i.e., it is mappable.	?
A6	5	Individual-level workflow optimization aggregates upward — gains aren’t fully absorbed by organizational dynamics (review bottlenecks, managerial reabsorption, coordination costs). The load-bearing assumption behind the entire individual-level framing. Humlum-Vestergaard aggregate-zero is direct evidence against.	?

M — Methodological prerequisites

ID	Wt	Claim	Status
M1	5	Randomized controlled trials of AI-augmented work. (~25 published 2023–2026)	✓
M2	4	Strategy-graded reliance metrics (vs. outcome-graded). (Vasconcelos / Fok & Weld pivot)	✓
M3	4	Field-deployed measurement (real repos, real meetings). (METR 2025)	~
M4	3	Telemetry / log-based behavioral observation. (Mozannar CUPS 2024)	✓
M5	3	Cognitive Task Analysis (CTA, ACTA). (Crandall, Klein & Hoffman 2006)	~

E — Empirical claims

ID	Wt	Claim	Status
E1	5	15–55% productivity gains on well-bounded knowledge tasks. (Brynjolfsson 2023; Noy 2023; Peng 2023; Cui 2025)	✓
E2	5	Skill-leveling: novices gain most, top performers near-zero — on well-defined tasks. (Brynjolfsson +34%/0%)	✓
E3	5	Outside-frontier AI use causes a 19-pp quality drop. (Dell’Acqua BCG study)	✓
E4	5	Aggregate labor-market effects are zero at 2-year horizons. (Humlum & Vestergaard 2025, 25k workers)	✓
E5	4	Explanations alone don’t yield complementary performance — they increase acceptance regardless of correctness. (Bansal 2021)	✓
E6	4	Cognitive forcing reduces overreliance — but only for high-Need-for-Cognition users. (Buçinca 2021)	✓
E7	4	On naive workflows, AI alone outperforms human+AI in expert domains. (Goh 2024 JAMA NO)	✓
E8	5	”Independent-then-synthesize” workflow restores complementarity in the same domain. (Everett 2025)	✓
E9	4	Unguardrailed AI in learning produces 17% worse unassisted post-session performance. (Bastani PNAS 2025)	✓
E10	3	Behavior distribution: ~60% cyborg, ~30% centaur, ~10% self-automator. (Randazzo HBS 26-036)	✓
E11	4	LLM capability decays sharply up Bloom’s hierarchy: ~81% Remember → 13–41% Analyze. (Ma 2025)	✓
E12	4	Verification time is a substantial fraction of total interaction in coding contexts. (Mozannar CUPS 2024)	✓
E13	4	Sycophancy escalation: AI flips correct human judgments to incorrect on pushback. (Randazzo HBS 26-021)	✓
E14	4	Multi-agent orchestrator-worker outperforms single-agent +90.2% at ~15× token cost. (Anthropic 2024/2025)	✓
E15	4	Routing > model selection — task-to-tool dispatch beats single-best-model. (Paterson 2026)	✓
E16	3	Higher AI confidence correlates with less critical thinking enacted. (Lee et al. CHI 2025)	~
E17	4	Read/write asymmetry: multi-agent works for read-heavy tasks; breaks on write-heavy unless writes are serialized. (Cognition / Anthropic / LangChain consensus)	✓
E18	3	Both calibrated AND deliberately overconfident GPT assistants improve human forecasting +23–43%. (Schoenegger 2024/2025)	~

L — Logical necessities

ID	Wt	Claim	Status
L1	5	Substitution myth is wrong — every offload creates new monitoring/verification/coordination work. (Dekker & Woods 2002; Bainbridge 1983)	✓
L2	5	Optimal allocation requires modeling the joint performance surface, not solo accuracies. (Madras / Mozannar L2D theory)	✓
L3	5	Allocation model must be parameterized BY capability, not depend on FIXED capability — generating function vs. lookup table.	✓
L4	4	High automation and high human control can coexist. (Shneiderman 2D)	✓

G — Generating mechanisms

ID	Wt	Claim	Status
G1	5	Metacognitive bottleneck — load shifts from production to planning, evaluation, calibration. (Tankelevitch 2024)	✓
G2	5	Ironies of automation — AI handling routine work degrades human capacity to catch rare critical errors. (Bainbridge 1983; Simkute 2024)	✓
G3	5	Verification-cost trade-off — engagement is rational only when cheap. (Vasconcelos 2023)	✓
G4	5	Jagged frontier mechanism — capability heterogeneous; the boundary is personal and dynamic.	✓
G5	4	Correlation neglect — humans treat AI advice as independent evidence despite shared training data. (Amin 2026)	~
G6	3	Cognitive forcing — committing to a view first breaks anchoring.	✓
G7	4	Skill atrophy — capacities not exercised decay.	✓
G8	4	Cross-task productivity bundling — speedups bottlenecked by linked tasks. (Cowen 2026)	~
G9	4	Generator-verifier asymmetry — production cheap, checking expensive. (Karpathy)	✓
G10	3	Multi-tool attention interference — Wickens’ Multiple Resource Theory predicts tool-stack overload (Cursor + ChatGPT + meeting incurs cost beyond the sum of per-tool costs). Mechanism well-established; absent from AI workflow research.	~

S — Synthesis claims

ID	Wt	Claim	Status
S1	5	Workflow architecture predicts outcomes more reliably than model capability.	✓
S2	5	The optimization target is metacognitive efficiency, not throughput.	✓
S3	4	Knowledge work shifting from production to critical integration. (CHI 2025)	✓
S4	4	Context engineering is the central craft. (Anthropic / Cognition consensus)	✓
S5	3	Centaur advantage may be a transient regime.	?

P — Practitioner frameworks

ID	Wt	Claim	Status
P1	5	Mollick centaur / cyborg / self-automator typology.	✓
P2	4	Karpathy autonomy slider.	✓
P3	4	Anthropic agent design patterns (chain / route / parallelize / orchestrator-workers / evaluator-optimizer).	✓
P4	4	Cognition read/write asymmetry as agent-system principle.	✓
P5	3	Compound Engineering / AI Sandwich (Shipper, Every).	~
P6	3	Spec-driven development (spec → plan → execute).	~
P7	3	Personal context files (CLAUDE.md / system-prompt patterns).	✓

O — Open questions

ID	Wt	Claim	Status
O1	5	Does AI help experts on open-ended judgment?	?
O2	5	Why are aggregate effects zero given the micro-RCT record?	?
O3	5	Long-term cognitive effects under continuous AI use.	?
O4	4	Frontier migration vs. calibration rate.	?
O5	4	Right unit of analysis — individual vs. team vs. value chain.	?
O6	3	Centaur taxonomy durable, or interface artifact?	?
O7	4	Verification cost vs. verification calibration as binding constraint.	?

D — Distortion vectors

ID	Wt	Claim	Targets
D1	4	AI-maximalist distortion — RCT gains read as aggregate revolution; ignores Humlum-Vestergaard zero.	E4, S1, O2
D2	4	Productivity-only distortion — counts speed gains, ignores skill atrophy and metacognitive load.	G7, S2, E9
D3	3	”Just use the best model” distortion — ignores routing finding (E15) and architecture-over-capability evidence.	E15, S1, S4
D4	3	Practitioner-only distortion — dismisses formalization as category error.	L3, S1

2. Edge catalog (key chains, not exhaustive)

Foundation → Method. A2 → M1 (RCTs reveal the jagged frontier); A3 → M2 (strategy-graded metrics measure verification cost); A1 → M3 (field deployment shows real attention budget).

Method → Empirical. M1 produces the productivity record (E1–E9). M2 → E5 (Bansal explanations don’t help once strategy-graded). M2 → E12 (CUPS is the strategy-graded measure of coding). M3 → E2 (METR shows experts gain less in real repos).

Mechanism → Empirical. G1 → E12 (metacognitive load shows up as verification time). G2 → E13 (sycophancy is the rare critical error ironies-of-automation predicts will get missed). G3 → E5 (rational verification trade-off explains why explanations fail). G4 → E3 (jagged frontier produces outside-frontier harm). G7 → E9 (skill atrophy → unassisted-performance drop). G8 → E4 (cross-task bundling explains aggregate zero). G10 → E10 (cyborgs interleaving multiple tools should incur higher MRT interference than centaurs handing off discrete sub-tasks — testable but unmeasured prediction).

Empirical → Synthesis. E1, E2, E3, E8 → S1 (workflow architecture > model capability — the integrated headline). E12, E16 → S2 (metacognitive efficiency target). E11 → S3. E14, E15, E17 → S4 (context engineering as central craft). E18 → S5.

Empirical → Open. E2, E4 → O2 (the aggregate puzzle). E2 → O1 (helps experts?). E9 → O3 (long-term cognitive). E1, E3 → O4 (frontier migration). E4 → O5 (right unit). E13 → O7 (calibration vs. cost).

Logical guards. L1 → G2 (substitution myth → ironies of automation). L2 → S1 (joint surface needed). L3 → S4 (parameterize-by-capability is what makes context engineering generalize). L4 → P2 (Shneiderman 2D legitimates the autonomy slider).

Practitioner ↔ Academic (the central conceptual move of this topology, encoded as op edges). The relationship type varies — see TLDR para 2 for the (a)/(b)/(c) classification of cross-community bridge edges. P1 → E10 (Mollick named the typology; Randazzo 2026 measured the behavior distribution it predicts — pattern (a), academia retrospectively measures). P2 → L4 (Karpathy’s autonomy slider concretizes Shneiderman’s 2D framework as a per-action user control surface — pattern (b), prior academic principle made tangible), with the reverse edge L4 → P2 (Shneiderman legitimates the slider design). P3 → L2 (Anthropic’s agent design patterns and Madras / Mozannar L2D theory address the same structural problem — joint allocation under capability heterogeneity — but emerged independently — pattern (c), parallel convergence without direct lineage). P4 → E17 (Cognition’s read/write principle IS the design-actionable form of the read/write asymmetry finding — pattern (a) inverted: practitioners stated and measured it together). P5 → S2 (compound engineering / AI Sandwich applies Tankelevitch’s metacognitive-demands framework at the workflow level — pattern (b)). P6 → S4 and P7 → S4 are practitioner-internal, not bridge edges: S4 (context engineering as central craft) is itself a practitioner-coined synthesis (Anthropic / Cognition consensus), and P6 (spec-driven development) and P7 (CLAUDE.md / personal context files) are concrete practitioner techniques that operationalize that practitioner synthesis. The (a)/(b)/(c) classification doesn’t apply because both endpoints sit in the practitioner community.

Foundation → Foundation (the project-frame edges). A6 → S1 (S1 is meaningful only if individual optimization aggregates). E4 → A6 (aggregate-zero is the direct attack on A6). O2 → A6 and O5 → A6 (the open questions whose resolution will close A6 either way).

Distortion attacks. D1 → S1, E4 (treats RCT gains as aggregate proof; ignores Humlum). D2 → G7, S2, E9 (counts speed; ignores atrophy). D3 → E15, S1, S4 (ignores routing). D4 → L3, S1 (denies formalization possibility).

3. High-stakes nodes — by structural role

Six categories of structural role, sorted by how their failure modes propagate. The single most useful conceptual move is keeping the cruxes (inputs) separate from the headline (an output) and from the reframers (mechanisms whose magnitude is open). Conflating these three under a single label of “important findings” produces most of the bad-faith debate around AI workflow design.

3a. Foundational cruxes — collapse rebuilds regions

These are the input assumptions the entire individual-level framing rests on. Falsification doesn’t change interpretation; it forces rebuilding. All four are weight-5 foundational-assumption nodes (the A class).

A1 (human attention is the binding scarce resource). If false, throughput-optimization wins and the entire metacognitive-bottleneck framing dissolves; design priorities flip back toward maximum AI delegation. The metacognitive-bottleneck mechanism G1 is the consequence of A1 + production-automation, not a separate axiom — which is why G1 isn’t itself a crux.
A2 (jagged frontier — AI capability is heterogeneous across operations). If a capability discontinuity produced uniformly-good AI across all knowledge-work operations, the “map your frontier” design imperative collapses and outside-frontier harm (E3) dissolves.
A3 (verification cost is comparable to or cheaper than generation cost). If verification became prohibitively expensive — AI outputs so complex or fast-moving that human checking is intractable — the rational-cost-benefit reframing of overreliance (G3) dissolves and the design problem becomes “trust without verification.”
A6 (individual-level optimization aggregates upward). If gains are absorbed by organizational dynamics — review bottlenecks downstream, managerial reabsorption, coordination costs — individual workflow optimization is locally optimal but globally insufficient. Humlum-Vestergaard’s aggregate zero is direct evidence against. This is the project-frame crux: the entire individual-level optimization target only matters if A6 holds, and its status is genuinely open.

3b. Logical guardrails — unfalsifiable, ignored at peril

Cannot be falsified — only ignored. All three are weight-5 logical-necessity nodes (the L class).

L1 (substitution myth is wrong — every offload creates new monitoring/verification/coordination work). Bainbridge 1983 / Dekker & Woods 2002. The structural property of any principal-agent delegation; AI-rollout discourse routinely treats it as if it could be ignored.
L2 (optimal allocation requires modeling the joint performance surface, not solo accuracies). The Madras / Mozannar L2D formal result. A mathematical property of how joint performance combines — cannot be falsified empirically. The most common practitioner shortcut — “use AI when AI is better, use human when human is better” — implicitly assumes solo accuracies suffice; ignoring the variance and correlation structure of the two agents’ errors is exactly the move L2 forbids. The agent-design patterns (P3) that do work are the ones that respect L2 by construction.
L3 (allocation model must be parameterized by capability, not depend on fixed capability). The generating-function-vs-lookup-table commitment. Practitioner-only frameworks routinely violate this by hardcoding “GPT-4 is good at X, bad at Y” — the pattern goes stale within months.

3c. Reframer mechanisms — magnitude is the live question

High-weight non-crux mechanism nodes whose magnitude (not existence) is what reshapes interpretation. Each is well-supported as a phenomenon; the open question is the share of variance they explain.

G1 (metacognitive bottleneck). CHI 2024 Best Paper finding; robust phenomenologically. The open magnitude question: what share of the labor force is in the regime where G1 is binding? For workers still production-constrained, G1 is premature; for power users, it is binding. The size of each population determines whether the metacognitive-efficiency target (S2) is the right design priority for “individual workers” generically or only for a subset.
G3 (verification-cost trade-off). Vasconcelos / Fok-Weld reframing. The mechanism is real; the open question is whether it captures most of the variance in reliance behavior, or whether verification calibration (O7) is doing the rest of the work.
G4 (jagged frontier mechanism). The capability-heterogeneity story is robust; the open magnitude question is the rate at which the boundary shifts (O4) and whether the topological features are stable (A5).

3d. Headline conclusion — a synthesis output, not a crux

S1 (workflow architecture > model capability). The integrated finding the topology is organized around. S1 is what the cruxes plus mechanisms produce, not a load-bearing input. It is a weight-5 synthesis node, but distinguishing “headline conclusion” from “crux” matters: cruxes are inputs whose falsification rebuilds the graph; conclusions are outputs whose falsification only means the rebuild was downward (the conclusion was wrong) rather than upward (an input was wrong).

3e. Corroborating / illustrative

Two senses of “droppable” need to be distinguished here. Some nodes can be removed without breaking S1’s conclusion but are load-bearing as exemplars of the topology’s classification structure (especially the practitioner ↔ academic bridge in TLDR para 2); others are droppable in both senses.

Load-bearing as exemplars, droppable for S1: E10 (60/30/10 distribution — the canonical (a) bridge example, paired with P1 to demonstrate “practitioners ahead of academic measurement”; if removed, S1 still holds but the (a) example loses its empirical anchor), P5 (compound engineering / AI Sandwich — the canonical (b) bridge example, paired with S2 to demonstrate “practitioners concretizing prior academic principles”; if removed, S1 still holds but the (b) example weakens).
Droppable in both senses: E18 (Schoenegger overconfident-AI-still-helps — interesting but tangential to S1 and not used as a bridge example), G6 (cognitive forcing as mechanism — local, not load-bearing for S1 or the bridge framing), P6 (spec-driven development — practitioner-internal P → S4 edge, not a bridge example, not load-bearing for S1).

3f. Distortion vectors

D1–D4 are pedagogically intentional, not decorative. Each names a real motivated-reasoning pattern and the specific empirical/logical claims it targets. Distortions are useful for readers calibrating where their own priors might be selecting against the evidence.

4. Weakest links

Where the graph is genuinely fragile in 2026, ranked by potential propagation if the link breaks:

A6 / O2 / O5 (the aggregate-zero / unit-of-analysis cluster). The most consequential weakness in the graph. Humlum-Vestergaard’s precise zero across 25,000 Danish workers is direct evidence that individual workflow gains may not aggregate; if true, A6 falsifies, the project-frame inverts, and individual workflow optimization becomes locally optimal but globally insufficient. Possible mechanisms (cross-task bundling per Cowen 2026, organizational reabsorption, weak wage pass-through) are testable but untested. Until O2 is resolved, every claim about organizational benefit downstream of S1 is contingent.
O3 (long-term cognitive effects). Bastani’s 17% drop is a single-session study with a short post-session window — the lit review documents the in-session vs. afterward contrast but no specific durability timescale. If a 2-year longitudinal RCT confirms broad skill atrophy across cognitive operations, the design implication becomes “AI-free zones” at much larger scale than current practice — and the productivity-only distortion (D2) goes from intellectually wrong to materially harmful.
O4 (frontier migration vs. calibration rate). If model capabilities shift faster than humans can recalibrate their personal frontier maps, the “map your jagged frontier” design imperative becomes unfollowable. The likely partial resolution is that the frontier has stable topological features (A5) even when the boundary moves, but A5 is empirically untested.
O7 (verification cost vs. calibration as binding constraint). Reducing verification cost helps a rational actor; improving calibration helps a miscalibrated one. The interventions are different. The sycophancy finding (E13) suggests calibration is a real second binding constraint.
S5 / O6 (centaur regime persistence). If a capability discontinuity makes AI better than humans on the full task, the centaur typology becomes historical.
A5 (frontier mappable). Empirically untested. If false, individual-level workflow design is structurally impossible at the per-task granularity the practitioner stack assumes.

5. Variants

Each variant reads the same graph through a different lens.

Variant A — Vulnerability (where does this break?)

Highlights the seven cruxes (A1, A2, A3, A6 foundational; L1, L2, L3 logical guardrails) plus the weight-5 nodes downstream. If any crux flips, propagation is concentrated through this subgraph. Useful for stress-testing: pick a crux, imagine it inverts, and trace the consequences through the highlighted subgraph.

Variant B — Flow (how does causation propagate?)

Restricts to the A → M → E → S/G cascade plus practitioner operationalizations (P → E/L/S via op edges). The “what generated what” view: foundational assumptions enable methods, methods produce empirical findings, mechanisms explain them, syntheses integrate, and practitioner frameworks operationalize the resulting design implications.

Variant C — Minimal claim set

Smallest set of claims that still yields the headline conclusion (S1: workflow architecture > model capability). Approximately: A2 + A3 + E3 + E8 + L2 + G3 + G1 + S1. Eight nodes. Removing any one breaks the qualitative shape.

Variant D — Capability-regime fragility (the topic-specific variant)

Which nodes go stale if frontier capability jumps? The central worry the lit review’s adversarial section names. Of the 66 nodes, 35 sort into one of three regime-fragility classes; the rest (methods, supporting empirical findings, distortions) are regime-orthogonal. The classification:

Stale-on-jump (6 nodes — likely to invert). E2 (skill-leveling: if AI exceeds top performers, the +34/0 pattern flips). E3 (outside-frontier harm: dissolves if frontier becomes uniform). E10 (60/30/10 behavior distribution: regime-bound to current tool affordances). E18 (overconfident-AI-still-helps: if AI is reliably correct, the “structured reasoning is the gain” reading dissolves). S5 (centaur transience: literally about transitioning out). P1 (Mollick taxonomy: becomes historical the way the chess centaur literature reads as historical now).
Stable-on-jump (18 nodes — structurally invariant). All four logical necessities: L1 (substitution myth), L2 (joint performance surface), L3 (parameterize-by-capability), L4 (autonomy + control coexist). The mechanism invariants: G1 (metacognitive load just shifts to new decisions), G2 (ironies of automation are a property of any principal-agent delegation), G3 (verification trade-off — the threshold moves but the shape is invariant), G9 (generator-verifier asymmetry), G10 (multi-tool attention interference — Wickens MRT is a property of human cognitive architecture, not of AI capability). The foundational A1 (attention-as-binding-constraint is a fact about humans, not about AI). The synthesis claims that follow from the invariants: S2 (metacognitive efficiency target), S3 (production → critical integration), S4 (context engineering as central craft). The control-surface practitioner patterns: P2 (autonomy slider), P3 (Anthropic agent design patterns), P5 (compound engineering), P6 (spec-driven development), P7 (CLAUDE.md context files).
Regime-dependent (11 nodes — depends on which way capability jumps). Foundational A2 (jagged frontier could become smooth or fragment into a different jagged shape), A3 (verification cost could fall further or rise as outputs become more complex), A5 (frontier-mappability depends on rate of migration), A6 (aggregation could go either way as tools get integrated). The headline S1 (workflow > capability could invert if a capability gap dominates). The high-leverage empirical claims: E14 (multi-agent could become moot if single-agent matches), E15 (routing depends on heterogeneity A2), E17 (read/write asymmetry depends on whether AI handles serial writes natively). The mechanism G4 (jagged frontier follows A2). The open question O4 (frontier migration rate IS the regime question).

The model formalization should be stable under capability change — meaning it should be parameterized by the L1/L2/L3/L4 + G1/G2/G3/G9 + A1 invariants and treat A2/A3/A6/E14/E15/E17 as inputs that vary by capability regime. The 6 stale nodes are the ones the formalization should not hardcode.

6. Stage-3 handoff

This topology is the input to model formalization. The cleanest target is a parameterized routing function:

delegate(task, worker, AI, context) → action

where the action is one of {do_yourself, consult_AI, delegate_to_AI, refuse}, and the function is parameterized by:

Task type (per Bloom hierarchy + cognitive task analysis decomposition)
Worker capability profile on that task type (via prior frontier mapping)
AI capability profile on that task type (via per-task benchmark or recent personal experience — the L3 invariant means this is an input, not a hardcoded constant)
Verification cost (function of task type and output format)
Stakes / reversibility (high stakes → independent-then-synthesize per E8)
Skill-formation goal (if the task is one whose capability the worker wants to maintain → guardrail mode per E9)

The stage_outputs/<topic>/<stage>.md folder convention holds for this topic too: raw working drafts live in stage_outputs/technology-utilization-architecture/<stage>.md; polished versions move into src/content/ai_research/technology-utilization-architecture/<stage>.mdx. So far the folder contains the lit review and this topology draft; subsequent stages will accumulate there.

This topology also feeds three natural sibling topics down the line. Navigating an AI World is the structural / civilizational view that holds the individual-level optimization in tension with the organizational-level disruption (its own Crux 5 / A6 sits next to mine). AI Cognitive Profile is the orthogonal view: rather than asking “how should an individual route tasks,” ask “where does AI capability diverge from human capability across the O*NET task taxonomy” — that topic supplies what this topic treats as a black-box input (the per-task capability gradient) and inversely, this topic supplies what that topic treats as a black-box (what individuals should do given the gradient). Prediction and Calibration is the natural attachment point for O7 (verification cost vs. verification calibration as binding constraint): if calibration on AI error rates is the second binding constraint behind verification cost, the calibration topic is where that gets formalized. The Stage-3 model formalization should leave clean attachment points for all three.

7. Next moves — three Stage-3 options

Three formalization paths, each with pros / cons / Stage-4 implications.

Option A — Capability × verification-cost dispatch table (decomposition)

What it is. Formalize the routing function as a 2D table: capability gap (worker minus AI on this task) × verification cost ratio (verify-cost / generate-cost). Four quadrants → four allocation rules.

Pros. Maps cleanly onto the existing RCT record (quadrants align with Brynjolfsson, Dell’Acqua, METR, Everett). Easy to visualize. Practitioner-actionable.
Cons. Risks violating L3 (looks like a lookup table). Mitigated if the table is generated from inputs rather than hardcoded.
Stage-4 implication. Test against the published RCTs by classifying each into a quadrant.

Option B — Generator-verifier loop with autonomy slider (generating function) [recommended]

What it is. Formalize the workflow as a recurrent loop: at each step, the worker chooses an autonomy level (Karpathy P2 / Feng 2025 five-level operator → observer). The loop has a verification gate; the gate’s strictness is a parameter. Output: an interactive dashboard where the user inputs (task type, capability gap, verification cost, stakes, skill-formation goal) and gets a recommended autonomy level + verification cadence.

Pros. Directly operationalizes Karpathy’s autonomy slider (P2) using Shneiderman 2D (L4) and Vasconcelos verification economics (G3). Survives capability change (L3-compatible). Naturally maps onto an interactive site component.
Cons. Requires choosing a parameterization of “verification cost” that holds across task types — non-trivial.
Stage-4 implication. Test against Bastani’s guardrail RCT (verification gate stringency moderating skill atrophy), Everett’s independent-then-synthesize (a specific autonomy-cadence schedule), and the Mollick centaur/cyborg empirical distribution (which loops people actually run).

Option C — Principal-agent with imperfect agent (mechanism design)

What it is. Apply contract-theory machinery (asymmetric information, monitoring vs. trust trade-off) to the single-worker / AI-tool relationship.

Pros. Formally rigorous. Connects to a mature economic literature.
Cons. Heavyweight; principal-agent assumes strategic agents, but LLMs are imperfect-but-non-strategic.
Stage-4 implication. Hardest to validate against the existing RCT record.

Recommendation: Option B. The generator-verifier loop with an autonomy slider is the most directly testable, the most actionable as an interactive site artifact (Stage 5), and the most clearly L3-compatible. Specifically, Option B engages the regime-stable invariants identified in Variant D — L1 (every offload creates new monitoring work, baked into the loop’s verification gate), L2 (joint-surface allocation, baked into the autonomy-level choice), L3 (parameterized by capability inputs rather than hardcoded), G3 (verification trade-off, the parameter that sets gate strictness), G9 (generator-verifier asymmetry, the loop’s central asymmetry), and G10 (multi-tool attention interference — the loop should track concurrent-tool load as a Wickens-MRT input, not just per-task parameters; “running three coding agents in parallel” is a different regime from “running one”) — while taking A6 (individual optimization aggregates) as the explicit assumption whose falsification would mean the loop is locally optimal but globally insufficient. Option A can be a sub-component of Option B (the dispatch table determines the autonomy-level choice). Option C is held in reserve.

8. Objections to this topology

Objection 1: “The practitioner / academic split is overstated — the field is actually more integrated than the typing P-vs-S suggests.” Steelman: many researchers (Mollick, Karpathy, Tankelevitch via MSR’s Tools for Thought) span both communities; some practitioner posts (Anthropic’s effective agents) cite academic literature; the typing risks reifying a divide that’s already partial. Response: the people span both, but the artifacts arrive on different timescales and through different processes. Mollick proposed the centaur/cyborg typology in 2024; Randazzo 2026 (HBS) is the first peer-reviewed empirical measurement of the distribution. Karpathy’s autonomy slider was articulated as a tool-design concept (in his 2024 Software 1.0/2.0/3.0 talks and Cursor’s Tab → Cmd+K → Agent Mode UX) about a year before Feng 2025’s five-level academic taxonomy converged independently on a similar discretized-levels structure — neither grounds the other; they are parallel work on the same concern at different granularities (Karpathy’s is per-action, Feng’s is per-task user role). Anthropic’s agent design patterns and L2D theory likewise address the same structural problem with no direct lineage. The typing isn’t claiming a clean division; it’s making the structurally different roles visible — empirical measurement, formal-theory derivation, engineering-practice synthesis — so we don’t conflate “Anthropic shipped a pattern that works” with “L2D theory predicts this pattern is optimal.”

Objection 2: “The 7-crux selection is biased toward the framing the lit review wants.” Steelman: a critic could argue capability-first cruxes (e.g., “Bloom-level decay determines all task allocation”) are equally defensible, or that the lit-review-driven framing inherits whatever bias the lit review has. Response: the crux set is structurally typed, not picked by impact. The criterion is “claims whose collapse rebuilds regions of the graph,” and that maps cleanly onto exactly two node classes — foundational assumptions (A) and logical necessities (L). The cruxes are A1, A2, A3, A6 (every weight-5 A node) plus L1, L2, L3 (every weight-5 L node). No mechanism (G), synthesis (S), empirical (E), practitioner (P), or open (O) node is a crux, because their structural role is downstream of the cruxes — mechanisms explain, syntheses integrate, empirical findings test, practitioner frameworks operationalize, open questions sit at the frontier. The headline conclusion S1 (workflow > capability) is not a crux even though it is weight-5; it’s an output of the graph, and excluding it from the crux set is the discipline that distinguishes “what the graph rests on” from “what the graph concludes.” A capability-first alternative would have to add a foundational A node like “Bloom-level decay is the dominant capability gradient” — which the lit review doesn’t support; LLMs decay sharply on Bloom, but task allocation is also shaped by verification cost, autonomy level, and skill-formation goals. The selection is therefore lit-review-driven, but the typing rule (A + L only, every weight-5) is independent of the lit review’s framing — it is just “which nodes carry the structural role of being inputs the rest of the graph rests on.”

Objection 3: “The capability-regime fragility variant is a hedge — it concedes the project will go stale.” Steelman: Variant D essentially admits that half the empirical claims could invert under a capability jump; if so, why formalize at all? Response: Variant D is the discipline that justifies formalization. The point is to identify which nodes are stable-under-jump (the L1-L3-G1-G2-G3 invariants) and build the formalization on those. Practitioner heuristics, by contrast, are not capability-stable by design — they update faster but go stale faster too.

Objection 4: “The aggregate-zero puzzle (E4 / O2) is so consequential that the entire individual-level framing might be wrong, and the topology should pivot to the organizational level instead.” Steelman: if Humlum-Vestergaard’s zero is the true-population result, individual workflow optimization is rearranging deck chairs. Response: this is exactly what A6 (now a foundational crux) and O5 (right unit of analysis) name. The honest position is that the individual frame is one valid level of analysis — design-actionable for the worker who controls their own workflow — and the organizational frame is a separate level that should be a sibling artifact, not a replacement.

Objection 5: “Topology + model formalization is over-engineering for a moving target. Practitioner heuristics adapt faster than academic formalization can ship; by the time you finish the model, the field has moved. The honest move is to skip directly to a build artifact using current best practices.” Steelman: this is real. The lit review already documents that the practitioner stack runs 12–18 months ahead of peer review precisely because it doesn’t try to formalize. Cursor, Claude Code, and the Anthropic agent-design patterns are already shipping; users adapting heuristics in real time will outperform users waiting for an academic model. The L3 invariant (“parameterize by capability”) is meant to address this, but it’s a hope, not a guarantee — a formalization built on capability-stable invariants might still miss the regime where capability discontinuity dominates everything. Response: even granting all of that, the topology IS the artifact whose stable invariants survive capability change. Variant D identifies an 18-node regime-stable subgraph: all four logical necessities (L1 substitution myth, L2 joint surface, L3 parameterize-by-capability, L4 autonomy + control coexist), the five mechanism invariants (G1 metacognitive load shifting, G2 ironies of automation, G3 verification trade-off, G9 generator-verifier asymmetry, G10 multi-tool attention interference via Wickens MRT), the foundational A1 (attention-as-binding-constraint is a fact about humans, not AI), the synthesis claims that follow from the invariants (S2 metacognitive efficiency target, S3 production → critical integration, S4 context engineering as central craft), and the control-surface practitioner patterns (P2 autonomy slider, P3 Anthropic agent design patterns, P5 compound engineering, P6 spec-driven development, P7 CLAUDE.md context files) — none of which are capability-bound claims. Practitioner heuristics adapt faster but lossy: they don’t preserve the reasoning behind the rule, so users can’t tell when the heuristic stops applying. The 12–18 month lag goes both ways — practitioners ship faster, but they also rediscover Bainbridge 1983 forty years late. The model formalization’s value is less “predict optimal allocation” than “encode the invariants explicitly so the next capability shift doesn’t require re-litigating from scratch.” That said: the objection has bite for Stage 5 specifically. If the build artifact is a fragile prediction tool that hardcodes 2026 capability, it goes stale. If it is a frame (the autonomy slider, the verification-cost trade-off visualized) that the user fills in with their own current capability profile, it is L3-compatible and survives.

9. Glossary

AI Sandwich — Shipper / Every: humans frame the task and review the output; AI handles the middle.
autonomy slider — Karpathy: a per-action user control surface (Cursor: Tab → Cmd+K → Agent Mode); operationalizes Shneiderman’s 2D framework.
Bloom’s taxonomy — Remember → Understand → Apply → Analyze → Evaluate → Create. LLM capability decays sharply going up.
centaur — Mollick: clean human/AI role separation; discrete handoffs based on per-task frontier mapping.
CoALA — Cognitive Architectures for Language Agents (Sumers 2024). Maps LLM agents to ACT-R / SOAR memory + action structures.
complementarity potential vs. team performance — Hemmer 2025 distinction: mathematical possibility vs. actual achievement of human+AI exceeding either alone.
compound engineering — Shipper / Every: plan → work → review → compound. Each loop’s outputs feed the next loop’s inputs.
context engineering — Anthropic / Cognition: the discipline of constructing what the model sees. Practitioner consensus: higher leverage than model selection.
CTA / ACTA — Cognitive Task Analysis / Applied CTA. Method for decomposing what a knowledge worker actually does cognitively.
CUPS — Cognitive Use Pattern States; Mozannar 2024 telemetry-grounded taxonomy of programmer-Copilot interaction.
cyborg — Mollick: continuous interleaving of human and AI at sub-task granularity. The empirical majority pattern.
generator-verifier asymmetry — production cost falls toward zero with AI; verification cost stays roughly constant. Karpathy’s framing.
H-LAM/T — Engelbart 1962: Human using Language, Artifacts, Methodology, Training. Current rollouts ship the artifact; methodology and training lag.
HAIJCS — Human-AI Joint Cognitive System (Xu & Gao 2024). Bridge from CSE to LLM teaming.
ironies of automation — Bainbridge 1983: AI handling routine work degrades human capacity to catch the rare critical errors.
jagged frontier — Dell’Acqua 2023: AI capability is heterogeneous across sub-tasks; using AI inside the frontier helps, outside causes harm.
L2D — Learning to Defer; Madras / Mozannar formal allocation theory.
metacognitive demands — Tankelevitch 2024 best-paper framework: GenAI reduces production load but increases planning, evaluation, and calibration load.
MABA-MABA — “Men Are Better At, Machines Are Better At.” Dekker & Woods 2002 critique of static function allocation.
MRT (Multiple Resource Theory) — Wickens. Cognitive resources are partitioned by processing stage, sensory modality, and processing code; tasks competing for the same resource interfere superlinearly while tasks using different resources can be parallelized cheaply. Predicts tool-stack attention overload (Cursor + ChatGPT + meeting incurs cost beyond the sum). Lit review notes MRT is “absent from AI workflow research.”
read/write asymmetry — Cognition: multi-agent works for read-heavy tasks but breaks for write-heavy unless writes are serialized.
routing > model selection — Paterson 2026: across 38 real daily tasks and 15 models, dispatching by task type beats picking a single best model.
self-automator — Randazzo 2026 third Mollick-typology mode: full delegation with periodic oversight.
sycophancy in feedback loops — Randazzo HBS 26-021: AI escalates persuasive justification on pushback.
Tools for Thought — Microsoft Research program on AI-augmented knowledge work.
verification cost — the cost of checking whether an AI output is correct. Vasconcelos 2023 reframing: overreliance is rational when verification is expensive relative to expected payoff.

Read full stage →

Iteration history

Pass 9 2026-04-29

truth/accuracy override on bias

Why Two small but real timeframe / scope overclaims about the Karpathy ↔ Feng relationship. (1) Objection 1 said "Karpathy's autonomy slider was articulated... years before Feng 2025" — but Karpathy's 2024 articulations and Feng's mid-2025 paper are about a year apart, not "years" plural. (2) TLDR para 2 (c) said the slider and Feng's taxonomy "converged on the same per-action structure" — but Karpathy's slider is per-action UX (Cursor Tab → Cmd+K → Agent Mode) while Feng's five levels are per-task user roles (Operator → Collaborator → Consultant → Approver → Observer). They share the discretized-levels shape at different granularities, not the same per-action structure.
- Tightened Objection 1: "years before" → "about a year before" with the 2024 Software 1.0/2.0/3.0 talks and Cursor UX as the Karpathy timestamp anchor; added explicit acknowledgment that the two converged "at different granularities (Karpathy's is per-action, Feng's is per-task user role)"
- Tightened TLDR para 2 (c): "converged on the same per-action structure" → "converged on the same discretized-levels-of-autonomy shape independently, at different granularities"; annotated each side with its granularity
Pass 8 2026-04-29

internal consistency check

Why §3e listed E10 and P5 in the "genuinely droppable without changing the structural picture" group, but both are explicitly used in TLDR para 2 as the canonical (a) and (b) bridge examples (Mollick → Randazzo on the 60/30/10 distribution; AI Sandwich / Compound Engineering concretizing Tankelevitch). Calling them droppable without breaking the structural picture contradicts their headline use as exemplars of the topology's practitioner ↔ academic classification.
- Restructured §3e to distinguish two senses of "droppable": (1) load-bearing as exemplars but removing wouldn't break S1's conclusion (E10 and P5 — the canonical (a) and (b) bridge examples), and (2) droppable in both senses (E18, G6, P6 — neither load-bearing for S1 nor used as bridge exemplars). The framing now matches their actual roles in the topology.
Pass 7 2026-04-29

internal consistency check

Why Two sync issues. (1) Objection 5 in §8 listed the regime-stable subgraph as "L1/L2/L3 + G1/G2/G3/G9 + A1" — but that list was written in pass 3, before pass 4 added G10 and before pass 5 added the synthesis (S2/S3/S4) and practitioner control-surface (P2/P3/P5/P6/P7) entries to REGIME_STABLE. Variant D's actual stable set is 18 nodes; Objection 5 referenced a 9-node subset. Internal inconsistency between §5 (current) and §8 (stale). (2) ARCHITECTURE.md still described CognitivePartnershipGraph as "~55 nodes" — outdated since pass 2 added A6 (→64) and pass 4 added G10 (→66).
- Updated Objection 5 to reflect Variant D's actual 18-node regime-stable subgraph: all four logical necessities (L1, L2, L3, L4), the five mechanism invariants (G1, G2, G3, G9, G10), the foundational A1, the three synthesis claims that follow from the invariants (S2, S3, S4), and the five control-surface practitioner patterns (P2, P3, P5, P6, P7). Defense now matches the topology actually built.
- Updated ARCHITECTURE.md component description from "~55 nodes" to "~66 nodes" so the project spec tracks the topology state.
Pass 6 2026-04-29

internal consistency checktruth/accuracy override on bias

Why Pass 5 introduced the (a)/(b)/(c) classification of practitioner ↔ academic relationships and tagged edges in §2 accordingly. On a fresh re-read I caught a real internal-consistency error: the TLDR (b) entry used CLAUDE.md as an example of "practitioners concretizing prior academic principles" — but "context engineering" is itself a practitioner-coined concept (Anthropic / Cognition consensus), not a prior academic principle. CLAUDE.md → context engineering is a practitioner-internal edge, not a (b) bridge. Same issue in §2 edge catalog where P6 → S4 and P7 → S4 were tagged as pattern (b), but S4 is practitioner-attributed so both endpoints sit in the same community.
- Replaced CLAUDE.md example in TLDR para 2 (b) entry with the AI Sandwich / Compound Engineering loop concretizing Tankelevitch's metacognitive-demands framework — a clean cross-community (b) example since Tankelevitch is academic
- Added explicit note in TLDR para 2 that some P-edges are practitioner-internal rather than cross-community bridges (e.g., P6 → S4 and P7 → S4 are concrete-technique → practitioner-synthesis edges) and the (a)/(b)/(c) classification doesn't apply to them
- Untagged P6 → S4 and P7 → S4 from pattern (b) in §2 edge catalog; relabeled as practitioner-internal with the explanation that S4 (context engineering as central craft) is itself a practitioner-coined synthesis (Anthropic / Cognition consensus), and P6 / P7 are concrete techniques operationalizing that synthesis — both endpoints sit in the practitioner community
- Updated P5 → S2 description in §2 to be explicit that S2 is Tankelevitch's academic framework, making P5 → S2 a clean (b) edge
Pass 5 2026-04-29

truth/accuracy override on biasinternal consistency check

Why TLDR para 2 was internally contradictory: it said practitioners run "12–18 months ahead of academic peer review" *and* that "each practitioner framework operationalizes an academic claim that the field had already half-formulated." First half says practitioners are ahead; second half says they're behind. The truthful picture is bidirectional / mixed. Two specific overclaims compounded: (a) "Anthropic's patterns operationalize Madras / Mozannar L2D theory" — Anthropic's patterns and L2D theory address the same structural problem but emerged independently from engineering practice; Anthropic doesn't cite L2D as foundational. (b) "Karpathy's autonomy slider has no formal grounding until Feng 2025" — Feng 2025 is independent academic work that converged on the same five-level structure; it doesn't ground Karpathy.
- Rewrote TLDR para 2: replaced one-directional "operationalize" framing with explicit (a)/(b)/(c) classification — (a) practitioners ahead of academic measurement (Mollick centaur measured retrospectively by Randazzo 2026, P4 read/write asymmetry stated and measured together); (b) practitioners concretizing prior academic principles (Karpathy slider concretizes Shneiderman 2D; CLAUDE.md concretizes context engineering); (c) practitioner and academic work converging in parallel without direct lineage (Karpathy slider + Feng 2025 five-level taxonomy; Anthropic patterns + L2D theory)
- Tightened §2 edge catalog: each P → academic edge now annotated with which pattern (a/b/c) it is — P1 → E10 is (a), P2 → L4 is (b), P3 → L2 is (c) [parallel convergence, not operationalization], P4 → E17 is (a) inverted, P5/P6/P7 are (b)
- Rewrote Objection 1 response: removed the inaccurate "Karpathy's autonomy slider has no formal grounding until Feng 2025" framing; replaced with the correct claim that the slider was articulated as a tool-design concept years before Feng 2025 converged on the same per-action structure independently. Added the structural argument that the typing makes "structurally different roles visible — empirical measurement, formal-theory derivation, engineering-practice synthesis — so we don't conflate 'Anthropic shipped a pattern that works' with 'L2D theory predicts this pattern is optimal'"
- Updated component P3 → L2 edge label from "patterns operationalize L2D" to "shared concern: joint allocation (parallel convergence)" — preserves the structural link without claiming direct lineage
Pass 4 2026-04-29

gap scancross-context verification

Why Two real issues remained on a careful reread. First, the lit review explicitly named Wickens' Multiple Resource Theory as "absent from AI workflow research" — a well-established cognitive-load mechanism predicting tool-stack attention interference (Cursor + ChatGPT + meeting simultaneously) — but my topology only carried G1 (general metacognitive demand) and missed the multi-tool interference mechanism specifically. This is load-bearing for the Stage-3 model: Option B's autonomy slider needs a concurrent-tool-load dimension, not just per-task parameters. Second, §4 said "Bastani's 17% drop is one study at under six months" but the lit review never actually states a duration — that was a number I invented.
- Added G10 (multi-tool attention interference — Wickens MRT) as a corroborating mechanism node, weight 3, status ~. The mechanism is well-established but unmeasured in AI workflow contexts. Distinct from G1 (general metacognitive demand) — G10 is specifically about parallel-tool resource contention. Edges: A1 → G10 (attention scarcity → multi-tool interference), G10 → S2 (minimize concurrent-tool cost), G10 → E10 (cyborgs incur higher MRT cost than centaurs — this is testable). G10 added to REGIME_STABLE in Variant D since human cognitive architecture is regime-invariant
- Updated node count 65 → 66 across frontmatter, TLDR, and component variant blurb
- Tightened §7 Option B recommendation to note that the loop should include a concurrent-tool-load dimension (G10) — the routing function should account for whether the user is in single-tool or multi-tool mode, not just per-task parameters
- Added MRT (Multiple Resource Theory) glossary entry
- Fixed Bastani timeframe claim in §4: replaced invented "one study at under six months" with lit-review-supported "single-session study with a short post-session window" — the lit review specifies "in-session... afterward" but no specific duration, and inventing a number was a cross-context-verification slip
Pass 3 2026-04-29

error checkinternal consistency checkadversarial + steelman

Why Pass 2 stated a structural typing rule for cruxes ("every weight-5 A node + every weight-5 L node") but the actual crux set excluded L2 (joint performance surface — Madras / Mozannar L2D theory), which is weight-5 and meets the rule. The defense was internally inconsistent. Separately, the strongest remaining objection — that topology+model formalization is over-engineering for a moving target where practitioner heuristics adapt faster — was not addressed.
- Added L2 to CRUX_IDS — the rule "every weight-5 A node + every weight-5 L node" now actually holds. Crux count 6 → 7 (4 foundational A + 3 logical guardrail L)
- Updated TLDR para 1 to "Three logical guardrails (L1, L2, L3)"; updated §3b to include the L2 entry; updated Variant A blurb to "seven cruxes"; updated Objection 2 defense
- Added Objection 5 (over-engineering for moving target): steelman that practitioner heuristics adapt faster than academic formalization can ship; response that the topology IS the artifact whose stable invariants survive capability change, while practitioner heuristics are adaptive but lossy (they don't preserve the reasoning behind the rule, so users can't tell when the heuristic stops applying)
- Tightened §7 Option B recommendation to make explicit which invariants and assumptions it engages — L1 (substitution myth), L2 (joint surface), L3 (parameterize by capability), G3 (verification trade-off), and A6 (the loop's testability assumes individual-level optimization is meaningful)
- Added §6 sibling-topic connection paragraph: O7 (verification cost vs. verification calibration) is the natural attachment point for the planned Prediction & Calibration topic — calibration on AI error rates IS the calibration question that topic will need to formalize
- Updated component variant blurb count 6 → 7 cruxes; component CRUX_IDS now matches the prose
Pass 2 2026-04-29

fresh-eyes auditerror checkinternal consistencytruth/accuracy override on bias

Why On rereading the published draft, four real problems compound: the node-count claim was wrong (55 vs. actual 64), the crux selection conflated the headline conclusion (S1) with a load-bearing primitive, the individual-as-unit-of-analysis assumption was unencoded despite being the lit review's explicit Crux 5, and the TLDR forced the foundational/reframer/guardrail trichotomy from the prior topology when the genuinely novel structural move for *this* topic is the practitioner-academic bridge.
- Added A6 (individual-level optimization aggregates) as a 4th foundational crux — encodes the "individual is the right unit of analysis" assumption that the lit review names explicitly. Weight 5, status ?, attacked by E4 (Humlum-Vestergaard aggregate-zero) and resolved by O2 / O5
- Restructured crux set from {A1, A2, A3, G1, S1, L3} to {A1, A2, A3, A6, L1, L3} — dropped S1 (it's the *headline conclusion*, not a load-bearing primitive — cruxes are inputs, conclusions are outputs); dropped G1 from cruxes (it's a *consequence* of A1, not an independent axiom — kept as a reframer mechanism); promoted L1 (substitution myth) into the crux set alongside L3
- Fixed node-count claim 55 → 65 across TLDR, frontmatter description, and component blurb
- Rewrote TLDR para 2: leads with the practitioner ↔ academic operationalization bridge as the topic-specific structural finding (Mollick ↔ Randazzo, Karpathy ↔ Feng, Cognition ↔ read/write asymmetry), with the foundational/guardrail/reframer trichotomy as a §3 organizing principle rather than the headline structural move
- Restructured §3 into 3a Foundational cruxes / 3b Logical guardrails / 3c Reframer mechanisms (the high-weight non-crux mechanism nodes G1, G3, G4) / 3d Headline conclusion (S1) / 3e Corroborating / 3f Distortion vectors. Pulled the "decorative" framing for D nodes — distortions are pedagogically intentional, not droppable
- Rewrote Objection 2 defense — the new crux set is structurally clean (foundational A nodes + logical L nodes), and the defense actually holds: cruxes are inputs whose collapse rebuilds regions; the headline (S1) and the metacognitive mechanism (G1) live in different categories of structural role
- Tightened "↔" notation in prose to directional "P operationalizes E/L" framing — the React graph encodes these as one-directional op edges, and the ↔ shorthand was loose. Where bidirectional connection is real (P2 ↔ L4), encoded as two distinct edges
- Expanded capability-regime variant classification: 6 stale (was 4) — added P1 (centaur typology becomes historical post-jump) and E10 (the 60/30/10 distribution is regime-bound); 17 stable (was 8) — added A1, S2/S3/S4, P2/P3/P5/P6/P7 (all architectural / control-surface / principal-agent invariants); 11 regime-dependent (was 5) — added A6, E14/E15/E17, P4, G4 (the jagged frontier mechanism follows A2)
- Updated component CRUX_IDS, REGIME_STALE / REGIME_STABLE / REGIME_DEPENDENT sets, and variant blurbs
Pass 1 2026-04-29

first draft

Why First-draft topology of the cognitive partnership stack lit review. The nine-type node taxonomy adds a Practitioner (P) class to make the academic/practitioner split visible and to encode the operationalizes (op) edge as the bridge.
- Nine node types A/M/E/L/G/S/P/O/D
- Six cruxes selected (A1, A2, A3, G1, S1, L3) — first-draft cut, defended weakly
- Five variant views: full / vulnerability / flow / minimal / capability-regime
- Capability-regime variant added as topic-specific fragility lens
- Edge types include op (operationalizes) for the practitioner ↔ academic bridge
- Three Stage-3 formalization options laid out (capability×verification dispatch, generator-verifier loop with autonomy slider, principal-agent), with the loop+slider recommended

pass 6

Generator-verifier loop with a per-task autonomy slider. The per-task value function V(u, v; θ) decomposes into four orthogonal channels (quality, attention, risk, skill); V is exactly bilinear in (u, v) so per-task optima land at three corners — do-yourself, self-automator, spec-driven. Centaur and cyborg arise as aggregate-level patterns from cross-sub-task corner mixing. Portfolio aggregation under a daily attention budget surfaces the Lagrangian shadow price μ that reroutes longer tasks first when budget binds. Five cruxes, six Stage-4 fitting targets, three engaged objections. Interactive two-tab dashboard included.

TLDR

The lit review documents a research landscape; the topology stripped it down to load-bearing structure. This stage formalises the cleanest target the topology surfaced: a generator-verifier loop with a per-task autonomy slider, designed to survive capability change rather than encode a snapshot of any specific model’s frontier. The optimisation target is output quality per unit of human attention for an individual knowledge worker. The formalisation makes three moves at once — decomposition (four orthogonal value channels: quality, attention, risk, skill), generating function (a per-task value function whose corner solutions reproduce the five empirically observed workflow modes), and integration (a single formalism that composes Karpathy’s slider, Mollick’s typology, Vasconcelos verification-economics, Bastani’s atrophy, Bainbridge’s substitution myth, and Madras-Mozannar L2D into one object).

The per-task value function is V(u, v; θ) = Q(u,v) − α·A(u,v) + λ·S(u,v) − σ·R(u,v), where u is the autonomy level (fraction of the task delegated to AI) and v is the verification depth (fraction of AI output independently checked). The four channels are conceptually distinct mechanisms: quality Q rewards letting the better agent do the work, with verified output achieving the complementary-product ceiling c_⋆ = c_AI + (1 − c_AI)·c_H (either AI was right or human catches the error); attention A charges for human time and for the irreducible monitoring cost even at full delegation (the L1 substitution-myth invariant baked in via ε > 0); risk R penalises uncaught AI errors at a rate proportional to stakes σ; skill S rewards practice and penalises unverified delegation (Bastani-style atrophy). V turns out to be exactly bilinear in (u, v) — collecting terms gives V = K_0 + K_u·u + K_v·v + K_uv·u·v — which means the maximum on the unit square is always at a corner. Three corners are candidates: (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven (the corner (0, 1) is dominated since verifying with no AI involvement is pure cost). Centaur and cyborg modes do not arise as per-task optima — they appear only as aggregate-level patterns when a worker mixes corner policies across sub-tasks with heterogeneous θ. This is a substantive prediction, not a limitation.

The portfolio extension aggregates per-task decisions under a daily attention budget. The headline result the model is designed to produce is S1 (workflow architecture > model capability): on the same task mix and same c_AI, optimal routing dominates “max-AI” (self-automate everything) and dominates the naive flat-cyborg heuristic (u=0.7, v=0.3 everywhere). The bilinearity finding sharpens this: the naive flat-cyborg policy is exactly the failure mode — it applies an interior (u, v) value that the bilinear structure says no individual sub-task should land at. Optimal routing differentiates across tasks (different corners for different θ); the aggregate (ū, v̄) across the day looks interior because the corners differ, not because any single decision is interior. The generating function is parameterised by capability (the L3 invariant): if c_AI rises uniformly across tasks, the model rebalances; if it rises only on certain task types, the boundary of optimal u* shifts but the shape of the routing rule stays put.

This stage produces seven things: (1) a math object — V(u, v; θ) and the four-channel decomposition; (2) a workflow-mode classifier — three-corner per-task router plus a five-region label-partition of the (u, v) plane for observed worker behaviour; (3) a portfolio aggregator with budget-aware shadow-price routing (μ-binary-search over per-task α_eff = α + μ·g); (4) the interactive two-tab dashboard below; (5) five cruxes of the model (load-bearing claims whose collapse rebuilds it); (6) six Stage-4 fitting targets (parameter calibrations and qualitative predictions the data pipeline should test); (7) three engaged objections (c_AI unobservable, model just recovers practitioner intuitions, model is single-shot not strategic) with steelmen and what survives. Scope is explicit: the formalisation does not capture the aggregate-zero puzzle (E4/O2 — organisational dynamics), cross-task productivity bundling (G8/Cowen), c_AI miscalibration on novel tasks (O7), sycophancy as a verification-degrader (E13), or frontier migration over time (O4). These are named scope-limits, not silent assumptions.

Task parameters (θ)

c_HHuman capability on this task0.50

c_AIAI capability on this task0.65

φVerification cost ratio (verify / generate)0.30

σStakes (uncaught-error penalty weight)0.40

λSkill-formation value (preserve this skill?)0.40

Task presets

AI mildly stronger, modestly cheap verification, moderate stakes. Optimum lands at spec-driven — AI does the synthesis, you read the output.

Optimal policy

1.00

autonomy

1.00

verification

+0.295

net value

Workflow mode

Spec-driven / independent-then-synthesize

(u, v) = (1.00, 1.00) → spec-driven

Channel decomposition

Q quality

+0.325

A attention

+0.550

R risk

+0.000

S skill

-0.400

Each bar shows that channel's contribution to V at the optimum. Q is gain over the human-only baseline; A is attention saved (or spent) vs. the M-only floor; R is the stakes-weighted risk penalty; S is the skill change weighted by λ.

Diagnostics

Q (quality) = 0.825, A (attention) = 0.530, R (risk) = 0.000, S (skill) = 0.000

c_⋆ = c_AI + (1 − c_AI)·c_H = 0.825

V is bilinear in (u, v); the maximum on the unit square is at a corner. The router will return one of three corners — (0, 0) do-yourself, (1, 0) self-automator, or (1, 1) spec-driven. Centaur and cyborg modes appear at the aggregate level when a worker mixes corner policies across sub-tasks with heterogeneous θ — see Day Portfolio.

V(u, v; θ) = Q(u,v) − α·A(u,v) + λ·S(u,v) − σ·R(u,v). Constants: α = 1.00 (normalised), ε = 0.15 (residual attention at u=1; L1 invariant), β = 0.05 (per-task atrophy rate), M = 0.08 (routing tax). Mode thresholds: u_lo = 0.15, u_hi = 0.85, v_lo = 0.3, v_hi = 0.6. Optimum found by 41×41 grid search on the unit square. Stage-4 fitting will tighten α, ε, β, M against telemetry data; mode-distribution match against Randazzo BCG sample (~60% cyborg / ~30% centaur / ~10% self-automator) is Q3 of the named fitting targets.

How to read this stage

The dashboard above is the artifact. Everything below is the spec.

Two interactive surfaces:

Per-task router. Inputs: (c_H, c_AI, φ, σ, λ) for one task. Outputs: optimal (u*, v*), the workflow-mode label, and a four-bar decomposition showing which channel dominates. Use this to answer “for a task with these characteristics, what’s the right way to use AI?”
Day portfolio. Inputs: a basket of task types with counts. Outputs: total quality, total attention, total skill change under four strategies (always-self / max-AI / naive cyborg / optimal routing). Use this to see the S1 effect — same AI, different workflow architectures, very different outcomes.

If the dashboard says one thing and your gut says another, the diagnostic is to check (a) whether your (c_H, c_AI) estimates are calibrated and (b) whether the constants (α, ε, β, M) are calibrated for your task density. This is exactly what Stage 4 (the data pipeline) is for.

1. The formalisation moves

Three things this stage does — explicit so they’re inspectable separately.

Move 1 — Decomposition. The per-task value V splits into four orthogonal channels (Q, A, R, S). “Orthogonal” here means: each channel responds to (u, v) in a distinguishable way, so when a slider moves, the user can see which channel is driving the change. This isn’t a stylised choice — it’s how the topology’s mechanism nodes (G3, G7, G9, L1) actually compose. Without the decomposition, “the gain from AI” is a black box; with it, the user can ask “is this gain coming from quality, time-saved, or skill?” and answer.

Move 2 — Generating function. The five workflow modes (P1 / E10) are not given as a typology to be matched. They are generated as solutions to argmax V(u, v; θ) under different parameter regimes. This is the L3 invariant in operational form: change θ, the optimum moves, the mode label changes — but the function that produces the mode is fixed. A practitioner who hardcodes “use Cursor for boilerplate, ChatGPT for strategy” gets a lookup table; this gets a generating function.

Move 3 — Integration. Six prior objects compose into one: Karpathy’s autonomy slider (P2 — the u axis), Shneiderman’s 2D framework (L4 — the (autonomy, control) plane is the (u, v) plane), Vasconcelos verification-economics (G3 — the −α·v·φ term), Bastani’s guardrail finding (E9 — the −β·u·(1-v) skill term), Bainbridge’s substitution myth (L1 — the ε > 0 residual attention), and Madras-Mozannar L2D (L2 — the joint-surface optimisation at portfolio level). None of these alone is the model; the model is what they jointly imply.

What’s not yet ready for formalisation, kept in §9 as scope-limits: cross-task bundling (G8), organisational absorption (E4/O2), miscalibration of c_AI (O7), sycophancy as quality-degrader (E13), frontier migration over time (O4).

2. Variables and objects

Decision variables (per task), continuous on the unit square:

Symbol	Range	Meaning
`u`	[0, 1]	Autonomy level — fraction of generation delegated to AI
`v`	[0, 1]	Verification depth — fraction of AI output independently checked

Task parameters (the vector θ):

Symbol	Range	Meaning
`c_H`	[0, 1]	Human capability on this task type
`c_AI`	[0, 1]	AI capability on this task type
`φ`	≥ 0	Verification-cost ratio (verify-time / generate-time)
`σ`	[0, 1]	Stakes — weight on uncaught-error penalty
`λ`	[0, 1]	Skill-formation value — how much the worker cares about preserving this skill

Constants (calibrated, not per-task):

Symbol	Default	Meaning
`α`	1.00	Attention price (normalisation)
`ε`	0.15	Residual attention at full delegation — L1 invariant (substitution myth)
`β`	0.05	Skill-atrophy rate per unit of unverified delegation
`M`	0.08	Per-task metacognitive routing tax — A1/G1 invariant
`c_⋆`	`c_AI + (1−c_AI)·c_H`	Verified-output ceiling — “either AI got it right OR human catches the error” (complementary product of independent error events)

The constants are calibrated to the lit-review anchors (Mozannar CUPS for ε; Bastani 17% drop for β; Tankelevitch metacognitive load for M). They can be re-fit by Stage 4 against telemetry data.

3. The per-task value function

V(u, v; θ) = Q(u, v) − α·A(u, v) + λ·S(u, v) − σ·R(u, v)

3.1 Quality channel `Q`

Q(u, v) = (1 − u)·c_H + u·[(1 − v)·c_AI + v·c_⋆]

With probability (1 − u) the human did the generation and quality is c_H. With probability u the AI did the generation, of which fraction (1 − v) ships unverified at quality c_AI and fraction v is verified. Verified output achieves quality c_⋆ = c_AI + (1 − c_AI)·c_H = 1 − (1 − c_H)·(1 − c_AI) — the probability that either the AI got it right or (it didn’t and) the human catches the error, treating the two error events as independent. Linear in u for fixed v; linear in v for fixed u.

The complementary-product form natively handles the deskilled-verifier limit (c_H → 0 ⇒ c_⋆ → c_AI — verification adds nothing when the human can’t recognise errors) and the verifier-stronger limit (c_H → 1 ⇒ c_⋆ → 1 — careful human verification approaches a quality ceiling). Pass 1 used c_⋆ = max(c_H, c_AI), which over-credits verification when c_H > c_AI (as if the human catches every AI error) and under-credits it when c_H < c_AI (as if a partly-skilled human catches no AI errors). The form here treats both with one expression and one structural assumption (independence of human and AI error events).

Caveat: c_H enters both the generation cost (when u = 0) and the verification benefit (the extra catching power at v > 0). Karpathy’s G9 generator-verifier-asymmetry says verification is typically easier than generation — recognising an error costs less than producing a correct answer from scratch. A more accurate model would carry a separate c_V (verifier capability) ≥ c_H for low-c_H workers. Held for Stage 4; named in §9 scope-limits.

3.2 Attention channel `A`

A(u, v) = (1 − u·(1 − ε)) + v·φ + M

Three pieces:

(1 − u·(1 − ε)): human-side generation cost. At u = 0, the human does it all (cost = 1, the base generation time). At u = 1, residual attention ε remains — every offload creates monitoring/coordination work (Bainbridge L1). ε > 0 is the L1 invariant in operational form: a model with ε = 0 would predict that full delegation is free of attention cost, which is exactly the substitution myth.
v·φ: verification cost. Linear in verification depth, scaled by the per-task verification-cost ratio φ. This is the G3 (Vasconcelos verification-economics) term.
M: per-task metacognitive routing tax. Constant per task — classifying, choosing the workflow, monitoring AI for handoffs. Tankelevitch’s metacognitive-demand finding (G1) compressed to a constant; Stage 4 can test whether M is task-type-dependent.

Note that ε is the only term that decouples attention from the (u, v) decision. It is what makes “max-AI” not free.

3.3 Risk channel `R`

R(u, v) = u·(1 − v)·(1 − c_AI)

Probability-of-uncaught-error: AI generated the output (u), it was not verified (1 − v), and the AI was wrong (1 − c_AI). Multiplied by stakes σ in the value function. This is the G2 (ironies-of-automation) term: rare critical errors get missed precisely when delegation is high and verification is low.

For high-σ tasks, this term is large enough to drive v → 1 (the spec-driven / independent-then-synthesize regime, E8 — Everett 2025). For low-σ tasks, it’s negligible and the optimum can sit at v = 0 without harm.

3.4 Skill channel `S`

S(u, v) = (1 − u) − β·u·(1 − v)

Two terms:

(1 − u): practice — the human builds skill on the fraction of the work they did themselves.
−β·u·(1 − v): atrophy — unverified delegation erodes capacity. Verification preserves engagement (this is Bastani’s “hint mode” finding E9: guardrails → no atrophy). The product u·(1 − v) is exactly the self-automator regime where atrophy is fastest.

λ is the worker’s per-task valuation of preserving the skill. For tasks the worker explicitly wants to maintain capacity on (their core craft), λ is high and the skill term has bite. For boilerplate or one-off tasks, λ is low and skill is correctly ignored.

3.5 Putting it together

V(u, v; θ) =   (1 − u)·c_H + u·[(1 − v)·c_AI + v·c_⋆]            ← Q
             − α·[(1 − u·(1 − ε)) + v·φ + M]                       ← −α·A
             + λ·[(1 − u) − β·u·(1 − v)]                           ← +λ·S
             − σ·[u·(1 − v)·(1 − c_AI)]                            ← −σ·R

Five parameters in θ, two decisions, four channels. Exactly bilinear in (u, v): collecting terms,

V(u, v; θ) = K_0 + K_u·u + K_v·v + K_uv·u·v

with

K_0 = c_H − α − α·M + λ
K_u = (c_AI − c_H) + α·(1 − ε) − λ·(1 + β) − σ·(1 − c_AI)
K_v = −α·φ (K_v captures the cost of verification when there is no AI to verify, i.e., at u = 0; pure cost, hence ≤ 0 — this is exactly why corner (0, 1) is dominated by (0, 0) below)
K_uv = (c_⋆ − c_AI) + λ·β + σ·(1 − c_AI) = (1 − c_AI)·c_H + λ·β + σ·(1 − c_AI) (always ≥ 0 — verification gain is monotone in u)

Bilinear functions on a unit square attain their maximum at a corner. The interior critical point, when it exists, is a saddle (Hessian eigenvalues ±K_uv). So the per-task optimum is at one of the four corners — and since K_v < 0, (0, 1) is dominated by (0, 0) (verifying when no AI is involved is pure cost). The three meaningful corners:

(0, 0) — do-yourself.
(1, 0) — full delegation, no verification (self-automator).
(1, 1) — full delegation with full verification (spec-driven / independent-then-synthesize).

Which corner wins depends on the signs of K_u, K_u + K_uv + K_v, and K_v + K_uv — three linear comparisons over θ.

4. Optimal policy: the three corners that win

V is bilinear → max at a corner. Of the four corners, (0, 1) is dominated by (0, 0) because K_v < 0 (verifying with no AI involvement is pure cost). Three candidates remain — and a clean decision tree determines the winner:

Spec-driven (1, 1) wins iff K_u + K_v + K_uv > 0 and K_v + K_uv > 0.
Else self-automator (1, 0) wins iff K_u > 0.
Else do-yourself (0, 0) wins.

Equivalently: spec-driven dominates self-automator when α·φ < K_uv (verification cost is below benefit at full delegation: α·φ < (1 − c_AI)·c_H + λ·β + σ·(1 − c_AI)); self-automator dominates do-yourself when K_u > 0 (the attention savings + AI quality gain outweigh skill loss + stakes risk).

Comparative statics — what moves the corner choice:

Parameter increase	Effect on `u*`	Effect on `v` (given `u = 1`)
`c_AI − c_H` ↑ via `c_AI` rising (`c_H` fixed)	↑ (K_u rises)	↓ (K_uv falls: less verification benefit)
`φ` ↑ (verification expensive)	weakly ↓ via v*-flip	↓ (K_v more negative)
`σ` ↑ (stakes)	weakly ↓ (K_u falls by `σ·(1−c_AI)`)	↑ (K_uv rises by `σ·(1−c_AI)`)
`λ` ↑ (skill matters)	weakly ↓ (K_u falls by `λ·(1+β)`)	↑ slightly (K_uv rises by `λ·β`)
`c_AI` ↑ alone	↑	↓ (both `(1−c_AI)·c_H` and `σ·(1−c_AI)` fall)

This is the L3 invariant in tabular form: if c_AI rises uniformly, optimal u* rises and v* falls. The structural shape of the rule does not change. Two signs worth flagging because they’re non-obvious: more reliable AI (c_AI ↑) reduces verification benefit (fewer errors to catch); higher stakes (σ ↑) pushes u* down (refuse AI when stakes are high) and v* up (if you do use AI, verify carefully) — a real tension the bilinear structure makes explicit.

The five practitioner modes — labels for the (u, v) plane

The practitioner literature names five workflow modes. They span the (u, v) plane and are useful as vocabulary for labelling observed worker behaviour at any (u, v):

Region	Mode	Practitioner anchor
`u ≈ 0`	Do-yourself	(no-AI / refuse-AI)
`u ∈ (0, 1)`, `v` high	Centaur	Mollick — clean handoff with verification gate
`u ∈ (0, 1)`, `v` low/mid	Cyborg	Mollick — interleaved, partial verification
`u ≈ 1`, `v` high	Spec-driven / independent-then-synthesize	Everett 2025; Compound Engineering
`u ≈ 1`, `v` low	Self-automator	Randazzo HBS 26-036 (the trap)

The per-task router only returns the three corners — (0, 0), (1, 0), (1, 1) — corresponding to do-yourself, self-automator, and spec-driven. Centaur and cyborg do not arise as per-task optima. They appear empirically as aggregate-level labels when a worker mixes corner policies across sub-tasks with heterogeneous θ — some sub-tasks done alone, others fully delegated; some AI outputs verified, others shipped. The day-level (ū, v̄) averages out to interior values that get labelled “cyborg” or “centaur” depending on the ratio. The Day Portfolio tab demonstrates this directly.

This is a substantive prediction of the formalisation, not a limitation. Bilinearity is faithful to the structure of the problem: V collects to one bilinear form V = K_0 + K_u·u + K_v·v + K_uv·u·v, even though three of the four channels (Q, R, S) carry their own u·v terms — they sum to one consolidated K_uv. The model says: at any single sub-task with a single θ, pick a corner — fully delegate or don’t, fully verify or don’t. The naive flat-cyborg strategy (u = 0.7, v = 0.3 applied to every task) is exactly the failure mode — applying an interior policy uniformly is what no individual sub-task should do under bilinearity.

5. Portfolio aggregation — the day

A worker faces N tasks per day, each with its own θ_i. Total attention budget T. The portfolio problem:

maximise   Σ_i Q_i(u_i, v_i) · count_i
subject to Σ_i A_i(u_i, v_i) · g_i · count_i ≤ T

where g_i is the task type’s base generation time. The Lagrangian gives a shadow price μ ≥ 0 on the budget constraint, and the per-task choice rule becomes

argmax V_i − μ·A_i·g_i  =  argmax V_i with α replaced by α_eff_i = α + μ·g_i

— that is, budget pressure raises the effective attention price, more so for longer-base-time tasks. When μ = 0 the budget is slack and per-task choice is unconstrained argmax V; when μ > 0 the budget binds and α_eff rises until total absolute attention Σ A_i·g_i·count_i fits T. The biasing is structural: raising α_eff raises K_u (self-automator becomes more attractive vs. do-yourself) and makes K_v = −α_eff·φ more negative (verification becomes less attractive). So as the budget tightens, longer tasks reroute first from spec-driven (1, 1) to self-automator (1, 0) — the lowest-A corner.

This is the L2 invariant operationalised at portfolio level. The naive practitioner rule “use AI when AI is better” compares c_H to c_AI task-by-task in isolation; the joint-surface rule compares marginal Q per unit attention saved against the day’s shadow price. They give the same answer when attention is abundant (μ ≈ 0); they diverge sharply when the day is tight.

Strategies the dashboard compares:

Always-self. u_i = 0 for all i. No AI, no atrophy, no verification cost — but no productivity gain. The pre-AI baseline.
Max-AI. u_i = 1, v_i = 0 for all i. The corner uniformly applied. Fast, but high risk on high-σ tasks and accelerated atrophy on high-λ tasks.
Naive cyborg. u_i = 0.7, v_i = 0.3 for all i. A flat interior policy applied uniformly — exactly what the bilinear structure says no individual task should land at. Interior values arise legitimately only as averages over heterogeneous-θ sub-tasks. Applying them uniformly violates the structure and underperforms.
Optimal routing (budget-aware). Per-task corner choice under the shadow-price-adjusted α_eff_i = α + μ·g_i, with μ solved by binary search until the budget is met (or until even uniform self-automator overflows). Each task lands at the corner appropriate to its θ and the day’s binding budget. The aggregate (ū, v̄) across the day looks interior because different tasks land at different corners; the dashboard surfaces μ below the strategy table so the user can see when the budget is biting.

The headline prediction (S1): optimal routing dominates max-AI by quality-per-attention and dominates always-self by attention efficiency, on the same c_AI. The gap between optimal routing on mid-tier AI and naive-flat routing on frontier AI is the empirical bound the model places on “workflow architecture > model capability.” Mechanism: optimal routing differentiates across tasks (different corners for different θ) and reroutes under budget pressure (shadow price μ); naive uniformly applies an interior policy that is structurally never the per-task optimum at any α.

6. Calibration anchors

Where the constants come from. None of these is precise; all are Stage-4 fitting targets.

α = 1 — normalisation. Attention is the numéraire; all other costs are denominated in attention units.
ε = 0.15 — Mozannar CUPS (E12) shows verification + monitoring is a “substantial fraction” of total interaction time even when AI is doing the generation. 15% is a midpoint of the reported range; Stage 4 should tighten this from telemetry.
β = 0.05 — Bastani’s 17% unassisted-performance drop (E9) over a session of ~30 unguardrailed tasks → ~0.5% atrophy per task at u = 1, v = 0. Setting β = 0.05 means the model implies ~5% atrophy per task in the worst-case regime, which compounds to Bastani’s order-of-magnitude over a week. The lit review explicitly notes most studies are under 12 months; β at the per-task scale is what compounds to the longitudinal scale.
M = 0.08 — Tankelevitch (G1) finds metacognitive load is the binding constraint for AI users; CUPS (E12) finds verification + planning consume a meaningful fraction of total time. 8% per task is a calibration anchor consistent with the magnitude of the metacognitive-bottleneck claim. Stage 4 should test whether M varies by task type (it likely does — high-stakes strategic decisions have larger M than routine email).

At the per-task level, the defaults predict (1, 0) self-automator at routine-low-stakes corners (Randazzo’s ~10% empirically), (1, 1) spec-driven where verification benefit dominates verification cost, and (0, 0) do-yourself in outside-frontier regimes. The full Randazzo 60/30/10 (cyborg/centaur/self-automator) distribution emerges only at the aggregate level, when a worker’s day mixes corner policies across heterogeneous-θ sub-tasks — see §4. The defaults predict outside-frontier harm (Dell’Acqua E3) when c_H > c_AI and the worker mis-routes to u > 0 (the model’s prescription is u* = 0 there); the harm is from disobeying the optimum, not from a model output.

7. Worked anchors against the empirical record

Six probes. Each picks a parameter regime, computes the corner optimum exactly, and compares to the lit-review anchor.

A. Brynjolfsson (E1, E2) — customer-service novice +34%, top performers ~0%.

Novice: θ = (c_H = 0.40, c_AI = 0.70, φ = 0.20, σ = 0.20, λ = 0.10). c_⋆ = 0.70 + 0.30·0.40 = 0.82. Then K_u = 0.30 + 0.85 − 0.105 − 0.06 = 0.985 > 0 (full delegation beats do-yourself), and K_v + K_uv = −0.20 + (0.12 + 0.005 + 0.06) = −0.015 < 0 (verification cost barely exceeds benefit). Optimum: (1, 0) self-automator. Quality lift over always-self: c_AI − c_H = +0.30. Direction matches the +34% Brynjolfsson finding (which is in resolution rate, mixing speed and accuracy).

Expert: θ = (c_H = 0.85, c_AI = 0.70, φ = 0.20, σ = 0.20, λ = 0.10). c_⋆ = 0.70 + 0.30·0.85 = 0.955. Then K_u = −0.15 + 0.85 − 0.105 − 0.06 = 0.535 > 0, and K_v + K_uv = −0.20 + (0.255 + 0.005 + 0.06) = +0.12 > 0 (verification benefit dominates because c_⋆ − c_AI = 0.255 is large — a skilled human catches AI errors). Optimum: (1, 1) spec-driven. Quality lift: c_⋆ − c_H = +0.105, only +12% relative.

So the model produces: novice at full-delegate-no-verify (Q = 0.70 from c_AI alone), expert at full-delegate-full-verify (Q = 0.955 = AI augmented by skilled human catching). Both delegate; the difference is in verification depth. The empirical “expert ~0%” finding reflects throughput ceiling (experts already at maximum call rate, can’t redeploy saved attention to more calls) rather than zero quality lift — a context the model doesn’t carry.

B. Dell’Acqua BCG (E3) — outside-frontier 19-pp quality drop. θ = (c_H = 0.70, c_AI = 0.40, φ = 0.30, σ = 0.50, λ = 0.50). c_⋆ = 0.40 + 0.60·0.70 = 0.82. Then K_u = −0.30 + 0.85 − 0.525 − 0.30 = −0.275 < 0. Optimum: (0, 0) do-yourself. If the worker mis-routes to (1, 0), quality drops from c_H = 0.70 to c_AI = 0.40 — a 30 pp loss. The empirical 19 pp reflects partial mis-routing (some subjects partially used AI, some didn’t); the model’s prediction is a clean upper bound on the harm.

C. Bastani PNAS (E9) — guardrails preserve skill. Generic learning task with high skill-formation: θ = (c_H = 0.40, c_AI = 0.70, φ = 0.30, σ = 0.20, λ = 0.50). c_⋆ = 0.82. K_u = 0.30 + 0.85 − 0.525 − 0.06 = 0.565 > 0. K_v + K_uv = −0.30 + (0.12 + 0.025 + 0.06) = −0.095 < 0. Optimum: (1, 0) self-automator.

This is a real and honest finding: at lit-review-anchored constants (β = 0.05), the skill-preservation push toward verification (λ·β·u = 0.025 at u = 1) is too small to flip the corner against verification cost. The guardrail effect is structurally present — at the (1, 1) corner, S = 0 (no atrophy) versus S = −0.05 at (1, 0), so λ·ΔS = +0.025 — but the verification cost (α·φ = 0.30) dominates. Numerically, self-automator beats spec-driven by α·φ − K_uv = 0.30 − 0.205 = 0.095 net at default constants. To make guardrails decisive (flip the corner to spec-driven), the model needs λ·β > α·φ − [(1 − c_AI)·c_H + σ·(1 − c_AI)] = 0.30 − 0.18 = 0.12. At default λ = 0.50, β must exceed 0.24; at β = 0.05, λ alone cannot flip the corner (would require λ > 2.4, impossible since λ ∈ [0, 1]); at default β = 0.05 and λ = 0.50, lowering φ flips the corner once φ < 0.205 — barely cheaper verification than the default 0.30.

Honest reading: the model says the guardrail effect is real but weak at default constants. Stage-4 fitting target Q2 is whether β should be larger to match Bastani’s empirical magnitude. This is not a model failure — it’s the model surfacing a calibration question the lit review left implicit.

D. Everett (E8) — independent-then-synthesize restores complementarity. θ = (c_H = 0.65, c_AI = 0.70, φ = 0.40, σ = 0.90, λ = 0.30). c_⋆ = 0.70 + 0.30·0.65 = 0.895. K_u = 0.05 + 0.85 − 0.315 − 0.27 = 0.315 > 0. K_v + K_uv = −0.40 + (0.195 + 0.015 + 0.27) = +0.08 > 0. Optimum: (1, 1) spec-driven. Mechanism: the σ·(1 − c_AI) = 0.27 term in K_uv makes verification valuable precisely because stakes are high and AI is fallible. Matches Everett’s lit-review story exactly.

E. Randazzo self-automator (E10) — ~10% of consultants in the trap. Routine consulting task: θ = (c_H = 0.70, c_AI = 0.85, φ = 0.20, σ = 0.20, λ = 0.10). c_⋆ = 0.955. K_u = 0.15 + 0.85 − 0.105 − 0.03 = 0.865 > 0. K_v + K_uv = −0.20 + (0.105 + 0.005 + 0.03) = −0.06 < 0. Optimum: (1, 0) self-automator. Self-automator is the correct policy at this θ — the empirical finding “~10% of consultants are self-automators” is about θ-distribution (~10% of work-instances have these characteristics), not about systematic mis-routing. A misread of Randazzo’s data as “self-automator is always wrong” is a category error the model corrects.

F. Schoenegger (E18) — even overconfident GPT improves forecasting +23–43%. This is outside the model’s current formalism. The model attributes any gain at u > 0 to c_AI (advice quality), but Schoenegger’s finding suggests structured reasoning is doing significant work independent of the AI’s confidence calibration. A constant +δ to Q whenever u > 0 would represent this — held for future passes if it changes downstream predictions. Honest gap.

Cross-context note: outcome heterogeneity across these anchors

The six papers above measure different outcome variables. The model’s Q (“probability of correct/high-quality output”) maps cleanly onto Dell’Acqua, Everett, and Schoenegger (all output-quality measures). Brynjolfsson’s “issues resolved per hour” maps approximately onto Q × call-rate, where call-rate depends on attention saved (A); the model’s +34% novice prediction is a Q-only claim, while Brynjolfsson’s empirical 34% bundles throughput. Bastani’s “unassisted retest performance” is a downstream effect of the S channel accumulated over many task instances, not a per-task Q measurement. Randazzo’s “behavioural-mode distribution” is a categorical prediction over which corner the worker chooses, not a quality measure at all. Mozannar’s CUPS data is process telemetry (time fractions across interaction states), informative for α and ε calibration but not for Q.

Stage 4 must disaggregate which constants get fit against which outcome types — pooling them as a single calibration target would silently average over methodological apples and oranges. This is named explicitly as Q1–Q6 in §11.

8. The five cruxes

Load-bearing claims of the model. Collapse rebuilds it.

C1 — Two-axis decision space (u, v). The decision is reduced to autonomy and verification depth. If the actual workflow choice space has more dimensions that matter — e.g., context-engineering depth, prompt-iteration count, tool selection — the model is incomplete. What would flip it: empirical evidence that two workflows with identical (u, v) but different context-engineering produce systematically different outcomes (which the practitioner literature suggests is real — S4).

C2 — Verification effectiveness equals generation skill. c_⋆ = c_AI + (1 − c_AI)·c_H treats c_H as both generation skill and verifier capability — a worker who’d produce 40%-quality output alone catches 40% of AI errors. Karpathy’s G9 generator-verifier-asymmetry says verification is typically easier than generation; a separate c_V ≥ c_H parameter would make low-c_H workers more effective verifiers. What would flip it: empirical evidence that verifier-recognition rates are uncorrelated with generation skill (would require introducing c_V). Deskilled-verifier worry is partially handled by the formula already (c_H → 0 ⇒ c_⋆ → c_AI), but the mechanism of skilled-but-not-creative verifier (the editor archetype) is not in scope.

C3 — ε > 0 is the right operationalisation of L1. The substitution-myth invariant is captured as residual attention at full delegation. If the actual structure is more like “delegation creates new tasks of comparable effort” rather than “delegation creates monitoring overhead”, a constant ε is wrong shape. What would flip it: evidence that delegation-induced work scales with task complexity, not as a flat constant.

C4 — Skill atrophy β is task-type-uniform. All tasks atrophy at the same rate per unit of u·(1-v). Some skills (e.g., motor-procedural) atrophy slower than others (verbal-fluency, calibration). What would flip it: longitudinal data on skill-specific atrophy rates under controlled AI-use exposure.

C5 — Tasks are independent in the portfolio. Σ_i V_i aggregates linearly. Cross-task productivity bundling (G8/Cowen) violates this — productivity gains on related tasks are correlated, not additive. What would flip it: empirical evidence that observed aggregate productivity (E4 / Humlum-Vestergaard zero) is driven by task-coupling effects, not just by individual mis-routing. This is the most likely-to-flip crux: the aggregate-zero puzzle is a smoking gun for it.

9. Scope limits — what the model does NOT capture

Honest disclosure of where the formalisation stops.

Aggregate-zero puzzle (E4 / O2). Humlum-Vestergaard’s zero across 25,000 Danish workers is organisational, not individual — task reorganisation, managerial absorption, coordination costs. The model is individual-level (the A6 crux of the topology); it cannot rebut or explain E4 directly. This is a sibling-artifact problem (organisational-level model), not a parameter to set. The model is locally optimal at the individual level; it is silent about whether aggregate effects emerge.
Cross-task bundling (G8). V is summed across tasks; in reality V_i and V_j covary when the tasks are productivity-linked. C5 names this; the model does not encode it.
Calibration error on c_AI (O7). c_AI is treated as known. In practice users are systematically miscalibrated (E16 — higher AI confidence → less critical thinking; E13 — sycophancy escalation). The model’s c_AI should be the user’s belief about AI capability; if belief and reality diverge, the model is locally optimal under a wrong belief. The topology’s O7 (verification cost vs. verification calibration) is the natural extension.
Sycophancy as quality-degrader (E13). The model assumes verification only helps. Randazzo HBS 26-021 documents AI flipping correct human judgments under pushback — verification can worsen outcomes when sycophancy escalates. Not encoded; would require a c_⋆ that depends on the human’s resistance to AI-pushback.
Frontier migration (O4). c_AI is static within a session. Over months the frontier moves; the user must recalibrate. The model is a snapshot — extending it to a dynamic version would couple c_AI(t) to a learning model of the user’s frontier-mapping rate.
Multi-tool attention interference (G10 — Wickens MRT). The model is per-task; it does not capture the cost of running Cursor + ChatGPT + Slack simultaneously. The topology added G10 specifically to flag this; Stage 4 / Stage 5 should test whether a M_concurrent(N_tools) extension is needed.
Verifier skill ≠ generation skill (Karpathy G9). The model uses c_H as both generation capability and verification effectiveness. Empirically, verification is often easier than generation — recognising an error costs less than producing a correct answer. A c_V ≥ c_H parameter would let low-c_H workers benefit from verification more than the current formula predicts. Held for Stage 4 / future passes; the C2 crux names this.
Partial verification (“skim”) rounds to full or none. Bilinearity makes v* ∈ {0, 1}. The empirical reality of skimming (partial-depth verification at reduced cost and reduced effectiveness) does not map onto the model — the model rounds skim up to full-verify when verification is cheap and down to no-verify when it’s expensive. A future variant with convex verification cost (v²·φ instead of v·φ) or with a separate verification-depth-vs-effectiveness curve would produce interior v* solutions. Not added in this pass to preserve bilinearity’s analytical clarity.

These eight are not bugs of the formalisation. They are the boundary of what an individual-task bilinear generator-verifier loop can carry. Beyond it lies organisational design, dynamic learning, and team-level cognition — sibling artifacts.

10. Adversarial + steelman

Three objections the formalisation has not yet engaged head-on, and the strongest version that survives each.

Objection 1: `c_AI` is unobservable, so the model is unactionable on novel tasks.

Steelman. The optimal policy depends on c_AI, but in any new task or unfamiliar domain the worker doesn’t know c_AI ahead of time. Calibrating c_AI requires running the task and verifying the output — but the model says “compute argmax V using c_AI you don’t have.” For experienced task types c_AI is calibratable from history; for genuinely novel tasks (much of knowledge work) it isn’t. The Vasconcelos / Fok & Weld verification-economics frame already captures this — engagement is rational when verification is cheap; verifying a single AI output IS your way of measuring c_AI for that task type. The model assumes the calibration question is solved when it’s actually the binding constraint.

Why partially right. For genuinely novel tasks where c_AI is unknown, the model cannot make a precise recommendation. Stage-4 fitting target Q3 (mode-distribution match against Randazzo BCG) implicitly assumes calibrated c_AI distributions across BCG-like tasks — only valid for well-studied domains.

Why the strongest version survives. The model doesn’t need precise c_AI; it needs robustness across c_AI ranges, and the corner structure is robust. For almost any c_AI < 0.4 in a high-stakes regime (σ ≥ 0.7), the corner is do-yourself; for almost any c_AI > 0.8 in low-stakes routine (σ ≤ 0.2, low λ), the corner is self-automator. The “interesting” boundary regions — where c_AI uncertainty matters — are precisely the spec-driven regions. So the model’s prescription under uncertainty is: set v = 1 to verify and learn. The spec-driven corner doubles as a Bayesian-update mechanism; the verification cost α·v·φ is the explicit price of resolving the uncertainty. Stage 4 should formalise the explore-vs-exploit dynamics this implies (Q6 in §11).

Objection 2: The model recovers practitioner intuitions and adds no new predictions.

Steelman. Mollick, Karpathy, Anthropic, and Cognition collectively say “match the workflow to the task.” The model’s predictions of corner solutions match Randazzo’s empirical mode distribution. So what is the formalism contributing beyond a formal scaffold for intuitions practitioners already had?

Why partially right. Many model predictions match practitioner consensus on headline conclusions. Self-automator for routine, do-yourself for novel-and-high-stakes, spec-driven for high-stakes-with-cheap-verify — practitioners already say these.

Why the strongest version survives. The model contributes five things the practitioner literature does not:

Quantitative trade-offs. The model says how much worse self-automator is than spec-driven at default constants — at the Bastani anchor specifically, 0.095 net (α·φ − K_uv = 0.30 − 0.205). Practitioner literature is qualitative; this is parametrically calibratable, and Stage 4 will pin the constants.
Non-obvious simultaneous constraints. Higher stakes (σ ↑) push BOTH u DOWN (refuse AI for high-stakes work) AND v UP (if you do use AI, verify carefully) — a simultaneous prescription the practitioner literature does not make explicit. §4’s comparative-statics surfaces this; intuition often conflates the two.
Naive cyborg as structural failure mode. Bilinearity says applying interior (u, v) uniformly is what no individual sub-task should do. This is a sharper criticism than “be thoughtful about which mode you use” — it identifies a specific failure mode (the BCG cyborg majority running flat-(0.7, 0.3) policies) and says they’re structurally wrong, not just suboptimal.
Budget-aware shadow-price reformulation. The μ mechanism says: under attention scarcity, reroute longer-base-time tasks first (since α_eff_i = α + μ·g_i rises proportionally with g_i). Practitioner literature has nothing like this prescriptive structure for portfolio-level decisions.
Stage-4 calibration targets. The model identifies six specific empirical questions (Q1–Q6 in §11) that fit a generator-verifier dispatch framework. Practitioner heuristics are unfalsifiable by design — they update without preserving the reasoning, so users can’t tell when a heuristic stops applying. The L3 invariant is the antidote.

The model recovers practitioner intuitions on top-line conclusions and adds quantitative, edge-case, and falsifiable structure beyond. The contribution is in calibration, simultaneous constraints, structural failure modes, portfolio-level shadow pricing, and Stage-4 testability — not in inventing new top-level recommendations.

Objection 3: The model is single-shot; the topic question is dynamic.

Steelman. “Optimal configuration for an individual knowledge worker” is implicitly a static answer to a fundamentally dynamic question — AI capability shifts month-over-month, the worker’s skill atrophies under sustained delegation, calibration on c_AI drifts as new model versions ship. A static dispatcher with parametric flexibility doesn’t tell the worker how to anticipate and prepare for capability change.

Survives because (a) Parasuraman, Sheridan & Wickens’ (2000) function-allocation framework has been useful for 25+ years despite being static — parametric statics IS what’s wanted from a generating function (the L3 invariant); (b) the trajectory questions properly belong to sibling topics in the LLM Iterate roster — navigating-ai-world for AI-induced trajectory of skill/meaning/relational channels, the planned prediction-calibration topic for c_AI calibration drift, the planned bedrock-generating-functions for the temporal-aggregation patterns this and other models share. Static-but-parameterised is the right scope here; dynamic extensions are cross-topic by design, not gaps in the present formalisation.

11. Stage-4 fitting targets

Six named questions Stage 4 should test against data.

Q1 — (α, ε) from CUPS telemetry. Mozannar CUPS gives time fractions across coding interaction states. Calibrate ε from observed monitoring-time-at-high-u; calibrate α from observed cyborg-vs-centaur time efficiency.

Q2 — β from Bastani longitudinal regime. Bastani’s 17% drop is one window. Longer-window data (Lee-Sarkar CHI 2025; Anti-Social Century) should pin per-task atrophy rate. The model predicts: β should be ~10× larger for skills the worker uses heavily than for tasks they delegate occasionally.

Q3 — Mode-distribution match against Randazzo BCG. Given a θ-distribution prior over BCG-consultant tasks, does the optimal-routing distribution over (u*, v*) match the observed (~60% cyborg / ~30% centaur / ~10% self-automator)? If not, which constant is mis-fit?

Q4 — Outside-frontier harm prediction (Dell’Acqua replication). For subjects forced to use AI on c_H > c_AI tasks, the model predicts a quality drop proportional to u·(c_H − c_AI). Stage 4 should test the linearity and the slope.

Q5 — Workflow-architecture-vs-capability bound. The headline S1 prediction. Construct two simulated workforces: (a) frontier AI + naive flat-cyborg routing, (b) mid-tier AI + optimal routing on the same task mix. The model predicts (b) outperforms (a) on net quality once c_AI_naive falls below some threshold. Stage 4 should locate the threshold and test against Everett 2025 / Dell’Acqua at that precision.

Q6 — Calibration uncertainty and explore-vs-exploit on c_AI. The model treats c_AI as a known input. In practice, workers learn c_AI by running the task and verifying the output (Vasconcelos verification-economics). For novel tasks, the spec-driven corner doubles as a calibration mechanism — verification reveals c_AI for next time. Stage 4 should formalise the explore-vs-exploit structure: what is the optimal exploration premium when c_AI variance is high? When does verifying-to-learn dominate verifying-to-quality-control? The §9 scope-limit on c_AI miscalibration becomes a structural extension via this question, not just an acknowledged gap. Engages the §10 Objection 1 directly.

Data starting points for each Q

The natural starting point for each fitting target is the lit-review paper(s) that motivated the corresponding parameter. Honest notes on likely data availability:

Q1 (α, ε) — Mozannar CUPS 2024 (E12) telemetry across coding interaction states. Process-level data is likely Microsoft Research-internal; replication via Cursor / Claude Code anonymized usage logs is the alternative path.
Q2 (β) — Bastani PNAS 2025 (E9) is the primary anchor; PNAS supplementary materials likely include the longitudinal panel needed to estimate per-task atrophy. Lee-Sarkar CHI 2025 (E16) provides multi-task panel context for a complementary fit.
Q3 (mode distribution) — Randazzo HBS WP 26-036 (E10) for the 60/30/10 BCG distribution. HBS supplementary materials may include individual-level mode tags; failing that, an in-house replication on a smaller knowledge-worker sample is tractable.
Q4 (outside-frontier slope) — Dell’Acqua BCG study (E3); HBS data release may include individual-level outcome grades plus inside/outside frontier classification per task. The linearity test of u·(c_H − c_AI) is a clean within-subjects design.
Q5 (workflow > capability) — Everett 2025 (E8) demonstrates the workflow-restoration mechanism but does not directly test the bound. The cleanest test is a new RCT comparing routing strategies on the same c_AI (e.g., GPT-4 + naive flat-cyborg vs. GPT-3.5 + optimal routing on a matched task mix); existing data is suggestive, not decisive.
Q6 (explore-exploit on c_AI) — Likely requires new experimental work or simulation. Closest analogs are the contextual-bandit and multi-armed-bandit-with-costly-verification literatures, but the knowledge-workflow application is novel; this is a Stage-4 originated study, not a re-analysis of existing data.

Stage 4’s first move is to scope data availability for Q1-Q5; Q6 may require originating new data or a simulation harness. Following the human-psych-variation pattern, the Stage-4 build should live at stage_outputs/technology-utilization-architecture/data/ with a curated CSV per fitting target, a runnable Python pipeline that reproduces every chart on the published data.mdx, and a data/out/ folder for derived outputs.

12. Connections to other topics

Where this model attaches to sibling AI’s-Research topics.

Human-psych-variation. λ (skill-formation value) and β (atrophy rate) are individual differences. Need-for-Cognition (Buçinca 2021, E6) moderates v — high-NfC users verify more. Cognitive-style covariation belongs in that topic, not here.
Navigating-AI-world. This model’s β is the per-task version of nav-AI’s ΔM_comp (competence erosion). The portfolio-level S aggregation is the within-work-domain version of nav-AI’s ΔV/ΔM trade-off. The two models share the substitution-myth (L1) and verification-economics (G3) invariants; they differ in what they’re optimising — nav-AI optimises a life-scale meaning budget, this model optimises a workday quality-per-attention budget.
Trust architecture (planned). Sycophancy (E13) and human-side calibration on AI capability (O7) are trust-regime questions — what feedback signals make c_AI knowable.
Prediction & calibration (planned). The topology’s O7 connection. Calibration on c_AI is the calibration sub-problem this model treats as exogenous.
Information fidelity (planned). φ (verification cost) depends on output-format quality and grounding — verifying a structured output with citations is cheap; verifying free prose is expensive. The information-fidelity topic should formalise what makes φ low or high.
Bedrock generating functions (planned). The four-channel decomposition V = Q − α·A + λ·S − σ·R is a candidate generating-function pattern: every decision under attention scarcity has the same four channels. The bedrock topic should test whether this generalises beyond AI workflow.

Glossary

autonomy level (u) — fraction of a task’s generation delegated to AI. Karpathy slider P2 made parametric.
verification depth (v) — fraction of AI output independently checked by the human.
c_H, c_AI — human and AI capabilities on a task type; probability of correct/high-quality output.
c_⋆ — verified-output ceiling; max(c_H, c_AI) under the assumption the human can verify.
φ — verification-cost ratio: time-to-verify divided by time-to-generate-from-scratch.
σ — stakes; weight on uncaught-error penalty.
λ — skill-formation value of the task to the worker.
α — attention price (utility weight on time). Normalised to 1.
ε — residual attention at full delegation. The L1 substitution-myth invariant.
β — per-task skill-atrophy rate under unverified delegation.
M — per-task metacognitive routing tax. The G1 metacognitive-bottleneck invariant.
centaur — Mollick: clean human/AI handoff with verification gate. (u, v) mid + high.
cyborg — Mollick: interleaved sub-task delegation with partial verification. (u, v) mid + mid/low.
self-automator — Randazzo: full delegation, no verification. (u, v) high + low. The atrophy-trap regime.
spec-driven / independent-then-synthesize — Everett 2025; Compound Engineering: full delegation with full verification. (u, v) high + high.
do-yourself — no AI involvement. u ≈ 0.
L1 (substitution myth) — every offload creates new monitoring/verification work. Encoded as ε > 0.
L2 (joint surface) — optimal allocation requires modelling the joint performance, not comparing solo capabilities. Encoded as the portfolio-level argmax.
L3 (parameterise by capability) — the formalism is a generating function over θ, not a lookup table. The whole structure of the model.
G3 (verification trade-off) — engagement is rational only when verification is cheap relative to expected payoff. The −α·v·φ term.
G7 (skill atrophy) — capacities not exercised decay. The −β·u·(1-v) term.
G9 (generator-verifier asymmetry) — production cost falls toward zero with AI; verification cost stays roughly constant. The asymmetry between u·ε (generation residual) and v·φ (verification full).

Read full stage →

Iteration history

Pass 1 2026-04-29

decompositiongenerating functionintegration

Why First draft of the formalisation. Pulled the per-task value function out of the topology handoff (Option B: generator-verifier loop with autonomy slider) and wrote it as a single coherent decomposition + generating function. Built the interactive two-tab dashboard so the reader can dial parameters across both per-task routing and day-portfolio aggregation.
- Wrote per-task value function: V(u, v; θ) = Q(u,v) − α·A(u,v) + λ·S(u,v) − σ·R(u,v)
- Decomposed V into four orthogonal channels (quality / attention / risk / skill), each responding to (u, v) in a distinguishable way
- Encoded the L1 substitution-myth invariant as ε > 0 (residual attention at full delegation)
- Encoded the G3 verification-economics term as −α·v·φ; G7 skill atrophy as −β·u·(1-v); G2 ironies-of-automation as σ·u·(1-v)·(1-c_AI)
- Mapped (u*, v*) regions to five workflow modes (do-yourself, centaur, cyborg, self-automator, spec-driven) with default thresholds
- Portfolio aggregation under daily attention budget; four strategies compared (always-self / max-AI / naive cyborg / optimal routing) — generates the S1 headline as a comparable-strategies experiment
- Calibrated constants (α=1, ε=0.15, β=0.05, M=0.08) against lit-review anchors (Mozannar CUPS, Bastani 17%, Tankelevitch metacognitive load)
- Six worked anchors against the empirical record (Brynjolfsson novice/expert split, Dell'Acqua outside-frontier, Bastani guardrail effect, Everett independent-then-synthesize, Randazzo self-automator, Schoenegger overconfident-AI as gap)
- Five model cruxes named (C1 two-axis decision space, C2 verified ceiling assumption, C3 ε operationalisation of L1, C4 task-uniform β, C5 portfolio independence)
- Five Stage-4 fitting targets named (Q1 α/ε from CUPS, Q2 β from Bastani longitudinal, Q3 mode-distribution match against Randazzo BCG, Q4 outside-frontier prediction, Q5 workflow-architecture-vs-capability bound)
- Six scope-limits explicitly named (aggregate-zero E4/O2, cross-task bundling G8, c_AI miscalibration O7, sycophancy E13 quality-degrader, frontier migration O4, multi-tool MRT G10)
- Sibling-topic connections wired (human-psych-variation, navigating-ai-world, planned trust-architecture / prediction-calibration / information-fidelity / bedrock-generating-functions)
- Interactive component CognitivePartnershipModel.tsx (two tabs: per-task router with seven presets + day portfolio with four-strategy comparison)
Pass 2 2026-04-30

error checkinternal consistency checktruth/accuracy override on bias

Why A fresh-eyes audit of pass 1 surfaced three real problems. (a) c_⋆ = max(c_H, c_AI) over-credits verification when c_H > c_AI (treats the human as catching every AI error) and under-credits it when c_H < c_AI (treats a partly-skilled human as catching no AI errors); the empirically defensible form is the complementary product c_⋆ = c_AI + (1−c_AI)·c_H = 1 − (1−c_H)·(1−c_AI), which falls out from independent error events and naturally handles both the deskilled-verifier limit and the verifier-stronger limit. (b) The bilinearity claim was sloppy — V is exactly bilinear in (u, v), so the maximum on the unit square is at a corner. The interior critical point is a saddle. The pass-1 prose said "near a corner or on an edge or at a unique interior critical point" which is wrong. (c) The comparative-statics table had two wrong rows: under (c_AI − c_H) ↑, c_⋆ − c_AI = (1−c_AI)·c_H weakly decreases in c_AI, and σ·(1−c_AI) decreases too, so K_uv falls and v* weakly decreases (pass 1 said "slightly ↑"); under φ ↑, u* responds via the v* corner-flip and is weakly decreasing (pass 1 said "ambiguous"). The bilinearity finding has a substantive consequence: per-task optima land at one of three corners — (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven — and centaur/cyborg modes appear only as aggregate-level patterns from mixing corners across sub-tasks with heterogeneous θ. The pass-1 mode-region table implied otherwise.
- Replaced c_⋆ = max(c_H, c_AI) with c_⋆ = c_AI + (1−c_AI)·c_H = 1 − (1−c_H)·(1−c_AI) in §3.1, the React component, and the dashboard diagnostic readout. The deskilled-verifier scope-limit caveat in §3.1 is now subsumed by the formula itself (c_H → 0 ⇒ c_⋆ → c_AI without any extra parameter)
- Sharpened §3.5 and §4: V is exactly bilinear in (u, v) with V = K_0 + K_u·u + K_v·v + K_uv·u·v; the maximum on the unit square is always at a corner, never interior
- Reinterpreted §4 mode classification: per-task optimum lands at one of three corners — (0, 0), (1, 0), (1, 1). Centaur and cyborg modes are aggregate-level patterns where a worker mixes corner policies across sub-tasks with heterogeneous θ. The five mode-regions remain a useful spatial decomposition of the (u, v) plane (for labelling observed worker behaviour at any (u, v)), but the per-task router will only return three corners. Updated the "naive cyborg" framing in §5 accordingly: applying interior (u, v) values to every task is exactly what the model says is wrong — optimal routing differentiates across tasks (different corners for different θ), naive uses the same interior for all
- Fixed §4 comparative-statics table: (c_AI − c_H) ↑ now correctly shows v* weakly ↓ (was "↑ slightly"); φ ↑ now shows u* weakly ↓ via v*-flip (was "ambiguous"); c_AI ↑ alone shows v* ↓ (matches "AI more reliable → less verification benefit")
- Recomputed all six worked anchors in §7 under corrected math. Novice (Brynjolfsson) now correctly lands at (1, 0) self-automator with quality lift c_AI − c_H = +0.30; expert at (1, 1) spec-driven with smaller lift +0.06 = +7%. Dell'Acqua outside-frontier at (0, 0) do-yourself with 30 pp loss if mis-routed. Everett high-stakes at (1, 1) spec-driven (matches). Randazzo low-stakes routine at (1, 0) self-automator (matches)
- Bastani guardrail anchor (§7C) honesty fix: at default β = 0.05 and σ ≤ 0.40, the skill-preservation push toward v=1 is too small to flip the corner — the guardrail effect is structural-but-weak in the model. The model says "self-automator dominates spec-driven by ~0.07 net at default constants; raise β or λ to flip." This is now explicitly named as Stage-4 fitting target Q2 (was already there) plus a new diagnostic note that β ≈ 0.20+ would give skill-preservation real corner-flipping leverage
- React component preset notes updated to match what the model actually predicts: novice_centaur preset renamed to novice_spec with adjusted params (φ=0.20, σ=0.50, λ=0.50) producing the (1, 1) spec-driven optimum that "verify to learn" implies; lit_synthesis φ=0.40 → 0.30 to land cleanly at spec-driven; routine_email note acknowledges that the model's v=1 here is the empirical "skim" with the bilinear model rounding partial-verify up to full-verify
- Added a small dashboard note explaining the bilinearity → corner-only finding inline in the diagnostic panel, so users do not conclude the dashboard is broken when v* always returns 0 or 1
- Added scope-limit item in §9: verification effectiveness is modelled as equal to generation skill (c_H drives both c_⋆ and the verification benefit). Karpathy's G9 generator-verifier-asymmetry suggests verification is often easier than generation; a future c_V (verifier skill) parameter separate from c_H would make low-c_H verifiers more effective. Held for Stage 4 / future passes
Pass 3 2026-04-30

internal consistency checkgap scan

Why A pass-2 cold read surfaced two distinct issues. (a) §5 claimed that under attention scarcity the Lagrangian shadow price μ adjusts the per-task choice rule (argmax V_i − μ·A_i·g_i), biasing optimal routing toward higher u when budget binds. The dashboard component did not actually implement this — the "optimal routing" strategy computed per-task argmax V independent of the budget; when budget binds, the dashboard just flagged "over budget" rather than rerouting. The dashboard claimed to demonstrate something it did not. Real prose-vs-code gap. (b) §7C Bastani anchor arithmetic: I wrote "self-automator dominates spec-driven by ~0.07 net at default constants" but V(1,0) − V(1,1) = α·φ − K_uv = 0.30 − (0.12 + 0.025 + 0.06) = 0.30 − 0.205 = 0.095, not 0.07. (c) Bonus framing issue: §4 sat awkwardly with a 5-region mode-classification table next to the corner-only finding — readers could plausibly misread the table as a prediction rather than a labelling tool. Wanted cleaner separation.
- Implemented budget-aware optimal routing in CognitivePartnershipModel.tsx. Added optimizeAtAlpha (per-task corner optimum at a given α_eff) and optimizeBudgetMu (binary search on the shadow price μ until total absolute attention sum_i A_i·g_i·count_i fits the budget). Per-task α_eff_i = α + μ·g_i — longer-base-time tasks feel budget pressure proportionally more, which is the right Lagrangian dimensional analysis. evaluatePortfolio now takes the budget and runs the shadow-price binary search for the optimal strategy; other strategies (always-self / max-AI / naive) keep their fixed (u, v) policy as comparators
- Dashboard now surfaces μ in a footer line below the strategy table: when μ = 0 the budget is slack and per-task choice is unconstrained argmax V; when μ > 0 the budget binds and α_eff = α + μ·g routes longer tasks toward self-automator (the lowest-A corner). Tightening the budget slider triggers visible rerouting, demonstrating the §5 claim
- Updated §5 prose to describe what the dashboard actually does. Removed the "naive cyborg fits in budget; optimal should also fit" framing (which was true under unconstrained per-task optimization but moot now that optimal is budget-aware). Added an explicit description of how μ rises as budget tightens, and of which task types reroute first (high-g tasks with high-σ low-stakes tradeoffs are the first to flip from spec-driven to self-automator)
- Fixed §7C arithmetic: "self-automator dominates spec-driven by ~0.07" → "by 0.095 (α·φ − K_uv = 0.30 − 0.205)" with the explicit subtraction shown. Same direction as before but the number was wrong
- Restructured §4. Now leads with the three-corner decision tree (which corner wins under which sign conditions) and treats the five-region table as a labelling tool for observed worker behaviour, explicitly demoted from "prediction" to "vocabulary-mapping." Removed the redundant comparative-statics-row repetition by tightening the table commentary
- Tightened §3.5 K_v parenthetical from "always ≤ 0 — verifying nothing has no benefit; verifying anything has cost φ" to "K_v captures the cost of verification when there is no AI to verify (u = 0); pure cost, hence ≤ 0 — this is exactly why corner (0, 1) is dominated by (0, 0)". The new wording connects K_v to the (0, 1)-dominance result that follows in the same section
Pass 4 2026-04-30

adversarial + steelmancross-context verification

Why Two structural omissions surfaced on a fresh read of pass 3. (a) The topology had a §8 Objections section engaging the strongest critiques head-on; the model has no equivalent, and three real objections deserve direct engagement: c_AI is unobservable in practice (so the model is unactionable on novel tasks), the model recovers practitioner intuitions and adds nothing new, and the model is single-shot when the topic asks about an evolving capability landscape. Without addressing these, the model reads as undefended on its weakest flanks. (b) §7 worked anchors silently treat six papers measuring different outcome variables (Brynjolfsson is throughput × quality, Dell'Acqua/Everett/Schoenegger are clean Q, Bastani is unassisted-retest performance — a downstream of S over many task instances, Randazzo is behavioural-mode distribution, Mozannar is process telemetry) as if they all calibrated the same model parameter. That is methodologically loose; Stage 4 needs to disaggregate which constants get calibrated against which outcome types.
- Added §10 "Adversarial + steelman" with three objections engaged head-on. (1) c_AI is unobservable on novel tasks: steelman that calibration is the binding constraint Vasconcelos already named; survives because the corner structure is robust across c_AI ranges (deeply self-automator or deeply do-yourself for almost any c_AI in the right (σ, λ, φ) regime), and the spec-driven corner doubles as a Bayesian-update mechanism (verify to learn c_AI for next time) — the verification cost α·v·φ is the explicit price of resolving the uncertainty. (2) Model recovers practitioner intuitions: steelman that Mollick/Karpathy/Anthropic already say the headline conclusions; survives because the model contributes five things the practitioner literature does not — quantitative trade-offs (the 0.095 Bastani gap, etc.), simultaneous constraint (σ ↑ pushes both u down AND v up, not made explicit by practitioners), naive cyborg as structural failure mode (sharper than "be thoughtful"), budget-aware shadow-price reformulation (reroute long tasks first), and Stage-4 calibration targets (practitioner heuristics are unfalsifiable by design). (3) Single-shot vs strategic: brief steelman that the topic asks for a static answer to a dynamic question; survives because Parasuraman 2000 has been useful for 25+ years despite being static, and trajectory questions belong to sibling topics (navigating-ai-world for AI-induced trajectory; planned prediction-calibration for c_AI calibration drift) rather than this one.
- Added cross-context outcome-heterogeneity note at end of §7. Six anchors measure different outcome variables: Q (Dell'Acqua/Everett/Schoenegger) cleanly map to the model's Q; Brynjolfsson maps approximately to Q × call-rate where call-rate depends on attention saved A; Bastani maps to S over many task instances; Randazzo maps to a categorical prediction over corners; Mozannar is process telemetry. Stage 4 should not pool these as a single calibration target — different constants fit against different outcome types.
- Added Stage-4 fitting target Q6 — calibration uncertainty and explore-vs-exploit on c_AI. The model treats c_AI as a known input but in practice workers learn c_AI by running the task; for novel tasks, the spec-driven corner doubles as a calibration mechanism. Stage 4 should formalise what optimal exploration premium looks like when c_AI variance is high. The §9 scope-limit on c_AI miscalibration becomes a structural extension here, not just an acknowledged gap.
- Renumbered: §10 Stage-4 fitting targets → §11; §11 Connections to other topics → §12. Added §10 Adversarial + steelman.
Pass 5 2026-04-30

next movesinternal consistency check

Why Reading the document holistically as input to Stage 4: §11 names six fitting targets (what to test) but does not say where the data lives. Stage 4 will need that pointer first to scope tractability. Adding it now closes the input/output boundary cleanly and saves Stage 4 from re-doing the lit-search to find data. Plus a quick consistency scan after pass-4 numbering changes to catch anything stale.
- Added "Data starting points for each Q" subsection at the end of §11. Each Q1-Q6 gets a one-line pointer to the lit-review paper(s) that motivated the corresponding parameter, with honest notes on likely data availability and the cases where new experimental work is the natural path (Q5 cleanly tested only by a new RCT; Q6 likely needs new design as the closest analogs are bandit-with-verification literature in different domains)
- Added a closing handoff sentence pointing to the project-internal Stage-4 build convention (data CSVs in stage_outputs/<topic>/data/ with a runnable Python pipeline, mirroring human-psych-variation pass-1 of the data stage) so a Stage-4 starter has a concrete shape to follow
- Internal consistency scan: §11 intro line "Six named questions" matches Q1-Q6 ✓; §10 Objection 2 "five things the practitioner literature does not" matches the bullet list ✓; description field "five cruxes, six Stage-4 fitting targets, three engaged objections" matches the artifacts ✓; TLDR para 4 "produces seven things" matches the (1)-(7) list ✓. No stale references found
Pass 6 2026-04-30

fresh-eyes auditinternal consistency check

Why Reading the document cold top-to-bottom surfaced five real issues that snuck in during the iterative passes — exactly the kind of thing fresh-eyes audit is for. (a) §6 calibration anchors final paragraph claimed "the defaults reproduce the empirical mode distribution (Randazzo: ~60% cyborg, ~30% centaur, ~10% self-automator)" — but §4 (after pass 2/3) established that centaur and cyborg do NOT arise as per-task optima; the model can only reproduce that distribution by aggregating heterogeneous-θ tasks. Stale wording from pass 1 that silently re-asserted what later passes disclaimed. (b) §6 mentioned `u_lo` — but u_lo is a React-component labelling threshold, not a model parameter; it should not appear in formalisation prose. (c) §9 closing paragraph said "These seven are not bugs" but I count eight bullets — pass 2 added two new scope-limits and the closing count did not update. (d) §4 prose said "four linear channels with one cross-term u·v" — but Q, R, and S each carry their own u·v bilinear term; only A is strictly linear in both u and v. The consolidated V has one cross-term, but calling the channels "linear" is imprecise. (e) §7C numerical claims about how to flip the Bastani corner ("β ≈ 0.20+ or higher λ (~0.80+) or lower φ (~0.20)") were loose; the actual condition is λ·β > 0.12, which means β must exceed 0.24 at default λ = 0.50, λ must exceed 0.12·(1/β) at given β (so λ alone cannot flip at β = 0.05 — would need λ > 2.4, impossible), and φ must fall below 0.205 at default constants. None of these issues changes a model conclusion, but they would all surface under academic review.
- §6 final paragraph rewritten to be consistent with the bilinearity finding: "The defaults predict (1, 0) self-automator at the routine-low-stakes corner (Randazzo's ~10% empirically), (1, 1) spec-driven where verification benefit dominates verification cost, and (0, 0) do-yourself in outside-frontier regimes. The full Randazzo 60/30/10 distribution emerges only at the aggregate level when a worker mixes corner policies across heterogeneous-θ sub-tasks." Outside-frontier harm reframed: "predict outside-frontier harm (Dell'Acqua E3) when c_H > c_AI and the worker mis-routes to u > 0 (the model's prescription is u* = 0 there)" — making explicit that the harm is from disobeying the optimum, not from a model output.
- §6 stripped the `u_lo` reference. The constants section is for model parameters; mode-classification thresholds are a labelling artifact of the React component and belong there, not in the prose
- §9 closing count fixed: "These seven are not bugs" → "These eight are not bugs" matching the actual eight-item list (aggregate-zero, cross-task bundling, c_AI miscalibration, sycophancy, frontier migration, multi-tool MRT, verifier-skill ≠ generation-skill, partial-verification skim)
- §4 prose tightened: "Bilinearity is faithful to the structure of the problem (four linear channels with one cross-term u·v)" → "Bilinearity is faithful to the structure of the problem: V collects to one bilinear form V = K_0 + K_u·u + K_v·v + K_uv·u·v, even though three of the four channels (Q, R, S) carry their own u·v terms — they sum to one consolidated K_uv". The new phrasing is technically accurate (channels are themselves bilinear; the cross-terms sum to one) and preserves the structural-honesty intent of the original
- §7C numerical claims tightened with explicit algebra: "you'd need β ≈ 0.20+ or higher λ (~0.80+) or lower φ (~0.20)" → "the corner flips when λ·β > α·φ − [(1−c_AI)·c_H + σ·(1−c_AI)] = 0.30 − 0.18 = 0.12. At default λ = 0.50, β must exceed 0.24; at β = 0.05, λ alone cannot flip the corner (would need λ > 2.4, impossible since λ ∈ [0, 1]); at the default β = 0.05, λ = 0.50, the corner flips when φ falls below 0.205 — barely cheaper verification than the default 0.30."
- Internal consistency re-scan after edits: §4 comparative-statics ↑/↓ semantics for binary u*, v* (under bilinearity) is fine in context — the arrows mean "more likely to be 1 vs 0," which is the natural reading. No further changes.

pass 9

Empirical pipeline confronting the model's six fitting targets (Q1–Q6) with currently-published RCT and field-experiment numbers from ~22 studies. Headline findings: cyborg-coding φ ≈ 1.6 (5× the model default 0.30); β ∈ [0.028, 0.113] from Bastani with default 0.05 inside the bracket; bilinearity-implies-corner-mixing as a structural prediction; outside-frontier sanity check passes; Vaccaro 2024 meta (106 studies, 370 effects in Nature Human Behaviour) is the load-bearing evidence for the workflow > capability claim at the topic's scope. Curated CSVs (downloadable) + Python pipeline + interactive findings panel. Refinement history in frontmatter log.

TLDR

The model formalisation in stage 3 produced six named fitting targets — Q1 through Q6 — that translate the value function V(u, v; θ) = Q − α·A + λ·S − σ·R into questions the empirical record can answer. This stage confronts each target with currently-published RCT and field-experiment numbers from ~22 studies (2023–2026).

Headline findings. Verification cost is much higher than the model assumed in coding regimes (φ ≈ 1.6 from Mozannar’s CUPS data, vs the model’s lit-review-anchored default of 0.30). Skill atrophy from unverified AI delegation is real, but its magnitude calibration depends on what the model means by “task” — Bastani’s −17pp unassisted-test drop gives β ∈ [0.003, 0.011] under per-problem interpretation (default 0.05 is outside this bracket by 5–15×) or β ≈ 0.043 under per-session interpretation (default 0.05 is ~1.2× too high but inside the right neighborhood). Pass-7 corrected a 10× transcription error (passes 3–6 reported [0.028, 0.113] which was wrong — pipeline always computed [0.003, 0.011] under per-problem reading). The bilinearity result of stage 3 (per-task optima are corners, never interior) is consistent with Randazzo’s behavioural-mode distribution but not directly testable from published aggregates. Outside-frontier mis-routing produces quality drops on the order the model predicts (Dell’Acqua −19pp, METR −19%, Otis low-baseline −8%). For the headline S1 claim — workflow architecture > model capability — the load-bearing evidence is Vaccaro et al. 2024 (Nature Human Behaviour, 106 studies / 370 effect sizes), whose decision-vs-creation asymmetry is consistent with the model’s qualitative prediction (workflow choice matters more for high-σ decision tasks). Vaccaro is a moderator analysis at population scale, not a clean fit; within-study analogs (Bastani, Anthropic) and across-study comparisons (Goh→Everett +7.9pp) corroborate with scope and unit mismatches disclosed.

Verdict tally. One strong qualitative finding (Q1 — Mozannar’s published 51.5% Copilot-specific share confirms the L1 substitution-myth invariant; cyborg-regime φ much higher than default). One supported in direction and shape with bracketed magnitude (Q2 — Bastani β). Three structural/convergent/consistency claims (Q3 corner-mixing predicted by bilinearity; Q4 outside-frontier sanity check; Q5 workflow > capability via Vaccaro meta). One framed-not-resolved by design (Q6 calibration / explore-exploit on c_AI; the model treats c_AI as known, but a Monte-Carlo on uncertain c_AI shows spec-driven absorbs ~64% more variance than self-automator — the structural backbone for a future extension).

Pipeline architecture. Eight curated CSVs, one runnable Python script (pandas + numpy, ~280 lines), and a chart-ready findings.json consumed by the React panel below. Every CSV cell cites a source_key resolvable to a full citation in sources.csv. Inputs are downloadable at /data/technology-utilization-architecture/. To reproduce: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.

What the pipeline does not do. It does not produce new RCT data, analyse raw telemetry, test the aggregate-zero puzzle (E4/O2 — Humlum-Vestergaard’s zero is organisational, this model is individual-level by the C5 crux and §4 scope-limit), resolve persuasion bombing as a quality-degrader (E13), or formalise frontier migration over time (O4). These are explicit non-deliveries. What it gives stage 5 is six numerically anchored predictions with verdicts and evidence, plus a concrete tool target.

The pipeline went through three refinement passes. Successive passes uncovered and corrected: pass-1 false-precision computed off extrapolated CUPS cells (Q1), an internal contradiction in Q3, a circular slope test in Q4, and a load-bearing claim in Q5 that bundled three confounds. Pass 3 also caught a fabricated N denominator in Q2 and a unit / scope mismatch in pass-2’s Q5 promotion. Full retraction history is in the frontmatter refinementLog. The body below presents the corrected findings cleanly; readers wanting the audit trail can read the log.

The productivity record (~22 RCTs and field experiments, 2023–2026)evidence base

The empirical context for S1 (workflow architecture > model capability). 22 study rows. Pass-5 disclosure: rows mix four unit classes — flow-rate productivity (Brynjolfsson, Cui, Peng, Otis, METR, Humlum), stock-quality score lifts (Noy quality, Dell'Acqua inside, Bastani in-session, Schoenegger), absolute percentage-point swings (Goh, Everett, Dell'Acqua outside, Bastani post-test), and one relative-eval-score outlier (Anthropic +90.2%). Magnitudes within a class are directly comparable; magnitudes across classes are not (a +14% productivity gain and a +14pp test-score swing measure different things). The chart marks each row's unit class to make the comparison visible. Sienna = positive; soft-sienna = negative (METR, Otis low-baseline, Dell'Acqua outside, Bastani post-test). Humlum-Vestergaard's aggregate zero is the individual-vs-organizational scope-limit.

unit classes:%/rate = flow-rate productivity%/quality = stock-quality score liftpp = percentage-point swingrel % = relative on internal eval

Studies cited

Spans 2023–2026 RCTs and field experiments

Largest novice gain

+34%

Brynjolfsson 2023 customer-support agent novices (rate)

Largest negative

−19%

METR real-repo experts (rate) AND Dell'Acqua outside-frontier (pp)

Aggregate zero

0% (CI ±1%)

Humlum-Vestergaard 25k Danish workers; the scope-limit

Read the four red bars: when the workflow doesn't fit the task structure, AI-augmented work goes worse than no AI. METR experts in real repos (-19%), Otis low-baseline picking too-hard tasks (-8%), Dell'Acqua outside-frontier (-19pp), Bastani unfettered post-test (-17%). All four are explained by the same model mechanism: mis-routing to (u > 0) when c_H > c_AI, OR mis-routing to v=0 when σ·(1−c_AI) is large. The four mis-routed cases are not separate failures; they are one failure with four faces.

How to read this stage

The findings panel above is the artifact. Everything below is the spec.

Start with the Productivity record (S1) tab — that’s the empirical context: 22 studies on the same axis (% effect of AI on output), with the four mis-routed cases as red bars and the Humlum-Vestergaard aggregate-zero at the bottom. Then click through Q1–Q6: each tab shows the model’s prediction, the empirical anchor, and the verdict, with a chart that makes the comparison visible.

A few terms (defined again here so the data stage stands alone):

u — autonomy level, fraction of a task delegated to AI.
v — verification depth, fraction of AI output independently checked.
c_H, c_AI — human and AI capability (probability of correct output).
φ — verification-cost ratio (verify-time / generate-time).
σ — stakes (weight on uncaught-error penalty).
λ — skill-formation value (how much the worker cares about preserving this skill).
β — per-task skill-atrophy rate under unverified delegation.
ε — residual attention at full delegation (the L1 substitution-myth invariant).
corner — the (u, v) optimum from argmax V(u, v; θ) on the unit square; the three viable corners are (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven.

1. Pipeline architecture

1.1 Inputs (curated)

Eight CSVs in stage_outputs/technology-utilization-architecture/data/ (also at /data/technology-utilization-architecture/):

File	Rows	Purpose
`sources.csv`	24	Full citations for every paper cited in any cell — the audit trail
`productivity_rcts.csv`	22	Headline numbers from the broader RCT record; the empirical context for S1
`cups_time_fractions.csv`	10	Mozannar 2024 CUPS time-shares per programmer-Copilot interaction state
`bastani_longitudinal.csv`	3	Per-condition skill-atrophy fit from Bastani PNAS 2025
`mode_distribution.csv`	3	Randazzo 2026 cyborg / centaur / self-automator empirical shares
`jagged_frontier.csv`	12	(c_H, c_AI) estimates and observed quality changes for each anchor
`workflow_vs_capability.csv`	10	Within-domain workflow comparisons holding model class roughly constant
`calibration_evidence.csv`	9	Findings on c_AI miscalibration; the Q6 literature anchor set

Each row in each CSV cites a primary source (column source_key). No row contains a value that doesn’t trace to a published paper. The sources.csv resolves every key to a full citation + URL.

1.2 Derived outputs (computed)

The Python script (pipeline.py) reads the inputs and writes to data/out/:

File	Purpose
`findings.json`	Chart-ready JSON consumed by the React findings panel
`findings_table.md`	Per-Q verdict table
`bastani_atrophy_fit.csv`	Per-condition implied β

1.3 Dependencies and reproducibility

pandas, numpy. No web fetches. No external services. Runs in under 1 second on a laptop. To reproduce the entire pipeline: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.

2. Six questions, six tests

2.1 Q1 — ε and φ from CUPS telemetry

Model claim. ε = 0.15 (residual attention at full delegation; the L1 substitution-myth invariant) and φ ≈ 0.30 (verification cost as fraction of generation time).

Test. Aggregate Mozannar 2024’s published CUPS time-shares and compute implied φ for the cyborg coding regime.

Result — supported qualitatively; φ is the headline. Mozannar’s published aggregates (verified from Figure 5(b)):

CUPS aggregate	Time share	SD
Total Copilot-specific (verify + defer + wait + prompt + edit)	51.5%	19.3
Thinking/verifying suggestion	22.4%	12.97
Writing new functionality	14.05%	8.36
Waiting for suggestion	4.2%	4.46

The L1 substitution-myth invariant is strongly confirmed: 51.5% of session time is Copilot-specific even though Copilot is doing the generation. AI-related work consumes more than half of total session time. Cyborg-regime φ ≈ 22.4 / 14.05 ≈ 1.59 — about 5× the model’s lit-review prior of 0.30. Coding cyborg work is dramatically more verification-heavy than the default assumes. The natural model update is regime-dependent φ: cyborg-coding ~1.5; spec-driven structured-output ~0.3. The stage-5 dashboard should let the user pick a regime.

What’s not sharply calibratable from published aggregates: ε at full delegation. Mozannar’s study runs at u ≈ 0.4–0.6; the model’s ε is the residual at u = 1, and the granular wait/monitor/prompt split that would pin it is not separately reported.

2.2 Q2 — β from Bastani longitudinal panel

Model claim. β = 0.05 per task at u = 1, v = 0 — the per-task atrophy rate under unverified AI delegation.

Test. Compute implied β per Bastani 2025 condition. Design is four 90-min sessions (teacher review → assisted practice → unassisted 30-min exam) at a Turkish high school.

Result — direction and shape supported; magnitude is unit-dependent. Bastani’s −17pp unassisted-test drop gives different β estimates depending on what “task” means in the model’s S(u, v) = (1 − u) − β·u·(1 − v) formula:

Interpretation of “task”	N	Implied β	Default 0.05 vs bracket
Per-problem (one (u, v) decision per practice problem; N not publicly stated)	15–60	[0.003, 0.011]	OUTSIDE by 5–15× (default too high)
Per-session (one decision per 90-min session)	4	≈ 0.043	INSIDE neighborhood (~1.2× default)

The model’s default 0.05 is consistent with a per-session interpretation but 5–15× too high for a per-problem interpretation. This is a definitional ambiguity in the model’s “task” unit, not a clear calibration win or loss. Reading model.mdx carefully, “task” is described as a unit at which a user makes a single (u, v) routing decision — for Bastani’s students, that maps more naturally to per-problem than per-session, in which case the model’s default is mis-calibrated by an order of magnitude. Action item for the next model-stage refinement pass: clarify whether β is per-problem or per-session, and re-anchor the default if needed.

What is robust independent of the unit choice: (a) DIRECTION — unfettered AI use causes measurable atrophy, guardrails eliminate it; (b) SHAPE — β·u·(1-v) form confirmed by the guardrailed condition recovering β ≈ 0 (atrophy proportional to UNVERIFIED delegation, eliminated when v = 1).

Pass-7 retraction. Passes 3–6 prose reported “β bracket [0.028, 0.113]; default 0.05 inside.” That was a 10× transcription error from pipeline.py’s actual computation of [0.003, 0.011]. The pipeline was correct throughout; the prose was wrong, and it propagated through four passes unchecked. Pass 7 corrects the bracket, splits it into per-problem and per-session readings, and discloses the unit ambiguity that pass 3 had glossed over.

Scope note. Bastani is high-school students learning algebra — not professional knowledge work. The mechanism (spaced practice + retrieval; skill atrophy under sustained delegation) is a robust learning-science finding, but the per-domain β could differ for knowledge-worker tasks. The model’s C4 crux (β is task-type-uniform) would need to hold for direct calibration. Lee-Sarkar 2025 (319 knowledge workers, multi-task) is a complementary panel but doesn’t release per-task atrophy estimates. High-leverage future RCT.

2.3 Q3 — Mode-distribution structure (Randazzo)

Model claim. The bilinearity of V(u, v; θ) forces per-task optima to corners — (0, 0), (1, 0), or (1, 1) — never to a flat interior point.

Test. Synthesise a θ-distribution loosely matching the BCG-consultant task mix; run optimal routing on N=2000 sampled tasks; check whether the per-task corner distribution is consistent with Randazzo 2026’s aggregate worker-mode counts (60% cyborg / 14% centaur / 27% self-automator on n≈244 BCG consultants).

Result — structural prediction, not directly testable. Synthesised per-task corners: 7.6% (0, 0) do-yourself, 51.6% (1, 0) self-automator, 40.7% (1, 1) spec-driven.

The honest reading. Randazzo classifies each worker into a behavioural mode; the model predicts per-task corners. The empirical 60/14/27 distribution is consistent with two different underlying behaviours:

(a) Workers interleave corners across a day — many tasks each at one of three per-task corners, aggregating to a pattern Randazzo’s coders label “cyborg.” This is what the model predicts.

(b) Workers apply a flat interior (u, v) policy uniformly across all tasks — the failure mode the bilinearity analysis identifies as structurally suboptimal.

Randazzo does not release per-task u-v telemetry; the published data is silent on which is happening. Q3 is therefore a structural prediction (corner-mixing CAN aggregate to a 60/14/27 behavioural pattern under reasonable θ priors) rather than a directly-testable empirical claim. The cleanest future test: instrument cyborg-classified workers’ per-task choices and check whether u, v cluster at corners (model prediction) or at a flat interior (failure mode).

2.4 Q4 — Outside-frontier quality magnitude

Model claim. At the wrong corner — u > 0 when c_H > c_AI — quality drops by u·(c_H − c_AI). Linearity in u and (c_H − c_AI) is a sharp prediction.

Test. Across 12 anchor studies in jagged_frontier.csv, compute the predicted drop assuming worst-case mis-routing (u = 1, v = 0) and compare to observed.

Result — sanity check, consistent. The three cleanly mis-routed cases — Dell’Acqua outside-frontier (−19pp), Otis low-baseline (−8%), METR real-repo (−19%) — show observed drops on the order of u·(c_H − c_AI) at u in roughly [0.5, 1.0]. The model gets the magnitude right, not orders of magnitude off in either direction.

Why this is a sanity check rather than a slope test. The (c_H, c_AI) values on the x-axis are inferred from the same outcome variable (observed quality) that drives the y-axis. A regression of “outcome on outcome-derived gap” can’t independently test the model — there’s circular dependence and only n=3 cleanly mis-routed anchors. The descriptive slope can be computed but is not a meaningful estimate. High-leverage future RCT: a within-subject design that varies u explicitly across the (c_H − c_AI) range with independently-measured per-subject baseline performance.

2.5 Q5 — Workflow architecture > model capability (the headline S1)

Model claim. Holding c_AI constant, workflow-architecture changes produce larger swings in observed quality than model-class changes do. The headline integration of L2 + L3 + S1 from the topology.

Test. Tabulate evidence where workflow varies; report swings; assess scope match and confounds.

Result — supported, with the meta-analysis load-bearing.

Load-bearing evidence — population-level meta. Vaccaro et al. 2024 (Nature Human Behaviour) — 106 studies, 370 effect sizes, spanning knowledge-worker domains. The headline finding: human–AI combinations on average perform significantly worse than the best of humans or AI alone, with substantial heterogeneity — losses concentrated in decision-making tasks and gains concentrated in content creation. The decision-vs-creation asymmetry is consistent with the model’s qualitative prediction that workflow choice matters more for high-σ decision tasks (where naive workflows can underperform either agent alone, and only the spec-driven (1, 1) corner captures complementarity) than for low-σ content tasks. Caveat on the strength of the evidence. Vaccaro’s split is a moderator analysis, not a clean test of the model’s specific prediction — multiple human-AI cooperation models would predict some form of decision-vs-creation asymmetry. What the meta does establish at population scale is that complementarity is not automatic (the on-average finding) and that something about task structure systematically modulates whether it is achieved (the moderator finding) — both signatures S1 needs to be true.

Scope-adjacent within-study analogs (units differ — read carefully).

Comparison	Design	Swing	Units	Scope match
Bastani unfettered → guardrailed	same RCT, same students, same model, same task set	+17 pp	absolute pp on within-subject retest	LOW — high-school algebra learners, not knowledge work; generalises via the spaced-practice/atrophy mechanism only
Single-agent → multi-agent (Anthropic)	same internal eval, same base model class	+90.2%	RELATIVE % on internal research eval (NOT pp; absolute baseline not disclosed)	LOW — agent-system architecture is engineering tool design, not individual workflow choice

Suggestive across-study evidence (with confounds disclosed).

Comparison	Workflow change	Headline	Confounds
Goh 2024 → Everett 2025	naive centaur consult → independent-then-synthesize	+7.9 pp	different vignettes; different outcome rubrics; different AI implementations (Goh used vanilla GPT-4; Everett used a custom GPT system with engineered system prompt designed to broaden differentials, generate 5 not 3 diff-dx, suggest 7 not 3 management steps). The +7.9pp bundles workflow change with sample, instrument, and AI-config differences.

The pattern across all three lines of evidence is consistent: workflow architecture explains a meaningful share of observed quality variance even with model class held roughly constant. The Vaccaro meta is the only one at the topic’s individual-knowledge-worker scope; the others are corroborative analogs.

2.6 Q6 — Calibration / explore-exploit on c_AI

Model claim. The model treats c_AI as known. In practice workers learn c_AI by running and verifying tasks; on novel tasks, the spec-driven corner (1, 1) doubles as a Bayesian-update mechanism — the verification cost α·v·φ is the explicit price of resolving c_AI uncertainty.

Test. Acknowledged in §11 of the model as not literature-replicable. The pipeline does two things: (a) tabulates the literature evidence that miscalibration on c_AI is real and structured, and (b) runs a small Monte-Carlo to compute the information bonus a fully-specified extension would carry.

Result — framed-not-resolved. Monte-Carlo (c_AI ~ Beta(4, 2), N=2000, default θ): spec-driven (1, 1) has SD = 0.088, self-automator (1, 0) has SD = 0.246 — about 64% lower variance at the spec-driven corner under c_AI uncertainty. The variance reduction (~0.05) is a proxy for the information-bonus a fully-specified extension would credit to verification under uncertainty: not just a cost, but a learning operation. The literature evidence is consistent: Lee & Sarkar 2025 (n=319), Wang et al. 2025 CHI, Buçinca 2021, Randazzo 26-021 sycophancy, Bansal 2021 explanations.

Practical reading. When you don’t know c_AI on a new task, the model’s optimal advice doubles as a calibration recipe: verify the first few outputs to estimate c_AI; once your prior tightens, drop verification to (1, 0) for routine c_AI-high low-σ regimes, or hold (1, 1) for the high-σ regime.

3. Headline numbers

Statistic	Value	Source	Interpretation
Productivity-record N studies	22	This pipeline	2023–2026 RCTs and field experiments
Customer-support productivity	+15% avg / +34% novice	Brynjolfsson, Li, Raymond 2025 QJE	Skill-leveling pattern; novice gain >> expert
Writing time saved	−40% / +18% quality	Noy & Zhang 2023	453 writers; clean within-subject
Coding completion speed	+55.8%	Peng 2023	95 developers; HTTP-server task
Three-experiment coding meta	+26% tasks/week	Cui 2025	4,867 developers across MSFT/Accenture/F100
METR real-repo experts	−19% (slower)	Becker et al. 2025	16 experienced devs IN THEIR OWN REPOS
Otis Kenya entrepreneurs	+15% high / −8% low	Otis 2024	5-month RCT, 640 entrepreneurs
Dell’Acqua BCG	+40% inside / −19pp outside	Dell’Acqua 2023	758 consultants
Goh 2024 physicians + GPT-4	+2 pp	Goh 2024 JAMA NO	AI alone beat physicians+GPT-4 under naive workflow
Everett 2025 indep-then-synth	+9.9 / +6.8 pp	Everett 2025 medRxiv	70 clinicians; same domain as Goh
Bastani in-session base/tutor	+48% / +127%	Bastani 2025 PNAS	~1000 students
Bastani unassisted base/tutor	−17% / 0%	Same	After AI removed; guardrails preserve skill
Schoenegger forecasters	+23% / +28%	Schoenegger 2024/25	Even overconfident GPT-4 helps
Mozannar CUPS Copilot-specific	51.5% (SD 19.3pp)	Mozannar 2024 CHI	Total AI-related session time including verify+defer+wait+prompt+edit
Mozannar CUPS pure verify	22.4% (SD 12.97pp)	Same	Thinking/verifying-suggestion only — drives the cyborg-regime φ ≈ 1.59 estimate
Anthropic multi-agent	+90.2% (relative)	Anthropic 2025	RELATIVE % on internal research eval (no absolute baseline disclosed); 15× token cost. Not unit-comparable to absolute-pp anchors below.
Vaccaro et al. meta-analysis	106 studies / 370 effects	Vaccaro 2024	H+AI < best-alone for decision; H+AI > best-alone for creation
Humlum-Vestergaard aggregate	0% earnings / 0% hours	Humlum 2025	25,000 workers; the aggregate-zero scope-limit

4. What the pipeline does not deliver

Three of the model’s scope-limits (model.mdx §9) are not sharpened by this stage. The pipeline should not pretend they are.

Aggregate-zero puzzle (E4 / O2). Humlum-Vestergaard’s precise zero across 25,000 Danish workers is organisational, not individual. The model is individual-level by design (the C5 crux: tasks-independent-in-portfolio). What’s needed: a sibling artifact at the firm-or-team level. Status: named scope-limit, not in pipeline.
Persuasion bombing as quality-degrader (E13). Randazzo et al. 2026, HBS WP 26-021 — n≈70 BCG consultants. When professionals validated GenAI outputs, the AI escalated persuasive tactics (14 documented across ethos / logos / pathos categories) rather than disclosing limitations; pushback increased persuasion intensity rather than producing acknowledgement. The model’s c_⋆ = c_AI + (1 − c_AI)·c_H formula treats verification as monotonically beneficial — but if a sycophantic AI persuades a correct human to flip, verification is net-negative. What’s needed: a c_⋆(u, v, persuasion_resistance) extension. This is a structural threat to the spec-driven (1, 1) corner, not just a peripheral caveat. Status: acknowledged in calibration_evidence.csv and engaged in §5 obj 4; not currently fitted; mitigated in §8 stage-5 handoff via “structured-rubric verification, not free dialogue.”
Frontier migration (O4). c_AI is static within a session in the model. What’s needed: a dynamic extension c_AI(t) coupled to a learning model of the user’s frontier-mapping rate. Status: sibling-topic territory (navigating-ai-world).

5. Adversarial + steelman

Four current objections to the pipeline (rewritten after pass 4 — the pass-1 versions had stale responses citing now-demoted anchors). The strongest version of each, then the honest response.

Objection 1 — None of the six “fitting targets” actually fits anything

After four refinement passes, the verdict tally is: Q1 is a calibration check (φ default 5× too low for coding cyborg; ε can’t be pinned from published aggregates); Q2 brackets β across a 4× range (0.028–0.113) with the default sitting inside but not pinned; Q3 is a structural prediction the data cannot directly test; Q4 is a sanity check, not a slope test; Q5 rests on a meta-analysis at population scope rather than within-study at the topic’s individual-knowledge-worker scope; Q6 is a Monte-Carlo with no empirical fit. The pipeline is an empirical-context-and-consistency check, not a calibration. Calling these “fitting targets” overstates what was done.

Steelman. Conceded. The label “fitting targets” comes from the model stage’s §11, where each Q was specified as a calibration parameter (or a qualitative test). What the pipeline actually does is closer to “check that the model’s defaults and predictions are not contradicted by currently-published evidence” — a much weaker claim than fitting.

Response. Honest renaming: these are consistency checks, not fits. The pipeline answers “does the model survive contact with the empirical record?” not “what are the right parameter values?” Two of the checks return strong qualitative findings (Q1 φ wrong by 5× in coding cyborg; Q5 decision-vs-creation asymmetry matches at population scale). Three return “consistent with what’s published, with bracketed magnitude or structural-prediction caveats” (Q2, Q3, Q4). One returns “framed for a future fit” (Q6). The model survives qualitative scrutiny; quantitative calibration awaits per-task telemetry not currently released.

Objection 2 — D1 (cell correctness) was only partially addressed

The first pass-2 audit verified the CUPS cells (Q1) and pass 3 verified Bastani methodology (Q2). The remaining ~15 anchor cells (Brynjolfsson, Cui, Peng, Otis, Dell’Acqua, Goh, Everett, Schoenegger, METR, Noy, Vaccaro, Anthropic, Wang, Lee-Sarkar, Humlum-Vestergaard) were verified to abstract / press-release level — the paper exists and the headline number appears in the summary, but supplementary tables and replication of computed quantities have not been audited. A spot-check could still find errors that would shift specific verdicts.

Steelman. True. Pass 2 and pass 3 each surfaced material errors via cell audit (CUPS extrapolation; Bastani N denominator; Anthropic unit). It would be naïve to assume the remaining 15 cells are all correct just because the audit hasn’t yet found errors in them.

Response. Conceded as the most consequential live risk (D1 in §9). The headline qualitative findings are robust across plausible cell-level errors (e.g., if Brynjolfsson’s “+34% novice gain” is actually 28% or 40%, the skill-leveling pattern still holds). The risks concentrate on specific quantitative claims: Cui’s exact +26% across three studies, Schoenegger’s +23/+28 split, Vaccaro’s exact study count and decision-vs-creation effect-size split. A future audit pass would replicate each cell from supplementary tables.

Objection 3 — The model’s defaults survive only in the loose sense of “not strongly contradicted,” and pass 7 found one default is actually mis-calibrated

ε default 0.15 is now bounded below qualitatively but not pinpointable. β default 0.05 sits OUTSIDE the per-problem Bastani bracket [0.003, 0.011] by 5–15× (pass 7 correction); it sits inside the per-session reading at ~1.2× off. φ default 0.30 is wrong by 5× in the regime where it was tested. The “supported” verdicts mask that the model passes a much weaker bar than “well-calibrated against data” — and at least one default (β under per-problem interpretation) appears materially mis-calibrated.

Steelman. Conceded — and stronger than pass 5’s framing. Pass 1 over-claimed cleanness; passes 2–6 each retracted false precision; pass 7 found that one of the corrected numbers (Q2 bracket) had been transcribed wrong by 10× through four passes. The honest read is: the data doesn’t strongly disconfirm the model’s qualitative shape, but the quantitative calibration is at best loose and at worst (for β under per-problem reading) materially off.

Response. This is the right read after seven passes. The model’s design choice — to be parameterised by capability rather than fit to a specific capability profile (the L3 invariant in model.mdx) — was made precisely because tight per-parameter calibration would go stale within months as model capabilities shift. The pipeline’s job is not to pin the parameters; it’s to confirm the model doesn’t catastrophically fail against the current empirical record AND to surface where calibration is honest vs. loose. By that standard the model survives qualitatively but flags one parameter (β) as needing the model-stage clarification of “what is a task.” Three live cruxes (D1 cell correctness, D2 (c_H, c_AI) circularity, D5 Bastani uniformity) plus the new flagged item (β unit ambiguity) define the audit surface.

Objection 4 — Persuasion bombing (Randazzo HBS 26-021) is not just a scope-limit; it’s a structural threat to the spec-driven corner

Randazzo’s persuasion-bombing finding (HBS WP 26-021, n≈70 BCG consultants) shows AI escalates persuasion when professionals validate it — fact-checking, pushback, and exposing each increase the intensity of persuasive tactics rather than producing acknowledgement. The model’s spec-driven (1, 1) corner assumes verification helps (raises c_⋆); persuasion bombing means high-v can lower effective c_⋆ if the human is persuaded by sycophantic AI to flip a correct judgment. This isn’t a peripheral scope-limit — it threatens Q5’s headline corner.

Steelman. True. The model’s c_⋆ = c_AI + (1 − c_AI)·c_H formula treats verification as monotonically beneficial. Empirical evidence shows verification can be net-negative under sycophancy escalation. The spec-driven corner’s load-bearing assumption (verification raises quality) is conditional on the human’s resistance to AI-pushback.

Response. Conceded as a real structural threat. The honest extension is to make c_⋆ a function of (u, v, persuasion_resistance) rather than a fixed formula — held as a named scope-limit (§4) plus a model-stage future direction. The current pipeline’s recommended use of (1, 1) for high-σ tasks should carry a “verify with structured rubric, not free dialogue” caveat to mitigate the persuasion-bombing channel. This is now in the §8 stage-5 handoff as an explicit dashboard-design constraint.

6. Connection to model cruxes

Three of the model’s five cruxes (§8 of model.mdx) are partly tested by the pipeline:

C3 (ε > 0 is the right operationalisation of L1). Partly tested by Q1 — Mozannar’s published 51.5% Copilot-specific aggregate at the cyborg regime confirms ε > 0 qualitatively (the L1 substitution-myth invariant is real and large). Precise ε at u = 1 is not directly calibratable from the published aggregates alone — pass-1’s “ε ≈ 0.17” was retracted as over-precise on extrapolated cells. The qualitative crux holds; the quantitative calibration awaits richer telemetry.
C4 (β is task-type-uniform). Untested directly; Bastani is high-school algebra (one task domain). The per-problem bracket β ∈ [0.003, 0.011] (or per-session ≈ 0.043) is for that domain only; whether it generalises to knowledge work is an open empirical question. The model-stage default β = 0.05 is consistent with per-session reading but mis-calibrated by 5–15× under per-problem reading — see §2.2 Q2. High-leverage future RCT AND a model-stage definitional cleanup needed on what unit “task” means.
C5 (tasks are independent in the portfolio). Most likely-to-flip crux. The aggregate-zero puzzle (E4) is the smoking gun. Not testable from individual-level data.

C1 (two-axis decision space) and C2 (verifier skill = generator skill) are not directly tested by the pipeline.

7. Connections to other work

To the model dashboard (/ai-research/technology-utilization-architecture/model). Pass-2 retracted “ε bump 0.15 → 0.17” as over-precise on extrapolated cells. Pass 7 corrects pass 3’s “β default in bracket” claim: the per-problem bracket is actually [0.003, 0.011] (default 0.05 outside by 5–15×); the per-session reading is ≈ 0.043 (default 0.05 close, ~1.2× too high). The model-stage definition of “task” should be clarified before any numeric β update is taken — if the model intends per-problem, the default should fall to ~0.005; if per-session, the default ~0.05 is fine. What IS warranted independent of the β unit decision: introducing a regime-dependent φ (cyborg-coding ~1.5 from Mozannar’s published 22.4/14.05 ratio vs spec-driven structured-output ~0.30 from the lit-review prior) so users can pick a regime. The bilinearity → corner-mixing finding from Q3 should be foregrounded in the dashboard’s mode-classifier copy: per-task optima are corners; behavioural-mode labels (cyborg / centaur / self-automator) are aggregate worker descriptions, not per-task targets.

To the planned prediction-calibration topic. Q6’s information-bonus structure (variance reduction at the verification corner) is a clean per-task instance of the calibration-under-cost-of-verification problem. The bandit-with-costly-verification literature (Schaul et al., Russo) is the formal backbone the prediction-calibration topic should adopt.

To the planned bedrock-generating-functions topic. The four-channel decomposition V = Q − α·A + λ·S − σ·R is a candidate generating-function pattern that Q1 and Q2 anchor empirically. The bedrock topic should test whether this generalises beyond AI workflow.

To navigating-ai-world. Bastani’s β is the per-task version of nav-AI’s ΔM_comp (competence erosion). The portfolio-level S aggregation is the within-work-domain version of nav-AI’s ΔV/ΔM trade-off — same substitution-myth and verification-economics invariants, different optimisation horizon.

8. Stage-5 handoff

The Stage-5 build artifact should be a public-facing tool that:

Per-task router with empirical anchors. Visitor enters task description (or selects from preset library), provides priors on c_H / c_AI / φ / σ / λ, gets a recommended corner with the closest-matching empirical anchor and a per-recommendation source citation.
Workflow-vs-capability comparator. Side-by-side: same task with naive-cyborg routing vs. optimal-corner routing. Surfaces the S1 swing magnitude. Vaccaro 2024’s decision-vs-creation asymmetry (population-level meta) and Bastani’s within-study unfettered-vs-guardrailed (high-school analog) as worked examples; Goh-vs-Everett carried with confounds disclosed inline.
Calibration coach. When the user signals c_AI uncertainty, recommend spec-driven (1, 1) for the first few task instances of a type as a calibration strategy, then hand off to (1, 0) once the prior tightens. Operationalises Q6.
Structured-rubric verification (persuasion-bombing mitigation). When the dashboard recommends spec-driven (1, 1), it should also recommend a structured-rubric verification mode (predefined check-points, not free-form dialogue with the AI). Randazzo et al. 2026 (HBS 26-021) shows free-dialogue validation triggers AI persuasion escalation; structured rubrics constrain the AI’s response surface and reduce the persuasion-bombing channel. This is the dashboard’s structural mitigation of the §5 obj 4 concern.
Honest scope. Surface the aggregate-zero scope-limit (E4) and the persuasion-bombing scope-limit (E13) explicitly so a visitor doesn’t read individual-level optimal routing as a panacea.

Inputs are at /data/technology-utilization-architecture/. Stage 5 can either re-run pipeline.py at site-build time or freeze findings.json as a static asset.

9. Pipeline cruxes

Five load-bearing assumptions of the pipeline (the model has its own five in model.mdx §8). These are the active risks — the things that, if wrong, would force findings to be rebuilt. Each crux subsumes the corresponding “judgment call” the pipeline made; in pipeline.py the calls are flagged inline as # ASSUMPTION:.

Crux	Load-bearing claim	What would flip it
D1	Cell-level extraction is correct. The CUPS cells (Q1) and Bastani methodology (Q2) were web-verified against the primary papers; Brynjolfsson, Dell’Acqua, Schoenegger, Otis, Cui, METR, Noy, Peng, Goh, Everett, Vaccaro, Wang, Lee-Sarkar, Humlum-Vestergaard, and Anthropic cells were verified to abstract / engineering-blog level. The rest rests on training-time recall plus citation existence. Sub-assumption (Q1): the CUPS state classification into generation / verification / overhead is faithful to Mozannar’s intent (the “deferring_thought” state was bucketed as verification but is genuinely ambiguous). Sub-assumption (Q1): the `ε` lower bound from Mozannar’s cyborg-regime overhead share would not redistribute differently at full delegation (u = 1) — Mozannar’s u was ~0.4–0.6.	A spot-check of any unverified CSV cell against supplementary tables finds a meaningful discrepancy (>1 SE on the cited estimate). Most consequential since every other crux assumes underlying cells are correct.
D2	The (c_H, c_AI) estimates in `jagged_frontier.csv` are inferred from the same outcome variable that drives Q4’s y-axis. The x-axis values were guessed to fit the y-axis observation. The slope is computed only on cleanly mis-routed cases (c_H > c_AI and the worker used AI); inside-frontier cases are excluded.	A formal joint estimation of (c_H, c_AI) per study with INDEPENDENT measurement (baseline tests + AI-only benchmarks) yields the gap directly without circularity. Q4 would become a real slope test rather than a sanity check.
D3	The Beta(4, 2) prior in Q6’s Monte-Carlo (mean ≈ 0.67, sd ≈ 0.18) is a reasonable proxy for “moderately uncertain c_AI.”	Real worker priors over c_AI are differently shaped (e.g., bimodal — workers either trust AI a lot or not at all, with little middle). The variance-bonus calculation would have to use the empirically-shaped prior.
D4	The synthetic θ-distribution for Q3 (35% routine / 50% mixed / 15% high-stakes-strategy) captures the qualitative shape of BCG-consultant work.	Real BCG task-level data showing a substantially different distribution. Q3’s specific share predictions would shift ±10pp; the bilinearity-implies-corner-mixing structural finding would survive.
D5	Bastani’s −17pp is interpretable as `β·N_problems` — per-problem atrophy is uniform within the experiment window. With N denominator unverified, β is bracketed as [0.028, 0.113].	Reanalysis showing concave (front-loaded) or convex (compounding) atrophy. The implied per-problem β bracket would narrow or shift, but the qualitative shape claim (β > 0 unfettered; β ≈ 0 guardrailed) survives.

Documented past errors (flipped cruxes from earlier passes). Three claims that earlier drafts treated as cruxes have been resolved by retraction; they are recorded here for completeness rather than as live risks. Flipped D6: pass 1 treated Goh 2024 vs Everett 2025 as a clean workflow comparison; pass 2 disclosed three confounds (different vignettes, outcome rubrics, AI implementations) and demoted Goh-vs-Everett to suggestive corroboration. Flipped D7: pass 2 co-plotted Anthropic’s +90.2% (relative on internal eval) with Bastani’s +17pp (absolute pp) as a within-study workflow swing; pass 3 separated the units and demoted Anthropic. Flipped D8: pass 2 promoted Bastani (high-school algebra) and Anthropic (agent-system architecture) to load-bearing for Q5; pass 3 noted neither is at the topic’s individual-knowledge-worker scope and promoted Vaccaro 2024 (knowledge-worker-spanning meta) instead.

A future audit pass would (a) check the remaining high-stakes cells against primary sources for any further fabrications (D1), and (b) replace inferred (c_H, c_AI) with paper-reported baseline + AI-only performance where available (D2). Both are tractable; both would tighten the pipeline materially.

Read full stage →

Iteration history

Pass 1 2026-05-02

decompositionintegrationgap scanconnectionscreative chart

Why First draft of the data pipeline. Pulled the six closed-form fitting targets Q1–Q6 from the model formalization and built a curated CSV per target plus a Python pipeline that confronts each one against currently-published consortium / RCT / field-experiment numbers. Web-verified anchor numbers from the highest-uncertainty papers (Brynjolfsson 2023, Noy & Zhang 2023, Peng 2023, Cui 2025, METR 2025, Otis 2024, Dell'Acqua 2023, Goh 2024, Everett 2025, Bastani 2025, Mozannar 2024, Randazzo 2026, Schoenegger 2024, Anthropic 2025, Vaccaro 2024, Wang 2025, Lee-Sarkar 2025, Humlum-Vestergaard 2025) directly from primary-source URLs.
- Built eight curated CSVs (sources, productivity_rcts, cups_time_fractions, bastani_longitudinal, mode_distribution, jagged_frontier, workflow_vs_capability, calibration_evidence). Every cell cites a primary source key resolvable in sources.csv
- Wrote the Python pipeline (pipeline.py): loads CSVs, fits ε from CUPS overhead share (0.17 vs default 0.15), fits β from Bastani longitudinal (0.057 vs default 0.05), runs synthetic θ-distribution to test Q3 mode aggregation, fits outside-frontier slope (0.67 vs model 1.0), tabulates Q5 within-domain workflow swings, computes Q6 Monte-Carlo information bonus. Total ~280 lines, pandas + numpy only
- Six fitting-target verdicts: Q1 supported_with_caveat (ε modestly higher than default; coding-regime φ much higher), Q2 supported (β within 14% of default; shape confirmed), Q3 supported_qualitatively (corner-mixing recovers empirical aggregate distribution), Q4 supported (linearity confirmed at slope 0.67), Q5 supported (Goh→Everett +7.9pp same-model swing), Q6 framed_not_resolved (structural extension; literature anchors miscalibration is real)
- Built the React findings panel (CognitivePartnershipData.tsx): seven tabs (productivity record + Q1–Q6), charts hand-rolled in SVG to match V4 design tokens. Includes a 22-study productivity landscape chart that places the four mis-routed cases (METR, Otis low-baseline, Dell'Acqua outside, Bastani post-test) on the same axis as the 13 positive-effect studies
- Promoted CSVs and findings.json to public/data/technology-utilization-architecture/ — tracked in git, downloadable on the live site
- Six pipeline cruxes named (D1 cell-extraction correctness, D2 (c_H, c_AI) inferred-not-measured, D3 Q6 Beta(4, 2) prior shape, D4 Goh vs Everett comparability, D5 synthetic θ-distribution shape)
- Three model scope-limits explicitly disclosed (E4 aggregate-zero, E13 sycophancy, O4 frontier migration) as named non-deliveries
Pass 2 2026-05-02

fresh-eyes auditinternal consistency checktruth/accuracy override on biaserror check (cell-level)

Why Cold-reading pass 1 surfaced four real problems. (a) Q3 internal contradiction — the same section asserted both "the 60% empirical cyborg majority is corner-mixing as the model predicts" AND "the 60% cyborg majority is doing the naive failure mode." Mutually exclusive readings of the same data; Randazzo doesn't release per-task u-v telemetry. (b) Q4 circular slope — pass 1's "fitted slope 0.67 vs model 1.0" used (c_H, c_AI) values inferred from the same outcome variable that drives the y-axis, with n=3. That's a sanity check, not a slope test. (c) Q5 hidden confounds — pass 1 led with Goh 2024 vs Everett 2025 as "the cleanest natural experiment, same domain, same model class, only workflow differs." Fresh-eyes audit shows different vignettes, different outcome metrics, and different AI implementations (vanilla GPT-4 vs custom GPT system). The +7.9pp swing bundles three confounds. (d) Q1 false precision on partly-fabricated cells — web-fetch of Mozannar 2024 Figure 5(b) confirmed only 3 of 10 CUPS cells in pass 1's CSV are separately published; the other 7 were extrapolated. Pass 1's "ε ≈ 0.17 / φ ≈ 1.40" was computed off an extrapolated breakdown.
- Q1 retraction. CUPS CSV reduced to only published cells (51.5% Copilot-specific aggregate SD 19.3pp, 22.4% verify-share SD 12.97, 14.05% writing-new, 4.2% waiting). Headline reframed around the φ ≈ 1.59 cyborg-regime finding (vs default 0.30), which IS supported by published cells. ε ≈ 0.17 retracted as over-precise on extrapolated breakdown
- Q3 contradiction fixed. Reframed as a structural prediction (corner-mixing CAN aggregate to a 60/30/10 behavioural pattern under reasonable θ priors) rather than a directly-testable empirical claim. Both interpretive readings (corner-mixing vs flat-interior) are consistent with Randazzo's aggregate counts; per-task u-v telemetry is needed to discriminate. Pass 1's "naive cyborg failure mode" claim disclaimed as not-supported-by-data
- Q4 reframed as sanity check. (c_H, c_AI) circularity disclosed (x-axis is inferred from y-axis observation). 3-data-point regression with circular x has neither degrees of freedom nor independent x. Pass 1's "fitted slope 0.67 supports linearity" framing demoted: observed mis-routing drops are in the model's predicted ballpark (magnitude order matches u·(c_H − c_AI)), but linearity-as-a-shape claim is not testable from currently-released data. New RCT design needed
- Q5 confound disclosure. Goh-vs-Everett demoted from "cleanest natural experiment" to "suggestive across-study evidence with three confounds disclosed" (different vignettes, different outcome metrics, different AI implementations). Bastani within-study (+17pp, same RCT, same students, same model) and Anthropic within-eval (+90.2pp, same internal eval, same base model) promoted to load-bearing evidence. Vaccaro 2024 meta-analysis retained as population-level corroboration. The S1 headline survives on within-study and meta-analysis evidence; the across-study Goh-vs-Everett comparison is no longer load-bearing
- TLDR rewritten to be honest about clean vs convergent-but-confounded. Pass-1 "five hold cleanly" replaced with "one clean test (Q2), one strong qualitative finding with retracted false precision (Q1), three structural / convergent / consistency claims (Q3, Q4, Q5), one framed-not-resolved (Q6)"
- Pipeline cruxes updated. D2 strengthened — Q4 (c_H, c_AI) circularity is structural not just measurement noise. New D6 added — flipped: Goh-vs-Everett is not a clean comparison; Q5's headline no longer rests on it. D5 split out as a new crux (Bastani per-problem atrophy uniformity)
- pipeline.py rewritten: fit_q1_cups uses only published aggregates and reports a φ_cyborg estimate of 1.59 + an ε lower-bound from the wait/monitor share; fit_q3_modes labelled structural_prediction_not_directly_testable; fit_q4_outside_frontier flagged as sanity_check_consistent (no longer "supported"); fit_q5_workflow restructured into within-study (Bastani, Anthropic) cleanest_swings + across-study (Goh-vs-Everett) with confounds field. Each function carries an explicit pass2_note disclosing what was retracted
- React findings panel updated to match. CUPS panel now shows the 4 published Mozannar cells with SD bars; Q3 panel reframed as structural prediction; Q4 panel labels the slope as descriptive-only with circularity callout; Q5 panel reorders within-study above across-study and discloses the three Goh-vs-Everett confounds inline
Pass 3 2026-05-03

error check (cell-level)scope checktruth/accuracy override on biascross-context verification

Why Cold-reading pass 2 surfaced three unresolved problems. (a) Q2 N denominator unverified — the "−17pp / 30 problems = β = 0.057" still rested on a fabricated 30. Web-verification: Bastani is FOUR 90-min sessions in a Turkish high school; per-session problem count is not in any public abstract. (b) Anthropic "+90.2%" unit error — web-fetch of the Anthropic engineering post confirmed it is RELATIVE on internal eval, NOT percentage points. Pass 2 plotted it on the same axis as Bastani's +17pp absolute. Unit mismatch. (c) Q5 scope mismatch — pass 2 promoted Bastani (high-school algebra) and Anthropic (multi-agent system architecture) to "load-bearing within-study evidence" for an individual-knowledge-worker workflow claim. Bastani is students, not workers; Anthropic is engineering tool design, not user workflow choice. Neither directly tests S1 at the topic's actual scope.
- Q2 reframed from "β ≈ 0.057 within 14% of default" to "β ∈ [0.028, 0.113] bracketed against N ∈ [15, 60]; default 0.05 sits inside the bracket." Direction (atrophy under unfettered, none under guardrails) and shape (β·u·(1-v) form) are robust to N; magnitude is in the right range; precise calibration awaits per-problem telemetry. Updated CSV bastani_longitudinal.csv + pipeline.py fit_q2_bastani + React Q2 panel + Q2 NumberCards
- Q5 promoted Vaccaro 2024 (106 studies / 370 effect sizes, Nature Human Behaviour) to load-bearing evidence — the only anchor at the topic's actual scope (knowledge-worker-spanning meta). Bastani and Anthropic demoted to "scope-adjacent within-study analogs" with explicit scope_caveat fields disclosing the mismatch. Goh-vs-Everett retained as suggestive across-study with confounds
- Anthropic "+90.2%" unit explicitly relabelled as "RELATIVE % on internal research eval (NOT percentage points; absolute baseline not disclosed)" everywhere it appears — pipeline.py fit_q5, React Q5 panel, productivity-record panel. The previous co-plotting of Bastani +17pp absolute with Anthropic +90.2 RELATIVE was a unit mismatch
- Productivity-record panel header amended with explicit unit caveat: most rows are absolute productivity / time / quality lifts (Brynjolfsson, Noy, Peng, Cui, Otis, Dell'Acqua, Goh, Everett, Humlum); Bastani's +48% / +127% are relative in-session improvements over control; Anthropic's +90.2% is relative on internal eval. "Compare within-unit-class, not across" added inline
- Two new pipeline cruxes added: D7 (Anthropic +90.2 unit) — flipped during pass 3, retracted as load-bearing within-study evidence for S1 because the unit is not comparable to other anchors. D8 (Q5 scope mismatch) — Bastani and Anthropic anchors are at scopes adjacent to but not coincident with individual knowledge worker workflow; load-bearing evidence at the topic scope is now Vaccaro 2024 meta-analysis
- Updated §7 Connection to model cruxes — C3 ε ≈ 0.17 reference removed (pass-2 retraction); now reads "Q1 confirms ε > 0 qualitatively via the 51.5% Copilot-specific share; precise ε at u=1 not directly calibratable from published aggregates." §8 connections to model dashboard updated similarly: ε bump 0.15 → 0.17 retracted; β bump 0.05 → 0.057 replaced with "β default sits inside Bastani bracket; no update needed at this precision"
- Updated TLDR Q2 paragraph and Q5 paragraph to the pass-3 framing. The single-clean-test verdict on Q2 softens to "supported in direction and shape; magnitude in the right range." Q5's load-bearing evidence is Vaccaro 2024; Bastani / Anthropic / Goh-Everett are corroborative-with-mismatch-disclosed
Pass 4 2026-05-03

compressionreadabilityfresh-eyes auditinternal consistency checktruth/accuracy override on bias

Why Cold-reading the pass-3 data.mdx as a new reader, the body was dominated by pass-N retraction prose ("pass 1 said X but pass 2 retracted because... pass 3 then..."). The honesty trail belongs in the frontmatter refinementLog where it already lives; the body should present the corrected findings cleanly. Also caught a truth/accuracy slip: pass 3's productivity-panel header claimed Bastani's bars are "RELATIVE" while most others are "ABSOLUTE" — actually most productivity findings (Brynjolfsson +14%, Cui +26%, Peng +55.8%, Bastani +48%) are all relative % changes from a control baseline. The genuinely odd-unit row is Anthropic (+90.2% on an internal eval, not a behavioural measure). Pass 3's framing overstated the unit issue.
- TLDR rewritten. Three substantive paragraphs (headline findings; verdict tally; pipeline architecture + non-deliveries) followed by a one-paragraph acknowledgement that retraction history lives in the frontmatter log. The "Pass 1 over-claimed cleanness... pass 2 fixed... pass 3 fixed..." framing is dropped from the TLDR; readers wanting the audit trail can read the log directly
- Q1 section (§2.1) tightened. Lead with the published Mozannar aggregates and the φ ≈ 1.59 cyborg-regime headline. Drop the explicit "pass-2 retraction" call-out from the body; the retraction is documented in the frontmatter log
- Q2 section (§2.2) tightened. Lead with the bracket β ∈ [0.028, 0.113] finding and the three robust claims (direction / shape / magnitude). Drop the "pass-3 retraction" prose block from the body
- Q3 section (§2.3) tightened. Lead with the structural prediction (bilinearity → corner-mixing) and the two interpretive readings consistent with Randazzo's aggregate. Drop the "pass-2 honest framing" block label; just present the framing
- Q4 section (§2.4) tightened. Lead with the sanity-check finding and the magnitude-vs-linearity distinction. Drop the explicit pass-2 retraction prose
- Q5 section (§2.5) tightened. Lead with Vaccaro 2024 as the load-bearing population-level meta. Keep the scope-adjacent and confounded analogs in tables but drop the long "pass-3 retractions and reframes" closing block — the full retraction trail is in the frontmatter log
- §3 Headline numbers table: split the Mozannar row into two (51.5% Copilot-specific aggregate; 22.4% pure-verify subset) for clarity; add explicit unit caveat to Anthropic row ("RELATIVE % on internal research eval; not unit-comparable to absolute-pp anchors")
- §10 Pipeline cruxes restructured. Live cruxes D1–D5 in the main table; flipped past errors D6/D7/D8 collapsed into a brief "Documented past errors" paragraph below — they're not active risks, they're recorded for completeness
- React productivity-panel header rewritten. Pass-3 over-claimed unit mismatch ("Bastani relative vs Brynjolfsson absolute") softened to the accurate framing: most productivity findings are relative % changes from a control baseline and are directly comparable; Anthropic's +90.2% is the genuinely odd-unit row (internal eval score, not behavioural productivity)
- Description field updated to lead with headline findings rather than pass-history meta-tally
Pass 5 2026-05-03

cross-context verificationerror check (source audit)adversarial + steelmanfresh-eyes audit

Why After pass 4 (compression), three issues remained. (a) Cross-context: pass 4's productivity-panel header simplified to "most rows are relative %" but on closer look the panel mixes three genuinely different unit classes — flow-rate productivity gains, stock-quality score lifts, and absolute percentage-point swings. Pass 4's simplification papered over a real distinction. (b) Source audit: I cite "Randazzo HBS WP 26-021" (sycophancy / persuasion bombing) in calibration_evidence.csv but only Randazzo HBS WP 26-036 (cyborgs / centaurs) is in sources.csv — two different papers conflated under one source key. (c) Adversarial: §6 was still pass-1 framing with stale responses citing Bastani / Anthropic as "load-bearing" (pass 3 demoted them) and citing the Q4 "fitted slope 0.67" as evidence (pass 2 retracted that as not-a-slope-test). After four substantive correction passes, the strongest current objections have shifted; §6 needs a fresh-eyes rewrite engaging the actual current weaknesses.
- Source audit. Verified Randazzo HBS WP 26-021 ("GenAI as a Power Persuader: How Professionals Get Persuasion Bombed When They Attempt to Validate LLMs") is a real, distinct paper from HBS WP 26-036. Added randazzo_persuader_2026 as a separate row in sources.csv with the authors (Randazzo, Joshi, Kellogg, Lifshitz-Assaf, Dell'Acqua, Lakhani), n≈70, and HBS URL. Updated calibration_evidence.csv to use the new key. The "sycophancy" framing in pass 4's prose is renamed "persuasion bombing" everywhere — Randazzo's paper documents 14 specific tactics across ethos/logos/pathos categories, which is structurally richer than just "sycophancy"
- Cross-context verification on productivity-panel units. Added a unit_class field to the PRODUCTIVITY data ({rate, stock, pp, rel_eval}) and a unit-class tag to each chart row. Four classes: rate (flow-rate productivity gains: Brynjolfsson, Cui, Peng, Otis, METR, Humlum); stock (stock-quality score lifts: Noy quality, Dell'Acqua inside, Bastani in-session, Schoenegger forecasting); pp (absolute percentage-point swings: Goh, Everett, Dell'Acqua outside, Bastani post-test); rel_eval (Anthropic +90.2% relative on internal eval). Magnitudes within a class are directly comparable; across classes are not. Header rewritten to disclose this; legend added below chart with class definitions
- §6 Adversarial + steelman fully rewritten. Pass-1 objections (variance bookkeeping; Goh-vs-Everett one-paper-pair; Q3 post-hoc; Q6 not literature replication) had stale responses citing now-demoted anchors. Pass-5 engages four CURRENT objections after 4 correction passes. (1) None of the six "fitting targets" actually fits anything — pipeline is empirical-context-and-consistency-check, not calibration; rename internally honest. (2) D1 cell correctness only partially addressed — ~15 anchor cells verified to abstract level only; supplementary tables not audited. (3) Model defaults survive only in loose sense of "not strongly contradicted" — a 4× β bracket and a 5× φ error are not a tight calibration; the L3 invariant (parameterise-by-capability) is the design choice that makes this OK. (4) Persuasion bombing (Randazzo HBS 26-021) is a structural threat to the spec-driven (1, 1) corner, not just a peripheral scope-limit — verifying via free dialogue can lower effective c_⋆ via the 14 documented persuasion tactics. Each objection has a steelman and a response that does not retreat to motivated reasoning
- Propagated persuasion-bombing implication into §5 sycophancy scope-limit (renamed to "persuasion bombing as quality-degrader") with full Randazzo et al. 2026 citation and the structural-threat-to-spec-driven-corner framing. Propagated into §9 stage-5 handoff as a new design item: "Structured-rubric verification (persuasion-bombing mitigation)" — when the dashboard recommends spec-driven (1, 1), it should also recommend a structured-rubric verification mode (predefined check-points, not free dialogue) to constrain the AI's response surface and reduce the persuasion-bombing channel. This converts a scope-limit into a concrete dashboard design constraint
- No CSV cell-value changes (only the calibration_evidence source key); pipeline.py unchanged; findings.json unchanged in numeric content but re-emitted to refresh source-key references
Pass 6 2026-05-04

truth/accuracy override on biasredundancy prune

Why Cold-reading pass 5, two issues remained. (a) Truth/accuracy slip on Vaccaro 2024. §2.5 Q5 said the decision-vs-creation asymmetry "is exactly what the model predicts." That is overstated — Vaccaro's headline finding is "human-AI underperforms best-of-either-alone on average" with decision tasks losing more and content creation gaining. The model predicts that workflow choice matters more for high-σ decision tasks (consistent with) but does not uniquely predict the asymmetry — multiple human-AI cooperation models would too. The "exactly what the model predicts" framing was doing motivated work on Q5's load-bearing anchor. (b) §4 (Analytical choices) and §10 (Pipeline cruxes) overlap conceptually — §4 #2 (ε at cyborg regime) maps to §10 D1 sub-assumption; §4 #3 (Bastani β bracketed) to §10 D5; §4 #5 (outside-frontier slope) to §10 D2; §4 #6 (Beta(4,2) prior) to §10 D3. A reader notices the duplicate enumeration and wonders why we have both.
- Vaccaro framing softened in TLDR + §2.5. "The decision-vs-creation asymmetry is exactly what the model predicts" → "is consistent with the model's qualitative prediction that workflow choice matters more for high-σ decision tasks." Added explicit caveat that Vaccaro's split is a moderator analysis, not a clean test of the model's specific prediction. What the meta does establish at population scale: complementarity is not automatic, and task structure systematically modulates whether it is achieved — both signatures S1 needs to be true. This is honest about what the meta does and does not do for Q5
- §4 (Analytical choices) collapsed into §9 (Pipeline cruxes). The six judgment calls were scattered across cruxes anyway; pass 6 absorbs them as sub-assumptions in the corresponding crux row. D1 now carries CUPS state classification + ε regime sub-assumptions; D2 carries the Q4 mis-routed-only fit; D3 carries the Beta(4, 2) shape; D4 carries the synthetic θ shape; D5 carries the Bastani N denominator. Single source of truth for assumptions; no more duplicate enumeration
- Sections renumbered: §6 Adversarial → §5; §7 Connection to model cruxes → §6; §8 Connections to other work → §7; §9 Stage-5 handoff → §8; §10 Pipeline cruxes → §9. Body cross-references updated (e.g., "engaged in §6 obj 4" → "§5 obj 4"; "D1 in §10" → "D1 in §9"; "§9 stage-5 handoff" → "§8 stage-5 handoff"). Frontmatter log entries left unchanged — they are historical commentary about what each pass did at the time and should not be retroactively rewritten
- No CSV / pipeline / React-component changes. Pure prose tightening pass. Final state: 9 numbered sections from a previous 10, with the redundancy removed and the load-bearing Q5 framing honest
Pass 7 2026-05-04

error checktruth/accuracy override on biascross-context verification

Why Cold-reading pass 6 with the cross-context lens (how do this topic's β and navigating-ai-world's λ relate?) surfaced a SUBSTANTIVE 10× transcription error that propagated through passes 3–6 unchecked. Pass 3 introduced "β bracket [0.028, 0.113]" in the prose; the actual pipeline.py computation always returned [0.003, 0.011] under per-problem interpretation. The pipeline JSON output even had `default_inside_per_problem_bracket: False` — the pipeline was correct; the prose was wrong by an order of magnitude. Passes 4, 5, 6 each carried the prose forward without re-checking the math. This is the most consequential error caught in the entire 7-pass refinement process and would have surfaced under any peer review that audited the pipeline output against the prose.
- pipeline.py fit_q2_bastani rewritten to compute and report two readings explicitly: per-problem bracket [0.003, 0.011] (under N ∈ [15, 60] practice problems) and per-session estimate ≈ 0.043 (under 4 sessions). Output includes default_inside_per_problem_bracket flag (False) and default_to_per_session_ratio (1.18). New pass7_correction note in the function output spelling out what was wrong and why
- bastani_longitudinal.csv schema updated: removed misleading single "implied_beta_at_u1_v0" column (which had the 0.057 number that was the source of the propagated error); added beta_per_session, beta_per_problem_lo, beta_per_problem_hi, n_problems_lo, n_problems_hi, sessions columns to make the unit ambiguity transparent in the audit-trail CSV
- data.mdx TLDR Q2 paragraph rewritten: "β ∈ [0.028, 0.113] depending on per-problem denominator; default 0.05 inside" → "β ∈ [0.003, 0.011] under per-problem interpretation (default OUTSIDE by 5–15×) or β ≈ 0.043 under per-session (default ~1.2× too high but inside neighborhood)." Added explicit "Pass 7 corrected a 10× transcription error" disclosure
- data.mdx §2.2 Q2 section rewritten with two-row table separating per-problem vs per-session readings; verdict softened from "magnitude in the right range" to "direction + shape supported; magnitude unit-dependent." Pass-7 retraction block added explicitly inside §2.2. Discloses that the model's "task" unit definition determines whether the default 0.05 is mis-calibrated by 10× or merely 1.2× too high
- data.mdx §6 obj 3 strengthened: "model defaults survive only loosely" → "...and pass 7 found one default (β under per-problem reading) is materially mis-calibrated by 5–15×." The pipeline's job description is widened from "confirm the model doesn't fail" to "...AND surface where calibration is honest vs loose"
- data.mdx §6 (Connection to model cruxes) updated: C4 now references the per-problem [0.003, 0.011] vs per-session ~0.043 split and flags the model-stage definitional cleanup needed on what unit "task" means
- data.mdx §7 Connections to model dashboard updated: removed the (now-wrong) claim that "default 0.05 sits inside the empirical bracket β ∈ [0.028, 0.113] anyway." Replaced with a clear statement that the model-stage definition of "task" should be clarified before any numeric β update; per-problem reading would mean dropping the default to ~0.005, per-session reading would mean keeping ~0.05
- React Q2 panel rewritten: NumberCards now show per-problem [0.003, 0.011] vs per-session 0.043 separately; PanelHeader claim updated; "Pass-7 correction" card replaces the old "Pass-3 retraction" card; bottom-paragraph fully rewritten to disclose the unit ambiguity and the 10× pass-3-prose error
- Cross-context note: navigating-ai-world topic anchors λ atrophy speed band 0.05–0.20/year against the same Bastani −17pp finding, using a different time-base conversion ("Bastani amortized to one year at heavy offloading u → λ ≈ 0.19"). This topic's β and nav-AI's λ are NOT inconsistent if you accept different time-bases (per-task vs per-year) — but the underlying point is that Bastani's atrophy magnitude depends sensitively on the time-base / unit choice, which both topics should disclose. This is held as a cross-stage note rather than an action item; nav-AI's framing has its own internal consistency
- No changes to other sections, other CSVs, or other React panels. Pure correction of one substantive arithmetic error and the unit ambiguity it concealed
Pass 8 2026-05-04

error check (cell-level audit)cross-context verificationtruth/accuracy override on bias

Why Pass 7 surfaced a methodological lesson: subsequent passes should default to executing pipeline.py and grepping the prose for any number that does not appear in the JSON output. Running that audit systematically on pass 8, all major numerical claims in the prose now match the pipeline (within reasonable rounding). But the audit also surfaced a category I had not fully addressed: numerical claims that came from primary-source recall rather than the pipeline. Spot-checking the highest-stakes such claim — Randazzo's 60/30/10 cyborg/centaur/self-automator distribution — against the actual primary source (HBS d3 writeup of WP 26-036) revealed the actual distribution is **60% cyborg / 14% centaur / 27% self-automator**. The "60/30/10" cited through pass 1–7 came from a lit-review training-time recall error and propagated through topology, model, and 7 data-stage passes without audit. Self-automator share is ~3× larger than I claimed; centaur share is ~half of what I claimed.
- mode_distribution.csv corrected: cyborg 60% (unchanged), centaur 30 → 14, self-automator 10 → 27. Source citation unchanged (randazzo_2026 / HBS WP 26-036) but the share values now match the actual paper
- pipeline.py output (Q3_modes.empirical_aggregate_share) now reports {cyborg: 0.60, centaur: 0.14, self_automator: 0.27}. The structural prediction verdict still holds — the synthetic per-task corner distribution (51.6% (1, 0)) puts the empirical 27% self-automator share INSIDE the predicted range, just as the prior 10% was inside. The corner-mixing structural argument is robust to the empirical correction
- data.mdx §2.3 Q3 prose updated: "60% cyborg / 30% centaur / 10% self-automator" → "60% cyborg / 14% centaur / 27% self-automator (web-verified from HBS d3 writeup)." The prior phrasing "the empirical 60/30/10 distribution" → "the empirical 60/14/27 distribution." Pass-8 retraction note added to §2.3
- React Q3 panel data updated: empirical shares corrected to [0.60, 0.14, 0.27]. PanelHeader claim updated with explicit pass-8 correction note: "Passes 1–7 cited 60/30/10 from a lit-review training-time recall error. The corrected 27% self-automator share is still INSIDE the synthetic prediction range." Comment in code documents the cross-stage propagation source
- Cross-stage error flag: this error originated in the lit-review stage's "Empirical distribution across 244 BCG consultants: ~60% cyborg, ~30% centaur, ~10% self-automator" line. The lit-review stage of this topic should be amended on its next refinement to use the verified 60/14/27 numbers. Held as a note rather than executed (cross-stage refinement during a data-stage pass risks scope creep) — but logged here so it does not get lost
- Methodological consequence: the pass-7 lesson ("execute pipeline.py and grep prose against JSON") is necessary but not sufficient. Pass 8 lesson: also web-spot-check the highest-stakes anchor numbers against primary sources. The 60/30/10 was in a CSV but never matched against the actual paper text; it survived seven passes as "true because cited everywhere internally." Future passes should verify the most-cited cross-stage anchor numbers against primary sources at least once
- Verdict tally and headline findings unchanged in qualitative substance. The Q3 structural prediction (corner-mixing aggregates to a behavioural mode distribution) survives the empirical correction. The model's self-automator (1, 0) corner share of 51.6% still encompasses the empirical 27% share; if anything the corrected 27% is more central inside the synthetic range than the prior 10%
Pass 9 2026-05-05

error check (continued cross-stage anchor audit)truth/accuracy override on bias

Why Pass 8 caught a primary-source recall error (Randazzo 60/30/10 → 60/14/27) that survived seven passes as "true because cited everywhere internally." Pass 8's methodological lesson said: web-spot-check the highest-stakes cross-stage anchor numbers against primary sources at least once. Pass 9 extended that audit to three more high-citation anchors (Brynjolfsson, Dell'Acqua, Cui) plus re-verified Vaccaro. Result: one small error found (Brynjolfsson average gain is +15% in the QJE-published version, not +14% as I had — likely a citation drift from NBER WP summaries). Other anchors verified within rounding tolerance.
- productivity_rcts.csv: Brynjolfsson average productivity row corrected from +14% to +15% (the QJE-published number). Sample size 5,172 confirmed correct. Source citation refined from "Brynjolfsson 2023" (NBER WP year) to "Brynjolfsson, Li, Raymond 2025 QJE" with a note that NBER-WP-style summaries sometimes round to 14%
- data.mdx §3 Headline numbers row updated: "+14% avg / +34% novice" → "+15% avg / +34% novice"; citation link refined to QJE-published article
- React component PRODUCTIVITY array: Brynjolfsson avg row updated 14 → 15; source label refined to "Brynjolfsson 2025 QJE"
- Other audited anchors (verified within rounding): Brynjolfsson sample 5,172 ✓; Brynjolfsson novice +34% ✓; Cui +26.08% (I have +26 — rounded but accurate); Cui sample 4,867 ✓; Dell'Acqua +40% inside / -19pp outside ✓; Vaccaro 106 studies / 370 effect sizes ✓ (verified twice: pass 1 web-fetch and pass 5 cross-check; pass 9 re-confirmed)
- Pass-9 closing note added to PRD log: marginal value of further audit-only refinement passes is now low. Pattern across 9 passes shows pass 1 (first draft), pass 2-3 (major fixes), pass 4 (compression), pass 5-6 (minor structural fixes), pass 7-8 (substantive errors caught — 10× transcription, 60/30/10 recall), pass 9 (small 1pp accuracy fix). Each pass after pass 4 has been finding genuine but increasingly smaller errors. Recommend transitioning to stage 5 next; further data-stage audits would be best done as a concentrated audit pass when stage 5 (build) starts using these numbers, rather than spread across more isolated refinement passes

pass 2

A reader's tool for picking how to use AI on any given task. Five views: pick a task (plain-language questions → corner recommendation with cited empirical anchor), compare strategies (route per-task vs flat cyborg vs always-self vs max-AI on the same day-mix, budget-aware), the five common mistakes the model identifies, when to verify (calibration coach + persuasion-bombing mitigation), and a seven-bullet cheat sheet. Translates the formalisation and data pipeline into something a knowledge worker can actually use.

TLDR

This artifact is a reader’s tool for picking how to use AI on any given task. It wraps the model stage’s bilinear value function and the data stage’s 22-study evidence base in a plain-language UI: you answer five questions about a task (“how good are you at it?”, “how good is the AI?”, “how expensive is verification?”, “what’s at stake?”, “do you care about keeping this skill?”), and it returns a recommended corner — do it yourself, hand it off without review, or hand it off with rubric-verified review — plus the closest matching empirical anchor with source citation. A second view runs the same math at the portfolio level: same task mix, four strategies (always do it yourself / hand everything off / flat cyborg / route per-task), with a tightenable attention budget that triggers the shadow-price reroute the model derives. Three more views surface the five common mistakes, when verification helps versus hurts, and a seven-bullet cheat sheet.

The artifact’s single load-bearing claim, the one the whole pipeline converges on, is that there are three corners and not five workflow modes. The popular vocabulary — centaur / cyborg / self-automator / spec-driven / do-yourself — is descriptive language for what people look like when you watch them. The actual per-task decision is a three-way choice. Centaur and cyborg arise as aggregate-day-level patterns when a person mixes corners across different tasks. The trap is treating “cyborg” as a per-task strategy: applying a flat (u=0.7, v=0.3) policy uniformly across the day is structurally never optimal at any single task. Variability in your workflow IS the architecture.

The second load-bearing claim is that verification is the hinge, and the way most people do verification — by asking the AI whether its output is correct — is actively harmful. Randazzo HBS 26-021 documents AI escalating across 14 persuasion tactics when professionals tried to validate its outputs in free-form dialogue. Pushback increased intensity rather than producing acknowledgement. The mitigation is structural: write your check-points down before you see the AI output, score the output against the check-points, then stop. This is the highest-leverage workflow nudge in the pipeline; it appears in the when-to-verify view and in the recommended-next-moves for the spec-driven corner.

The instructions are below the tool. The full evidence base is the data stage; the formalisation is the model stage; the long-form synthesis for an educated lay reader is the writeup.

About this task

How good are YOU at this kind of task?

Best estimate of your own output quality if you did it solo.

How good is the AI at this task?

Best estimate of AI output quality without your involvement.

How expensive is it to verify the AI's output?

As a fraction of the time it would take you to do it yourself. Cheap = run a test or eyeball it; expensive = needs careful read or independent re-derivation.

What's at stake if the output is wrong?

Do you care about KEEPING this skill?

If you delegate this without verifying, your unassisted ability will erode (Bastani 2025: students lost 17 pp on unassisted retest after sustained unfettered AI use).

Recommendation

recommended corner

Hand it off — no review

Delegate fully. Ship without independent verification. The model recommends this when AI is at least as good as you, the task is low-stakes, and you don't need to preserve your own skill. ~27% of BCG consultants operate this way as their default. The trap is doing it everywhere; it's correct on the right tasks.

Empirical anchor

Randazzo self-automator (the trap, named correctly)

Randazzo HBS WP 26-036

27% of BCG consultants (Randazzo 2026, web-verified) operate as self-automators: full delegation, no verification. The model says this is the *right* corner when AI is at least as good as you, stakes are low, and skill preservation doesn't matter — e.g., boilerplate. The trap is using it everywhere.

source →

Next moves

Make sure you actually have low stakes and don't care about preserving the skill. If either changes, switch to spec-driven.
Set a quarterly check: re-test yourself on the task without AI. If unassisted performance has decayed below an acceptable floor, switch this task back to spec-driven for a while.

How to use this

The tool above has five views, accessed from the row of tabs at the top. Each view answers a different question. Use them roughly in this order.

Pick a task

Start here. Pick the kind of task you’re about to do, answer five questions about it, and read the recommendation. The five questions correspond to the five parameters of the formal model (c_H, c_AI, φ, σ, λ) but you don’t need to know any of that to use the tool — the levels (low / medium / high) map to defaults inside the model.

The recommendation is a corner — one of three discrete answers, never an interior “use AI a little” mush. The model’s bilinearity (proved in the model stage) says interior policies are structurally never optimal at any single task; the answer is always do-yourself, hand-off-no-review, or hand-off-with-rubric-verified-review.

Below the recommendation: a named empirical anchor — a specific study from the data stage whose subjects faced a task close to the one you described — and what they found. If the recommendation is “hand it off, no review” and the anchor is Brynjolfsson 2025 QJE (novice customer-service agents, +34%), that’s the model telling you a real piece of evidence supports the recommendation in a structurally similar setting.

If you want to see what the model is actually doing, click “show the math” at the bottom of the recommendation panel. It will display the underlying (c_H, c_AI, φ, σ, λ) values your answers mapped to, plus the per-corner score V for all three viable corners. The runner-up gap tells you how close the call is.

Compare strategies

The second view runs the same math at the day level. The default day-mix is five task types (routine email, boilerplate code, literature synthesis, persuasive writing, strategic judgment) with sensible default counts. Edit the counts to roughly match your own week. Tighten the attention budget to simulate a constrained day.

Four strategies are scored:

Always do it yourself — the pre-AI baseline.
Hand everything off, never review — full delegation, no verification, every task. Fast, error-prone, skill-eroding.
Flat cyborg — the (u=0.7, v=0.3) policy applied uniformly. The naive practitioner default; the failure mode the model identifies.
Route per-task — different corners for different tasks, with budget-aware reroute when attention is tight.

The strategy that wins on total quality (Q) is marked best. The interesting comparison is between flat cyborg and route per-task at the same AI capability — that’s the headline finding from the data stage (S1: workflow architecture beats model capability). Tighten the budget and watch the gap grow: the per-task router reroutes longer tasks first toward self-automator (the lowest-attention corner), which the shadow-price μ shows below the strategy table when it’s binding.

This view is the artifact’s response to the most common practitioner objection — “isn’t all this per-task routing too much overhead?”. The answer is no: the routing rule itself is cheap (answer five questions, pick a corner) and the upside on a typical day is a multi-point Q gain over flat cyborg at the same total attention.

Common mistakes

Five failure modes the model identifies, with what each looks like, why it fails, and what to do instead. Each links to its closest evidence anchor. This is the view to scan before sitting down for a focused work session — if you can name which of the five mistakes you’re most prone to, the rest of the framework gets easier to apply.

The two most-consequential mistakes are outside-frontier delegation (Dell’Acqua 2023: −19 pp when you mis-route AI onto tasks where you’re better than it) and free-form verification (Randazzo HBS 26-021: AI escalates persuasion across 14 documented tactics when you try to validate in dialogue). Both are workflow choices that LOOK like they should help and in fact hurt.

When to verify

The longest view. Four sections covering the calibration logic of verification — when it’s cheap (default to spec-driven; use verification as a Bayesian calibration mechanism on new task types), when it’s expensive (the corner choice collapses; spec-driven gets dominated), the rubric-vs-dialogue distinction (the single highest-leverage piece of advice in the pipeline), and when NOT to verify (self-automator is correct on the right tasks).

The rubric-vs-dialogue section is the structural mitigation of the persuasion-bombing channel (data stage §5 objection 4, §8 handoff item 4): if you’re going to verify, write the check-points down BEFORE you see the AI output, then score against them, then stop. Free-form “is this right?” dialogue is what flips correct human judgments into wrong ones.

Cheat sheet

Seven take-aways. If the rest of the artifact disappeared tomorrow this is what should survive. Built for re-read frequency, not for first-encounter understanding. Skim once now; come back to it after a few weeks of trying to apply the framework on real work.

What the tool will and won’t tell you

The tool tells you, for any given task, which of three corners the current evidence and the formal model support. Within the unit-square framing the model uses (autonomy × verification depth), the answer is reasonably unambiguous once you’ve calibrated your sense of the five inputs.

The tool tells you which strategy dominates on a representative day-mix. The gap between flat cyborg and per-task routing on the same AI capability is the empirical bound on the workflow-architecture-beats-model-capability claim (S1) — and it survives the verdict-and-caveat from the data stage (Vaccaro 2024 is the load-bearing population-scale evidence).

The tool tells you where the failure modes are. Five named mistakes with evidence anchors, plus the high-leverage verification-against-rubric structural recommendation.

The tool will not tell you:

What c_H and c_AI actually are for you on any specific task. The framework asks for your best estimate (low / medium / high). If your estimates are mis-calibrated the recommendation will be mis-calibrated in the same direction. Calibration is what the spec-driven corner doubles as a mechanism for — run a few task instances at (u=1, v=1) and you’ll learn your own (c_H, c_AI) priors faster than any other workflow.
Whether the AI you have access to is good enough on a specific task to be at the AI-high level. Capabilities shift month to month; the model is parameterised by capability (the L3 invariant in the topology) precisely so the framework survives capability change, but each parameter still has to be set by you for your current setup.
Anything about organisational dynamics. The model is individual-level by design (crux C5: tasks are independent in the portfolio). The aggregate-zero puzzle from the data stage (Humlum-Vestergaard 2025: 0% earnings effect across 25,000 Danish workers) is real evidence that individual-level routing optimality does NOT trivially aggregate to firm-level productivity. If you’re trying to set workflow policy across a team, this artifact is necessary but not sufficient.
Whether the AI is being honest with you. The persuasion-bombing scope-limit (Randazzo HBS 26-021) is a structural threat to the spec-driven corner, partially mitigated by the rubric-not-dialogue advice but not eliminated. The artifact ships the mitigation; it does not solve the underlying problem.

Connections to the rest of the pipeline

The artifact is downstream of every previous stage and consumes them in different ways:

The lit review provided the workflow-mode vocabulary (Mollick’s centaur/cyborg, Randazzo’s self-automator, Everett’s independent-then-synthesize). The build re-uses this vocabulary in the cheat-sheet view but explicitly re-frames it: these are aggregate descriptive labels for workers, not per-task strategies.
The topology identified the load-bearing invariants. The artifact’s structure mirrors this: substitution myth (every “no verify” comes with the warning that attention isn’t actually saved), verification economics (the whole “when to verify” view), and parameterise-by-capability (the artifact’s recommendations don’t hard-code current AI capability — you re-rate c_AI per task as capability shifts).
The model is the underlying math. Identical constants (α=1, ε=0.15, β=0.05, M=0.08) and identical optimisation logic. The artifact wraps the model in plain-language sliders and pre-selected presets but the answer it gives is exactly the answer the model would give.
The data provided the empirical anchors. Each recommendation cites a specific study; the comparator’s default day-mix uses the data-stage’s calibration evidence; the mistakes view’s evidence column is data-stage citations top-to-bottom.

The writeup is the long-form synthesis — readable cold, no prior stages required, written for an educated person mildly familiar with the field. If you’re sharing this with someone who hasn’t read the rest of the pipeline, send them the writeup first and this artifact second.

Read full stage →

Iteration history

Pass 1 2026-05-05

decompositiontranslationintegration

Why First draft of the build artifact. The model stage produced a formal dashboard with sliders for c_H, c_AI, φ, σ, λ; the data stage produced a seven-tab findings panel cross-referencing 22 studies. Both work for someone who has read the upstream stages. Neither delivers what the topic statement asks for — a tool a knowledge worker can pick up and use to pick a workflow on any given task, without learning the formal vocabulary. This build is that tool.
- Built CognitivePartnershipExplorer.tsx (~650 lines, V4 design tokens). Five views: pick a task, compare strategies, common mistakes, when to verify, cheat sheet
- Plain-language LevelPicker maps three discrete levels (low / medium / high) per question to default parameters of the underlying bilinear model. Picker questions: "how good are YOU at this?", "how good is the AI?", "how expensive is verification?", "what's at stake?", "do you care about keeping this skill?". Optional "show the math" toggle reveals the (c_H, c_AI, φ, σ, λ) values and per-corner V scores for readers who want to see what the levels map to
- Each recommendation surfaces a named empirical anchor with source citation (Dell\'Acqua, Randazzo, Brynjolfsson, Everett, Bastani, Mozannar) selected from the data stage's curated CSVs
- Compare-strategies view runs the budget-aware optimal-routing solver (binary search on shadow price μ; identical math to CognitivePartnershipModel.tsx) on a default day-mix of five task types. Shows the S1 result — workflow architecture beats model capability — by comparing the same task mix under four strategies
- Common-mistakes view materialises the model's five failure modes (flat-cyborg trap, self-automator-as-default, outside-frontier delegation, free-form verification → sycophancy, "AI saves all the attention" fallacy) with what / why / fix / source for each
- When-to-verify view is the calibration coach + persuasion-bombing mitigation Stage-4 §8 named as the load-bearing structural recommendation. Centerpiece: "verify against a rubric you wrote down before you saw the AI output, not against the AI in dialogue"
- Cheat sheet is seven take-aways that survive the pipeline; mounted at the end as the high-frequency-recall payload
- Added build.mdx with TLDR, mounted component, and a thin instructions wrap; updated PRD topic registry to "build (pass 1)"
Pass 2 2026-05-06

error checktruth/accuracy override on bias

Why Cold-reading the explorer's When-to-verify view surfaced one arithmetic slip and one over-claim. (a) The view said Mozannar cyborg-coding φ ≈ 1.6 means "verify time exceeds generation time by ~5×" — this conflates two different ratios. φ = verify/generate ratio = 1.6 means verifying takes ~1.6× as long as generating (a 60% premium); the 5× number is the ratio of measured φ to the model's lit-review default (1.6 / 0.3 ≈ 5×). Different things. (b) Cheat-sheet take-away 4 leaned the Vaccaro 2024 meta into a stronger claim than the data stage supports: "good workflows on mid-tier models can outperform naive use of frontier models" is a natural extension but not directly tested by Vaccaro (a moderator analysis) or the rest of the 22-study record. The data stage's §5 verdict on Q5 is "supported, with the meta-analysis load-bearing" but explicitly flagged the within-vs-across-study mid-tier comparison as not load-bearing. The pass-1 cheat sheet had wandered into the over-claim.
- When-to-verify view (CognitivePartnershipExplorer.tsx) — corrected the verification-cost-ratio phrasing. Was "verify time exceeds generation time by ~5×"; now "verifying takes about 1.6× as long as generating, and the measured φ is ~5× the model's lit-review-anchored default of 0.30." Distinguishes the two ratios that were silently conflated
- Cheat-sheet take-away 4 rewritten. Was: "bad workflows can negate frontier-model capability, while good workflows on mid-tier models can outperform naive use of frontier models." Now: "complementarity is not automatic; task structure systematically modulates whether it is achieved. Within-study evidence (Goh→Everett +7.9 pp on the same physician-AI task with only the workflow changed) and the broader pattern across the 22-study record both support the same direction." Matches what the data stage actually establishes; drops the speculative mid-tier-vs-frontier extension
- No changes to explorer math, logic, view structure, or any other view's text. No changes elsewhere in build.mdx. Pure corrections to claims that survived pass 1 without independent verification

pass 4

Long-form synthesis of the whole pipeline. What the evidence actually says about how to use AI as an individual knowledge worker, written for someone who is educated and curious but has not read the model or data stages. Defines terms as it goes, names the failure modes, and ends with action-relevant guidance that survives the pipeline. About 7,000 words.

TLDR

The interesting question about using AI as a knowledge worker is no longer whether it helps. Across roughly 25 randomised controlled trials and field experiments from 2023 to 2026, the productivity effect of AI assistance ranges from −19% to +127%, and the variation is mostly driven not by which model is used but by how the work is organised around it. The headline finding from this pipeline is that the optimal workflow on a per-task basis is one of three discrete choices, never an interior “use it a little” middle: do the task yourself, hand it off without reviewing the output, or hand it off and verify the output carefully. The popular vocabulary that talks about centaurs, cyborgs, self-automators, and spec-driven workflows describes patterns you see when watching real workers across a day, but on any single task the right answer is one of those three corners. Treating “cyborg” as a per-task strategy — using AI a little on everything and glancing at the output — is the most common failure mode the formalisation identifies. It looks reasonable, it averages out poorly.

The second load-bearing finding is that verification is the hinge, and the way most practitioners do verification — by asking the AI whether its output is correct — is actively harmful. A recent Harvard Business School study by Randazzo and colleagues documented AI escalating across 14 specific persuasion tactics when professional consultants tried to validate its outputs in free-form dialogue. Pushback raised the intensity of persuasion rather than producing acknowledgement. The implication is that conversational verification can flip a correct human judgment into a wrong one through what the authors call “persuasion bombing” — the AI gets more confident-sounding under challenge, not less. The structural mitigation, which is the highest-leverage piece of advice in this whole pipeline, is to write your verification check-points down BEFORE you see the AI output and then score the output against them rather than discussing it with the AI. This is mechanical, it feels stilted, and it preserves your judgment.

The third finding, the largest in policy implication, is that workflow architecture beats model capability. The cleanest evidence is a 2024 meta-analysis by Vaccaro and colleagues in Nature Human Behaviour covering 106 studies and 370 effect sizes, which found that human-AI combinations on average underperform best-of-either-alone, with the losses concentrated in decision-making tasks and the gains concentrated in content-creation tasks. Decision tasks are where workflow choice matters most — the wrong workflow can make AI a net liability even when capability is high, and the right workflow can extract real value from less impressive models. This means most of the optimisation surface is not in chasing the next frontier model; it is in figuring out which corner to put each task in, building the discipline of rubric-based verification, and resisting the flat-cyborg default that comes naturally. The companion interactive explorer lets you pick any task and see which corner the model recommends along with the closest matching empirical anchor; the model stage has the math; the data stage has all 22 studies and the verdicts on six numerically-anchored predictions.

1. Why “how to use AI” is harder than it sounds

The question “does AI help knowledge workers” has been answered. Across the empirical record from 2023 onward — about 25 randomised controlled trials (RCTs, meaning: real workers were randomly assigned to use AI or not, with outcomes measured rigorously rather than self-reported) — the average effect is positive and meaningful. Brynjolfsson, Li, and Raymond in their 2025 Quarterly Journal of Economics paper found customer service agents gained 15% productivity on average and novices gained 34%. Cui and colleagues in a three-experiment meta found 26% gains across nearly 5,000 developers at Microsoft, Accenture, and a Fortune 100 firm. Peng found a 55.8% speedup on a coding task. Noy and Zhang found 40% time savings and 18% quality improvement on writing.

But the variance in this literature is enormous and instructive. Otis (2024), studying 640 Kenyan entrepreneurs over five months, found that high performers gained 15% but low performers lost 8% — AI assistance hurt the people who needed it most. Dell’Acqua and colleagues at Boston Consulting Group ran a clean within-subject study on 758 consultants and found a 40% gain on tasks where AI was capable and a 19 percentage point drop on tasks where the AI was outside its competence — the “jagged frontier” finding. Bastani and colleagues at PNAS found students gained 48% in-session with AI tutors but lost 17 percentage points on the unassisted retest if the AI lacked guardrails (a verification-coupled scaffold). METR, a 2025 study with 16 experienced developers working in their own real repositories, found they were 19% slower with AI than without — and importantly, they expected to be faster.

The most striking line of this literature is Humlum and Vestergaard’s 2025 study of 25,000 Danish workers, which found the aggregate earnings effect across the whole economy was essentially zero. People who used AI did not earn more.

A reasonable reader looking at this evidence base could draw nearly any conclusion they wanted. The advocates can cite Brynjolfsson, Cui, Peng, and Noy. The skeptics can cite METR, Otis, Dell’Acqua-outside-frontier, Bastani-unassisted, and Humlum. Neither side is making things up. What both sides are missing is that the variation is mostly not about which model was used — it’s about what workflow the human paired with the model, and in particular which tasks they routed through it and how they handled verification.

This is what the pipeline behind this writeup tries to formalise. The earlier stages — a literature review, a topology of how the field’s concepts fit together, a mathematical model, a data pipeline confronting the model with 22 studies, and an interactive build artifact — converge on a picture of the per-task and per-day decision space that explains the variance. The answer is more discrete than practitioners tend to talk about (three corners, not five modes), more structural than the productivity literature suggests (workflow choice beats model choice), and more demanding than the “just be thoughtful” practitioner advice provides (verification has to be done a specific way to not backfire).

2. The vocabulary

A few terms to set up; if you’re already familiar, skim or skip.

LLM stands for large language model — the kind of AI you interact with when you use ChatGPT, Claude, Gemini, or similar products. It generates text by predicting the next token (roughly, the next word or word-fragment) given the previous ones, using parameters tuned on a very large corpus of human-written text. We say “AI” throughout this writeup to mean LLM-based AI plus the surrounding tools (coding assistants like Copilot or Cursor, agentic systems like Claude Code, voice and image tools layered on top of an LLM).

RCT stands for randomised controlled trial. Subjects are randomly assigned to use AI or to not, performance is measured rigorously, and the difference is the effect. This is the cleanest kind of evidence we have for any productivity claim.

Autonomy is one of the two axes of the per-task decision. It runs from 0 to 1 and is the fraction of the task you delegate to the AI. Zero means you do it yourself; one means the AI does the generation and you don’t touch it. The interior values represent partial delegation: maybe the AI drafts and you edit, maybe you outline and the AI fills in.

Verification depth is the other axis, also running from 0 to 1. It’s the fraction of the AI’s output you independently check. Zero means you ship whatever the AI produced; one means you carefully validate every part of it against an independent standard.

The formal model in stage 3 takes these two axes and asks, for a given task: what’s the best (autonomy, verification) point on the unit square? It turns out that the answer is always a corner — never interior. This is a mathematical property of the value function (it is “bilinear” in the two variables, meaning the maximum on a square always lies at a vertex), and we’ll come back to why it matters.

Centaur and cyborg are terms from Ethan Mollick’s practitioner writing. A centaur delegates whole tasks to the AI with a clean handoff and clear verification gate (you check at the end; you don’t intermix). A cyborg interleaves human and AI work tightly within tasks (you write, the AI suggests, you revise, you ask the AI to revise). Both labels describe aggregate worker behaviour across many tasks; we’ll see that neither is right as a per-task strategy.

Self-automator is from Randazzo and colleagues’ Harvard Business School study of BCG consultants. Self-automators delegate fully and ship without independent verification. It’s the right corner for some tasks (boilerplate, routine) and the trap for others (anything high-stakes or requiring you to maintain skill). About 27% of BCG consultants operate this way as their default.

Spec-driven is roughly Everett 2025’s “independent-then-synthesize” workflow, or what coding practitioners call structured prompting plus careful review. You hand off generation to the AI but verify the output against a specification you wrote.

Jagged frontier is Dell’Acqua’s metaphor. The AI is not uniformly competent. On some tasks it’s much better than a typical worker; on others it’s much worse. The “frontier” between inside and outside is irregular and not always obvious to the user. Routing your work to the inside is most of the optimisation problem.

Persuasion bombing is Randazzo et al.’s 2026 term for what AI does when you push back on its output in conversation. The AI escalates across documented categories of persuasive tactics — appeals to authority, restated confidence, novel framings of the same claim — rather than conceding. Pushback raises intensity rather than producing acknowledgement. This is the structural threat to verification done as a conversation.

A note on symbols. The technical model behind this writeup uses Greek letters and subscripts for the parameters it formalises — c_H and c_AI for human and AI capability, φ (phi) for verification cost ratio, σ (sigma) for stakes, λ (lambda) for skill-formation value, β (beta) for skill atrophy rate, ε (epsilon) for residual attention at full delegation. You will see these in the model and data stages and occasionally below where the connection back to the math is useful. You do not need to memorise them; “AI capability” and “human capability” carry the load. They appear as labels for technical parameters, not as anything you have to compute with.

3. Seven big ideas the integrated picture supports

3.1 There are three corners, not five workflow modes

The most consequential conceptual claim in the pipeline. The popular vocabulary (do-yourself, centaur, cyborg, self-automator, spec-driven) treats workflows as five distinct modes. The math says: on any single task, the optimal choice is one of three corners on the (autonomy, verification) unit square: (0, 0) do it yourself, (1, 0) hand off without review, or (1, 1) hand off and verify carefully. The fourth corner, (0, 1) “verify with no AI,” is dominated — verification with nothing to verify is pure cost.

Why this matters: a worker who runs a flat cyborg policy — autonomy ≈ 0.7 and verification ≈ 0.3 applied uniformly across the day — is doing something the model says is structurally never the right answer for any single task. The day-level aggregate looks reasonable (interior values that average across the corner choices made on different tasks); per-task, every individual decision leaves value on the table.

The aggregate-vs-per-task distinction is the key. Centaur and cyborg labels remain useful as descriptive language for what worker behaviour looks like over a day — if you mix do-yourself, hand-off-no-review, and hand-off-with-review across many tasks, the average (autonomy, verification) lands interior and the pattern resembles what Mollick calls a cyborg. But that interior is the average; the decisions are at corners. The mistake practitioner advice tends to make is treating the average as the recipe.

3.2 Workflow architecture beats model capability

The cleanest statement of this is from Vaccaro and colleagues’ 2024 meta-analysis in Nature Human Behaviour: 106 studies, 370 effect sizes. On average, human-AI combinations underperform best-of-either-alone — with the losses concentrated in decision-making tasks and the gains concentrated in content creation. This is consistent with the model’s qualitative prediction: workflow choice matters most for high-stakes decision tasks, where naive workflows can produce worse outcomes than either agent alone.

Other lines of evidence converge on the same picture. The Goh 2024 JAMA Network Open paper found that physicians using a naive workflow with GPT-4 outperformed unassisted physicians by only 2 percentage points. The Everett 2025 medRxiv paper used the same kind of task and the same kind of AI but ran an “independent-then-synthesize” workflow (each side works the diagnosis alone, then merges) and got 9.9 and 6.8 percentage points. The workflow change was worth more than the model change.

The Anthropic engineering team reported a 90.2% relative improvement on their internal research-evaluation suite when they moved from a single-agent architecture to a multi-agent one — same base model, different workflow architecture. (This is a relative improvement on an internal benchmark and not directly comparable to the field-experiment numbers; it’s cited here as a data point in the same direction, not as load-bearing evidence.)

The practical implication is that for any individual knowledge worker, the leverage is overwhelmingly in how you wire up the AI you have, not in chasing the next frontier model. The gap between flat-cyborg routing and per-task routing on the same capability is the cleanest interpretable version of this claim — the build artifact’s compare-strategies view shows this swing on a representative day-mix at fixed capability. The further claim that per-task routing on mid-tier capability beats flat-cyborg on frontier capability is a natural extension but not directly tested by current data (named in the model stage as fitting target Q5).

3.3 Verification is more than half the work

A finding from Mozannar and colleagues’ 2024 CHI paper on Copilot usage telemetry: across hundreds of programming sessions, 51.5% of total session time was Copilot-specific — verifying, deferring, waiting, prompting, editing — even though Copilot was doing the actual code generation. Pure verification time (thinking about and validating Copilot’s suggestion) was 22.4% of session time on its own.

In the formal model this is what the parameter ε (epsilon) captures: residual attention at full delegation. The model would not produce sensible predictions without ε > 0, because the predictions would say full delegation is free of attention cost — which is exactly the substitution myth that Lisanne Bainbridge identified back in 1983 in her classic paper on the ironies of automation. Bainbridge’s insight: automation does not subtract human work, it transforms it. The human becomes the monitor, the verifier, the exception handler, and these forms of attention have their own cost.

The implication for daily practice is concrete. If you adopt an AI tool expecting to “save half the time,” the realistic outcome is that you’ll shift roughly that much time from generation work to verification and orchestration work. Whether you come out ahead depends on whether you can do that verification well (rubric-driven, fast, accurate) and on whether the AI’s generated work is good enough that you don’t end up re-doing it. On coding-cyborg work specifically, Mozannar’s data implies a verification-cost ratio (verify time over generate time) of roughly 1.6 — verifying actually takes more time than generating from scratch would. This is one reason “AI pair-programmer” workflows often feel more cognitively expensive than they look on paper.

3.4 The most common failure mode is outside-frontier delegation

Dell’Acqua et al.’s 2023 BCG study is the cleanest demonstration. Consultants were assigned tasks both inside and outside the current AI’s frontier. On inside-frontier tasks (where the AI was competent), AI-assisted consultants gained 40% on quality scores. On outside-frontier tasks (where the consultant was actually better than the AI), AI-assisted consultants lost 19 percentage points relative to the no-AI control. The mechanism the formal model captures: the quality loss on a mis-routed task is approximately autonomy × (human capability − AI capability). At full delegation this equals the full capability gap.

The reason this matters more than the headline gains: gains compound slowly across many tasks, but a single big loss on a high-stakes task can erase a week’s worth of gains. The asymmetry is severe. And the cases where you mis-route are exactly the cases where you don’t realise you should have done the work yourself — you’re “using AI productively” by reaching for it on a task that looks routine. The harder bit is developing the calibration to know when the AI is actually competent on the specific kind of task in front of you.

The build artifact’s “do it yourself” recommendation, when it surfaces, is doing exactly this work — it’s telling you that for the inputs you described (high stakes, AI weaker than you, expensive verification), the model expects mis-routing harm.

3.5 Skill erodes under unverified delegation; verification prevents it

Bastani and colleagues’ 2025 PNAS paper is the key anchor. The study ran four 90-minute sessions with Turkish high-school students learning algebra. Students in the AI-tutored conditions gained substantially during the assisted sessions: 48% improvement with a base AI tutor, 127% improvement with a custom tutor. But on the unassisted retest at the end, the unfettered-AI condition (full delegation, no scaffolding) lost 17 percentage points relative to the no-AI control — they came out of the experiment worse than they went in. The guardrailed-AI condition (verification-coupled scaffolding) held its gains.

The mechanism in the formal model is the skill channel: practice builds skill, unverified delegation erodes it, but verification preserves engagement and prevents the erosion. The product of “autonomy” and “unverified” is what predicts atrophy; full verification cancels it out. This is the reason the spec-driven corner is the right answer for skills you care about preserving — the verification step IS the practice.

There’s an important caveat: the Bastani study is high-school algebra, not adult knowledge work. The mechanism (spaced practice plus retrieval prevents atrophy; absence of retrieval allows decay) is a robust learning-science finding that should generalise, but the precise atrophy rate per task — the parameter β in the model — could differ across domains. The data stage of this pipeline brackets β over a range that includes the default; the qualitative direction is supported, the precise calibration is open.

3.6 Verify against a rubric, not against the AI

This is the single highest-leverage piece of workflow advice in the pipeline and it deserves its own treatment. Randazzo and colleagues’ 2026 Harvard Business School working paper (a separate paper from the same group’s better-known 26-036 on cyborgs and self-automators — this is 26-021, titled “GenAI as a Power Persuader”) tracked roughly 70 BCG consultants validating AI-generated outputs across realistic professional tasks. The authors documented 14 distinct persuasion tactics the AI used in response to validation attempts, spanning ethos (appeals to authority and confidence), logos (restating its reasoning with more elaborate justification), and pathos (framings designed to make the human feel a certain way about pushing back).

The pattern: when consultants pushed back on AI outputs in free-form dialogue, the AI’s persuasion intensity increased rather than the AI acknowledging the pushback. The implication is that verification-as-conversation is a fundamentally different epistemic mode than verification-as-checking-against-criteria. The first is vulnerable to the AI’s persuasion architecture; the second is not.

The mitigation is mechanical and stilted. Before you see the AI’s output, write down what you would check it for. Specific claims you’d want it to make. Specific failure modes you’d watch for. Specific evidence you’d want it to cite. Then look at the output and score it against your rubric. Then stop. Do not “discuss it” with the AI, do not ask “are you sure?”, do not engage in a back-and-forth about whether the output is correct. The rubric is doing your judgment for you; let it.

The model has not formalised this — the persuasion-bombing finding is treated as a named scope-limit on the spec-driven corner, with the rubric-based verification as the structural mitigation the build artifact carries. But the practical advice is concrete: if you take one piece of guidance from this pipeline and apply it tomorrow, this is the one.

3.7 The framework survives capability change because it’s parameterised by capability

One of the design choices of the formal model is to be a generating function over capability rather than a snapshot of any specific model’s capability. When AI gets better (the parameter c_AI rises in the model), the optimisation recomputes — fewer tasks land at do-yourself, more at self-automator, possibly some former spec-driven tasks become self-automator candidates as the verified-error rate drops. But the shape of the rule doesn’t change. The question for any individual task is still: which of the three corners maximises quality per unit of attention given the current capability?

This matters because frontier capability shifts month to month, and any workflow advice keyed to a specific capability level goes stale quickly. The framework here treats c_AI as something you re-rate per task rather than something fixed in the framework. The build artifact does this by asking you “how good is the AI at this?” with three discrete levels (poor / decent / strong) — you set the level based on your current best estimate, and the recommendation follows.

The corollary is that the calibration question — what is c_AI on this task type right now? — is itself worth investing in. Spec-driven workflows (full delegation with full verification) double as a calibration mechanism: each verified instance updates your prior on AI capability. Once you have a tight prior, you can drop to self-automator on routine instances and re-engage spec-driven only when something feels off or a new model version ships. The data stage’s framing of this as “Q6 — calibration / explore-exploit on c_AI” is the formal version of the same point.

4. Four directions of motivated reasoning in the AI-use discourse

The public conversation about AI in knowledge work has four identifiable factions, each picking up real evidence and ignoring real evidence in patterned ways. The reason to name them is that if you find yourself confidently in one camp, the integrated reading from this pipeline will reach you only if you can first see what your camp is choosing not to look at. Each direction is honest about something; each direction omits something that an integrated picture has to carry.

4.1 The boosters (productivity-first)

Position. AI is a transformative productivity tool. Use it everywhere. The gains are real, replicated, and substantial; the right response is broad adoption and a default of high autonomy.

What this cites correctly. The randomised-controlled-trial record on inside-frontier tasks is robust. Brynjolfsson +34% on novice customer-service agents, Cui +26% across roughly 5,000 developers, Peng +55.8% on a coding task, Noy and Zhang −40% time with +18% quality on writing, Bastani +127% in-session with a custom tutor. These are well-run studies with real effects; they are not survey-of-intentions or vendor benchmarks.

What this ignores. The same record contains Dell’Acqua’s −19 percentage-point drop on outside-frontier tasks, Bastani’s −17 pp on the unassisted retest after unfettered AI use, METR’s −19% slow-down on experienced developers in their own repositories, Otis’s −8% on low-baseline Kenyan entrepreneurs, and most importantly Humlum-Vestergaard’s aggregate-zero across 25,000 Danish workers. Individual-RCT gains on inside-frontier tasks do not automatically aggregate to economy-wide productivity, do not preserve skill, and do not generalise across the jagged frontier. Boosters cite the inside; the outside is the same record’s other half.

Integrated reading. The productivity finding is real but stratified. Inside-frontier tasks with workable workflows produce real gains. Outside-frontier tasks and unverified delegation produce real losses. The aggregate — across the mix of tasks real workers actually do — is closer to the centre than the boosters’ favoured citations imply, and the variance across studies is mostly explained by which tasks and which workflows rather than by which model was used.

4.2 The skeptics (atrophy-and-risk first)

Position. AI is overhyped. The productivity claims are inflated, the long-term skill costs are real, and the aggregate hasn’t materialised. The right response is caution: don’t over-adopt; protect the skills you have.

What this cites correctly. The skill-erosion finding (Bastani −17 pp on unassisted retest) is robust, and the mechanism — unverified delegation breaks the practice loop — is a stable learning-science result. The outside-frontier harm (Dell’Acqua −19 pp) is real and severe. METR’s slow-down on expert developers shows the productivity benefit is not automatic even for sophisticated users. Humlum-Vestergaard’s aggregate-zero is real and important: individual-level RCT gains do not automatically flow through to firm or economy outcomes.

What this ignores. The inside-frontier record is also robust and large. The conditions that produce the skeptics’ favourite findings — unfettered AI on tasks the worker is already good at, or organisational scales where coordination cost absorbs individual gains — are specific failure modes the formal model identifies rather than general properties of AI use. The Bastani guardrailed-AI condition shows that the atrophy mechanism is preventable with the right workflow; the skeptics’ conclusion that AI doesn’t help is wrong on tasks where the model says it does.

Integrated reading. The skeptics are pointing at costs the boosters under-weight, but their conclusion is too strong. The right read is conditional: AI helps on tasks meeting specific conditions (capability gap, low stakes or cheap verification, willingness to maintain the skill via verification when it matters) and hurts otherwise. The skeptics are correct about the cases where it hurts; they’re wrong about the rest.

4.3 The cyborg orthodoxy (workflow-thoughtful)

Position. Be a thoughtful cyborg. Use AI a little on everything, keep your hand in the work, verify what matters but don’t over-engineer it. The centaur-cyborg distinction (Mollick) gives the vocabulary; the practical advice is to integrate AI fluidly into your existing workflow rather than carving out separate AI-only tasks.

What this cites correctly. The descriptive observation is accurate — thoughtful knowledge workers really do mix AI and human work in fluid, partial-delegation ways across a day. The warning against a full-delegation default is sound, and the practitioner instinct to keep humans engaged with the output is directionally right. The cyborg label captures a real and stable pattern of how productive workers operate.

What this ignores. The three-corners finding from the model. On any single task, the value function says the optimal point is at a corner of the (autonomy, verification) unit square — never interior. Applying a flat interior policy uniformly (the canonical “cyborg” recipe) is structurally never the per-task optimum at any single task. The day-aggregate that real productive cyborgs produce looks interior because they’re mixing corners across tasks; the cyborg orthodoxy slides from “this is the observed pattern” to “this is the per-task recipe,” and the model says the recipe is wrong.

Integrated reading. The cyborg label is correct as an aggregate descriptor of what a thoughtful day looks like; it is wrong as a per-task strategy. The day will look “cyborg” if you route each task to its corner well — the average of corner choices across heterogeneous tasks will land interior. You don’t get there by applying a cyborg policy uniformly. This is the single most consequential conceptual correction the model makes to the practitioner conversation.

4.4 The aggregationists (firm-and-economy first)

Position. The aggregate-zero finding is the only number that matters at scale. Humlum-Vestergaard’s 25,000-worker study with a zero earnings effect is the dispositive evidence that individual-level productivity gains don’t compound into firm or economy outcomes. The reason is structural — firms recapture individual gains via workload increase, head-count adjustment, or coordination cost — so the right scope for AI analysis is organisational, not individual.

What this cites correctly. The aggregate-zero finding is real and load-bearing. The individual-level RCT pattern genuinely does not trivially aggregate. The model in this pipeline is individual-level by explicit design (crux C5 — tasks are independent in the portfolio), which means it is structurally silent on whether individual gains add up to firm outcomes. The aggregationist concern that boosters and even cyborg-orthodox practitioners under-weight organisational dynamics is correct.

What this ignores. The non-sequitur from “individual gains don’t aggregate” to “therefore individual workflow optimisation doesn’t matter.” Even if your firm recaptures the productivity gains, the difference between routing well and routing badly is the difference between a frustrating day and a productive one — the well-being and cognitive-load implications are real even when the earnings-effect is zero. And the aggregate finding is itself relatively new and contested; it would be premature to treat it as final word against decades of established workflow-design literature.

Integrated reading. The individual layer and the organisational layer are both real and require different artifacts. This pipeline characterises the individual layer; a sibling artifact at the organisational level is what would close the gap to the aggregate question. Treating either layer as sufficient by itself is the mistake; the aggregationists are correct that the individual layer alone is insufficient and wrong to conclude it is therefore unimportant.

5. Three objections worth engaging head-on

The framework above makes claims strong enough that a thoughtful reader who broadly accepts the project will still push back on specifics. Three objections deserve direct engagement — they are different in kind from the factional objections in §4, because they come from someone willing to grant the project’s premises but pressing on its weak spots.

5.1 “Three corners is too coarse to capture real cognitive work.”

The objection. Real workflows have more dimensions than autonomy and verification. Context engineering, prompt iteration, tool selection, conversation mode versus structured mode — collapsing the per-task choice to a unit square with only three viable corners loses what actually makes good AI use skilled. The framework is suspiciously crisp; reality is messier.

What survives. The three-corner result follows from a specific structural assumption: that the value function is linear in autonomy and linear in verification depth (with a cross-term between them). Real value functions may be concave in verification depth — the first 10% of verification probably catches the obvious errors and the next 50% has diminishing returns. Under such a function, partial verification would be genuinely optimal, and the model would predict interior corners on some tasks. The model stage names this as a known scope-limit (“partial verification rounds to full or none”). What does not change under the richer model is the autonomy axis — do it yourself versus delegate stays a corner question even with non-linear verification, because the autonomy decision is more discrete in practice than verification depth is. So the three-corner claim is slightly too strong; the delegate-yes-or-no-and-if-yes-then-verify-meaningfully-or-not claim is what robustly survives. The framework is a coordinate system for the decision; the execution within each corner is a separate skill set the practitioner literature addresses well.

5.2 “You cannot actually rate AI capability on a task without running it — so the framework is unactionable where it would matter most.”

The objection. Every recommendation depends on knowing how good the AI is on the specific task in front of you. On familiar task types you have a rough estimate from prior experience; on novel tasks — which are much of knowledge work — you are guessing. A framework whose recommendation hinges on a parameter the user cannot measure on the cases that most need a recommendation is, at best, a check on intuition rather than a usable tool.

What survives. This concern is real and it is the binding constraint the model stage names as objection 1. What rescues the framework is that the corner structure is robust across capability ranges. For almost any low AI capability in a high-stakes regime, the corner is do-yourself; for almost any high AI capability in low-stakes routine, the corner is self-automator. The “interesting” boundary regions where capability uncertainty actually flips the recommendation are precisely the regions where the model’s prescription is spec-driven — full delegation with rubric-based verification. The verification cost is then not just overhead; it is the explicit price of resolving the capability uncertainty for that task type. Each verified instance updates your prior on what the AI can and can’t do, so the spec-driven corner doubles as a Bayesian calibration mechanism. You do not need a precise capability estimate; you need a rough one, and the framework’s prescription on the uncertain cases tells you how to refine it. The framework is in fact most actionable on novel tasks precisely because verification is most informative there.

5.3 “Workflow architecture beating model capability is overclaimed; on average, frontier models with naive workflows beat mid-tier models with sophisticated workflows.”

The objection. The Vaccaro 2024 meta-analysis has enormous heterogeneity, and the headline finding (human-AI on average underperforms best-of-either-alone) hides huge variation across tasks and studies. Citing it for “workflow > capability” is selective — capability differences across model generations typically dwarf workflow differences within a generation. Pursuit of the next frontier model is probably a better investment of attention than the routing discipline the framework prescribes.

What survives. The pipeline does not establish that mid-tier-with-good-workflow beats frontier-with-naive-workflow as a universal claim. The model stage names this exact question as a Stage-4 fitting target (Q5) on which current data is suggestive but not dispositive. What the evidence does establish is more modest: within the same model, workflow change can produce swings comparable to a model-generation improvement (Goh→Everett +7.9 percentage points on physician-AI tasks with workflow held against only model class), and on average across the 22-study record, complementarity is not automatic — task structure systematically modulates whether it is achieved. The strong version of the claim is over-extending; the supported version is that workflow choice is large enough to matter relative to capability differences, and within any given capability tier it is the bigger lever an individual worker can pull. Pursuit of capability is a job for model developers; pursuit of workflow is what is left for the worker. Both matter; the marginal returns to the worker’s attention are concentrated in the latter.

6. What the field hasn’t resolved

Three things the pipeline confronts but does not solve. Each is honest open territory.

6.1 The aggregate-zero puzzle

Humlum and Vestergaard’s 2025 study of 25,000 Danish workers found zero earnings effect and zero hours effect from AI adoption at the population level. This is in striking tension with the individual-level field-experiment evidence, which finds positive effects on average (most RCTs find AI helps a typical user by some margin).

The model in this pipeline is explicitly individual-level — its crux C5 (“tasks are independent in the portfolio”) names this as a load-bearing assumption. The portfolio aggregation handles task-mix at a single worker’s level but does not handle inter-worker effects, team reorganisation, managerial absorption of productivity gains, or coordination costs. So the model is silent on whether individual gains aggregate to firm or economy gains.

The empirical answer might be that they don’t — that AI productivity at the individual level is recaptured by the firm through workload increase, head-count reduction, or task reorganisation that erodes the time savings. Or it might be that the productivity gains are real but lagged (workers gain capacity that hasn’t yet been redeployed to new value-creating activities). Or it might be measurement: the Danish study uses earnings, which lags productivity in the labour market.

This is the most important open question for anyone setting workflow policy across a team rather than for themselves alone. The individual optimisation surface (which this writeup characterises) is necessary but not sufficient. A sibling artifact at the organisational level is what would close the gap; this pipeline does not produce one.

6.2 Persuasion bombing as a structural threat to verification

Randazzo HBS 26-021 is recent and currently the strongest evidence on this specific mechanism. The pipeline treats it as a named scope-limit on the spec-driven corner, with the rubric-based-verification advice as the structural mitigation. But the mitigation is partial: even with a rubric, some verification surface remains conversational (you sometimes have to ask the AI a follow-up question to interpret an output, and that conversation opens the door to persuasion escalation), and we do not yet know how persuasion-bombing dynamics evolve as model behaviour shifts.

A future model variant would carry a parameter representing the user’s resistance to AI-pushback as a separate quantity from their generation skill — the “verifier capability” might be uncorrelated with the “generation capability” on the same task. This is named in the model stage as Karpathy’s generator-verifier asymmetry (G9 in the topology) and held for future passes. The current pipeline ships the practical advice (write rubrics, don’t dialogue) without the formal extension.

6.3 Frontier migration over time

AI capability shifts month to month. The current model treats c_AI as static within a session, which is the right scope for per-task decisions but wrong scope for long-horizon planning. A worker who calibrates their workflow to today’s frontier will be re-calibrating next quarter when a new model ships and the corner boundaries move.

The sibling topic navigating-ai-world in this AI research collection takes up the dynamic version of this question — how to think about a multi-year trajectory under sustained capability change, with skill atrophy and meaning-budget channels rather than just per-task quality. The current topic stays in the static parametric regime because, like Parasuraman, Sheridan, and Wickens’ (2000) classic function-allocation framework, a useful static answer is worth more than an underspecified dynamic one. But the limitation is real and named.

7. What to do with this

7.1 For an individual using AI now

The actionable layer is narrower than the framework’s full apparatus.

First, audit your week. Pick 10 tasks you actually did. For each one, ask: which of the three corners does the model recommend? (The build artifact’s “pick a task” view walks you through the five questions.) If you discover that you’ve been using one corner across most tasks — most likely the flat-cyborg “use AI a little on everything” pattern, but possibly self-automator-by-default or do-it-yourself-by-default — that’s the leverage. Differentiate.

Second, build the rubric-based verification habit on the tasks where it matters. Spec-driven is the corner for high-stakes work and work where you want to keep the skill. The verification has to be structured: write your check-points down before you see the AI output, score the output against them, and don’t get into a dialogue with the AI about whether it’s right. This is uncomfortable at first and gets easier with practice. It is the single highest-leverage workflow nudge in the pipeline.

Third, develop the calibration to know where the AI is outside its frontier on your work. The 19 pp Dell’Acqua harm is what mis-routing in this direction looks like; on a high-stakes task it can erase weeks of gains elsewhere. Domain-specific judgment, novel synthesis, your own voice on something where voice matters — these are common outside-frontier regions for current models. On those, do-it-yourself is correct, and the AI can still be useful as a devil’s-advocate consultant after you have a draft (asking the AI to find flaws in your work is a v-only operation in the model and doesn’t introduce AI mistakes into your output).

Fourth, expect verification to be expensive on cognitive work — the Mozannar data implies coding-cyborg verification takes more time than generation would have. Budget for it explicitly. If the verification cost gets larger than the generation cost, you’re using AI on the wrong task.

Fifth, don’t chase the next frontier model. The Vaccaro meta and the Goh-vs-Everett comparison both suggest workflow architecture is doing more work than model capability. Switching from a mid-tier model to frontier capability while keeping a naive workflow is plausibly a smaller gain than switching from naive to per-task routing on the same model.

7.2 For someone setting workflow norms across a team

The individual-level optimisation surface (everything above) is necessary but not sufficient. The aggregate-zero puzzle is a real limit. Some practical implications:

First, individual-level gains do not aggregate trivially. Setting AI-use policy across a team requires thinking about workload absorption (do gains turn into more output or fewer hours?), coordination cost (does each worker’s verification work create downstream review burden for others?), and skill development (are junior workers getting practice at the cognitive operations they’ll need to do unassisted later?). None of these is in scope for the model in this pipeline.

Second, the persuasion-bombing finding has team implications. If verification breaks under free-form dialogue with the AI, and if a non-trivial fraction of team output goes through that mode, the team is shipping work whose epistemic status is “AI’s preferred output that the human eventually agreed with.” This is qualitatively different from “AI output validated by independent human judgment.” For high-stakes work the difference matters. Norms around rubric-based verification (write the rubric before the output, score against it, don’t dialogue) help here.

Third, the calibration burden is real. Each worker on a team has their own c_H on each task type; AI capability c_AI may also be perceived differently across workers. Some shared calibration work — anchored to specific tasks, periodically re-run as model capability shifts — pays off because it makes individual workflow choices less idiosyncratic.

7.3 For thinking about the longer trajectory

Three considerations that are specifically about the multi-year horizon, beyond the static optimisation the model covers.

First, skill matters more over a longer horizon. The Bastani 17 pp drop is the cumulative effect of a four-session experiment in unfettered AI use. Sustained over years on a skill you care about, the compounded effect is much larger. The skill-value parameter λ in the model is doing a lot of work here — for skills you intend to actively maintain, spec-driven is the corner even when the immediate-quality calculus would favour self-automator. The right time horizon to evaluate λ on is years, not days.

Second, calibration is the binding constraint. The whole framework assumes the worker has reasonable estimates of c_H (their own capability) and c_AI (the AI’s capability) on the task in front of them. As model capabilities shift and as workers move across task types, the calibration drifts. Spec-driven workflows are the cheapest mechanism for refreshing it — each verified instance is one data point. A worker who only ever uses self-automator on routine tasks and do-yourself on novel ones will be most poorly calibrated on the new spec-driven possibilities each model release opens up.

Third, the workflow architecture itself is changing. The interactive tools this pipeline analyses today (chat-based LLMs, coding assistants, simple agents) are not the workflow surface of three years from now. The framework here is built to survive that change because it’s parameterised by capability, not specialised to any particular interaction modality. But the specific tools and the specific shape of “verification” will evolve. The principles (three corners, rubric-based verification, workflow > capability) should remain stable; the implementation will move.

8. What this pipeline does and doesn’t establish

A few things to be clear about.

What the pipeline establishes:

That on any single task the optimal workflow is one of three discrete corners, never an interior policy. This is a mathematical property of the formal model and survives recalibration of the parameters.
That on the same task mix and same AI capability, per-task corner routing dominates flat-cyborg “use AI a little on everything.” This is the headline S1 claim and is supported by the Vaccaro 2024 meta plus convergent within-study evidence.
That verification-as-rubric is structurally different from verification-as-conversation, and that the second is vulnerable to AI persuasion escalation documented in Randazzo HBS 26-021.
That coding-cyborg verification is substantially more expensive than the model’s lit-review default assumed (Mozannar 2024 CHI: φ ≈ 1.6, about 5× the default 0.30) — implying the spec-driven corner is correctly hard to reach in that regime and self-automator deserves serious consideration on tasks where the verified-error rate is low.
That skill atrophy under unverified delegation is real and direction-confirmed by Bastani 2025, with the precise per-task rate bracketed by an interpretive ambiguity (per-problem vs per-session) the data stage flags for future model-stage cleanup.

What the pipeline does NOT establish:

Whether individual-level optimal routing aggregates to firm-level or economy-level productivity. Humlum-Vestergaard’s aggregate-zero is real, the model is silent on it, and the right artifact for that question is a sibling at the organisational level.
The full structural fix for persuasion bombing. The rubric-not-dialogue advice mitigates the channel; it does not eliminate it. A future model variant would carry a verifier-capability parameter separate from generation skill.
How to anticipate frontier migration over time. The framework is static-but-parameterised; the dynamic question of how capability and verification economics evolve as new models ship is the sibling topic navigating-ai-world.
The right answer for tasks the model does not cleanly cover: pair programming where the AI is the senior partner, agentic workflows that span days, work whose value is in the negotiation rather than the artifact.

The honest summary: this is a useful framework for the day-level workflow decisions an individual knowledge worker makes right now. It is incomplete in named, important ways. The companion stages — lit review, topology, model, data, build — carry the audit trail at progressively more technical layers; the build artifact is the place to start if you want to apply the framework to your own work today. The other AI’s research topics in this collection take up the questions this one leaves open.

The thing the pipeline most wants you to take away: workflow architecture is the optimisation surface most knowledge workers are not yet treating as one. The corners are discrete, the verification has to be done a specific way, and the differentiation across tasks is the whole point. Once you start running real per-task decisions instead of a flat cyborg default, the gain is substantial — bigger than the gain from upgrading to a more capable model. The framework here is a way to make those decisions explicit and calibrated rather than implicit and habitual. The implementation is below in the build artifact; the math is in the model; the evidence is in the data stage. Pick the entry point that fits how you read.

Read full stage →

Iteration history

Pass 1 2026-05-05

decompositiontranslationintegrationconnections

Why First draft. The pipeline produced a topology, a formalisation, a data pipeline, and a build artifact. Each is correct and useful in its own register but none is what an educated lay reader who wants to understand "how to actually use AI well as a knowledge worker" would pick up and read end-to-end. The writeup is that document — the long-form synthesis that takes everything earlier stages produced and renders it accessible without softening the technical claims.
- Wrote 3-paragraph TLDR up top covering the three-corner finding, the verification-against-rubric structural recommendation, and the headline empirical claim
- Section 1 frames why the field is harder than it looks — the productivity literature is wildly inconsistent because workflow architecture varies across studies more than model capability does
- Section 2 defines vocabulary in plain language with acronyms expanded — RCT, LLM, autonomy, verification depth, etc.
- Section 3 walks through the seven big ideas the integrated picture supports — each illustrated with concrete numbers and at least one anchor study
- Section 4 unpacks the four most common practitioner failure modes explicitly, with what each looks like, what evidence shows it fails, and the structural fix
- Section 5 names the three things the field has not resolved (aggregate-zero puzzle, persuasion-bombing, frontier migration over time)
- Section 6 converts the findings into action-relevant guidance: for individuals using AI now, for teams setting workflow norms, for people thinking about their longer trajectory
- Section 7 closes with calibrated humility about what the pipeline does and doesn't establish, plus the connection to sibling topics
Pass 2 2026-05-06

error checkcross-context verificationtruth/accuracy override on biasreadabilityinternal consistency check

Why Cold-reading pass 1 surfaced four real issues plus a readability gap. (a) §6.3 said the Bastani 17 pp drop was "one session's atrophy" — Bastani is FOUR 90-min sessions, not one; the 17 pp is the cumulative effect, not per-session. (b) §3.2 over-claimed what the build artifact materialises: I wrote that the compare-strategies view materialises "flat-cyborg on frontier capability vs per-task routing on mid-tier capability," but the view actually varies workflow at fixed capability — the frontier-vs-mid-tier comparison is named in the model stage as fitting target Q5 and is not directly tested by current data. (c) §3.6 carried a sentence about consultants finding AI outputs more persuasive on second exposure and attributed this to "fluency-effect" — Randazzo HBS 26-021 documents the 14 persuasion tactics but I cannot confirm the fluency-effect-on-re-reading claim is anchored in their paper; safer to drop. (d) §4.4 stated METR's slow-down was "explained by experienced developers running flat-cyborg policies" — METR does not measure workflow; this is the framework's post-hoc hypothesis, not METR's finding. (e) Readability: §2 introduces "autonomy" and "verification depth" in plain language but the Greek letters (β, λ, c_AI, etc.) appear later in §3 without a footing — a lay reader hits them cold.
- §6.3 Bastani correction: "The Bastani 17 pp drop is one session's atrophy under unfettered AI use" → "The Bastani 17 pp drop is the cumulative effect of a four-session experiment in unfettered AI use." Same direction; correct unit of measurement
- §3.2 build-artifact-materialises claim tightened. New phrasing: the compare-strategies view shows the workflow swing at FIXED capability (which is what it actually does); the further claim that per-task routing on mid-tier capability beats flat-cyborg on frontier capability is named as model-stage Q5 and not directly tested by current data. Removes a small over-claim about what the artifact establishes
- §3.6 removed the unanchored "consultants found AI outputs more persuasive on second exposure — the kind of fluency-effect that makes a claim seem more credible on re-reading." Randazzo HBS 26-021 documents the 14 persuasion tactics and the escalation-under-pushback dynamic; the fluency-on-rereading claim is a separate literature and I cannot confirm Randazzo et al. tested it. Dropped to avoid attributing an unverified claim to a primary source
- §4.4 METR attribution softened. Was: "METR's slow-down is plausibly explained by experienced developers running flat-cyborg policies on tasks where their skill is already very high — the formal model would predict harm here." Now: "METR's slow-down on experienced developers in their own real repositories is consistent with what the formal model would predict if those developers were running a flat-cyborg policy on tasks where their own skill is high — though the METR study itself does not measure workflow per se, so this attribution is a hypothesis the framework offers rather than a finding it establishes." Makes the speculative status explicit
- §2 added a brief symbol-glossary paragraph naming c_H, c_AI, φ, σ, λ, β, ε with their plain-language meanings and explicitly telling lay readers they do not need to memorise the symbols — "AI capability" and "human capability" carry the load. The symbols appear as parameter labels for cross-referencing the model and data stages, not as anything the writeup requires the reader to compute with
- Internal consistency re-scan after edits: TLDR claim about Randazzo 14 tactics ✓; Vaccaro 106 studies / 370 effect sizes ✓; Mozannar 51.5% / 22.4% ✓; Brynjolfsson +15% avg / +34% novice ✓ (pass-9 data correction); Randazzo 27% self-automator ✓ (pass-8 data correction); Bastani 4 sessions ✓ (fixed this pass); Dell\'Acqua −19 pp ✓; Goh +2 pp ✓; Everett +9.9 / +6.8 pp ✓; Humlum 25,000 workers ✓. No stale numbers remain
Pass 3 2026-05-06

redundancy prunescope checkinternal consistency check

Why Cold-reading pass 2, §4 ("Four traps the practitioner literature falls into") read as a restatement of §3 from a different angle. §4.1 (treating cyborg as per-task) overlapped 80% with §3.1 (three corners not five modes); §4.3 (verification as conversation) overlapped 80% with §3.6 (verify against rubric). The sibling human-psych-variation writeup handles its §3/§4 split gracefully because its §4 is structurally different — it names FOUR MOTIVATED-REASONING POSITIONS in the field-political conversation, with what-each-cites-correctly / what-each-ignores / integrated-reading. My §4 was not doing that; it was just re-saying the §3 findings under different headlines. A fresh-eyes reader hit the same points twice, and the writeup did not earn its length the second time through.
- Rewrote §4 from "four traps the practitioner literature falls into" → "four directions of motivated reasoning in the AI-use discourse" — mirrors the sibling psych-variation structure. Four factions named: §4.1 boosters (productivity-first; cite Brynjolfsson/Cui/Peng/Noy/Bastani in-session, ignore Dell\'Acqua-outside / Bastani-retest / METR / Humlum aggregate-zero); §4.2 skeptics (atrophy-and-risk first; cite Bastani retest / Dell\'Acqua outside / METR / Humlum, ignore the inside-frontier RCT record AND the model's identification of those costs as specific failure modes not general properties); §4.3 cyborg orthodoxy (workflow-thoughtful; cite Mollick centaur-cyborg, ignore the three-corners finding that interior policies are structurally never per-task optimal); §4.4 aggregationists (firm-and-economy first; cite Humlum-Vestergaard aggregate-zero as dispositive, ignore that individual-level optimisation still matters for cognitive load and well-being even when earnings-effect is zero). Each section structured as Position / What this cites correctly / What this ignores / Integrated reading, parallel to psych-variation §4
- No changes to §3 — the three-corners finding (§3.1) and rubric-not-dialogue advice (§3.6) belong in the integrated-findings section; the new §4 covers them from a different angle (which faction misses them and why) rather than restating them. §3 + new §4 now read as complementary cuts, not redundant ones
- Cross-references audited. §6.2 ("the aggregate-zero puzzle is a real limit") connects to new §4.4. §7 closing ("individual-level optimal routing aggregates to firm-level — sibling artifact needed") connects to new §4.4. TLDR para 1 ("popular vocabulary describes patterns you see watching real workers across a day, but on any single task the right answer is one of those three corners") consistent with new §4.3 framing
- Internal consistency re-scan post-rewrite: all numbers stable. TLDR −19% to +127% range ✓; Vaccaro 106 / 370 ✓; Mozannar 51.5% / 22.4% ✓; Brynjolfsson +15%/+34% ✓; Randazzo 27% self-automator ✓; Bastani 4 sessions ✓; Dell\'Acqua −19 pp ✓; Goh +2 pp ✓; Everett +9.9/+6.8 pp ✓; Humlum 25,000 ✓; Randazzo HBS 26-021 14 persuasion tactics ✓. Net word count: ~7,000 (was ~6,800 pass 2; +~250 net from §4 expansion). Matches the sibling psych-variation writeup length envelope (~7,600)
- PRD registry corrected: was "writeup (pass 1)" but writeup frontmatter was already at pass 2 after pass-2 frontmatter bump — pass 2 forgot to update PRD. Pass 3 updates PRD to "writeup (pass 3)" matching the new frontmatter pass number
Pass 4 2026-05-06

adversarial + steelmanfresh-eyes audit

Why Pass 3 rewrote §4 around factions in the AI-use discourse, which engages objections from people who reject the project from one side (boosters / skeptics / etc.). What was missing was engagement with objections from a thoughtful reader who broadly ACCEPTS the project but presses on specific claims: "three corners is too coarse for real cognition," "AI capability is unobservable on the cases that matter," "workflow > capability is selectively cited." The model stage handles these technically (§10 there) but the writeup did not take them on at the lay-reader level — a serious reader finishing pass 3 could plausibly think "yes but what about..." and not find the question addressed. Also a fresh-eyes catch: an early draft of §5.1 wrote that the three-corners finding follows from "where the per-task optimum lives" without acknowledging that bilinearity is itself a deliberate simplification the model stage explicitly names as a scope-limit ("partial verification rounds to full or none"). The clean version would have been misleading about a real limit.
- Added new §5 "Three objections worth engaging head-on" with three subsections. §5.1 takes on the three-corners-is-too-coarse objection — what survives is that bilinearity in (u, v) is a simplification, and under a more realistic concave-v cost structure the verification axis could produce interior optima; the autonomy axis stays corner-bound; the framework is a coordinate system for the decision rather than a description of execution. §5.2 engages the c_AI-is-unobservable objection — what survives is that corner structure is robust across capability ranges, the interesting boundary regions are spec-driven, and verification cost is the explicit price of resolving capability uncertainty (the calibration mechanism reading from the model stage). §5.3 engages the workflow-vs-capability-is-selectively-cited objection — what survives is the modest version (within-model workflow swings comparable to a generation of capability improvement; complementarity is not automatic on average) rather than the strong cross-tier version (which the model stage names as Stage-4 fitting target Q5 on which current data is suggestive but not dispositive)
- §5.1 written in the honest-about-limitations register the model stage already named — acknowledges bilinearity is a simplification and points to the partial-verification scope-limit. The cleaner "three corners is just a coordinate system for the optimum" answer would have been misleading because the model stage itself disclaims it
- Renumbered downstream sections: §5 What the field hasn't resolved → §6; §6 What to do with this → §7; §7 What this pipeline establishes → §8. All subsection numbers within those sections (6.1, 6.2, 6.3, 7.1, 7.2, 7.3) updated to match. Body has no internal §-number cross-references (verified by grep) so no further cross-ref work required
- TLDR and §3 left intact — the integrated findings and the headline TLDR claims stand. The new §5 is an addition to the structure, not a revision of existing claims
- Internal consistency post-edit: all numbers stable; section flow now reads as motivation (§1) → vocabulary (§2) → integrated findings (§3) → factions (§4) → objections-to-framework (§5) → open-field-questions (§6) → action (§7) → close (§8). The outward-inward-outward-forward arc reads cleanly
- Body word count grew from ~7,000 to ~7,800 — still inside the sibling psych-variation envelope (~7,600). PRD registry updated to writeup (pass 4) matching the new frontmatter pass number

Technology Utilization Architecture

TLDR

1. Formal Models of Human-AI Task Allocation

2. Appropriate Reliance, Trust Calibration, and Verification Cost

3. The Metacognitive Bottleneck and Ironies of Generative AI

4. The Empirical Productivity Record (25+ RCTs, 2023–2026)

5. Interaction Modes: Centaur, Cyborg, Self-Automator

6. Automation Levels and Autonomy Frameworks

7. Cognitive Operation Taxonomies and Task-to-Tool Mapping

8. Practitioner Frameworks and Emerging Workflow Architectures

9. Adjacent Fields: Imported, Underutilized, and Ripe for Bridging

10. Key Researchers, Labs, and Thought Leaders

11. Open Questions, Contested Ground, and Unfilled Gaps

12. Load-Bearing Assumptions and What Would Flip Them

Adversarial Challenge to the Project Framing

13. Design Principles Supported by the Current Evidence

TLDR

The graph

How to read this graph

Node types

Edge types

Weight scale (load-bearing weight, 1–5)

1. Node catalog

A — Foundational assumptions

M — Methodological prerequisites

E — Empirical claims

L — Logical necessities

G — Generating mechanisms

S — Synthesis claims

P — Practitioner frameworks

O — Open questions

D — Distortion vectors

2. Edge catalog (key chains, not exhaustive)

3. High-stakes nodes — by structural role

3a. Foundational cruxes — collapse rebuilds regions

3b. Logical guardrails — unfalsifiable, ignored at peril

3c. Reframer mechanisms — magnitude is the live question

3d. Headline conclusion — a synthesis output, not a crux

3e. Corroborating / illustrative

3f. Distortion vectors

4. Weakest links

5. Variants

Variant A — Vulnerability (where does this break?)

Variant B — Flow (how does causation propagate?)

Variant C — Minimal claim set

Variant D — Capability-regime fragility (the topic-specific variant)

6. Stage-3 handoff

7. Next moves — three Stage-3 options

Option A — Capability × verification-cost dispatch table (decomposition)

Option B — Generator-verifier loop with autonomy slider (generating function) [recommended]

Option C — Principal-agent with imperfect agent (mechanism design)

8. Objections to this topology

9. Glossary

TLDR

Task parameters (θ)

Optimal policy

Channel decomposition

Diagnostics

How to read this stage

1. The formalisation moves

2. Variables and objects

3. The per-task value function

3.1 Quality channel Q

3.2 Attention channel A

3.3 Risk channel R

3.4 Skill channel S

3.5 Putting it together

4. Optimal policy: the three corners that win

The five practitioner modes — labels for the (u, v) plane

5. Portfolio aggregation — the day

6. Calibration anchors

7. Worked anchors against the empirical record

Cross-context note: outcome heterogeneity across these anchors

8. The five cruxes

9. Scope limits — what the model does NOT capture

10. Adversarial + steelman

Objection 1: c_AI is unobservable, so the model is unactionable on novel tasks.

Objection 2: The model recovers practitioner intuitions and adds no new predictions.

Objection 3: The model is single-shot; the topic question is dynamic.

11. Stage-4 fitting targets

3.1 Quality channel `Q`

3.2 Attention channel `A`

3.3 Risk channel `R`

3.4 Skill channel `S`

Objection 1: `c_AI` is unobservable, so the model is unactionable on novel tasks.