Technology Utilization Architecture
Optimal workflow architecture for an individual knowledge worker given the current AI / agent / automation toolset. Not "use AI" but the specific choreography — which tools for which cognitive operations, where human judgment is essential vs. bottleneck, and how to structure the feedback loops.
The topic running through the LLM Iterate pipeline. The question is not “does AI help” — that has been answered (15–55% productivity gains on well-bounded tasks, replicated across ~25 RCTs). The question is which workflow architecture maximizes output quality per unit of human attention, given that the binding constraint has shifted from production throughput to metacognitive load.
Stage 1 (lit review) maps three layers: a mature HCI / decision-science literature on appropriate reliance and complementarity; a classical foundation in cognitive systems engineering being re-imported (Bainbridge 1983, Klein et al. 2004, Hollnagel & Woods 2005); and a practitioner stack (Mollick, Karpathy, Anthropic, Cognition, Claude Code) that runs 12–18 months ahead of peer review. The headline finding across the empirical record is that workflow architecture predicts outcomes more reliably than which frontier model you use.
Stage 2 (topology) is the dependency graph — three foundational assumptions (attention-as-binding-constraint, jagged frontier exists, verification cost is comparable to generation cost) carry most of the inferential weight. Six crux nodes are where collapse propagates farthest. The graph also encodes how practitioner frameworks and academic theory map onto the same underlying structure.
Stages 3–5 will formalize the workflow choreography into a parameterized model (capability-by-operation × verification-cost × autonomy level → routing decision), test it against the available task-level evidence, and ship a small interactive tool for individual workflow design.
The research landscape on optimal human-AI workflow design for individual knowledge workers — three layers (HCI/decision-science, classical cognitive systems engineering, practitioner stack), 25+ RCTs, the metacognitive bottleneck reframing, and the load-bearing assumptions any formalization must survive.
TLDR
The research landscape on optimal human-AI workflow design for individual knowledge workers spans three largely disconnected layers: a mature HCI/decision-science literature on reliance and complementarity, a classical foundation in cognitive systems engineering now being re-imported, and a practitioner literature that drives terminology 12–18 months ahead of peer review. The single most important empirical finding across ~25 RCTs (2023–2026) is that workflow architecture predicts outcomes more reliably than model capability — how you structure human-AI interaction (centaur vs. cyborg, independent-then-synthesize, guardrailed vs. unfettered) matters more than which frontier model you use. A corollary is that the binding constraint on AI-augmented knowledge work is not production throughput but the metacognitive bottleneck: planning what to delegate, verifying outputs, and maintaining calibration on where AI fails.
The empirical record shows 15–55% productivity gains on well-bounded tasks, robust skill-leveling effects for novices, but contested and sometimes negative results for experts on open-ended judgment work. A major unresolved puzzle is that these micro-RCT gains produce precisely zero impact on aggregate labor-market outcomes at two-year horizons (Humlum & Vestergaard 2025). The “ironies of automation” from 1983 human-factors research replay exactly in LLM contexts: AI that handles routine work degrades human capacity to catch the rare errors where human judgment is critical (Simkute, Tankelevitch et al. 2024/2025). Practitioner frameworks — Mollick’s centaur/cyborg typology, Karpathy’s autonomy slider, Anthropic’s agent-design patterns, Cognition’s context-engineering principles — are converging toward a common architecture but lack formal integration.
The largest theoretical gap is that no unified normative framework exists for the individual knowledge worker’s daily AI workflow choreography — when to consult, delegate, verify, or refuse AI on a per-task basis. The closest academic scaffolding is Parasuraman, Sheridan & Wickens’ (2000) function-allocation model, but it was designed for system engineers, not individual users. The practitioner world has de facto answers (autonomy sliders, AI sandwiches, compound engineering), and the academic world has converging threads (CoALA cognitive architectures, HAIJCS joint cognitive systems, Tankelevitch’s metacognitive demands framework), but no synthesis yet integrates them. That integration — a formal, empirically grounded model of human-AI cognitive partnership that maximizes output quality per unit of human attention — is the field’s open frontier and the target of the next phase of this project. Six load-bearing assumptions underlying this landscape are identified in Section 12, along with the specific evidence that would flip each one — these define the risk surface for any formalization attempt.
1. Formal Models of Human-AI Task Allocation
The deepest formal literature concerns “learning to defer” (L2D) and human-AI complementarity — algorithms that decide, per instance, whether the AI or the human should handle a task. The canonical formulations (Madras et al. 2018 NeurIPS; Mozannar & Sontag 2020 ICML; Mozannar et al. 2023 AISTATS, arXiv:2301.06197) established that optimal joint human-AI assignment is computationally hard and that naive heuristic approaches systematically underperform. Wilder, Horvitz & Kamar (2020, IJCAI) operationalized “learning to complement humans” by training models end-to-end against team accuracy.
Why this matters for workflow design: These are formal proofs that the intuitive approach — “use AI when it’s better, use humans when they’re better” — is not a well-specified decision rule. Optimal allocation requires modeling the joint performance surface, not comparing solo accuracies.
The most actionable synthesis is Hemmer et al.’s “Complementarity in Human-AI Collaboration” (2025, EJIS, link), which distinguishes “complementarity potential” (the mathematical possibility of exceeding either agent alone) from “complementary team performance” (actually achieving it). They identify information asymmetry and capability asymmetry as the two sources. The uncomfortable finding: complementary team performance is rarely empirically observed despite decades of theoretical promise. Amin et al.’s (2026) Bayesian framework adds a behavioral explanation: “correlation neglect,” where humans treat AI advice as independent evidence despite shared training data, can make AI advice anti-augmentative.
Vaccaro, Almaatouq & Malone’s (2024, Nature Human Behaviour) meta-analysis provides the closest thing to a quantitative allocation rule: human-AI combinations help most when (a) humans alone outperform AI, (b) the task is creation rather than decision-making, and (c) AI handles sub-tasks rather than the whole task.
2. Appropriate Reliance, Trust Calibration, and Verification Cost
Bansal et al. (2021, CHI) established the canonical finding: AI explanations increase acceptance regardless of correctness — they do not produce complementary performance. Buçinca, Malaya & Gajos (2021, arXiv:2102.09692) showed that “cognitive forcing functions” (commit to your own answer before seeing AI) reduce overreliance, but only for users high in Need for Cognition — creating intervention-generated inequality.
The major reframing came from Vasconcelos et al. (2023, arXiv:2212.06823) and Fok & Weld (2023, arXiv:2305.07722): overreliance is a rational cost-benefit choice, not a cognitive defect. People engage with verification only when it is cheap relative to the expected payoff. This produced a methodological pivot from “outcome-graded” to “strategy-graded” reliance metrics. Buçinca et al.’s (2024/2025, CHI) offline-RL approach learns adaptive per-instance policies for what kind of AI support to provide.
The practical design implication: minimize verification cost, not maximize explanation quality. Confidence indicators and linguistic uncertainty markers shift reliance more reliably than feature-importance explanations. Microsoft Research’s 2024 synthesis (PDF) endorses this framing for generative AI.
A newly identified failure mode: sycophancy in feedback loops. Randazzo et al. (HBS WP 26-021, 2026) document that when professionals push back on incorrect AI output, the AI escalates persuasive justification rather than disclosing uncertainty, sometimes flipping correct human judgments to incorrect ones.
3. The Metacognitive Bottleneck and Ironies of Generative AI
This section covers what is arguably the most important reframing in the 2024–2026 literature. Horvitz’s (1999, CHI, link) twelve principles of mixed-initiative interfaces and Amershi et al.’s (2019, CHI) 18 guidelines for human-AI interaction remain the design base layer.
Tankelevitch, Sarkar, Sellen, Rintel et al. (CHI 2024 Best Paper, arXiv:2312.10893) introduced the metacognitive demands framework: GenAI reduces cognitive load on production but increases metacognitive load — planning goals, evaluating outputs, monitoring confidence, and deciding when to use AI at all. The optimization target shifts from throughput to metacognitive efficiency.
Simkute, Tankelevitch, Kewenig, Scott, Sellen & Rintel’s “Ironies of Generative AI” (2024/2025, IJHCI, arXiv:2402.11364) directly bridged Bainbridge’s 1983 “Ironies of Automation” to GenAI. They identify four GenAI-specific productivity losses that mirror classical automation ironies: (1) the shift from creative production to supervisory demands, (2) workflow disruptions breaking established rhythms, (3) frequent task interruptions from AI suggestions, and (4) a polarization effect where simple tasks become easier but complex ones become harder. Their proposed mitigations — continuous feedback, personalization, ecological interface design, clear task allocation — echo Bainbridge almost exactly, suggesting the field is rediscovering rather than advancing.
The CHI 2025 “Tools for Thought” workshop synthesis (Tankelevitch et al. 2025, arXiv:2508.21036) consolidates the MSR research program’s position: knowledge work is shifting from production to critical integration — decisions about when and how to use AI, how to frame tasks, and how to assess outputs. Sarkar’s “Friction-Induced AI” concept adds deliberate intervention points to improve verification short-term and prevent skill atrophy long-term.
Mozannar, Bansal, Fourney & Horvitz’s CUPS taxonomy (CHI 2024, arXiv:2210.14306) provides the empirical anatomy for coding specifically: programmers using Copilot spend large amounts of time verifying and thinking about AI suggestions. Verification time is the hidden tax, and it is substantial.
4. The Empirical Productivity Record (25+ RCTs, 2023–2026)
Stable findings. Generative AI yields 15–55% productivity gains on well-defined knowledge tasks. Time-savings are large and robustly replicated; quality effects are smaller and more variable. The headline studies:
- Brynjolfsson, Li & Raymond (2023/2025, QJE, link): 5,172 customer-support agents. +15% average, +34% for novices, ~0% for top performers.
- Noy & Zhang (2023, Science, link): 453 professional writers. 40% time reduction, 18% quality lift.
- Peng et al. (2023): GitHub Copilot RCT, +55.8% task completion speed.
- Cui et al. (2025, Management Science): 4,867 developers, +26% tasks/week. But a 2025 longitudinal case study found experienced developers gained less and sometimes slowed down (arXiv:2509.20353).
- Dell’Acqua, Mollick et al. (2023/2025, HBS, link): 758 BCG consultants. +25% speed and +40% quality on inside-frontier tasks; 19-percentage-point quality drop on outside-frontier tasks. This study coined the “jagged technological frontier” concept.
Contested findings.
Does AI help experts? The skill-leveling pattern breaks down for open-ended judgment. Otis et al. (2024): 640 Kenyan entrepreneurs over 5 months — high-baseline +15–20%, low-baseline –8–10%. METR (2025, link): 16 experienced open-source developers were 19% slower with AI in their own repos, despite predicting 24% speedup. Likely resolution: the bottleneck differs by task type — execution speed (where AI levels) vs. judgment/filtering (where AI amplifies those who already can filter).
Human+AI vs. AI alone. Goh et al. (2024, JAMA Network Open): GPT-4 alone outscored physicians + GPT-4 on diagnostic vignettes. But Everett et al. (2025, link): an “independent-then-synthesize” workflow eliminated the underperformance. Workflow architecture, not model capability, explains the discrepancy.
Long-term cognitive effects. Bastani et al. (PNAS 2025, link): AI boosted in-session math performance 48–127% but produced 17% worse unassisted performance afterward — unless AI was guardrailed to give hints rather than answers. Lee, Sarkar et al. (CHI 2025, link): 319 knowledge workers — higher AI confidence correlates with less critical thinking enacted.
The aggregate puzzle. Humlum & Vestergaard (2025, NBER 33777, link): 25,000 Danish workers across 11 exposed occupations, precise zero impact on earnings or hours at two-year horizons. This is the field’s largest unresolved tension: micro-RCT productivity does not translate to aggregate productivity. Possible mechanisms: task reorganization, weak wage pass-through, substitution effects, cross-task productivity bundling (Cowen 2026, link).
Methodological caveat: The Toner-Rodgers (2024) materials-discovery study (+44% novel materials) was publicly disavowed by MIT in May 2025 following data-integrity concerns. Widely cited but should not be treated as established fact.
5. Interaction Modes: Centaur, Cyborg, Self-Automator
Mollick’s three-mode taxonomy is now empirically grounded:
Centaurs maintain clean human/AI role separation, handing off discrete tasks based on frontier mapping. Cyborgs intertwine human and AI continuously at sub-task granularity. Randazzo, Lifshitz et al. (HBS WP 26-036, 2026) added the self-automator: full delegation with periodic oversight. Empirical distribution across 244 BCG consultants: ~60% cyborg, ~30% centaur, ~10% self-automator.
Schoenegger, Park, Karger & Tetlock’s superforecasting study (2024/2025, ACM TiiS, link) found both well-calibrated and deliberately overconfident GPT assistants improved forecasting accuracy 23–43% — suggesting much of the centaur gain comes from forced structured reasoning rather than AI advice quality. Combined with the historical chess record, this raises the question of whether the centaur advantage is a transient regime that disappears when AI exceeds humans on the full task, or a permanent feature of asymmetric cognitive strengths.
6. Automation Levels and Autonomy Frameworks
Parasuraman, Sheridan & Wickens’ (2000, link) four-function × ten-level model remains the cleanest formal scaffolding. Their four automation functions — information acquisition, information analysis, decision/action selection, action implementation — map directly onto modern LLM workflow stages (RAG/retrieval, synthesis/analysis, recommendation, tool use/code execution). Yet no one has formally re-operationalized this for LLMs.
The 2023–2026 wave: Morris et al.’s “Levels of AGI” (DeepMind, 2023, arXiv:2311.02462) separates capability from autonomy across six levels. Feng, McDonald & Zhang’s “Levels of Autonomy for AI Agents” (2025, arXiv:2506.12469) defines five user-centered roles (Operator → Collaborator → Consultant → Approver → Observer) and is the most directly applicable to individual workflow design. Anthropic’s “Measuring AI Agent Autonomy in Practice” (2025/2026, link) surveys five competing frameworks empirically.
Shneiderman’s Human-Centered AI 2D framework explicitly rejects the “more automation = less control” assumption — high automation and high human control can coexist (cameras, GPS, modern IDEs). This is a crucial conceptual move for knowledge work, where the goal is high-autonomy AI with high human oversight, not a trade-off between them.
The classical critique tempering all level-talk: Dekker & Woods’ “MABA-MABA or Abracadabra?” (2002) — automation does not merely replace human work, it transforms it. The substitution myth is alive in current LLM discourse. Every sub-task offloaded to AI creates new monitoring, verification, and coordination work.
7. Cognitive Operation Taxonomies and Task-to-Tool Mapping
Bloom’s revised taxonomy (Remember → Understand → Apply → Analyze → Evaluate → Create × factual/conceptual/procedural/metacognitive) is the most-imported cognitive framework in 2024–2026 LLM research. Empirically, LLM capability decays sharply up the Bloom hierarchy: BloomAPR (Ma et al. 2025, arXiv:2509.25465) found ~81% success at Remember-level tasks, dropping to 43% at Apply and 13–41% at Analyze. Lee et al.’s CHI 2025 survey explicitly used Bloom’s levels to show GenAI shifts cognitive labor from lower-order production to higher-order verification, integration, and stewardship.
Cognitive Task Analysis (CTA) methods (Crandall, Klein & Hoffman 2006 Working Minds; Militello & Hutton’s ACTA, PubMed) remain conspicuously underutilized. CTA is the canonical method for understanding what a knowledge worker actually does cognitively before allocating sub-tasks to AI — yet almost no production agent design uses it. Klein et al.’s macrocognition framework (sensemaking, problem detection, mental projection, coordination, PMC) is similarly absent despite obvious fit.
The cleanest bridge between classical cognitive architectures and LLM agents: Sumers, Yao, Narasimhan & Griffiths’ CoALA framework (2024, TMLR, arXiv:2309.02427), mapping LLM agents onto modular memory (working/episodic/semantic/procedural), structured action spaces, and decision cycles drawn from ACT-R and SOAR.
The practitioner world has a de facto task-routing approach that academia hasn’t formalized. Paterson’s (2026, link) empirical benchmark of 15 models across 38 real daily tasks concluded that “routing beats model selection” — the generating function is a dispatch table matching task type to tool, not a single best model. This echoes Power’s DSS taxonomy (model-/data-/knowledge-/document-/communication-driven systems) but is grounded in per-task empirical measurement rather than a priori categorization.
8. Practitioner Frameworks and Emerging Workflow Architectures
Practitioner literature is now driving the discipline. This section maps the most influential frameworks and where they converge.
Mollick’s centaur/cyborg/jagged frontier (link) and his book Co-Intelligence (2024) function as the dominant practitioner vocabulary. His four rules — always invite AI, be the human in the loop, give it a persona, assume this is the worst AI you’ll ever use — are the closest to a widely adopted practitioner heuristic set.
Karpathy’s framework (Software 1.0/2.0/3.0, jagged intelligence, anterograde amnesia, the autonomy slider, generator-verifier loop, link) gives precise vocabulary for coding workflows. The autonomy slider — instantiated in Cursor’s Tab → Cmd+K → Agent Mode progression — is the clearest practitioner instantiation of what academic autonomy taxonomies describe abstractly: a per-action user control surface.
Anthropic’s agent-design patterns (link): prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. Their multi-agent research system (link) showed orchestrator-worker patterns outperformed single-agent Claude Opus by 90.2% at ~15× token cost. Their context-engineering guide (link) formalizes compaction, just-in-time retrieval, structured memory, and subagent isolation.
Cognition’s context-engineering principles (link): the read/write asymmetry — multi-agent works for read-heavy tasks (research) but breaks for write-heavy tasks (code) unless writes are serialized. This is now consensus across Anthropic, LangChain, and Cognition.
Coding-agent workflow patterns have stabilized around three approaches: (1) continuous pairing (Cursor — taxes attention, preserves flow), (2) batch delegation (Devin — reduces presence, adds re-entry cost), and (3) spec-driven development (Harper Reed’s spec → plan → execute loops, Amazon Kiro, GitHub Spec Kit). Claude Code’s documented harness (CLAUDE.md memory, writer/reviewer, test-then-code) is the most comprehensive single-tool pattern.
Personal knowledge management + AI (Karpathy’s “LLM Wiki,” Obsidian + Claude Code patterns from Eric Ma and others) converges on: plain-text-as-substrate, persistent context files teaching personal taxonomy, reusable commands, inbox → process → integrate → review lifecycle. Every’s Dan Shipper articulates this as the “AI Sandwich” (humans frame and review; AI does the middle) and “Compound Engineering” (plan → work → review → compound).
Key practitioner insight lacking academic formalization: Cowen’s cross-task productivity bundling — per-task speedups don’t translate proportionally to aggregate productivity because related tasks are productivity-linked. This connects directly to the Humlum & Vestergaard aggregate-zero puzzle.
9. Adjacent Fields: Imported, Underutilized, and Ripe for Bridging
Joint Cognitive Systems / Cognitive Systems Engineering (Hollnagel & Woods 2005) reframes human + LLM as a single coupled system. Klein, Woods, Bradshaw, Hoffman & Feltovich’s “Ten Challenges for Making Automation a Team Player” (2004, PDF) has become the most-cited pre-LLM paper in 2024–2026 agent design. Its requirements — Basic Compact, mutual models, predictability, directability, observability, goal negotiation, attention management, common-ground repair — function as the checklist for what an agent teammate needs. Xu & Gao’s (2024, Interactions) HAIJCS framework is the cleanest bridge from CSE to LLM human-AI teaming.
Distributed cognition (Hutchins 1995) is being imported by Hutchins himself (Paris IAS 2024) and by Tao An’s “Cognitive Workspace” (2025, arXiv), which grounds LLM context management in Baddeley’s working memory model. The extended mind thesis (Clark & Chalmers 1998) has been explicitly extended to LLMs by Smart, Clowes & Clark in Synthese 2025 (link). Tong’s 2026 survey (arXiv) synthesizes the full Licklider–Engelbart–Clark lineage through to modern human-AI symbiosis.
Engelbart’s H-LAM/T (1962) — Human using Language, Artifacts, Methodology, Training — is the most under-imported framework. It required co-evolution of all four components; current AI rollouts ship the artifact (the model) while methodology and training lag. Treating H-LAM/T as a literal rollout checklist would discipline most AI deployments.
Other underutilized resources: Power’s DSS taxonomy for classifying AI tools by purpose; Nonaka’s SECI cycle being extended to human-AI knowledge creation (Böhm & Durst 2025, Matsumoto et al.); Personal Information Management (Jones, Bergman & Whittaker) providing taxonomies for the “AI second brain” movement; Endsley’s situation awareness model extended to human-AI teams in her own 2023 paper; and Wickens’ Multiple Resource Theory, which would predict tool-stack attention overload (running Cursor + ChatGPT + a meeting simultaneously) but is absent from AI workflow research.
10. Key Researchers, Labs, and Thought Leaders
Microsoft Research has the deepest portfolio: Horvitz, Kamar, Amershi, Liao, Tankelevitch, Rintel, Sarkar, Bansal, Mozannar, Buçinca. The “Tools for Thought” research program (link) and the associated CHI 2025 workshop (link) are the most concentrated effort on AI-augmented knowledge work. Stanford HAI covers empirical reliance work. MIT CSAIL + Sloan/D³ drives both formal L2D theory (Sontag, Mozannar) and field experiments (Dell’Acqua, Lakhani). Harvard SEAS/D³ hosts Buçinca, Gajos. Wharton/HBS bridges practice and research (Mollick, Lifshitz-Assaf, Kellogg). CMU HCII, UW/AI2 (Weld, Fok), and Stanford Digital Economy Lab (Brynjolfsson) round out the empirical work.
Adjacent-field bridgers: Wei Xu (HAIJCS), Smart/Clowes/Clark (extended mind), Endsley (SA), Klein/Bradshaw/Hoffman/Feltovich (CSE/NDM), Tao An (cognitive workspace), Tong (augmentation→symbiosis).
Practitioner thought leaders with formalized frameworks: Mollick (Wharton), Karpathy (independent), Willison (independent — coined the canonical agent definition, link), Schluntz & Zhang (Anthropic), Yan & Cognition Labs, Chase (LangChain), swyx (Latent Space, link), Shipper & Klaassen (Every, link), Cowen (GMU, link), and the Microsoft New Future of Work Report team.
11. Open Questions, Contested Ground, and Unfilled Gaps
Stable consensus: Explanations alone don’t yield complementary performance. AI helps novices most on well-defined tasks. Cognitive forcing reduces overreliance with equity caveats. AI homogenizes outputs at the population level. Verification cost is the binding constraint. Workflow architecture predicts outcomes better than model choice.
Genuinely contested:
- Whether AI helps experts. Skill-leveling (Brynjolfsson, Noy) vs. skill-amplifying (Otis high-baseline finding) vs. net-negative (METR). Likely resolution: the bottleneck differs — execution speed (AI levels skill) vs. judgment/filtering (AI amplifies whoever can already filter).
- Micro-to-macro translation. 15–55% RCT gains coexisting with Humlum & Vestergaard’s aggregate zero. Possible explanations: task reorganization absorbs time savings, cross-task bundling (Cowen), weak wage pass-through, measurement artifacts.
- Long-term cognitive effects. Bastani’s skill-atrophy evidence vs. Brynjolfsson’s accelerated learning curves. The guardrail design matters more than the binary of AI access vs. none.
- Human+AI vs. AI alone in expert domains. Goh’s medical finding that AI alone wins vs. Everett’s workflow-architecture fix. The claim appears workflow-dependent, not capability-dependent.
- Persistence of the centaur regime. Chess history + Schoenegger’s findings suggest centaur advantages may be transient as AI capability crosses task thresholds.
What hasn’t been formalized:
- No normative framework for individual daily workflow choreography (when to consult, delegate, verify, refuse) — the Parasuraman (2000) equivalent for end users rather than system designers.
- Context engineering remains a practitioner discipline without academic theory.
- Multi-tool attention allocation lacks quantitative models despite Wickens’ MRT being directly applicable.
- The interaction between agent autonomy level and metacognitive load is under-theorized.
- Long-term skill formation under continuous AI use lacks longitudinal data (most studies ≤6 months).
- Direct workflow-architecture comparison RCTs are rare; the field needs more studies structured like Everett 2025 and Bastani’s guardrailed vs. unguardrailed designs.
- The feedback loop between task routing, skill development, and frontier migration over time (as you get better at using AI, the frontier shifts, changing optimal allocation) has no formal model.
12. Load-Bearing Assumptions and What Would Flip Them
Any formalization built on this landscape will inherit certain assumptions. Making them explicit now disciplines the next phase.
Crux 1: “Workflow architecture > model capability.” This is the document’s central claim. It’s supported by Dell’Acqua (inside vs. outside frontier), Everett (workflow fix restoring physician+AI performance), and the general pattern that the same model yields very different outcomes under different interaction designs. But this claim is load-bearing on a specific regime: one where capability differences between frontier models are small relative to design differences between workflows. If a model capability jump is large enough that even naive workflows dramatically outperform expert workflows on current models, this claim inverts. What would flip it: A capability discontinuity (not incremental improvement) that eliminates the jagged frontier for a broad task class. The evidence base comes from a narrow window (2023–2025) of similar-capability frontier models — the claim may not survive a regime change.
Crux 2: “The metacognitive bottleneck is the binding constraint.” This assumes production has been sufficiently automated that the bottleneck has shifted upward to planning, evaluation, and calibration. But for many knowledge workers, production is still the bottleneck — they lack the time, skill, or tool access to make AI-assisted production easy. The metacognitive framing may describe elite power users, not the median worker. What would flip it: Evidence that the majority of knowledge workers are production-constrained rather than metacognition-constrained, even with AI access. Lee et al.’s CHI 2025 survey (319 workers) partially supports the metacognitive framing, but the sample skews toward workers who already use AI regularly.
Crux 3: “The jagged frontier is mappable and relatively stable.” Design principle #1 says “map your personal jagged frontier task-by-task.” This assumes the frontier is stable enough to calibrate against. But if model capabilities shift every 3–6 months, the frontier migrates faster than a user can recalibrate. What would flip it: Evidence that frontier migration rate exceeds human calibration rate — that by the time you’ve learned where GPT-4 fails, GPT-5 has moved the boundary. The likely resolution is that the frontier has stable topological features (AI is reliably good at X-type tasks, reliably bad at Y-type) even as the boundary shifts, making the shape mappable even if the exact edge is volatile. This is an empirical question the field hasn’t tested.
Crux 4: “Verification cost is the binding constraint on appropriate reliance.” The rational-cost-benefit reframing (Vasconcelos, Fok & Weld) is load-bearing on approximate rationality — that people correctly estimate when verification is worth the effort. But if people are systematically miscalibrated about AI error rates (which the sycophancy finding from Randazzo et al. directly suggests), then the binding constraint isn’t verification cost but verification calibration. The distinction matters for design: reducing cost helps a rational actor; improving calibration helps a miscalibrated one. Both interventions are different.
Crux 5: “The individual knowledge worker is the right unit of analysis.” The entire document frames optimization at the individual level. But if the Humlum & Vestergaard aggregate-zero puzzle is explained by organizational dynamics (task reallocation, managerial absorption of time savings, coordination costs), then individual workflow optimization is locally optimal but globally insufficient. The right unit might be the team or the value chain. What would flip it: Evidence that individually optimized AI workflows produce organizational friction (e.g., faster individual output creating review bottlenecks downstream, or AI-homogenized outputs reducing team diversity of thought). The Anderson et al. (2024) homogenization finding and the organizational-absorption explanation for aggregate-zero both point in this direction.
Crux 6: “The centaur/cyborg/self-automator taxonomy is durable.” It may instead be a transient artifact of current tool limitations. As tools evolve toward seamless human-AI blending (real-time co-editing, ambient AI, continuous context), the discrete modes may dissolve into a continuum. The taxonomy’s value for formalization depends on whether the modes capture something structurally real about cognitive coupling or merely describe current interface affordances. What would flip it: Evidence that as tool integration deepens, the behavioral distinction between centaur and cyborg disappears — users naturally slide between modes within a single task rather than choosing one.
Adversarial Challenge to the Project Framing
Strongest objection: “You’re trying to formalize a workflow architecture for a system where one of the components (AI capability) changes faster than any formal model can track. By the time you’ve mapped the frontier, built the model, and tested it, the frontier has moved. The practitioner literature is ahead precisely because it doesn’t try to formalize — it adapts via heuristics and rapid iteration. The academic aspiration to a ‘unified normative framework’ is a category error: this is an engineering problem requiring adaptive heuristics, not a science problem requiring formal models.”
Why this objection is partially right: The objection correctly identifies that any model parameterized on a specific capability profile (GPT-4 is good at X, bad at Y) will go stale within months. Fixed allocation rules are doomed. The practitioner instinct to stay adaptive is sound.
Why the strongest version of the project survives it: Even in rapidly changing systems, structural invariants exist that a formal model should capture. The metacognitive bottleneck doesn’t disappear when models improve — it shifts to new decisions. The verification cost trade-off doesn’t change shape when models get better — the threshold moves. The automation ironies are structural properties of any delegation relationship between a principal and an imperfect agent. What a formal model should capture is the generating function — the invariant structure that produces the right allocation given any capability profile — not a specific allocation for a specific model. The model should be parameterized by capability, not dependent on a fixed capability level. This is exactly the difference between Parasuraman’s (2000) framework (which has lasted 26 years despite massive automation changes) and any specific automation allocation table (which goes stale quickly). The target is a model that says “here is how to decide what to delegate” — not “here is what to delegate.”
13. Design Principles Supported by the Current Evidence
The empirical and theoretical record converges on a set of actionable principles for designing an individual knowledge worker’s AI workflow:
-
Map your personal jagged frontier task-by-task. Outside-frontier AI use is actively harmful, so the first design decision is calibrating which sub-tasks are inside and outside for you specifically. This frontier is personal (varies by expertise) and dynamic (shifts with practice and model updates).
-
Match interaction mode to task structure. Centaur (clean handoff) for tasks with verifiable checkpoints. Cyborg (interleaved) for creative or ill-structured work. “Independent-then-synthesize” for high-stakes expert judgment.
-
Minimize verification cost, not maximize AI capability. The binding constraint is verification, not generation. Design for structured outputs, confidence signals, and cheap-to-check formats.
-
Insert deliberate friction at decision points. Cognitive forcing functions (form your own view before seeing AI output) reduce overreliance. Sarkar’s “Friction-Induced AI” concept shows this can be built into tool design.
-
Treat context engineering as the central craft. Practitioner consensus: the binding constraint is not model intelligence but what context the model operates in. CLAUDE.md files, system prompts, persistent memory, and structured instructions are higher-leverage than model selection.
-
Route tasks, don’t pick a single best tool. Paterson’s empirical result (“routing beats model selection”) is the practitioner instantiation of L2D theory. Build a personal dispatch table matching task types to tools.
-
Preserve skills with guardrails. Hint-only AI in learning contexts. AI-free zones for capabilities you need to maintain. Bastani’s guardrailed-AI design prevented skill atrophy while preserving performance gains.
-
Serialize writes in agentic systems. Cognition’s read/write asymmetry: multi-agent is powerful for research/analysis but breaks for code/document production unless writes are serialized.
-
Watch for the metacognitive bottleneck. The limiting resource in AI-augmented work is no longer effort but judgment and attention allocation. Tankelevitch’s framework suggests optimizing for metacognitive efficiency, not throughput.
-
Budget for automation ironies. Every sub-task delegated to AI creates new monitoring, verification, and coordination work. Simkute et al.’s four productivity-loss categories are predictable and designable-against.
Dependency graph of the lit review. 66 nodes typed across nine classes (foundational assumptions / methods / empirical claims / logical necessities / generating mechanisms / synthesis / practitioner frameworks / open questions / distortions); seven cruxes (A1, A2, A3, A6, L1, L2, L3 — every weight-5 A node + every weight-5 L node) and four variant views (Vulnerability / Flow / Minimal / Capability-regime). The genuinely novel structural move for this topic: encoding the practitioner ↔ academic operationalization bridge that runs 12–18 months out of phase.
TLDR
The lit review documents the research landscape on optimal human-AI workflow design. This topology asks the sharper question: what depends on what? Strip the field down to its load-bearing structure and the picture is surprisingly clean. Four foundational assumptions sit upstream of most of the empirical and synthesis nodes — that human attention is the binding scarce resource (A1), that AI capability is heterogeneous across cognitive operations (the jagged frontier; A2), that verification cost is comparable to or cheaper than generation cost (A3), and that individual-level workflow optimization aggregates upward rather than being absorbed by organizational dynamics (A6). If any one of them flipped, large regions of the picture would have to be rebuilt. Three logical guardrails (L1 substitution myth is wrong; L2 optimal allocation needs the joint performance surface, not solo accuracies; L3 parameterize by capability) cannot be falsified at all — they can only be ignored, which is exactly how most overconfident AI-rollout discourse proceeds. Everything else is methodology, empirical claim, generating mechanism, synthesis, practitioner framework, open question, or distortion vector.
The genuinely novel structural move for this topic is the practitioner ↔ academic bridge that the new node type P and the new edge type op make explicit. The practitioner stack (Mollick, Karpathy, Anthropic, Cognition, Claude Code) and the academic literature address the same structural problems but with different methodologies and on different timescales — and the relationship is bidirectional, not one-directional. Three patterns recur on the cross-community bridge. (a) Practitioners ahead of academic measurement: Mollick proposed the centaur/cyborg typology in 2024; Randazzo 2026 (HBS) is the first peer-reviewed empirical measurement of the behavior distribution (60/30/10) the typology names. (b) Practitioners concretizing prior academic principles: Karpathy’s autonomy slider concretizes Shneiderman’s 2D framework as a per-action user control surface; the AI Sandwich and Compound Engineering loop (Shipper / Every) concretize Tankelevitch’s metacognitive-demands framework as a daily workflow practice. (c) Practitioner and academic work converging in parallel without direct lineage: Karpathy’s autonomy slider (per-action UX) and Feng 2025’s five-level academic taxonomy (per-task user roles) converged on the same discretized-levels-of-autonomy shape independently, at different granularities; Anthropic’s agent design patterns (chain / route / parallelize / orchestrator-workers / evaluator-optimizer) address the same structural problem (joint allocation under capability heterogeneity) that L2D theory formalizes, but emerged from engineering practice rather than as L2D operationalizations. Note that some P-edges in the graph are practitioner-internal rather than cross-community — e.g., P6 (spec-driven development) → S4 (context engineering as central craft) and P7 (CLAUDE.md / context files) → S4 are both practitioner concrete-technique → practitioner-coined synthesis edges, not bridge edges, and the (a)/(b)/(c) classification doesn’t apply to them. Reading the topology through this mixed bridge — rather than as a single integrated literature — disciplines the Stage-3 formalization: the model should encode invariants tracked across both communities (autonomy levels, verification gates, context structure, joint-allocation logic) parameterized by inputs the academic literature measures (capability gap, verification cost, skill-formation goals), without privileging either community as the source of authority.
The field’s weakest links are not where the popular discourse focuses. Mainstream debate contests “does AI help” — but at the well-bounded-task level the productivity record is robust (E1: 15–55% gains across ~25 RCTs). The actual fragile zones in 2026 are: (a) the aggregate-zero puzzle (E4 / O2 attacks A6) — Humlum & Vestergaard’s precise zero across 25,000 Danish workers tensions every micro-RCT result and is direct evidence against the individual-aggregation assumption that the entire individual-level frame depends on; (b) whether the centaur regime persists (E18 + S5) — Schoenegger’s finding that even deliberately overconfident GPT improves forecasting suggests much of the gain comes from forced structured reasoning, not advice quality, raising the question of whether the centaur advantage survives a capability discontinuity; (c) whether the frontier is mappable faster than it migrates (O4) — a calibration race-condition the field hasn’t tested; (d) whether the binding reliance constraint is verification cost or verification calibration (O7) — the design implication is different; (e) long-term cognitive effects of continuous AI use (O3) — Bastani’s 17% unassisted-performance drop and the broader skill-atrophy literature point to a real risk but the longitudinal data window is still under twelve months for most studies.
This topology is the input to model formalization (Stage 3). The cleanest target is a parameterized routing function: for each (task type, capability profile, verification cost, autonomy level) tuple, produce an allocation decision that maximizes expected output quality per unit of human attention. The four variant views below (Vulnerability / Flow / Minimal / Capability-regime) read the same graph through different lenses to discipline that formalization choice — the capability-regime variant in particular sorts every node into stale-on-jump / structurally invariant / regime-dependent, which directly tells the model formalization which terms must be parameters (the regime-dependent ones) and which can be invariants (the stable ones).
The graph
All 66 nodes and their dependencies. Click a node for detail; drag to rearrange.· drag empty space to pan · scroll to zoom
Click a node for its claim and load-bearing weight; hover an edge for the relation type; drag to rearrange. The variant toggles read the same graph through different lenses.
How to read this graph
Every node in the lit review collapses to one of nine types. Edges between them carry one of seven relations. Together they make the structure inspectable.
Node types
| Code | Type | What it is |
|---|---|---|
| A | Foundational assumption | A claim the field cannot operate without; if false, large downstream regions collapse |
| M | Methodological prerequisite | A study design or measurement approach that must work for the empirical claims to be testable |
| E | Empirical claim | A specific measured finding with an effect size and replication status |
| L | Logical necessity | Follows from definitions or algebra; not empirically refutable |
| G | Generating mechanism | A causal process that explains a pattern (metacognitive load, verification trade-off, ironies of automation) |
| S | Synthesis claim | An integrative statement combining multiple lower-level claims |
| P | Practitioner framework | A typology, slider, or pattern published by the practitioner stack ahead of academic formalization |
| O | Open question | Genuinely undecided with current methods or evidence |
| D | Distortion vector | Where motivated reasoning concentrates (typed by direction) |
Edge types
| Code | Edge | Meaning |
|---|---|---|
| dep | depends-on | If target collapses, source collapses |
| imp | implies | Logical implication |
| sup | empirically-supports | Evidence relation |
| conf | confounds / inflates | Artifact relationship |
| mod | moderates | Changes magnitude |
| op | operationalizes | Practitioner framework concretizes an academic claim or vice versa |
| corr | corrects | Workflow architecture corrects naive allocation |
| attacks | attacks | Distortion vector targets a specific node |
Weight scale (load-bearing weight, 1–5)
- 5 — crux node; collapse propagates across multiple sections of the lit review
- 4 — load-bearing within a section
- 3 — important but local
- 2 — corroborating
- 1 — decorative
1. Node catalog
Each node carries: type code · weight · short claim · key citation · status. Status flags: ✓ (robust/replicated), ~ (partial/qualified), ? (contested/open), ✗ (refuted, kept as historical reference).
A — Foundational assumptions
| ID | Wt | Claim | Status |
|---|---|---|---|
| A1 | 5 | Human attention is the binding scarce resource — once production is automated, the bottleneck shifts upward to planning, evaluation, calibration. (Tankelevitch 2024) | ✓ |
| A2 | 5 | AI capability is heterogeneous across cognitive operations (the jagged frontier). (Dell’Acqua 2023/2025) | ✓ |
| A3 | 5 | Verification cost is comparable to or cheaper than generation cost — otherwise rational engagement collapses. (Vasconcelos 2023; Fok & Weld 2023) | ~ |
| A4 | 4 | Knowledge work decomposes into sub-tasks that can be selectively delegated. (Vaccaro 2024 sub-task finding) | ✓ |
| A5 | 4 | The frontier has stable topological features even as the boundary shifts — i.e., it is mappable. | ? |
| A6 | 5 | Individual-level workflow optimization aggregates upward — gains aren’t fully absorbed by organizational dynamics (review bottlenecks, managerial reabsorption, coordination costs). The load-bearing assumption behind the entire individual-level framing. Humlum-Vestergaard aggregate-zero is direct evidence against. | ? |
M — Methodological prerequisites
| ID | Wt | Claim | Status |
|---|---|---|---|
| M1 | 5 | Randomized controlled trials of AI-augmented work. (~25 published 2023–2026) | ✓ |
| M2 | 4 | Strategy-graded reliance metrics (vs. outcome-graded). (Vasconcelos / Fok & Weld pivot) | ✓ |
| M3 | 4 | Field-deployed measurement (real repos, real meetings). (METR 2025) | ~ |
| M4 | 3 | Telemetry / log-based behavioral observation. (Mozannar CUPS 2024) | ✓ |
| M5 | 3 | Cognitive Task Analysis (CTA, ACTA). (Crandall, Klein & Hoffman 2006) | ~ |
E — Empirical claims
| ID | Wt | Claim | Status |
|---|---|---|---|
| E1 | 5 | 15–55% productivity gains on well-bounded knowledge tasks. (Brynjolfsson 2023; Noy 2023; Peng 2023; Cui 2025) | ✓ |
| E2 | 5 | Skill-leveling: novices gain most, top performers near-zero — on well-defined tasks. (Brynjolfsson +34%/0%) | ✓ |
| E3 | 5 | Outside-frontier AI use causes a 19-pp quality drop. (Dell’Acqua BCG study) | ✓ |
| E4 | 5 | Aggregate labor-market effects are zero at 2-year horizons. (Humlum & Vestergaard 2025, 25k workers) | ✓ |
| E5 | 4 | Explanations alone don’t yield complementary performance — they increase acceptance regardless of correctness. (Bansal 2021) | ✓ |
| E6 | 4 | Cognitive forcing reduces overreliance — but only for high-Need-for-Cognition users. (Buçinca 2021) | ✓ |
| E7 | 4 | On naive workflows, AI alone outperforms human+AI in expert domains. (Goh 2024 JAMA NO) | ✓ |
| E8 | 5 | ”Independent-then-synthesize” workflow restores complementarity in the same domain. (Everett 2025) | ✓ |
| E9 | 4 | Unguardrailed AI in learning produces 17% worse unassisted post-session performance. (Bastani PNAS 2025) | ✓ |
| E10 | 3 | Behavior distribution: ~60% cyborg, ~30% centaur, ~10% self-automator. (Randazzo HBS 26-036) | ✓ |
| E11 | 4 | LLM capability decays sharply up Bloom’s hierarchy: ~81% Remember → 13–41% Analyze. (Ma 2025) | ✓ |
| E12 | 4 | Verification time is a substantial fraction of total interaction in coding contexts. (Mozannar CUPS 2024) | ✓ |
| E13 | 4 | Sycophancy escalation: AI flips correct human judgments to incorrect on pushback. (Randazzo HBS 26-021) | ✓ |
| E14 | 4 | Multi-agent orchestrator-worker outperforms single-agent +90.2% at ~15× token cost. (Anthropic 2024/2025) | ✓ |
| E15 | 4 | Routing > model selection — task-to-tool dispatch beats single-best-model. (Paterson 2026) | ✓ |
| E16 | 3 | Higher AI confidence correlates with less critical thinking enacted. (Lee et al. CHI 2025) | ~ |
| E17 | 4 | Read/write asymmetry: multi-agent works for read-heavy tasks; breaks on write-heavy unless writes are serialized. (Cognition / Anthropic / LangChain consensus) | ✓ |
| E18 | 3 | Both calibrated AND deliberately overconfident GPT assistants improve human forecasting +23–43%. (Schoenegger 2024/2025) | ~ |
L — Logical necessities
| ID | Wt | Claim | Status |
|---|---|---|---|
| L1 | 5 | Substitution myth is wrong — every offload creates new monitoring/verification/coordination work. (Dekker & Woods 2002; Bainbridge 1983) | ✓ |
| L2 | 5 | Optimal allocation requires modeling the joint performance surface, not solo accuracies. (Madras / Mozannar L2D theory) | ✓ |
| L3 | 5 | Allocation model must be parameterized BY capability, not depend on FIXED capability — generating function vs. lookup table. | ✓ |
| L4 | 4 | High automation and high human control can coexist. (Shneiderman 2D) | ✓ |
G — Generating mechanisms
| ID | Wt | Claim | Status |
|---|---|---|---|
| G1 | 5 | Metacognitive bottleneck — load shifts from production to planning, evaluation, calibration. (Tankelevitch 2024) | ✓ |
| G2 | 5 | Ironies of automation — AI handling routine work degrades human capacity to catch rare critical errors. (Bainbridge 1983; Simkute 2024) | ✓ |
| G3 | 5 | Verification-cost trade-off — engagement is rational only when cheap. (Vasconcelos 2023) | ✓ |
| G4 | 5 | Jagged frontier mechanism — capability heterogeneous; the boundary is personal and dynamic. | ✓ |
| G5 | 4 | Correlation neglect — humans treat AI advice as independent evidence despite shared training data. (Amin 2026) | ~ |
| G6 | 3 | Cognitive forcing — committing to a view first breaks anchoring. | ✓ |
| G7 | 4 | Skill atrophy — capacities not exercised decay. | ✓ |
| G8 | 4 | Cross-task productivity bundling — speedups bottlenecked by linked tasks. (Cowen 2026) | ~ |
| G9 | 4 | Generator-verifier asymmetry — production cheap, checking expensive. (Karpathy) | ✓ |
| G10 | 3 | Multi-tool attention interference — Wickens’ Multiple Resource Theory predicts tool-stack overload (Cursor + ChatGPT + meeting incurs cost beyond the sum of per-tool costs). Mechanism well-established; absent from AI workflow research. | ~ |
S — Synthesis claims
| ID | Wt | Claim | Status |
|---|---|---|---|
| S1 | 5 | Workflow architecture predicts outcomes more reliably than model capability. | ✓ |
| S2 | 5 | The optimization target is metacognitive efficiency, not throughput. | ✓ |
| S3 | 4 | Knowledge work shifting from production to critical integration. (CHI 2025) | ✓ |
| S4 | 4 | Context engineering is the central craft. (Anthropic / Cognition consensus) | ✓ |
| S5 | 3 | Centaur advantage may be a transient regime. | ? |
P — Practitioner frameworks
| ID | Wt | Claim | Status |
|---|---|---|---|
| P1 | 5 | Mollick centaur / cyborg / self-automator typology. | ✓ |
| P2 | 4 | Karpathy autonomy slider. | ✓ |
| P3 | 4 | Anthropic agent design patterns (chain / route / parallelize / orchestrator-workers / evaluator-optimizer). | ✓ |
| P4 | 4 | Cognition read/write asymmetry as agent-system principle. | ✓ |
| P5 | 3 | Compound Engineering / AI Sandwich (Shipper, Every). | ~ |
| P6 | 3 | Spec-driven development (spec → plan → execute). | ~ |
| P7 | 3 | Personal context files (CLAUDE.md / system-prompt patterns). | ✓ |
O — Open questions
| ID | Wt | Claim | Status |
|---|---|---|---|
| O1 | 5 | Does AI help experts on open-ended judgment? | ? |
| O2 | 5 | Why are aggregate effects zero given the micro-RCT record? | ? |
| O3 | 5 | Long-term cognitive effects under continuous AI use. | ? |
| O4 | 4 | Frontier migration vs. calibration rate. | ? |
| O5 | 4 | Right unit of analysis — individual vs. team vs. value chain. | ? |
| O6 | 3 | Centaur taxonomy durable, or interface artifact? | ? |
| O7 | 4 | Verification cost vs. verification calibration as binding constraint. | ? |
D — Distortion vectors
| ID | Wt | Claim | Targets |
|---|---|---|---|
| D1 | 4 | AI-maximalist distortion — RCT gains read as aggregate revolution; ignores Humlum-Vestergaard zero. | E4, S1, O2 |
| D2 | 4 | Productivity-only distortion — counts speed gains, ignores skill atrophy and metacognitive load. | G7, S2, E9 |
| D3 | 3 | ”Just use the best model” distortion — ignores routing finding (E15) and architecture-over-capability evidence. | E15, S1, S4 |
| D4 | 3 | Practitioner-only distortion — dismisses formalization as category error. | L3, S1 |
2. Edge catalog (key chains, not exhaustive)
Foundation → Method. A2 → M1 (RCTs reveal the jagged frontier); A3 → M2 (strategy-graded metrics measure verification cost); A1 → M3 (field deployment shows real attention budget).
Method → Empirical. M1 produces the productivity record (E1–E9). M2 → E5 (Bansal explanations don’t help once strategy-graded). M2 → E12 (CUPS is the strategy-graded measure of coding). M3 → E2 (METR shows experts gain less in real repos).
Mechanism → Empirical. G1 → E12 (metacognitive load shows up as verification time). G2 → E13 (sycophancy is the rare critical error ironies-of-automation predicts will get missed). G3 → E5 (rational verification trade-off explains why explanations fail). G4 → E3 (jagged frontier produces outside-frontier harm). G7 → E9 (skill atrophy → unassisted-performance drop). G8 → E4 (cross-task bundling explains aggregate zero). G10 → E10 (cyborgs interleaving multiple tools should incur higher MRT interference than centaurs handing off discrete sub-tasks — testable but unmeasured prediction).
Empirical → Synthesis. E1, E2, E3, E8 → S1 (workflow architecture > model capability — the integrated headline). E12, E16 → S2 (metacognitive efficiency target). E11 → S3. E14, E15, E17 → S4 (context engineering as central craft). E18 → S5.
Empirical → Open. E2, E4 → O2 (the aggregate puzzle). E2 → O1 (helps experts?). E9 → O3 (long-term cognitive). E1, E3 → O4 (frontier migration). E4 → O5 (right unit). E13 → O7 (calibration vs. cost).
Logical guards. L1 → G2 (substitution myth → ironies of automation). L2 → S1 (joint surface needed). L3 → S4 (parameterize-by-capability is what makes context engineering generalize). L4 → P2 (Shneiderman 2D legitimates the autonomy slider).
Practitioner ↔ Academic (the central conceptual move of this topology, encoded as op edges). The relationship type varies — see TLDR para 2 for the (a)/(b)/(c) classification of cross-community bridge edges. P1 → E10 (Mollick named the typology; Randazzo 2026 measured the behavior distribution it predicts — pattern (a), academia retrospectively measures). P2 → L4 (Karpathy’s autonomy slider concretizes Shneiderman’s 2D framework as a per-action user control surface — pattern (b), prior academic principle made tangible), with the reverse edge L4 → P2 (Shneiderman legitimates the slider design). P3 → L2 (Anthropic’s agent design patterns and Madras / Mozannar L2D theory address the same structural problem — joint allocation under capability heterogeneity — but emerged independently — pattern (c), parallel convergence without direct lineage). P4 → E17 (Cognition’s read/write principle IS the design-actionable form of the read/write asymmetry finding — pattern (a) inverted: practitioners stated and measured it together). P5 → S2 (compound engineering / AI Sandwich applies Tankelevitch’s metacognitive-demands framework at the workflow level — pattern (b)). P6 → S4 and P7 → S4 are practitioner-internal, not bridge edges: S4 (context engineering as central craft) is itself a practitioner-coined synthesis (Anthropic / Cognition consensus), and P6 (spec-driven development) and P7 (CLAUDE.md / personal context files) are concrete practitioner techniques that operationalize that practitioner synthesis. The (a)/(b)/(c) classification doesn’t apply because both endpoints sit in the practitioner community.
Foundation → Foundation (the project-frame edges). A6 → S1 (S1 is meaningful only if individual optimization aggregates). E4 → A6 (aggregate-zero is the direct attack on A6). O2 → A6 and O5 → A6 (the open questions whose resolution will close A6 either way).
Distortion attacks. D1 → S1, E4 (treats RCT gains as aggregate proof; ignores Humlum). D2 → G7, S2, E9 (counts speed; ignores atrophy). D3 → E15, S1, S4 (ignores routing). D4 → L3, S1 (denies formalization possibility).
3. High-stakes nodes — by structural role
Six categories of structural role, sorted by how their failure modes propagate. The single most useful conceptual move is keeping the cruxes (inputs) separate from the headline (an output) and from the reframers (mechanisms whose magnitude is open). Conflating these three under a single label of “important findings” produces most of the bad-faith debate around AI workflow design.
3a. Foundational cruxes — collapse rebuilds regions
These are the input assumptions the entire individual-level framing rests on. Falsification doesn’t change interpretation; it forces rebuilding. All four are weight-5 foundational-assumption nodes (the A class).
- A1 (human attention is the binding scarce resource). If false, throughput-optimization wins and the entire metacognitive-bottleneck framing dissolves; design priorities flip back toward maximum AI delegation. The metacognitive-bottleneck mechanism G1 is the consequence of A1 + production-automation, not a separate axiom — which is why G1 isn’t itself a crux.
- A2 (jagged frontier — AI capability is heterogeneous across operations). If a capability discontinuity produced uniformly-good AI across all knowledge-work operations, the “map your frontier” design imperative collapses and outside-frontier harm (E3) dissolves.
- A3 (verification cost is comparable to or cheaper than generation cost). If verification became prohibitively expensive — AI outputs so complex or fast-moving that human checking is intractable — the rational-cost-benefit reframing of overreliance (G3) dissolves and the design problem becomes “trust without verification.”
- A6 (individual-level optimization aggregates upward). If gains are absorbed by organizational dynamics — review bottlenecks downstream, managerial reabsorption, coordination costs — individual workflow optimization is locally optimal but globally insufficient. Humlum-Vestergaard’s aggregate zero is direct evidence against. This is the project-frame crux: the entire individual-level optimization target only matters if A6 holds, and its status is genuinely open.
3b. Logical guardrails — unfalsifiable, ignored at peril
Cannot be falsified — only ignored. All three are weight-5 logical-necessity nodes (the L class).
- L1 (substitution myth is wrong — every offload creates new monitoring/verification/coordination work). Bainbridge 1983 / Dekker & Woods 2002. The structural property of any principal-agent delegation; AI-rollout discourse routinely treats it as if it could be ignored.
- L2 (optimal allocation requires modeling the joint performance surface, not solo accuracies). The Madras / Mozannar L2D formal result. A mathematical property of how joint performance combines — cannot be falsified empirically. The most common practitioner shortcut — “use AI when AI is better, use human when human is better” — implicitly assumes solo accuracies suffice; ignoring the variance and correlation structure of the two agents’ errors is exactly the move L2 forbids. The agent-design patterns (P3) that do work are the ones that respect L2 by construction.
- L3 (allocation model must be parameterized by capability, not depend on fixed capability). The generating-function-vs-lookup-table commitment. Practitioner-only frameworks routinely violate this by hardcoding “GPT-4 is good at X, bad at Y” — the pattern goes stale within months.
3c. Reframer mechanisms — magnitude is the live question
High-weight non-crux mechanism nodes whose magnitude (not existence) is what reshapes interpretation. Each is well-supported as a phenomenon; the open question is the share of variance they explain.
- G1 (metacognitive bottleneck). CHI 2024 Best Paper finding; robust phenomenologically. The open magnitude question: what share of the labor force is in the regime where G1 is binding? For workers still production-constrained, G1 is premature; for power users, it is binding. The size of each population determines whether the metacognitive-efficiency target (S2) is the right design priority for “individual workers” generically or only for a subset.
- G3 (verification-cost trade-off). Vasconcelos / Fok-Weld reframing. The mechanism is real; the open question is whether it captures most of the variance in reliance behavior, or whether verification calibration (O7) is doing the rest of the work.
- G4 (jagged frontier mechanism). The capability-heterogeneity story is robust; the open magnitude question is the rate at which the boundary shifts (O4) and whether the topological features are stable (A5).
3d. Headline conclusion — a synthesis output, not a crux
- S1 (workflow architecture > model capability). The integrated finding the topology is organized around. S1 is what the cruxes plus mechanisms produce, not a load-bearing input. It is a weight-5 synthesis node, but distinguishing “headline conclusion” from “crux” matters: cruxes are inputs whose falsification rebuilds the graph; conclusions are outputs whose falsification only means the rebuild was downward (the conclusion was wrong) rather than upward (an input was wrong).
3e. Corroborating / illustrative
Two senses of “droppable” need to be distinguished here. Some nodes can be removed without breaking S1’s conclusion but are load-bearing as exemplars of the topology’s classification structure (especially the practitioner ↔ academic bridge in TLDR para 2); others are droppable in both senses.
- Load-bearing as exemplars, droppable for S1: E10 (60/30/10 distribution — the canonical (a) bridge example, paired with P1 to demonstrate “practitioners ahead of academic measurement”; if removed, S1 still holds but the (a) example loses its empirical anchor), P5 (compound engineering / AI Sandwich — the canonical (b) bridge example, paired with S2 to demonstrate “practitioners concretizing prior academic principles”; if removed, S1 still holds but the (b) example weakens).
- Droppable in both senses: E18 (Schoenegger overconfident-AI-still-helps — interesting but tangential to S1 and not used as a bridge example), G6 (cognitive forcing as mechanism — local, not load-bearing for S1 or the bridge framing), P6 (spec-driven development — practitioner-internal P → S4 edge, not a bridge example, not load-bearing for S1).
3f. Distortion vectors
D1–D4 are pedagogically intentional, not decorative. Each names a real motivated-reasoning pattern and the specific empirical/logical claims it targets. Distortions are useful for readers calibrating where their own priors might be selecting against the evidence.
4. Weakest links
Where the graph is genuinely fragile in 2026, ranked by potential propagation if the link breaks:
- A6 / O2 / O5 (the aggregate-zero / unit-of-analysis cluster). The most consequential weakness in the graph. Humlum-Vestergaard’s precise zero across 25,000 Danish workers is direct evidence that individual workflow gains may not aggregate; if true, A6 falsifies, the project-frame inverts, and individual workflow optimization becomes locally optimal but globally insufficient. Possible mechanisms (cross-task bundling per Cowen 2026, organizational reabsorption, weak wage pass-through) are testable but untested. Until O2 is resolved, every claim about organizational benefit downstream of S1 is contingent.
- O3 (long-term cognitive effects). Bastani’s 17% drop is a single-session study with a short post-session window — the lit review documents the in-session vs. afterward contrast but no specific durability timescale. If a 2-year longitudinal RCT confirms broad skill atrophy across cognitive operations, the design implication becomes “AI-free zones” at much larger scale than current practice — and the productivity-only distortion (D2) goes from intellectually wrong to materially harmful.
- O4 (frontier migration vs. calibration rate). If model capabilities shift faster than humans can recalibrate their personal frontier maps, the “map your jagged frontier” design imperative becomes unfollowable. The likely partial resolution is that the frontier has stable topological features (A5) even when the boundary moves, but A5 is empirically untested.
- O7 (verification cost vs. calibration as binding constraint). Reducing verification cost helps a rational actor; improving calibration helps a miscalibrated one. The interventions are different. The sycophancy finding (E13) suggests calibration is a real second binding constraint.
- S5 / O6 (centaur regime persistence). If a capability discontinuity makes AI better than humans on the full task, the centaur typology becomes historical.
- A5 (frontier mappable). Empirically untested. If false, individual-level workflow design is structurally impossible at the per-task granularity the practitioner stack assumes.
5. Variants
Each variant reads the same graph through a different lens.
Variant A — Vulnerability (where does this break?)
Highlights the seven cruxes (A1, A2, A3, A6 foundational; L1, L2, L3 logical guardrails) plus the weight-5 nodes downstream. If any crux flips, propagation is concentrated through this subgraph. Useful for stress-testing: pick a crux, imagine it inverts, and trace the consequences through the highlighted subgraph.
Variant B — Flow (how does causation propagate?)
Restricts to the A → M → E → S/G cascade plus practitioner operationalizations (P → E/L/S via op edges). The “what generated what” view: foundational assumptions enable methods, methods produce empirical findings, mechanisms explain them, syntheses integrate, and practitioner frameworks operationalize the resulting design implications.
Variant C — Minimal claim set
Smallest set of claims that still yields the headline conclusion (S1: workflow architecture > model capability). Approximately: A2 + A3 + E3 + E8 + L2 + G3 + G1 + S1. Eight nodes. Removing any one breaks the qualitative shape.
Variant D — Capability-regime fragility (the topic-specific variant)
Which nodes go stale if frontier capability jumps? The central worry the lit review’s adversarial section names. Of the 66 nodes, 35 sort into one of three regime-fragility classes; the rest (methods, supporting empirical findings, distortions) are regime-orthogonal. The classification:
-
Stale-on-jump (6 nodes — likely to invert). E2 (skill-leveling: if AI exceeds top performers, the +34/0 pattern flips). E3 (outside-frontier harm: dissolves if frontier becomes uniform). E10 (60/30/10 behavior distribution: regime-bound to current tool affordances). E18 (overconfident-AI-still-helps: if AI is reliably correct, the “structured reasoning is the gain” reading dissolves). S5 (centaur transience: literally about transitioning out). P1 (Mollick taxonomy: becomes historical the way the chess centaur literature reads as historical now).
-
Stable-on-jump (18 nodes — structurally invariant). All four logical necessities: L1 (substitution myth), L2 (joint performance surface), L3 (parameterize-by-capability), L4 (autonomy + control coexist). The mechanism invariants: G1 (metacognitive load just shifts to new decisions), G2 (ironies of automation are a property of any principal-agent delegation), G3 (verification trade-off — the threshold moves but the shape is invariant), G9 (generator-verifier asymmetry), G10 (multi-tool attention interference — Wickens MRT is a property of human cognitive architecture, not of AI capability). The foundational A1 (attention-as-binding-constraint is a fact about humans, not about AI). The synthesis claims that follow from the invariants: S2 (metacognitive efficiency target), S3 (production → critical integration), S4 (context engineering as central craft). The control-surface practitioner patterns: P2 (autonomy slider), P3 (Anthropic agent design patterns), P5 (compound engineering), P6 (spec-driven development), P7 (CLAUDE.md context files).
-
Regime-dependent (11 nodes — depends on which way capability jumps). Foundational A2 (jagged frontier could become smooth or fragment into a different jagged shape), A3 (verification cost could fall further or rise as outputs become more complex), A5 (frontier-mappability depends on rate of migration), A6 (aggregation could go either way as tools get integrated). The headline S1 (workflow > capability could invert if a capability gap dominates). The high-leverage empirical claims: E14 (multi-agent could become moot if single-agent matches), E15 (routing depends on heterogeneity A2), E17 (read/write asymmetry depends on whether AI handles serial writes natively). The mechanism G4 (jagged frontier follows A2). The open question O4 (frontier migration rate IS the regime question).
The model formalization should be stable under capability change — meaning it should be parameterized by the L1/L2/L3/L4 + G1/G2/G3/G9 + A1 invariants and treat A2/A3/A6/E14/E15/E17 as inputs that vary by capability regime. The 6 stale nodes are the ones the formalization should not hardcode.
6. Stage-3 handoff
This topology is the input to model formalization. The cleanest target is a parameterized routing function:
delegate(task, worker, AI, context) → action
where the action is one of {do_yourself, consult_AI, delegate_to_AI, refuse}, and the function is parameterized by:
- Task type (per Bloom hierarchy + cognitive task analysis decomposition)
- Worker capability profile on that task type (via prior frontier mapping)
- AI capability profile on that task type (via per-task benchmark or recent personal experience — the L3 invariant means this is an input, not a hardcoded constant)
- Verification cost (function of task type and output format)
- Stakes / reversibility (high stakes → independent-then-synthesize per E8)
- Skill-formation goal (if the task is one whose capability the worker wants to maintain → guardrail mode per E9)
The stage_outputs/<topic>/<stage>.md folder convention holds for this topic too: raw working drafts live in stage_outputs/technology-utilization-architecture/<stage>.md; polished versions move into src/content/ai_research/technology-utilization-architecture/<stage>.mdx. So far the folder contains the lit review and this topology draft; subsequent stages will accumulate there.
This topology also feeds three natural sibling topics down the line. Navigating an AI World is the structural / civilizational view that holds the individual-level optimization in tension with the organizational-level disruption (its own Crux 5 / A6 sits next to mine). AI Cognitive Profile is the orthogonal view: rather than asking “how should an individual route tasks,” ask “where does AI capability diverge from human capability across the O*NET task taxonomy” — that topic supplies what this topic treats as a black-box input (the per-task capability gradient) and inversely, this topic supplies what that topic treats as a black-box (what individuals should do given the gradient). Prediction and Calibration is the natural attachment point for O7 (verification cost vs. verification calibration as binding constraint): if calibration on AI error rates is the second binding constraint behind verification cost, the calibration topic is where that gets formalized. The Stage-3 model formalization should leave clean attachment points for all three.
7. Next moves — three Stage-3 options
Three formalization paths, each with pros / cons / Stage-4 implications.
Option A — Capability × verification-cost dispatch table (decomposition)
What it is. Formalize the routing function as a 2D table: capability gap (worker minus AI on this task) × verification cost ratio (verify-cost / generate-cost). Four quadrants → four allocation rules.
- Pros. Maps cleanly onto the existing RCT record (quadrants align with Brynjolfsson, Dell’Acqua, METR, Everett). Easy to visualize. Practitioner-actionable.
- Cons. Risks violating L3 (looks like a lookup table). Mitigated if the table is generated from inputs rather than hardcoded.
- Stage-4 implication. Test against the published RCTs by classifying each into a quadrant.
Option B — Generator-verifier loop with autonomy slider (generating function) [recommended]
What it is. Formalize the workflow as a recurrent loop: at each step, the worker chooses an autonomy level (Karpathy P2 / Feng 2025 five-level operator → observer). The loop has a verification gate; the gate’s strictness is a parameter. Output: an interactive dashboard where the user inputs (task type, capability gap, verification cost, stakes, skill-formation goal) and gets a recommended autonomy level + verification cadence.
- Pros. Directly operationalizes Karpathy’s autonomy slider (P2) using Shneiderman 2D (L4) and Vasconcelos verification economics (G3). Survives capability change (L3-compatible). Naturally maps onto an interactive site component.
- Cons. Requires choosing a parameterization of “verification cost” that holds across task types — non-trivial.
- Stage-4 implication. Test against Bastani’s guardrail RCT (verification gate stringency moderating skill atrophy), Everett’s independent-then-synthesize (a specific autonomy-cadence schedule), and the Mollick centaur/cyborg empirical distribution (which loops people actually run).
Option C — Principal-agent with imperfect agent (mechanism design)
What it is. Apply contract-theory machinery (asymmetric information, monitoring vs. trust trade-off) to the single-worker / AI-tool relationship.
- Pros. Formally rigorous. Connects to a mature economic literature.
- Cons. Heavyweight; principal-agent assumes strategic agents, but LLMs are imperfect-but-non-strategic.
- Stage-4 implication. Hardest to validate against the existing RCT record.
Recommendation: Option B. The generator-verifier loop with an autonomy slider is the most directly testable, the most actionable as an interactive site artifact (Stage 5), and the most clearly L3-compatible. Specifically, Option B engages the regime-stable invariants identified in Variant D — L1 (every offload creates new monitoring work, baked into the loop’s verification gate), L2 (joint-surface allocation, baked into the autonomy-level choice), L3 (parameterized by capability inputs rather than hardcoded), G3 (verification trade-off, the parameter that sets gate strictness), G9 (generator-verifier asymmetry, the loop’s central asymmetry), and G10 (multi-tool attention interference — the loop should track concurrent-tool load as a Wickens-MRT input, not just per-task parameters; “running three coding agents in parallel” is a different regime from “running one”) — while taking A6 (individual optimization aggregates) as the explicit assumption whose falsification would mean the loop is locally optimal but globally insufficient. Option A can be a sub-component of Option B (the dispatch table determines the autonomy-level choice). Option C is held in reserve.
8. Objections to this topology
Objection 1: “The practitioner / academic split is overstated — the field is actually more integrated than the typing P-vs-S suggests.” Steelman: many researchers (Mollick, Karpathy, Tankelevitch via MSR’s Tools for Thought) span both communities; some practitioner posts (Anthropic’s effective agents) cite academic literature; the typing risks reifying a divide that’s already partial. Response: the people span both, but the artifacts arrive on different timescales and through different processes. Mollick proposed the centaur/cyborg typology in 2024; Randazzo 2026 (HBS) is the first peer-reviewed empirical measurement of the distribution. Karpathy’s autonomy slider was articulated as a tool-design concept (in his 2024 Software 1.0/2.0/3.0 talks and Cursor’s Tab → Cmd+K → Agent Mode UX) about a year before Feng 2025’s five-level academic taxonomy converged independently on a similar discretized-levels structure — neither grounds the other; they are parallel work on the same concern at different granularities (Karpathy’s is per-action, Feng’s is per-task user role). Anthropic’s agent design patterns and L2D theory likewise address the same structural problem with no direct lineage. The typing isn’t claiming a clean division; it’s making the structurally different roles visible — empirical measurement, formal-theory derivation, engineering-practice synthesis — so we don’t conflate “Anthropic shipped a pattern that works” with “L2D theory predicts this pattern is optimal.”
Objection 2: “The 7-crux selection is biased toward the framing the lit review wants.” Steelman: a critic could argue capability-first cruxes (e.g., “Bloom-level decay determines all task allocation”) are equally defensible, or that the lit-review-driven framing inherits whatever bias the lit review has. Response: the crux set is structurally typed, not picked by impact. The criterion is “claims whose collapse rebuilds regions of the graph,” and that maps cleanly onto exactly two node classes — foundational assumptions (A) and logical necessities (L). The cruxes are A1, A2, A3, A6 (every weight-5 A node) plus L1, L2, L3 (every weight-5 L node). No mechanism (G), synthesis (S), empirical (E), practitioner (P), or open (O) node is a crux, because their structural role is downstream of the cruxes — mechanisms explain, syntheses integrate, empirical findings test, practitioner frameworks operationalize, open questions sit at the frontier. The headline conclusion S1 (workflow > capability) is not a crux even though it is weight-5; it’s an output of the graph, and excluding it from the crux set is the discipline that distinguishes “what the graph rests on” from “what the graph concludes.” A capability-first alternative would have to add a foundational A node like “Bloom-level decay is the dominant capability gradient” — which the lit review doesn’t support; LLMs decay sharply on Bloom, but task allocation is also shaped by verification cost, autonomy level, and skill-formation goals. The selection is therefore lit-review-driven, but the typing rule (A + L only, every weight-5) is independent of the lit review’s framing — it is just “which nodes carry the structural role of being inputs the rest of the graph rests on.”
Objection 3: “The capability-regime fragility variant is a hedge — it concedes the project will go stale.” Steelman: Variant D essentially admits that half the empirical claims could invert under a capability jump; if so, why formalize at all? Response: Variant D is the discipline that justifies formalization. The point is to identify which nodes are stable-under-jump (the L1-L3-G1-G2-G3 invariants) and build the formalization on those. Practitioner heuristics, by contrast, are not capability-stable by design — they update faster but go stale faster too.
Objection 4: “The aggregate-zero puzzle (E4 / O2) is so consequential that the entire individual-level framing might be wrong, and the topology should pivot to the organizational level instead.” Steelman: if Humlum-Vestergaard’s zero is the true-population result, individual workflow optimization is rearranging deck chairs. Response: this is exactly what A6 (now a foundational crux) and O5 (right unit of analysis) name. The honest position is that the individual frame is one valid level of analysis — design-actionable for the worker who controls their own workflow — and the organizational frame is a separate level that should be a sibling artifact, not a replacement.
Objection 5: “Topology + model formalization is over-engineering for a moving target. Practitioner heuristics adapt faster than academic formalization can ship; by the time you finish the model, the field has moved. The honest move is to skip directly to a build artifact using current best practices.” Steelman: this is real. The lit review already documents that the practitioner stack runs 12–18 months ahead of peer review precisely because it doesn’t try to formalize. Cursor, Claude Code, and the Anthropic agent-design patterns are already shipping; users adapting heuristics in real time will outperform users waiting for an academic model. The L3 invariant (“parameterize by capability”) is meant to address this, but it’s a hope, not a guarantee — a formalization built on capability-stable invariants might still miss the regime where capability discontinuity dominates everything. Response: even granting all of that, the topology IS the artifact whose stable invariants survive capability change. Variant D identifies an 18-node regime-stable subgraph: all four logical necessities (L1 substitution myth, L2 joint surface, L3 parameterize-by-capability, L4 autonomy + control coexist), the five mechanism invariants (G1 metacognitive load shifting, G2 ironies of automation, G3 verification trade-off, G9 generator-verifier asymmetry, G10 multi-tool attention interference via Wickens MRT), the foundational A1 (attention-as-binding-constraint is a fact about humans, not AI), the synthesis claims that follow from the invariants (S2 metacognitive efficiency target, S3 production → critical integration, S4 context engineering as central craft), and the control-surface practitioner patterns (P2 autonomy slider, P3 Anthropic agent design patterns, P5 compound engineering, P6 spec-driven development, P7 CLAUDE.md context files) — none of which are capability-bound claims. Practitioner heuristics adapt faster but lossy: they don’t preserve the reasoning behind the rule, so users can’t tell when the heuristic stops applying. The 12–18 month lag goes both ways — practitioners ship faster, but they also rediscover Bainbridge 1983 forty years late. The model formalization’s value is less “predict optimal allocation” than “encode the invariants explicitly so the next capability shift doesn’t require re-litigating from scratch.” That said: the objection has bite for Stage 5 specifically. If the build artifact is a fragile prediction tool that hardcodes 2026 capability, it goes stale. If it is a frame (the autonomy slider, the verification-cost trade-off visualized) that the user fills in with their own current capability profile, it is L3-compatible and survives.
9. Glossary
- AI Sandwich — Shipper / Every: humans frame the task and review the output; AI handles the middle.
- autonomy slider — Karpathy: a per-action user control surface (Cursor: Tab → Cmd+K → Agent Mode); operationalizes Shneiderman’s 2D framework.
- Bloom’s taxonomy — Remember → Understand → Apply → Analyze → Evaluate → Create. LLM capability decays sharply going up.
- centaur — Mollick: clean human/AI role separation; discrete handoffs based on per-task frontier mapping.
- CoALA — Cognitive Architectures for Language Agents (Sumers 2024). Maps LLM agents to ACT-R / SOAR memory + action structures.
- complementarity potential vs. team performance — Hemmer 2025 distinction: mathematical possibility vs. actual achievement of human+AI exceeding either alone.
- compound engineering — Shipper / Every: plan → work → review → compound. Each loop’s outputs feed the next loop’s inputs.
- context engineering — Anthropic / Cognition: the discipline of constructing what the model sees. Practitioner consensus: higher leverage than model selection.
- CTA / ACTA — Cognitive Task Analysis / Applied CTA. Method for decomposing what a knowledge worker actually does cognitively.
- CUPS — Cognitive Use Pattern States; Mozannar 2024 telemetry-grounded taxonomy of programmer-Copilot interaction.
- cyborg — Mollick: continuous interleaving of human and AI at sub-task granularity. The empirical majority pattern.
- generator-verifier asymmetry — production cost falls toward zero with AI; verification cost stays roughly constant. Karpathy’s framing.
- H-LAM/T — Engelbart 1962: Human using Language, Artifacts, Methodology, Training. Current rollouts ship the artifact; methodology and training lag.
- HAIJCS — Human-AI Joint Cognitive System (Xu & Gao 2024). Bridge from CSE to LLM teaming.
- ironies of automation — Bainbridge 1983: AI handling routine work degrades human capacity to catch the rare critical errors.
- jagged frontier — Dell’Acqua 2023: AI capability is heterogeneous across sub-tasks; using AI inside the frontier helps, outside causes harm.
- L2D — Learning to Defer; Madras / Mozannar formal allocation theory.
- metacognitive demands — Tankelevitch 2024 best-paper framework: GenAI reduces production load but increases planning, evaluation, and calibration load.
- MABA-MABA — “Men Are Better At, Machines Are Better At.” Dekker & Woods 2002 critique of static function allocation.
- MRT (Multiple Resource Theory) — Wickens. Cognitive resources are partitioned by processing stage, sensory modality, and processing code; tasks competing for the same resource interfere superlinearly while tasks using different resources can be parallelized cheaply. Predicts tool-stack attention overload (Cursor + ChatGPT + meeting incurs cost beyond the sum). Lit review notes MRT is “absent from AI workflow research.”
- read/write asymmetry — Cognition: multi-agent works for read-heavy tasks but breaks for write-heavy unless writes are serialized.
- routing > model selection — Paterson 2026: across 38 real daily tasks and 15 models, dispatching by task type beats picking a single best model.
- self-automator — Randazzo 2026 third Mollick-typology mode: full delegation with periodic oversight.
- sycophancy in feedback loops — Randazzo HBS 26-021: AI escalates persuasive justification on pushback.
- Tools for Thought — Microsoft Research program on AI-augmented knowledge work.
- verification cost — the cost of checking whether an AI output is correct. Vasconcelos 2023 reframing: overreliance is rational when verification is expensive relative to expected payoff.
Generator-verifier loop with a per-task autonomy slider. The per-task value function V(u, v; θ) decomposes into four orthogonal channels (quality, attention, risk, skill); V is exactly bilinear in (u, v) so per-task optima land at three corners — do-yourself, self-automator, spec-driven. Centaur and cyborg arise as aggregate-level patterns from cross-sub-task corner mixing. Portfolio aggregation under a daily attention budget surfaces the Lagrangian shadow price μ that reroutes longer tasks first when budget binds. Five cruxes, six Stage-4 fitting targets, three engaged objections. Interactive two-tab dashboard included.
TLDR
The lit review documents a research landscape; the topology stripped it down to load-bearing structure. This stage formalises the cleanest target the topology surfaced: a generator-verifier loop with a per-task autonomy slider, designed to survive capability change rather than encode a snapshot of any specific model’s frontier. The optimisation target is output quality per unit of human attention for an individual knowledge worker. The formalisation makes three moves at once — decomposition (four orthogonal value channels: quality, attention, risk, skill), generating function (a per-task value function whose corner solutions reproduce the five empirically observed workflow modes), and integration (a single formalism that composes Karpathy’s slider, Mollick’s typology, Vasconcelos verification-economics, Bastani’s atrophy, Bainbridge’s substitution myth, and Madras-Mozannar L2D into one object).
The per-task value function is V(u, v; θ) = Q(u,v) − α·A(u,v) + λ·S(u,v) − σ·R(u,v), where u is the autonomy level (fraction of the task delegated to AI) and v is the verification depth (fraction of AI output independently checked). The four channels are conceptually distinct mechanisms: quality Q rewards letting the better agent do the work, with verified output achieving the complementary-product ceiling c_⋆ = c_AI + (1 − c_AI)·c_H (either AI was right or human catches the error); attention A charges for human time and for the irreducible monitoring cost even at full delegation (the L1 substitution-myth invariant baked in via ε > 0); risk R penalises uncaught AI errors at a rate proportional to stakes σ; skill S rewards practice and penalises unverified delegation (Bastani-style atrophy). V turns out to be exactly bilinear in (u, v) — collecting terms gives V = K_0 + K_u·u + K_v·v + K_uv·u·v — which means the maximum on the unit square is always at a corner. Three corners are candidates: (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven (the corner (0, 1) is dominated since verifying with no AI involvement is pure cost). Centaur and cyborg modes do not arise as per-task optima — they appear only as aggregate-level patterns when a worker mixes corner policies across sub-tasks with heterogeneous θ. This is a substantive prediction, not a limitation.
The portfolio extension aggregates per-task decisions under a daily attention budget. The headline result the model is designed to produce is S1 (workflow architecture > model capability): on the same task mix and same c_AI, optimal routing dominates “max-AI” (self-automate everything) and dominates the naive flat-cyborg heuristic (u=0.7, v=0.3 everywhere). The bilinearity finding sharpens this: the naive flat-cyborg policy is exactly the failure mode — it applies an interior (u, v) value that the bilinear structure says no individual sub-task should land at. Optimal routing differentiates across tasks (different corners for different θ); the aggregate (ū, v̄) across the day looks interior because the corners differ, not because any single decision is interior. The generating function is parameterised by capability (the L3 invariant): if c_AI rises uniformly across tasks, the model rebalances; if it rises only on certain task types, the boundary of optimal u* shifts but the shape of the routing rule stays put.
This stage produces seven things: (1) a math object — V(u, v; θ) and the four-channel decomposition; (2) a workflow-mode classifier — three-corner per-task router plus a five-region label-partition of the (u, v) plane for observed worker behaviour; (3) a portfolio aggregator with budget-aware shadow-price routing (μ-binary-search over per-task α_eff = α + μ·g); (4) the interactive two-tab dashboard below; (5) five cruxes of the model (load-bearing claims whose collapse rebuilds it); (6) six Stage-4 fitting targets (parameter calibrations and qualitative predictions the data pipeline should test); (7) three engaged objections (c_AI unobservable, model just recovers practitioner intuitions, model is single-shot not strategic) with steelmen and what survives. Scope is explicit: the formalisation does not capture the aggregate-zero puzzle (E4/O2 — organisational dynamics), cross-task productivity bundling (G8/Cowen), c_AI miscalibration on novel tasks (O7), sycophancy as a verification-degrader (E13), or frontier migration over time (O4). These are named scope-limits, not silent assumptions.
Task parameters (θ)
AI mildly stronger, modestly cheap verification, moderate stakes. Optimum lands at spec-driven — AI does the synthesis, you read the output.
Optimal policy
Channel decomposition
Each bar shows that channel's contribution to V at the optimum. Q is gain over the human-only baseline; A is attention saved (or spent) vs. the M-only floor; R is the stakes-weighted risk penalty; S is the skill change weighted by λ.
Diagnostics
V(u, v; θ) = Q(u,v) − α·A(u,v) + λ·S(u,v) − σ·R(u,v). Constants: α = 1.00 (normalised), ε = 0.15 (residual attention at u=1; L1 invariant), β = 0.05 (per-task atrophy rate), M = 0.08 (routing tax). Mode thresholds: u_lo = 0.15, u_hi = 0.85, v_lo = 0.3, v_hi = 0.6. Optimum found by 41×41 grid search on the unit square. Stage-4 fitting will tighten α, ε, β, M against telemetry data; mode-distribution match against Randazzo BCG sample (~60% cyborg / ~30% centaur / ~10% self-automator) is Q3 of the named fitting targets.
How to read this stage
The dashboard above is the artifact. Everything below is the spec.
Two interactive surfaces:
-
Per-task router. Inputs:
(c_H, c_AI, φ, σ, λ)for one task. Outputs: optimal(u*, v*), the workflow-mode label, and a four-bar decomposition showing which channel dominates. Use this to answer “for a task with these characteristics, what’s the right way to use AI?” -
Day portfolio. Inputs: a basket of task types with counts. Outputs: total quality, total attention, total skill change under four strategies (always-self / max-AI / naive cyborg / optimal routing). Use this to see the S1 effect — same AI, different workflow architectures, very different outcomes.
If the dashboard says one thing and your gut says another, the diagnostic is to check (a) whether your (c_H, c_AI) estimates are calibrated and (b) whether the constants (α, ε, β, M) are calibrated for your task density. This is exactly what Stage 4 (the data pipeline) is for.
1. The formalisation moves
Three things this stage does — explicit so they’re inspectable separately.
Move 1 — Decomposition. The per-task value V splits into four orthogonal channels (Q, A, R, S). “Orthogonal” here means: each channel responds to (u, v) in a distinguishable way, so when a slider moves, the user can see which channel is driving the change. This isn’t a stylised choice — it’s how the topology’s mechanism nodes (G3, G7, G9, L1) actually compose. Without the decomposition, “the gain from AI” is a black box; with it, the user can ask “is this gain coming from quality, time-saved, or skill?” and answer.
Move 2 — Generating function. The five workflow modes (P1 / E10) are not given as a typology to be matched. They are generated as solutions to argmax V(u, v; θ) under different parameter regimes. This is the L3 invariant in operational form: change θ, the optimum moves, the mode label changes — but the function that produces the mode is fixed. A practitioner who hardcodes “use Cursor for boilerplate, ChatGPT for strategy” gets a lookup table; this gets a generating function.
Move 3 — Integration. Six prior objects compose into one: Karpathy’s autonomy slider (P2 — the u axis), Shneiderman’s 2D framework (L4 — the (autonomy, control) plane is the (u, v) plane), Vasconcelos verification-economics (G3 — the −α·v·φ term), Bastani’s guardrail finding (E9 — the −β·u·(1-v) skill term), Bainbridge’s substitution myth (L1 — the ε > 0 residual attention), and Madras-Mozannar L2D (L2 — the joint-surface optimisation at portfolio level). None of these alone is the model; the model is what they jointly imply.
What’s not yet ready for formalisation, kept in §9 as scope-limits: cross-task bundling (G8), organisational absorption (E4/O2), miscalibration of c_AI (O7), sycophancy as quality-degrader (E13), frontier migration over time (O4).
2. Variables and objects
Decision variables (per task), continuous on the unit square:
| Symbol | Range | Meaning |
|---|---|---|
u | [0, 1] | Autonomy level — fraction of generation delegated to AI |
v | [0, 1] | Verification depth — fraction of AI output independently checked |
Task parameters (the vector θ):
| Symbol | Range | Meaning |
|---|---|---|
c_H | [0, 1] | Human capability on this task type |
c_AI | [0, 1] | AI capability on this task type |
φ | ≥ 0 | Verification-cost ratio (verify-time / generate-time) |
σ | [0, 1] | Stakes — weight on uncaught-error penalty |
λ | [0, 1] | Skill-formation value — how much the worker cares about preserving this skill |
Constants (calibrated, not per-task):
| Symbol | Default | Meaning |
|---|---|---|
α | 1.00 | Attention price (normalisation) |
ε | 0.15 | Residual attention at full delegation — L1 invariant (substitution myth) |
β | 0.05 | Skill-atrophy rate per unit of unverified delegation |
M | 0.08 | Per-task metacognitive routing tax — A1/G1 invariant |
c_⋆ | c_AI + (1−c_AI)·c_H | Verified-output ceiling — “either AI got it right OR human catches the error” (complementary product of independent error events) |
The constants are calibrated to the lit-review anchors (Mozannar CUPS for ε; Bastani 17% drop for β; Tankelevitch metacognitive load for M). They can be re-fit by Stage 4 against telemetry data.
3. The per-task value function
V(u, v; θ) = Q(u, v) − α·A(u, v) + λ·S(u, v) − σ·R(u, v)
3.1 Quality channel Q
Q(u, v) = (1 − u)·c_H + u·[(1 − v)·c_AI + v·c_⋆]
With probability (1 − u) the human did the generation and quality is c_H. With probability u the AI did the generation, of which fraction (1 − v) ships unverified at quality c_AI and fraction v is verified. Verified output achieves quality c_⋆ = c_AI + (1 − c_AI)·c_H = 1 − (1 − c_H)·(1 − c_AI) — the probability that either the AI got it right or (it didn’t and) the human catches the error, treating the two error events as independent. Linear in u for fixed v; linear in v for fixed u.
The complementary-product form natively handles the deskilled-verifier limit (c_H → 0 ⇒ c_⋆ → c_AI — verification adds nothing when the human can’t recognise errors) and the verifier-stronger limit (c_H → 1 ⇒ c_⋆ → 1 — careful human verification approaches a quality ceiling). Pass 1 used c_⋆ = max(c_H, c_AI), which over-credits verification when c_H > c_AI (as if the human catches every AI error) and under-credits it when c_H < c_AI (as if a partly-skilled human catches no AI errors). The form here treats both with one expression and one structural assumption (independence of human and AI error events).
Caveat: c_H enters both the generation cost (when u = 0) and the verification benefit (the extra catching power at v > 0). Karpathy’s G9 generator-verifier-asymmetry says verification is typically easier than generation — recognising an error costs less than producing a correct answer from scratch. A more accurate model would carry a separate c_V (verifier capability) ≥ c_H for low-c_H workers. Held for Stage 4; named in §9 scope-limits.
3.2 Attention channel A
A(u, v) = (1 − u·(1 − ε)) + v·φ + M
Three pieces:
(1 − u·(1 − ε)): human-side generation cost. Atu = 0, the human does it all (cost = 1, the base generation time). Atu = 1, residual attentionεremains — every offload creates monitoring/coordination work (Bainbridge L1).ε > 0is the L1 invariant in operational form: a model withε = 0would predict that full delegation is free of attention cost, which is exactly the substitution myth.v·φ: verification cost. Linear in verification depth, scaled by the per-task verification-cost ratioφ. This is the G3 (Vasconcelos verification-economics) term.M: per-task metacognitive routing tax. Constant per task — classifying, choosing the workflow, monitoring AI for handoffs. Tankelevitch’s metacognitive-demand finding (G1) compressed to a constant; Stage 4 can test whetherMis task-type-dependent.
Note that ε is the only term that decouples attention from the (u, v) decision. It is what makes “max-AI” not free.
3.3 Risk channel R
R(u, v) = u·(1 − v)·(1 − c_AI)
Probability-of-uncaught-error: AI generated the output (u), it was not verified (1 − v), and the AI was wrong (1 − c_AI). Multiplied by stakes σ in the value function. This is the G2 (ironies-of-automation) term: rare critical errors get missed precisely when delegation is high and verification is low.
For high-σ tasks, this term is large enough to drive v → 1 (the spec-driven / independent-then-synthesize regime, E8 — Everett 2025). For low-σ tasks, it’s negligible and the optimum can sit at v = 0 without harm.
3.4 Skill channel S
S(u, v) = (1 − u) − β·u·(1 − v)
Two terms:
(1 − u): practice — the human builds skill on the fraction of the work they did themselves.−β·u·(1 − v): atrophy — unverified delegation erodes capacity. Verification preserves engagement (this is Bastani’s “hint mode” finding E9: guardrails → no atrophy). The productu·(1 − v)is exactly the self-automator regime where atrophy is fastest.
λ is the worker’s per-task valuation of preserving the skill. For tasks the worker explicitly wants to maintain capacity on (their core craft), λ is high and the skill term has bite. For boilerplate or one-off tasks, λ is low and skill is correctly ignored.
3.5 Putting it together
V(u, v; θ) = (1 − u)·c_H + u·[(1 − v)·c_AI + v·c_⋆] ← Q
− α·[(1 − u·(1 − ε)) + v·φ + M] ← −α·A
+ λ·[(1 − u) − β·u·(1 − v)] ← +λ·S
− σ·[u·(1 − v)·(1 − c_AI)] ← −σ·R
Five parameters in θ, two decisions, four channels. Exactly bilinear in (u, v): collecting terms,
V(u, v; θ) = K_0 + K_u·u + K_v·v + K_uv·u·v
with
K_0 = c_H − α − α·M + λK_u = (c_AI − c_H) + α·(1 − ε) − λ·(1 + β) − σ·(1 − c_AI)K_v = −α·φ(K_v captures the cost of verification when there is no AI to verify, i.e., atu = 0; pure cost, hence ≤ 0 — this is exactly why corner(0, 1)is dominated by(0, 0)below)K_uv = (c_⋆ − c_AI) + λ·β + σ·(1 − c_AI) = (1 − c_AI)·c_H + λ·β + σ·(1 − c_AI)(always ≥ 0 — verification gain is monotone inu)
Bilinear functions on a unit square attain their maximum at a corner. The interior critical point, when it exists, is a saddle (Hessian eigenvalues ±K_uv). So the per-task optimum is at one of the four corners — and since K_v < 0, (0, 1) is dominated by (0, 0) (verifying when no AI is involved is pure cost). The three meaningful corners:
(0, 0)— do-yourself.(1, 0)— full delegation, no verification (self-automator).(1, 1)— full delegation with full verification (spec-driven / independent-then-synthesize).
Which corner wins depends on the signs of K_u, K_u + K_uv + K_v, and K_v + K_uv — three linear comparisons over θ.
4. Optimal policy: the three corners that win
V is bilinear → max at a corner. Of the four corners, (0, 1) is dominated by (0, 0) because K_v < 0 (verifying with no AI involvement is pure cost). Three candidates remain — and a clean decision tree determines the winner:
- Spec-driven
(1, 1)wins iffK_u + K_v + K_uv > 0andK_v + K_uv > 0. - Else self-automator
(1, 0)wins iffK_u > 0. - Else do-yourself
(0, 0)wins.
Equivalently: spec-driven dominates self-automator when α·φ < K_uv (verification cost is below benefit at full delegation: α·φ < (1 − c_AI)·c_H + λ·β + σ·(1 − c_AI)); self-automator dominates do-yourself when K_u > 0 (the attention savings + AI quality gain outweigh skill loss + stakes risk).
Comparative statics — what moves the corner choice:
| Parameter increase | Effect on u* | Effect on v* (given u* = 1) |
|---|---|---|
c_AI − c_H ↑ via c_AI rising (c_H fixed) | ↑ (K_u rises) | ↓ (K_uv falls: less verification benefit) |
φ ↑ (verification expensive) | weakly ↓ via v*-flip | ↓ (K_v more negative) |
σ ↑ (stakes) | weakly ↓ (K_u falls by σ·(1−c_AI)) | ↑ (K_uv rises by σ·(1−c_AI)) |
λ ↑ (skill matters) | weakly ↓ (K_u falls by λ·(1+β)) | ↑ slightly (K_uv rises by λ·β) |
c_AI ↑ alone | ↑ | ↓ (both (1−c_AI)·c_H and σ·(1−c_AI) fall) |
This is the L3 invariant in tabular form: if c_AI rises uniformly, optimal u* rises and v* falls. The structural shape of the rule does not change. Two signs worth flagging because they’re non-obvious: more reliable AI (c_AI ↑) reduces verification benefit (fewer errors to catch); higher stakes (σ ↑) pushes u* down (refuse AI when stakes are high) and v* up (if you do use AI, verify carefully) — a real tension the bilinear structure makes explicit.
The five practitioner modes — labels for the (u, v) plane
The practitioner literature names five workflow modes. They span the (u, v) plane and are useful as vocabulary for labelling observed worker behaviour at any (u, v):
| Region | Mode | Practitioner anchor |
|---|---|---|
u ≈ 0 | Do-yourself | (no-AI / refuse-AI) |
u ∈ (0, 1), v high | Centaur | Mollick — clean handoff with verification gate |
u ∈ (0, 1), v low/mid | Cyborg | Mollick — interleaved, partial verification |
u ≈ 1, v high | Spec-driven / independent-then-synthesize | Everett 2025; Compound Engineering |
u ≈ 1, v low | Self-automator | Randazzo HBS 26-036 (the trap) |
The per-task router only returns the three corners — (0, 0), (1, 0), (1, 1) — corresponding to do-yourself, self-automator, and spec-driven. Centaur and cyborg do not arise as per-task optima. They appear empirically as aggregate-level labels when a worker mixes corner policies across sub-tasks with heterogeneous θ — some sub-tasks done alone, others fully delegated; some AI outputs verified, others shipped. The day-level (ū, v̄) averages out to interior values that get labelled “cyborg” or “centaur” depending on the ratio. The Day Portfolio tab demonstrates this directly.
This is a substantive prediction of the formalisation, not a limitation. Bilinearity is faithful to the structure of the problem: V collects to one bilinear form V = K_0 + K_u·u + K_v·v + K_uv·u·v, even though three of the four channels (Q, R, S) carry their own u·v terms — they sum to one consolidated K_uv. The model says: at any single sub-task with a single θ, pick a corner — fully delegate or don’t, fully verify or don’t. The naive flat-cyborg strategy (u = 0.7, v = 0.3 applied to every task) is exactly the failure mode — applying an interior policy uniformly is what no individual sub-task should do under bilinearity.
5. Portfolio aggregation — the day
A worker faces N tasks per day, each with its own θ_i. Total attention budget T. The portfolio problem:
maximise Σ_i Q_i(u_i, v_i) · count_i
subject to Σ_i A_i(u_i, v_i) · g_i · count_i ≤ T
where g_i is the task type’s base generation time. The Lagrangian gives a shadow price μ ≥ 0 on the budget constraint, and the per-task choice rule becomes
argmax V_i − μ·A_i·g_i = argmax V_i with α replaced by α_eff_i = α + μ·g_i
— that is, budget pressure raises the effective attention price, more so for longer-base-time tasks. When μ = 0 the budget is slack and per-task choice is unconstrained argmax V; when μ > 0 the budget binds and α_eff rises until total absolute attention Σ A_i·g_i·count_i fits T. The biasing is structural: raising α_eff raises K_u (self-automator becomes more attractive vs. do-yourself) and makes K_v = −α_eff·φ more negative (verification becomes less attractive). So as the budget tightens, longer tasks reroute first from spec-driven (1, 1) to self-automator (1, 0) — the lowest-A corner.
This is the L2 invariant operationalised at portfolio level. The naive practitioner rule “use AI when AI is better” compares c_H to c_AI task-by-task in isolation; the joint-surface rule compares marginal Q per unit attention saved against the day’s shadow price. They give the same answer when attention is abundant (μ ≈ 0); they diverge sharply when the day is tight.
Strategies the dashboard compares:
- Always-self.
u_i = 0for all i. No AI, no atrophy, no verification cost — but no productivity gain. The pre-AI baseline. - Max-AI.
u_i = 1, v_i = 0for all i. The corner uniformly applied. Fast, but high risk on high-σ tasks and accelerated atrophy on high-λ tasks. - Naive cyborg.
u_i = 0.7, v_i = 0.3for all i. A flat interior policy applied uniformly — exactly what the bilinear structure says no individual task should land at. Interior values arise legitimately only as averages over heterogeneous-θ sub-tasks. Applying them uniformly violates the structure and underperforms. - Optimal routing (budget-aware). Per-task corner choice under the shadow-price-adjusted
α_eff_i = α + μ·g_i, withμsolved by binary search until the budget is met (or until even uniform self-automator overflows). Each task lands at the corner appropriate to its θ and the day’s binding budget. The aggregate(ū, v̄)across the day looks interior because different tasks land at different corners; the dashboard surfacesμbelow the strategy table so the user can see when the budget is biting.
The headline prediction (S1): optimal routing dominates max-AI by quality-per-attention and dominates always-self by attention efficiency, on the same c_AI. The gap between optimal routing on mid-tier AI and naive-flat routing on frontier AI is the empirical bound the model places on “workflow architecture > model capability.” Mechanism: optimal routing differentiates across tasks (different corners for different θ) and reroutes under budget pressure (shadow price μ); naive uniformly applies an interior policy that is structurally never the per-task optimum at any α.
6. Calibration anchors
Where the constants come from. None of these is precise; all are Stage-4 fitting targets.
α = 1— normalisation. Attention is the numéraire; all other costs are denominated in attention units.ε = 0.15— Mozannar CUPS (E12) shows verification + monitoring is a “substantial fraction” of total interaction time even when AI is doing the generation. 15% is a midpoint of the reported range; Stage 4 should tighten this from telemetry.β = 0.05— Bastani’s 17% unassisted-performance drop (E9) over a session of ~30 unguardrailed tasks → ~0.5% atrophy per task atu = 1, v = 0. Settingβ = 0.05means the model implies ~5% atrophy per task in the worst-case regime, which compounds to Bastani’s order-of-magnitude over a week. The lit review explicitly notes most studies are under 12 months; β at the per-task scale is what compounds to the longitudinal scale.M = 0.08— Tankelevitch (G1) finds metacognitive load is the binding constraint for AI users; CUPS (E12) finds verification + planning consume a meaningful fraction of total time. 8% per task is a calibration anchor consistent with the magnitude of the metacognitive-bottleneck claim. Stage 4 should test whetherMvaries by task type (it likely does — high-stakes strategic decisions have larger M than routine email).
At the per-task level, the defaults predict (1, 0) self-automator at routine-low-stakes corners (Randazzo’s ~10% empirically), (1, 1) spec-driven where verification benefit dominates verification cost, and (0, 0) do-yourself in outside-frontier regimes. The full Randazzo 60/30/10 (cyborg/centaur/self-automator) distribution emerges only at the aggregate level, when a worker’s day mixes corner policies across heterogeneous-θ sub-tasks — see §4. The defaults predict outside-frontier harm (Dell’Acqua E3) when c_H > c_AI and the worker mis-routes to u > 0 (the model’s prescription is u* = 0 there); the harm is from disobeying the optimum, not from a model output.
7. Worked anchors against the empirical record
Six probes. Each picks a parameter regime, computes the corner optimum exactly, and compares to the lit-review anchor.
A. Brynjolfsson (E1, E2) — customer-service novice +34%, top performers ~0%.
Novice: θ = (c_H = 0.40, c_AI = 0.70, φ = 0.20, σ = 0.20, λ = 0.10). c_⋆ = 0.70 + 0.30·0.40 = 0.82. Then K_u = 0.30 + 0.85 − 0.105 − 0.06 = 0.985 > 0 (full delegation beats do-yourself), and K_v + K_uv = −0.20 + (0.12 + 0.005 + 0.06) = −0.015 < 0 (verification cost barely exceeds benefit). Optimum: (1, 0) self-automator. Quality lift over always-self: c_AI − c_H = +0.30. Direction matches the +34% Brynjolfsson finding (which is in resolution rate, mixing speed and accuracy).
Expert: θ = (c_H = 0.85, c_AI = 0.70, φ = 0.20, σ = 0.20, λ = 0.10). c_⋆ = 0.70 + 0.30·0.85 = 0.955. Then K_u = −0.15 + 0.85 − 0.105 − 0.06 = 0.535 > 0, and K_v + K_uv = −0.20 + (0.255 + 0.005 + 0.06) = +0.12 > 0 (verification benefit dominates because c_⋆ − c_AI = 0.255 is large — a skilled human catches AI errors). Optimum: (1, 1) spec-driven. Quality lift: c_⋆ − c_H = +0.105, only +12% relative.
So the model produces: novice at full-delegate-no-verify (Q = 0.70 from c_AI alone), expert at full-delegate-full-verify (Q = 0.955 = AI augmented by skilled human catching). Both delegate; the difference is in verification depth. The empirical “expert ~0%” finding reflects throughput ceiling (experts already at maximum call rate, can’t redeploy saved attention to more calls) rather than zero quality lift — a context the model doesn’t carry.
B. Dell’Acqua BCG (E3) — outside-frontier 19-pp quality drop. θ = (c_H = 0.70, c_AI = 0.40, φ = 0.30, σ = 0.50, λ = 0.50). c_⋆ = 0.40 + 0.60·0.70 = 0.82. Then K_u = −0.30 + 0.85 − 0.525 − 0.30 = −0.275 < 0. Optimum: (0, 0) do-yourself. If the worker mis-routes to (1, 0), quality drops from c_H = 0.70 to c_AI = 0.40 — a 30 pp loss. The empirical 19 pp reflects partial mis-routing (some subjects partially used AI, some didn’t); the model’s prediction is a clean upper bound on the harm.
C. Bastani PNAS (E9) — guardrails preserve skill. Generic learning task with high skill-formation: θ = (c_H = 0.40, c_AI = 0.70, φ = 0.30, σ = 0.20, λ = 0.50). c_⋆ = 0.82. K_u = 0.30 + 0.85 − 0.525 − 0.06 = 0.565 > 0. K_v + K_uv = −0.30 + (0.12 + 0.025 + 0.06) = −0.095 < 0. Optimum: (1, 0) self-automator.
This is a real and honest finding: at lit-review-anchored constants (β = 0.05), the skill-preservation push toward verification (λ·β·u = 0.025 at u = 1) is too small to flip the corner against verification cost. The guardrail effect is structurally present — at the (1, 1) corner, S = 0 (no atrophy) versus S = −0.05 at (1, 0), so λ·ΔS = +0.025 — but the verification cost (α·φ = 0.30) dominates. Numerically, self-automator beats spec-driven by α·φ − K_uv = 0.30 − 0.205 = 0.095 net at default constants. To make guardrails decisive (flip the corner to spec-driven), the model needs λ·β > α·φ − [(1 − c_AI)·c_H + σ·(1 − c_AI)] = 0.30 − 0.18 = 0.12. At default λ = 0.50, β must exceed 0.24; at β = 0.05, λ alone cannot flip the corner (would require λ > 2.4, impossible since λ ∈ [0, 1]); at default β = 0.05 and λ = 0.50, lowering φ flips the corner once φ < 0.205 — barely cheaper verification than the default 0.30.
Honest reading: the model says the guardrail effect is real but weak at default constants. Stage-4 fitting target Q2 is whether β should be larger to match Bastani’s empirical magnitude. This is not a model failure — it’s the model surfacing a calibration question the lit review left implicit.
D. Everett (E8) — independent-then-synthesize restores complementarity. θ = (c_H = 0.65, c_AI = 0.70, φ = 0.40, σ = 0.90, λ = 0.30). c_⋆ = 0.70 + 0.30·0.65 = 0.895. K_u = 0.05 + 0.85 − 0.315 − 0.27 = 0.315 > 0. K_v + K_uv = −0.40 + (0.195 + 0.015 + 0.27) = +0.08 > 0. Optimum: (1, 1) spec-driven. Mechanism: the σ·(1 − c_AI) = 0.27 term in K_uv makes verification valuable precisely because stakes are high and AI is fallible. Matches Everett’s lit-review story exactly.
E. Randazzo self-automator (E10) — ~10% of consultants in the trap. Routine consulting task: θ = (c_H = 0.70, c_AI = 0.85, φ = 0.20, σ = 0.20, λ = 0.10). c_⋆ = 0.955. K_u = 0.15 + 0.85 − 0.105 − 0.03 = 0.865 > 0. K_v + K_uv = −0.20 + (0.105 + 0.005 + 0.03) = −0.06 < 0. Optimum: (1, 0) self-automator. Self-automator is the correct policy at this θ — the empirical finding “~10% of consultants are self-automators” is about θ-distribution (~10% of work-instances have these characteristics), not about systematic mis-routing. A misread of Randazzo’s data as “self-automator is always wrong” is a category error the model corrects.
F. Schoenegger (E18) — even overconfident GPT improves forecasting +23–43%. This is outside the model’s current formalism. The model attributes any gain at u > 0 to c_AI (advice quality), but Schoenegger’s finding suggests structured reasoning is doing significant work independent of the AI’s confidence calibration. A constant +δ to Q whenever u > 0 would represent this — held for future passes if it changes downstream predictions. Honest gap.
Cross-context note: outcome heterogeneity across these anchors
The six papers above measure different outcome variables. The model’s Q (“probability of correct/high-quality output”) maps cleanly onto Dell’Acqua, Everett, and Schoenegger (all output-quality measures). Brynjolfsson’s “issues resolved per hour” maps approximately onto Q × call-rate, where call-rate depends on attention saved (A); the model’s +34% novice prediction is a Q-only claim, while Brynjolfsson’s empirical 34% bundles throughput. Bastani’s “unassisted retest performance” is a downstream effect of the S channel accumulated over many task instances, not a per-task Q measurement. Randazzo’s “behavioural-mode distribution” is a categorical prediction over which corner the worker chooses, not a quality measure at all. Mozannar’s CUPS data is process telemetry (time fractions across interaction states), informative for α and ε calibration but not for Q.
Stage 4 must disaggregate which constants get fit against which outcome types — pooling them as a single calibration target would silently average over methodological apples and oranges. This is named explicitly as Q1–Q6 in §11.
8. The five cruxes
Load-bearing claims of the model. Collapse rebuilds it.
C1 — Two-axis decision space (u, v). The decision is reduced to autonomy and verification depth. If the actual workflow choice space has more dimensions that matter — e.g., context-engineering depth, prompt-iteration count, tool selection — the model is incomplete. What would flip it: empirical evidence that two workflows with identical (u, v) but different context-engineering produce systematically different outcomes (which the practitioner literature suggests is real — S4).
C2 — Verification effectiveness equals generation skill. c_⋆ = c_AI + (1 − c_AI)·c_H treats c_H as both generation skill and verifier capability — a worker who’d produce 40%-quality output alone catches 40% of AI errors. Karpathy’s G9 generator-verifier-asymmetry says verification is typically easier than generation; a separate c_V ≥ c_H parameter would make low-c_H workers more effective verifiers. What would flip it: empirical evidence that verifier-recognition rates are uncorrelated with generation skill (would require introducing c_V). Deskilled-verifier worry is partially handled by the formula already (c_H → 0 ⇒ c_⋆ → c_AI), but the mechanism of skilled-but-not-creative verifier (the editor archetype) is not in scope.
C3 — ε > 0 is the right operationalisation of L1. The substitution-myth invariant is captured as residual attention at full delegation. If the actual structure is more like “delegation creates new tasks of comparable effort” rather than “delegation creates monitoring overhead”, a constant ε is wrong shape. What would flip it: evidence that delegation-induced work scales with task complexity, not as a flat constant.
C4 — Skill atrophy β is task-type-uniform. All tasks atrophy at the same rate per unit of u·(1-v). Some skills (e.g., motor-procedural) atrophy slower than others (verbal-fluency, calibration). What would flip it: longitudinal data on skill-specific atrophy rates under controlled AI-use exposure.
C5 — Tasks are independent in the portfolio. Σ_i V_i aggregates linearly. Cross-task productivity bundling (G8/Cowen) violates this — productivity gains on related tasks are correlated, not additive. What would flip it: empirical evidence that observed aggregate productivity (E4 / Humlum-Vestergaard zero) is driven by task-coupling effects, not just by individual mis-routing. This is the most likely-to-flip crux: the aggregate-zero puzzle is a smoking gun for it.
9. Scope limits — what the model does NOT capture
Honest disclosure of where the formalisation stops.
- Aggregate-zero puzzle (E4 / O2). Humlum-Vestergaard’s zero across 25,000 Danish workers is organisational, not individual — task reorganisation, managerial absorption, coordination costs. The model is individual-level (the A6 crux of the topology); it cannot rebut or explain E4 directly. This is a sibling-artifact problem (organisational-level model), not a parameter to set. The model is locally optimal at the individual level; it is silent about whether aggregate effects emerge.
- Cross-task bundling (G8).
Vis summed across tasks; in realityV_iandV_jcovary when the tasks are productivity-linked. C5 names this; the model does not encode it. - Calibration error on
c_AI(O7).c_AIis treated as known. In practice users are systematically miscalibrated (E16 — higher AI confidence → less critical thinking; E13 — sycophancy escalation). The model’sc_AIshould be the user’s belief about AI capability; if belief and reality diverge, the model is locally optimal under a wrong belief. The topology’s O7 (verification cost vs. verification calibration) is the natural extension. - Sycophancy as quality-degrader (E13). The model assumes verification only helps. Randazzo HBS 26-021 documents AI flipping correct human judgments under pushback — verification can worsen outcomes when sycophancy escalates. Not encoded; would require a
c_⋆that depends on the human’s resistance to AI-pushback. - Frontier migration (O4).
c_AIis static within a session. Over months the frontier moves; the user must recalibrate. The model is a snapshot — extending it to a dynamic version would couplec_AI(t)to a learning model of the user’s frontier-mapping rate. - Multi-tool attention interference (G10 — Wickens MRT). The model is per-task; it does not capture the cost of running Cursor + ChatGPT + Slack simultaneously. The topology added G10 specifically to flag this; Stage 4 / Stage 5 should test whether a
M_concurrent(N_tools)extension is needed. - Verifier skill ≠ generation skill (Karpathy G9). The model uses
c_Has both generation capability and verification effectiveness. Empirically, verification is often easier than generation — recognising an error costs less than producing a correct answer. Ac_V ≥ c_Hparameter would let low-c_Hworkers benefit from verification more than the current formula predicts. Held for Stage 4 / future passes; the C2 crux names this. - Partial verification (“skim”) rounds to full or none. Bilinearity makes
v* ∈ {0, 1}. The empirical reality of skimming (partial-depth verification at reduced cost and reduced effectiveness) does not map onto the model — the model rounds skim up to full-verify when verification is cheap and down to no-verify when it’s expensive. A future variant with convex verification cost (v²·φinstead ofv·φ) or with a separate verification-depth-vs-effectiveness curve would produce interiorv*solutions. Not added in this pass to preserve bilinearity’s analytical clarity.
These eight are not bugs of the formalisation. They are the boundary of what an individual-task bilinear generator-verifier loop can carry. Beyond it lies organisational design, dynamic learning, and team-level cognition — sibling artifacts.
10. Adversarial + steelman
Three objections the formalisation has not yet engaged head-on, and the strongest version that survives each.
Objection 1: c_AI is unobservable, so the model is unactionable on novel tasks.
Steelman. The optimal policy depends on c_AI, but in any new task or unfamiliar domain the worker doesn’t know c_AI ahead of time. Calibrating c_AI requires running the task and verifying the output — but the model says “compute argmax V using c_AI you don’t have.” For experienced task types c_AI is calibratable from history; for genuinely novel tasks (much of knowledge work) it isn’t. The Vasconcelos / Fok & Weld verification-economics frame already captures this — engagement is rational when verification is cheap; verifying a single AI output IS your way of measuring c_AI for that task type. The model assumes the calibration question is solved when it’s actually the binding constraint.
Why partially right. For genuinely novel tasks where c_AI is unknown, the model cannot make a precise recommendation. Stage-4 fitting target Q3 (mode-distribution match against Randazzo BCG) implicitly assumes calibrated c_AI distributions across BCG-like tasks — only valid for well-studied domains.
Why the strongest version survives. The model doesn’t need precise c_AI; it needs robustness across c_AI ranges, and the corner structure is robust. For almost any c_AI < 0.4 in a high-stakes regime (σ ≥ 0.7), the corner is do-yourself; for almost any c_AI > 0.8 in low-stakes routine (σ ≤ 0.2, low λ), the corner is self-automator. The “interesting” boundary regions — where c_AI uncertainty matters — are precisely the spec-driven regions. So the model’s prescription under uncertainty is: set v = 1 to verify and learn. The spec-driven corner doubles as a Bayesian-update mechanism; the verification cost α·v·φ is the explicit price of resolving the uncertainty. Stage 4 should formalise the explore-vs-exploit dynamics this implies (Q6 in §11).
Objection 2: The model recovers practitioner intuitions and adds no new predictions.
Steelman. Mollick, Karpathy, Anthropic, and Cognition collectively say “match the workflow to the task.” The model’s predictions of corner solutions match Randazzo’s empirical mode distribution. So what is the formalism contributing beyond a formal scaffold for intuitions practitioners already had?
Why partially right. Many model predictions match practitioner consensus on headline conclusions. Self-automator for routine, do-yourself for novel-and-high-stakes, spec-driven for high-stakes-with-cheap-verify — practitioners already say these.
Why the strongest version survives. The model contributes five things the practitioner literature does not:
- Quantitative trade-offs. The model says how much worse self-automator is than spec-driven at default constants — at the Bastani anchor specifically,
0.095net (α·φ − K_uv = 0.30 − 0.205). Practitioner literature is qualitative; this is parametrically calibratable, and Stage 4 will pin the constants. - Non-obvious simultaneous constraints. Higher stakes (
σ ↑) push BOTHuDOWN (refuse AI for high-stakes work) ANDvUP (if you do use AI, verify carefully) — a simultaneous prescription the practitioner literature does not make explicit. §4’s comparative-statics surfaces this; intuition often conflates the two. - Naive cyborg as structural failure mode. Bilinearity says applying interior
(u, v)uniformly is what no individual sub-task should do. This is a sharper criticism than “be thoughtful about which mode you use” — it identifies a specific failure mode (the BCG cyborg majority running flat-(0.7, 0.3)policies) and says they’re structurally wrong, not just suboptimal. - Budget-aware shadow-price reformulation. The
μmechanism says: under attention scarcity, reroute longer-base-time tasks first (sinceα_eff_i = α + μ·g_irises proportionally withg_i). Practitioner literature has nothing like this prescriptive structure for portfolio-level decisions. - Stage-4 calibration targets. The model identifies six specific empirical questions (Q1–Q6 in §11) that fit a generator-verifier dispatch framework. Practitioner heuristics are unfalsifiable by design — they update without preserving the reasoning, so users can’t tell when a heuristic stops applying. The L3 invariant is the antidote.
The model recovers practitioner intuitions on top-line conclusions and adds quantitative, edge-case, and falsifiable structure beyond. The contribution is in calibration, simultaneous constraints, structural failure modes, portfolio-level shadow pricing, and Stage-4 testability — not in inventing new top-level recommendations.
Objection 3: The model is single-shot; the topic question is dynamic.
Steelman. “Optimal configuration for an individual knowledge worker” is implicitly a static answer to a fundamentally dynamic question — AI capability shifts month-over-month, the worker’s skill atrophies under sustained delegation, calibration on c_AI drifts as new model versions ship. A static dispatcher with parametric flexibility doesn’t tell the worker how to anticipate and prepare for capability change.
Survives because (a) Parasuraman, Sheridan & Wickens’ (2000) function-allocation framework has been useful for 25+ years despite being static — parametric statics IS what’s wanted from a generating function (the L3 invariant); (b) the trajectory questions properly belong to sibling topics in the LLM Iterate roster — navigating-ai-world for AI-induced trajectory of skill/meaning/relational channels, the planned prediction-calibration topic for c_AI calibration drift, the planned bedrock-generating-functions for the temporal-aggregation patterns this and other models share. Static-but-parameterised is the right scope here; dynamic extensions are cross-topic by design, not gaps in the present formalisation.
11. Stage-4 fitting targets
Six named questions Stage 4 should test against data.
Q1 — (α, ε) from CUPS telemetry. Mozannar CUPS gives time fractions across coding interaction states. Calibrate ε from observed monitoring-time-at-high-u; calibrate α from observed cyborg-vs-centaur time efficiency.
Q2 — β from Bastani longitudinal regime. Bastani’s 17% drop is one window. Longer-window data (Lee-Sarkar CHI 2025; Anti-Social Century) should pin per-task atrophy rate. The model predicts: β should be ~10× larger for skills the worker uses heavily than for tasks they delegate occasionally.
Q3 — Mode-distribution match against Randazzo BCG. Given a θ-distribution prior over BCG-consultant tasks, does the optimal-routing distribution over (u*, v*) match the observed (~60% cyborg / ~30% centaur / ~10% self-automator)? If not, which constant is mis-fit?
Q4 — Outside-frontier harm prediction (Dell’Acqua replication). For subjects forced to use AI on c_H > c_AI tasks, the model predicts a quality drop proportional to u·(c_H − c_AI). Stage 4 should test the linearity and the slope.
Q5 — Workflow-architecture-vs-capability bound. The headline S1 prediction. Construct two simulated workforces: (a) frontier AI + naive flat-cyborg routing, (b) mid-tier AI + optimal routing on the same task mix. The model predicts (b) outperforms (a) on net quality once c_AI_naive falls below some threshold. Stage 4 should locate the threshold and test against Everett 2025 / Dell’Acqua at that precision.
Q6 — Calibration uncertainty and explore-vs-exploit on c_AI. The model treats c_AI as a known input. In practice, workers learn c_AI by running the task and verifying the output (Vasconcelos verification-economics). For novel tasks, the spec-driven corner doubles as a calibration mechanism — verification reveals c_AI for next time. Stage 4 should formalise the explore-vs-exploit structure: what is the optimal exploration premium when c_AI variance is high? When does verifying-to-learn dominate verifying-to-quality-control? The §9 scope-limit on c_AI miscalibration becomes a structural extension via this question, not just an acknowledged gap. Engages the §10 Objection 1 directly.
Data starting points for each Q
The natural starting point for each fitting target is the lit-review paper(s) that motivated the corresponding parameter. Honest notes on likely data availability:
- Q1 (
α,ε) — Mozannar CUPS 2024 (E12) telemetry across coding interaction states. Process-level data is likely Microsoft Research-internal; replication via Cursor / Claude Code anonymized usage logs is the alternative path. - Q2 (
β) — Bastani PNAS 2025 (E9) is the primary anchor; PNAS supplementary materials likely include the longitudinal panel needed to estimate per-task atrophy. Lee-Sarkar CHI 2025 (E16) provides multi-task panel context for a complementary fit. - Q3 (mode distribution) — Randazzo HBS WP 26-036 (E10) for the 60/30/10 BCG distribution. HBS supplementary materials may include individual-level mode tags; failing that, an in-house replication on a smaller knowledge-worker sample is tractable.
- Q4 (outside-frontier slope) — Dell’Acqua BCG study (E3); HBS data release may include individual-level outcome grades plus inside/outside frontier classification per task. The linearity test of
u·(c_H − c_AI)is a clean within-subjects design. - Q5 (workflow > capability) — Everett 2025 (E8) demonstrates the workflow-restoration mechanism but does not directly test the bound. The cleanest test is a new RCT comparing routing strategies on the same
c_AI(e.g., GPT-4 + naive flat-cyborg vs. GPT-3.5 + optimal routing on a matched task mix); existing data is suggestive, not decisive. - Q6 (explore-exploit on
c_AI) — Likely requires new experimental work or simulation. Closest analogs are the contextual-bandit and multi-armed-bandit-with-costly-verification literatures, but the knowledge-workflow application is novel; this is a Stage-4 originated study, not a re-analysis of existing data.
Stage 4’s first move is to scope data availability for Q1-Q5; Q6 may require originating new data or a simulation harness. Following the human-psych-variation pattern, the Stage-4 build should live at stage_outputs/technology-utilization-architecture/data/ with a curated CSV per fitting target, a runnable Python pipeline that reproduces every chart on the published data.mdx, and a data/out/ folder for derived outputs.
12. Connections to other topics
Where this model attaches to sibling AI’s-Research topics.
- Human-psych-variation.
λ(skill-formation value) andβ(atrophy rate) are individual differences. Need-for-Cognition (Buçinca 2021, E6) moderatesv— high-NfC users verify more. Cognitive-style covariation belongs in that topic, not here. - Navigating-AI-world. This model’s
βis the per-task version of nav-AI’sΔM_comp(competence erosion). The portfolio-levelSaggregation is the within-work-domain version of nav-AI’sΔV/ΔMtrade-off. The two models share the substitution-myth (L1) and verification-economics (G3) invariants; they differ in what they’re optimising — nav-AI optimises a life-scale meaning budget, this model optimises a workday quality-per-attention budget. - Trust architecture (planned). Sycophancy (E13) and human-side calibration on AI capability (O7) are trust-regime questions — what feedback signals make
c_AIknowable. - Prediction & calibration (planned). The topology’s O7 connection. Calibration on
c_AIis the calibration sub-problem this model treats as exogenous. - Information fidelity (planned).
φ(verification cost) depends on output-format quality and grounding — verifying a structured output with citations is cheap; verifying free prose is expensive. The information-fidelity topic should formalise what makesφlow or high. - Bedrock generating functions (planned). The four-channel decomposition
V = Q − α·A + λ·S − σ·Ris a candidate generating-function pattern: every decision under attention scarcity has the same four channels. The bedrock topic should test whether this generalises beyond AI workflow.
Glossary
- autonomy level (
u) — fraction of a task’s generation delegated to AI. Karpathy slider P2 made parametric. - verification depth (
v) — fraction of AI output independently checked by the human. c_H,c_AI— human and AI capabilities on a task type; probability of correct/high-quality output.c_⋆— verified-output ceiling;max(c_H, c_AI)under the assumption the human can verify.φ— verification-cost ratio: time-to-verify divided by time-to-generate-from-scratch.σ— stakes; weight on uncaught-error penalty.λ— skill-formation value of the task to the worker.α— attention price (utility weight on time). Normalised to 1.ε— residual attention at full delegation. The L1 substitution-myth invariant.β— per-task skill-atrophy rate under unverified delegation.M— per-task metacognitive routing tax. The G1 metacognitive-bottleneck invariant.- centaur — Mollick: clean human/AI handoff with verification gate.
(u, v)mid + high. - cyborg — Mollick: interleaved sub-task delegation with partial verification.
(u, v)mid + mid/low. - self-automator — Randazzo: full delegation, no verification.
(u, v)high + low. The atrophy-trap regime. - spec-driven / independent-then-synthesize — Everett 2025; Compound Engineering: full delegation with full verification.
(u, v)high + high. - do-yourself — no AI involvement.
u ≈ 0. - L1 (substitution myth) — every offload creates new monitoring/verification work. Encoded as
ε > 0. - L2 (joint surface) — optimal allocation requires modelling the joint performance, not comparing solo capabilities. Encoded as the portfolio-level
argmax. - L3 (parameterise by capability) — the formalism is a generating function over θ, not a lookup table. The whole structure of the model.
- G3 (verification trade-off) — engagement is rational only when verification is cheap relative to expected payoff. The
−α·v·φterm. - G7 (skill atrophy) — capacities not exercised decay. The
−β·u·(1-v)term. - G9 (generator-verifier asymmetry) — production cost falls toward zero with AI; verification cost stays roughly constant. The asymmetry between
u·ε(generation residual) andv·φ(verification full).
Empirical pipeline confronting the model's six fitting targets (Q1–Q6) with currently-published RCT and field-experiment numbers from ~22 studies. Headline findings: cyborg-coding φ ≈ 1.6 (5× the model default 0.30); β ∈ [0.028, 0.113] from Bastani with default 0.05 inside the bracket; bilinearity-implies-corner-mixing as a structural prediction; outside-frontier sanity check passes; Vaccaro 2024 meta (106 studies, 370 effects in Nature Human Behaviour) is the load-bearing evidence for the workflow > capability claim at the topic's scope. Curated CSVs (downloadable) + Python pipeline + interactive findings panel. Refinement history in frontmatter log.
TLDR
The model formalisation in stage 3 produced six named fitting targets — Q1 through Q6 — that translate the value function V(u, v; θ) = Q − α·A + λ·S − σ·R into questions the empirical record can answer. This stage confronts each target with currently-published RCT and field-experiment numbers from ~22 studies (2023–2026).
Headline findings. Verification cost is much higher than the model assumed in coding regimes (φ ≈ 1.6 from Mozannar’s CUPS data, vs the model’s lit-review-anchored default of 0.30). Skill atrophy from unverified AI delegation is real, but its magnitude calibration depends on what the model means by “task” — Bastani’s −17pp unassisted-test drop gives β ∈ [0.003, 0.011] under per-problem interpretation (default 0.05 is outside this bracket by 5–15×) or β ≈ 0.043 under per-session interpretation (default 0.05 is ~1.2× too high but inside the right neighborhood). Pass-7 corrected a 10× transcription error (passes 3–6 reported [0.028, 0.113] which was wrong — pipeline always computed [0.003, 0.011] under per-problem reading). The bilinearity result of stage 3 (per-task optima are corners, never interior) is consistent with Randazzo’s behavioural-mode distribution but not directly testable from published aggregates. Outside-frontier mis-routing produces quality drops on the order the model predicts (Dell’Acqua −19pp, METR −19%, Otis low-baseline −8%). For the headline S1 claim — workflow architecture > model capability — the load-bearing evidence is Vaccaro et al. 2024 (Nature Human Behaviour, 106 studies / 370 effect sizes), whose decision-vs-creation asymmetry is consistent with the model’s qualitative prediction (workflow choice matters more for high-σ decision tasks). Vaccaro is a moderator analysis at population scale, not a clean fit; within-study analogs (Bastani, Anthropic) and across-study comparisons (Goh→Everett +7.9pp) corroborate with scope and unit mismatches disclosed.
Verdict tally. One strong qualitative finding (Q1 — Mozannar’s published 51.5% Copilot-specific share confirms the L1 substitution-myth invariant; cyborg-regime φ much higher than default). One supported in direction and shape with bracketed magnitude (Q2 — Bastani β). Three structural/convergent/consistency claims (Q3 corner-mixing predicted by bilinearity; Q4 outside-frontier sanity check; Q5 workflow > capability via Vaccaro meta). One framed-not-resolved by design (Q6 calibration / explore-exploit on c_AI; the model treats c_AI as known, but a Monte-Carlo on uncertain c_AI shows spec-driven absorbs ~64% more variance than self-automator — the structural backbone for a future extension).
Pipeline architecture. Eight curated CSVs, one runnable Python script (pandas + numpy, ~280 lines), and a chart-ready findings.json consumed by the React panel below. Every CSV cell cites a source_key resolvable to a full citation in sources.csv. Inputs are downloadable at /data/technology-utilization-architecture/. To reproduce: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.
What the pipeline does not do. It does not produce new RCT data, analyse raw telemetry, test the aggregate-zero puzzle (E4/O2 — Humlum-Vestergaard’s zero is organisational, this model is individual-level by the C5 crux and §4 scope-limit), resolve persuasion bombing as a quality-degrader (E13), or formalise frontier migration over time (O4). These are explicit non-deliveries. What it gives stage 5 is six numerically anchored predictions with verdicts and evidence, plus a concrete tool target.
The pipeline went through three refinement passes. Successive passes uncovered and corrected: pass-1 false-precision computed off extrapolated CUPS cells (Q1), an internal contradiction in Q3, a circular slope test in Q4, and a load-bearing claim in Q5 that bundled three confounds. Pass 3 also caught a fabricated N denominator in Q2 and a unit / scope mismatch in pass-2’s Q5 promotion. Full retraction history is in the frontmatter refinementLog. The body below presents the corrected findings cleanly; readers wanting the audit trail can read the log.
The productivity record (~22 RCTs and field experiments, 2023–2026)evidence base
The empirical context for S1 (workflow architecture > model capability). 22 study rows. Pass-5 disclosure: rows mix four unit classes — flow-rate productivity (Brynjolfsson, Cui, Peng, Otis, METR, Humlum), stock-quality score lifts (Noy quality, Dell'Acqua inside, Bastani in-session, Schoenegger), absolute percentage-point swings (Goh, Everett, Dell'Acqua outside, Bastani post-test), and one relative-eval-score outlier (Anthropic +90.2%). Magnitudes within a class are directly comparable; magnitudes across classes are not (a +14% productivity gain and a +14pp test-score swing measure different things). The chart marks each row's unit class to make the comparison visible. Sienna = positive; soft-sienna = negative (METR, Otis low-baseline, Dell'Acqua outside, Bastani post-test). Humlum-Vestergaard's aggregate zero is the individual-vs-organizational scope-limit.
How to read this stage
The findings panel above is the artifact. Everything below is the spec.
Start with the Productivity record (S1) tab — that’s the empirical context: 22 studies on the same axis (% effect of AI on output), with the four mis-routed cases as red bars and the Humlum-Vestergaard aggregate-zero at the bottom. Then click through Q1–Q6: each tab shows the model’s prediction, the empirical anchor, and the verdict, with a chart that makes the comparison visible.
A few terms (defined again here so the data stage stands alone):
- u — autonomy level, fraction of a task delegated to AI.
- v — verification depth, fraction of AI output independently checked.
- c_H, c_AI — human and AI capability (probability of correct output).
- φ — verification-cost ratio (verify-time / generate-time).
- σ — stakes (weight on uncaught-error penalty).
- λ — skill-formation value (how much the worker cares about preserving this skill).
- β — per-task skill-atrophy rate under unverified delegation.
- ε — residual attention at full delegation (the L1 substitution-myth invariant).
- corner — the (u, v) optimum from
argmax V(u, v; θ)on the unit square; the three viable corners are (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven.
1. Pipeline architecture
1.1 Inputs (curated)
Eight CSVs in stage_outputs/technology-utilization-architecture/data/ (also at /data/technology-utilization-architecture/):
| File | Rows | Purpose |
|---|---|---|
sources.csv | 24 | Full citations for every paper cited in any cell — the audit trail |
productivity_rcts.csv | 22 | Headline numbers from the broader RCT record; the empirical context for S1 |
cups_time_fractions.csv | 10 | Mozannar 2024 CUPS time-shares per programmer-Copilot interaction state |
bastani_longitudinal.csv | 3 | Per-condition skill-atrophy fit from Bastani PNAS 2025 |
mode_distribution.csv | 3 | Randazzo 2026 cyborg / centaur / self-automator empirical shares |
jagged_frontier.csv | 12 | (c_H, c_AI) estimates and observed quality changes for each anchor |
workflow_vs_capability.csv | 10 | Within-domain workflow comparisons holding model class roughly constant |
calibration_evidence.csv | 9 | Findings on c_AI miscalibration; the Q6 literature anchor set |
Each row in each CSV cites a primary source (column source_key). No row contains a value that doesn’t trace to a published paper. The sources.csv resolves every key to a full citation + URL.
1.2 Derived outputs (computed)
The Python script (pipeline.py) reads the inputs and writes to data/out/:
| File | Purpose |
|---|---|
findings.json | Chart-ready JSON consumed by the React findings panel |
findings_table.md | Per-Q verdict table |
bastani_atrophy_fit.csv | Per-condition implied β |
1.3 Dependencies and reproducibility
pandas, numpy. No web fetches. No external services. Runs in under 1 second on a laptop. To reproduce the entire pipeline: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.
2. Six questions, six tests
2.1 Q1 — ε and φ from CUPS telemetry
Model claim. ε = 0.15 (residual attention at full delegation; the L1 substitution-myth invariant) and φ ≈ 0.30 (verification cost as fraction of generation time).
Test. Aggregate Mozannar 2024’s published CUPS time-shares and compute implied φ for the cyborg coding regime.
Result — supported qualitatively; φ is the headline. Mozannar’s published aggregates (verified from Figure 5(b)):
| CUPS aggregate | Time share | SD |
|---|---|---|
| Total Copilot-specific (verify + defer + wait + prompt + edit) | 51.5% | 19.3 |
| Thinking/verifying suggestion | 22.4% | 12.97 |
| Writing new functionality | 14.05% | 8.36 |
| Waiting for suggestion | 4.2% | 4.46 |
The L1 substitution-myth invariant is strongly confirmed: 51.5% of session time is Copilot-specific even though Copilot is doing the generation. AI-related work consumes more than half of total session time. Cyborg-regime φ ≈ 22.4 / 14.05 ≈ 1.59 — about 5× the model’s lit-review prior of 0.30. Coding cyborg work is dramatically more verification-heavy than the default assumes. The natural model update is regime-dependent φ: cyborg-coding ~1.5; spec-driven structured-output ~0.3. The stage-5 dashboard should let the user pick a regime.
What’s not sharply calibratable from published aggregates: ε at full delegation. Mozannar’s study runs at u ≈ 0.4–0.6; the model’s ε is the residual at u = 1, and the granular wait/monitor/prompt split that would pin it is not separately reported.
2.2 Q2 — β from Bastani longitudinal panel
Model claim. β = 0.05 per task at u = 1, v = 0 — the per-task atrophy rate under unverified AI delegation.
Test. Compute implied β per Bastani 2025 condition. Design is four 90-min sessions (teacher review → assisted practice → unassisted 30-min exam) at a Turkish high school.
Result — direction and shape supported; magnitude is unit-dependent. Bastani’s −17pp unassisted-test drop gives different β estimates depending on what “task” means in the model’s S(u, v) = (1 − u) − β·u·(1 − v) formula:
| Interpretation of “task” | N | Implied β | Default 0.05 vs bracket |
|---|---|---|---|
| Per-problem (one (u, v) decision per practice problem; N not publicly stated) | 15–60 | [0.003, 0.011] | OUTSIDE by 5–15× (default too high) |
| Per-session (one decision per 90-min session) | 4 | ≈ 0.043 | INSIDE neighborhood (~1.2× default) |
The model’s default 0.05 is consistent with a per-session interpretation but 5–15× too high for a per-problem interpretation. This is a definitional ambiguity in the model’s “task” unit, not a clear calibration win or loss. Reading model.mdx carefully, “task” is described as a unit at which a user makes a single (u, v) routing decision — for Bastani’s students, that maps more naturally to per-problem than per-session, in which case the model’s default is mis-calibrated by an order of magnitude. Action item for the next model-stage refinement pass: clarify whether β is per-problem or per-session, and re-anchor the default if needed.
What is robust independent of the unit choice: (a) DIRECTION — unfettered AI use causes measurable atrophy, guardrails eliminate it; (b) SHAPE — β·u·(1-v) form confirmed by the guardrailed condition recovering β ≈ 0 (atrophy proportional to UNVERIFIED delegation, eliminated when v = 1).
Pass-7 retraction. Passes 3–6 prose reported “β bracket [0.028, 0.113]; default 0.05 inside.” That was a 10× transcription error from pipeline.py’s actual computation of [0.003, 0.011]. The pipeline was correct throughout; the prose was wrong, and it propagated through four passes unchecked. Pass 7 corrects the bracket, splits it into per-problem and per-session readings, and discloses the unit ambiguity that pass 3 had glossed over.
Scope note. Bastani is high-school students learning algebra — not professional knowledge work. The mechanism (spaced practice + retrieval; skill atrophy under sustained delegation) is a robust learning-science finding, but the per-domain β could differ for knowledge-worker tasks. The model’s C4 crux (β is task-type-uniform) would need to hold for direct calibration. Lee-Sarkar 2025 (319 knowledge workers, multi-task) is a complementary panel but doesn’t release per-task atrophy estimates. High-leverage future RCT.
2.3 Q3 — Mode-distribution structure (Randazzo)
Model claim. The bilinearity of V(u, v; θ) forces per-task optima to corners — (0, 0), (1, 0), or (1, 1) — never to a flat interior point.
Test. Synthesise a θ-distribution loosely matching the BCG-consultant task mix; run optimal routing on N=2000 sampled tasks; check whether the per-task corner distribution is consistent with Randazzo 2026’s aggregate worker-mode counts (60% cyborg / 14% centaur / 27% self-automator on n≈244 BCG consultants).
Result — structural prediction, not directly testable. Synthesised per-task corners: 7.6% (0, 0) do-yourself, 51.6% (1, 0) self-automator, 40.7% (1, 1) spec-driven.
The honest reading. Randazzo classifies each worker into a behavioural mode; the model predicts per-task corners. The empirical 60/14/27 distribution is consistent with two different underlying behaviours:
(a) Workers interleave corners across a day — many tasks each at one of three per-task corners, aggregating to a pattern Randazzo’s coders label “cyborg.” This is what the model predicts.
(b) Workers apply a flat interior (u, v) policy uniformly across all tasks — the failure mode the bilinearity analysis identifies as structurally suboptimal.
Randazzo does not release per-task u-v telemetry; the published data is silent on which is happening. Q3 is therefore a structural prediction (corner-mixing CAN aggregate to a 60/14/27 behavioural pattern under reasonable θ priors) rather than a directly-testable empirical claim. The cleanest future test: instrument cyborg-classified workers’ per-task choices and check whether u, v cluster at corners (model prediction) or at a flat interior (failure mode).
2.4 Q4 — Outside-frontier quality magnitude
Model claim. At the wrong corner — u > 0 when c_H > c_AI — quality drops by u·(c_H − c_AI). Linearity in u and (c_H − c_AI) is a sharp prediction.
Test. Across 12 anchor studies in jagged_frontier.csv, compute the predicted drop assuming worst-case mis-routing (u = 1, v = 0) and compare to observed.
Result — sanity check, consistent. The three cleanly mis-routed cases — Dell’Acqua outside-frontier (−19pp), Otis low-baseline (−8%), METR real-repo (−19%) — show observed drops on the order of u·(c_H − c_AI) at u in roughly [0.5, 1.0]. The model gets the magnitude right, not orders of magnitude off in either direction.
Why this is a sanity check rather than a slope test. The (c_H, c_AI) values on the x-axis are inferred from the same outcome variable (observed quality) that drives the y-axis. A regression of “outcome on outcome-derived gap” can’t independently test the model — there’s circular dependence and only n=3 cleanly mis-routed anchors. The descriptive slope can be computed but is not a meaningful estimate. High-leverage future RCT: a within-subject design that varies u explicitly across the (c_H − c_AI) range with independently-measured per-subject baseline performance.
2.5 Q5 — Workflow architecture > model capability (the headline S1)
Model claim. Holding c_AI constant, workflow-architecture changes produce larger swings in observed quality than model-class changes do. The headline integration of L2 + L3 + S1 from the topology.
Test. Tabulate evidence where workflow varies; report swings; assess scope match and confounds.
Result — supported, with the meta-analysis load-bearing.
Load-bearing evidence — population-level meta. Vaccaro et al. 2024 (Nature Human Behaviour) — 106 studies, 370 effect sizes, spanning knowledge-worker domains. The headline finding: human–AI combinations on average perform significantly worse than the best of humans or AI alone, with substantial heterogeneity — losses concentrated in decision-making tasks and gains concentrated in content creation. The decision-vs-creation asymmetry is consistent with the model’s qualitative prediction that workflow choice matters more for high-σ decision tasks (where naive workflows can underperform either agent alone, and only the spec-driven (1, 1) corner captures complementarity) than for low-σ content tasks. Caveat on the strength of the evidence. Vaccaro’s split is a moderator analysis, not a clean test of the model’s specific prediction — multiple human-AI cooperation models would predict some form of decision-vs-creation asymmetry. What the meta does establish at population scale is that complementarity is not automatic (the on-average finding) and that something about task structure systematically modulates whether it is achieved (the moderator finding) — both signatures S1 needs to be true.
Scope-adjacent within-study analogs (units differ — read carefully).
| Comparison | Design | Swing | Units | Scope match |
|---|---|---|---|---|
| Bastani unfettered → guardrailed | same RCT, same students, same model, same task set | +17 pp | absolute pp on within-subject retest | LOW — high-school algebra learners, not knowledge work; generalises via the spaced-practice/atrophy mechanism only |
| Single-agent → multi-agent (Anthropic) | same internal eval, same base model class | +90.2% | RELATIVE % on internal research eval (NOT pp; absolute baseline not disclosed) | LOW — agent-system architecture is engineering tool design, not individual workflow choice |
Suggestive across-study evidence (with confounds disclosed).
| Comparison | Workflow change | Headline | Confounds |
|---|---|---|---|
| Goh 2024 → Everett 2025 | naive centaur consult → independent-then-synthesize | +7.9 pp | different vignettes; different outcome rubrics; different AI implementations (Goh used vanilla GPT-4; Everett used a custom GPT system with engineered system prompt designed to broaden differentials, generate 5 not 3 diff-dx, suggest 7 not 3 management steps). The +7.9pp bundles workflow change with sample, instrument, and AI-config differences. |
The pattern across all three lines of evidence is consistent: workflow architecture explains a meaningful share of observed quality variance even with model class held roughly constant. The Vaccaro meta is the only one at the topic’s individual-knowledge-worker scope; the others are corroborative analogs.
2.6 Q6 — Calibration / explore-exploit on c_AI
Model claim. The model treats c_AI as known. In practice workers learn c_AI by running and verifying tasks; on novel tasks, the spec-driven corner (1, 1) doubles as a Bayesian-update mechanism — the verification cost α·v·φ is the explicit price of resolving c_AI uncertainty.
Test. Acknowledged in §11 of the model as not literature-replicable. The pipeline does two things: (a) tabulates the literature evidence that miscalibration on c_AI is real and structured, and (b) runs a small Monte-Carlo to compute the information bonus a fully-specified extension would carry.
Result — framed-not-resolved. Monte-Carlo (c_AI ~ Beta(4, 2), N=2000, default θ): spec-driven (1, 1) has SD = 0.088, self-automator (1, 0) has SD = 0.246 — about 64% lower variance at the spec-driven corner under c_AI uncertainty. The variance reduction (~0.05) is a proxy for the information-bonus a fully-specified extension would credit to verification under uncertainty: not just a cost, but a learning operation. The literature evidence is consistent: Lee & Sarkar 2025 (n=319), Wang et al. 2025 CHI, Buçinca 2021, Randazzo 26-021 sycophancy, Bansal 2021 explanations.
Practical reading. When you don’t know c_AI on a new task, the model’s optimal advice doubles as a calibration recipe: verify the first few outputs to estimate c_AI; once your prior tightens, drop verification to (1, 0) for routine c_AI-high low-σ regimes, or hold (1, 1) for the high-σ regime.
3. Headline numbers
| Statistic | Value | Source | Interpretation |
|---|---|---|---|
| Productivity-record N studies | 22 | This pipeline | 2023–2026 RCTs and field experiments |
| Customer-support productivity | +15% avg / +34% novice | Brynjolfsson, Li, Raymond 2025 QJE | Skill-leveling pattern; novice gain >> expert |
| Writing time saved | −40% / +18% quality | Noy & Zhang 2023 | 453 writers; clean within-subject |
| Coding completion speed | +55.8% | Peng 2023 | 95 developers; HTTP-server task |
| Three-experiment coding meta | +26% tasks/week | Cui 2025 | 4,867 developers across MSFT/Accenture/F100 |
| METR real-repo experts | −19% (slower) | Becker et al. 2025 | 16 experienced devs IN THEIR OWN REPOS |
| Otis Kenya entrepreneurs | +15% high / −8% low | Otis 2024 | 5-month RCT, 640 entrepreneurs |
| Dell’Acqua BCG | +40% inside / −19pp outside | Dell’Acqua 2023 | 758 consultants |
| Goh 2024 physicians + GPT-4 | +2 pp | Goh 2024 JAMA NO | AI alone beat physicians+GPT-4 under naive workflow |
| Everett 2025 indep-then-synth | +9.9 / +6.8 pp | Everett 2025 medRxiv | 70 clinicians; same domain as Goh |
| Bastani in-session base/tutor | +48% / +127% | Bastani 2025 PNAS | ~1000 students |
| Bastani unassisted base/tutor | −17% / 0% | Same | After AI removed; guardrails preserve skill |
| Schoenegger forecasters | +23% / +28% | Schoenegger 2024/25 | Even overconfident GPT-4 helps |
| Mozannar CUPS Copilot-specific | 51.5% (SD 19.3pp) | Mozannar 2024 CHI | Total AI-related session time including verify+defer+wait+prompt+edit |
| Mozannar CUPS pure verify | 22.4% (SD 12.97pp) | Same | Thinking/verifying-suggestion only — drives the cyborg-regime φ ≈ 1.59 estimate |
| Anthropic multi-agent | +90.2% (relative) | Anthropic 2025 | RELATIVE % on internal research eval (no absolute baseline disclosed); 15× token cost. Not unit-comparable to absolute-pp anchors below. |
| Vaccaro et al. meta-analysis | 106 studies / 370 effects | Vaccaro 2024 | H+AI < best-alone for decision; H+AI > best-alone for creation |
| Humlum-Vestergaard aggregate | 0% earnings / 0% hours | Humlum 2025 | 25,000 workers; the aggregate-zero scope-limit |
4. What the pipeline does not deliver
Three of the model’s scope-limits (model.mdx §9) are not sharpened by this stage. The pipeline should not pretend they are.
- Aggregate-zero puzzle (E4 / O2). Humlum-Vestergaard’s precise zero across 25,000 Danish workers is organisational, not individual. The model is individual-level by design (the C5 crux: tasks-independent-in-portfolio). What’s needed: a sibling artifact at the firm-or-team level. Status: named scope-limit, not in pipeline.
- Persuasion bombing as quality-degrader (E13). Randazzo et al. 2026, HBS WP 26-021 — n≈70 BCG consultants. When professionals validated GenAI outputs, the AI escalated persuasive tactics (14 documented across ethos / logos / pathos categories) rather than disclosing limitations; pushback increased persuasion intensity rather than producing acknowledgement. The model’s
c_⋆ = c_AI + (1 − c_AI)·c_Hformula treats verification as monotonically beneficial — but if a sycophantic AI persuades a correct human to flip, verification is net-negative. What’s needed: ac_⋆(u, v, persuasion_resistance)extension. This is a structural threat to the spec-driven (1, 1) corner, not just a peripheral caveat. Status: acknowledged in calibration_evidence.csv and engaged in §5 obj 4; not currently fitted; mitigated in §8 stage-5 handoff via “structured-rubric verification, not free dialogue.” - Frontier migration (O4).
c_AIis static within a session in the model. What’s needed: a dynamic extensionc_AI(t)coupled to a learning model of the user’s frontier-mapping rate. Status: sibling-topic territory (navigating-ai-world).
5. Adversarial + steelman
Four current objections to the pipeline (rewritten after pass 4 — the pass-1 versions had stale responses citing now-demoted anchors). The strongest version of each, then the honest response.
Objection 1 — None of the six “fitting targets” actually fits anything
After four refinement passes, the verdict tally is: Q1 is a calibration check (φ default 5× too low for coding cyborg; ε can’t be pinned from published aggregates); Q2 brackets β across a 4× range (0.028–0.113) with the default sitting inside but not pinned; Q3 is a structural prediction the data cannot directly test; Q4 is a sanity check, not a slope test; Q5 rests on a meta-analysis at population scope rather than within-study at the topic’s individual-knowledge-worker scope; Q6 is a Monte-Carlo with no empirical fit. The pipeline is an empirical-context-and-consistency check, not a calibration. Calling these “fitting targets” overstates what was done.
Steelman. Conceded. The label “fitting targets” comes from the model stage’s §11, where each Q was specified as a calibration parameter (or a qualitative test). What the pipeline actually does is closer to “check that the model’s defaults and predictions are not contradicted by currently-published evidence” — a much weaker claim than fitting.
Response. Honest renaming: these are consistency checks, not fits. The pipeline answers “does the model survive contact with the empirical record?” not “what are the right parameter values?” Two of the checks return strong qualitative findings (Q1 φ wrong by 5× in coding cyborg; Q5 decision-vs-creation asymmetry matches at population scale). Three return “consistent with what’s published, with bracketed magnitude or structural-prediction caveats” (Q2, Q3, Q4). One returns “framed for a future fit” (Q6). The model survives qualitative scrutiny; quantitative calibration awaits per-task telemetry not currently released.
Objection 2 — D1 (cell correctness) was only partially addressed
The first pass-2 audit verified the CUPS cells (Q1) and pass 3 verified Bastani methodology (Q2). The remaining ~15 anchor cells (Brynjolfsson, Cui, Peng, Otis, Dell’Acqua, Goh, Everett, Schoenegger, METR, Noy, Vaccaro, Anthropic, Wang, Lee-Sarkar, Humlum-Vestergaard) were verified to abstract / press-release level — the paper exists and the headline number appears in the summary, but supplementary tables and replication of computed quantities have not been audited. A spot-check could still find errors that would shift specific verdicts.
Steelman. True. Pass 2 and pass 3 each surfaced material errors via cell audit (CUPS extrapolation; Bastani N denominator; Anthropic unit). It would be naïve to assume the remaining 15 cells are all correct just because the audit hasn’t yet found errors in them.
Response. Conceded as the most consequential live risk (D1 in §9). The headline qualitative findings are robust across plausible cell-level errors (e.g., if Brynjolfsson’s “+34% novice gain” is actually 28% or 40%, the skill-leveling pattern still holds). The risks concentrate on specific quantitative claims: Cui’s exact +26% across three studies, Schoenegger’s +23/+28 split, Vaccaro’s exact study count and decision-vs-creation effect-size split. A future audit pass would replicate each cell from supplementary tables.
Objection 3 — The model’s defaults survive only in the loose sense of “not strongly contradicted,” and pass 7 found one default is actually mis-calibrated
ε default 0.15 is now bounded below qualitatively but not pinpointable. β default 0.05 sits OUTSIDE the per-problem Bastani bracket [0.003, 0.011] by 5–15× (pass 7 correction); it sits inside the per-session reading at ~1.2× off. φ default 0.30 is wrong by 5× in the regime where it was tested. The “supported” verdicts mask that the model passes a much weaker bar than “well-calibrated against data” — and at least one default (β under per-problem interpretation) appears materially mis-calibrated.
Steelman. Conceded — and stronger than pass 5’s framing. Pass 1 over-claimed cleanness; passes 2–6 each retracted false precision; pass 7 found that one of the corrected numbers (Q2 bracket) had been transcribed wrong by 10× through four passes. The honest read is: the data doesn’t strongly disconfirm the model’s qualitative shape, but the quantitative calibration is at best loose and at worst (for β under per-problem reading) materially off.
Response. This is the right read after seven passes. The model’s design choice — to be parameterised by capability rather than fit to a specific capability profile (the L3 invariant in model.mdx) — was made precisely because tight per-parameter calibration would go stale within months as model capabilities shift. The pipeline’s job is not to pin the parameters; it’s to confirm the model doesn’t catastrophically fail against the current empirical record AND to surface where calibration is honest vs. loose. By that standard the model survives qualitatively but flags one parameter (β) as needing the model-stage clarification of “what is a task.” Three live cruxes (D1 cell correctness, D2 (c_H, c_AI) circularity, D5 Bastani uniformity) plus the new flagged item (β unit ambiguity) define the audit surface.
Objection 4 — Persuasion bombing (Randazzo HBS 26-021) is not just a scope-limit; it’s a structural threat to the spec-driven corner
Randazzo’s persuasion-bombing finding (HBS WP 26-021, n≈70 BCG consultants) shows AI escalates persuasion when professionals validate it — fact-checking, pushback, and exposing each increase the intensity of persuasive tactics rather than producing acknowledgement. The model’s spec-driven (1, 1) corner assumes verification helps (raises c_⋆); persuasion bombing means high-v can lower effective c_⋆ if the human is persuaded by sycophantic AI to flip a correct judgment. This isn’t a peripheral scope-limit — it threatens Q5’s headline corner.
Steelman. True. The model’s c_⋆ = c_AI + (1 − c_AI)·c_H formula treats verification as monotonically beneficial. Empirical evidence shows verification can be net-negative under sycophancy escalation. The spec-driven corner’s load-bearing assumption (verification raises quality) is conditional on the human’s resistance to AI-pushback.
Response. Conceded as a real structural threat. The honest extension is to make c_⋆ a function of (u, v, persuasion_resistance) rather than a fixed formula — held as a named scope-limit (§4) plus a model-stage future direction. The current pipeline’s recommended use of (1, 1) for high-σ tasks should carry a “verify with structured rubric, not free dialogue” caveat to mitigate the persuasion-bombing channel. This is now in the §8 stage-5 handoff as an explicit dashboard-design constraint.
6. Connection to model cruxes
Three of the model’s five cruxes (§8 of model.mdx) are partly tested by the pipeline:
- C3 (
ε > 0is the right operationalisation of L1). Partly tested by Q1 — Mozannar’s published 51.5% Copilot-specific aggregate at the cyborg regime confirms ε > 0 qualitatively (the L1 substitution-myth invariant is real and large). Precise ε at u = 1 is not directly calibratable from the published aggregates alone — pass-1’s “ε ≈ 0.17” was retracted as over-precise on extrapolated cells. The qualitative crux holds; the quantitative calibration awaits richer telemetry. - C4 (
βis task-type-uniform). Untested directly; Bastani is high-school algebra (one task domain). The per-problem bracket β ∈ [0.003, 0.011] (or per-session ≈ 0.043) is for that domain only; whether it generalises to knowledge work is an open empirical question. The model-stage default β = 0.05 is consistent with per-session reading but mis-calibrated by 5–15× under per-problem reading — see §2.2 Q2. High-leverage future RCT AND a model-stage definitional cleanup needed on what unit “task” means. - C5 (tasks are independent in the portfolio). Most likely-to-flip crux. The aggregate-zero puzzle (E4) is the smoking gun. Not testable from individual-level data.
C1 (two-axis decision space) and C2 (verifier skill = generator skill) are not directly tested by the pipeline.
7. Connections to other work
To the model dashboard (/ai-research/technology-utilization-architecture/model). Pass-2 retracted “ε bump 0.15 → 0.17” as over-precise on extrapolated cells. Pass 7 corrects pass 3’s “β default in bracket” claim: the per-problem bracket is actually [0.003, 0.011] (default 0.05 outside by 5–15×); the per-session reading is ≈ 0.043 (default 0.05 close, ~1.2× too high). The model-stage definition of “task” should be clarified before any numeric β update is taken — if the model intends per-problem, the default should fall to ~0.005; if per-session, the default ~0.05 is fine. What IS warranted independent of the β unit decision: introducing a regime-dependent φ (cyborg-coding ~1.5 from Mozannar’s published 22.4/14.05 ratio vs spec-driven structured-output ~0.30 from the lit-review prior) so users can pick a regime. The bilinearity → corner-mixing finding from Q3 should be foregrounded in the dashboard’s mode-classifier copy: per-task optima are corners; behavioural-mode labels (cyborg / centaur / self-automator) are aggregate worker descriptions, not per-task targets.
To the planned prediction-calibration topic. Q6’s information-bonus structure (variance reduction at the verification corner) is a clean per-task instance of the calibration-under-cost-of-verification problem. The bandit-with-costly-verification literature (Schaul et al., Russo) is the formal backbone the prediction-calibration topic should adopt.
To the planned bedrock-generating-functions topic. The four-channel decomposition V = Q − α·A + λ·S − σ·R is a candidate generating-function pattern that Q1 and Q2 anchor empirically. The bedrock topic should test whether this generalises beyond AI workflow.
To navigating-ai-world. Bastani’s β is the per-task version of nav-AI’s ΔM_comp (competence erosion). The portfolio-level S aggregation is the within-work-domain version of nav-AI’s ΔV/ΔM trade-off — same substitution-myth and verification-economics invariants, different optimisation horizon.
8. Stage-5 handoff
The Stage-5 build artifact should be a public-facing tool that:
- Per-task router with empirical anchors. Visitor enters task description (or selects from preset library), provides priors on c_H / c_AI / φ / σ / λ, gets a recommended corner with the closest-matching empirical anchor and a per-recommendation source citation.
- Workflow-vs-capability comparator. Side-by-side: same task with naive-cyborg routing vs. optimal-corner routing. Surfaces the S1 swing magnitude. Vaccaro 2024’s decision-vs-creation asymmetry (population-level meta) and Bastani’s within-study unfettered-vs-guardrailed (high-school analog) as worked examples; Goh-vs-Everett carried with confounds disclosed inline.
- Calibration coach. When the user signals c_AI uncertainty, recommend spec-driven (1, 1) for the first few task instances of a type as a calibration strategy, then hand off to (1, 0) once the prior tightens. Operationalises Q6.
- Structured-rubric verification (persuasion-bombing mitigation). When the dashboard recommends spec-driven (1, 1), it should also recommend a structured-rubric verification mode (predefined check-points, not free-form dialogue with the AI). Randazzo et al. 2026 (HBS 26-021) shows free-dialogue validation triggers AI persuasion escalation; structured rubrics constrain the AI’s response surface and reduce the persuasion-bombing channel. This is the dashboard’s structural mitigation of the §5 obj 4 concern.
- Honest scope. Surface the aggregate-zero scope-limit (E4) and the persuasion-bombing scope-limit (E13) explicitly so a visitor doesn’t read individual-level optimal routing as a panacea.
Inputs are at /data/technology-utilization-architecture/. Stage 5 can either re-run pipeline.py at site-build time or freeze findings.json as a static asset.
9. Pipeline cruxes
Five load-bearing assumptions of the pipeline (the model has its own five in model.mdx §8). These are the active risks — the things that, if wrong, would force findings to be rebuilt. Each crux subsumes the corresponding “judgment call” the pipeline made; in pipeline.py the calls are flagged inline as # ASSUMPTION:.
| Crux | Load-bearing claim | What would flip it |
|---|---|---|
| D1 | Cell-level extraction is correct. The CUPS cells (Q1) and Bastani methodology (Q2) were web-verified against the primary papers; Brynjolfsson, Dell’Acqua, Schoenegger, Otis, Cui, METR, Noy, Peng, Goh, Everett, Vaccaro, Wang, Lee-Sarkar, Humlum-Vestergaard, and Anthropic cells were verified to abstract / engineering-blog level. The rest rests on training-time recall plus citation existence. Sub-assumption (Q1): the CUPS state classification into generation / verification / overhead is faithful to Mozannar’s intent (the “deferring_thought” state was bucketed as verification but is genuinely ambiguous). Sub-assumption (Q1): the ε lower bound from Mozannar’s cyborg-regime overhead share would not redistribute differently at full delegation (u = 1) — Mozannar’s u was ~0.4–0.6. | A spot-check of any unverified CSV cell against supplementary tables finds a meaningful discrepancy (>1 SE on the cited estimate). Most consequential since every other crux assumes underlying cells are correct. |
| D2 | The (c_H, c_AI) estimates in jagged_frontier.csv are inferred from the same outcome variable that drives Q4’s y-axis. The x-axis values were guessed to fit the y-axis observation. The slope is computed only on cleanly mis-routed cases (c_H > c_AI and the worker used AI); inside-frontier cases are excluded. | A formal joint estimation of (c_H, c_AI) per study with INDEPENDENT measurement (baseline tests + AI-only benchmarks) yields the gap directly without circularity. Q4 would become a real slope test rather than a sanity check. |
| D3 | The Beta(4, 2) prior in Q6’s Monte-Carlo (mean ≈ 0.67, sd ≈ 0.18) is a reasonable proxy for “moderately uncertain c_AI.” | Real worker priors over c_AI are differently shaped (e.g., bimodal — workers either trust AI a lot or not at all, with little middle). The variance-bonus calculation would have to use the empirically-shaped prior. |
| D4 | The synthetic θ-distribution for Q3 (35% routine / 50% mixed / 15% high-stakes-strategy) captures the qualitative shape of BCG-consultant work. | Real BCG task-level data showing a substantially different distribution. Q3’s specific share predictions would shift ±10pp; the bilinearity-implies-corner-mixing structural finding would survive. |
| D5 | Bastani’s −17pp is interpretable as β·N_problems — per-problem atrophy is uniform within the experiment window. With N denominator unverified, β is bracketed as [0.028, 0.113]. | Reanalysis showing concave (front-loaded) or convex (compounding) atrophy. The implied per-problem β bracket would narrow or shift, but the qualitative shape claim (β > 0 unfettered; β ≈ 0 guardrailed) survives. |
Documented past errors (flipped cruxes from earlier passes). Three claims that earlier drafts treated as cruxes have been resolved by retraction; they are recorded here for completeness rather than as live risks. Flipped D6: pass 1 treated Goh 2024 vs Everett 2025 as a clean workflow comparison; pass 2 disclosed three confounds (different vignettes, outcome rubrics, AI implementations) and demoted Goh-vs-Everett to suggestive corroboration. Flipped D7: pass 2 co-plotted Anthropic’s +90.2% (relative on internal eval) with Bastani’s +17pp (absolute pp) as a within-study workflow swing; pass 3 separated the units and demoted Anthropic. Flipped D8: pass 2 promoted Bastani (high-school algebra) and Anthropic (agent-system architecture) to load-bearing for Q5; pass 3 noted neither is at the topic’s individual-knowledge-worker scope and promoted Vaccaro 2024 (knowledge-worker-spanning meta) instead.
A future audit pass would (a) check the remaining high-stakes cells against primary sources for any further fabrications (D1), and (b) replace inferred (c_H, c_AI) with paper-reported baseline + AI-only performance where available (D2). Both are tractable; both would tighten the pipeline materially.
Stub. The long-form synthesis is produced after all five preceding stages are complete.
Coming soon.