## Topology
*topic: technology-utilization-architecture · stage: topology · pass 9 · in-progress*

Dependency graph of the lit review. 66 nodes typed across nine classes (foundational assumptions / methods / empirical claims / logical necessities / generating mechanisms / synthesis / practitioner frameworks / open questions / distortions); seven cruxes (A1, A2, A3, A6, L1, L2, L3 — every weight-5 A node + every weight-5 L node) and four variant views (Vulnerability / Flow / Minimal / Capability-regime). The genuinely novel structural move for this topic: encoding the practitioner ↔ academic operationalization bridge that runs 12–18 months out of phase.

## TLDR

The lit review documents the research landscape on optimal human-AI workflow design. This topology asks the sharper question: **what depends on what?** Strip the field down to its load-bearing structure and the picture is surprisingly clean. Four foundational assumptions sit upstream of most of the empirical and synthesis nodes — that **human attention is the binding scarce resource** (A1), that **AI capability is heterogeneous across cognitive operations** (the jagged frontier; A2), that **verification cost is comparable to or cheaper than generation cost** (A3), and that **individual-level workflow optimization aggregates upward rather than being absorbed by organizational dynamics** (A6). If any one of them flipped, large regions of the picture would have to be rebuilt. Three logical guardrails (L1 substitution myth is wrong; L2 optimal allocation needs the joint performance surface, not solo accuracies; L3 parameterize by capability) cannot be falsified at all — they can only be ignored, which is exactly how most overconfident AI-rollout discourse proceeds. Everything else is methodology, empirical claim, generating mechanism, synthesis, practitioner framework, open question, or distortion vector.

The genuinely novel structural move for this topic is the **practitioner ↔ academic bridge** that the new node type **P** and the new edge type **op** make explicit. The practitioner stack (Mollick, Karpathy, Anthropic, Cognition, Claude Code) and the academic literature address the same structural problems but with different methodologies and on different timescales — and the relationship is bidirectional, not one-directional. Three patterns recur on the cross-community bridge. **(a) Practitioners ahead of academic measurement**: Mollick proposed the centaur/cyborg typology in 2024; Randazzo 2026 (HBS) is the first peer-reviewed empirical measurement of the behavior distribution (60/30/10) the typology names. **(b) Practitioners concretizing prior academic principles**: Karpathy's autonomy slider concretizes Shneiderman's 2D framework as a per-action user control surface; the AI Sandwich and Compound Engineering loop (Shipper / Every) concretize Tankelevitch's metacognitive-demands framework as a daily workflow practice. **(c) Practitioner and academic work converging in parallel without direct lineage**: Karpathy's autonomy slider (per-action UX) and Feng 2025's five-level academic taxonomy (per-task user roles) converged on the same discretized-levels-of-autonomy shape independently, at different granularities; Anthropic's agent design patterns (chain / route / parallelize / orchestrator-workers / evaluator-optimizer) address the same structural problem (joint allocation under capability heterogeneity) that L2D theory formalizes, but emerged from engineering practice rather than as L2D operationalizations. Note that some P-edges in the graph are *practitioner-internal* rather than cross-community — e.g., P6 (spec-driven development) → S4 (context engineering as central craft) and P7 (CLAUDE.md / context files) → S4 are both practitioner concrete-technique → practitioner-coined synthesis edges, not bridge edges, and the (a)/(b)/(c) classification doesn't apply to them. Reading the topology through this mixed bridge — rather than as a single integrated literature — disciplines the Stage-3 formalization: the model should encode invariants tracked across both communities (autonomy levels, verification gates, context structure, joint-allocation logic) parameterized by inputs the academic literature measures (capability gap, verification cost, skill-formation goals), without privileging either community as the source of authority.

The field's **weakest links** are not where the popular discourse focuses. Mainstream debate contests "does AI help" — but at the well-bounded-task level the productivity record is robust (E1: 15–55% gains across ~25 RCTs). The actual fragile zones in 2026 are: (a) **the aggregate-zero puzzle** (E4 / O2 attacks A6) — Humlum & Vestergaard's precise zero across 25,000 Danish workers tensions every micro-RCT result and is direct evidence against the individual-aggregation assumption that the entire individual-level frame depends on; (b) **whether the centaur regime persists** (E18 + S5) — Schoenegger's finding that even deliberately overconfident GPT improves forecasting suggests much of the gain comes from forced structured reasoning, not advice quality, raising the question of whether the centaur advantage survives a capability discontinuity; (c) **whether the frontier is mappable faster than it migrates** (O4) — a calibration race-condition the field hasn't tested; (d) **whether the binding reliance constraint is verification cost or verification calibration** (O7) — the design implication is different; (e) **long-term cognitive effects of continuous AI use** (O3) — Bastani's 17% unassisted-performance drop and the broader skill-atrophy literature point to a real risk but the longitudinal data window is still under twelve months for most studies.

This topology is the input to model formalization (Stage 3). The cleanest target is a **parameterized routing function**: for each (task type, capability profile, verification cost, autonomy level) tuple, produce an allocation decision that maximizes expected output quality per unit of human attention. The four variant views below (Vulnerability / Flow / Minimal / Capability-regime) read the same graph through different lenses to discipline that formalization choice — the capability-regime variant in particular sorts every node into stale-on-jump / structurally invariant / regime-dependent, which directly tells the model formalization which terms must be parameters (the regime-dependent ones) and which can be invariants (the stable ones).

## The graph

<CognitivePartnershipGraph client:load />

*Click a node for its claim and load-bearing weight; hover an edge for the relation type; drag to rearrange. The variant toggles read the same graph through different lenses.*

---

## How to read this graph

Every node in the lit review collapses to one of nine types. Edges between them carry one of seven relations. Together they make the structure inspectable.

### Node types

| Code | Type | What it is |
|---|---|---|
| **A** | Foundational assumption | A claim the field cannot operate without; if false, large downstream regions collapse |
| **M** | Methodological prerequisite | A study design or measurement approach that must work for the empirical claims to be testable |
| **E** | Empirical claim | A specific measured finding with an effect size and replication status |
| **L** | Logical necessity | Follows from definitions or algebra; not empirically refutable |
| **G** | Generating mechanism | A causal process that explains a pattern (metacognitive load, verification trade-off, ironies of automation) |
| **S** | Synthesis claim | An integrative statement combining multiple lower-level claims |
| **P** | Practitioner framework | A typology, slider, or pattern published by the practitioner stack ahead of academic formalization |
| **O** | Open question | Genuinely undecided with current methods or evidence |
| **D** | Distortion vector | Where motivated reasoning concentrates (typed by direction) |

### Edge types

| Code | Edge | Meaning |
|---|---|---|
| **dep** | depends-on | If target collapses, source collapses |
| **imp** | implies | Logical implication |
| **sup** | empirically-supports | Evidence relation |
| **conf** | confounds / inflates | Artifact relationship |
| **mod** | moderates | Changes magnitude |
| **op** | operationalizes | Practitioner framework concretizes an academic claim or vice versa |
| **corr** | corrects | Workflow architecture corrects naive allocation |
| **attacks** | attacks | Distortion vector targets a specific node |

### Weight scale (load-bearing weight, 1–5)

- **5** — crux node; collapse propagates across multiple sections of the lit review
- **4** — load-bearing within a section
- **3** — important but local
- **2** — corroborating
- **1** — decorative

---

## 1. Node catalog

Each node carries: type code · weight · short claim · key citation · status. Status flags: ✓ (robust/replicated), ~ (partial/qualified), ? (contested/open), ✗ (refuted, kept as historical reference).

### A — Foundational assumptions

| ID | Wt | Claim | Status |
|---|---|---|---|
| A1 | 5 | Human attention is the binding scarce resource — once production is automated, the bottleneck shifts upward to planning, evaluation, calibration. (Tankelevitch 2024) | ✓ |
| A2 | 5 | AI capability is heterogeneous across cognitive operations (the jagged frontier). (Dell'Acqua 2023/2025) | ✓ |
| A3 | 5 | Verification cost is comparable to or cheaper than generation cost — otherwise rational engagement collapses. (Vasconcelos 2023; Fok & Weld 2023) | ~ |
| A4 | 4 | Knowledge work decomposes into sub-tasks that can be selectively delegated. (Vaccaro 2024 sub-task finding) | ✓ |
| A5 | 4 | The frontier has stable topological features even as the boundary shifts — i.e., it is mappable. | ? |
| A6 | 5 | Individual-level workflow optimization aggregates upward — gains aren't fully absorbed by organizational dynamics (review bottlenecks, managerial reabsorption, coordination costs). The load-bearing assumption behind the entire individual-level framing. Humlum-Vestergaard aggregate-zero is direct evidence against. | ? |

### M — Methodological prerequisites

| ID | Wt | Claim | Status |
|---|---|---|---|
| M1 | 5 | Randomized controlled trials of AI-augmented work. (~25 published 2023–2026) | ✓ |
| M2 | 4 | Strategy-graded reliance metrics (vs. outcome-graded). (Vasconcelos / Fok & Weld pivot) | ✓ |
| M3 | 4 | Field-deployed measurement (real repos, real meetings). (METR 2025) | ~ |
| M4 | 3 | Telemetry / log-based behavioral observation. (Mozannar CUPS 2024) | ✓ |
| M5 | 3 | Cognitive Task Analysis (CTA, ACTA). (Crandall, Klein & Hoffman 2006) | ~ |

### E — Empirical claims

| ID | Wt | Claim | Status |
|---|---|---|---|
| E1 | 5 | 15–55% productivity gains on well-bounded knowledge tasks. (Brynjolfsson 2023; Noy 2023; Peng 2023; Cui 2025) | ✓ |
| E2 | 5 | Skill-leveling: novices gain most, top performers near-zero — on well-defined tasks. (Brynjolfsson +34%/0%) | ✓ |
| E3 | 5 | Outside-frontier AI use causes a 19-pp quality drop. (Dell'Acqua BCG study) | ✓ |
| E4 | 5 | Aggregate labor-market effects are zero at 2-year horizons. (Humlum & Vestergaard 2025, 25k workers) | ✓ |
| E5 | 4 | Explanations alone don't yield complementary performance — they increase acceptance regardless of correctness. (Bansal 2021) | ✓ |
| E6 | 4 | Cognitive forcing reduces overreliance — but only for high-Need-for-Cognition users. (Buçinca 2021) | ✓ |
| E7 | 4 | On naive workflows, AI alone outperforms human+AI in expert domains. (Goh 2024 JAMA NO) | ✓ |
| E8 | 5 | "Independent-then-synthesize" workflow restores complementarity in the same domain. (Everett 2025) | ✓ |
| E9 | 4 | Unguardrailed AI in learning produces 17% worse unassisted post-session performance. (Bastani PNAS 2025) | ✓ |
| E10 | 3 | Behavior distribution: ~60% cyborg, ~30% centaur, ~10% self-automator. (Randazzo HBS 26-036) | ✓ |
| E11 | 4 | LLM capability decays sharply up Bloom's hierarchy: ~81% Remember → 13–41% Analyze. (Ma 2025) | ✓ |
| E12 | 4 | Verification time is a substantial fraction of total interaction in coding contexts. (Mozannar CUPS 2024) | ✓ |
| E13 | 4 | Sycophancy escalation: AI flips correct human judgments to incorrect on pushback. (Randazzo HBS 26-021) | ✓ |
| E14 | 4 | Multi-agent orchestrator-worker outperforms single-agent +90.2% at ~15× token cost. (Anthropic 2024/2025) | ✓ |
| E15 | 4 | Routing > model selection — task-to-tool dispatch beats single-best-model. (Paterson 2026) | ✓ |
| E16 | 3 | Higher AI confidence correlates with less critical thinking enacted. (Lee et al. CHI 2025) | ~ |
| E17 | 4 | Read/write asymmetry: multi-agent works for read-heavy tasks; breaks on write-heavy unless writes are serialized. (Cognition / Anthropic / LangChain consensus) | ✓ |
| E18 | 3 | Both calibrated AND deliberately overconfident GPT assistants improve human forecasting +23–43%. (Schoenegger 2024/2025) | ~ |

### L — Logical necessities

| ID | Wt | Claim | Status |
|---|---|---|---|
| L1 | 5 | Substitution myth is wrong — every offload creates new monitoring/verification/coordination work. (Dekker & Woods 2002; Bainbridge 1983) | ✓ |
| L2 | 5 | Optimal allocation requires modeling the *joint* performance surface, not solo accuracies. (Madras / Mozannar L2D theory) | ✓ |
| L3 | 5 | Allocation model must be parameterized BY capability, not depend on FIXED capability — generating function vs. lookup table. | ✓ |
| L4 | 4 | High automation and high human control can coexist. (Shneiderman 2D) | ✓ |

### G — Generating mechanisms

| ID | Wt | Claim | Status |
|---|---|---|---|
| G1 | 5 | Metacognitive bottleneck — load shifts from production to planning, evaluation, calibration. (Tankelevitch 2024) | ✓ |
| G2 | 5 | Ironies of automation — AI handling routine work degrades human capacity to catch rare critical errors. (Bainbridge 1983; Simkute 2024) | ✓ |
| G3 | 5 | Verification-cost trade-off — engagement is rational only when cheap. (Vasconcelos 2023) | ✓ |
| G4 | 5 | Jagged frontier mechanism — capability heterogeneous; the boundary is personal and dynamic. | ✓ |
| G5 | 4 | Correlation neglect — humans treat AI advice as independent evidence despite shared training data. (Amin 2026) | ~ |
| G6 | 3 | Cognitive forcing — committing to a view first breaks anchoring. | ✓ |
| G7 | 4 | Skill atrophy — capacities not exercised decay. | ✓ |
| G8 | 4 | Cross-task productivity bundling — speedups bottlenecked by linked tasks. (Cowen 2026) | ~ |
| G9 | 4 | Generator-verifier asymmetry — production cheap, checking expensive. (Karpathy) | ✓ |
| G10 | 3 | Multi-tool attention interference — Wickens' Multiple Resource Theory predicts tool-stack overload (Cursor + ChatGPT + meeting incurs cost beyond the sum of per-tool costs). Mechanism well-established; absent from AI workflow research. | ~ |

### S — Synthesis claims

| ID | Wt | Claim | Status |
|---|---|---|---|
| S1 | 5 | **Workflow architecture predicts outcomes more reliably than model capability.** | ✓ |
| S2 | 5 | The optimization target is metacognitive efficiency, not throughput. | ✓ |
| S3 | 4 | Knowledge work shifting from production to critical integration. (CHI 2025) | ✓ |
| S4 | 4 | Context engineering is the central craft. (Anthropic / Cognition consensus) | ✓ |
| S5 | 3 | Centaur advantage may be a transient regime. | ? |

### P — Practitioner frameworks

| ID | Wt | Claim | Status |
|---|---|---|---|
| P1 | 5 | Mollick centaur / cyborg / self-automator typology. | ✓ |
| P2 | 4 | Karpathy autonomy slider. | ✓ |
| P3 | 4 | Anthropic agent design patterns (chain / route / parallelize / orchestrator-workers / evaluator-optimizer). | ✓ |
| P4 | 4 | Cognition read/write asymmetry as agent-system principle. | ✓ |
| P5 | 3 | Compound Engineering / AI Sandwich (Shipper, Every). | ~ |
| P6 | 3 | Spec-driven development (spec → plan → execute). | ~ |
| P7 | 3 | Personal context files (CLAUDE.md / system-prompt patterns). | ✓ |

### O — Open questions

| ID | Wt | Claim | Status |
|---|---|---|---|
| O1 | 5 | Does AI help experts on open-ended judgment? | ? |
| O2 | 5 | Why are aggregate effects zero given the micro-RCT record? | ? |
| O3 | 5 | Long-term cognitive effects under continuous AI use. | ? |
| O4 | 4 | Frontier migration vs. calibration rate. | ? |
| O5 | 4 | Right unit of analysis — individual vs. team vs. value chain. | ? |
| O6 | 3 | Centaur taxonomy durable, or interface artifact? | ? |
| O7 | 4 | Verification cost vs. verification calibration as binding constraint. | ? |

### D — Distortion vectors

| ID | Wt | Claim | Targets |
|---|---|---|---|
| D1 | 4 | AI-maximalist distortion — RCT gains read as aggregate revolution; ignores Humlum-Vestergaard zero. | E4, S1, O2 |
| D2 | 4 | Productivity-only distortion — counts speed gains, ignores skill atrophy and metacognitive load. | G7, S2, E9 |
| D3 | 3 | "Just use the best model" distortion — ignores routing finding (E15) and architecture-over-capability evidence. | E15, S1, S4 |
| D4 | 3 | Practitioner-only distortion — dismisses formalization as category error. | L3, S1 |

---

## 2. Edge catalog (key chains, not exhaustive)

**Foundation → Method.** A2 → M1 (RCTs reveal the jagged frontier); A3 → M2 (strategy-graded metrics measure verification cost); A1 → M3 (field deployment shows real attention budget).

**Method → Empirical.** M1 produces the productivity record (E1–E9). M2 → E5 (Bansal explanations don't help once strategy-graded). M2 → E12 (CUPS is the strategy-graded measure of coding). M3 → E2 (METR shows experts gain less in real repos).

**Mechanism → Empirical.** G1 → E12 (metacognitive load shows up as verification time). G2 → E13 (sycophancy is the rare critical error ironies-of-automation predicts will get missed). G3 → E5 (rational verification trade-off explains why explanations fail). G4 → E3 (jagged frontier produces outside-frontier harm). G7 → E9 (skill atrophy → unassisted-performance drop). G8 → E4 (cross-task bundling explains aggregate zero). G10 → E10 (cyborgs interleaving multiple tools should incur higher MRT interference than centaurs handing off discrete sub-tasks — testable but unmeasured prediction).

**Empirical → Synthesis.** E1, E2, E3, E8 → S1 (workflow architecture > model capability — the integrated headline). E12, E16 → S2 (metacognitive efficiency target). E11 → S3. E14, E15, E17 → S4 (context engineering as central craft). E18 → S5.

**Empirical → Open.** E2, E4 → O2 (the aggregate puzzle). E2 → O1 (helps experts?). E9 → O3 (long-term cognitive). E1, E3 → O4 (frontier migration). E4 → O5 (right unit). E13 → O7 (calibration vs. cost).

**Logical guards.** L1 → G2 (substitution myth → ironies of automation). L2 → S1 (joint surface needed). L3 → S4 (parameterize-by-capability is what makes context engineering generalize). L4 → P2 (Shneiderman 2D legitimates the autonomy slider).

**Practitioner ↔ Academic (the central conceptual move of this topology, encoded as op edges).** The relationship type varies — see TLDR para 2 for the (a)/(b)/(c) classification of cross-community bridge edges. P1 → E10 (Mollick named the typology; Randazzo 2026 measured the behavior distribution it predicts — pattern (a), academia retrospectively measures). P2 → L4 (Karpathy's autonomy slider concretizes Shneiderman's 2D framework as a per-action user control surface — pattern (b), prior academic principle made tangible), with the reverse edge L4 → P2 (Shneiderman legitimates the slider design). P3 → L2 (Anthropic's agent design patterns and Madras / Mozannar L2D theory address the same structural problem — joint allocation under capability heterogeneity — but emerged independently — pattern (c), parallel convergence without direct lineage). P4 → E17 (Cognition's read/write principle IS the design-actionable form of the read/write asymmetry finding — pattern (a) inverted: practitioners stated and measured it together). P5 → S2 (compound engineering / AI Sandwich applies Tankelevitch's metacognitive-demands framework at the workflow level — pattern (b)). **P6 → S4 and P7 → S4 are practitioner-internal, not bridge edges**: S4 (context engineering as central craft) is itself a practitioner-coined synthesis (Anthropic / Cognition consensus), and P6 (spec-driven development) and P7 (CLAUDE.md / personal context files) are concrete practitioner techniques that operationalize that practitioner synthesis. The (a)/(b)/(c) classification doesn't apply because both endpoints sit in the practitioner community.

**Foundation → Foundation (the project-frame edges).** A6 → S1 (S1 is meaningful only if individual optimization aggregates). E4 → A6 (aggregate-zero is the direct attack on A6). O2 → A6 and O5 → A6 (the open questions whose resolution will close A6 either way).

**Distortion attacks.** D1 → S1, E4 (treats RCT gains as aggregate proof; ignores Humlum). D2 → G7, S2, E9 (counts speed; ignores atrophy). D3 → E15, S1, S4 (ignores routing). D4 → L3, S1 (denies formalization possibility).

---

## 3. High-stakes nodes — by structural role

Six categories of structural role, sorted by how their failure modes propagate. The single most useful conceptual move is keeping the cruxes (inputs) separate from the headline (an output) and from the reframers (mechanisms whose magnitude is open). Conflating these three under a single label of "important findings" produces most of the bad-faith debate around AI workflow design.

### 3a. Foundational cruxes — collapse rebuilds regions

These are the **input assumptions** the entire individual-level framing rests on. Falsification doesn't change interpretation; it forces rebuilding. All four are weight-5 foundational-assumption nodes (the A class).

- **A1** (human attention is the binding scarce resource). If false, throughput-optimization wins and the entire metacognitive-bottleneck framing dissolves; design priorities flip back toward maximum AI delegation. The metacognitive-bottleneck mechanism G1 is the *consequence* of A1 + production-automation, not a separate axiom — which is why G1 isn't itself a crux.
- **A2** (jagged frontier — AI capability is heterogeneous across operations). If a capability discontinuity produced uniformly-good AI across all knowledge-work operations, the "map your frontier" design imperative collapses and outside-frontier harm (E3) dissolves.
- **A3** (verification cost is comparable to or cheaper than generation cost). If verification became prohibitively expensive — AI outputs so complex or fast-moving that human checking is intractable — the rational-cost-benefit reframing of overreliance (G3) dissolves and the design problem becomes "trust without verification."
- **A6** (individual-level optimization aggregates upward). If gains are absorbed by organizational dynamics — review bottlenecks downstream, managerial reabsorption, coordination costs — individual workflow optimization is locally optimal but globally insufficient. Humlum-Vestergaard's aggregate zero is direct evidence against. **This is the project-frame crux**: the entire individual-level optimization target only matters if A6 holds, and its status is genuinely open.

### 3b. Logical guardrails — unfalsifiable, ignored at peril

Cannot be falsified — only ignored. All three are weight-5 logical-necessity nodes (the L class).

- **L1** (substitution myth is wrong — every offload creates new monitoring/verification/coordination work). Bainbridge 1983 / Dekker & Woods 2002. The structural property of any principal-agent delegation; AI-rollout discourse routinely treats it as if it could be ignored.
- **L2** (optimal allocation requires modeling the *joint* performance surface, not solo accuracies). The Madras / Mozannar L2D formal result. A mathematical property of how joint performance combines — cannot be falsified empirically. The most common practitioner shortcut — "use AI when AI is better, use human when human is better" — implicitly assumes solo accuracies suffice; ignoring the variance and correlation structure of the two agents' errors is exactly the move L2 forbids. The agent-design patterns (P3) that *do* work are the ones that respect L2 by construction.
- **L3** (allocation model must be parameterized by capability, not depend on fixed capability). The generating-function-vs-lookup-table commitment. Practitioner-only frameworks routinely violate this by hardcoding "GPT-4 is good at X, bad at Y" — the pattern goes stale within months.

### 3c. Reframer mechanisms — magnitude is the live question

High-weight non-crux mechanism nodes whose *magnitude* (not existence) is what reshapes interpretation. Each is well-supported as a phenomenon; the open question is the share of variance they explain.

- **G1** (metacognitive bottleneck). CHI 2024 Best Paper finding; robust phenomenologically. The open magnitude question: what share of the labor force is in the regime where G1 is binding? For workers still production-constrained, G1 is premature; for power users, it is binding. The size of each population determines whether the metacognitive-efficiency target (S2) is the right design priority for "individual workers" generically or only for a subset.
- **G3** (verification-cost trade-off). Vasconcelos / Fok-Weld reframing. The mechanism is real; the open question is whether it captures most of the variance in reliance behavior, or whether verification *calibration* (O7) is doing the rest of the work.
- **G4** (jagged frontier mechanism). The capability-heterogeneity story is robust; the open magnitude question is the rate at which the boundary shifts (O4) and whether the topological features are stable (A5).

### 3d. Headline conclusion — a synthesis output, not a crux

- **S1** (workflow architecture > model capability). The integrated finding the topology is organized around. **S1 is what the cruxes plus mechanisms produce, not a load-bearing input.** It is a weight-5 synthesis node, but distinguishing "headline conclusion" from "crux" matters: cruxes are inputs whose falsification rebuilds the graph; conclusions are outputs whose falsification only means the rebuild was downward (the conclusion was wrong) rather than upward (an input was wrong).

### 3e. Corroborating / illustrative

Two senses of "droppable" need to be distinguished here. Some nodes can be removed without breaking S1's *conclusion* but are load-bearing as *exemplars* of the topology's classification structure (especially the practitioner ↔ academic bridge in TLDR para 2); others are droppable in both senses.

- **Load-bearing as exemplars, droppable for S1**: E10 (60/30/10 distribution — the canonical (a) bridge example, paired with P1 to demonstrate "practitioners ahead of academic measurement"; if removed, S1 still holds but the (a) example loses its empirical anchor), P5 (compound engineering / AI Sandwich — the canonical (b) bridge example, paired with S2 to demonstrate "practitioners concretizing prior academic principles"; if removed, S1 still holds but the (b) example weakens).
- **Droppable in both senses**: E18 (Schoenegger overconfident-AI-still-helps — interesting but tangential to S1 and not used as a bridge example), G6 (cognitive forcing as mechanism — local, not load-bearing for S1 or the bridge framing), P6 (spec-driven development — practitioner-internal P → S4 edge, not a bridge example, not load-bearing for S1).

### 3f. Distortion vectors

D1–D4 are pedagogically intentional, not decorative. Each names a real motivated-reasoning pattern and the specific empirical/logical claims it targets. Distortions are useful for readers calibrating where their own priors might be selecting against the evidence.

---

## 4. Weakest links

Where the graph is genuinely fragile in 2026, ranked by potential propagation if the link breaks:

1. **A6 / O2 / O5 (the aggregate-zero / unit-of-analysis cluster).** The most consequential weakness in the graph. Humlum-Vestergaard's precise zero across 25,000 Danish workers is direct evidence that individual workflow gains may not aggregate; if true, A6 falsifies, the project-frame inverts, and individual workflow optimization becomes locally optimal but globally insufficient. Possible mechanisms (cross-task bundling per Cowen 2026, organizational reabsorption, weak wage pass-through) are testable but untested. Until O2 is resolved, every claim about *organizational* benefit downstream of S1 is contingent.
2. **O3 (long-term cognitive effects).** Bastani's 17% drop is a single-session study with a short post-session window — the lit review documents the in-session vs. afterward contrast but no specific durability timescale. If a 2-year longitudinal RCT confirms broad skill atrophy across cognitive operations, the design implication becomes "AI-free zones" at much larger scale than current practice — and the productivity-only distortion (D2) goes from intellectually wrong to materially harmful.
3. **O4 (frontier migration vs. calibration rate).** If model capabilities shift faster than humans can recalibrate their personal frontier maps, the "map your jagged frontier" design imperative becomes unfollowable. The likely partial resolution is that the frontier has stable topological features (A5) even when the boundary moves, but A5 is empirically untested.
4. **O7 (verification cost vs. calibration as binding constraint).** Reducing verification cost helps a rational actor; improving calibration helps a miscalibrated one. The interventions are different. The sycophancy finding (E13) suggests calibration is a real second binding constraint.
5. **S5 / O6 (centaur regime persistence).** If a capability discontinuity makes AI better than humans on the full task, the centaur typology becomes historical.
6. **A5 (frontier mappable).** Empirically untested. If false, individual-level workflow design is structurally impossible at the per-task granularity the practitioner stack assumes.

---

## 5. Variants

Each variant reads the same graph through a different lens.

### Variant A — Vulnerability (where does this break?)

Highlights the seven cruxes (A1, A2, A3, A6 foundational; L1, L2, L3 logical guardrails) plus the weight-5 nodes downstream. If any crux flips, propagation is concentrated through this subgraph. Useful for stress-testing: pick a crux, imagine it inverts, and trace the consequences through the highlighted subgraph.

### Variant B — Flow (how does causation propagate?)

Restricts to the A → M → E → S/G cascade plus practitioner operationalizations (P → E/L/S via op edges). The "what generated what" view: foundational assumptions enable methods, methods produce empirical findings, mechanisms explain them, syntheses integrate, and practitioner frameworks operationalize the resulting design implications.

### Variant C — Minimal claim set

Smallest set of claims that still yields the headline conclusion (S1: workflow architecture > model capability). Approximately: A2 + A3 + E3 + E8 + L2 + G3 + G1 + S1. Eight nodes. Removing any one breaks the qualitative shape.

### Variant D — Capability-regime fragility (the topic-specific variant)

Which nodes go stale if frontier capability jumps? The central worry the lit review's adversarial section names. Of the 66 nodes, 35 sort into one of three regime-fragility classes; the rest (methods, supporting empirical findings, distortions) are regime-orthogonal. The classification:

- **Stale-on-jump (6 nodes — likely to invert).** E2 (skill-leveling: if AI exceeds top performers, the +34/0 pattern flips). E3 (outside-frontier harm: dissolves if frontier becomes uniform). E10 (60/30/10 behavior distribution: regime-bound to current tool affordances). E18 (overconfident-AI-still-helps: if AI is reliably correct, the "structured reasoning is the gain" reading dissolves). S5 (centaur transience: literally about transitioning out). P1 (Mollick taxonomy: becomes historical the way the chess centaur literature reads as historical now).

- **Stable-on-jump (18 nodes — structurally invariant).** All four logical necessities: L1 (substitution myth), L2 (joint performance surface), L3 (parameterize-by-capability), L4 (autonomy + control coexist). The mechanism invariants: G1 (metacognitive load just shifts to new decisions), G2 (ironies of automation are a property of any principal-agent delegation), G3 (verification trade-off — the threshold moves but the shape is invariant), G9 (generator-verifier asymmetry), G10 (multi-tool attention interference — Wickens MRT is a property of human cognitive architecture, not of AI capability). The foundational A1 (attention-as-binding-constraint is a fact about humans, not about AI). The synthesis claims that follow from the invariants: S2 (metacognitive efficiency target), S3 (production → critical integration), S4 (context engineering as central craft). The control-surface practitioner patterns: P2 (autonomy slider), P3 (Anthropic agent design patterns), P5 (compound engineering), P6 (spec-driven development), P7 (CLAUDE.md context files).

- **Regime-dependent (11 nodes — depends on which way capability jumps).** Foundational A2 (jagged frontier could become smooth or fragment into a different jagged shape), A3 (verification cost could fall further or rise as outputs become more complex), A5 (frontier-mappability depends on rate of migration), A6 (aggregation could go either way as tools get integrated). The headline S1 (workflow > capability could invert if a capability gap dominates). The high-leverage empirical claims: E14 (multi-agent could become moot if single-agent matches), E15 (routing depends on heterogeneity A2), E17 (read/write asymmetry depends on whether AI handles serial writes natively). The mechanism G4 (jagged frontier follows A2). The open question O4 (frontier migration rate IS the regime question).

The model formalization should be stable under capability change — meaning it should be parameterized by the L1/L2/L3/L4 + G1/G2/G3/G9 + A1 invariants and treat A2/A3/A6/E14/E15/E17 as inputs that vary by capability regime. The 6 stale nodes are the ones the formalization should *not* hardcode.

---

## 6. Stage-3 handoff

This topology is the input to model formalization. The cleanest target is a **parameterized routing function**:

```
delegate(task, worker, AI, context) → action
```

where the action is one of `{do_yourself, consult_AI, delegate_to_AI, refuse}`, and the function is parameterized by:

- **Task type** (per Bloom hierarchy + cognitive task analysis decomposition)
- **Worker capability profile** on that task type (via prior frontier mapping)
- **AI capability profile** on that task type (via per-task benchmark or recent personal experience — the L3 invariant means this is an input, not a hardcoded constant)
- **Verification cost** (function of task type and output format)
- **Stakes / reversibility** (high stakes → independent-then-synthesize per E8)
- **Skill-formation goal** (if the task is one whose capability the worker wants to maintain → guardrail mode per E9)

The `stage_outputs/<topic>/<stage>.md` folder convention holds for this topic too: raw working drafts live in `stage_outputs/technology-utilization-architecture/<stage>.md`; polished versions move into `src/content/ai_research/technology-utilization-architecture/<stage>.mdx`. So far the folder contains the lit review and this topology draft; subsequent stages will accumulate there.

This topology also feeds three natural sibling topics down the line. **Navigating an AI World** is the structural / civilizational view that holds the individual-level optimization in tension with the organizational-level disruption (its own Crux 5 / A6 sits next to mine). **AI Cognitive Profile** is the orthogonal view: rather than asking "how should an individual route tasks," ask "where does AI capability diverge from human capability across the O*NET task taxonomy" — that topic supplies what this topic treats as a black-box input (the per-task capability gradient) and inversely, this topic supplies what that topic treats as a black-box (what individuals should *do* given the gradient). **Prediction and Calibration** is the natural attachment point for O7 (verification cost vs. verification calibration as binding constraint): if calibration on AI error rates is the second binding constraint behind verification cost, the calibration topic is where that gets formalized. The Stage-3 model formalization should leave clean attachment points for all three.

---

## 7. Next moves — three Stage-3 options

Three formalization paths, each with pros / cons / Stage-4 implications.

### Option A — Capability × verification-cost dispatch table (decomposition)

**What it is.** Formalize the routing function as a 2D table: capability gap (worker minus AI on this task) × verification cost ratio (verify-cost / generate-cost). Four quadrants → four allocation rules.

- **Pros.** Maps cleanly onto the existing RCT record (quadrants align with Brynjolfsson, Dell'Acqua, METR, Everett). Easy to visualize. Practitioner-actionable.
- **Cons.** Risks violating L3 (looks like a lookup table). Mitigated if the table is generated from inputs rather than hardcoded.
- **Stage-4 implication.** Test against the published RCTs by classifying each into a quadrant.

### Option B — Generator-verifier loop with autonomy slider (generating function) [recommended]

**What it is.** Formalize the workflow as a recurrent loop: at each step, the worker chooses an autonomy level (Karpathy P2 / Feng 2025 five-level operator → observer). The loop has a verification gate; the gate's strictness is a parameter. Output: an interactive dashboard where the user inputs (task type, capability gap, verification cost, stakes, skill-formation goal) and gets a recommended autonomy level + verification cadence.

- **Pros.** Directly operationalizes Karpathy's autonomy slider (P2) using Shneiderman 2D (L4) and Vasconcelos verification economics (G3). Survives capability change (L3-compatible). Naturally maps onto an interactive site component.
- **Cons.** Requires choosing a parameterization of "verification cost" that holds across task types — non-trivial.
- **Stage-4 implication.** Test against Bastani's guardrail RCT (verification gate stringency moderating skill atrophy), Everett's independent-then-synthesize (a specific autonomy-cadence schedule), and the Mollick centaur/cyborg empirical distribution (which loops people actually run).

### Option C — Principal-agent with imperfect agent (mechanism design)

**What it is.** Apply contract-theory machinery (asymmetric information, monitoring vs. trust trade-off) to the single-worker / AI-tool relationship.

- **Pros.** Formally rigorous. Connects to a mature economic literature.
- **Cons.** Heavyweight; principal-agent assumes strategic agents, but LLMs are imperfect-but-non-strategic.
- **Stage-4 implication.** Hardest to validate against the existing RCT record.

**Recommendation: Option B.** The generator-verifier loop with an autonomy slider is the most directly testable, the most actionable as an interactive site artifact (Stage 5), and the most clearly L3-compatible. Specifically, Option B engages the regime-stable invariants identified in Variant D — L1 (every offload creates new monitoring work, baked into the loop's verification gate), L2 (joint-surface allocation, baked into the autonomy-level choice), L3 (parameterized by capability inputs rather than hardcoded), G3 (verification trade-off, the parameter that sets gate strictness), G9 (generator-verifier asymmetry, the loop's central asymmetry), and G10 (multi-tool attention interference — the loop should track concurrent-tool load as a Wickens-MRT input, not just per-task parameters; "running three coding agents in parallel" is a different regime from "running one") — while taking A6 (individual optimization aggregates) as the explicit assumption whose falsification would mean the loop is locally optimal but globally insufficient. Option A can be a sub-component of Option B (the dispatch table determines the autonomy-level choice). Option C is held in reserve.

---

## 8. Objections to this topology

**Objection 1: "The practitioner / academic split is overstated — the field is actually more integrated than the typing P-vs-S suggests."** Steelman: many researchers (Mollick, Karpathy, Tankelevitch via MSR's Tools for Thought) span both communities; some practitioner posts (Anthropic's effective agents) cite academic literature; the typing risks reifying a divide that's already partial. Response: the *people* span both, but the *artifacts* arrive on different timescales and through different processes. Mollick proposed the centaur/cyborg typology in 2024; Randazzo 2026 (HBS) is the first peer-reviewed empirical measurement of the distribution. Karpathy's autonomy slider was articulated as a tool-design concept (in his 2024 Software 1.0/2.0/3.0 talks and Cursor's Tab → Cmd+K → Agent Mode UX) about a year before Feng 2025's five-level academic taxonomy converged independently on a similar discretized-levels structure — neither grounds the other; they are parallel work on the same concern at different granularities (Karpathy's is per-action, Feng's is per-task user role). Anthropic's agent design patterns and L2D theory likewise address the same structural problem with no direct lineage. The typing isn't claiming a clean division; it's making the *structurally different roles* visible — empirical measurement, formal-theory derivation, engineering-practice synthesis — so we don't conflate "Anthropic shipped a pattern that works" with "L2D theory predicts this pattern is optimal."

**Objection 2: "The 7-crux selection is biased toward the framing the lit review wants."** Steelman: a critic could argue capability-first cruxes (e.g., "Bloom-level decay determines all task allocation") are equally defensible, or that the lit-review-driven framing inherits whatever bias the lit review has. Response: the crux set is **structurally typed**, not picked by impact. The criterion is "claims whose collapse rebuilds regions of the graph," and that maps cleanly onto exactly two node classes — foundational assumptions (A) and logical necessities (L). The cruxes are A1, A2, A3, A6 (every weight-5 A node) plus L1, L2, L3 (every weight-5 L node). No mechanism (G), synthesis (S), empirical (E), practitioner (P), or open (O) node is a crux, because their structural role is downstream of the cruxes — mechanisms explain, syntheses integrate, empirical findings test, practitioner frameworks operationalize, open questions sit at the frontier. The headline conclusion S1 (workflow > capability) is *not* a crux even though it is weight-5; it's an output of the graph, and excluding it from the crux set is the discipline that distinguishes "what the graph rests on" from "what the graph concludes." A capability-first alternative would have to add a foundational A node like "Bloom-level decay is the dominant capability gradient" — which the lit review doesn't support; LLMs decay sharply on Bloom, but task allocation is also shaped by verification cost, autonomy level, and skill-formation goals. The selection is therefore lit-review-driven, but the *typing rule* (A + L only, every weight-5) is independent of the lit review's framing — it is just "which nodes carry the structural role of being inputs the rest of the graph rests on."

**Objection 3: "The capability-regime fragility variant is a hedge — it concedes the project will go stale."** Steelman: Variant D essentially admits that half the empirical claims could invert under a capability jump; if so, why formalize at all? Response: Variant D is the discipline that *justifies* formalization. The point is to identify which nodes are stable-under-jump (the L1-L3-G1-G2-G3 invariants) and build the formalization on those. Practitioner heuristics, by contrast, are not capability-stable by design — they update faster but go stale faster too.

**Objection 4: "The aggregate-zero puzzle (E4 / O2) is so consequential that the entire individual-level framing might be wrong, and the topology should pivot to the organizational level instead."** Steelman: if Humlum-Vestergaard's zero is the true-population result, individual workflow optimization is rearranging deck chairs. Response: this is exactly what A6 (now a foundational crux) and O5 (right unit of analysis) name. The honest position is that the individual frame is *one valid level* of analysis — design-actionable for the worker who controls their own workflow — and the organizational frame is a *separate* level that should be a sibling artifact, not a replacement.

**Objection 5: "Topology + model formalization is over-engineering for a moving target. Practitioner heuristics adapt faster than academic formalization can ship; by the time you finish the model, the field has moved. The honest move is to skip directly to a build artifact using current best practices."** Steelman: this is real. The lit review already documents that the practitioner stack runs 12–18 months ahead of peer review precisely because it doesn't try to formalize. Cursor, Claude Code, and the Anthropic agent-design patterns are already shipping; users adapting heuristics in real time will outperform users waiting for an academic model. The L3 invariant ("parameterize by capability") is meant to address this, but it's a hope, not a guarantee — a formalization built on capability-stable invariants might *still* miss the regime where capability discontinuity dominates everything. Response: even granting all of that, the topology IS the artifact whose stable invariants survive capability change. Variant D identifies an 18-node regime-stable subgraph: all four logical necessities (L1 substitution myth, L2 joint surface, L3 parameterize-by-capability, L4 autonomy + control coexist), the five mechanism invariants (G1 metacognitive load shifting, G2 ironies of automation, G3 verification trade-off, G9 generator-verifier asymmetry, G10 multi-tool attention interference via Wickens MRT), the foundational A1 (attention-as-binding-constraint is a fact about humans, not AI), the synthesis claims that follow from the invariants (S2 metacognitive efficiency target, S3 production → critical integration, S4 context engineering as central craft), and the control-surface practitioner patterns (P2 autonomy slider, P3 Anthropic agent design patterns, P5 compound engineering, P6 spec-driven development, P7 CLAUDE.md context files) — none of which are capability-bound claims. Practitioner heuristics adapt faster but lossy: they don't preserve the reasoning behind the rule, so users can't tell when the heuristic stops applying. The 12–18 month lag goes both ways — practitioners ship faster, but they also rediscover Bainbridge 1983 forty years late. The model formalization's value is less "predict optimal allocation" than "encode the invariants explicitly so the next capability shift doesn't require re-litigating from scratch." That said: the objection has bite for Stage 5 specifically. If the build artifact is a fragile prediction tool that hardcodes 2026 capability, it goes stale. If it is a *frame* (the autonomy slider, the verification-cost trade-off visualized) that the user fills in with their own current capability profile, it is L3-compatible and survives.

---

## 9. Glossary

- **AI Sandwich** — Shipper / Every: humans frame the task and review the output; AI handles the middle.
- **autonomy slider** — Karpathy: a per-action user control surface (Cursor: Tab → Cmd+K → Agent Mode); operationalizes Shneiderman's 2D framework.
- **Bloom's taxonomy** — Remember → Understand → Apply → Analyze → Evaluate → Create. LLM capability decays sharply going up.
- **centaur** — Mollick: clean human/AI role separation; discrete handoffs based on per-task frontier mapping.
- **CoALA** — Cognitive Architectures for Language Agents (Sumers 2024). Maps LLM agents to ACT-R / SOAR memory + action structures.
- **complementarity potential vs. team performance** — Hemmer 2025 distinction: mathematical possibility vs. actual achievement of human+AI exceeding either alone.
- **compound engineering** — Shipper / Every: plan → work → review → compound. Each loop's outputs feed the next loop's inputs.
- **context engineering** — Anthropic / Cognition: the discipline of constructing what the model sees. Practitioner consensus: higher leverage than model selection.
- **CTA / ACTA** — Cognitive Task Analysis / Applied CTA. Method for decomposing what a knowledge worker actually does cognitively.
- **CUPS** — Cognitive Use Pattern States; Mozannar 2024 telemetry-grounded taxonomy of programmer-Copilot interaction.
- **cyborg** — Mollick: continuous interleaving of human and AI at sub-task granularity. The empirical majority pattern.
- **generator-verifier asymmetry** — production cost falls toward zero with AI; verification cost stays roughly constant. Karpathy's framing.
- **H-LAM/T** — Engelbart 1962: Human using Language, Artifacts, Methodology, Training. Current rollouts ship the artifact; methodology and training lag.
- **HAIJCS** — Human-AI Joint Cognitive System (Xu & Gao 2024). Bridge from CSE to LLM teaming.
- **ironies of automation** — Bainbridge 1983: AI handling routine work degrades human capacity to catch the rare critical errors.
- **jagged frontier** — Dell'Acqua 2023: AI capability is heterogeneous across sub-tasks; using AI inside the frontier helps, outside causes harm.
- **L2D** — Learning to Defer; Madras / Mozannar formal allocation theory.
- **metacognitive demands** — Tankelevitch 2024 best-paper framework: GenAI reduces production load but increases planning, evaluation, and calibration load.
- **MABA-MABA** — "Men Are Better At, Machines Are Better At." Dekker & Woods 2002 critique of static function allocation.
- **MRT (Multiple Resource Theory)** — Wickens. Cognitive resources are partitioned by processing stage, sensory modality, and processing code; tasks competing for the same resource interfere superlinearly while tasks using different resources can be parallelized cheaply. Predicts tool-stack attention overload (Cursor + ChatGPT + meeting incurs cost beyond the sum). Lit review notes MRT is "absent from AI workflow research."
- **read/write asymmetry** — Cognition: multi-agent works for read-heavy tasks but breaks for write-heavy unless writes are serialized.
- **routing > model selection** — Paterson 2026: across 38 real daily tasks and 15 models, dispatching by task type beats picking a single best model.
- **self-automator** — Randazzo 2026 third Mollick-typology mode: full delegation with periodic oversight.
- **sycophancy in feedback loops** — Randazzo HBS 26-021: AI escalates persuasive justification on pushback.
- **Tools for Thought** — Microsoft Research program on AI-augmented knowledge work.
- **verification cost** — the cost of checking whether an AI output is correct. Vasconcelos 2023 reframing: overreliance is rational when verification is expensive relative to expected payoff.