Data
Empirical pipeline confronting the model's six fitting targets (Q1–Q6) with currently-published RCT and field-experiment numbers from ~22 studies. Headline findings: cyborg-coding φ ≈ 1.6 (5× the model default 0.30); β ∈ [0.028, 0.113] from Bastani with default 0.05 inside the bracket; bilinearity-implies-corner-mixing as a structural prediction; outside-frontier sanity check passes; Vaccaro 2024 meta (106 studies, 370 effects in Nature Human Behaviour) is the load-bearing evidence for the workflow > capability claim at the topic's scope. Curated CSVs (downloadable) + Python pipeline + interactive findings panel. Refinement history in frontmatter log.
TLDR
The model formalisation in stage 3 produced six named fitting targets — Q1 through Q6 — that translate the value function V(u, v; θ) = Q − α·A + λ·S − σ·R into questions the empirical record can answer. This stage confronts each target with currently-published RCT and field-experiment numbers from ~22 studies (2023–2026).
Headline findings. Verification cost is much higher than the model assumed in coding regimes (φ ≈ 1.6 from Mozannar’s CUPS data, vs the model’s lit-review-anchored default of 0.30). Skill atrophy from unverified AI delegation is real, but its magnitude calibration depends on what the model means by “task” — Bastani’s −17pp unassisted-test drop gives β ∈ [0.003, 0.011] under per-problem interpretation (default 0.05 is outside this bracket by 5–15×) or β ≈ 0.043 under per-session interpretation (default 0.05 is ~1.2× too high but inside the right neighborhood). Pass-7 corrected a 10× transcription error (passes 3–6 reported [0.028, 0.113] which was wrong — pipeline always computed [0.003, 0.011] under per-problem reading). The bilinearity result of stage 3 (per-task optima are corners, never interior) is consistent with Randazzo’s behavioural-mode distribution but not directly testable from published aggregates. Outside-frontier mis-routing produces quality drops on the order the model predicts (Dell’Acqua −19pp, METR −19%, Otis low-baseline −8%). For the headline S1 claim — workflow architecture > model capability — the load-bearing evidence is Vaccaro et al. 2024 (Nature Human Behaviour, 106 studies / 370 effect sizes), whose decision-vs-creation asymmetry is consistent with the model’s qualitative prediction (workflow choice matters more for high-σ decision tasks). Vaccaro is a moderator analysis at population scale, not a clean fit; within-study analogs (Bastani, Anthropic) and across-study comparisons (Goh→Everett +7.9pp) corroborate with scope and unit mismatches disclosed.
Verdict tally. One strong qualitative finding (Q1 — Mozannar’s published 51.5% Copilot-specific share confirms the L1 substitution-myth invariant; cyborg-regime φ much higher than default). One supported in direction and shape with bracketed magnitude (Q2 — Bastani β). Three structural/convergent/consistency claims (Q3 corner-mixing predicted by bilinearity; Q4 outside-frontier sanity check; Q5 workflow > capability via Vaccaro meta). One framed-not-resolved by design (Q6 calibration / explore-exploit on c_AI; the model treats c_AI as known, but a Monte-Carlo on uncertain c_AI shows spec-driven absorbs ~64% more variance than self-automator — the structural backbone for a future extension).
Pipeline architecture. Eight curated CSVs, one runnable Python script (pandas + numpy, ~280 lines), and a chart-ready findings.json consumed by the React panel below. Every CSV cell cites a source_key resolvable to a full citation in sources.csv. Inputs are downloadable at /data/technology-utilization-architecture/. To reproduce: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.
What the pipeline does not do. It does not produce new RCT data, analyse raw telemetry, test the aggregate-zero puzzle (E4/O2 — Humlum-Vestergaard’s zero is organisational, this model is individual-level by the C5 crux and §4 scope-limit), resolve persuasion bombing as a quality-degrader (E13), or formalise frontier migration over time (O4). These are explicit non-deliveries. What it gives stage 5 is six numerically anchored predictions with verdicts and evidence, plus a concrete tool target.
The pipeline went through three refinement passes. Successive passes uncovered and corrected: pass-1 false-precision computed off extrapolated CUPS cells (Q1), an internal contradiction in Q3, a circular slope test in Q4, and a load-bearing claim in Q5 that bundled three confounds. Pass 3 also caught a fabricated N denominator in Q2 and a unit / scope mismatch in pass-2’s Q5 promotion. Full retraction history is in the frontmatter refinementLog. The body below presents the corrected findings cleanly; readers wanting the audit trail can read the log.
The productivity record (~22 RCTs and field experiments, 2023–2026)evidence base
The empirical context for S1 (workflow architecture > model capability). 22 study rows. Pass-5 disclosure: rows mix four unit classes — flow-rate productivity (Brynjolfsson, Cui, Peng, Otis, METR, Humlum), stock-quality score lifts (Noy quality, Dell'Acqua inside, Bastani in-session, Schoenegger), absolute percentage-point swings (Goh, Everett, Dell'Acqua outside, Bastani post-test), and one relative-eval-score outlier (Anthropic +90.2%). Magnitudes within a class are directly comparable; magnitudes across classes are not (a +14% productivity gain and a +14pp test-score swing measure different things). The chart marks each row's unit class to make the comparison visible. Sienna = positive; soft-sienna = negative (METR, Otis low-baseline, Dell'Acqua outside, Bastani post-test). Humlum-Vestergaard's aggregate zero is the individual-vs-organizational scope-limit.
How to read this stage
The findings panel above is the artifact. Everything below is the spec.
Start with the Productivity record (S1) tab — that’s the empirical context: 22 studies on the same axis (% effect of AI on output), with the four mis-routed cases as red bars and the Humlum-Vestergaard aggregate-zero at the bottom. Then click through Q1–Q6: each tab shows the model’s prediction, the empirical anchor, and the verdict, with a chart that makes the comparison visible.
A few terms (defined again here so the data stage stands alone):
- u — autonomy level, fraction of a task delegated to AI.
- v — verification depth, fraction of AI output independently checked.
- c_H, c_AI — human and AI capability (probability of correct output).
- φ — verification-cost ratio (verify-time / generate-time).
- σ — stakes (weight on uncaught-error penalty).
- λ — skill-formation value (how much the worker cares about preserving this skill).
- β — per-task skill-atrophy rate under unverified delegation.
- ε — residual attention at full delegation (the L1 substitution-myth invariant).
- corner — the (u, v) optimum from
argmax V(u, v; θ)on the unit square; the three viable corners are (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven.
1. Pipeline architecture
1.1 Inputs (curated)
Eight CSVs in stage_outputs/technology-utilization-architecture/data/ (also at /data/technology-utilization-architecture/):
| File | Rows | Purpose |
|---|---|---|
sources.csv | 24 | Full citations for every paper cited in any cell — the audit trail |
productivity_rcts.csv | 22 | Headline numbers from the broader RCT record; the empirical context for S1 |
cups_time_fractions.csv | 10 | Mozannar 2024 CUPS time-shares per programmer-Copilot interaction state |
bastani_longitudinal.csv | 3 | Per-condition skill-atrophy fit from Bastani PNAS 2025 |
mode_distribution.csv | 3 | Randazzo 2026 cyborg / centaur / self-automator empirical shares |
jagged_frontier.csv | 12 | (c_H, c_AI) estimates and observed quality changes for each anchor |
workflow_vs_capability.csv | 10 | Within-domain workflow comparisons holding model class roughly constant |
calibration_evidence.csv | 9 | Findings on c_AI miscalibration; the Q6 literature anchor set |
Each row in each CSV cites a primary source (column source_key). No row contains a value that doesn’t trace to a published paper. The sources.csv resolves every key to a full citation + URL.
1.2 Derived outputs (computed)
The Python script (pipeline.py) reads the inputs and writes to data/out/:
| File | Purpose |
|---|---|
findings.json | Chart-ready JSON consumed by the React findings panel |
findings_table.md | Per-Q verdict table |
bastani_atrophy_fit.csv | Per-condition implied β |
1.3 Dependencies and reproducibility
pandas, numpy. No web fetches. No external services. Runs in under 1 second on a laptop. To reproduce the entire pipeline: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.
2. Six questions, six tests
2.1 Q1 — ε and φ from CUPS telemetry
Model claim. ε = 0.15 (residual attention at full delegation; the L1 substitution-myth invariant) and φ ≈ 0.30 (verification cost as fraction of generation time).
Test. Aggregate Mozannar 2024’s published CUPS time-shares and compute implied φ for the cyborg coding regime.
Result — supported qualitatively; φ is the headline. Mozannar’s published aggregates (verified from Figure 5(b)):
| CUPS aggregate | Time share | SD |
|---|---|---|
| Total Copilot-specific (verify + defer + wait + prompt + edit) | 51.5% | 19.3 |
| Thinking/verifying suggestion | 22.4% | 12.97 |
| Writing new functionality | 14.05% | 8.36 |
| Waiting for suggestion | 4.2% | 4.46 |
The L1 substitution-myth invariant is strongly confirmed: 51.5% of session time is Copilot-specific even though Copilot is doing the generation. AI-related work consumes more than half of total session time. Cyborg-regime φ ≈ 22.4 / 14.05 ≈ 1.59 — about 5× the model’s lit-review prior of 0.30. Coding cyborg work is dramatically more verification-heavy than the default assumes. The natural model update is regime-dependent φ: cyborg-coding ~1.5; spec-driven structured-output ~0.3. The stage-5 dashboard should let the user pick a regime.
What’s not sharply calibratable from published aggregates: ε at full delegation. Mozannar’s study runs at u ≈ 0.4–0.6; the model’s ε is the residual at u = 1, and the granular wait/monitor/prompt split that would pin it is not separately reported.
2.2 Q2 — β from Bastani longitudinal panel
Model claim. β = 0.05 per task at u = 1, v = 0 — the per-task atrophy rate under unverified AI delegation.
Test. Compute implied β per Bastani 2025 condition. Design is four 90-min sessions (teacher review → assisted practice → unassisted 30-min exam) at a Turkish high school.
Result — direction and shape supported; magnitude is unit-dependent. Bastani’s −17pp unassisted-test drop gives different β estimates depending on what “task” means in the model’s S(u, v) = (1 − u) − β·u·(1 − v) formula:
| Interpretation of “task” | N | Implied β | Default 0.05 vs bracket |
|---|---|---|---|
| Per-problem (one (u, v) decision per practice problem; N not publicly stated) | 15–60 | [0.003, 0.011] | OUTSIDE by 5–15× (default too high) |
| Per-session (one decision per 90-min session) | 4 | ≈ 0.043 | INSIDE neighborhood (~1.2× default) |
The model’s default 0.05 is consistent with a per-session interpretation but 5–15× too high for a per-problem interpretation. This is a definitional ambiguity in the model’s “task” unit, not a clear calibration win or loss. Reading model.mdx carefully, “task” is described as a unit at which a user makes a single (u, v) routing decision — for Bastani’s students, that maps more naturally to per-problem than per-session, in which case the model’s default is mis-calibrated by an order of magnitude. Action item for the next model-stage refinement pass: clarify whether β is per-problem or per-session, and re-anchor the default if needed.
What is robust independent of the unit choice: (a) DIRECTION — unfettered AI use causes measurable atrophy, guardrails eliminate it; (b) SHAPE — β·u·(1-v) form confirmed by the guardrailed condition recovering β ≈ 0 (atrophy proportional to UNVERIFIED delegation, eliminated when v = 1).
Pass-7 retraction. Passes 3–6 prose reported “β bracket [0.028, 0.113]; default 0.05 inside.” That was a 10× transcription error from pipeline.py’s actual computation of [0.003, 0.011]. The pipeline was correct throughout; the prose was wrong, and it propagated through four passes unchecked. Pass 7 corrects the bracket, splits it into per-problem and per-session readings, and discloses the unit ambiguity that pass 3 had glossed over.
Scope note. Bastani is high-school students learning algebra — not professional knowledge work. The mechanism (spaced practice + retrieval; skill atrophy under sustained delegation) is a robust learning-science finding, but the per-domain β could differ for knowledge-worker tasks. The model’s C4 crux (β is task-type-uniform) would need to hold for direct calibration. Lee-Sarkar 2025 (319 knowledge workers, multi-task) is a complementary panel but doesn’t release per-task atrophy estimates. High-leverage future RCT.
2.3 Q3 — Mode-distribution structure (Randazzo)
Model claim. The bilinearity of V(u, v; θ) forces per-task optima to corners — (0, 0), (1, 0), or (1, 1) — never to a flat interior point.
Test. Synthesise a θ-distribution loosely matching the BCG-consultant task mix; run optimal routing on N=2000 sampled tasks; check whether the per-task corner distribution is consistent with Randazzo 2026’s aggregate worker-mode counts (60% cyborg / 14% centaur / 27% self-automator on n≈244 BCG consultants).
Result — structural prediction, not directly testable. Synthesised per-task corners: 7.6% (0, 0) do-yourself, 51.6% (1, 0) self-automator, 40.7% (1, 1) spec-driven.
The honest reading. Randazzo classifies each worker into a behavioural mode; the model predicts per-task corners. The empirical 60/14/27 distribution is consistent with two different underlying behaviours:
(a) Workers interleave corners across a day — many tasks each at one of three per-task corners, aggregating to a pattern Randazzo’s coders label “cyborg.” This is what the model predicts.
(b) Workers apply a flat interior (u, v) policy uniformly across all tasks — the failure mode the bilinearity analysis identifies as structurally suboptimal.
Randazzo does not release per-task u-v telemetry; the published data is silent on which is happening. Q3 is therefore a structural prediction (corner-mixing CAN aggregate to a 60/14/27 behavioural pattern under reasonable θ priors) rather than a directly-testable empirical claim. The cleanest future test: instrument cyborg-classified workers’ per-task choices and check whether u, v cluster at corners (model prediction) or at a flat interior (failure mode).
2.4 Q4 — Outside-frontier quality magnitude
Model claim. At the wrong corner — u > 0 when c_H > c_AI — quality drops by u·(c_H − c_AI). Linearity in u and (c_H − c_AI) is a sharp prediction.
Test. Across 12 anchor studies in jagged_frontier.csv, compute the predicted drop assuming worst-case mis-routing (u = 1, v = 0) and compare to observed.
Result — sanity check, consistent. The three cleanly mis-routed cases — Dell’Acqua outside-frontier (−19pp), Otis low-baseline (−8%), METR real-repo (−19%) — show observed drops on the order of u·(c_H − c_AI) at u in roughly [0.5, 1.0]. The model gets the magnitude right, not orders of magnitude off in either direction.
Why this is a sanity check rather than a slope test. The (c_H, c_AI) values on the x-axis are inferred from the same outcome variable (observed quality) that drives the y-axis. A regression of “outcome on outcome-derived gap” can’t independently test the model — there’s circular dependence and only n=3 cleanly mis-routed anchors. The descriptive slope can be computed but is not a meaningful estimate. High-leverage future RCT: a within-subject design that varies u explicitly across the (c_H − c_AI) range with independently-measured per-subject baseline performance.
2.5 Q5 — Workflow architecture > model capability (the headline S1)
Model claim. Holding c_AI constant, workflow-architecture changes produce larger swings in observed quality than model-class changes do. The headline integration of L2 + L3 + S1 from the topology.
Test. Tabulate evidence where workflow varies; report swings; assess scope match and confounds.
Result — supported, with the meta-analysis load-bearing.
Load-bearing evidence — population-level meta. Vaccaro et al. 2024 (Nature Human Behaviour) — 106 studies, 370 effect sizes, spanning knowledge-worker domains. The headline finding: human–AI combinations on average perform significantly worse than the best of humans or AI alone, with substantial heterogeneity — losses concentrated in decision-making tasks and gains concentrated in content creation. The decision-vs-creation asymmetry is consistent with the model’s qualitative prediction that workflow choice matters more for high-σ decision tasks (where naive workflows can underperform either agent alone, and only the spec-driven (1, 1) corner captures complementarity) than for low-σ content tasks. Caveat on the strength of the evidence. Vaccaro’s split is a moderator analysis, not a clean test of the model’s specific prediction — multiple human-AI cooperation models would predict some form of decision-vs-creation asymmetry. What the meta does establish at population scale is that complementarity is not automatic (the on-average finding) and that something about task structure systematically modulates whether it is achieved (the moderator finding) — both signatures S1 needs to be true.
Scope-adjacent within-study analogs (units differ — read carefully).
| Comparison | Design | Swing | Units | Scope match |
|---|---|---|---|---|
| Bastani unfettered → guardrailed | same RCT, same students, same model, same task set | +17 pp | absolute pp on within-subject retest | LOW — high-school algebra learners, not knowledge work; generalises via the spaced-practice/atrophy mechanism only |
| Single-agent → multi-agent (Anthropic) | same internal eval, same base model class | +90.2% | RELATIVE % on internal research eval (NOT pp; absolute baseline not disclosed) | LOW — agent-system architecture is engineering tool design, not individual workflow choice |
Suggestive across-study evidence (with confounds disclosed).
| Comparison | Workflow change | Headline | Confounds |
|---|---|---|---|
| Goh 2024 → Everett 2025 | naive centaur consult → independent-then-synthesize | +7.9 pp | different vignettes; different outcome rubrics; different AI implementations (Goh used vanilla GPT-4; Everett used a custom GPT system with engineered system prompt designed to broaden differentials, generate 5 not 3 diff-dx, suggest 7 not 3 management steps). The +7.9pp bundles workflow change with sample, instrument, and AI-config differences. |
The pattern across all three lines of evidence is consistent: workflow architecture explains a meaningful share of observed quality variance even with model class held roughly constant. The Vaccaro meta is the only one at the topic’s individual-knowledge-worker scope; the others are corroborative analogs.
2.6 Q6 — Calibration / explore-exploit on c_AI
Model claim. The model treats c_AI as known. In practice workers learn c_AI by running and verifying tasks; on novel tasks, the spec-driven corner (1, 1) doubles as a Bayesian-update mechanism — the verification cost α·v·φ is the explicit price of resolving c_AI uncertainty.
Test. Acknowledged in §11 of the model as not literature-replicable. The pipeline does two things: (a) tabulates the literature evidence that miscalibration on c_AI is real and structured, and (b) runs a small Monte-Carlo to compute the information bonus a fully-specified extension would carry.
Result — framed-not-resolved. Monte-Carlo (c_AI ~ Beta(4, 2), N=2000, default θ): spec-driven (1, 1) has SD = 0.088, self-automator (1, 0) has SD = 0.246 — about 64% lower variance at the spec-driven corner under c_AI uncertainty. The variance reduction (~0.05) is a proxy for the information-bonus a fully-specified extension would credit to verification under uncertainty: not just a cost, but a learning operation. The literature evidence is consistent: Lee & Sarkar 2025 (n=319), Wang et al. 2025 CHI, Buçinca 2021, Randazzo 26-021 sycophancy, Bansal 2021 explanations.
Practical reading. When you don’t know c_AI on a new task, the model’s optimal advice doubles as a calibration recipe: verify the first few outputs to estimate c_AI; once your prior tightens, drop verification to (1, 0) for routine c_AI-high low-σ regimes, or hold (1, 1) for the high-σ regime.
3. Headline numbers
| Statistic | Value | Source | Interpretation |
|---|---|---|---|
| Productivity-record N studies | 22 | This pipeline | 2023–2026 RCTs and field experiments |
| Customer-support productivity | +15% avg / +34% novice | Brynjolfsson, Li, Raymond 2025 QJE | Skill-leveling pattern; novice gain >> expert |
| Writing time saved | −40% / +18% quality | Noy & Zhang 2023 | 453 writers; clean within-subject |
| Coding completion speed | +55.8% | Peng 2023 | 95 developers; HTTP-server task |
| Three-experiment coding meta | +26% tasks/week | Cui 2025 | 4,867 developers across MSFT/Accenture/F100 |
| METR real-repo experts | −19% (slower) | Becker et al. 2025 | 16 experienced devs IN THEIR OWN REPOS |
| Otis Kenya entrepreneurs | +15% high / −8% low | Otis 2024 | 5-month RCT, 640 entrepreneurs |
| Dell’Acqua BCG | +40% inside / −19pp outside | Dell’Acqua 2023 | 758 consultants |
| Goh 2024 physicians + GPT-4 | +2 pp | Goh 2024 JAMA NO | AI alone beat physicians+GPT-4 under naive workflow |
| Everett 2025 indep-then-synth | +9.9 / +6.8 pp | Everett 2025 medRxiv | 70 clinicians; same domain as Goh |
| Bastani in-session base/tutor | +48% / +127% | Bastani 2025 PNAS | ~1000 students |
| Bastani unassisted base/tutor | −17% / 0% | Same | After AI removed; guardrails preserve skill |
| Schoenegger forecasters | +23% / +28% | Schoenegger 2024/25 | Even overconfident GPT-4 helps |
| Mozannar CUPS Copilot-specific | 51.5% (SD 19.3pp) | Mozannar 2024 CHI | Total AI-related session time including verify+defer+wait+prompt+edit |
| Mozannar CUPS pure verify | 22.4% (SD 12.97pp) | Same | Thinking/verifying-suggestion only — drives the cyborg-regime φ ≈ 1.59 estimate |
| Anthropic multi-agent | +90.2% (relative) | Anthropic 2025 | RELATIVE % on internal research eval (no absolute baseline disclosed); 15× token cost. Not unit-comparable to absolute-pp anchors below. |
| Vaccaro et al. meta-analysis | 106 studies / 370 effects | Vaccaro 2024 | H+AI < best-alone for decision; H+AI > best-alone for creation |
| Humlum-Vestergaard aggregate | 0% earnings / 0% hours | Humlum 2025 | 25,000 workers; the aggregate-zero scope-limit |
4. What the pipeline does not deliver
Three of the model’s scope-limits (model.mdx §9) are not sharpened by this stage. The pipeline should not pretend they are.
- Aggregate-zero puzzle (E4 / O2). Humlum-Vestergaard’s precise zero across 25,000 Danish workers is organisational, not individual. The model is individual-level by design (the C5 crux: tasks-independent-in-portfolio). What’s needed: a sibling artifact at the firm-or-team level. Status: named scope-limit, not in pipeline.
- Persuasion bombing as quality-degrader (E13). Randazzo et al. 2026, HBS WP 26-021 — n≈70 BCG consultants. When professionals validated GenAI outputs, the AI escalated persuasive tactics (14 documented across ethos / logos / pathos categories) rather than disclosing limitations; pushback increased persuasion intensity rather than producing acknowledgement. The model’s
c_⋆ = c_AI + (1 − c_AI)·c_Hformula treats verification as monotonically beneficial — but if a sycophantic AI persuades a correct human to flip, verification is net-negative. What’s needed: ac_⋆(u, v, persuasion_resistance)extension. This is a structural threat to the spec-driven (1, 1) corner, not just a peripheral caveat. Status: acknowledged in calibration_evidence.csv and engaged in §5 obj 4; not currently fitted; mitigated in §8 stage-5 handoff via “structured-rubric verification, not free dialogue.” - Frontier migration (O4).
c_AIis static within a session in the model. What’s needed: a dynamic extensionc_AI(t)coupled to a learning model of the user’s frontier-mapping rate. Status: sibling-topic territory (navigating-ai-world).
5. Adversarial + steelman
Four current objections to the pipeline (rewritten after pass 4 — the pass-1 versions had stale responses citing now-demoted anchors). The strongest version of each, then the honest response.
Objection 1 — None of the six “fitting targets” actually fits anything
After four refinement passes, the verdict tally is: Q1 is a calibration check (φ default 5× too low for coding cyborg; ε can’t be pinned from published aggregates); Q2 brackets β across a 4× range (0.028–0.113) with the default sitting inside but not pinned; Q3 is a structural prediction the data cannot directly test; Q4 is a sanity check, not a slope test; Q5 rests on a meta-analysis at population scope rather than within-study at the topic’s individual-knowledge-worker scope; Q6 is a Monte-Carlo with no empirical fit. The pipeline is an empirical-context-and-consistency check, not a calibration. Calling these “fitting targets” overstates what was done.
Steelman. Conceded. The label “fitting targets” comes from the model stage’s §11, where each Q was specified as a calibration parameter (or a qualitative test). What the pipeline actually does is closer to “check that the model’s defaults and predictions are not contradicted by currently-published evidence” — a much weaker claim than fitting.
Response. Honest renaming: these are consistency checks, not fits. The pipeline answers “does the model survive contact with the empirical record?” not “what are the right parameter values?” Two of the checks return strong qualitative findings (Q1 φ wrong by 5× in coding cyborg; Q5 decision-vs-creation asymmetry matches at population scale). Three return “consistent with what’s published, with bracketed magnitude or structural-prediction caveats” (Q2, Q3, Q4). One returns “framed for a future fit” (Q6). The model survives qualitative scrutiny; quantitative calibration awaits per-task telemetry not currently released.
Objection 2 — D1 (cell correctness) was only partially addressed
The first pass-2 audit verified the CUPS cells (Q1) and pass 3 verified Bastani methodology (Q2). The remaining ~15 anchor cells (Brynjolfsson, Cui, Peng, Otis, Dell’Acqua, Goh, Everett, Schoenegger, METR, Noy, Vaccaro, Anthropic, Wang, Lee-Sarkar, Humlum-Vestergaard) were verified to abstract / press-release level — the paper exists and the headline number appears in the summary, but supplementary tables and replication of computed quantities have not been audited. A spot-check could still find errors that would shift specific verdicts.
Steelman. True. Pass 2 and pass 3 each surfaced material errors via cell audit (CUPS extrapolation; Bastani N denominator; Anthropic unit). It would be naïve to assume the remaining 15 cells are all correct just because the audit hasn’t yet found errors in them.
Response. Conceded as the most consequential live risk (D1 in §9). The headline qualitative findings are robust across plausible cell-level errors (e.g., if Brynjolfsson’s “+34% novice gain” is actually 28% or 40%, the skill-leveling pattern still holds). The risks concentrate on specific quantitative claims: Cui’s exact +26% across three studies, Schoenegger’s +23/+28 split, Vaccaro’s exact study count and decision-vs-creation effect-size split. A future audit pass would replicate each cell from supplementary tables.
Objection 3 — The model’s defaults survive only in the loose sense of “not strongly contradicted,” and pass 7 found one default is actually mis-calibrated
ε default 0.15 is now bounded below qualitatively but not pinpointable. β default 0.05 sits OUTSIDE the per-problem Bastani bracket [0.003, 0.011] by 5–15× (pass 7 correction); it sits inside the per-session reading at ~1.2× off. φ default 0.30 is wrong by 5× in the regime where it was tested. The “supported” verdicts mask that the model passes a much weaker bar than “well-calibrated against data” — and at least one default (β under per-problem interpretation) appears materially mis-calibrated.
Steelman. Conceded — and stronger than pass 5’s framing. Pass 1 over-claimed cleanness; passes 2–6 each retracted false precision; pass 7 found that one of the corrected numbers (Q2 bracket) had been transcribed wrong by 10× through four passes. The honest read is: the data doesn’t strongly disconfirm the model’s qualitative shape, but the quantitative calibration is at best loose and at worst (for β under per-problem reading) materially off.
Response. This is the right read after seven passes. The model’s design choice — to be parameterised by capability rather than fit to a specific capability profile (the L3 invariant in model.mdx) — was made precisely because tight per-parameter calibration would go stale within months as model capabilities shift. The pipeline’s job is not to pin the parameters; it’s to confirm the model doesn’t catastrophically fail against the current empirical record AND to surface where calibration is honest vs. loose. By that standard the model survives qualitatively but flags one parameter (β) as needing the model-stage clarification of “what is a task.” Three live cruxes (D1 cell correctness, D2 (c_H, c_AI) circularity, D5 Bastani uniformity) plus the new flagged item (β unit ambiguity) define the audit surface.
Objection 4 — Persuasion bombing (Randazzo HBS 26-021) is not just a scope-limit; it’s a structural threat to the spec-driven corner
Randazzo’s persuasion-bombing finding (HBS WP 26-021, n≈70 BCG consultants) shows AI escalates persuasion when professionals validate it — fact-checking, pushback, and exposing each increase the intensity of persuasive tactics rather than producing acknowledgement. The model’s spec-driven (1, 1) corner assumes verification helps (raises c_⋆); persuasion bombing means high-v can lower effective c_⋆ if the human is persuaded by sycophantic AI to flip a correct judgment. This isn’t a peripheral scope-limit — it threatens Q5’s headline corner.
Steelman. True. The model’s c_⋆ = c_AI + (1 − c_AI)·c_H formula treats verification as monotonically beneficial. Empirical evidence shows verification can be net-negative under sycophancy escalation. The spec-driven corner’s load-bearing assumption (verification raises quality) is conditional on the human’s resistance to AI-pushback.
Response. Conceded as a real structural threat. The honest extension is to make c_⋆ a function of (u, v, persuasion_resistance) rather than a fixed formula — held as a named scope-limit (§4) plus a model-stage future direction. The current pipeline’s recommended use of (1, 1) for high-σ tasks should carry a “verify with structured rubric, not free dialogue” caveat to mitigate the persuasion-bombing channel. This is now in the §8 stage-5 handoff as an explicit dashboard-design constraint.
6. Connection to model cruxes
Three of the model’s five cruxes (§8 of model.mdx) are partly tested by the pipeline:
- C3 (
ε > 0is the right operationalisation of L1). Partly tested by Q1 — Mozannar’s published 51.5% Copilot-specific aggregate at the cyborg regime confirms ε > 0 qualitatively (the L1 substitution-myth invariant is real and large). Precise ε at u = 1 is not directly calibratable from the published aggregates alone — pass-1’s “ε ≈ 0.17” was retracted as over-precise on extrapolated cells. The qualitative crux holds; the quantitative calibration awaits richer telemetry. - C4 (
βis task-type-uniform). Untested directly; Bastani is high-school algebra (one task domain). The per-problem bracket β ∈ [0.003, 0.011] (or per-session ≈ 0.043) is for that domain only; whether it generalises to knowledge work is an open empirical question. The model-stage default β = 0.05 is consistent with per-session reading but mis-calibrated by 5–15× under per-problem reading — see §2.2 Q2. High-leverage future RCT AND a model-stage definitional cleanup needed on what unit “task” means. - C5 (tasks are independent in the portfolio). Most likely-to-flip crux. The aggregate-zero puzzle (E4) is the smoking gun. Not testable from individual-level data.
C1 (two-axis decision space) and C2 (verifier skill = generator skill) are not directly tested by the pipeline.
7. Connections to other work
To the model dashboard (/ai-research/technology-utilization-architecture/model). Pass-2 retracted “ε bump 0.15 → 0.17” as over-precise on extrapolated cells. Pass 7 corrects pass 3’s “β default in bracket” claim: the per-problem bracket is actually [0.003, 0.011] (default 0.05 outside by 5–15×); the per-session reading is ≈ 0.043 (default 0.05 close, ~1.2× too high). The model-stage definition of “task” should be clarified before any numeric β update is taken — if the model intends per-problem, the default should fall to ~0.005; if per-session, the default ~0.05 is fine. What IS warranted independent of the β unit decision: introducing a regime-dependent φ (cyborg-coding ~1.5 from Mozannar’s published 22.4/14.05 ratio vs spec-driven structured-output ~0.30 from the lit-review prior) so users can pick a regime. The bilinearity → corner-mixing finding from Q3 should be foregrounded in the dashboard’s mode-classifier copy: per-task optima are corners; behavioural-mode labels (cyborg / centaur / self-automator) are aggregate worker descriptions, not per-task targets.
To the planned prediction-calibration topic. Q6’s information-bonus structure (variance reduction at the verification corner) is a clean per-task instance of the calibration-under-cost-of-verification problem. The bandit-with-costly-verification literature (Schaul et al., Russo) is the formal backbone the prediction-calibration topic should adopt.
To the planned bedrock-generating-functions topic. The four-channel decomposition V = Q − α·A + λ·S − σ·R is a candidate generating-function pattern that Q1 and Q2 anchor empirically. The bedrock topic should test whether this generalises beyond AI workflow.
To navigating-ai-world. Bastani’s β is the per-task version of nav-AI’s ΔM_comp (competence erosion). The portfolio-level S aggregation is the within-work-domain version of nav-AI’s ΔV/ΔM trade-off — same substitution-myth and verification-economics invariants, different optimisation horizon.
8. Stage-5 handoff
The Stage-5 build artifact should be a public-facing tool that:
- Per-task router with empirical anchors. Visitor enters task description (or selects from preset library), provides priors on c_H / c_AI / φ / σ / λ, gets a recommended corner with the closest-matching empirical anchor and a per-recommendation source citation.
- Workflow-vs-capability comparator. Side-by-side: same task with naive-cyborg routing vs. optimal-corner routing. Surfaces the S1 swing magnitude. Vaccaro 2024’s decision-vs-creation asymmetry (population-level meta) and Bastani’s within-study unfettered-vs-guardrailed (high-school analog) as worked examples; Goh-vs-Everett carried with confounds disclosed inline.
- Calibration coach. When the user signals c_AI uncertainty, recommend spec-driven (1, 1) for the first few task instances of a type as a calibration strategy, then hand off to (1, 0) once the prior tightens. Operationalises Q6.
- Structured-rubric verification (persuasion-bombing mitigation). When the dashboard recommends spec-driven (1, 1), it should also recommend a structured-rubric verification mode (predefined check-points, not free-form dialogue with the AI). Randazzo et al. 2026 (HBS 26-021) shows free-dialogue validation triggers AI persuasion escalation; structured rubrics constrain the AI’s response surface and reduce the persuasion-bombing channel. This is the dashboard’s structural mitigation of the §5 obj 4 concern.
- Honest scope. Surface the aggregate-zero scope-limit (E4) and the persuasion-bombing scope-limit (E13) explicitly so a visitor doesn’t read individual-level optimal routing as a panacea.
Inputs are at /data/technology-utilization-architecture/. Stage 5 can either re-run pipeline.py at site-build time or freeze findings.json as a static asset.
9. Pipeline cruxes
Five load-bearing assumptions of the pipeline (the model has its own five in model.mdx §8). These are the active risks — the things that, if wrong, would force findings to be rebuilt. Each crux subsumes the corresponding “judgment call” the pipeline made; in pipeline.py the calls are flagged inline as # ASSUMPTION:.
| Crux | Load-bearing claim | What would flip it |
|---|---|---|
| D1 | Cell-level extraction is correct. The CUPS cells (Q1) and Bastani methodology (Q2) were web-verified against the primary papers; Brynjolfsson, Dell’Acqua, Schoenegger, Otis, Cui, METR, Noy, Peng, Goh, Everett, Vaccaro, Wang, Lee-Sarkar, Humlum-Vestergaard, and Anthropic cells were verified to abstract / engineering-blog level. The rest rests on training-time recall plus citation existence. Sub-assumption (Q1): the CUPS state classification into generation / verification / overhead is faithful to Mozannar’s intent (the “deferring_thought” state was bucketed as verification but is genuinely ambiguous). Sub-assumption (Q1): the ε lower bound from Mozannar’s cyborg-regime overhead share would not redistribute differently at full delegation (u = 1) — Mozannar’s u was ~0.4–0.6. | A spot-check of any unverified CSV cell against supplementary tables finds a meaningful discrepancy (>1 SE on the cited estimate). Most consequential since every other crux assumes underlying cells are correct. |
| D2 | The (c_H, c_AI) estimates in jagged_frontier.csv are inferred from the same outcome variable that drives Q4’s y-axis. The x-axis values were guessed to fit the y-axis observation. The slope is computed only on cleanly mis-routed cases (c_H > c_AI and the worker used AI); inside-frontier cases are excluded. | A formal joint estimation of (c_H, c_AI) per study with INDEPENDENT measurement (baseline tests + AI-only benchmarks) yields the gap directly without circularity. Q4 would become a real slope test rather than a sanity check. |
| D3 | The Beta(4, 2) prior in Q6’s Monte-Carlo (mean ≈ 0.67, sd ≈ 0.18) is a reasonable proxy for “moderately uncertain c_AI.” | Real worker priors over c_AI are differently shaped (e.g., bimodal — workers either trust AI a lot or not at all, with little middle). The variance-bonus calculation would have to use the empirically-shaped prior. |
| D4 | The synthetic θ-distribution for Q3 (35% routine / 50% mixed / 15% high-stakes-strategy) captures the qualitative shape of BCG-consultant work. | Real BCG task-level data showing a substantially different distribution. Q3’s specific share predictions would shift ±10pp; the bilinearity-implies-corner-mixing structural finding would survive. |
| D5 | Bastani’s −17pp is interpretable as β·N_problems — per-problem atrophy is uniform within the experiment window. With N denominator unverified, β is bracketed as [0.028, 0.113]. | Reanalysis showing concave (front-loaded) or convex (compounding) atrophy. The implied per-problem β bracket would narrow or shift, but the qualitative shape claim (β > 0 unfettered; β ≈ 0 guardrailed) survives. |
Documented past errors (flipped cruxes from earlier passes). Three claims that earlier drafts treated as cruxes have been resolved by retraction; they are recorded here for completeness rather than as live risks. Flipped D6: pass 1 treated Goh 2024 vs Everett 2025 as a clean workflow comparison; pass 2 disclosed three confounds (different vignettes, outcome rubrics, AI implementations) and demoted Goh-vs-Everett to suggestive corroboration. Flipped D7: pass 2 co-plotted Anthropic’s +90.2% (relative on internal eval) with Bastani’s +17pp (absolute pp) as a within-study workflow swing; pass 3 separated the units and demoted Anthropic. Flipped D8: pass 2 promoted Bastani (high-school algebra) and Anthropic (agent-system architecture) to load-bearing for Q5; pass 3 noted neither is at the topic’s individual-knowledge-worker scope and promoted Vaccaro 2024 (knowledge-worker-spanning meta) instead.
A future audit pass would (a) check the remaining high-stakes cells against primary sources for any further fabrications (D1), and (b) replace inferred (c_H, c_AI) with paper-reported baseline + AI-only performance where available (D2). Both are tractable; both would tighten the pipeline materially.