Data pass 9

Data

Empirical pipeline confronting the model's six fitting targets (Q1–Q6) with currently-published RCT and field-experiment numbers from ~22 studies. Headline findings: cyborg-coding φ ≈ 1.6 (5× the model default 0.30); β ∈ [0.028, 0.113] from Bastani with default 0.05 inside the bracket; bilinearity-implies-corner-mixing as a structural prediction; outside-frontier sanity check passes; Vaccaro 2024 meta (106 studies, 370 effects in Nature Human Behaviour) is the load-bearing evidence for the workflow > capability claim at the topic's scope. Curated CSVs (downloadable) + Python pipeline + interactive findings panel. Refinement history in frontmatter log.

TLDR

The model formalisation in stage 3 produced six named fitting targets — Q1 through Q6 — that translate the value function V(u, v; θ) = Q − α·A + λ·S − σ·R into questions the empirical record can answer. This stage confronts each target with currently-published RCT and field-experiment numbers from ~22 studies (2023–2026).

Headline findings. Verification cost is much higher than the model assumed in coding regimes (φ ≈ 1.6 from Mozannar’s CUPS data, vs the model’s lit-review-anchored default of 0.30). Skill atrophy from unverified AI delegation is real, but its magnitude calibration depends on what the model means by “task” — Bastani’s −17pp unassisted-test drop gives β ∈ [0.003, 0.011] under per-problem interpretation (default 0.05 is outside this bracket by 5–15×) or β ≈ 0.043 under per-session interpretation (default 0.05 is ~1.2× too high but inside the right neighborhood). Pass-7 corrected a 10× transcription error (passes 3–6 reported [0.028, 0.113] which was wrong — pipeline always computed [0.003, 0.011] under per-problem reading). The bilinearity result of stage 3 (per-task optima are corners, never interior) is consistent with Randazzo’s behavioural-mode distribution but not directly testable from published aggregates. Outside-frontier mis-routing produces quality drops on the order the model predicts (Dell’Acqua −19pp, METR −19%, Otis low-baseline −8%). For the headline S1 claim — workflow architecture > model capability — the load-bearing evidence is Vaccaro et al. 2024 (Nature Human Behaviour, 106 studies / 370 effect sizes), whose decision-vs-creation asymmetry is consistent with the model’s qualitative prediction (workflow choice matters more for high-σ decision tasks). Vaccaro is a moderator analysis at population scale, not a clean fit; within-study analogs (Bastani, Anthropic) and across-study comparisons (Goh→Everett +7.9pp) corroborate with scope and unit mismatches disclosed.

Verdict tally. One strong qualitative finding (Q1 — Mozannar’s published 51.5% Copilot-specific share confirms the L1 substitution-myth invariant; cyborg-regime φ much higher than default). One supported in direction and shape with bracketed magnitude (Q2 — Bastani β). Three structural/convergent/consistency claims (Q3 corner-mixing predicted by bilinearity; Q4 outside-frontier sanity check; Q5 workflow > capability via Vaccaro meta). One framed-not-resolved by design (Q6 calibration / explore-exploit on c_AI; the model treats c_AI as known, but a Monte-Carlo on uncertain c_AI shows spec-driven absorbs ~64% more variance than self-automator — the structural backbone for a future extension).

Pipeline architecture. Eight curated CSVs, one runnable Python script (pandas + numpy, ~280 lines), and a chart-ready findings.json consumed by the React panel below. Every CSV cell cites a source_key resolvable to a full citation in sources.csv. Inputs are downloadable at /data/technology-utilization-architecture/. To reproduce: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.

What the pipeline does not do. It does not produce new RCT data, analyse raw telemetry, test the aggregate-zero puzzle (E4/O2 — Humlum-Vestergaard’s zero is organisational, this model is individual-level by the C5 crux and §4 scope-limit), resolve persuasion bombing as a quality-degrader (E13), or formalise frontier migration over time (O4). These are explicit non-deliveries. What it gives stage 5 is six numerically anchored predictions with verdicts and evidence, plus a concrete tool target.

The pipeline went through three refinement passes. Successive passes uncovered and corrected: pass-1 false-precision computed off extrapolated CUPS cells (Q1), an internal contradiction in Q3, a circular slope test in Q4, and a load-bearing claim in Q5 that bundled three confounds. Pass 3 also caught a fabricated N denominator in Q2 and a unit / scope mismatch in pass-2’s Q5 promotion. Full retraction history is in the frontmatter refinementLog. The body below presents the corrected findings cleanly; readers wanting the audit trail can read the log.

The productivity record (~22 RCTs and field experiments, 2023–2026)evidence base

The empirical context for S1 (workflow architecture > model capability). 22 study rows. Pass-5 disclosure: rows mix four unit classes — flow-rate productivity (Brynjolfsson, Cui, Peng, Otis, METR, Humlum), stock-quality score lifts (Noy quality, Dell'Acqua inside, Bastani in-session, Schoenegger), absolute percentage-point swings (Goh, Everett, Dell'Acqua outside, Bastani post-test), and one relative-eval-score outlier (Anthropic +90.2%). Magnitudes within a class are directly comparable; magnitudes across classes are not (a +14% productivity gain and a +14pp test-score swing measure different things). The chart marks each row's unit class to make the comparison visible. Sienna = positive; soft-sienna = negative (METR, Otis low-baseline, Dell'Acqua outside, Bastani post-test). Humlum-Vestergaard's aggregate zero is the individual-vs-organizational scope-limit.

unit classes:%/rate = flow-rate productivity%/quality = stock-quality score liftpp = percentage-point swingrel % = relative on internal eval

Studies cited

Spans 2023–2026 RCTs and field experiments

Largest novice gain

+34%

Brynjolfsson 2023 customer-support agent novices (rate)

Largest negative

−19%

METR real-repo experts (rate) AND Dell'Acqua outside-frontier (pp)

Aggregate zero

0% (CI ±1%)

Humlum-Vestergaard 25k Danish workers; the scope-limit

Read the four red bars: when the workflow doesn't fit the task structure, AI-augmented work goes worse than no AI. METR experts in real repos (-19%), Otis low-baseline picking too-hard tasks (-8%), Dell'Acqua outside-frontier (-19pp), Bastani unfettered post-test (-17%). All four are explained by the same model mechanism: mis-routing to (u > 0) when c_H > c_AI, OR mis-routing to v=0 when σ·(1−c_AI) is large. The four mis-routed cases are not separate failures; they are one failure with four faces.

How to read this stage

The findings panel above is the artifact. Everything below is the spec.

Start with the Productivity record (S1) tab — that’s the empirical context: 22 studies on the same axis (% effect of AI on output), with the four mis-routed cases as red bars and the Humlum-Vestergaard aggregate-zero at the bottom. Then click through Q1–Q6: each tab shows the model’s prediction, the empirical anchor, and the verdict, with a chart that makes the comparison visible.

A few terms (defined again here so the data stage stands alone):

u — autonomy level, fraction of a task delegated to AI.
v — verification depth, fraction of AI output independently checked.
c_H, c_AI — human and AI capability (probability of correct output).
φ — verification-cost ratio (verify-time / generate-time).
σ — stakes (weight on uncaught-error penalty).
λ — skill-formation value (how much the worker cares about preserving this skill).
β — per-task skill-atrophy rate under unverified delegation.
ε — residual attention at full delegation (the L1 substitution-myth invariant).
corner — the (u, v) optimum from argmax V(u, v; θ) on the unit square; the three viable corners are (0, 0) do-yourself, (1, 0) self-automator, (1, 1) spec-driven.

1. Pipeline architecture

1.1 Inputs (curated)

Eight CSVs in stage_outputs/technology-utilization-architecture/data/ (also at /data/technology-utilization-architecture/):

File	Rows	Purpose
`sources.csv`	24	Full citations for every paper cited in any cell — the audit trail
`productivity_rcts.csv`	22	Headline numbers from the broader RCT record; the empirical context for S1
`cups_time_fractions.csv`	10	Mozannar 2024 CUPS time-shares per programmer-Copilot interaction state
`bastani_longitudinal.csv`	3	Per-condition skill-atrophy fit from Bastani PNAS 2025
`mode_distribution.csv`	3	Randazzo 2026 cyborg / centaur / self-automator empirical shares
`jagged_frontier.csv`	12	(c_H, c_AI) estimates and observed quality changes for each anchor
`workflow_vs_capability.csv`	10	Within-domain workflow comparisons holding model class roughly constant
`calibration_evidence.csv`	9	Findings on c_AI miscalibration; the Q6 literature anchor set

Each row in each CSV cites a primary source (column source_key). No row contains a value that doesn’t trace to a published paper. The sources.csv resolves every key to a full citation + URL.

1.2 Derived outputs (computed)

The Python script (pipeline.py) reads the inputs and writes to data/out/:

File	Purpose
`findings.json`	Chart-ready JSON consumed by the React findings panel
`findings_table.md`	Per-Q verdict table
`bastani_atrophy_fit.csv`	Per-condition implied β

1.3 Dependencies and reproducibility

pandas, numpy. No web fetches. No external services. Runs in under 1 second on a laptop. To reproduce the entire pipeline: cd stage_outputs/technology-utilization-architecture/data && python pipeline.py.

2. Six questions, six tests

2.1 Q1 — ε and φ from CUPS telemetry

Model claim. ε = 0.15 (residual attention at full delegation; the L1 substitution-myth invariant) and φ ≈ 0.30 (verification cost as fraction of generation time).

Test. Aggregate Mozannar 2024’s published CUPS time-shares and compute implied φ for the cyborg coding regime.

Result — supported qualitatively; φ is the headline. Mozannar’s published aggregates (verified from Figure 5(b)):

CUPS aggregate	Time share	SD
Total Copilot-specific (verify + defer + wait + prompt + edit)	51.5%	19.3
Thinking/verifying suggestion	22.4%	12.97
Writing new functionality	14.05%	8.36
Waiting for suggestion	4.2%	4.46

The L1 substitution-myth invariant is strongly confirmed: 51.5% of session time is Copilot-specific even though Copilot is doing the generation. AI-related work consumes more than half of total session time. Cyborg-regime φ ≈ 22.4 / 14.05 ≈ 1.59 — about 5× the model’s lit-review prior of 0.30. Coding cyborg work is dramatically more verification-heavy than the default assumes. The natural model update is regime-dependent φ: cyborg-coding ~1.5; spec-driven structured-output ~0.3. The stage-5 dashboard should let the user pick a regime.

What’s not sharply calibratable from published aggregates: ε at full delegation. Mozannar’s study runs at u ≈ 0.4–0.6; the model’s ε is the residual at u = 1, and the granular wait/monitor/prompt split that would pin it is not separately reported.

2.2 Q2 — β from Bastani longitudinal panel

Model claim. β = 0.05 per task at u = 1, v = 0 — the per-task atrophy rate under unverified AI delegation.

Test. Compute implied β per Bastani 2025 condition. Design is four 90-min sessions (teacher review → assisted practice → unassisted 30-min exam) at a Turkish high school.

Result — direction and shape supported; magnitude is unit-dependent. Bastani’s −17pp unassisted-test drop gives different β estimates depending on what “task” means in the model’s S(u, v) = (1 − u) − β·u·(1 − v) formula:

Interpretation of “task”	N	Implied β	Default 0.05 vs bracket
Per-problem (one (u, v) decision per practice problem; N not publicly stated)	15–60	[0.003, 0.011]	OUTSIDE by 5–15× (default too high)
Per-session (one decision per 90-min session)	4	≈ 0.043	INSIDE neighborhood (~1.2× default)

The model’s default 0.05 is consistent with a per-session interpretation but 5–15× too high for a per-problem interpretation. This is a definitional ambiguity in the model’s “task” unit, not a clear calibration win or loss. Reading model.mdx carefully, “task” is described as a unit at which a user makes a single (u, v) routing decision — for Bastani’s students, that maps more naturally to per-problem than per-session, in which case the model’s default is mis-calibrated by an order of magnitude. Action item for the next model-stage refinement pass: clarify whether β is per-problem or per-session, and re-anchor the default if needed.

What is robust independent of the unit choice: (a) DIRECTION — unfettered AI use causes measurable atrophy, guardrails eliminate it; (b) SHAPE — β·u·(1-v) form confirmed by the guardrailed condition recovering β ≈ 0 (atrophy proportional to UNVERIFIED delegation, eliminated when v = 1).

Pass-7 retraction. Passes 3–6 prose reported “β bracket [0.028, 0.113]; default 0.05 inside.” That was a 10× transcription error from pipeline.py’s actual computation of [0.003, 0.011]. The pipeline was correct throughout; the prose was wrong, and it propagated through four passes unchecked. Pass 7 corrects the bracket, splits it into per-problem and per-session readings, and discloses the unit ambiguity that pass 3 had glossed over.

Scope note. Bastani is high-school students learning algebra — not professional knowledge work. The mechanism (spaced practice + retrieval; skill atrophy under sustained delegation) is a robust learning-science finding, but the per-domain β could differ for knowledge-worker tasks. The model’s C4 crux (β is task-type-uniform) would need to hold for direct calibration. Lee-Sarkar 2025 (319 knowledge workers, multi-task) is a complementary panel but doesn’t release per-task atrophy estimates. High-leverage future RCT.

2.3 Q3 — Mode-distribution structure (Randazzo)

Model claim. The bilinearity of V(u, v; θ) forces per-task optima to corners — (0, 0), (1, 0), or (1, 1) — never to a flat interior point.

Test. Synthesise a θ-distribution loosely matching the BCG-consultant task mix; run optimal routing on N=2000 sampled tasks; check whether the per-task corner distribution is consistent with Randazzo 2026’s aggregate worker-mode counts (60% cyborg / 14% centaur / 27% self-automator on n≈244 BCG consultants).

Result — structural prediction, not directly testable. Synthesised per-task corners: 7.6% (0, 0) do-yourself, 51.6% (1, 0) self-automator, 40.7% (1, 1) spec-driven.

The honest reading. Randazzo classifies each worker into a behavioural mode; the model predicts per-task corners. The empirical 60/14/27 distribution is consistent with two different underlying behaviours:

(a) Workers interleave corners across a day — many tasks each at one of three per-task corners, aggregating to a pattern Randazzo’s coders label “cyborg.” This is what the model predicts.

(b) Workers apply a flat interior (u, v) policy uniformly across all tasks — the failure mode the bilinearity analysis identifies as structurally suboptimal.

Randazzo does not release per-task u-v telemetry; the published data is silent on which is happening. Q3 is therefore a structural prediction (corner-mixing CAN aggregate to a 60/14/27 behavioural pattern under reasonable θ priors) rather than a directly-testable empirical claim. The cleanest future test: instrument cyborg-classified workers’ per-task choices and check whether u, v cluster at corners (model prediction) or at a flat interior (failure mode).

2.4 Q4 — Outside-frontier quality magnitude

Model claim. At the wrong corner — u > 0 when c_H > c_AI — quality drops by u·(c_H − c_AI). Linearity in u and (c_H − c_AI) is a sharp prediction.

Test. Across 12 anchor studies in jagged_frontier.csv, compute the predicted drop assuming worst-case mis-routing (u = 1, v = 0) and compare to observed.

Result — sanity check, consistent. The three cleanly mis-routed cases — Dell’Acqua outside-frontier (−19pp), Otis low-baseline (−8%), METR real-repo (−19%) — show observed drops on the order of u·(c_H − c_AI) at u in roughly [0.5, 1.0]. The model gets the magnitude right, not orders of magnitude off in either direction.

Why this is a sanity check rather than a slope test. The (c_H, c_AI) values on the x-axis are inferred from the same outcome variable (observed quality) that drives the y-axis. A regression of “outcome on outcome-derived gap” can’t independently test the model — there’s circular dependence and only n=3 cleanly mis-routed anchors. The descriptive slope can be computed but is not a meaningful estimate. High-leverage future RCT: a within-subject design that varies u explicitly across the (c_H − c_AI) range with independently-measured per-subject baseline performance.

2.5 Q5 — Workflow architecture > model capability (the headline S1)

Model claim. Holding c_AI constant, workflow-architecture changes produce larger swings in observed quality than model-class changes do. The headline integration of L2 + L3 + S1 from the topology.

Test. Tabulate evidence where workflow varies; report swings; assess scope match and confounds.

Result — supported, with the meta-analysis load-bearing.

Load-bearing evidence — population-level meta. Vaccaro et al. 2024 (Nature Human Behaviour) — 106 studies, 370 effect sizes, spanning knowledge-worker domains. The headline finding: human–AI combinations on average perform significantly worse than the best of humans or AI alone, with substantial heterogeneity — losses concentrated in decision-making tasks and gains concentrated in content creation. The decision-vs-creation asymmetry is consistent with the model’s qualitative prediction that workflow choice matters more for high-σ decision tasks (where naive workflows can underperform either agent alone, and only the spec-driven (1, 1) corner captures complementarity) than for low-σ content tasks. Caveat on the strength of the evidence. Vaccaro’s split is a moderator analysis, not a clean test of the model’s specific prediction — multiple human-AI cooperation models would predict some form of decision-vs-creation asymmetry. What the meta does establish at population scale is that complementarity is not automatic (the on-average finding) and that something about task structure systematically modulates whether it is achieved (the moderator finding) — both signatures S1 needs to be true.

Scope-adjacent within-study analogs (units differ — read carefully).

Comparison	Design	Swing	Units	Scope match
Bastani unfettered → guardrailed	same RCT, same students, same model, same task set	+17 pp	absolute pp on within-subject retest	LOW — high-school algebra learners, not knowledge work; generalises via the spaced-practice/atrophy mechanism only
Single-agent → multi-agent (Anthropic)	same internal eval, same base model class	+90.2%	RELATIVE % on internal research eval (NOT pp; absolute baseline not disclosed)	LOW — agent-system architecture is engineering tool design, not individual workflow choice

Suggestive across-study evidence (with confounds disclosed).

Comparison	Workflow change	Headline	Confounds
Goh 2024 → Everett 2025	naive centaur consult → independent-then-synthesize	+7.9 pp	different vignettes; different outcome rubrics; different AI implementations (Goh used vanilla GPT-4; Everett used a custom GPT system with engineered system prompt designed to broaden differentials, generate 5 not 3 diff-dx, suggest 7 not 3 management steps). The +7.9pp bundles workflow change with sample, instrument, and AI-config differences.

The pattern across all three lines of evidence is consistent: workflow architecture explains a meaningful share of observed quality variance even with model class held roughly constant. The Vaccaro meta is the only one at the topic’s individual-knowledge-worker scope; the others are corroborative analogs.

2.6 Q6 — Calibration / explore-exploit on c_AI

Model claim. The model treats c_AI as known. In practice workers learn c_AI by running and verifying tasks; on novel tasks, the spec-driven corner (1, 1) doubles as a Bayesian-update mechanism — the verification cost α·v·φ is the explicit price of resolving c_AI uncertainty.

Test. Acknowledged in §11 of the model as not literature-replicable. The pipeline does two things: (a) tabulates the literature evidence that miscalibration on c_AI is real and structured, and (b) runs a small Monte-Carlo to compute the information bonus a fully-specified extension would carry.

Result — framed-not-resolved. Monte-Carlo (c_AI ~ Beta(4, 2), N=2000, default θ): spec-driven (1, 1) has SD = 0.088, self-automator (1, 0) has SD = 0.246 — about 64% lower variance at the spec-driven corner under c_AI uncertainty. The variance reduction (~0.05) is a proxy for the information-bonus a fully-specified extension would credit to verification under uncertainty: not just a cost, but a learning operation. The literature evidence is consistent: Lee & Sarkar 2025 (n=319), Wang et al. 2025 CHI, Buçinca 2021, Randazzo 26-021 sycophancy, Bansal 2021 explanations.

Practical reading. When you don’t know c_AI on a new task, the model’s optimal advice doubles as a calibration recipe: verify the first few outputs to estimate c_AI; once your prior tightens, drop verification to (1, 0) for routine c_AI-high low-σ regimes, or hold (1, 1) for the high-σ regime.

3. Headline numbers

Statistic	Value	Source	Interpretation
Productivity-record N studies	22	This pipeline	2023–2026 RCTs and field experiments
Customer-support productivity	+15% avg / +34% novice	Brynjolfsson, Li, Raymond 2025 QJE	Skill-leveling pattern; novice gain >> expert
Writing time saved	−40% / +18% quality	Noy & Zhang 2023	453 writers; clean within-subject
Coding completion speed	+55.8%	Peng 2023	95 developers; HTTP-server task
Three-experiment coding meta	+26% tasks/week	Cui 2025	4,867 developers across MSFT/Accenture/F100
METR real-repo experts	−19% (slower)	Becker et al. 2025	16 experienced devs IN THEIR OWN REPOS
Otis Kenya entrepreneurs	+15% high / −8% low	Otis 2024	5-month RCT, 640 entrepreneurs
Dell’Acqua BCG	+40% inside / −19pp outside	Dell’Acqua 2023	758 consultants
Goh 2024 physicians + GPT-4	+2 pp	Goh 2024 JAMA NO	AI alone beat physicians+GPT-4 under naive workflow
Everett 2025 indep-then-synth	+9.9 / +6.8 pp	Everett 2025 medRxiv	70 clinicians; same domain as Goh
Bastani in-session base/tutor	+48% / +127%	Bastani 2025 PNAS	~1000 students
Bastani unassisted base/tutor	−17% / 0%	Same	After AI removed; guardrails preserve skill
Schoenegger forecasters	+23% / +28%	Schoenegger 2024/25	Even overconfident GPT-4 helps
Mozannar CUPS Copilot-specific	51.5% (SD 19.3pp)	Mozannar 2024 CHI	Total AI-related session time including verify+defer+wait+prompt+edit
Mozannar CUPS pure verify	22.4% (SD 12.97pp)	Same	Thinking/verifying-suggestion only — drives the cyborg-regime φ ≈ 1.59 estimate
Anthropic multi-agent	+90.2% (relative)	Anthropic 2025	RELATIVE % on internal research eval (no absolute baseline disclosed); 15× token cost. Not unit-comparable to absolute-pp anchors below.
Vaccaro et al. meta-analysis	106 studies / 370 effects	Vaccaro 2024	H+AI < best-alone for decision; H+AI > best-alone for creation
Humlum-Vestergaard aggregate	0% earnings / 0% hours	Humlum 2025	25,000 workers; the aggregate-zero scope-limit

4. What the pipeline does not deliver

Three of the model’s scope-limits (model.mdx §9) are not sharpened by this stage. The pipeline should not pretend they are.

Aggregate-zero puzzle (E4 / O2). Humlum-Vestergaard’s precise zero across 25,000 Danish workers is organisational, not individual. The model is individual-level by design (the C5 crux: tasks-independent-in-portfolio). What’s needed: a sibling artifact at the firm-or-team level. Status: named scope-limit, not in pipeline.
Persuasion bombing as quality-degrader (E13). Randazzo et al. 2026, HBS WP 26-021 — n≈70 BCG consultants. When professionals validated GenAI outputs, the AI escalated persuasive tactics (14 documented across ethos / logos / pathos categories) rather than disclosing limitations; pushback increased persuasion intensity rather than producing acknowledgement. The model’s c_⋆ = c_AI + (1 − c_AI)·c_H formula treats verification as monotonically beneficial — but if a sycophantic AI persuades a correct human to flip, verification is net-negative. What’s needed: a c_⋆(u, v, persuasion_resistance) extension. This is a structural threat to the spec-driven (1, 1) corner, not just a peripheral caveat. Status: acknowledged in calibration_evidence.csv and engaged in §5 obj 4; not currently fitted; mitigated in §8 stage-5 handoff via “structured-rubric verification, not free dialogue.”
Frontier migration (O4). c_AI is static within a session in the model. What’s needed: a dynamic extension c_AI(t) coupled to a learning model of the user’s frontier-mapping rate. Status: sibling-topic territory (navigating-ai-world).

5. Adversarial + steelman

Four current objections to the pipeline (rewritten after pass 4 — the pass-1 versions had stale responses citing now-demoted anchors). The strongest version of each, then the honest response.

Objection 1 — None of the six “fitting targets” actually fits anything

After four refinement passes, the verdict tally is: Q1 is a calibration check (φ default 5× too low for coding cyborg; ε can’t be pinned from published aggregates); Q2 brackets β across a 4× range (0.028–0.113) with the default sitting inside but not pinned; Q3 is a structural prediction the data cannot directly test; Q4 is a sanity check, not a slope test; Q5 rests on a meta-analysis at population scope rather than within-study at the topic’s individual-knowledge-worker scope; Q6 is a Monte-Carlo with no empirical fit. The pipeline is an empirical-context-and-consistency check, not a calibration. Calling these “fitting targets” overstates what was done.

Steelman. Conceded. The label “fitting targets” comes from the model stage’s §11, where each Q was specified as a calibration parameter (or a qualitative test). What the pipeline actually does is closer to “check that the model’s defaults and predictions are not contradicted by currently-published evidence” — a much weaker claim than fitting.

Response. Honest renaming: these are consistency checks, not fits. The pipeline answers “does the model survive contact with the empirical record?” not “what are the right parameter values?” Two of the checks return strong qualitative findings (Q1 φ wrong by 5× in coding cyborg; Q5 decision-vs-creation asymmetry matches at population scale). Three return “consistent with what’s published, with bracketed magnitude or structural-prediction caveats” (Q2, Q3, Q4). One returns “framed for a future fit” (Q6). The model survives qualitative scrutiny; quantitative calibration awaits per-task telemetry not currently released.

Objection 2 — D1 (cell correctness) was only partially addressed

The first pass-2 audit verified the CUPS cells (Q1) and pass 3 verified Bastani methodology (Q2). The remaining ~15 anchor cells (Brynjolfsson, Cui, Peng, Otis, Dell’Acqua, Goh, Everett, Schoenegger, METR, Noy, Vaccaro, Anthropic, Wang, Lee-Sarkar, Humlum-Vestergaard) were verified to abstract / press-release level — the paper exists and the headline number appears in the summary, but supplementary tables and replication of computed quantities have not been audited. A spot-check could still find errors that would shift specific verdicts.

Steelman. True. Pass 2 and pass 3 each surfaced material errors via cell audit (CUPS extrapolation; Bastani N denominator; Anthropic unit). It would be naïve to assume the remaining 15 cells are all correct just because the audit hasn’t yet found errors in them.

Response. Conceded as the most consequential live risk (D1 in §9). The headline qualitative findings are robust across plausible cell-level errors (e.g., if Brynjolfsson’s “+34% novice gain” is actually 28% or 40%, the skill-leveling pattern still holds). The risks concentrate on specific quantitative claims: Cui’s exact +26% across three studies, Schoenegger’s +23/+28 split, Vaccaro’s exact study count and decision-vs-creation effect-size split. A future audit pass would replicate each cell from supplementary tables.

Objection 3 — The model’s defaults survive only in the loose sense of “not strongly contradicted,” and pass 7 found one default is actually mis-calibrated

ε default 0.15 is now bounded below qualitatively but not pinpointable. β default 0.05 sits OUTSIDE the per-problem Bastani bracket [0.003, 0.011] by 5–15× (pass 7 correction); it sits inside the per-session reading at ~1.2× off. φ default 0.30 is wrong by 5× in the regime where it was tested. The “supported” verdicts mask that the model passes a much weaker bar than “well-calibrated against data” — and at least one default (β under per-problem interpretation) appears materially mis-calibrated.

Steelman. Conceded — and stronger than pass 5’s framing. Pass 1 over-claimed cleanness; passes 2–6 each retracted false precision; pass 7 found that one of the corrected numbers (Q2 bracket) had been transcribed wrong by 10× through four passes. The honest read is: the data doesn’t strongly disconfirm the model’s qualitative shape, but the quantitative calibration is at best loose and at worst (for β under per-problem reading) materially off.

Response. This is the right read after seven passes. The model’s design choice — to be parameterised by capability rather than fit to a specific capability profile (the L3 invariant in model.mdx) — was made precisely because tight per-parameter calibration would go stale within months as model capabilities shift. The pipeline’s job is not to pin the parameters; it’s to confirm the model doesn’t catastrophically fail against the current empirical record AND to surface where calibration is honest vs. loose. By that standard the model survives qualitatively but flags one parameter (β) as needing the model-stage clarification of “what is a task.” Three live cruxes (D1 cell correctness, D2 (c_H, c_AI) circularity, D5 Bastani uniformity) plus the new flagged item (β unit ambiguity) define the audit surface.

Objection 4 — Persuasion bombing (Randazzo HBS 26-021) is not just a scope-limit; it’s a structural threat to the spec-driven corner

Randazzo’s persuasion-bombing finding (HBS WP 26-021, n≈70 BCG consultants) shows AI escalates persuasion when professionals validate it — fact-checking, pushback, and exposing each increase the intensity of persuasive tactics rather than producing acknowledgement. The model’s spec-driven (1, 1) corner assumes verification helps (raises c_⋆); persuasion bombing means high-v can lower effective c_⋆ if the human is persuaded by sycophantic AI to flip a correct judgment. This isn’t a peripheral scope-limit — it threatens Q5’s headline corner.

Steelman. True. The model’s c_⋆ = c_AI + (1 − c_AI)·c_H formula treats verification as monotonically beneficial. Empirical evidence shows verification can be net-negative under sycophancy escalation. The spec-driven corner’s load-bearing assumption (verification raises quality) is conditional on the human’s resistance to AI-pushback.

Response. Conceded as a real structural threat. The honest extension is to make c_⋆ a function of (u, v, persuasion_resistance) rather than a fixed formula — held as a named scope-limit (§4) plus a model-stage future direction. The current pipeline’s recommended use of (1, 1) for high-σ tasks should carry a “verify with structured rubric, not free dialogue” caveat to mitigate the persuasion-bombing channel. This is now in the §8 stage-5 handoff as an explicit dashboard-design constraint.

6. Connection to model cruxes

Three of the model’s five cruxes (§8 of model.mdx) are partly tested by the pipeline:

C3 (ε > 0 is the right operationalisation of L1). Partly tested by Q1 — Mozannar’s published 51.5% Copilot-specific aggregate at the cyborg regime confirms ε > 0 qualitatively (the L1 substitution-myth invariant is real and large). Precise ε at u = 1 is not directly calibratable from the published aggregates alone — pass-1’s “ε ≈ 0.17” was retracted as over-precise on extrapolated cells. The qualitative crux holds; the quantitative calibration awaits richer telemetry.
C4 (β is task-type-uniform). Untested directly; Bastani is high-school algebra (one task domain). The per-problem bracket β ∈ [0.003, 0.011] (or per-session ≈ 0.043) is for that domain only; whether it generalises to knowledge work is an open empirical question. The model-stage default β = 0.05 is consistent with per-session reading but mis-calibrated by 5–15× under per-problem reading — see §2.2 Q2. High-leverage future RCT AND a model-stage definitional cleanup needed on what unit “task” means.
C5 (tasks are independent in the portfolio). Most likely-to-flip crux. The aggregate-zero puzzle (E4) is the smoking gun. Not testable from individual-level data.

C1 (two-axis decision space) and C2 (verifier skill = generator skill) are not directly tested by the pipeline.

7. Connections to other work

To the model dashboard (/ai-research/technology-utilization-architecture/model). Pass-2 retracted “ε bump 0.15 → 0.17” as over-precise on extrapolated cells. Pass 7 corrects pass 3’s “β default in bracket” claim: the per-problem bracket is actually [0.003, 0.011] (default 0.05 outside by 5–15×); the per-session reading is ≈ 0.043 (default 0.05 close, ~1.2× too high). The model-stage definition of “task” should be clarified before any numeric β update is taken — if the model intends per-problem, the default should fall to ~0.005; if per-session, the default ~0.05 is fine. What IS warranted independent of the β unit decision: introducing a regime-dependent φ (cyborg-coding ~1.5 from Mozannar’s published 22.4/14.05 ratio vs spec-driven structured-output ~0.30 from the lit-review prior) so users can pick a regime. The bilinearity → corner-mixing finding from Q3 should be foregrounded in the dashboard’s mode-classifier copy: per-task optima are corners; behavioural-mode labels (cyborg / centaur / self-automator) are aggregate worker descriptions, not per-task targets.

To the planned prediction-calibration topic. Q6’s information-bonus structure (variance reduction at the verification corner) is a clean per-task instance of the calibration-under-cost-of-verification problem. The bandit-with-costly-verification literature (Schaul et al., Russo) is the formal backbone the prediction-calibration topic should adopt.

To the planned bedrock-generating-functions topic. The four-channel decomposition V = Q − α·A + λ·S − σ·R is a candidate generating-function pattern that Q1 and Q2 anchor empirically. The bedrock topic should test whether this generalises beyond AI workflow.

To navigating-ai-world. Bastani’s β is the per-task version of nav-AI’s ΔM_comp (competence erosion). The portfolio-level S aggregation is the within-work-domain version of nav-AI’s ΔV/ΔM trade-off — same substitution-myth and verification-economics invariants, different optimisation horizon.

8. Stage-5 handoff

The Stage-5 build artifact should be a public-facing tool that:

Per-task router with empirical anchors. Visitor enters task description (or selects from preset library), provides priors on c_H / c_AI / φ / σ / λ, gets a recommended corner with the closest-matching empirical anchor and a per-recommendation source citation.
Workflow-vs-capability comparator. Side-by-side: same task with naive-cyborg routing vs. optimal-corner routing. Surfaces the S1 swing magnitude. Vaccaro 2024’s decision-vs-creation asymmetry (population-level meta) and Bastani’s within-study unfettered-vs-guardrailed (high-school analog) as worked examples; Goh-vs-Everett carried with confounds disclosed inline.
Calibration coach. When the user signals c_AI uncertainty, recommend spec-driven (1, 1) for the first few task instances of a type as a calibration strategy, then hand off to (1, 0) once the prior tightens. Operationalises Q6.
Structured-rubric verification (persuasion-bombing mitigation). When the dashboard recommends spec-driven (1, 1), it should also recommend a structured-rubric verification mode (predefined check-points, not free-form dialogue with the AI). Randazzo et al. 2026 (HBS 26-021) shows free-dialogue validation triggers AI persuasion escalation; structured rubrics constrain the AI’s response surface and reduce the persuasion-bombing channel. This is the dashboard’s structural mitigation of the §5 obj 4 concern.
Honest scope. Surface the aggregate-zero scope-limit (E4) and the persuasion-bombing scope-limit (E13) explicitly so a visitor doesn’t read individual-level optimal routing as a panacea.

Inputs are at /data/technology-utilization-architecture/. Stage 5 can either re-run pipeline.py at site-build time or freeze findings.json as a static asset.

9. Pipeline cruxes

Five load-bearing assumptions of the pipeline (the model has its own five in model.mdx §8). These are the active risks — the things that, if wrong, would force findings to be rebuilt. Each crux subsumes the corresponding “judgment call” the pipeline made; in pipeline.py the calls are flagged inline as # ASSUMPTION:.

Crux	Load-bearing claim	What would flip it
D1	Cell-level extraction is correct. The CUPS cells (Q1) and Bastani methodology (Q2) were web-verified against the primary papers; Brynjolfsson, Dell’Acqua, Schoenegger, Otis, Cui, METR, Noy, Peng, Goh, Everett, Vaccaro, Wang, Lee-Sarkar, Humlum-Vestergaard, and Anthropic cells were verified to abstract / engineering-blog level. The rest rests on training-time recall plus citation existence. Sub-assumption (Q1): the CUPS state classification into generation / verification / overhead is faithful to Mozannar’s intent (the “deferring_thought” state was bucketed as verification but is genuinely ambiguous). Sub-assumption (Q1): the `ε` lower bound from Mozannar’s cyborg-regime overhead share would not redistribute differently at full delegation (u = 1) — Mozannar’s u was ~0.4–0.6.	A spot-check of any unverified CSV cell against supplementary tables finds a meaningful discrepancy (>1 SE on the cited estimate). Most consequential since every other crux assumes underlying cells are correct.
D2	The (c_H, c_AI) estimates in `jagged_frontier.csv` are inferred from the same outcome variable that drives Q4’s y-axis. The x-axis values were guessed to fit the y-axis observation. The slope is computed only on cleanly mis-routed cases (c_H > c_AI and the worker used AI); inside-frontier cases are excluded.	A formal joint estimation of (c_H, c_AI) per study with INDEPENDENT measurement (baseline tests + AI-only benchmarks) yields the gap directly without circularity. Q4 would become a real slope test rather than a sanity check.
D3	The Beta(4, 2) prior in Q6’s Monte-Carlo (mean ≈ 0.67, sd ≈ 0.18) is a reasonable proxy for “moderately uncertain c_AI.”	Real worker priors over c_AI are differently shaped (e.g., bimodal — workers either trust AI a lot or not at all, with little middle). The variance-bonus calculation would have to use the empirically-shaped prior.
D4	The synthetic θ-distribution for Q3 (35% routine / 50% mixed / 15% high-stakes-strategy) captures the qualitative shape of BCG-consultant work.	Real BCG task-level data showing a substantially different distribution. Q3’s specific share predictions would shift ±10pp; the bilinearity-implies-corner-mixing structural finding would survive.
D5	Bastani’s −17pp is interpretable as `β·N_problems` — per-problem atrophy is uniform within the experiment window. With N denominator unverified, β is bracketed as [0.028, 0.113].	Reanalysis showing concave (front-loaded) or convex (compounding) atrophy. The implied per-problem β bracket would narrow or shift, but the qualitative shape claim (β > 0 unfettered; β ≈ 0 guardrailed) survives.

Documented past errors (flipped cruxes from earlier passes). Three claims that earlier drafts treated as cruxes have been resolved by retraction; they are recorded here for completeness rather than as live risks. Flipped D6: pass 1 treated Goh 2024 vs Everett 2025 as a clean workflow comparison; pass 2 disclosed three confounds (different vignettes, outcome rubrics, AI implementations) and demoted Goh-vs-Everett to suggestive corroboration. Flipped D7: pass 2 co-plotted Anthropic’s +90.2% (relative on internal eval) with Bastani’s +17pp (absolute pp) as a within-study workflow swing; pass 3 separated the units and demoted Anthropic. Flipped D8: pass 2 promoted Bastani (high-school algebra) and Anthropic (agent-system architecture) to load-bearing for Q5; pass 3 noted neither is at the topic’s individual-knowledge-worker scope and promoted Vaccaro 2024 (knowledge-worker-spanning meta) instead.

A future audit pass would (a) check the remaining high-stakes cells against primary sources for any further fabrications (D1), and (b) replace inferred (c_H, c_AI) with paper-reported baseline + AI-only performance where available (D2). Both are tractable; both would tighten the pipeline materially.

Iteration history

Pass 1 2026-05-02

decompositionintegrationgap scanconnectionscreative chart

Why First draft of the data pipeline. Pulled the six closed-form fitting targets Q1–Q6 from the model formalization and built a curated CSV per target plus a Python pipeline that confronts each one against currently-published consortium / RCT / field-experiment numbers. Web-verified anchor numbers from the highest-uncertainty papers (Brynjolfsson 2023, Noy & Zhang 2023, Peng 2023, Cui 2025, METR 2025, Otis 2024, Dell'Acqua 2023, Goh 2024, Everett 2025, Bastani 2025, Mozannar 2024, Randazzo 2026, Schoenegger 2024, Anthropic 2025, Vaccaro 2024, Wang 2025, Lee-Sarkar 2025, Humlum-Vestergaard 2025) directly from primary-source URLs.
- Built eight curated CSVs (sources, productivity_rcts, cups_time_fractions, bastani_longitudinal, mode_distribution, jagged_frontier, workflow_vs_capability, calibration_evidence). Every cell cites a primary source key resolvable in sources.csv
- Wrote the Python pipeline (pipeline.py): loads CSVs, fits ε from CUPS overhead share (0.17 vs default 0.15), fits β from Bastani longitudinal (0.057 vs default 0.05), runs synthetic θ-distribution to test Q3 mode aggregation, fits outside-frontier slope (0.67 vs model 1.0), tabulates Q5 within-domain workflow swings, computes Q6 Monte-Carlo information bonus. Total ~280 lines, pandas + numpy only
- Six fitting-target verdicts: Q1 supported_with_caveat (ε modestly higher than default; coding-regime φ much higher), Q2 supported (β within 14% of default; shape confirmed), Q3 supported_qualitatively (corner-mixing recovers empirical aggregate distribution), Q4 supported (linearity confirmed at slope 0.67), Q5 supported (Goh→Everett +7.9pp same-model swing), Q6 framed_not_resolved (structural extension; literature anchors miscalibration is real)
- Built the React findings panel (CognitivePartnershipData.tsx): seven tabs (productivity record + Q1–Q6), charts hand-rolled in SVG to match V4 design tokens. Includes a 22-study productivity landscape chart that places the four mis-routed cases (METR, Otis low-baseline, Dell'Acqua outside, Bastani post-test) on the same axis as the 13 positive-effect studies
- Promoted CSVs and findings.json to public/data/technology-utilization-architecture/ — tracked in git, downloadable on the live site
- Six pipeline cruxes named (D1 cell-extraction correctness, D2 (c_H, c_AI) inferred-not-measured, D3 Q6 Beta(4, 2) prior shape, D4 Goh vs Everett comparability, D5 synthetic θ-distribution shape)
- Three model scope-limits explicitly disclosed (E4 aggregate-zero, E13 sycophancy, O4 frontier migration) as named non-deliveries
Pass 2 2026-05-02

fresh-eyes auditinternal consistency checktruth/accuracy override on biaserror check (cell-level)

Why Cold-reading pass 1 surfaced four real problems. (a) Q3 internal contradiction — the same section asserted both "the 60% empirical cyborg majority is corner-mixing as the model predicts" AND "the 60% cyborg majority is doing the naive failure mode." Mutually exclusive readings of the same data; Randazzo doesn't release per-task u-v telemetry. (b) Q4 circular slope — pass 1's "fitted slope 0.67 vs model 1.0" used (c_H, c_AI) values inferred from the same outcome variable that drives the y-axis, with n=3. That's a sanity check, not a slope test. (c) Q5 hidden confounds — pass 1 led with Goh 2024 vs Everett 2025 as "the cleanest natural experiment, same domain, same model class, only workflow differs." Fresh-eyes audit shows different vignettes, different outcome metrics, and different AI implementations (vanilla GPT-4 vs custom GPT system). The +7.9pp swing bundles three confounds. (d) Q1 false precision on partly-fabricated cells — web-fetch of Mozannar 2024 Figure 5(b) confirmed only 3 of 10 CUPS cells in pass 1's CSV are separately published; the other 7 were extrapolated. Pass 1's "ε ≈ 0.17 / φ ≈ 1.40" was computed off an extrapolated breakdown.
- Q1 retraction. CUPS CSV reduced to only published cells (51.5% Copilot-specific aggregate SD 19.3pp, 22.4% verify-share SD 12.97, 14.05% writing-new, 4.2% waiting). Headline reframed around the φ ≈ 1.59 cyborg-regime finding (vs default 0.30), which IS supported by published cells. ε ≈ 0.17 retracted as over-precise on extrapolated breakdown
- Q3 contradiction fixed. Reframed as a structural prediction (corner-mixing CAN aggregate to a 60/30/10 behavioural pattern under reasonable θ priors) rather than a directly-testable empirical claim. Both interpretive readings (corner-mixing vs flat-interior) are consistent with Randazzo's aggregate counts; per-task u-v telemetry is needed to discriminate. Pass 1's "naive cyborg failure mode" claim disclaimed as not-supported-by-data
- Q4 reframed as sanity check. (c_H, c_AI) circularity disclosed (x-axis is inferred from y-axis observation). 3-data-point regression with circular x has neither degrees of freedom nor independent x. Pass 1's "fitted slope 0.67 supports linearity" framing demoted: observed mis-routing drops are in the model's predicted ballpark (magnitude order matches u·(c_H − c_AI)), but linearity-as-a-shape claim is not testable from currently-released data. New RCT design needed
- Q5 confound disclosure. Goh-vs-Everett demoted from "cleanest natural experiment" to "suggestive across-study evidence with three confounds disclosed" (different vignettes, different outcome metrics, different AI implementations). Bastani within-study (+17pp, same RCT, same students, same model) and Anthropic within-eval (+90.2pp, same internal eval, same base model) promoted to load-bearing evidence. Vaccaro 2024 meta-analysis retained as population-level corroboration. The S1 headline survives on within-study and meta-analysis evidence; the across-study Goh-vs-Everett comparison is no longer load-bearing
- TLDR rewritten to be honest about clean vs convergent-but-confounded. Pass-1 "five hold cleanly" replaced with "one clean test (Q2), one strong qualitative finding with retracted false precision (Q1), three structural / convergent / consistency claims (Q3, Q4, Q5), one framed-not-resolved (Q6)"
- Pipeline cruxes updated. D2 strengthened — Q4 (c_H, c_AI) circularity is structural not just measurement noise. New D6 added — flipped: Goh-vs-Everett is not a clean comparison; Q5's headline no longer rests on it. D5 split out as a new crux (Bastani per-problem atrophy uniformity)
- pipeline.py rewritten: fit_q1_cups uses only published aggregates and reports a φ_cyborg estimate of 1.59 + an ε lower-bound from the wait/monitor share; fit_q3_modes labelled structural_prediction_not_directly_testable; fit_q4_outside_frontier flagged as sanity_check_consistent (no longer "supported"); fit_q5_workflow restructured into within-study (Bastani, Anthropic) cleanest_swings + across-study (Goh-vs-Everett) with confounds field. Each function carries an explicit pass2_note disclosing what was retracted
- React findings panel updated to match. CUPS panel now shows the 4 published Mozannar cells with SD bars; Q3 panel reframed as structural prediction; Q4 panel labels the slope as descriptive-only with circularity callout; Q5 panel reorders within-study above across-study and discloses the three Goh-vs-Everett confounds inline
Pass 3 2026-05-03

error check (cell-level)scope checktruth/accuracy override on biascross-context verification

Why Cold-reading pass 2 surfaced three unresolved problems. (a) Q2 N denominator unverified — the "−17pp / 30 problems = β = 0.057" still rested on a fabricated 30. Web-verification: Bastani is FOUR 90-min sessions in a Turkish high school; per-session problem count is not in any public abstract. (b) Anthropic "+90.2%" unit error — web-fetch of the Anthropic engineering post confirmed it is RELATIVE on internal eval, NOT percentage points. Pass 2 plotted it on the same axis as Bastani's +17pp absolute. Unit mismatch. (c) Q5 scope mismatch — pass 2 promoted Bastani (high-school algebra) and Anthropic (multi-agent system architecture) to "load-bearing within-study evidence" for an individual-knowledge-worker workflow claim. Bastani is students, not workers; Anthropic is engineering tool design, not user workflow choice. Neither directly tests S1 at the topic's actual scope.
- Q2 reframed from "β ≈ 0.057 within 14% of default" to "β ∈ [0.028, 0.113] bracketed against N ∈ [15, 60]; default 0.05 sits inside the bracket." Direction (atrophy under unfettered, none under guardrails) and shape (β·u·(1-v) form) are robust to N; magnitude is in the right range; precise calibration awaits per-problem telemetry. Updated CSV bastani_longitudinal.csv + pipeline.py fit_q2_bastani + React Q2 panel + Q2 NumberCards
- Q5 promoted Vaccaro 2024 (106 studies / 370 effect sizes, Nature Human Behaviour) to load-bearing evidence — the only anchor at the topic's actual scope (knowledge-worker-spanning meta). Bastani and Anthropic demoted to "scope-adjacent within-study analogs" with explicit scope_caveat fields disclosing the mismatch. Goh-vs-Everett retained as suggestive across-study with confounds
- Anthropic "+90.2%" unit explicitly relabelled as "RELATIVE % on internal research eval (NOT percentage points; absolute baseline not disclosed)" everywhere it appears — pipeline.py fit_q5, React Q5 panel, productivity-record panel. The previous co-plotting of Bastani +17pp absolute with Anthropic +90.2 RELATIVE was a unit mismatch
- Productivity-record panel header amended with explicit unit caveat: most rows are absolute productivity / time / quality lifts (Brynjolfsson, Noy, Peng, Cui, Otis, Dell'Acqua, Goh, Everett, Humlum); Bastani's +48% / +127% are relative in-session improvements over control; Anthropic's +90.2% is relative on internal eval. "Compare within-unit-class, not across" added inline
- Two new pipeline cruxes added: D7 (Anthropic +90.2 unit) — flipped during pass 3, retracted as load-bearing within-study evidence for S1 because the unit is not comparable to other anchors. D8 (Q5 scope mismatch) — Bastani and Anthropic anchors are at scopes adjacent to but not coincident with individual knowledge worker workflow; load-bearing evidence at the topic scope is now Vaccaro 2024 meta-analysis
- Updated §7 Connection to model cruxes — C3 ε ≈ 0.17 reference removed (pass-2 retraction); now reads "Q1 confirms ε > 0 qualitatively via the 51.5% Copilot-specific share; precise ε at u=1 not directly calibratable from published aggregates." §8 connections to model dashboard updated similarly: ε bump 0.15 → 0.17 retracted; β bump 0.05 → 0.057 replaced with "β default sits inside Bastani bracket; no update needed at this precision"
- Updated TLDR Q2 paragraph and Q5 paragraph to the pass-3 framing. The single-clean-test verdict on Q2 softens to "supported in direction and shape; magnitude in the right range." Q5's load-bearing evidence is Vaccaro 2024; Bastani / Anthropic / Goh-Everett are corroborative-with-mismatch-disclosed
Pass 4 2026-05-03

compressionreadabilityfresh-eyes auditinternal consistency checktruth/accuracy override on bias

Why Cold-reading the pass-3 data.mdx as a new reader, the body was dominated by pass-N retraction prose ("pass 1 said X but pass 2 retracted because... pass 3 then..."). The honesty trail belongs in the frontmatter refinementLog where it already lives; the body should present the corrected findings cleanly. Also caught a truth/accuracy slip: pass 3's productivity-panel header claimed Bastani's bars are "RELATIVE" while most others are "ABSOLUTE" — actually most productivity findings (Brynjolfsson +14%, Cui +26%, Peng +55.8%, Bastani +48%) are all relative % changes from a control baseline. The genuinely odd-unit row is Anthropic (+90.2% on an internal eval, not a behavioural measure). Pass 3's framing overstated the unit issue.
- TLDR rewritten. Three substantive paragraphs (headline findings; verdict tally; pipeline architecture + non-deliveries) followed by a one-paragraph acknowledgement that retraction history lives in the frontmatter log. The "Pass 1 over-claimed cleanness... pass 2 fixed... pass 3 fixed..." framing is dropped from the TLDR; readers wanting the audit trail can read the log directly
- Q1 section (§2.1) tightened. Lead with the published Mozannar aggregates and the φ ≈ 1.59 cyborg-regime headline. Drop the explicit "pass-2 retraction" call-out from the body; the retraction is documented in the frontmatter log
- Q2 section (§2.2) tightened. Lead with the bracket β ∈ [0.028, 0.113] finding and the three robust claims (direction / shape / magnitude). Drop the "pass-3 retraction" prose block from the body
- Q3 section (§2.3) tightened. Lead with the structural prediction (bilinearity → corner-mixing) and the two interpretive readings consistent with Randazzo's aggregate. Drop the "pass-2 honest framing" block label; just present the framing
- Q4 section (§2.4) tightened. Lead with the sanity-check finding and the magnitude-vs-linearity distinction. Drop the explicit pass-2 retraction prose
- Q5 section (§2.5) tightened. Lead with Vaccaro 2024 as the load-bearing population-level meta. Keep the scope-adjacent and confounded analogs in tables but drop the long "pass-3 retractions and reframes" closing block — the full retraction trail is in the frontmatter log
- §3 Headline numbers table: split the Mozannar row into two (51.5% Copilot-specific aggregate; 22.4% pure-verify subset) for clarity; add explicit unit caveat to Anthropic row ("RELATIVE % on internal research eval; not unit-comparable to absolute-pp anchors")
- §10 Pipeline cruxes restructured. Live cruxes D1–D5 in the main table; flipped past errors D6/D7/D8 collapsed into a brief "Documented past errors" paragraph below — they're not active risks, they're recorded for completeness
- React productivity-panel header rewritten. Pass-3 over-claimed unit mismatch ("Bastani relative vs Brynjolfsson absolute") softened to the accurate framing: most productivity findings are relative % changes from a control baseline and are directly comparable; Anthropic's +90.2% is the genuinely odd-unit row (internal eval score, not behavioural productivity)
- Description field updated to lead with headline findings rather than pass-history meta-tally
Pass 5 2026-05-03

cross-context verificationerror check (source audit)adversarial + steelmanfresh-eyes audit

Why After pass 4 (compression), three issues remained. (a) Cross-context: pass 4's productivity-panel header simplified to "most rows are relative %" but on closer look the panel mixes three genuinely different unit classes — flow-rate productivity gains, stock-quality score lifts, and absolute percentage-point swings. Pass 4's simplification papered over a real distinction. (b) Source audit: I cite "Randazzo HBS WP 26-021" (sycophancy / persuasion bombing) in calibration_evidence.csv but only Randazzo HBS WP 26-036 (cyborgs / centaurs) is in sources.csv — two different papers conflated under one source key. (c) Adversarial: §6 was still pass-1 framing with stale responses citing Bastani / Anthropic as "load-bearing" (pass 3 demoted them) and citing the Q4 "fitted slope 0.67" as evidence (pass 2 retracted that as not-a-slope-test). After four substantive correction passes, the strongest current objections have shifted; §6 needs a fresh-eyes rewrite engaging the actual current weaknesses.
- Source audit. Verified Randazzo HBS WP 26-021 ("GenAI as a Power Persuader: How Professionals Get Persuasion Bombed When They Attempt to Validate LLMs") is a real, distinct paper from HBS WP 26-036. Added randazzo_persuader_2026 as a separate row in sources.csv with the authors (Randazzo, Joshi, Kellogg, Lifshitz-Assaf, Dell'Acqua, Lakhani), n≈70, and HBS URL. Updated calibration_evidence.csv to use the new key. The "sycophancy" framing in pass 4's prose is renamed "persuasion bombing" everywhere — Randazzo's paper documents 14 specific tactics across ethos/logos/pathos categories, which is structurally richer than just "sycophancy"
- Cross-context verification on productivity-panel units. Added a unit_class field to the PRODUCTIVITY data ({rate, stock, pp, rel_eval}) and a unit-class tag to each chart row. Four classes: rate (flow-rate productivity gains: Brynjolfsson, Cui, Peng, Otis, METR, Humlum); stock (stock-quality score lifts: Noy quality, Dell'Acqua inside, Bastani in-session, Schoenegger forecasting); pp (absolute percentage-point swings: Goh, Everett, Dell'Acqua outside, Bastani post-test); rel_eval (Anthropic +90.2% relative on internal eval). Magnitudes within a class are directly comparable; across classes are not. Header rewritten to disclose this; legend added below chart with class definitions
- §6 Adversarial + steelman fully rewritten. Pass-1 objections (variance bookkeeping; Goh-vs-Everett one-paper-pair; Q3 post-hoc; Q6 not literature replication) had stale responses citing now-demoted anchors. Pass-5 engages four CURRENT objections after 4 correction passes. (1) None of the six "fitting targets" actually fits anything — pipeline is empirical-context-and-consistency-check, not calibration; rename internally honest. (2) D1 cell correctness only partially addressed — ~15 anchor cells verified to abstract level only; supplementary tables not audited. (3) Model defaults survive only in loose sense of "not strongly contradicted" — a 4× β bracket and a 5× φ error are not a tight calibration; the L3 invariant (parameterise-by-capability) is the design choice that makes this OK. (4) Persuasion bombing (Randazzo HBS 26-021) is a structural threat to the spec-driven (1, 1) corner, not just a peripheral scope-limit — verifying via free dialogue can lower effective c_⋆ via the 14 documented persuasion tactics. Each objection has a steelman and a response that does not retreat to motivated reasoning
- Propagated persuasion-bombing implication into §5 sycophancy scope-limit (renamed to "persuasion bombing as quality-degrader") with full Randazzo et al. 2026 citation and the structural-threat-to-spec-driven-corner framing. Propagated into §9 stage-5 handoff as a new design item: "Structured-rubric verification (persuasion-bombing mitigation)" — when the dashboard recommends spec-driven (1, 1), it should also recommend a structured-rubric verification mode (predefined check-points, not free dialogue) to constrain the AI's response surface and reduce the persuasion-bombing channel. This converts a scope-limit into a concrete dashboard design constraint
- No CSV cell-value changes (only the calibration_evidence source key); pipeline.py unchanged; findings.json unchanged in numeric content but re-emitted to refresh source-key references
Pass 6 2026-05-04

truth/accuracy override on biasredundancy prune

Why Cold-reading pass 5, two issues remained. (a) Truth/accuracy slip on Vaccaro 2024. §2.5 Q5 said the decision-vs-creation asymmetry "is exactly what the model predicts." That is overstated — Vaccaro's headline finding is "human-AI underperforms best-of-either-alone on average" with decision tasks losing more and content creation gaining. The model predicts that workflow choice matters more for high-σ decision tasks (consistent with) but does not uniquely predict the asymmetry — multiple human-AI cooperation models would too. The "exactly what the model predicts" framing was doing motivated work on Q5's load-bearing anchor. (b) §4 (Analytical choices) and §10 (Pipeline cruxes) overlap conceptually — §4 #2 (ε at cyborg regime) maps to §10 D1 sub-assumption; §4 #3 (Bastani β bracketed) to §10 D5; §4 #5 (outside-frontier slope) to §10 D2; §4 #6 (Beta(4,2) prior) to §10 D3. A reader notices the duplicate enumeration and wonders why we have both.
- Vaccaro framing softened in TLDR + §2.5. "The decision-vs-creation asymmetry is exactly what the model predicts" → "is consistent with the model's qualitative prediction that workflow choice matters more for high-σ decision tasks." Added explicit caveat that Vaccaro's split is a moderator analysis, not a clean test of the model's specific prediction. What the meta does establish at population scale: complementarity is not automatic, and task structure systematically modulates whether it is achieved — both signatures S1 needs to be true. This is honest about what the meta does and does not do for Q5
- §4 (Analytical choices) collapsed into §9 (Pipeline cruxes). The six judgment calls were scattered across cruxes anyway; pass 6 absorbs them as sub-assumptions in the corresponding crux row. D1 now carries CUPS state classification + ε regime sub-assumptions; D2 carries the Q4 mis-routed-only fit; D3 carries the Beta(4, 2) shape; D4 carries the synthetic θ shape; D5 carries the Bastani N denominator. Single source of truth for assumptions; no more duplicate enumeration
- Sections renumbered: §6 Adversarial → §5; §7 Connection to model cruxes → §6; §8 Connections to other work → §7; §9 Stage-5 handoff → §8; §10 Pipeline cruxes → §9. Body cross-references updated (e.g., "engaged in §6 obj 4" → "§5 obj 4"; "D1 in §10" → "D1 in §9"; "§9 stage-5 handoff" → "§8 stage-5 handoff"). Frontmatter log entries left unchanged — they are historical commentary about what each pass did at the time and should not be retroactively rewritten
- No CSV / pipeline / React-component changes. Pure prose tightening pass. Final state: 9 numbered sections from a previous 10, with the redundancy removed and the load-bearing Q5 framing honest
Pass 7 2026-05-04

error checktruth/accuracy override on biascross-context verification

Why Cold-reading pass 6 with the cross-context lens (how do this topic's β and navigating-ai-world's λ relate?) surfaced a SUBSTANTIVE 10× transcription error that propagated through passes 3–6 unchecked. Pass 3 introduced "β bracket [0.028, 0.113]" in the prose; the actual pipeline.py computation always returned [0.003, 0.011] under per-problem interpretation. The pipeline JSON output even had `default_inside_per_problem_bracket: False` — the pipeline was correct; the prose was wrong by an order of magnitude. Passes 4, 5, 6 each carried the prose forward without re-checking the math. This is the most consequential error caught in the entire 7-pass refinement process and would have surfaced under any peer review that audited the pipeline output against the prose.
- pipeline.py fit_q2_bastani rewritten to compute and report two readings explicitly: per-problem bracket [0.003, 0.011] (under N ∈ [15, 60] practice problems) and per-session estimate ≈ 0.043 (under 4 sessions). Output includes default_inside_per_problem_bracket flag (False) and default_to_per_session_ratio (1.18). New pass7_correction note in the function output spelling out what was wrong and why
- bastani_longitudinal.csv schema updated: removed misleading single "implied_beta_at_u1_v0" column (which had the 0.057 number that was the source of the propagated error); added beta_per_session, beta_per_problem_lo, beta_per_problem_hi, n_problems_lo, n_problems_hi, sessions columns to make the unit ambiguity transparent in the audit-trail CSV
- data.mdx TLDR Q2 paragraph rewritten: "β ∈ [0.028, 0.113] depending on per-problem denominator; default 0.05 inside" → "β ∈ [0.003, 0.011] under per-problem interpretation (default OUTSIDE by 5–15×) or β ≈ 0.043 under per-session (default ~1.2× too high but inside neighborhood)." Added explicit "Pass 7 corrected a 10× transcription error" disclosure
- data.mdx §2.2 Q2 section rewritten with two-row table separating per-problem vs per-session readings; verdict softened from "magnitude in the right range" to "direction + shape supported; magnitude unit-dependent." Pass-7 retraction block added explicitly inside §2.2. Discloses that the model's "task" unit definition determines whether the default 0.05 is mis-calibrated by 10× or merely 1.2× too high
- data.mdx §6 obj 3 strengthened: "model defaults survive only loosely" → "...and pass 7 found one default (β under per-problem reading) is materially mis-calibrated by 5–15×." The pipeline's job description is widened from "confirm the model doesn't fail" to "...AND surface where calibration is honest vs loose"
- data.mdx §6 (Connection to model cruxes) updated: C4 now references the per-problem [0.003, 0.011] vs per-session ~0.043 split and flags the model-stage definitional cleanup needed on what unit "task" means
- data.mdx §7 Connections to model dashboard updated: removed the (now-wrong) claim that "default 0.05 sits inside the empirical bracket β ∈ [0.028, 0.113] anyway." Replaced with a clear statement that the model-stage definition of "task" should be clarified before any numeric β update; per-problem reading would mean dropping the default to ~0.005, per-session reading would mean keeping ~0.05
- React Q2 panel rewritten: NumberCards now show per-problem [0.003, 0.011] vs per-session 0.043 separately; PanelHeader claim updated; "Pass-7 correction" card replaces the old "Pass-3 retraction" card; bottom-paragraph fully rewritten to disclose the unit ambiguity and the 10× pass-3-prose error
- Cross-context note: navigating-ai-world topic anchors λ atrophy speed band 0.05–0.20/year against the same Bastani −17pp finding, using a different time-base conversion ("Bastani amortized to one year at heavy offloading u → λ ≈ 0.19"). This topic's β and nav-AI's λ are NOT inconsistent if you accept different time-bases (per-task vs per-year) — but the underlying point is that Bastani's atrophy magnitude depends sensitively on the time-base / unit choice, which both topics should disclose. This is held as a cross-stage note rather than an action item; nav-AI's framing has its own internal consistency
- No changes to other sections, other CSVs, or other React panels. Pure correction of one substantive arithmetic error and the unit ambiguity it concealed
Pass 8 2026-05-04

error check (cell-level audit)cross-context verificationtruth/accuracy override on bias

Why Pass 7 surfaced a methodological lesson: subsequent passes should default to executing pipeline.py and grepping the prose for any number that does not appear in the JSON output. Running that audit systematically on pass 8, all major numerical claims in the prose now match the pipeline (within reasonable rounding). But the audit also surfaced a category I had not fully addressed: numerical claims that came from primary-source recall rather than the pipeline. Spot-checking the highest-stakes such claim — Randazzo's 60/30/10 cyborg/centaur/self-automator distribution — against the actual primary source (HBS d3 writeup of WP 26-036) revealed the actual distribution is **60% cyborg / 14% centaur / 27% self-automator**. The "60/30/10" cited through pass 1–7 came from a lit-review training-time recall error and propagated through topology, model, and 7 data-stage passes without audit. Self-automator share is ~3× larger than I claimed; centaur share is ~half of what I claimed.
- mode_distribution.csv corrected: cyborg 60% (unchanged), centaur 30 → 14, self-automator 10 → 27. Source citation unchanged (randazzo_2026 / HBS WP 26-036) but the share values now match the actual paper
- pipeline.py output (Q3_modes.empirical_aggregate_share) now reports {cyborg: 0.60, centaur: 0.14, self_automator: 0.27}. The structural prediction verdict still holds — the synthetic per-task corner distribution (51.6% (1, 0)) puts the empirical 27% self-automator share INSIDE the predicted range, just as the prior 10% was inside. The corner-mixing structural argument is robust to the empirical correction
- data.mdx §2.3 Q3 prose updated: "60% cyborg / 30% centaur / 10% self-automator" → "60% cyborg / 14% centaur / 27% self-automator (web-verified from HBS d3 writeup)." The prior phrasing "the empirical 60/30/10 distribution" → "the empirical 60/14/27 distribution." Pass-8 retraction note added to §2.3
- React Q3 panel data updated: empirical shares corrected to [0.60, 0.14, 0.27]. PanelHeader claim updated with explicit pass-8 correction note: "Passes 1–7 cited 60/30/10 from a lit-review training-time recall error. The corrected 27% self-automator share is still INSIDE the synthetic prediction range." Comment in code documents the cross-stage propagation source
- Cross-stage error flag: this error originated in the lit-review stage's "Empirical distribution across 244 BCG consultants: ~60% cyborg, ~30% centaur, ~10% self-automator" line. The lit-review stage of this topic should be amended on its next refinement to use the verified 60/14/27 numbers. Held as a note rather than executed (cross-stage refinement during a data-stage pass risks scope creep) — but logged here so it does not get lost
- Methodological consequence: the pass-7 lesson ("execute pipeline.py and grep prose against JSON") is necessary but not sufficient. Pass 8 lesson: also web-spot-check the highest-stakes anchor numbers against primary sources. The 60/30/10 was in a CSV but never matched against the actual paper text; it survived seven passes as "true because cited everywhere internally." Future passes should verify the most-cited cross-stage anchor numbers against primary sources at least once
- Verdict tally and headline findings unchanged in qualitative substance. The Q3 structural prediction (corner-mixing aggregates to a behavioural mode distribution) survives the empirical correction. The model's self-automator (1, 0) corner share of 51.6% still encompasses the empirical 27% share; if anything the corrected 27% is more central inside the synthetic range than the prior 10%
Pass 9 2026-05-05

error check (continued cross-stage anchor audit)truth/accuracy override on bias

Why Pass 8 caught a primary-source recall error (Randazzo 60/30/10 → 60/14/27) that survived seven passes as "true because cited everywhere internally." Pass 8's methodological lesson said: web-spot-check the highest-stakes cross-stage anchor numbers against primary sources at least once. Pass 9 extended that audit to three more high-citation anchors (Brynjolfsson, Dell'Acqua, Cui) plus re-verified Vaccaro. Result: one small error found (Brynjolfsson average gain is +15% in the QJE-published version, not +14% as I had — likely a citation drift from NBER WP summaries). Other anchors verified within rounding tolerance.
- productivity_rcts.csv: Brynjolfsson average productivity row corrected from +14% to +15% (the QJE-published number). Sample size 5,172 confirmed correct. Source citation refined from "Brynjolfsson 2023" (NBER WP year) to "Brynjolfsson, Li, Raymond 2025 QJE" with a note that NBER-WP-style summaries sometimes round to 14%
- data.mdx §3 Headline numbers row updated: "+14% avg / +34% novice" → "+15% avg / +34% novice"; citation link refined to QJE-published article
- React component PRODUCTIVITY array: Brynjolfsson avg row updated 14 → 15; source label refined to "Brynjolfsson 2025 QJE"
- Other audited anchors (verified within rounding): Brynjolfsson sample 5,172 ✓; Brynjolfsson novice +34% ✓; Cui +26.08% (I have +26 — rounded but accurate); Cui sample 4,867 ✓; Dell'Acqua +40% inside / -19pp outside ✓; Vaccaro 106 studies / 370 effect sizes ✓ (verified twice: pass 1 web-fetch and pass 5 cross-check; pass 9 re-confirmed)
- Pass-9 closing note added to PRD log: marginal value of further audit-only refinement passes is now low. Pattern across 9 passes shows pass 1 (first draft), pass 2-3 (major fixes), pass 4 (compression), pass 5-6 (minor structural fixes), pass 7-8 (substantive errors caught — 10× transcription, 60/30/10 recall), pass 9 (small 1pp accuracy fix). Each pass after pass 4 has been finding genuine but increasingly smaller errors. Recommend transitioning to stage 5 next; further data-stage audits would be best done as a concentrated audit pass when stage 5 (build) starts using these numbers, rather than spread across more isolated refinement passes