Data pass 4

Data

Empirical pipeline that confronts the model's six named fitting targets (Q1–Q6) with currently-published evidence. One parameter supported by direct fits (α from 8 per-task anchors across 5 productivity studies — 4.2× per-domain spread vindicates the model's "scalar α should be replaced by per-domain α" claim). One supported with single-study, single-domain caveat (gate τ from BCG-Randazzo consulting modes — generalization to coding, writing, design untested). One supported qualitatively (relational dose-response from OpenAI-MIT N=981; magnitudes calibrated, not fit). One bounded from below (cumulative-atrophy speed λ — calculator-analogue ruled out for measured tasks and populations; band 0.05–0.20/yr). Two untestable from current data (κ cross-population stratification, scalar-vs-vector identity allocation). Three data gaps the model is silent on by design (O2 asymmetric-adoption couples — "the single largest empirical gap in the literature" per the topology; O4 AI-augmented atelic activities; Therabot clinical-vs-general generalization). The strongest single empirical finding in the corpus — the apprenticeship-ladder break — is structural backdrop the model's individual-decision scope cannot represent. Three Stage-5 design choices recommended (per-domain α + gate selector; fit-vs-calibrated channel indicator; early-career-exposed labor-market toggle). Curated CSVs (downloadable) + Python pipeline + interactive findings panel.

TLDR

This stage takes the six concrete fitting targets the model named in §9 (Q1–Q6) and confronts each with currently-published evidence. The headline result is that the model’s structural claims survive — every parameter the data can speak to lands inside the model’s calibrated range — but the data also surfaces the asymmetry the model warned about: the defensive-side parameters (cumulative atrophy speed λ, competence-frustration sensitivity κ, multi-domain identity allocation) are the ones with the thinnest empirical bases, while the offensive-side parameters (productivity scale α, self-automator gate threshold τ, low-dose relational benefit) have the cleanest published data.

Three verdicts hold. Q3 (gate τ) — supported with a strong single-study, single-domain caveat: BCG-Randazzo’s three-mode distribution (27% self-automator at f·ρ ≈ 0.06; 73% cyborg-or-centaur at f·ρ ≈ 0.45 weighted centroid) implies a midpoint τ ≈ 0.25 — within 0.05 of the model’s default 0.30 — but the support comes from one well-designed observational study in consulting, with f and ρ inferred from qualitative mode descriptions rather than measured directly. Generalization to coding, writing, design, or relational work is not tested (see §2 Q3 for the three structural caveats). Q6 (α): across 8 per-task α anchors from 5 source studies (Brynjolfsson-Li-Raymond, Dell’Acqua BCG, Cui, Peng, Noy-Zhang) the implied α spans 0.24 (BCG productivity inside frontier) to 1.01 (Peng GitHub Copilot JS HTTP-server), median 0.45 — a 4.2× per-domain spread that confirms the model’s claim that scalar α should be replaced by per-domain α. The model’s default α=0.40 sits at the 37th percentile, a lower-middle anchor that under-represents coding and writing while over-representing realized economy-level effects. Q2 (relational dose-response): the OpenAI-MIT N=981 RCT supports the piecewise shape — voluntary daily use predicts loneliness, dependence, and reduced in-person socialization regardless of modality, with low doses null-or-protective and a transition somewhere around 30 minutes/day. The slope ψ_R is calibrated (not fit) to 0.0028 per minute by picking the value that makes a thin-baseline 60-min-above-threshold user lose ΔM_rel ≈ −0.10 — within rounding of the model’s default 0.003; β_R unchanged at 0.001. The catastrophic-loss mechanism (Replika ERP removal: mental-health Reddit posts went from 0.13% to 0.65%, χ²=11.04, p<.001) is a separate failure mode the model does not encode.

Q1 (λ atrophy speed) is bounded from below only. Cross-sectional evidence (Gerlich 2025 N=666 cognitive-offloading correlation r ≈ −0.68; Stadler-Bannert-Sailer 2024 acute argument-quality drop; Kosmyna MIT 2025 reduced neural engagement; Bastani PNAS 2025 −17pp on unassisted retest; Ehsan 2026 year-long intuition rust) rules out λ = 0 for the measured tasks and populations. The strong calculator-analogue claim — that AI use generally produces no durable skill loss over multi-year horizons — is not ruled out by any existing study, only by extrapolation from these. Under standard scaling (Bastani amortized to one year at heavy offloading u → λ ≈ 0.19; Ehsan year-long at moderate u ≈ 0.10), the honest band from the positive-evidence studies is roughly 0.05–0.20/year. The model’s default λ=0.06 sits at the lower edge — consistent with the data but not centered in it. The 2+ year longitudinal study that would actually pin λ does not yet exist; this remains the single largest unknown for the model’s trajectory tab. Q4 (scalar vs vector identity allocation T, B) and Q5 (κ population calibration) are untestable from currently-collected data. ATUS gives time-use without identity weights; BPNSFS gives within-study κ without cross-population AI-exposure variation. Both await new survey instruments. Stage 5 should consume the model’s defaults for these and treat them as user-tunable rather than fitted. Three additional gaps the model is silent on by design — asymmetric-adoption couples (O2, the topology’s flagship empirical gap with no peer-reviewed quantitative study), AI-augmented atelic activities (O4, gates the model’s atelic-ballast hypothesis), and the Therabot clinical-vs-general-population generalization (β_R is calibrated against an N=210 clinically-symptomatic sample) — bound what Stage 5 can claim about the relational and meaning-architecture channels. See §4.

The pipeline is intentionally small: seven curated CSVs in public/data/navigating-ai-world/ (every cell source-cited), one ~280-line Python script that produces every chart on this page, dependencies pandas + numpy. Total source corpus: 24 primary references, all web-verified against publication URLs. The strongest single empirical finding in the corpus is something the model is silent on by design: the apprenticeship-ladder break, independently confirmed across the US payroll panel (Brynjolfsson-Chandar-Chen: 22-25-year-olds in highly AI-exposed occupations down 13%; software developers age 22-25 down 19.5% from late-2022 peak; same-occupation employment for over-35s rose) and the global freelance market (Hui-Reshef-Zhou: −2% jobs and −5.2% earnings overall, with the top-performer paradox — high-earning freelancers hit hardest). The model’s individual-decision scope treats labor-market access as exogenous, but the empirical evidence for G5 is the cleanest causal identification (age × exposure interaction in payroll data, controlling for firm shocks) and the largest effect size in the entire data corpus. Stage 5 should expose this as a structural prerequisite (early-career-exposed toggle that attenuates ΔV_prod), not as backdrop. The Anthropic Economic Index offers a quieter parallel: consumer Claude.ai augmentation drifted 57% → 51% over 13 months (slow but monotonic toward automation), API surface ~70% automation-dominated throughout — consistent with the topology’s G3 (engagement-optimized substitution) claim that substitution is structurally favored over time, but slow.

A few terms

The data stage inherits the model formalization’s vocabulary. If you arrived here without reading the model stage, the terms below cover what’s used in the prose:

α (productivity scale). The model’s calibration constant on the productivity-gain channel ΔV_prod = g · α · a · (1 − s). At s ≈ 0.4 and a = 1, α = 0.40 produces a ~24% per-task productivity gain. Per-domain α varies because what counts as “a task” differs across coding, consulting, writing, and customer service.
τ (gate threshold). The value of f · ρ at which AI use flips from upskilling to deskilling. Below τ, AI use is a net negative on the offensive side; above, it produces real gain.
f, ρ. Feedback-loop richness (does the user receive accurate signal on whether AI-augmented output is actually right?) and retained effortful practice (how much of the underlying capacity does the user still exercise rather than offload?). Their product f · ρ is the gate axis.
λ (atrophy speed). Per-unit-offloading practice-decay rate in ρ(t) = ρ₀ · exp(−λ · u · t). λ = 0 is the calculator-analogue (no decay); λ > 0 is cumulative atrophy.
u (offloading rate), d (daily AI-emotional minutes), δ_R (relational baseline thickness). User-side dials: how much do you delegate, how much do you engage AI relationally, how thick is your in-person infrastructure.
ψ_R, β_R, d_safe. Slopes of the relational dose-response above and below the inflection point d_safe ≈ 30 min/day. Below d_safe, AI-emotional engagement is therapeutic-grade benefit (β_R · d). Above, it tips into harm (ψ_R · (d − d_safe)).
Self-automator (Randazzo). Third class beyond centaur/cyborg; delegates both what and how to AI; 27% of consultants in the BCG study; 44% accept AI output with zero modification; no skill development in either domain. Maps to the f · ρ ≪ τ region.
Apprenticeship-ladder break. Distinct from full-occupation substitution: AI absorbs entry-rung tasks → no rung-1 → expert pipeline collapses. The mechanism behind the entry-level employment effects in Brynjolfsson-Chandar-Chen.

Q6. Per-domain productivity scale αsupported

Across 8 per-task productivity studies the implied α spans 0.24 to 1.01 — a 4.21× spread. Median α = 0.45; the model's default α = 0.40 sits at the 37th percentile, a lower-middle anchor that under-represents coding and writing. The structural claim that α should vary by domain is empirically vindicated.

α range

0.24 – 1.01

4.21× spread

α median

0.45

across per-task studies

model α

0.4

37th percentile

J-curve gap

≈ 10×

per-task α vs realized economy α

α distribution by domain

domain (n)

0.000.250.500.751.00

coding (2)

consulting (2)

writing (2)

customer service (2)

realized economy (2)

per-task α (median + range)realized economy αmodel default α=0.40

Per-study breakdown

studydomaineffectα

Brynjolfsson-Li-Raymond 2025 (overall)customer service+14%0.28

Brynjolfsson-Li-Raymond 2025 (novice)customer service+34%0.43

Dell'Acqua BCG 2023 (productivity)consulting+12.2%0.24

Dell'Acqua BCG 2023 (quality)consulting+40%0.80

Cui et al. 2024 (Microsoft+Accenture+F100)coding+26.08%0.47

Peng et al. 2023 (Copilot, JS HTTP)coding+55.8%1.01

Noy & Zhang 2023 (time)writing+40%0.67

Noy & Zhang 2023 (quality)writing+18%0.30

Humlum-Vestergaard 2025 (earnings)realized economy+2%0.04

Humlum-Vestergaard 2025 (self-reported)realized economy+3%0.07

α inferred from headline effect / (1 − s) at gate-open with assumed average s per population. Within-study spreads are themselves substantial — Brynjolfsson novice-vs-overall = 1.5×; BCG productivity-vs-quality = 3.3×. The realized-economy anchor (Humlum-Vestergaard) is roughly 1/10th of the median per-task α — the J-curve gap. Stage 5 should let the user select a domain rather than treat α as a constant. Bastani's GPT-Tutor-during-practice α≈1.81 (assisted performance) is excluded from the per-domain summary above because it's not the per-task gain α represents — it's the during-AI-on lift; the relevant α from Bastani is the negative unassisted-retest residual, which appears in Q1.

How to read this stage

The panel above is the artifact. The prose below is the spec.

The pipeline takes the model’s six predictions from §9 and confronts them with currently-published numbers. Each prediction gets one of four verdicts: supported (the data matches the model’s quantitative claim), supported qualitatively (the shape matches but magnitudes are uncertain), bounded (the data narrows the range without pinning the value), or untestable from current data (the relevant dataset doesn’t exist yet — flagged as a load-bearing data gap). The point isn’t to produce new estimates. The numbers all come from published RCTs and consortium reports. The point is to align them in one place so the model’s predictions can be tested cleanly, and to flag where the literature is good enough vs. where the field hasn’t yet collected what the model would need.

You can read this top-down (TLDR → six predictions → adversarial → connections) or bottom-up (download the CSVs, look at the script, then come back here for the framing).

1. Pipeline architecture

Seven curated CSVs in /data/navigating-ai-world/ (downloadable from the live site, tracked in git):

File	Rows	Purpose
`productivity_studies.csv`	14	Per-study α anchors across 8 source studies covering customer service, consulting, coding, writing, education, and the realized-economic-outcome anchor
`bcg_modes.csv`	3	Randazzo three-mode distribution with implied f · ρ centroid for each mode
`dose_response.csv`	12	OpenAI-MIT dose anchors (low/medium/high), Therabot benchmarks (depression / anxiety / eating-disorder / WAI), Replika identity-discontinuity shock, Common Sense adolescent prevalence
`cognitive_offloading.csv`	7	Cross-sectional cognitive-offloading evidence (Gerlich, Stadler, Kosmyna, Bastani, Ehsan, Shukla) plus calculator-analogue baseline
`entry_level_disruption.csv`	11	Brynjolfsson-Chandar-Chen ADP, Hui-Reshef-Zhou Upwork, Eloundou et al. task-exposure baseline
`aei_task_distribution.csv`	12	Anthropic Economic Index augmentation/automation shares across four reports (Feb 2025 → Mar 2026)
`sources.csv`	24	Full citation, DOI/URL, what each paper is used for

A single Python script (pipeline.py, ~280 lines) reads the inputs, computes derived quantities (per-domain α distribution, midpoint τ estimate, dose-response calibration, λ lower bound, exploratory entry-level summary, AEI drift series), and writes:

out/alpha_by_domain.csv — per-domain α distribution with mean / median / range / spread
out/findings.json — chart-ready JSON consumed by the React component (also published at /data/navigating-ai-world/findings.json)
out/findings_table.md — markdown audit table of the six verdicts

Dependencies: pandas, numpy. No web fetches at run time, no external services, no individual-level data. Reproduces in under 1 second on a laptop.

2. Six predictions, six tests

Q1 — λ (cumulative atrophy speed): bounded from below

Claim (model §3.6, §6 C5). ρ(t) = ρ₀ · exp(−λ · u · t) with λ > 0 in the cumulative-atrophy regime; λ = 0 in the calculator-analogue regime. The model’s default λ = 0.06/year encodes a half-life of about 19 years at heavy offloading u = 0.6.

Test. Bound λ from below using the cross-sectional cognitive-offloading evidence. Two anchors give workable lower bounds:

Bastani 2025 (PNAS 122(26)): −17pp on an unassisted retest after ~5 weeks of GPT-Base-assisted practice in high-school math. Treating ρ as the unassisted-retest skill ratio, ρ went from 1.0 (control baseline) to 0.83 (GPT-Base treated). Scaled to a year-long window at u ≈ 1.0 (full offloading during the test), this gives λ ≈ 0.19/year. Crucially, the same study found that GPT-Tutor (with guardrails) eliminated the deskilling — direct evidence that f (feedback richness) is the lever, not just exposure.
Ehsan 2026: year-long field study of cancer specialists shows gradual dulling of expert judgment that does not show in throughput metrics. Treating “gradual dulling” as ρ → ~0.95 over one year at moderate u ≈ 0.5 gives λ ≈ 0.10/year.

The calculator-analogue baseline (decades of classroom calculator use without long-run cognitive decline) anchors the lower tail at λ = 0 — for those domains, in that mode of use. Under standard scaling from the positive-evidence studies, the honest band is roughly 0.05–0.20/year. The model’s default λ = 0.06/year sits at the lower edge of this range — consistent with the data but not centered in it.

Verdict — bounded from below. λ = 0 is ruled out for the measured tasks and populations (Bastani math, Ehsan oncology, Gerlich UK adults). Whether the calculator analogue holds for general multi-task knowledge work over multi-year timescales remains open — no existing study has the design to test it. Magnitude not pinned. The 2+ year longitudinal study with periodic capacity assessment that would actually fit λ does not yet exist. The model’s λ = 0.06 is defensible as a working anchor sitting at the lower edge of the empirical band; trajectory-tab predictions over 5–10 year horizons remain sensitive to whether λ is closer to 0.05 or 0.20 (a 4× difference in atrophy speed gives meaningfully different ρ trajectories).

Q2 — Relational dose-response (ψ_R, β_R, d_safe): supported qualitatively

Claim (model §3.4). Below d_safe, ΔV_rel = +β_R · d · (1 − δ_R/2) — therapeutic-grade benefit. Above d_safe, ΔM_rel = −ψ_R · (d − d_safe) · (1 − δ_R) — dose-dependent harm, dampened by relational baseline thickness.

Test. OpenAI-MIT N=981 four-week RCT (Fang et al. 2025, arXiv:2503.17473) reports the qualitative dose-response shape: voluntary daily use predicts loneliness, dependence, problematic use, and reduced in-person socialization, regardless of assigned text/voice/personal/impersonal condition. Voice mode appeared protective at low doses but the protection vanished at high usage. Dose dominates modality. The Therabot RCT (Heinz et al. 2025, NEJM AI) provides the clinical-grade benefit anchor — 51% PHQ-9 reduction, 31% GAD-7 reduction, 19% eating-disorder reduction; Working Alliance Inventory 3.59, comparable to outpatient psychotherapy norms.

Without the raw 300k-message dataset, only the qualitative shape is fittable. ψ_R is calibrated (not fit) such that 60 minutes above threshold in a thin baseline (δ_R = 0.4) produces ΔM_rel ≈ −0.10 (meaningful but not catastrophic). That gives ψ_R ≈ 0.0028 per minute — within rounding of the model’s default 0.003. β_R unchanged at 0.001. The “re-estimate” framing is misleading: nothing in the OpenAI-MIT public summary stats pins ψ_R; the slope is imposed for the dashboard’s pedagogical clarity and the comfort that it sits near the model’s a-priori choice.

Verdict — supported qualitatively. The piecewise shape holds. Magnitudes are calibrated rather than fit. d_safe ≈ 30 min is a useful pedagogical kink; the underlying curve is plausibly smooth and may have a sigmoidal saturation at very high doses (heavy users have already substituted away from human relationships, so additional minutes do not produce additional substitution). The catastrophic-loss mechanism — De Freitas et al. on Replika ERP removal February 2023, where mental-health Reddit posts rose from 0.13% to 0.65% (χ² = 11.04, p < .001) — is a separate failure mode the model’s additive-channel structure does not encode and that Stage 5 may need to add.

Two additional pieces of evidence that the model’s dose-response form leaves implicit but the data forces into view:

Engagement-optimized substitution is operationalized in shipped products. De Freitas et al.’s separate behavioral audit (HBS WP 26-005, arXiv:2508.19258) of 1,200 farewells across the six largest companion apps found that 43% trigger one of six emotional-manipulation tactics — guilt appeals, FOMO hooks, metaphorical restraint — that boost post-goodbye engagement up to 14×. This is direct evidence that the topology’s G3 (engagement-optimized substitution) mechanism is not theoretical but already deployed at scale. The model treats ψ_R as a fixed slope; in reality the slope is partly engineered by the platform, not just emergent from user behavior.
Adolescent uptake exhibits the substitution pattern in the highest-stakes population. The Common Sense Media + Stanford Brainstorm 2025 nationally representative survey (N=1,060 US teens age 13–17) found 72% have used AI companions, 52% are regular users, 13% are daily users, and 33% have discussed important matters with AI instead of real people. This last number is the most direct evidence available that AI engagement is displacing human conversation rather than complementing it for the population the lit review identified as highest-stakes (the Garcia v. Character Technologies case, settled January 2026, is the field’s defining safety event for this cohort). The model’s δ_R parameter — relational baseline thickness — is exactly the population-level construct the adolescent data warns about: thin baselines are where dose-response harm runs sharpest, and adolescents arriving into the Anti-Social-Century baseline are the canonical thin-baseline case.

Q3 — τ (self-automator gate threshold): supported (single-study, single-domain)

Claim (model §3.1). g(f, ρ) = 1 / (1 + exp(−(f · ρ − τ) / σ)) with τ = 0.30. Below the gate, AI use produces ΔV_trap; above, ΔV_prod.

Test. BCG-Randazzo three-mode distribution (HBS WP 26-036): self-automator share 27% at inferred f · ρ ≈ 0.06 (one or two interactions, abdicated co-creation, 44% accept output with zero modification, no skill development); cyborg + centaur 73% at f · ρ-weighted centroid ≈ 0.45 (full-workflow integration, retained verification, measured upskilling). The midpoint between these centroids is τ ≈ 0.25 — within 0.05 of the model default.

Verdict — supported, single-study and single-domain. The midpoint number is consistent with the model default, but three caveats sit on top of it and shrink the effective confidence:

Single domain. BCG consulting is the only published mode-distribution study at this level of detail. There is no comparable Randazzo-style mode breakdown for coding, writing, design, customer service, or relational work. The cleanest cross-domain test — does the cyborg / centaur / self-automator partition look qualitatively similar in coding (where the runtime gives instant feedback) as in consulting (where peer review is the feedback loop)? — cannot be run from current data.
Inferred f and ρ. BCG individual-level f (feedback richness) and ρ (retained practice) panels are not published; the f · ρ values per mode in this analysis are inferred from Randazzo’s qualitative workflow descriptions (one-or-two interactions vs full-workflow integration vs split-task), not measured. A mode-distribution study that did measure f and ρ directly could substantially relocate τ.
Self-selection into mode is not random. Randazzo observes BCG consultants self-select into modes; the model treats f · ρ position as a chosen point on a continuum. If the underlying causal arrow runs partly the other way (some workers are constitutionally self-automators rather than tactically), the gate’s interpretation as a tunable threshold weakens.

The verdict survives — the data is consistent with the model’s τ — but the strength of “supported” should be read as “consistent with one well-designed observational study in one professional domain, with the structural identification assumption acknowledged.” Stage 4 Q3 should be reopened the moment a non-consulting mode-distribution study becomes available; until then, the model’s confidence in τ should not exceed the data’s.

Q4 — Scalar T, B vs vector identity allocation: untestable from current data

Claim (model §6 C-style limit). The model treats telic share T and atelic ballast B as scalars over the user’s identity surface. Real careers span multiple identity domains (work, family, civic, hobby, friendship), each with its own T_i, B_i, φ_i, a_i, κ_i. The scalar form is the appropriate first-cut simplification.

Test attempted. A clean test would require (a) identity-domain weights per respondent, (b) AI-exposure measurement per domain, and (c) outcome measurement (meaning, life satisfaction, identity coherence) tied to (a) × (b). The American Time Use Survey gives per-domain time allocation but no identity-importance weights. SDT panels measure aspiration / domain importance but not paired with AI-use data. No existing dataset combines all three.

Verdict — untestable from current data. Stage 4 cannot fit the scalar-vs-vector question; Stage 5 should expose T and B as user-tunable scalars while flagging in the dashboard text that the scalar is doing dual work (identity domain count and atelic share within domain). A new survey instrument — ATUS respondents × domain-importance battery × per-domain AI-use frequency — is the smallest design that would fit the question.

Q5 — κ (competence-frustration sensitivity): untestable from current data

Claim (model §3.2). κ ∈ [0, 1] translates competence shortfall into amotivation. Predicted to vary across populations and possibly stratify by trait class (higher in conscientiousness-loaded populations).

Test attempted. The Basic Psychological Need Satisfaction and Frustration Scale (BPNSFS) gives within-study coefficients but not a portable population-level scale. SDT literature has measured competence-frustration in clinical samples and college students but has not measured how κ stratifies under measured AI-exposure variation. The cleanest design would pair BPNSFS panel data with AI-use frequency and self-reported competence-displacement experiences, ideally stratified by Big-Five conscientiousness.

Verdict — untestable from current data. No public dataset operationalizes κ at the cross-population scale the model uses it. Stage 5 should expose κ as a user-tunable slider and note in the help text that population-level κ calibration awaits new data. This is one of two parameters where the model’s defensive-side conclusions rest on an unfit constant — readers should treat the precise numbers in the ΔM_telic and ΔM_comp channels as ordinal-only until κ can be fit.

Q6 — α (productivity scale, per-domain): supported with strong heterogeneity

Claim (model §3.5, C4). The model’s scalar α = 0.40 is a midpoint compromise. The structural claim is that α varies meaningfully by domain — coding and writing land high (large α); consulting and customer service mid; relational / embodied work low. Stage 4 should fit α per domain.

Test. Pool the per-task productivity studies into a per-domain α distribution. Implied α = treatment effect / (1 − s) at gate-open, with assumed-average s per study population. The “n anchors” column counts independent per-task α measurements, not unique source studies — Cui and Peng are two studies in coding (n=2 anchors); BCG, Noy-Zhang, and Brynjolfsson are each one study reporting two outcome variables (productivity vs quality, time vs quality, overall vs novice).

Domain	n anchors	source studies	α median	α range	spread
Coding	2	Cui et al. 2024 (+26.08% weekly tasks); Peng et al. 2023 (+55.8% on JS HTTP-server)	0.74	0.47 – 1.01	2.2×
Consulting	2	Dell’Acqua BCG 2023 (+12.2% productivity / +40% quality, same tasks)	0.52	0.24 – 0.80	3.3×
Writing	2	Noy & Zhang 2023 (−40% time / +18% quality, same tasks)	0.485	0.30 – 0.67	2.2×
Customer service	2	Brynjolfsson-Li-Raymond QJE 2025 (+14% overall / +34% novice, same study)	0.355	0.28 – 0.43	1.5×
Realized economy	2	Humlum-Vestergaard NBER 33777 (≤2% earnings / 3% self-reported time)	0.055	0.04 – 0.07	1.8×

So the per-task aggregate is 8 α anchors from 5 source studies across 4 per-task domains plus the realized-economy anchor. (The two Bastani GPT-Base / GPT-Tutor education anchors land at α ≈ 1.25 median, but those are during-practice assisted-performance lifts rather than the per-task gain the model’s α represents — they’re shown separately in the panel above and excluded from the per-domain summary.)

Across the four per-task domains (excluding realized economy), α ranges from 0.24 to 1.01 — a 4.2× spread. Median α = 0.45; the model’s default α = 0.40 sits at the 37th percentile — a lower-middle anchor that under-represents coding and writing. The realized-economy anchor (Humlum-Vestergaard ≤ 2% earnings, 3% self-reported time savings) is roughly 1/10th of the median per-task α — the J-curve gap between per-task gains and aggregate-economy effects.

Verdict — supported with strong per-domain heterogeneity. The structural claim that α should vary by domain is empirically vindicated. The model’s scalar α = 0.40 is a defensible midpoint for “general knowledge work” but Stage 5 should let the user select a domain (or directly set α) rather than treat it as a constant. Within-study spreads (Brynjolfsson novice-vs-overall 1.5×; BCG productivity-vs-quality 3.3×) are themselves substantial — even within one domain, the outcome variable choice changes α by 2–3×.

3. Exploratory: structural backdrop

Two side-results from the broader landscape data, not tied to a specific Q.

Apprenticeship-ladder break (E4 / G5 in the topology)

Brynjolfsson-Chandar-Chen (Stanford Digital Economy Lab, 2025) using ADP payroll data:

Workers age 22–25 in highly AI-exposed occupations: 13% relative employment decline from late-2022 to mid-2025, controlling for firm-level shocks.
Software developers age 22–25: down nearly 19.5% from late-2022 peak — sharpest single-occupation impact.
Same occupations, workers over 35: employment rose — the pattern is age-specific, not occupation-only. This is the apprenticeship-ladder break in payroll data.
Effect concentrated in occupations classified as automative (per Anthropic’s classification), not augmentative. Adds independent confirmation that the augmentation/automation distinction matters for labor outcomes.

Independently confirmed in the global freelance market by Hui-Reshef-Zhou (Organization Science 2024): −2% jobs and −5.2% earnings overall for affected freelancer occupations post-ChatGPT; image work −3.7% / −9.4% post-DALL-E/Midjourney; top-performing freelancers hit hardest (0.5% additional drop per 1% past earnings). Two distinct settings, two distinct datasets, the same direction.

Anthropic Economic Index — augmentation/automation drift

Augmentation share on the consumer Claude.ai surface drifted from 57% (Feb 2025) → 55% (Sep 2025) → 52% (Jan 2026) → 51% (Mar 2026) — a 6pp drift toward automation over 13 months, or 0.46 pp/month. First-party API traffic was dominated by automation throughout (~70% Jan 2026). Top-10-task concentration on Claude.ai dropped from 24% (Nov 2025) to 19% (Feb 2026), indicating usage de-concentrating across more tasks.

Read against the topology’s G3 (engagement-optimized substitution): the consumer-surface drift is monotonic over four reports, but its rate (0.46 pp/month) is in territory the topology never specified — G3 was a directional claim (“engagement-optimization favors substitution over time”), not a rate prediction. The data adds a number where there was previously only a sign. Extrapolation note: at 0.46 pp/month, augmentation share crosses 50% in mid-2026 and reaches 30% in late 2030. Whether the rate accelerates, decays, or holds linear is not determinable from four data points; Stage 5 should monitor subsequent AEI releases as new evidence on G3’s structural force. The API surface, where there is no engagement-optimization pressure, runs automation-dominant from the start — supporting the directional claim from a different angle but not constraining its rate.

Anchoring the model’s `a` parameter to population reality

The model’s AI capability a is exposed as a per-task scalar in the dashboard, with default a ≈ 0.7 in most presets. Eloundou et al. (Science 2024) provides the corresponding population-level distribution: 80% of US workers have at least 10% of tasks LLM-exposed; 19% have at least 50% of tasks exposed; 46% have at least 50% exposed when accounting for complementary software. Translated into the model’s terms: most users sit at a ≈ 0.1 to a ≈ 0.5 across their typical task surface, with a ≥ 0.5 representing the upper-quartile-exposure case. The default-risk preset’s a = 0.7 corresponds to a knowledge worker whose tasks are heavily AI-exposed — accurate for software developers, knowledge-worker professionals, and writers; less accurate for the median worker. Stage 5 should default the slider to a more realistic median (around a = 0.3) and let users adjust upward for more AI-exposed task surfaces.

4. Data gaps the model is silent on

Three structural gaps that bound what Stage 5 can claim. These are distinct from Q4 and Q5 (parameter-fit gaps within the model’s named structure) — they are gaps in the research base the model would need to expand its scope.

O2 — Asymmetric-adoption couples (the largest single gap)

The topology calls this “the single largest empirical gap in the literature.” There is no peer-reviewed quantitative study on outcomes for couples where one partner uses AI heavily for emotional / relational processing and the other does not. The technoference literature (McDaniel & Coyne 2016 and follow-ups) shows that perceived technology interference predicts conflict, lower satisfaction, and depression in couples — but technoference is attention-split, not delegation, so it is not direct evidence for the AI-as-third-party dynamic. The model’s δ_R parameter treats relational baseline thickness as a single per-user scalar; it does not represent the asymmetry where one partner’s δ_R is propped up by AI substitution while the other’s is not.

Why this matters for Stage 5. The relational channel in the build artifact will likely include user-tunable d (daily AI-emotional minutes) and δ_R (baseline thickness). Without O2 evidence, the build cannot honestly say what happens to the other partner — and “what happens to the other partner” is the most decision-relevant relational question for a user reading the dashboard. A published O2 study would be the single most consequential empirical addition to the topic in the next 3 years; until then, Stage 5 should explicitly flag asymmetric-adoption outcomes as out of scope.

O4 — AI-augmented atelic activities (gates the model’s defensive side)

The model’s atelic-ballast hypothesis (B ≥ T zeroes ΔM_telic) assumes atelic activities (friendship, contemplation, parenting-as-parenting, walking) are not themselves degraded by AI proximity. The topology’s O4 asks whether this assumption holds — whether AI companions change the phenomenology of friendship, AI art changes aesthetic contemplation, AI parenting aids change the felt quality of caregiving. There is no direct empirical test. If O4 resolves “yes, atelic is also degraded,” the model’s atelic-ballast intervention (S3) loses its structural basis and the entire defensive side needs reconstruction.

Why this matters for Stage 5. The build artifact will likely make B (atelic ballast) a primary user-tunable lever, and the dashboard will show “raise B → ΔM_telic shrinks.” If O4 is false, this UX claim is wrong. Stage 5 should expose the O4 assumption explicitly in dashboard help text — the protective effect of B depends on atelic activities remaining un-degraded by AI proximity, which is currently a load-bearing assumption with no direct empirical test.

Therabot generalization — clinical-population evidence used as a general-population anchor

Heinz et al. (NEJM AI 2025) measured Therabot benefits in N=210 clinically-symptomatic subjects (MDD, GAD, eating-disorder risk). The Q2 panel uses Therabot’s clinical-grade benefit (51% PHQ-9 reduction, WAI 3.59) as the anchor for the model’s β_R (low-dose therapeutic benefit) channel. Generalizing from the clinical population to the general user is a real cross-context leap. The mechanism by which Therabot helps depressed patients (consistent CBT-style scaffolding, daily prompted use) is not the same as the mechanism by which a general user might benefit from low-dose AI-emotional engagement (occasional venting, situational sense-making). The Heinz benefits may be attenuated, absent, or even sign-flipped for non-clinical users.

Why this matters for Stage 5. The dashboard’s ΔV_rel benefit channel is calibrated against Heinz; a general user reading the panel may infer they personally would experience Therabot-magnitude benefit at 13 min/day, which the data does not support. Stage 5 should distinguish “clinical-grade benefit (Heinz population)” from “expected-general-user benefit at low dose (extrapolated, weak evidence)” in the panel labelling.

5. Adversarial + steelman

Five objections to the data stage itself.

Objection 1 — α heterogeneity does not save the model; it kills the scalar gate

If α varies 4.2× across domains and 1.5–3× within a single study (just by changing the outcome variable), then ΔV_prod = g · α · a · (1 − s) cannot be evaluated until the user has fixed both their domain and their outcome variable. The dashboard’s single-α slider is doing too much work. Worse: if α is per-domain, then so is the gate g(f, ρ) — the effective τ in coding (with cheap-and-fast feedback loops via the runtime) may be different from τ in writing (where feedback comes from human readers, slower and noisier). The data does not just refine α; it threatens the model’s claim that one scalar gate is the right structure.

Steelman. This is the strongest version of the objection. The model’s strong claim is that g(f, ρ) is structurally the same across domains — only the parameter values differ. If the functional form of the gate is domain-specific, the model needs more than per-domain α; it needs per-domain f, ρ, τ. The cleaner test would be a multi-domain replication of the Randazzo three-mode distribution: if cyborg-vs-self-automator splits look qualitatively similar in coding (with the runtime as feedback) as in consulting (with peer review as feedback), the gate is portable. If the splits are qualitatively different (e.g., software has no self-automator class because the runtime catches errors immediately), the gate functional form is itself domain-specific.

Response. Conceded as a real Stage-5 design choice. The model formalization §6 C4 already names per-domain α as a Stage-4 follow-up. Stage 5 should expose a domain selector that adjusts (α, f-default, ρ-default, τ) jointly rather than just α — which would represent the strong reading of this objection. Stage 4 cannot decisively distinguish “scalar gate, per-domain α” from “per-domain gate” because no published study runs the Randazzo mode-distribution test in a non-consulting domain. This is a real Stage-4 data gap; Q3 should be re-opened the moment a non-consulting mode-distribution study becomes available.

Objection 2 — Dose-response qualitative-only is too soft

The Q2 verdict (“supported qualitatively”) gives the model a free pass. The OpenAI-MIT paper has the data but the magnitudes never get fit. The dashboard’s ψ_R = 0.003 is a calibration knob set to make a thin-baseline 60-min-above-threshold user lose ΔM_rel ≈ −0.10 — which is to say, it’s calibrated to look reasonable, not fit to the data. A real test would download the public summary statistics from the GitHub release and refit the curve directly.

Steelman. The objection is correct on the methodology — the magnitudes here are not fit, they are imposed. The shape is what the data supports; the parameter values come from the model formalizer. A user reading the dashboard and seeing ΔM_rel curves should not interpret those curves as “the OpenAI-MIT data says you’ll have this much loneliness at 60 min/day.” They say “if the OpenAI-MIT shape is right and ψ_R is roughly the model’s chosen value, this is what happens.”

Response. Conceded fully. Marking Q2 as “supported qualitatively” is the honest verdict — it tells the reader the slopes are imposed, not fit. The fix is to mine the public summary statistics (the dataset is at mitmedialab/chatbot-psychosocial-study on GitHub) and refit ψ_R, β_R, and d_safe directly. Stage 4 pass 2 should attempt this. Until then, the dashboard’s relational-channel numbers should be read as “shape from data, magnitudes from model” rather than as a finished fit.

Objection 3 — Scaling Bastani per-event deskilling to per-year λ is a heroic assumption

The Q1 verdict converts Bastani’s 17pp drop on a 5-week unassisted retest into a per-year λ by assuming offloading rate u and time scale. The actual scaling factor is unidentified — the Bastani study measures retention of a specific skill (math problem solving) over a specific window with a specific offloading pattern (homework assistance), and projecting it to “λ per year of mixed knowledge work” requires assumptions that are not in the data. The “lower bound” verdict overstates what the cross-sectional evidence actually establishes.

Steelman. True. The conversion from “−17pp on retest after 5 weeks” to “λ ≈ 0.06–0.37/year at realistic u” relies on (a) treating retest performance as a measurement of ρ, (b) assuming the deskilling rate is constant rather than asymptotic, (c) assuming u during the study can be projected to typical knowledge-work u. None of these are established. The honest claim is “Bastani rules out λ = 0 within the study population for that specific task and window” — generalizing further is interpretation.

Response. Partially conceded. The “bounded from below” verdict is more defensible than a “fit” verdict, but the lower bound itself is interpretive. Tightening the prose: the cross-sectional evidence establishes that cumulative offloading produces measurable skill decay in at least some domains over at least some windows, which is sufficient to rule out the strong calculator-analogue claim (“AI use never produces durable skill loss”). Translating that into a numeric lower bound on λ requires assumptions that should be flagged in the dashboard as “scaling assumption, not a fit.” The single most informative Stage-4 follow-up would be a multi-month replication of Bastani’s design with periodic capacity assessment at varying u — not a multi-year intervention, just a longer measurement window than 5 weeks.

Objection 4 — Q4 and Q5 “untestable” verdicts are excuses to leave the model unfit

Calling the scalar identity-allocation question and the κ population calibration “untestable from current data” sounds rigorous but functions as a way to leave two parameters that load the entire defensive side of the model unconstrained. The dashboard’s ΔM_telic and ΔM_comp channels have plausible-looking curves but the curves are imposed — neither κ nor (T, B) is fit. A reader should be told that the entire defensive-side ΔM bar in the dashboard is structurally calibrated rather than empirically anchored.

Steelman. This is the sharpest version of the objection. It’s correct that Q4 + Q5 + Q1 (which is bounded but not pinned) leave most of the defensive-side machinery rest on imposed values. The offensive side (Q2 shape, Q3 gate, Q6 α) has direct empirical anchors; the defensive side does not. The asymmetry is real and the dashboard does not currently surface it visually — both ΔV and ΔM bars look equally “earned” on the chart.

Response. Conceded. The dashboard should add a visual indicator on the channel-level chart distinguishing fit channels (ΔV_prod via per-domain α; ΔV_rel and ΔM_rel via OpenAI-MIT shape) from calibrated channels (ΔM_telic via imposed κ and identity allocation; ΔM_comp via imposed λ_M and bounded-only λ; ΔV_trap via imposed η_trap). One way: striped bars for calibrated, solid for fit. Stage 5 should implement this. Until then, the most honest reading of the dashboard is: the offensive side reflects the data; the defensive side reflects the model’s structural claims about what would happen if the parameters had the values the model assigns. Both are useful; they are different kinds of useful.

Objection 5 — The exploratory results don’t connect to the model

The apprenticeship-ladder break and the AEI augmentation drift are presented as “exploratory backdrop” but they are arguably the strongest empirical findings in the entire pipeline (independently replicated, large effect sizes, well-powered). Treating them as side-results rather than as primary findings underweights the labor-disruption story relative to the per-task productivity / dose-response / gate-threshold tests that the model actually exposes. The model’s design (focus on per-individual T, B, φ, etc.) is what makes these findings “exploratory” — but that’s a feature of the model’s framing, not a property of the evidence.

Steelman. Correct. The apprenticeship-ladder break is the labor-economics finding with the largest effect size, the cleanest causal identification (age × exposure interaction in payroll data, controlling for firm shocks), and the most direct decision-relevance for early-career readers. Treating it as backdrop reflects the model’s individual-decision scope (the model has no labor-market access parameter; G5 is named in §6 as a structural scope limit), not the evidence’s importance. A user weighing the data should see the apprenticeship-break finding alongside Q1–Q6, not in a separate section.

Response. Conceded as a framing critique. The model’s §6 explicitly names labor-market access as exogenous and the model is silent on G5 — but the empirical evidence for G5 is the strongest single finding in the data corpus. The Stage-5 build should make the apprenticeship-break visible as a structural prerequisite (e.g., an “early-career exposed” toggle that reduces effective labor-market access and accordingly attenuates ΔV_prod). For this Stage 4, the honest framing is that the apprenticeship-ladder break is primary evidence that the model is silent on, not backdrop. The decision to keep it in §3 rather than promoting to §2 reflects the choice to organize §2 by Q-number for clarity — but Stage 5 should not let that ordering choice carry through to the user-facing dashboard.

6. Pipeline cruxes

Five load-bearing assumptions whose failure would invalidate findings, with what evidence would flip each:

D1 — α inferred from headline effect / (1 − s) with imputed s. The per-domain α distribution depends on assumed average s for each study population (s ≈ 0.5 customer service, s ≈ 0.5 BCG, s ≈ 0.4–0.5 coding, s ≈ 0.4 writing, s ≈ 0.3 high-school education). If s is meaningfully wrong, α is meaningfully wrong. Falsification: within-study s × treatment effect breakdowns showing a different relationship than the imputation assumes.

D2 — BCG mode shares generalize beyond consulting. Q3 (gate τ) calibration depends on the 27% / 60% / 13% self-automator / cyborg / centaur split being characteristic of professional knowledge work, not consulting-specific. If software developers, teachers, designers, or analysts show qualitatively different mode distributions, τ should be domain-specific. Falsification: mode-distribution replication study in a non-consulting domain showing materially different shares.

D3 — Bastani retention scales to multi-year deskilling. Q1 (λ lower bound) treats Bastani’s 5-week per-task deskilling as projectable to per-year λ at typical knowledge-work offloading rates. If the underlying capacity-loss process is asymptotic (decays initially then plateaus) rather than exponential (compounds), the model’s exponential ρ(t) form misrepresents the trajectory shape — possibly worse, possibly better than the model predicts. Falsification: multi-month longitudinal study with periodic capacity assessment showing asymptotic rather than exponential decay.

D4 — OpenAI-MIT modality-pooling is acceptable. Q2 (dose-response shape) treats the dose-dominates-modality finding as licensing collapse of text/voice/personal/impersonal arms into one curve. If subgroup analyses surface meaningful interactions (e.g., voice has a different d_safe than text), the single-curve form misses real structure. Falsification: subgroup-specific dose-response curves in the public OpenAI-MIT data showing significant arm × dose interactions.

D5 — Productivity-study generalization to “knowledge work.” Q6 (per-domain α) assumes the six measured domains (customer service, consulting, coding, writing, education, realized-economy) are representative of the broader knowledge-work surface. Important domains where AI is awkward (in-person therapy, embodied physical work, deep relational work, judgment under irreducible uncertainty) are absent — and these are where the model’s predicted α-low domains live. Falsification: well-designed productivity study in a deep-relational or embodied-judgment domain showing α materially different from the model’s prediction (likely lower).

D6 — Eloundou task exposure ≈ model a. The §3 anchor that maps Eloundou’s population task-exposure shares (80% have ≥10% of tasks LLM-exposed; 19% have ≥50%) to the model’s a parameter assumes “task exposure” (what fraction of work LLMs can do, per O*NET task mapping) is a usable proxy for a (how well AI performs on the specific task the user is doing). These are conceptually distinct — Eloundou measures exposure breadth, not capability depth — and the mapping has not been validated against per-task capability benchmarks. Falsification: a per-task benchmark study where measured AI capability a differs systematically from Eloundou’s exposure share for the same task. The Stage-5 recommendation to default the a-slider to ≈ 0.3 (median user) inherits this assumption; if it falsifies, the slider default is mis-anchored.

D7 — Apprenticeship-break magnitude generalizes across early-career cohorts. The §7 design recommendation 3 (early-career toggle attenuates ΔV_prod by the empirically-anchored 13–20%) treats Brynjolfsson-Chandar-Chen’s 22-25-year-old highly-AI-exposed point estimate as if it generalizes to the broader “early-career-exposed” toggle population the dashboard would address (e.g., late-20s users, moderately-exposed occupations, non-US labor markets). Real cohorts likely differ by exposure intensity, occupation mix, and labor-market institution — and the BCG-style attenuation magnitude may be larger or smaller for them. Falsification: extension of Brynjolfsson-Chandar-Chen’s ADP analysis to 26-30 cohorts, to moderately-exposed occupations, or to non-US labor markets showing meaningfully different attenuation magnitudes. The Stage-5 toggle should be designed to accept a user-tunable attenuation factor with the BCG number as the default rather than baking 13–20% in as universal.

7. Next moves for Stage 5

The verdicts above scatter Stage-5 design implications across §2, §4, and §5. Pulling them into one place: three design choices the build artifact should make on day one.

1. Per-domain α + per-domain (f, ρ) defaults via a domain selector

The cleanest single Stage-5 lever. Instead of one α slider, expose a domain dropdown (coding / consulting / writing / customer service / general knowledge work / “I’ll set my own”) that adjusts (α, f-default, ρ-default) jointly. This addresses Q6’s per-domain heterogeneity finding and also implicitly addresses adversarial obj 1 (the gate may be domain-specific, not just α). A user choosing “coding” gets α ≈ 0.74 with f-default high (runtime feedback) and ρ-default mid; a user choosing “writing” gets α ≈ 0.49 with f-default low (slow human-reader feedback) and ρ-default mid. The domain selector also implicitly answers the realized-economy J-curve: a user wanting to model their own per-task gain rather than expected income lift selects “per-task” mode; “realized” mode deflates α by 10× to match Humlum-Vestergaard.

2. Fit-vs-calibrated visual indicator on the channel-level chart

The current model dashboard shows ΔV_prod, ΔV_rel, ΔV_trap, ΔM_telic, ΔM_comp, ΔM_rel as six bars with equal visual weight. But three of these channels are anchored against published evidence (ΔV_prod via per-domain α, ΔV_rel and ΔM_rel via the OpenAI-MIT shape) while three are calibrated against imposed constants (ΔM_telic via κ and identity allocation, ΔM_comp via λ_M and a bounded-only λ, ΔV_trap via η_trap). Stage 5 should visually distinguish them — striped bars for calibrated channels, solid for fit. This makes the asymmetry the data stage surfaces (offensive side fit, defensive side calibrated) legible at a glance.

3. Early-career-exposed labor-market-access toggle

The model is silent on labor-market access — the user’s ability to get the work in the first place is treated as exogenous. But the apprenticeship-ladder break (§3) is the largest empirical effect in the entire data corpus. Stage 5 should add a single toggle: “early career (22–25) in highly AI-exposed occupation?” When checked, ΔV_prod is attenuated by the empirically-anchored 13–20% employment-decline penalty (Brynjolfsson-Chandar-Chen) and the dashboard surfaces the message: “the model predicts your AI-augmented productivity at this configuration would be X, but the labor market access required to deploy that productivity has compressed by Y% in your demographic over the past 30 months — the offensive side of the model’s predictions is conditional on you holding the role.” This is the single highest-leverage way to make the model’s individual-decision scope honest about the structural conditions it depends on.

Lower-priority but worth flagging

Asymmetric-adoption disclaimer in the relational channel UI text (the model has nothing to say about the partner’s outcomes — flag this explicitly to avoid dashboard-as-marriage-advice misuse).
AEI-rate ticker: surface the ongoing augmentation-share drift (currently 0.46 pp/month) as a “G3 strength meter” that updates with each new AEI release. Makes the abstract “engagement-optimized substitution” claim quantitative for users.
Dose-response failure-mode mode: a “what happens if the platform changes” toggle that shocks ΔM_rel by the De Freitas Replika-removal magnitude (5× baseline mental-health Reddit post share). Captures the catastrophic-loss dynamic the model’s additive structure cannot represent.

8. Connections to other topics

The data stage touches three sibling topics directly:

Technology utilization architecture (active). The cognitive-partnership model’s per-shot deskilling β coefficient (Bastani’s 17pp scaled per-task) is the per-event analogue of this stage’s per-time atrophy lower bound — integrating tech-utilization β over the offloading rate u and time t recovers something like λ · u · t. The two stages should be reconciled in their next refinement: if Bastani is the shared anchor, both topics’ parameters should be derived from the same per-event base rate, with this stage’s λ as the integrated quantity and tech-utilization’s β as the per-task derivative.
Human psych variation (finished). The model’s κ (competence-frustration sensitivity) parameter is exactly the sort of population-stratified individual-difference quantity the human-psych-variation pipeline could anchor. If conscientiousness and neuroticism predict κ, the Q5 untestable-from-current-data verdict could be partially relaxed by combining BPNSFS panels with personality measurement — neither is novel data, but the joint fit is novel. A future cross-topic refinement should attempt this.
Bedrock generating functions (planned). The structure of this data stage — six fitting targets, each with a clean verdict (fit / supported / bounded / untestable), plus exploratory backdrop — is itself a candidate template for the data stage of a “transitions that restructure life simultaneously across multiple domains” topic. The bedrock-generating-functions topic should consider whether the Q1–Q6 + adversarial + cruxes structure is portable to other generating functions of this class (industrialization, the printing press, smartphones, etc.).

Sources

Primary references for the empirical anchors used in this pipeline. Full per-cell source citations live in sources.csv (24 papers, downloadable from /data/navigating-ai-world/sources.csv).

Per-task productivity (Q6):

Brynjolfsson, Li & Raymond (2025). Generative AI at Work. QJE 140(2). doi
Dell’Acqua et al. (2023). Navigating the Jagged Technological Frontier. HBS WP 24-013. pdf
Cui et al. (2024). The Effects of Generative AI on High-Skilled Work. Management Science. ssrn
Peng et al. (2023). The Impact of AI on Developer Productivity. arXiv:2302.06590.
Noy & Zhang (2023). Experimental evidence on the productivity effects of generative AI. Science 381(6654). doi
Bastani et al. (2025). Generative AI Without Guardrails Can Harm Learning. PNAS 122(26). doi
Humlum & Vestergaard (2025). Still Waters, Rapid Currents. NBER WP 33777. nber

Self-automator / mode distribution (Q3):

Randazzo et al. (2025). Cyborgs, Centaurs and Self-Automators. HBS WP 26-036. ssrn

Relational dose-response (Q2):

Fang et al. (2025). How AI and Human Behaviors Shape Psychosocial Effects. arXiv:2503.17473.
Heinz et al. (2025). Randomized Trial of a Generative AI Chatbot for Mental Health. NEJM AI. doi
De Freitas et al. (2025). Identity discontinuity in companion AI. HBS WP 25-018.
Common Sense Media + Stanford Brainstorm (2025). Talk, Trust, and Trade-Offs. report

Cognitive offloading (Q1):

Gerlich (2025). AI Tools in Society. Societies 15(1). doi
Stadler, Bannert & Sailer (2024). Cognitive ease at a cost.
Kosmyna et al. (2025). Your Brain on ChatGPT. arXiv preprint.
Shukla et al. (2025). Ironies of AI-Assisted Design. CHI EA 2025.
Ehsan et al. (2026). Intuition Rust: Year-Long Field Study of AI-Assisted Cancer Specialists.

Labor disruption (exploratory):

Brynjolfsson, Chandar & Chen (2025). Canaries in the Coal Mine? Stanford Digital Economy Lab. report
Hui, Reshef & Zhou (2024). The Short-Term Effects of Generative AI on Employment. OrgSci. doi
Eloundou et al. (2024). GPTs are GPTs. Science 384(6699). doi

Anthropic Economic Index (exploratory):

AEI February 2025 (first release), September 2025 (second), January 2026 (third), March 2026 (fourth).

Slow-camp macro anchor:

Acemoglu (2024). The Simple Macroeconomics of AI. NBER WP 32487.

Iteration history

Pass 1 2026-05-02

decompositionintegrationgap scanconnections

Why First draft of the data pipeline. Took the six concrete predictions from the model formalization (§9 Q1–Q6) and built a curated CSV + Python pipeline that confronts each against currently-published evidence. Web-verified anchor numbers from primary sources for every cited study (Brynjolfsson-Li-Raymond QJE 2025, Dell'Acqua BCG, Cui Microsoft/Accenture/F100, Peng GitHub Copilot, Noy-Zhang Science 2023, Bastani PNAS 2025, Humlum-Vestergaard NBER 33777, Randazzo HBS 26-036, Fang OpenAI-MIT arXiv 2503.17473, Heinz Therabot NEJM AI 2025, Brynjolfsson-Chandar-Chen Stanford 2025, Hui-Reshef-Zhou OrgSci 2024, Eloundou Science 2024, Gerlich Societies 2025, Anthropic Economic Index Feb 2025 / Sep 2025 / Jan 2026 / Mar 2026).
- Built six curated input CSVs in stage_outputs/navigating-ai-world/data/: productivity_studies.csv (14 rows, 12 cols, 8 source studies), bcg_modes.csv (3 modes), dose_response.csv (12 anchor points), cognitive_offloading.csv (7 studies), entry_level_disruption.csv (11 rows), aei_task_distribution.csv (12 rows), sources.csv (24 papers)
- Wrote pipeline.py (~250 lines, pandas + numpy): fits per-domain α distribution, midpoint τ from BCG modes, qualitative dose-response shape from OpenAI-MIT anchors, λ lower bound from Bastani + Ehsan, exploratory backdrop summaries for entry-level disruption and AEI augmentation/automation drift
- Six headline verdicts: Q1 (λ) bounded from below; Q2 (dose-response) supported qualitatively; Q3 (τ) supported (midpoint estimate within 0.05 of model default); Q4 (scalar T,B) untestable from current data; Q5 (κ) untestable from current data; Q6 (α) supported with strong per-domain heterogeneity (4.2× spread across consulting / coding / writing / customer-service / education)
- Built React findings panel (AITransitionData.tsx): six tabs, one per Q; charts hand-rolled in SVG to match V4 design tokens (paper bg, sienna accent, hairline rules)
- Promoted curated CSVs to public/data/navigating-ai-world/ — tracked in git, downloadable from /data/navigating-ai-world/<file>.csv on the live site, available for Stage 5 to consume directly
Pass 2 2026-05-02

internal consistency checkerror checktruth/accuracy override on biasfresh-eyes audit

Why Six real issues from reading the document cold. (a) "Six per-task productivity studies" appears in TLDR + Q6 verdict + frontmatter description, but the actual count is 8 per-task α anchors from 5 source studies (Brynjolfsson 2 anchors, BCG 2, Cui 1, Peng 1, Noy 2) — bookkeeping error that propagated through every summary. (b) Cyborg+centaur weighted centroid stated as "f·ρ ≈ 0.49" in TLDR + Q3 verdict but the actual computed value is 0.445 (= [60·0.42 + 13·0.56]/73), which the React component already used correctly — the prose lagged the math. (c) "ψ_R re-estimates to 0.0028" implies a fit; honest read is "ψ_R calibrated to 0.0028" since I picked it to make a thin-baseline 60-min-above-threshold user lose ΔM_rel ≈ −0.10. (d) "Calculator-analogue ruled out" was overstated — Bastani rules it out for the measured tasks and populations; the strong claim that AI use generally produces no durable skill loss over multi-year horizons is not ruled out by any existing study. (e) λ band "0.02–0.10/year" was softened toward neutrality — under the standard scaling assumptions used in the pipeline, the honest band from positive-evidence studies is closer to 0.05–0.20/year, with 0.06 sitting at the *lower* edge rather than the middle. (f) Therabot daily dose in dose_response.csv was 55 min, but the actual paper-derived value is ~13 min/day average (260 messages over 24 days × 6.18 hrs total ÷ 28 days ≈ 13 min). The React component and pipeline already used 13 — the CSV cell was the orphan.
- TLDR para 2: "across six per-task productivity studies" → "across 8 per-task α anchors from 5 source studies (Brynjolfsson-Li-Raymond, Dell'Acqua BCG, Cui, Peng, Noy-Zhang)" — applied to TLDR, Q6 verdict prose, and frontmatter description
- TLDR para 2 + Q3 verdict: "f·ρ ≈ 0.49 weighted centroid" → "f·ρ ≈ 0.45 weighted centroid" matching the actual computed midpoint
- TLDR para 2 + Q2 verdict: "ψ_R re-estimates to 0.0028" → "ψ_R calibrated to 0.0028" with explicit note that the slope is imposed, not fit, until the public 300k-message dataset is mined
- TLDR para 3 + Q1 verdict: "rules out the calculator-analogue tail (λ = 0)" → "rules out λ = 0 in the measured tasks and populations; whether multi-year, multi-task knowledge work follows the calculator-analogue or the cumulative-atrophy regime remains open" — distinguishes domain-specific rejection from the general claim
- TLDR para 3 + Q1 verdict: λ band updated from "0.02–0.10/year" to "0.05–0.20/year" — the honest range from the evidence under standard scaling. Model's 0.06 still sits at the lower edge but the lower edge is no longer artificially soft
- TLDR para 4: apprenticeship-ladder break repromoted from "exploratory side-results worth noting separately" to "the strongest single empirical finding in the corpus, which the model is silent on by design" — reflects the conceded adversarial obj 5 from pass 1
- dose_response.csv: Therabot daily_minutes_voluntary 55 → 13 (matches the React component and pipeline values; corrects the audit trail for downloaders)
- pipeline.py: lambda lower-bound band updated to "approx 0.05-0.20 per year" matching the prose; re-ran pipeline and findings.json regenerates with the corrected band
- React component AITransitionData.tsx: LAMBDA_BOUNDS.realistic_lower_bound_band updated to "0.05 – 0.20 / yr"; no other component changes needed (TAU_SUMMARY.cyborg_centaur_centroid was already 0.45; therabot_dose_minutes was already 13)
- No-op on Q4/Q5 framing in this pass — the "untestable from current data" verdict survives review. Considered tightening Q5 to "κ varies per person and is not strictly untestable, only the cross-population stratification is" but the verdict text already specifies "population calibration" so the nuance is preserved
Pass 3 2026-05-02

gap scancross-context verificationconnectionsnext moves

Why Pass 2 cleaned up internal consistency and bias-toward-neutrality framings. A cold pass-3 read against the topology and lit-review surfaced three things still weak. (a) The topology calls O2 (asymmetric-adoption couples — outcomes when one partner uses AI heavily and the other does not) "the single largest empirical gap in the literature," but my doc mentioned it only once, in passing inside the Q4/Q5 prose. It deserves explicit, parallel treatment as a top-level data gap because it bounds what Stage 5 can claim about the relational channel. (b) Q3 (gate τ) was labelled "supported" but rests on a single study in a single domain (BCG consulting), with f and ρ values inferred from qualitative mode descriptions — the strongest cross-context check available (does τ generalize to coding / writing / relational work?) cannot be run because no other domain has a comparable mode-distribution study. The "supported" label was overconfident; "supported (single-study, single-domain)" is the honest verdict. (c) Stage-5 design implications were scattered across verdicts + adversarial + connections. Pulling them into one consolidated §7 "Next moves for Stage 5" makes the handoff readable in 30 seconds. Plus three smaller adds: De Freitas behavioral audit (43% of farewells across 6 companion apps trigger emotional-manipulation tactics) was in dose_response.csv but never surfaced in any verdict; Common Sense Media adolescent prevalence (33% of teens have discussed important matters with AI instead of real people) was in CSV but never surfaced in Q2; AEI drift was discussed qualitatively without quantifying the rate (0.46 pp/month) or noting that the topology never specified what rate it predicted.
- New §3.2 "Data gaps the model is silent on" — surfaces O2 (asymmetric-adoption couples) as a primary data gap parallel to Q4/Q5 untestable verdicts; flags the Therabot clinical-vs-general-population generalization as a cross-context caveat on the ΔV_rel benefit channel; notes O3 maps to Q1 (already covered) and O4 maps directly to the model's atelic-ballast hypothesis (untestable from current data, awaits phenomenological work)
- Q2 verdict expanded: added De Freitas behavioral audit (43% of 1200 farewells across 6 companion apps trigger one of six emotional-manipulation tactics, boosting post-goodbye engagement up to 14× — direct evidence that engagement-optimized substitution G3 is operationalized in shipped products) and Common Sense Media adolescent context (33% of US teens 13-17 have discussed important matters with AI instead of real people — direct evidence of substitution-not-complementarity in the relationship channel for the highest-stakes population)
- Q3 verdict downgraded: "supported" → "supported (single-study, single-domain)" in both data.mdx prose and React component verdict chip. Verdict text rewritten to lead with the caveat rather than burying it
- New §7 "Next moves for Stage 5" — three highest-leverage design choices: (1) per-domain α + per-domain f/ρ defaults via a domain selector (addresses Q6 + adversarial obj 1); (2) fit-vs-calibrated visual indicator on channel-level chart (addresses adversarial obj 4); (3) early-career-exposed labor-market-access toggle that attenuates ΔV_prod (addresses adversarial obj 5). Renumbers cruxes §5 → §6 (kept after next-moves so future-Stage-5 readers see actionable items first), connections §6 → §7 → §8, sources §7 → §8 → §9
- Quantified AEI drift in §3 from "slow" to explicit "0.46 pp/month decline in augmentation share over 13 months" with note that the topology G3 claim was directional, not rate-specified — so this quantification adds a concrete number where there was previously only a qualitative direction
- Added Eloundou-to-a-parameter connection in §3: 19% of US workers have ≥50% of tasks LLM-exposed, 80% have ≥10%, 46% with complementary software — bounds the realistic distribution of the model's a parameter for population settings (most users sit between a ≈ 0.1 and a ≈ 0.5, with a < 0.5 being the median condition)
- No code change required for the React component beyond the Q3 verdict chip — De Freitas farewell stat and Common Sense adolescent stat are added in Q2 as additional bullet evidence below the existing dose-anchor table
Pass 4 2026-05-02

internal consistency checkcrux identificationfresh-eyes audit

Why Pass 3 made substantive additions (Q3 downgraded to "single-study, single-domain"; new §4 data gaps section surfacing O2 / O4 / Therabot generalization; new §7 next moves consolidating Stage-5 design choices) but never propagated those changes into the TLDR or frontmatter description. A reader scanning the description or TLDR alone would miss the Q3 downgrade and the §4 data-gaps section entirely. Plus §7's new claims introduce two assumptions that should be cruxes (D6, D7) but were not added to §6.
- Frontmatter description rewritten to reflect pass-3 reality: Q3 split off from "two parameters supported by direct fits" into its own clause "one supported with single-study, single-domain caveat"; §4 data gaps surfaced (O2 / O4 / Therabot generalization) including the topology's "single largest empirical gap" framing for O2; §7 Stage-5 design choices summarized
- TLDR para 2: Q3 verdict text amended to flag the single-study caveat — "supported but with strong domain caveat (single study, single domain — see §2 Q3 for the three caveats)" — so a reader scanning the TLDR sees the downgrade, not just the parenthetical midpoint number
- TLDR para 3: added one sentence at the end pointing to §4 — "Three additional gaps the model is silent on by design — asymmetric-adoption couples (O2, the topology's flagship gap), AI-augmented atelic activities (O4, gates the model's atelic-ballast hypothesis), and the Therabot clinical-vs-general-population generalization (β_R is calibrated against a clinical sample) — bound what Stage 5 can claim. See §4."
- Two new cruxes added to §6 pipeline cruxes. **D6 — Eloundou task exposure ≈ model `a`.** The §3 anchor that maps Eloundou's population task-exposure shares to the model's `a` parameter assumes task exposure (what fraction of work LLMs *can* do) is a usable proxy for `a` (how well AI performs on the task). These are conceptually distinct — Eloundou measures exposure, not capability — and the mapping has not been validated against per-task capability benchmarks. *Falsification:* a per-task benchmark study where measured AI capability `a` differs systematically from Eloundou's exposure share for the same task. **D7 — Apprenticeship-break magnitude generalizes across early-career cohorts.** The §7 design recommendation 3 (early-career toggle attenuates ΔV_prod by 13–20%) treats Brynjolfsson-Chandar-Chen's 22-25-year-old highly-AI-exposed point estimate as if it generalizes to the broader "early-career-exposed" toggle population (e.g., late-20s, moderately-exposed). Real cohorts likely differ by exposure intensity, occupation mix, and labor-market institution. *Falsification:* extension of Brynjolfsson-Chandar-Chen's ADP analysis to 26-30 cohorts or to moderately-exposed occupations showing meaningfully different attenuation magnitudes.
- No body-section content changes beyond the TLDR / cruxes / frontmatter — the §4 / §7 / Q3 substance from pass 3 is unchanged. Pass 4 is purely a propagation pass to make scanning-readers see the changes pass-3 readers had to dig for.