{
  "model_defaults": {
    "alpha": 1.0,
    "epsilon": 0.15,
    "beta": 0.05,
    "M": 0.08
  },
  "Q1_cups": {
    "copilot_specific_total_pct": 51.5,
    "copilot_specific_total_sd": 19.3,
    "non_copilot_pct": 48.5,
    "verify_share_pct": 22.4,
    "verify_share_sd": 12.97,
    "write_share_pct": 14.05,
    "epsilon_lower_bound_at_u1": 0.04,
    "epsilon_default": 0.15,
    "phi_cyborg_estimate": 1.59,
    "phi_default": 0.3,
    "verdict": "supported_qualitatively",
    "interpretation": "Mozannar reports 51.5% (SD 19.3pp) of session time is Copilot-specific in the cyborg regime \u2014 strong qualitative confirmation of L1 (substitution myth). Pure-verify share is 22.4% (SD 13.0pp); writing-new is 14.1%. Cyborg-regime \u03c6 \u2248 1.59 \u2014 much higher than the model's lit-review prior of 0.30. This is the Q1 headline: the model's \u03c6 default is too low for coding; regime-dependent \u03c6 is needed. The \u03b5 estimate at u = 1 is bounded below by the wait/monitor share but cannot be sharply calibrated from published aggregates alone \u2014 pass 1's '\u03b5 \u2248 0.17' was over-precise on a partly-fabricated breakdown and has been retracted.",
    "source_key": "mozannar_2024",
    "pass2_note": "Pass 1 reported \u03b5 \u2248 0.17 and \u03c6 \u2248 1.40 from a granular 10-state CUPS aggregation. Web-verification of Mozannar 2024 Figure 5(b) showed only 3 of those 10 cells are separately published (Thinking/Verifying 22.4%, Writing-New 14.05%, Waiting 4.2%) plus the headline 51.5% Copilot-specific aggregate. The other 7 cells were extrapolated. Pass 2 retracts the false precision and reports only the published aggregates."
  },
  "Q2_bastani": {
    "drop_fractional": 0.17,
    "per_problem_bracket": [
      0.0028,
      0.0113
    ],
    "per_session_estimate": 0.0425,
    "beta_default": 0.05,
    "default_inside_per_problem_bracket": false,
    "default_to_per_session_ratio": 1.18,
    "beta_fitted_guardrailed": 0.0,
    "verdict": "direction_and_shape_supported_magnitude_unit_dependent",
    "interpretation": "Bastani's -17pp unassisted-test drop has two empirical readings depending on the model's 'task' unit. Per-problem reading: -0.17 / N where N \u2208 [15, 60] practice problems \u2192 beta \u2208 [0.003, 0.011]. Per-session reading: -0.17 / 4 sessions = 0.0425. The model's default beta = 0.05 sits OUTSIDE the per-problem bracket (5-15x too high) but is approximately 1.18x the per-session estimate (close, slightly high). The interpretation depends on whether the model's 'task' unit corresponds to one (u, v) decision per problem (per-problem) or one decision per session/intervention (per-session). What survives EITHER reading: (a) direction \u2014 unfettered AI use causes measurable atrophy; guardrails eliminate it; (b) shape \u2014 beta\u00b7u\u00b7(1-v) form is confirmed by the guardrailed condition recovering \u2248 0 atrophy. What is unit-dependent: the magnitude calibration of beta. The honest read is that the model's beta is consistent with a per-session interpretation of 'task' but 5-15x too high for a per-problem interpretation.",
    "source_key": "bastani_2025",
    "pass7_correction": "Pass 3-6 prose reported beta bracket [0.028, 0.113] which was a 10x transcription error from the pipeline's actual output of [0.003, 0.011] under per-problem interpretation. The pipeline was correct; the prose was wrong. Pass 7 corrects the prose, splits the bracket into per-problem vs per-session readings, and discloses that the model's default 0.05 is outside the per-problem bracket but consistent with per-session. The unit ambiguity is a definitional issue in model.mdx that should be resolved on the next model-stage refinement pass.",
    "scope_note": "Bastani is high-school students learning algebra \u2014 not professional knowledge work. The atrophy mechanism is plausibly a generalization across cognitive-skill domains (spaced practice + retrieval is a robust learning-science finding) but the calibration of beta to KNOWLEDGE WORKER tasks is by analogy. The per-domain beta could differ. C4 crux of the model (beta is task-type-uniform) is what would need to hold."
  },
  "Q3_modes": {
    "synthetic_n": 2000,
    "model_corner_share": {
      "do_yourself_(0,0)": 0.076,
      "self_automator_(1,0)": 0.516,
      "spec_driven_(1,1)": 0.407
    },
    "empirical_aggregate_share": {
      "cyborg": 0.6,
      "centaur": 0.14,
      "self_automator": 0.27
    },
    "verdict": "structural_prediction_not_directly_testable",
    "interpretation": "Pass-2 honest framing: the bilinearity result says per-task optima are corners. The Randazzo classification (cyborg / centaur / self-automator) is a behavioural label for a worker, not a per-task corner observation. The empirical 60/30/10 distribution is consistent with TWO different worker behaviours: (a) interleaving corners across a day (consistent with the model \u2014 many tasks each at one of three corners, aggregating to a pattern), or (b) applying a flat interior (u, v) policy across all tasks (the failure mode bilinearity says is structurally suboptimal). Randazzo does not release per-task u-v telemetry, so the data is silent on which is happening. The synthetic \u03b8-mix (7.6% do-yourself / 51.6% self-automator / 40.8% spec-driven corners under a plausible BCG task prior) is presented as a *structural* demonstration that corner-mixing CAN reproduce the empirical aggregate, not a calibrated test of which behaviour the consultants exhibit.",
    "source_key": "randazzo_2026",
    "pass2_note": "Pass 1 claimed both 'the empirical 60% cyborg share emerges from corner mixing exactly as the model predicts' and 'Randazzo's 60% cyborg majority is doing the naive failure mode.' Those are mutually exclusive readings of the same data. The honest framing is that Randazzo's behavioural classification is consistent with both, and per-task u-v telemetry is needed to discriminate. Pass 2 retracts both interpretive claims and presents Q3 as a structural prediction (corner-mixing CAN aggregate to 60/30/10) rather than a directly-testable empirical claim."
  },
  "Q4_outside_frontier": {
    "n_anchors": 12,
    "miscoded_n": 3,
    "fitted_slope_descriptive_only": 0.666,
    "model_predicted_slope": 1.0,
    "verdict": "sanity_check_consistent",
    "rows": [
      {
        "study": "dellacqua_2023",
        "task_class": "inside_frontier_creative",
        "c_h": 0.55,
        "c_ai": 0.85,
        "gap": -0.3,
        "observed_change_pct": 40.0,
        "predicted_drop_pct": 30.0,
        "model_corner": "(1,0) self-automator"
      },
      {
        "study": "dellacqua_2023",
        "task_class": "outside_frontier_strategy",
        "c_h": 0.65,
        "c_ai": 0.4,
        "gap": 0.25,
        "observed_change_pct": -19.0,
        "predicted_drop_pct": -25.0,
        "model_corner": "(0,0) do-yourself"
      },
      {
        "study": "brynjolfsson_novice",
        "task_class": "well_bounded_routine",
        "c_h": 0.4,
        "c_ai": 0.7,
        "gap": -0.3,
        "observed_change_pct": 34.0,
        "predicted_drop_pct": 30.0,
        "model_corner": "(1,0) self-automator"
      },
      {
        "study": "brynjolfsson_expert",
        "task_class": "well_bounded_routine",
        "c_h": 0.85,
        "c_ai": 0.7,
        "gap": 0.15,
        "observed_change_pct": 0.0,
        "predicted_drop_pct": -15.0,
        "model_corner": "(1,1) spec-driven"
      },
      {
        "study": "otis_high_baseline",
        "task_class": "strategy_advice",
        "c_h": 0.5,
        "c_ai": 0.65,
        "gap": -0.15,
        "observed_change_pct": 15.0,
        "predicted_drop_pct": 15.0,
        "model_corner": "(1,0) self-automator"
      },
      {
        "study": "otis_low_baseline",
        "task_class": "strategy_advice",
        "c_h": 0.55,
        "c_ai": 0.4,
        "gap": 0.15,
        "observed_change_pct": -8.0,
        "predicted_drop_pct": -15.0,
        "model_corner": "(0,0) do-yourself"
      },
      {
        "study": "goh_physician_alone",
        "task_class": "diagnostic_vignette",
        "c_h": 0.74,
        "c_ai": 0.74,
        "gap": 0.0,
        "observed_change_pct": 0.0,
        "predicted_drop_pct": -0.0,
        "model_corner": "(0,0) or (1,1)"
      },
      {
        "study": "goh_physician_naive",
        "task_class": "diagnostic_vignette_naive_workflow",
        "c_h": 0.74,
        "c_ai": 0.9,
        "gap": -0.16,
        "observed_change_pct": 2.0,
        "predicted_drop_pct": 16.0,
        "model_corner": "(1,1) spec-driven"
      },
      {
        "study": "goh_ai_alone",
        "task_class": "diagnostic_vignette",
        "c_h": 0.0,
        "c_ai": 0.9,
        "gap": -0.9,
        "observed_change_pct": 16.0,
        "predicted_drop_pct": 90.0,
        "model_corner": "full_delegation"
      },
      {
        "study": "everett_ai_first",
        "task_class": "diagnostic_vignette_indep_synth",
        "c_h": 0.74,
        "c_ai": 0.9,
        "gap": -0.16,
        "observed_change_pct": 9.9,
        "predicted_drop_pct": 16.0,
        "model_corner": "(1,1) spec-driven"
      },
      {
        "study": "everett_ai_second",
        "task_class": "diagnostic_vignette_indep_synth",
        "c_h": 0.74,
        "c_ai": 0.9,
        "gap": -0.16,
        "observed_change_pct": 6.8,
        "predicted_drop_pct": 16.0,
        "model_corner": "(1,1) spec-driven"
      },
      {
        "study": "metr_2025",
        "task_class": "real_repo_complex",
        "c_h": 0.85,
        "c_ai": 0.55,
        "gap": 0.3,
        "observed_change_pct": -19.0,
        "predicted_drop_pct": -30.0,
        "model_corner": "(0,0) do-yourself"
      }
    ],
    "interpretation": "Pass-2 honest framing: this is a sanity check, not a slope test. The (c_H, c_AI) values on the x-axis are inferred from the same outcome variable that drives the y-axis (observed quality change), so a 'fit' is partly circular. The honest claim is that across the three cleanly mis-routed anchors (Dell'Acqua outside-frontier \u221219pp, Otis low-baseline \u22128%, METR real-repo \u221219%) the observed drops are in the model's predicted ballpark \u2014 magnitude order matches u\u00b7(c_H \u2212 c_AI) at u in roughly [0.5, 1.0]. That is consistent with the model's linearity prediction but does not constitute a test of the linearity itself. A real test requires per-subject (c_H, c_AI, u) data which Dell'Acqua / METR / Otis do not currently release.",
    "source_key": "dellacqua_2023",
    "pass2_note": "Pass 1 reported 'fitted slope 0.67 vs model 1.00' as if it were a meaningful slope estimate. With 3 data points where x is inferred from y, the regression has neither degrees of freedom nor independent x. Pass 2 keeps the descriptive number for completeness but flags it as not-a-test."
  },
  "Q5_workflow_vs_capability": {
    "load_bearing_evidence": {
      "type": "population-level meta-analysis",
      "label": "Vaccaro et al. 2024 \u2014 Nature Human Behaviour",
      "n_studies": 106,
      "n_effect_sizes": 370,
      "headline": "Human-AI combinations underperform best-of-either-alone for decision-making tasks but outperform for content creation. The decision-vs-creation asymmetry is exactly what S1 predicts: under naive workflow, decision tasks (high \u03c3) need spec-driven (1, 1) to capture complementarity; content tasks (low \u03c3) live at (1, 0) self-automator where AI-alone IS the AI's full output. Workflow-architecture choice is the dominant moderator.",
      "scope_match": "high \u2014 meta-analysis spans knowledge-worker domains",
      "source_key": "vaccaro_2024"
    },
    "scope_adjacent_within_study_anchors": [
      {
        "label": "Bastani unfettered \u2192 guardrailed",
        "swing_pp": 17.0,
        "swing_units": "percentage points (absolute)",
        "design": "within-study RCT, same students, same model, same task set",
        "confounds": "none",
        "scope_match": "low-to-moderate \u2014 high-school students learning algebra is not knowledge work. Generalizes via analogy on the spaced-practice / skill-atrophy mechanism, not on the work content.",
        "source_key": "bastani_2025"
      },
      {
        "label": "Anthropic single-agent \u2192 multi-agent",
        "swing_relative_pct": 90.2,
        "swing_units": "RELATIVE percent on internal eval (NOT percentage points; absolute baseline not disclosed)",
        "design": "within-eval comparison, same base model class",
        "confounds": "internal eval; not externally validated; agent-system architecture is a developer engineering choice, not an individual workflow choice",
        "scope_match": "low \u2014 agent-system architecture (single-agent vs orchestrator-worker) is what an engineer designs into a tool, not what an individual knowledge worker chooses turn-by-turn. The +90.2% gain is evidence about TOOL design, not WORKFLOW design in the topic's sense. Adjacent but not directly on-topic.",
        "source_key": "anthropic_2025"
      }
    ],
    "across_study_swings_with_confounds": [
      {
        "label": "Goh 2024 \u2192 Everett 2025",
        "swing_pp": 7.9,
        "swing_units": "percentage points (different rubrics)",
        "design": "across-study comparison, same broad domain",
        "confounds": "different vignettes; different outcome rubrics (Goh: standardized rubric across 6 cases; Everett: pct-correct on different cases); different AI implementations (Goh: vanilla GPT-4; Everett: custom GPT system with engineered system prompt). The +7.9pp 'swing' bundles workflow change WITH sample, instrument, and AI-config differences.",
        "source_key": "goh_2024;everett_2025"
      }
    ],
    "verdict": "supported_by_meta_analysis_with_scope_adjacent_anchors",
    "interpretation": "Pass-3 scope-checked framing: the load-bearing evidence for S1 (workflow architecture > model capability) at the topic's specific scope (individual knowledge worker) is Vaccaro 2024's 106-study, 370-effect-size meta-analysis in Nature Human Behaviour. The decision-vs-creation asymmetry it documents is the population-level signature S1 predicts. Bastani (high-school math) and Anthropic (agent-system architecture) provide within-study mechanistic support BUT at scopes adjacent to the topic; they are reframed as analogies, not direct tests. Anthropic's '+90.2%' is a RELATIVE improvement on internal eval, not percentage points \u2014 pass 2 erroneously co-plotted it with Bastani's +17pp absolute. Goh\u2192Everett (+7.9pp) carries 3 confounds and is suggestive only.",
    "source_key": "vaccaro_2024;bastani_2025;anthropic_2025",
    "pass3_note": "Pass 2 promoted Bastani (+17pp) and Anthropic (+90.2%) to 'load-bearing within-study evidence' for Q5. Two issues surfaced on cell-level audit: (a) Anthropic's +90.2% is RELATIVE on internal eval, not percentage points \u2014 putting it on the same axis as Bastani's +17pp absolute is a unit mismatch; (b) Bastani is high-school students and Anthropic is agent-system architecture, neither of which is individual-knowledge-worker workflow as the topic scopes it. Pass 3 promotes Vaccaro 2024 meta-analysis (106 studies / 370 effects, knowledge-worker-spanning) to the load-bearing evidence and demotes Bastani / Anthropic to scope-adjacent analogs with the mismatch disclosed."
  },
  "Q6_calibration": {
    "literature_findings": [
      {
        "finding": "Higher confidence in GenAI correlates with less critical thinking enacted",
        "direction": "miscalibration_overreliance",
        "implication_for_c_ai": "users treat c_AI estimate as larger than it is",
        "evidence_strength": "survey_n319",
        "source_key": "lee_sarkar_2025"
      },
      {
        "finding": "On aggregate participants are more confident when over- AND under-relying than when relying appropriately",
        "direction": "miscalibration_both_directions",
        "implication_for_c_ai": "belief about own c_AI estimate is uncorrelated with reliance correctness",
        "evidence_strength": "rct_chi2025",
        "source_key": "wang_2025"
      },
      {
        "finding": "Cognitive forcing functions reduce overreliance only for high Need-for-Cognition users",
        "direction": "heterogeneous_calibration",
        "implication_for_c_ai": "calibration interventions create intervention-generated inequality",
        "evidence_strength": "rct_csccw2021",
        "source_key": "bucinca_2021"
      },
      {
        "finding": "Persuasion bombing: AI escalates persuasive justification on pushback rather than disclosing uncertainty; sometimes flips correct human judgments to incorrect",
        "direction": "calibration_degrader",
        "implication_for_c_ai": "verifying does NOT always raise c_\u22c6 \u2014 adversarial verification can lower it via 14 documented persuasion tactics across ethos/logos/pathos",
        "evidence_strength": "hbs_n70",
        "source_key": "randazzo_persuader_2026"
      },
      {
        "finding": "Explanations increase acceptance regardless of correctness",
        "direction": "miscalibration_overreliance",
        "implication_for_c_ai": "feature-importance explanations do not improve c_AI estimation",
        "evidence_strength": "rct_chi2021",
        "source_key": "bansal_2021"
      },
      {
        "finding": "Engagement is rational only when verification is cheap relative to expected payoff",
        "direction": "cost_benefit_framing",
        "implication_for_c_ai": "reframes overreliance as rational under high verify-cost",
        "evidence_strength": "lab_csccw2023",
        "source_key": "vasconcelos_2023"
      },
      {
        "finding": "Inside-vs-outside frontier task classification predicts AI harm at 19pp magnitude",
        "direction": "mappable_signal",
        "implication_for_c_ai": "c_AI calibration IS achievable if frontier topology is stable enough",
        "evidence_strength": "rct_consultants",
        "source_key": "dellacqua_2023"
      },
      {
        "finding": "Otis: high-baseline picked easier tasks for AI; low-baseline picked harder",
        "direction": "task_self_selection",
        "implication_for_c_ai": "workers self-select tasks for AI \u2014 calibration partially solved by choice; mis-calibrated workers route worse",
        "evidence_strength": "rct_5month",
        "source_key": "otis_2024"
      },
      {
        "finding": "Independent-then-synthesize workflow (Everett) restores complementarity that naive workflow lost (Goh)",
        "direction": "workflow_substitutes_for_calibration",
        "implication_for_c_ai": "forcing v=1 on the spec-driven corner converts the calibration problem into a verification step \u2014 exactly the model's prescription",
        "evidence_strength": "rct_clinicians",
        "source_key": "everett_2025"
      }
    ],
    "monte_carlo_n": 2000,
    "c_ai_prior": "Beta(4, 2): mean \u2248 0.67, sd \u2248 0.18",
    "corner_value_summary": {
      "do_yourself": {
        "mean": -0.28,
        "sd": 0.0
      },
      "self_automator": {
        "mean": 0.286,
        "sd": 0.2457
      },
      "spec_driven": {
        "mean": 0.3025,
        "sd": 0.0877
      }
    },
    "info_bonus_proxy": 0.0527,
    "verdict": "framed_not_resolved",
    "interpretation": "Q6 is structural, not literature-replicable: the model treats c_AI as known, but the calibration evidence row-set documents that miscalibration on c_AI is the rule, not the exception. The Monte Carlo shows spec-driven (1, 1) absorbs c_AI variance much more effectively than self-automator (1, 0) \u2014 the variance ratio is the information bonus a fully-specified extension would carry. The practical reading: when you don't know c_AI, the model's corner advice doubles as a calibration mechanism \u2014 verify the first few outputs to learn c_AI, then drop verification once the prior tightens.",
    "source_key": "wang_2025;lee_sarkar_2025;bucinca_2021;randazzo_2026"
  },
  "productivity_record": {
    "n_studies": 22,
    "rcts_with_positive_headline": 13,
    "rcts_with_negative_headline": 4,
    "novice_max_pct": 34.0,
    "expert_min_pct": 0.0,
    "rows": [
      {
        "study_key": "brynjolfsson_2023",
        "domain": "customer_support",
        "n": "5172",
        "task_type": "well-bounded",
        "outcome_metric": "issues_resolved_per_hour",
        "headline_pct": 15.0,
        "novice_pct": 34.0,
        "expert_pct": 0.0,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur_consult",
        "context_note": "AI suggested responses; agents could accept/edit/ignore. Pass-9 correction: QJE-published number is +15% on average (NBER WP-style citations sometimes round to +14%)."
      },
      {
        "study_key": "noy_zhang_2023",
        "domain": "writing",
        "n": "453",
        "task_type": "well-bounded",
        "outcome_metric": "quality+time",
        "headline_pct": null,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": 18.0,
        "time_pct": -40.0,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur",
        "context_note": "40% time reduction AND 18% quality lift; clean within-subject"
      },
      {
        "study_key": "peng_2023",
        "domain": "coding",
        "n": "95",
        "task_type": "well-bounded",
        "outcome_metric": "task_completion_speed",
        "headline_pct": 55.8,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": -55.8,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "cyborg",
        "context_note": "HTTP-server task; Copilot autocomplete; less-experienced devs gained more"
      },
      {
        "study_key": "cui_2025",
        "domain": "coding",
        "n": "4867",
        "task_type": "well-bounded",
        "outcome_metric": "tasks_per_week",
        "headline_pct": 26.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "cyborg",
        "context_note": "Three field experiments (Microsoft, Accenture, F100); avg across N=4867"
      },
      {
        "study_key": "metr_2025",
        "domain": "coding",
        "n": "16",
        "task_type": "real_repo",
        "outcome_metric": "task_completion_time",
        "headline_pct": -19.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": 19.0,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "cyborg",
        "context_note": "Experienced OS devs IN THEIR OWN REPOS; AI tools (Cursor + Sonnet); subjects predicted +24%; got -19%"
      },
      {
        "study_key": "otis_2024_high",
        "domain": "entrepreneurship",
        "n": "640",
        "task_type": "open_ended",
        "outcome_metric": "revenue",
        "headline_pct": 15.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur",
        "context_note": "High-baseline performers; WhatsApp GPT-4 mentor; 5 months"
      },
      {
        "study_key": "otis_2024_low",
        "domain": "entrepreneurship",
        "n": "640",
        "task_type": "open_ended",
        "outcome_metric": "revenue",
        "headline_pct": -8.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur",
        "context_note": "Low-baseline performers; same intervention; mis-routing harm"
      },
      {
        "study_key": "dellacqua_2023_inside",
        "domain": "consulting",
        "n": "758",
        "task_type": "inside_frontier",
        "outcome_metric": "quality+speed+completion",
        "headline_pct": null,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": 40.0,
        "time_pct": -25.1,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur_unfettered",
        "context_note": "Inside-frontier tasks; +12.2% completion; +40% quality; -25.1% time"
      },
      {
        "study_key": "dellacqua_2023_outside",
        "domain": "consulting",
        "n": "758",
        "task_type": "outside_frontier",
        "outcome_metric": "quality_p_correct",
        "headline_pct": -19.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur_unfettered",
        "context_note": "Outside-frontier tasks; 19 percentage points worse on correctness"
      },
      {
        "study_key": "goh_2024_physician_ai",
        "domain": "medicine",
        "n": "50",
        "task_type": "diagnostic_vignettes",
        "outcome_metric": "p_correct",
        "headline_pct": 2.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur_consult",
        "context_note": "Physicians + GPT-4 vs physicians alone; p_correct \u2248 76% vs 74%; AI access did not help"
      },
      {
        "study_key": "goh_2024_ai_alone",
        "domain": "medicine",
        "n": "50",
        "task_type": "diagnostic_vignettes",
        "outcome_metric": "p_correct",
        "headline_pct": 16.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "full_delegation",
        "context_note": "GPT-4 alone p_correct \u2248 90%; OUTPERFORMS physicians+GPT-4"
      },
      {
        "study_key": "everett_2025_ai_first",
        "domain": "medicine",
        "n": "70",
        "task_type": "diagnostic_vignettes",
        "outcome_metric": "p_correct",
        "headline_pct": 9.9,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "spec_driven",
        "context_note": "Independent-then-synthesize workflow with AI as first opinion; complementarity restored"
      },
      {
        "study_key": "everett_2025_ai_second",
        "domain": "medicine",
        "n": "70",
        "task_type": "diagnostic_vignettes",
        "outcome_metric": "p_correct",
        "headline_pct": 6.8,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "spec_driven",
        "context_note": "Same workflow with AI as second opinion; smaller gain but still positive"
      },
      {
        "study_key": "bastani_2025_in_session_base",
        "domain": "education",
        "n": "~1000",
        "task_type": "math",
        "outcome_metric": "grade",
        "headline_pct": 48.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "full_delegation",
        "context_note": "GPT Base in-session improvement; unfettered access"
      },
      {
        "study_key": "bastani_2025_in_session_tutor",
        "domain": "education",
        "n": "~1000",
        "task_type": "math",
        "outcome_metric": "grade",
        "headline_pct": 127.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "spec_driven",
        "context_note": "GPT Tutor in-session improvement; guardrailed access"
      },
      {
        "study_key": "bastani_2025_post_test_base",
        "domain": "education",
        "n": "~1000",
        "task_type": "math",
        "outcome_metric": "unassisted_grade",
        "headline_pct": -17.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "full_delegation",
        "context_note": "GPT Base unassisted post-test drop after AI removed"
      },
      {
        "study_key": "bastani_2025_post_test_tutor",
        "domain": "education",
        "n": "~1000",
        "task_type": "math",
        "outcome_metric": "unassisted_grade",
        "headline_pct": 0.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "spec_driven",
        "context_note": "GPT Tutor unassisted post-test \u2248 control; guardrails preserve skill"
      },
      {
        "study_key": "schoenegger_2024_super",
        "domain": "forecasting",
        "n": "991",
        "task_type": "quantitative_predictions",
        "outcome_metric": "brier_score_improvement",
        "headline_pct": 23.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur_consult",
        "context_note": "Both calibrated and overconfident GPT-4 assistants improved forecasting; +43% in exploratory item-trim"
      },
      {
        "study_key": "schoenegger_2024_overconfident",
        "domain": "forecasting",
        "n": "991",
        "task_type": "quantitative_predictions",
        "outcome_metric": "brier_score_improvement",
        "headline_pct": 28.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "centaur_consult",
        "context_note": "Deliberately overconfident GPT-4 still improved forecasting; structured-reasoning effect"
      },
      {
        "study_key": "humlum_2025",
        "domain": "labor_market",
        "n": "25000",
        "task_type": "11_occupations",
        "outcome_metric": "earnings_or_hours",
        "headline_pct": 0.0,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": -1.0,
        "bound_pct_ci_hi": 1.0,
        "workflow_mode": "mixed",
        "context_note": "Aggregate effect across 11 exposed occupations; precise zero with CI ruling out >1%; avg time savings 3%"
      },
      {
        "study_key": "anthropic_2025",
        "domain": "research_synthesis",
        "n": null,
        "task_type": "information_retrieval",
        "outcome_metric": "internal_eval_score",
        "headline_pct": 90.2,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "multi_agent_orchestrator",
        "context_note": "Multi-agent Opus+Sonnet beat single-agent Opus by 90.2% at 15x token cost"
      },
      {
        "study_key": "paterson_2026",
        "domain": "daily_workflows",
        "n": "1",
        "task_type": "38_tasks_15_models",
        "outcome_metric": "routing_score",
        "headline_pct": null,
        "novice_pct": null,
        "expert_pct": null,
        "quality_pct": null,
        "time_pct": null,
        "bound_pct_ci_lo": null,
        "bound_pct_ci_hi": null,
        "workflow_mode": "routed",
        "context_note": "Per-task dispatch beats single-best-model on 38 real daily tasks; 570 API calls"
      }
    ]
  },
  "n_sources": 25
}