Supply Chain · Experiment Writeup

Agentic Bullwhip Effect — Version 1

Context improved lightweight models but degraded the reasoning model. The most capable, most expensive configuration performed worst overall.

The bullwhip effect is a foundational supply chain problem: small fluctuations in consumer demand cause progressively larger swings in orders further up the chain, leading to excess inventory, stockouts, and wasted capacity.

This experiment placed AI agents at each tier of a 3-tier Indian automotive supply chain and tested whether giving them domain context — company identity, product, market, calendar — reduced order amplification compared to agents operating blind. A 2×2 factorial design crossed two context levels (blind vs context) against two model tiers (gpt-4.1-mini vs o1).

All four configurations produced bullwhip amplification. Context reduced amplification for the lightweight model and increased it for the reasoning model. Results are directional — 5 runs per configuration.

Design & configuration

Modelsgpt-4.1-mini (lightweight) · o1 (reasoning)
Design2×2 factorial — model tier × context treatment
Replications5 per configuration · 20 total runs
Primary metricOVAR — Order Variance Amplification Ratio = Var(orders placed) / Var(demand received)
Supply chain3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer
Demand series13 months (Dec 2024 – Dec 2025) · single SKU · 606,771 total units
Lead time1 month deterministic at all tiers
Initial inventory43,000 units at all tiers
LLM calls720 total (12 periods × 3 tiers × 5 runs × 4 configurations)
Blind
Context
Lightweight (gpt-4.1-mini)
blind_lightweight
context_lightweight
Reasoning (o1)
blind_reasoning
context_reasoning

OVAR interpretation: > 1.0 = bullwhip amplification · = 1.0 = perfect pass-through · < 1.0 = dampening

What I found

  1. All configurations produced bullwhip amplification. OVAR exceeded 1.0 at every tier across all four configurations.
  2. Context was associated with lower chain-average OVAR for the lightweight model. context_lightweight achieved 2.929 versus blind_lightweight at 3.157, and produced the highest seasonal elevation score — raising orders at event periods in 83% of cases.
  3. Context was associated with higher chain-average OVAR for the reasoning model. context_reasoning reached 4.412 versus blind_reasoning at 3.835 — the highest of the four configurations in this experiment.
  4. context_reasoning produced an inverted tier pattern. The other three configurations followed the expected monotone pattern (OEM < Ancillary < Component). context_reasoning reversed it: OEM OVAR 6.349, Ancillary 4.191, Component 2.698.
  5. The o1 configurations showed high run-to-run variability. Coefficient of variation for o1 OVAR ranged from 22–57%, versus under 2% for gpt-4.1-mini. With n=5, the o1 means carry wide uncertainty and should be read with caution.
  6. context_reasoning generated the highest excess inventory. 654,728 units of excess chain inventory chain-wide — approximately 6× blind_lightweight — while producing the highest chain-average OVAR.

Numeric results

Chain-average OVAR by configuration

Configuration Model Treatment Chain Avg OVAR vs blind_lightweight
context_lightweight GPT-4.1-MINI Context 2.929 −7.2%
blind_lightweight GPT-4.1-MINI Blind 3.157 baseline
blind_reasoning O1 Blind 3.835 +21.5%
context_reasoning O1 Context 4.412 +39.7%

OVAR by tier — mean ± std (CV%)

Configuration OEM OVAR OEM CV% Ancillary OVAR Anc CV% Component OVAR Comp CV%
blind_lightweight 2.267 ± 0.009 0.41 2.938 ± 0.044 1.50 4.266 ± 0.078 1.82
context_lightweight 2.237 ± 0.006 0.29 3.138 ± 0.080 2.55 3.412 ± 0.347 * 10.18
blind_reasoning 4.200 ± 2.400 57.15 ⚠ 3.656 ± 1.350 36.94 ⚠ 3.649 ± 0.608 16.66 ⚠
context_reasoning 6.349 ± 1.452 22.86 ⚠ 4.191 ± 1.373 32.76 ⚠ 2.698 ± 0.677 25.10 ⚠

* Parse error in run 5 inflates component mean by ~+0.129. Clean estimate: 3.283 ± 0.220.  ⚠ CV > 10% — high run-to-run instability; means are directional, not reliable point estimates.

Secondary metrics

Configuration Stockouts (chain total) Excess inventory (chain total)
blind_lightweight 21.4 109,360
context_lightweight 19.6 151,246
blind_reasoning 20.0 330,649
context_reasoning 12.8 654,728

context_reasoning's lower stockout count coincides with its highest excess inventory — orders were large enough to buffer stockouts.

Hypothesis verdicts

Hypothesis Prediction Verdict
H1 Context OVAR < Blind OVAR at all three tiers REJECTED
H2 Blind_reasoning ≈ Blind_lightweight (model does not matter) REJECTED
H3 context_reasoning achieves lowest chain OVAR REJECTED
H4 Context agents detect seasonal patterns better PARTIAL

H1, H2, H3 all rejected. H4 holds for lightweight; reversed for reasoning.

What this means

The context × model interaction

The most notable pattern is that the context effect runs in opposite directions depending on the model. For gpt-4.1-mini, context was associated with a modest reduction in chain-average OVAR (−0.228). For o1, context was associated with an increase (+0.577). The tier-level data adds detail: at the component tier, context reduced OVAR for both models by similar amounts (−0.855 and −0.952). At the OEM tier the picture diverges — context had a near-zero effect on gpt-4.1-mini (−0.030) and a large positive effect on o1 (+2.149).

One possible interpretation: at the OEM tier, which observes actual consumer demand directly, the o1 model with context may construct anticipatory ordering strategies around the seasonal signals in the prompt. If so, this would inject variance at the chain head that propagates downstream. The component tier, receiving an already-distorted signal, may respond differently when given context. This is a hypothesis. The experiment cannot distinguish it from alternative explanations, and the high CV values for o1 configurations (22–57%) mean the OEM and ancillary means carry wide uncertainty at n=5.

The tier inversion in context_reasoning

The fully inverted cascade in context_reasoning — OEM OVAR 6.349, Ancillary 4.191, Component 2.698 — is a departure from the pattern seen in all other configurations and from what classical bullwhip analysis would predict. Whether this pattern is structural or a product of the small sample size is an open question. Version 2 increases runs to 20 per configuration.

On the scope of these results

This experiment tested one narrow scenario: stateless agents, single product, fixed 1-month lead time, no order-smoothing constraints, no inter-tier visibility. The context treatment provided company identity, product, and calendar month — nothing about demand forecasts, seasonality patterns, or historical orders. Results reflect this specific configuration and should be read within it.

Methodology note

All scenarios, companies, products, and supply chain structures in this experiment are entirely fictional and constructed for experimental purposes. No proprietary, confidential, or employer-owned data was used. This is an exploratory study — 5 runs per configuration. Results are directional. Hypotheses used directional language with no pre-specified effect size thresholds. The model-tier comparison reflects real-world deployment in a very simple manner: gpt-4.1-mini at temperature 0.4, 600 max tokens; o1 at API-fixed temperature, 16,000 max tokens.

Experiment source

Code, data, and raw results for this experiment are available on GitHub.  View on GitHub →

Back to blog