Abstract
The bullwhip effect is a foundational supply chain problem: small fluctuations in consumer demand cause progressively larger swings in orders further up the chain, leading to excess inventory, stockouts, and wasted capacity.
This experiment placed AI agents at each tier of a 3-tier Indian automotive supply chain and tested whether giving them domain context — company identity, product, market, calendar — reduced order amplification compared to agents operating blind. A 2×2 factorial design crossed two context levels (blind vs context) against two model tiers (gpt-4.1-mini vs o1).
All four configurations produced bullwhip amplification. Context reduced amplification for the lightweight model and increased it for the reasoning model. Results are directional — 5 runs per configuration.
Experiment setup
Design & configuration
| Models | gpt-4.1-mini (lightweight) · o1 (reasoning) |
| Design | 2×2 factorial — model tier × context treatment |
| Replications | 5 per configuration · 20 total runs |
| Primary metric | OVAR — Order Variance Amplification Ratio = Var(orders placed) / Var(demand received) |
| Supply chain | 3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer |
| Demand series | 13 months (Dec 2024 – Dec 2025) · single SKU · 606,771 total units |
| Lead time | 1 month deterministic at all tiers |
| Initial inventory | 43,000 units at all tiers |
| LLM calls | 720 total (12 periods × 3 tiers × 5 runs × 4 configurations) |
OVAR interpretation: > 1.0 = bullwhip amplification · = 1.0 = perfect pass-through · < 1.0 = dampening
Key findings
What I found
- All configurations produced bullwhip amplification. OVAR exceeded 1.0 at every tier across all four configurations.
- Context was associated with lower chain-average OVAR for the lightweight model. context_lightweight achieved 2.929 versus blind_lightweight at 3.157, and produced the highest seasonal elevation score — raising orders at event periods in 83% of cases.
- Context was associated with higher chain-average OVAR for the reasoning model. context_reasoning reached 4.412 versus blind_reasoning at 3.835 — the highest of the four configurations in this experiment.
- context_reasoning produced an inverted tier pattern. The other three configurations followed the expected monotone pattern (OEM < Ancillary < Component). context_reasoning reversed it: OEM OVAR 6.349, Ancillary 4.191, Component 2.698.
- The o1 configurations showed high run-to-run variability. Coefficient of variation for o1 OVAR ranged from 22–57%, versus under 2% for gpt-4.1-mini. With n=5, the o1 means carry wide uncertainty and should be read with caution.
- context_reasoning generated the highest excess inventory. 654,728 units of excess chain inventory chain-wide — approximately 6× blind_lightweight — while producing the highest chain-average OVAR.
Results
Numeric results
Chain-average OVAR by configuration
| Configuration | Model | Treatment | Chain Avg OVAR | vs blind_lightweight |
|---|---|---|---|---|
| context_lightweight | GPT-4.1-MINI | Context | 2.929 | −7.2% |
| blind_lightweight | GPT-4.1-MINI | Blind | 3.157 | baseline |
| blind_reasoning | O1 | Blind | 3.835 | +21.5% |
| context_reasoning | O1 | Context | 4.412 | +39.7% |
OVAR by tier — mean ± std (CV%)
| Configuration | OEM OVAR | OEM CV% | Ancillary OVAR | Anc CV% | Component OVAR | Comp CV% |
|---|---|---|---|---|---|---|
| blind_lightweight | 2.267 ± 0.009 | 0.41 | 2.938 ± 0.044 | 1.50 | 4.266 ± 0.078 | 1.82 |
| context_lightweight | 2.237 ± 0.006 | 0.29 | 3.138 ± 0.080 | 2.55 | 3.412 ± 0.347 * | 10.18 |
| blind_reasoning | 4.200 ± 2.400 | 57.15 ⚠ | 3.656 ± 1.350 | 36.94 ⚠ | 3.649 ± 0.608 | 16.66 ⚠ |
| context_reasoning | 6.349 ± 1.452 | 22.86 ⚠ | 4.191 ± 1.373 | 32.76 ⚠ | 2.698 ± 0.677 | 25.10 ⚠ |
* Parse error in run 5 inflates component mean by ~+0.129. Clean estimate: 3.283 ± 0.220. ⚠ CV > 10% — high run-to-run instability; means are directional, not reliable point estimates.
Secondary metrics
| Configuration | Stockouts (chain total) | Excess inventory (chain total) |
|---|---|---|
| blind_lightweight | 21.4 | 109,360 |
| context_lightweight | 19.6 | 151,246 |
| blind_reasoning | 20.0 | 330,649 |
| context_reasoning | 12.8 | 654,728 |
context_reasoning's lower stockout count coincides with its highest excess inventory — orders were large enough to buffer stockouts.
Hypothesis verdicts
| Hypothesis | Prediction | Verdict |
|---|---|---|
| H1 | Context OVAR < Blind OVAR at all three tiers | REJECTED |
| H2 | Blind_reasoning ≈ Blind_lightweight (model does not matter) | REJECTED |
| H3 | context_reasoning achieves lowest chain OVAR | REJECTED |
| H4 | Context agents detect seasonal patterns better | PARTIAL |
H1, H2, H3 all rejected. H4 holds for lightweight; reversed for reasoning.
Discussion
What this means
The context × model interaction
The most notable pattern is that the context effect runs in opposite directions depending on the model. For gpt-4.1-mini, context was associated with a modest reduction in chain-average OVAR (−0.228). For o1, context was associated with an increase (+0.577). The tier-level data adds detail: at the component tier, context reduced OVAR for both models by similar amounts (−0.855 and −0.952). At the OEM tier the picture diverges — context had a near-zero effect on gpt-4.1-mini (−0.030) and a large positive effect on o1 (+2.149).
One possible interpretation: at the OEM tier, which observes actual consumer demand directly, the o1 model with context may construct anticipatory ordering strategies around the seasonal signals in the prompt. If so, this would inject variance at the chain head that propagates downstream. The component tier, receiving an already-distorted signal, may respond differently when given context. This is a hypothesis. The experiment cannot distinguish it from alternative explanations, and the high CV values for o1 configurations (22–57%) mean the OEM and ancillary means carry wide uncertainty at n=5.
The tier inversion in context_reasoning
The fully inverted cascade in context_reasoning — OEM OVAR 6.349, Ancillary 4.191, Component 2.698 — is a departure from the pattern seen in all other configurations and from what classical bullwhip analysis would predict. Whether this pattern is structural or a product of the small sample size is an open question. Version 2 increases runs to 20 per configuration.
On the scope of these results
This experiment tested one narrow scenario: stateless agents, single product, fixed 1-month lead time, no order-smoothing constraints, no inter-tier visibility. The context treatment provided company identity, product, and calendar month — nothing about demand forecasts, seasonality patterns, or historical orders. Results reflect this specific configuration and should be read within it.
Methodology note
All scenarios, companies, products, and supply chain structures in this experiment are entirely fictional and constructed for experimental purposes. No proprietary, confidential, or employer-owned data was used. This is an exploratory study — 5 runs per configuration. Results are directional. Hypotheses used directional language with no pre-specified effect size thresholds. The model-tier comparison reflects real-world deployment in a very simple manner: gpt-4.1-mini at temperature 0.4, 600 max tokens; o1 at API-fixed temperature, 16,000 max tokens.
Experiment source
Code, data, and raw results for this experiment are available on GitHub. View on GitHub →