Agentic Bullwhip Effect — Version 1 — Industrial Mind & Code

Abstract

The bullwhip effect is a foundational supply chain problem: small fluctuations in consumer demand cause progressively larger swings in orders further up the chain, leading to excess inventory, stockouts, and wasted capacity.

This experiment placed AI agents at each tier of a 3-tier Indian automotive supply chain and tested whether giving them domain context — company identity, product, market, calendar — reduced order amplification compared to agents operating blind. A 2×2 factorial design crossed two context levels (blind vs context) against two model tiers (gpt-4.1-mini vs o1).

All four configurations produced bullwhip amplification. Context reduced amplification for the lightweight model and increased it for the reasoning model. Results are directional — 5 runs per configuration.

Experiment setup

Design & configuration

Models	gpt-4.1-mini (lightweight) · o1 (reasoning)
Design	2×2 factorial — model tier × context treatment
Replications	5 per configuration · 20 total runs
Primary metric	OVAR — Order Variance Amplification Ratio = Var(orders placed) / Var(demand received)
Supply chain	3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer
Demand series	13 months (Dec 2024 – Dec 2025) · single SKU · 606,771 total units
Lead time	1 month deterministic at all tiers
Initial inventory	43,000 units at all tiers
LLM calls	720 total (12 periods × 3 tiers × 5 runs × 4 configurations)

Lightweight (gpt-4.1-mini)

blind_lightweight

context_lightweight

Reasoning (o1)

blind_reasoning

context_reasoning

OVAR interpretation: > 1.0 = bullwhip amplification · = 1.0 = perfect pass-through · < 1.0 = dampening

Key findings

What I found

All configurations produced bullwhip amplification. OVAR exceeded 1.0 at every tier across all four configurations.
Context was associated with lower chain-average OVAR for the lightweight model. context_lightweight achieved 2.929 versus blind_lightweight at 3.157, and produced the highest seasonal elevation score — raising orders at event periods in 83% of cases.
Context was associated with higher chain-average OVAR for the reasoning model. context_reasoning reached 4.412 versus blind_reasoning at 3.835 — the highest of the four configurations in this experiment.
context_reasoning produced an inverted tier pattern. The other three configurations followed the expected monotone pattern (OEM < Ancillary < Component). context_reasoning reversed it: OEM OVAR 6.349, Ancillary 4.191, Component 2.698.
The o1 configurations showed high run-to-run variability. Coefficient of variation for o1 OVAR ranged from 22–57%, versus under 2% for gpt-4.1-mini. With n=5, the o1 means carry wide uncertainty and should be read with caution.
context_reasoning generated the highest excess inventory. 654,728 units of excess chain inventory chain-wide — approximately 6× blind_lightweight — while producing the highest chain-average OVAR.

Results

Numeric results

Chain-average OVAR by configuration

Configuration	Model	Treatment	Chain Avg OVAR	vs blind_lightweight
context_lightweight	GPT-4.1-MINI	Context	2.929	−7.2%
blind_lightweight	GPT-4.1-MINI	Blind	3.157	baseline
blind_reasoning	O1	Blind	3.835	+21.5%
context_reasoning	O1	Context	4.412	+39.7%

OVAR by tier — mean ± std (CV%)

Configuration	OEM OVAR	OEM CV%	Ancillary OVAR	Anc CV%	Component OVAR	Comp CV%
blind_lightweight	2.267 ± 0.009	0.41	2.938 ± 0.044	1.50	4.266 ± 0.078	1.82
context_lightweight	2.237 ± 0.006	0.29	3.138 ± 0.080	2.55	3.412 ± 0.347 *	10.18
blind_reasoning	4.200 ± 2.400	57.15 ⚠	3.656 ± 1.350	36.94 ⚠	3.649 ± 0.608	16.66 ⚠
context_reasoning	6.349 ± 1.452	22.86 ⚠	4.191 ± 1.373	32.76 ⚠	2.698 ± 0.677	25.10 ⚠

* Parse error in run 5 inflates component mean by ~+0.129. Clean estimate: 3.283 ± 0.220. ⚠ CV > 10% — high run-to-run instability; means are directional, not reliable point estimates.

Secondary metrics

Configuration	Stockouts (chain total)	Excess inventory (chain total)
blind_lightweight	21.4	109,360
context_lightweight	19.6	151,246
blind_reasoning	20.0	330,649
context_reasoning	12.8	654,728

context_reasoning's lower stockout count coincides with its highest excess inventory — orders were large enough to buffer stockouts.

Hypothesis verdicts

Hypothesis	Prediction	Verdict
H1	Context OVAR < Blind OVAR at all three tiers	REJECTED
H2	Blind_reasoning ≈ Blind_lightweight (model does not matter)	REJECTED
H3	context_reasoning achieves lowest chain OVAR	REJECTED
H4	Context agents detect seasonal patterns better	PARTIAL

H1, H2, H3 all rejected. H4 holds for lightweight; reversed for reasoning.

Discussion

What this means

The context × model interaction

The most notable pattern is that the context effect runs in opposite directions depending on the model. For gpt-4.1-mini, context was associated with a modest reduction in chain-average OVAR (−0.228). For o1, context was associated with an increase (+0.577). The tier-level data adds detail: at the component tier, context reduced OVAR for both models by similar amounts (−0.855 and −0.952). At the OEM tier the picture diverges — context had a near-zero effect on gpt-4.1-mini (−0.030) and a large positive effect on o1 (+2.149).

One possible interpretation: at the OEM tier, which observes actual consumer demand directly, the o1 model with context may construct anticipatory ordering strategies around the seasonal signals in the prompt. If so, this would inject variance at the chain head that propagates downstream. The component tier, receiving an already-distorted signal, may respond differently when given context. This is a hypothesis. The experiment cannot distinguish it from alternative explanations, and the high CV values for o1 configurations (22–57%) mean the OEM and ancillary means carry wide uncertainty at n=5.

The tier inversion in context_reasoning

The fully inverted cascade in context_reasoning — OEM OVAR 6.349, Ancillary 4.191, Component 2.698 — is a departure from the pattern seen in all other configurations and from what classical bullwhip analysis would predict. Whether this pattern is structural or a product of the small sample size is an open question. Version 2 increases runs to 20 per configuration.

On the scope of these results

This experiment tested one narrow scenario: stateless agents, single product, fixed 1-month lead time, no order-smoothing constraints, no inter-tier visibility. The context treatment provided company identity, product, and calendar month — nothing about demand forecasts, seasonality patterns, or historical orders. Results reflect this specific configuration and should be read within it.

Methodology note

All scenarios, companies, products, and supply chain structures in this experiment are entirely fictional and constructed for experimental purposes. No proprietary, confidential, or employer-owned data was used. This is an exploratory study — 5 runs per configuration. Results are directional. Hypotheses used directional language with no pre-specified effect size thresholds. The model-tier comparison reflects real-world deployment in a very simple manner: gpt-4.1-mini at temperature 0.4, 600 max tokens; o1 at API-fixed temperature, 16,000 max tokens.

Experiment source

Code, data, and raw results for this experiment are available on GitHub. View on GitHub →