LLM Agents Against Heuristic Baselines in Supply Chain Replenishment

TL;DR

Every heuristic outperformed every LLM configuration on both order variability and stockouts. Exponential smoothing beat the best LLM by 8x on both metrics simultaneously. All seven hypotheses were rejected.

Overview

What this experiment explored

Agentic Bullwhip Effect Version 2 asks a harder question than the first experiment in this series, Agentic Bullwhip Effect Version 1: not which AI configuration performs best, but whether any LLM configuration outperforms a simple rule-based heuristic at all. Four models across lightweight and reasoning tiers, frontier and local, were tested against three deterministic heuristic baselines across 20 independent runs per condition.

Every heuristic outperformed every LLM on both order variance and stockouts simultaneously. This is not a tradeoff result.

Experiment Setup

Design & configuration

Models	gpt-4.1-mini (frontier lightweight) · o4-mini (frontier reasoning) · phi4:14b (local lightweight) · gpt-oss:120b (local reasoning)
Design	2×2 factorial (model tier × context treatment) across two backends: frontier (Azure) and local (Ollama), 8 backend-specific model-condition cells in total
Replications	20 per LLM configuration · 1 per heuristic (deterministic)
Primary metrics	OVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand). Stockout count. Both always reported together. MPRD threshold: \|ΔOVAR\| ≥ 0.5 required for a practically meaningful claim.
Heuristic baselines	Exponential smoothing (α=0.30) · Naive passthrough · Order-up-to with fixed safety stock
Supply chain	3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer
Demand series	25 months (Jan 2025 to Jan 2027) · single SKU · two full Indian festive cycles
Lead time	1 month deterministic at all tiers
Initial inventory	43,609 units (mean + 1.65σ, ~95% service level)
LLM calls	11,520 total (4 conditions × 20 runs × 24 periods × 3 tiers × 2 backends)
Agent design	Stateless, no memory between periods. Deliberate: most real agentic deployments are stateless.
Blind condition	Numbers only. No tier persona, no calendar month.
Context condition	Tier persona + calendar month, with the same numeric state variables in both conditions.

Key findings

What I found

Every heuristic outperformed every LLM on both OVAR and stockouts simultaneously. Not a tradeoff. LLMs were strictly dominated on both primary metrics in every configuration tested.
The gap is not marginal. Exponential smoothing: chain OVAR 0.54, 5 stockouts. Best LLM (local phi4:14b, blind): OVAR 4.33, 41 stockouts. 8x worse on both dimensions at once.
Context had opposite effects by model and backend. For frontier gpt-4.1-mini, adding business context reduced OVAR marginally (4.70 to 4.47, delta 0.23, below the MPRD threshold). For local phi4:14b, the same context was dramatically worse: chain OVAR jumped from 4.33 to 6.35, with the Ancillary tier hitting 10.82 ± 8.14 across 20 runs. The standard deviation of 8.14 indicates instability, not a consistent directional effect.
Reasoning models showed no ordering advantage. The 116B gpt-oss:120b produced results indistinguishable from gpt-4.1-mini in blind conditions. o4-mini generated over 1 million reasoning tokens and produced no measurable improvement on either metric. All 7 hypotheses were rejected.

Results

Numeric results

Heuristic baselines

Heuristic	Chain OVAR	Stockouts (of 75 possible)
Exponential smoothing	0.54	5
Naive passthrough	1.00	3
Order-up-to	1.71	14

Chain-average OVAR by LLM configuration

Condition	Backend	Chain OVAR (mean ± std)	Stockouts (mean ± std)
exp_smoothing	HEURISTIC	0.54	5
naive_passthrough	HEURISTIC	1.00	3
order_up_to	HEURISTIC	1.71	14
L-Blind	FRONTIER	4.70 ± 0.14	40.5 ± 0.83
L-Context	FRONTIER	4.47 ± 0.07	39.0 ± 0.83
L-Blind	LOCAL	4.33 ± 0.00	41.0 ± 0.00
L-Context	LOCAL	6.35 ± 2.53	37.2 ± 3.11
R-Blind	FRONTIER	4.72 ± 1.12	42.9 ± 3.85
R-Context	FRONTIER	4.52 ± 0.08	40.1 ± 0.85
R-Blind	LOCAL	4.52 ± 0.00	40.0 ± 0.00
R-Context	LOCAL	4.52 ± 0.05	39.6 ± 0.76

L = Lightweight (gpt-4.1-mini / phi4:14b) · R = Reasoning (o4-mini / gpt-oss:120b)

OVAR by tier

Condition	Backend	OEM	Ancillary	Component
exp_smoothing	HEURISTIC	0.41	0.65	0.58
L-Blind	FRONTIER	4.21	6.64	3.25
L-Context	FRONTIER	4.12	6.01	3.30
L-Blind	LOCAL	3.71	5.89	3.40
L-Context	LOCAL	4.62	10.82	3.61
R-Blind	FRONTIER	5.94	5.18	3.05
R-Context	FRONTIER	4.13	5.99	3.45
R-Blind	LOCAL	4.13	5.98	3.45
R-Context	LOCAL	4.13	6.01	3.43

Hypothesis verdicts

Hypothesis	Prediction	Verdict
H1	At least one LLM achieves lower OVAR than exp smoothing (0.54) with ≤5 stockouts; best LLM: OVAR 4.33, 41 stockouts	REJECTED
H2	context_lightweight OVAR < blind_lightweight by ≥0.5; actual Δ = 0.23, below MPRD	REJECTED
H3	context_reasoning OVAR < blind_reasoning by ≥0.5; actual Δ = 0.20, below MPRD	REJECTED
H4	blind_reasoning OVAR < blind_lightweight by ≥0.5; actual Δ = −0.02, opposite direction	REJECTED
H5	context_reasoning OVAR < context_lightweight by ≥0.5; actual Δ = −0.05, opposite direction	REJECTED
H6	Context benefit larger for reasoning tier than lightweight; actual: −0.03, opposite direction	REJECTED
H7	Local context_lightweight within ±0.5 of frontier context_lightweight; actual Δ = 1.88, well outside equivalence bounds	REJECTED

Discussion

Why the gap is structural

The bullwhip failure is structural. Each agent sees only the current period, with no memory of what it ordered previously. Without that causal chain, there is no self-correction mechanism. A stateless agent that over-ordered last period arrives at the next period without knowing it did. Combined with the fact that LLMs generate plausible text rather than numerically precise outcomes, the result is an agent that picks a number that sounds reasonable rather than one that dampens variance.

The phi4:14b context result is worth noting separately. The standard deviation of 8.14 on Ancillary-tier OVAR across 20 runs indicates that the business context prompt did not consistently shape ordering behaviour; it introduced variance. In some runs it may have triggered aggressive anticipatory ordering, in others conservative responses. The blind model failed consistently and identically across all 20 runs. Consistent failure can be diagnosed and compensated for. Intermittent instability, where the same model with the same prompt produces order-of-magnitude different outcomes across runs, is harder to anticipate and mitigate in a real deployment.

Industry Implications

What this means for practitioners

Do not replace your ordering formula with an LLM. The formula was built for this task and it will do it better. This experiment is not a close result: exponential smoothing, a method from the 1950s, produced orders eight times less variable than the best LLM configuration, with fewer stockouts at the same time.

Where LLMs might add value is earlier in the process: reading demand signals, spotting something unusual in the data, providing context to a planner. That is a different task from executing the order quantity decision, and it was not tested here. But using an LLM to inform a formula, rather than replace it, is a more plausible role than the one tested in this experiment.

The phi4:14b result is worth noting separately. When given business context, that model did not fail in a predictable way: it produced reasonable results in most runs and extreme results in a few, with nothing to distinguish the two from the outside. A model that fails consistently is easier to manage. You remove it and move on. A model that is mostly fine is harder to catch.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub. View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The experiment was intentionally narrow: single product, fixed lead times, stateless agents, no unstructured context. Results should not be generalised to supply chain management broadly. The correct scope: LLM agents do not outperform simple blind heuristics in a stylised single-product replenishment task with fixed lead times and no unstructured context.