TL;DR
Every heuristic outperformed every LLM configuration on both order variability and stockouts. Exponential smoothing beat the best LLM by 8x on both metrics simultaneously. All seven hypotheses were rejected.
Overview
What this experiment explored
Agentic Bullwhip Effect Version 2 asks a harder question than the first experiment in this series, Agentic Bullwhip Effect Version 1: not which AI configuration performs best, but whether any LLM configuration outperforms a simple rule-based heuristic at all. Four models across lightweight and reasoning tiers, frontier and local, were tested against three deterministic heuristic baselines across 20 independent runs per condition.
Every heuristic outperformed every LLM on both order variance and stockouts simultaneously. This is not a tradeoff result.
Experiment Setup
Design & configuration
| Models | gpt-4.1-mini (frontier lightweight) · o4-mini (frontier reasoning) · phi4:14b (local lightweight) · gpt-oss:120b (local reasoning) |
| Design | 2×2 factorial (model tier × context treatment) across two backends: frontier (Azure) and local (Ollama), 8 backend-specific model-condition cells in total |
| Replications | 20 per LLM configuration · 1 per heuristic (deterministic) |
| Primary metrics | OVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand). Stockout count. Both always reported together. MPRD threshold: |ΔOVAR| ≥ 0.5 required for a practically meaningful claim. |
| Heuristic baselines | Exponential smoothing (α=0.30) · Naive passthrough · Order-up-to with fixed safety stock |
| Supply chain | 3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer |
| Demand series | 25 months (Jan 2025 to Jan 2027) · single SKU · two full Indian festive cycles |
| Lead time | 1 month deterministic at all tiers |
| Initial inventory | 43,609 units (mean + 1.65σ, ~95% service level) |
| LLM calls | 11,520 total (4 conditions × 20 runs × 24 periods × 3 tiers × 2 backends) |
| Agent design | Stateless, no memory between periods. Deliberate: most real agentic deployments are stateless. |
| Blind condition | Numbers only. No tier persona, no calendar month. |
| Context condition | Tier persona + calendar month, with the same numeric state variables in both conditions. |
Key findings
What I found
- Every heuristic outperformed every LLM on both OVAR and stockouts simultaneously. Not a tradeoff. LLMs were strictly dominated on both primary metrics in every configuration tested.
- The gap is not marginal. Exponential smoothing: chain OVAR 0.54, 5 stockouts. Best LLM (local phi4:14b, blind): OVAR 4.33, 41 stockouts. 8x worse on both dimensions at once.
- Context had opposite effects by model and backend. For frontier gpt-4.1-mini, adding business context reduced OVAR marginally (4.70 to 4.47, delta 0.23, below the MPRD threshold). For local phi4:14b, the same context was dramatically worse: chain OVAR jumped from 4.33 to 6.35, with the Ancillary tier hitting 10.82 ± 8.14 across 20 runs. The standard deviation of 8.14 indicates instability, not a consistent directional effect.
- Reasoning models showed no ordering advantage. The 116B gpt-oss:120b produced results indistinguishable from gpt-4.1-mini in blind conditions. o4-mini generated over 1 million reasoning tokens and produced no measurable improvement on either metric. All 7 hypotheses were rejected.
Results
Numeric results
Heuristic baselines
| Heuristic | Chain OVAR | Stockouts (of 75 possible) |
|---|---|---|
| Exponential smoothing | 0.54 | 5 |
| Naive passthrough | 1.00 | 3 |
| Order-up-to | 1.71 | 14 |
Chain-average OVAR by LLM configuration
| Condition | Backend | Chain OVAR (mean ± std) | Stockouts (mean ± std) |
|---|---|---|---|
| exp_smoothing | HEURISTIC | 0.54 | 5 |
| naive_passthrough | HEURISTIC | 1.00 | 3 |
| order_up_to | HEURISTIC | 1.71 | 14 |
| L-Blind | FRONTIER | 4.70 ± 0.14 | 40.5 ± 0.83 |
| L-Context | FRONTIER | 4.47 ± 0.07 | 39.0 ± 0.83 |
| L-Blind | LOCAL | 4.33 ± 0.00 | 41.0 ± 0.00 |
| L-Context | LOCAL | 6.35 ± 2.53 | 37.2 ± 3.11 |
| R-Blind | FRONTIER | 4.72 ± 1.12 | 42.9 ± 3.85 |
| R-Context | FRONTIER | 4.52 ± 0.08 | 40.1 ± 0.85 |
| R-Blind | LOCAL | 4.52 ± 0.00 | 40.0 ± 0.00 |
| R-Context | LOCAL | 4.52 ± 0.05 | 39.6 ± 0.76 |
L = Lightweight (gpt-4.1-mini / phi4:14b) · R = Reasoning (o4-mini / gpt-oss:120b)
OVAR by tier
| Condition | Backend | OEM | Ancillary | Component |
|---|---|---|---|---|
| exp_smoothing | HEURISTIC | 0.41 | 0.65 | 0.58 |
| L-Blind | FRONTIER | 4.21 | 6.64 | 3.25 |
| L-Context | FRONTIER | 4.12 | 6.01 | 3.30 |
| L-Blind | LOCAL | 3.71 | 5.89 | 3.40 |
| L-Context | LOCAL | 4.62 | 10.82 | 3.61 |
| R-Blind | FRONTIER | 5.94 | 5.18 | 3.05 |
| R-Context | FRONTIER | 4.13 | 5.99 | 3.45 |
| R-Blind | LOCAL | 4.13 | 5.98 | 3.45 |
| R-Context | LOCAL | 4.13 | 6.01 | 3.43 |
Hypothesis verdicts
| Hypothesis | Prediction | Verdict |
|---|---|---|
| H1 | At least one LLM achieves lower OVAR than exp smoothing (0.54) with ≤5 stockouts; best LLM: OVAR 4.33, 41 stockouts | REJECTED |
| H2 | context_lightweight OVAR < blind_lightweight by ≥0.5; actual Δ = 0.23, below MPRD | REJECTED |
| H3 | context_reasoning OVAR < blind_reasoning by ≥0.5; actual Δ = 0.20, below MPRD | REJECTED |
| H4 | blind_reasoning OVAR < blind_lightweight by ≥0.5; actual Δ = −0.02, opposite direction | REJECTED |
| H5 | context_reasoning OVAR < context_lightweight by ≥0.5; actual Δ = −0.05, opposite direction | REJECTED |
| H6 | Context benefit larger for reasoning tier than lightweight; actual: −0.03, opposite direction | REJECTED |
| H7 | Local context_lightweight within ±0.5 of frontier context_lightweight; actual Δ = 1.88, well outside equivalence bounds | REJECTED |
Discussion
Why the gap is structural
The bullwhip failure is structural. Each agent sees only the current period, with no memory of what it ordered previously. Without that causal chain, there is no self-correction mechanism. A stateless agent that over-ordered last period arrives at the next period without knowing it did. Combined with the fact that LLMs generate plausible text rather than numerically precise outcomes, the result is an agent that picks a number that sounds reasonable rather than one that dampens variance.
The phi4:14b context result is worth noting separately. The standard deviation of 8.14 on Ancillary-tier OVAR across 20 runs indicates that the business context prompt did not consistently shape ordering behaviour; it introduced variance. In some runs it may have triggered aggressive anticipatory ordering, in others conservative responses. The blind model failed consistently and identically across all 20 runs. Consistent failure can be diagnosed and compensated for. Intermittent instability, where the same model with the same prompt produces order-of-magnitude different outcomes across runs, is harder to anticipate and mitigate in a real deployment.
Industry Implications
What this means for practitioners
Do not replace your ordering formula with an LLM. The formula was built for this task and it will do it better. This experiment is not a close result: exponential smoothing, a method from the 1950s, produced orders eight times less variable than the best LLM configuration, with fewer stockouts at the same time.
Where LLMs might add value is earlier in the process: reading demand signals, spotting something unusual in the data, providing context to a planner. That is a different task from executing the order quantity decision, and it was not tested here. But using an LLM to inform a formula, rather than replace it, is a more plausible role than the one tested in this experiment.
The phi4:14b result is worth noting separately. When given business context, that model did not fail in a predictable way: it produced reasonable results in most runs and extreme results in a few, with nothing to distinguish the two from the outside. A model that fails consistently is easier to manage. You remove it and move on. A model that is mostly fine is harder to catch.
Full code and results on GitHub
Full code, data, and raw results are available on GitHub. View on GitHub →
Methodology note
All scenarios, companies, products, and supply chain structures are entirely fictional. The experiment was intentionally narrow: single product, fixed lead times, stateless agents, no unstructured context. Results should not be generalised to supply chain management broadly. The correct scope: LLM agents do not outperform simple blind heuristics in a stylised single-product replenishment task with fixed lead times and no unstructured context.