How a $2.3M Billing Bug Hid in Plain Sight for 11 Months*
A postmortem on the discount-stacking incident that nobody could reproduce — until we stopped guessing and started asking.
The logs said everything was fine. The database said we were hemorrhaging money.
The Incident
On March 14, 2025 at 2:47 AM EST, our fraud detection pipeline flagged a batch of transactions with negative totals. Not zero — negative. We were paying customers to buy things.
The first engineer on-call triaged it as a data corruption issue. Restarted the service. Went back to sleep. The next morning, the numbers were clean. No negative totals. No anomalies in the logs. Nothing in the error tracker.
It happened again two weeks later. And then again three days after that.
What We Knew
Our checkout flow was straightforward. Cart totals were calculated server-side. Promotional discounts were applied via a lookup table — coupon code in, percentage out, capped at 20%. The code had been reviewed by four engineers. It had 94% test coverage. The logic was simple enough to fit on an index card.
We knew the following:
- Negative totals only appeared on orders placed between 1:00 AM and 4:00 AM EST
- They always involved promotional codes
- The affected orders had applied the same promo code twice
- The discount exceeded our 20% cap, sometimes reaching 40–50%
What We Didn't Know
We couldn't reproduce it. Not locally, not in staging, not with load testing. We threw everything at it: added logging around the promo application, set up error boundaries, put breakpoints in the discount pipeline, created synthetic transactions with the exact same parameters. Nothing. The code behaved correctly every single time.
For five months, a rotating cast of engineers spent an average of 6 hours per week on this bug. We called it "the phantom." Someone printed out the checkout code and pinned it to a corkboard. We had a Slack channel with 847 messages. We hired an external consultant.
The Root Cause
The issue was a race condition. Not in the checkout code — in the API gateway's retry logic.
During late-night traffic troughs, our load balancer's keepalive timeout overlapped with the payment gateway's processing window. About 1 in 9,000 requests would time out at the gateway level but succeed at the processor level. The gateway would retry. The retry would apply the promo code again. The discount stacked.
The retry logic was in infrastructure we hadn't written. The timeout values were defaults we'd never changed. The interaction only manifested under specific traffic patterns that only occurred during low-volume hours. And the logs — every single one — showed exactly one promo application per order, because each individual request logged correctly. You had to correlate two requests that didn't know about each other.
What We Wish We'd Had
Here's the thing that still bothers me: the running application knew. At the moment the bug was active, the live state of the system contained every piece of information we needed. The order total was negative. The discount was above the cap. The promo array had two entries instead of one. The retry counter exceeded the limit.
All of it was right there, in memory, at runtime.
We couldn't see it because we were reading logs after the fact. We were studying the blueprint when we needed to walk through the building. We were reading the patient's chart when we needed an MRI.
The debugging workflow we used — reproduce, instrument, observe, correlate — assumes you know what to look for before the bug happens. When the bug is an emergent property of systems interacting at runtime, that workflow doesn't just fail slowly. It fails completely.
The Cost
$2.3 million in incorrect discounts over 11 months. 312 engineering hours on investigation. One external consulting engagement. Zero useful information from traditional debugging.
The fix was a single configuration change: a 200ms increase to a gateway timeout. Four characters in a YAML file.
The Lesson
The runtime visibility gap isn't a theoretical concept. It's the $2.3 million space between what your code says should happen and what your running system is actually doing.
If we could have asked our running system "show me orders where the discount exceeds the configured maximum" — not searched logs, not added print statements, not reproduced the conditions — just asked — we would have found this in minutes. Not months.
That's not a tooling preference. That's a fundamentally different relationship with running software.
Marcus Chen is CTO at Lattice Financial, where he now asks his production systems questions instead of reading their diaries. The phantom has not returned.
*This is a fictional case study created to illustrate the runtime visibility gap. Real stories coming soon.
Ready to close the gap?
Start querying your running software in plain English.