Catastrophic Backtracking: The Regex Bug That Took Down a Service

It was a Tuesday afternoon when our alerting system started screaming. P99 latency on the authentication service had climbed from 40ms to 45 seconds. By the time I pulled up the dashboard, half our API pods were pegged at 100% CPU. We were not under a DDoS. No database was choking. The culprit was eleven characters in a regular expression that one of our engineers had written nine months earlier without a second thought.

This is a post-mortem — or more precisely, a dissection of a ReDoS (Regular Expression Denial of Service) incident. I want to walk through exactly what happened, why the regex failed the way it did, and how we learned to reason about these patterns before they ship.

The Setup: A Seemingly Harmless Input Validator

The auth service validated JWT subjects before signing tokens. Subject fields were supposed to be simple identifiers — an email or a username slug. The validator was written quickly and looked reasonable at a glance:

const subjectPattern = /^([a-zA-Z0-9]+\s?)*$/;

I have seen this pattern — or something structurally identical — in production codebases more times than I can count. It reads naturally: "one or more alphanumeric characters, optionally followed by a space, repeated any number of times, anchored at both ends." Looks fine. Ship it.

The problem is nested quantifiers. The outer * and the inner + interact in a way that causes the regex engine to explore an exponential number of possible match paths when the input does not match. Most inputs either match immediately or fail fast. But a carefully chosen input — or in our case, an accidentally chosen one — triggers a pathological case.

What Catastrophic Backtracking Actually Means

Most developers have a mental model of regex as something that scans left to right and either finds a match or does not. That model is accurate for simple patterns on normal inputs. It breaks down completely with backtracking NFA engines (which is what JavaScript, Python, Java, and most others use by default).

When a regex engine hits a quantifier like + or *, it speculatively tries one branch and then, if that branch fails to lead to a complete match, backtracks and tries another. With nested quantifiers, the number of branches can grow combinatorially. The engine is not searching linearly — it is exploring a tree.

For the pattern ^([a-zA-Z0-9]+\s?)*$, consider the input "aaaaaaaaaa!" — ten "a" characters followed by an exclamation mark that cannot match anything in the pattern. The engine will try every possible way to group those ten characters across the outer and inner quantifiers before it concludes the input is invalid. How many ways are there? For n characters, it is roughly 2ⁿ attempts. Ten characters gives you about 1,024 attempts. Twenty gives you over a million. Thirty gives you over a billion.

The input that triggered our incident was not malicious. A user had submitted a subject string of 28 alphanumeric characters followed by a single Unicode punctuation character that the pattern could not match. The validator tried to find a way — any way — to make it work. It could not. But it took the Node.js event loop approximately 22 seconds to figure that out. While it was stuck, every other request on that thread was queued. We run eight workers per pod. All eight locked up within 90 seconds of the bad request arriving.

Why It Went Undetected for Nine Months

Nobody had submitted a subject with that structure before. Our test suite checked valid inputs and a handful of obviously invalid ones — strings with SQL injection fragments, strings that were too long, empty strings. We never tested a long string of mostly-valid characters with a single bad character at the end. That is the specific shape that triggers the worst-case behavior, and it is not an obvious test case to reach for.

The pattern also passed code review without comment. I reviewed that commit. I looked at the regex, thought "sure, that looks like it validates what we want," and approved it. I did not reach for a complexity analyzer. I did not test it against adversarial inputs. Neither did the author.

This is the real lesson, and it is uncomfortable: the regex looked fine. There was no moment where a reasonable developer should have flagged it on instinct. The failure was systemic — we had no tooling in the review process that would have caught it.

The Incident Timeline

At 14:37 UTC, a mobile client sent a registration request. The subject field was a 28-character string ending in a right-to-left override character (U+202E) — not injected maliciously, but pasted from a copied profile name on a platform that embeds it for RTL text rendering.

The validator received the input. The regex engine began working. It did not finish for 22 seconds.

By 14:38, all eight workers on the pod that received the request were saturated. The load balancer began routing new auth requests to other pods. The same user — and several others with similar pasted inputs in a batch import — hit those pods too. By 14:41, all pods in the authentication deployment were at 100% CPU. New connections were timing out at the ingress layer.

We identified the spike at 14:43 via our CPU alert. By 14:47 we had rolled back to the previous validator (a simple character allowlist with a length cap). Service recovered within 90 seconds of rollout.

Total impact: approximately 10 minutes of degraded auth service, 6 minutes of complete outage. No data loss. No security breach. But a very bad afternoon.

Fixing the Pattern

The fix itself is trivial once you understand the problem. The original intent was to allow alphanumeric characters and spaces. The safe way to write that:

const subjectPattern = /^[a-zA-Z0-9 ]{1,64}$/;

No nested quantifiers. No alternation inside a repeated group. The character class [a-zA-Z0-9 ] is atomic — the engine either matches a character or it does not, with no branching to backtrack into. Failure is O(n), not O(2ⁿ).

If you genuinely need something more complex — say, you want to allow spaces but not leading or trailing ones — you can express that structurally without nested quantifiers:

const subjectPattern = /^[a-zA-Z0-9]+( [a-zA-Z0-9]+)*$/;

This one is interesting because it looks similar to the original but behaves differently. The key difference: the repeated group ( [a-zA-Z0-9]+)* always starts with a literal space. That space acts as a separator — the engine anchors each iteration to it, so there is no ambiguity about how to partition a sequence of alphanumeric characters across multiple group iterations. Backtracking is bounded.

What We Put in Place After

We made three changes to our development process.

Static analysis in CI. We added eslint-plugin-regexp to our JavaScript linting config. It has a rule (regexp/no-super-linear-backtracking) that flags patterns with exponential worst-case behavior at lint time. It is not perfect — some false negatives exist — but it catches the most common patterns and it runs on every push.

Regex complexity testing. For any regex that touches user-supplied input, we now include at least one adversarial test: a long string of mostly-matching characters with a single non-matching character appended. If the test takes more than a few milliseconds, something is wrong. We time-box these tests explicitly with a 100ms limit and fail the suite if the validator breaches it.

Timeout wrapping for user input validation. In Node.js you can wrap synchronous operations in a worker thread with a timeout to prevent locking the event loop. We wrote a thin utility that runs validators in a worker and rejects with a 408 if validation exceeds 50ms. This is defense in depth — we do not want to rely on it, but it prevents a single bad regex from taking down the whole pod even if one slips through the other controls.

A Note on Testing Regexes Interactively

If you want to audit existing patterns before they bite you, regex101.com has a "match information" panel that shows match time in milliseconds and step count. Paste your pattern, paste a long adversarial input, and watch the step count. A well-behaved pattern on a 30-character string should complete in under 100 steps. If you see tens of thousands, you have a problem.

There is also devina.io/redos-checker, which performs static analysis on a pattern and returns whether it has polynomial or exponential worst-case complexity, along with an example input that triggers it. Run every user-facing validator through it once. It takes about 30 seconds per pattern and could save you a very bad afternoon.

The Broader Point

ReDoS is not exotic. Stack Overflow went down for 34 minutes in 2016 because of a regex in their Markdown parser. Cloudflare had a global outage in 2019 caused by a WAF rule with a catastrophic backtracking pattern. These are not junior engineer mistakes — they are structural failures in how our tooling and review processes treat regular expressions as safe by default.

Regexes that operate on untrusted input are code that executes arbitrary-length computations in response to user-controlled data. They deserve the same scrutiny as a SQL query or a shell command. Not paranoia — just the habit of asking, before you ship: what does this pattern do when the input is designed, accidentally or otherwise, to make it fail slowly?

Nine months, one bad afternoon, and eleven characters. Worth asking the question earlier.