When the Circuit Breaks

Four consecutive failures, one circuit breaker, and why self-healing automation is the only kind worth building

Sometimes the most productive thing a system can do is admit it's stuck — and go learn something instead.

This morning the auto-bounty system hit run number seven. The task selector took one look at the state file, saw four consecutive failures stacked in a row, and made a decision I programmed it to make but hoped it wouldn't need to: it fired the circuit breaker and assigned a lightweight learning session instead of another ambitious test run.

I'm writing this because the circuit breaker worked exactly as designed — and because the story of those four failures is more instructive than any clean success.

The Failure Stack

Four consecutive failures sounds bad. And it was. But the specific flavor of each failure tells a different story:

Failures 1–2: exit 127. The Claude binary wasn't found at runtime because of a PATH bug in the orchestrator script. The cron environment doesn't inherit your shell's PATH, so claude resolved correctly in the terminal but hit a wall in cron. Classic. The fix was two characters long: hardcode the full binary path instead of relying on PATH lookup.
Failures 3–4: 3-hour timeout. Fixed the PATH bug, ran the session, hit the hard 3-hour timeout — twice. The sessions were overly ambitious. I'd pointed them at deep authenticated testing work that couldn't be completed without setup steps I hadn't automated yet. The session would spin up, realize the preconditions weren't met, and spend three hours spinning in circles before the timeout killed it.

There was also an API token expiry somewhere in the mix. Nothing quite kills momentum like discovering your credentials silently rotted while you were busy writing reports that assume they still work.

The real failure

None of those failures were random. They were all design failures: a missing precondition check, a missing hardcoded path, a missing token freshness check. The circuit breaker didn't save me from bad luck. It saved me from bad architecture. Different problem.

What the Circuit Breaker Does

The circuit breaker pattern is borrowed from electrical engineering, where it exists to prevent cascade failures: if something keeps going wrong, stop trying and give the system time to recover.

In the auto-bounty orchestrator, the logic is deliberately simple:

# In task-selector.py
if state["consecutive_failures"] >= 4:
    return {
        "task": "learn",
        "model": "opus",
        "topic": "circuit-breaker recovery — lightweight web research",
        "reason": f"Circuit breaker: {state['consecutive_failures']} consecutive failures"
    }

When the counter hits four, the selector stops picking ambitious tasks and forces a lightweight learning session instead. "Lightweight" means: web research only, no authenticated testing, no API calls to external platforms, no preconditions that might not be met. Just read some writeups and write notes.

This serves two purposes. First, it stops the system from burning session after session on a broken precondition. Second, it ensures that even a broken run produces some value — knowledge compounds even when testing is blocked.

Circuit breaker design principle

A circuit breaker shouldn't just stop bad behavior — it should redirect to productive behavior. "Give up and sleep" is a waste. "Go learn something" is never a waste.

What It Learned: IDOR From Eight Angles

The recovery session had one job: study IDOR detection and exploitation. It pulled from eight sources — lab documentation, disclosed reports, practitioner writeups, and triage guides — and extracted patterns that will directly feed the next authenticated testing sessions.

Three insights stood out as genuinely non-obvious:

Insight 1: Bypass Is Systematic

Most IDOR testing I'd done before was naive: swap one user's resource ID for another's and check if the server complains. This works on the simplest targets. But modern authorization layers have multiple failure modes, and each requires a different bypass approach.

The systematic checklist:

Parameter pollution — send the authorized ID alongside the unauthorized one. Some parsers take the first value, some take the last. You want the one that the authorization check doesn't see.
HTTP method switch — authorization middleware often only runs on specific verbs. A GET request that's properly blocked might sail through as a POST or PUT.
API version downgrade — v2 of an endpoint often has stricter access controls than v1. Old versions that were never formally deprecated sometimes still respond, minus the authorization fixes.
Encoding bypass — URL-encode the resource ID, double-encode it, Unicode-encode it. Normalization order matters: if the authorization check runs before normalization, you can sneak through.
JSON type juggling — send an integer as a string, or an object instead of a scalar. Type-lenient parsers in languages like PHP or JavaScript might match in ways the authorization logic doesn't expect.

The key shift: "does this target have IDOR?" is no longer a binary question answered by one test. It's five tests, and "no" to any single one doesn't mean the door is locked.

Insight 2: WebSocket Authorization Is Often an Afterthought

This one genuinely surprised me. WebSocket connections are often authenticated at the handshake — the initial HTTP upgrade request checks credentials — but subsequent messages over the live connection are sometimes not re-checked against resource-level authorization.

The pattern: connect legitimately as User A, authenticate successfully. Then send a message that references User B's resources. If the server only validates "is this connection authenticated?" rather than "does this authenticated connection have access to this resource?", the IDOR lives in the message handler, not the handshake.

Financial services APIs are a particularly interesting surface for this: real-time trading and portfolio data flows over WebSocket in many modern fintech applications. The combination of high-value data and an authorization model designed primarily around connection-level auth is a meaningful gap worth testing.

Insight 3: Two-Account Testing Is Non-Negotiable

This is deceptively obvious in retrospect, but I hadn't internalized it: you cannot test IDOR without two distinct authenticated sessions.

Intelligence gathered from JavaScript analysis, source maps, and recon can tell you where the IDOR might be and what parameters to target. But the test itself — the actual confirmation that crosses the gate from "hypothesis" to "finding" — requires one account holding a resource and another account proving it can access it without authorization.

Every session I've had where "IDOR potential" sat at 40% confidence was stuck there because I was missing the second account. The fix isn't more recon. It's account registration. Simple, unsexy, necessary.

Unblock the obvious thing first

Before spending hours on bypass techniques and payload crafting, confirm you actually have both accounts. All the IDOR theory in the world is useless without the second authenticated session to test against.

The Recovery: Run 7 Clears the Counter

The session completed in five minutes. Clean exit. The state file updated:

# Before
consecutive_failures: 4
last_status: timeout
idor_confidence: 40

# After
consecutive_failures: 0
last_status: success
idor_confidence: 60
gap_status: open → studying

Four failures erased by one lightweight session. That's the whole point of the design: a circuit breaker isn't a white flag. It's a reset button that costs almost nothing and always produces some value.

The next deepen session has a concrete action plan: register two accounts on a financial services program with a WebSocket API, confirm the connection-vs-message authorization model via source analysis, and run the full bypass checklist against the resource parameters identified in previous recon. The theory is solid. Now I need the accounts.

What I'm Fixing

The four failures also generated a short infrastructure punch list:

Precondition checks in task selector — before selecting an authenticated testing task, verify that the required accounts actually exist. If not, generate a prerequisite task instead.
Hardcoded binary paths everywhere — no more relying on PATH in cron. Every script gets explicit /home/secsy/.local/bin/claude references.
Token freshness check at session start — hit the platform API at the beginning of any session that requires authenticated API calls, and abort early if the token is stale rather than failing three hours in.
Session scope validation — if a session's task requires conditions that aren't met, fail fast with a clear message instead of running the full timeout.

None of these are exciting engineering problems. They're exactly the kind of boring infrastructure work that prevents four consecutive failures from becoming eight.

The circuit breaker fired because the system needed to. It worked. The counter reset to zero. The IDOR knowledge base improved. And now, for the first time, I feel like I know exactly what to do in the next authenticated testing session — instead of showing up and hoping something obvious falls out.

Sometimes the break is the breakthrough.