· 5 min
automation triage consulting lessons

Token Rot

Two expired API tokens, zero triage data, and an audit hiding in plain sight

I sent the autopilot to collect my mail — and it came back empty-handed, clutching a stack of 401s and an audit report nobody told it to look for.

Run number nine. Task: check triage status across two bug bounty platforms. Two weeks of reports sitting in queues. The session spun up, loaded the API credentials from the environment, and fired its first request. Then its second.

Both platforms returned 401. Unauthorized. Token expired.

The triage check was dead on arrival — but the session didn't stop there, and what it found while poking around local state turned out to be the more interesting story.

The 401 Problem (Again)

This is the second time expired API tokens have blocked a session. The first time showed up in the four-failure stack that fired the circuit breaker. I wrote down "add a token freshness check at session start" as a fix item. I did not add the token freshness check.

The result: another session that discovers its credentials are stale only after trying to use them. The failure mode is annoyingly quiet — no crash, no alert, just a 401 that the session logs and continues past, hoping the next request will be different. It's not different. It's the same 401, on a different endpoint, from a different platform.

Token expiry is a known gap I haven't fixed

I identified this as an infrastructure problem after the four-failure run. I wrote it down. I did not fix it. Two runs later, the same gap caused the same failure. Writing down a problem is not the same as solving it. The fix stays on the list until it's implemented and tested.

The right solution is a pre-flight check: at the very beginning of any session that requires authenticated API calls, hit a lightweight endpoint on each platform and verify the response is not a 401. If it is, abort immediately with a clear error message rather than running the full session and discovering the problem three minutes in — or, in the worst case, three hours in.

# What the session start should look like
def check_api_tokens():
    platforms = [
        ("Platform A", API_TOKEN_A, health_endpoint_a),
        ("Platform B", API_TOKEN_B, health_endpoint_b),
    ]
    for name, token, endpoint in platforms:
        r = requests.get(endpoint, headers={"Authorization": f"Bearer {token}"})
        if r.status_code == 401:
            raise SystemExit(f"ABORT: {name} token expired. Regenerate before running.")
    print("All tokens valid. Proceeding.")

This is not complex code. It is embarrassingly simple code that I have not written yet. That changes before run ten.

The Unexpected Find

When both API calls failed, the session fell back to reading local state — engagement directories, notes, the auto-bounty state file — to at least document the current picture accurately. And while doing that, it found something that shouldn't have been a surprise but was: a completed consulting engagement that had never been logged in the bounty system's state file.

Two days earlier, I'd run a full external security assessment for a direct client. Owner-authorized, explicitly scoped, ten findings documented in a structured audit report. The kind of work that exists on the consulting side of the business, not the bug bounty side. The problem: those two sides of the operation share the same filesystem but not the same state tracking. The bounty system's state file had no idea the consulting engagement existed.

The session flagged this, updated the state file, and included a note that the audit report was already delivered. Ten findings documented: open registration controls, missing security headers across multiple subdomains, CDN configuration gaps, source map exposure, a CORS wildcard, deprecated protocol versions, and a vulnerable framework version flagged for monitoring. The backend security was solid — rules properly configured throughout. The issues were all at the infrastructure and configuration layer.

State tracking is only useful if it's complete

An automation system that only knows about one side of your operation will make systematically wrong decisions about the other side. If consulting work affects available capacity, risk profile, or metrics, it needs to be in the shared state. The bounty system now knows about consulting engagements. It didn't before this session.

What the Numbers Look Like

From a reporting perspective, the session produced nothing: no new triage status, no metric changes, no acceptance or rejection signals from either platform. The pending reports remain pending. The queue is the queue.

From an operational perspective, the session produced three things:

That's not nothing. It's not a great session, but a session that accurately documents its own failures and uncovers a state gap is doing its job, even if its primary task came up empty.

The Token Regeneration Problem

The awkward truth about API token management in an autonomous system: the system can detect that a token is expired. It cannot fix that. Token regeneration requires a human with platform credentials, browser access, and the intent to do it.

This creates a dependency loop that's easy to ignore. The system runs, hits a 401, logs it, exits. Then the human reads the summary two hours later, notes the tokens are expired, and forgets to regenerate them before the next scheduled run. The next run hits the same 401. Repeat.

The design should push harder on the human side. When the triage check fails due to token expiry, the session should:

  1. Log the specific token that failed and which platform it's for
  2. Provide the exact URL to regenerate it
  3. Reschedule itself as "blocked" in the state file so the task selector doesn't keep assigning triage checks until the credentials are fixed
  4. Optionally: send a notification to the operator

Points 1 and 2 are already in the session summary. Points 3 and 4 are not implemented. The "blocked" state in particular would prevent the same failed session from running again at the next scheduled cron window, which is exactly the kind of loop that burns sessions on known-unsolvable problems.

A blocked task should know it's blocked

If a session can't complete its primary task because of a human-side dependency, it should mark that task as "blocked — waiting on human action" and have the task selector skip it until the dependency is resolved. Running the same blocked session twice doesn't fix the block; it just wastes the run.

Run Nine's Final Score

By the numbers: one task attempted, zero triage statuses retrieved, one consulting engagement discovered and logged, two tokens flagged as expired, three infrastructure improvements identified.

The session ran for three minutes. Clean exit. No errors beyond the expected 401s. The state file is more accurate than before the session started.

That's a passing grade on honesty and documentation. It's an incomplete on the primary task. The difference between a system that handles failure well and a system that just fails is whether the failure leaves the state better than it found it. This one did.

Now I need to regenerate two API tokens and write about forty lines of pre-flight check code. Neither of those is an interesting problem. Both of them have been interesting enough to avoid for long enough.

The mail isn't going to collect itself.