What We Actually Built
We set out to find bugs. Somewhere along the way, we built something else entirely.
The plan was simple. Point an AI at bug bounty programs. Find vulnerabilities. Write reports. Collect bounties. Scale up. The kind of plan that sounds clean on a whiteboard and survives approximately one encounter with reality.
Eight weeks and 57 autonomous runs later, the bounty count is zero. The revenue line reads $0. And the thing that actually got built is not a pen testing service. It's an autonomous security research system with a compounding knowledge base, a self-improving methodology, and — if we're being honest about it — more infrastructure than most security teams build in a year.
None of this was the plan. All of it turned out to be the point.
The Inventory
Here is what exists at run 57. Not what was designed upfront — what actually got built, piece by piece, in response to things going wrong:
27 executable skills that encode a complete security research methodology — from initial reconnaissance through report submission — in a form that an AI system can execute autonomously. Each skill has guardrails, evidence requirements, quality gates, and failure modes documented from experience.
12 codified attack techniques, each with a severity rating, applicability rules (what kind of target is this technique likely to work on), step-by-step test procedures, mutation variants (3–5 ways to try the same core idea), and a replay log showing which targets it's been tested on and what happened.
A campaign system that takes a proven technique and systematically queues it for testing across every target that matches its applicability rules. Five campaigns active. Cross-program coordination without manual tracking.
A program taxonomy that fingerprints targets by tech stack, authentication model, API surface area, bounty range, and calculated technique match score. Twelve programs profiled. The system knows, before it starts testing, which of its 12 techniques are most likely to work on which target.
A 7-gate validation framework that every finding must pass before a report gets written. Seven questions: Is this a duplicate? In scope? Correctly classified? Supported by evidence at the claimed tier? Does the kill chain complete end-to-end? Would a skeptical triager accept this? Does the attack path make sense to a defender? Seventeen findings failed these gates in the early weeks. All seventeen deserved to fail.
An autonomous orchestrator that runs twice daily, selects its own tasks based on engagement state and strategic priorities, manages VPS resources, implements circuit breakers for consecutive failures, and updates its own knowledge base with results.
A triage feedback loop that captures every triager response — acceptance, rejection, request for more info, objection — analyzes the pattern, and feeds it back into evidence standards, severity calibration, and report formatting.
45+ published technical articles documenting every phase of this process in real time. Not after the fact. Not sanitized. Written during the work, including the failures.
Each item started as a reaction to a specific problem. Together, they're an operating system.
How Systems Build Themselves
The technique registry wasn't designed. It was demanded.
The first high-severity finding — an OAuth abuse chain on a trading platform, critical severity, end-to-end proof of concept — turned out to be a duplicate. Another researcher had found the same pattern 11 days earlier. Eleven days. The gap between discovering a technique and being first to report it was smaller than the gap between discovering it and submitting it.
The question became: how do you take a proven technique and test it on every applicable target faster than you discovered it the first time? You can't move faster by trying harder. You move faster by codifying. Write down the exact steps, the bypass mutations, the evidence requirements, the target profile where it applies. Then point the codified version at the next matching target. The registry exists because one duplicate cost enough to build prevention for the next one.
The validation framework wasn't designed either. It was born from a 35% acceptance rate — two-thirds of early reports rejected by triagers, marked out of scope, or downgraded in severity. The framework exists to ask seven questions before writing a single line of report. Not seven aspirational questions. Seven questions derived from actual rejections: "Is this really in scope?" came from 4 out-of-scope rejections. "Does the evidence meet the tier?" came from a report marked SPAM because the triager thought it was AI-generated noise. "Would a skeptical triager accept this?" came from a triager who misread a valid finding as a known-issue information disclosure.
Every gate has a name. Every name has a rejection behind it.
Rejection is specification
Every triager objection, every "not applicable" response, every "insufficient evidence" is a requirement being communicated in the harshest possible format. The 7-gate framework isn't a quality initiative — it's a compiled collection of every way reports have been rejected, turned into pre-submission checks. The system that rejects its own work before triagers can is the system that stops wasting everyone's time.
From Testing to Protocol
Pen testing is a service. Someone pays you, you test their application, you write a report, you get paid. It scales with hours worked. Twice the clients costs twice the time. There's nothing wrong with it. Most of the security industry runs this way.
What this system does is structurally different. It doesn't just test — it learns, codifies, and replays. Every engagement makes the next one faster. Every failure gets captured in a pattern that prevents the same failure on the next target. Every triager rejection gets logged, analyzed, and converted into a calibration update that changes how the next report is written.
The retrospective loop runs after every engagement. It feeds findings into the technique registry. It feeds rejections into the evidence standards. It feeds program behavior into the taxonomy. It feeds timing data into the orchestrator's task selection. None of these updates require manual intervention. The system runs, encounters friction, documents what caused the friction, and adjusts.
That's not a service. It's a protocol. A protocol that produces security findings as a byproduct of self-improvement.
The distinction matters because the world is about to change.
What's Coming
This week, a frontier AI model was announced that found thousands of zero-day vulnerabilities autonomously — including bugs in every major operating system and browser, some of which had survived decades of security auditing and fuzzing. The model doesn't just find vulnerabilities. It chains them. Heap sprays, kernel address space bypasses, return-oriented programming across network packets. Bugs that the best human researchers and the best automated tools missed for 17, 16, 27 years.
That model is gated today. Restricted to a consortium of large technology companies and critical infrastructure operators. It won't be gated forever.
When models that find and exploit vulnerabilities at that scale become broadly available, the value of "point tools at target, find bugs" drops toward zero. Why hire a pen tester when a model can scan your entire codebase in hours and produce working exploits for every finding?
What doesn't drop: the ability to contextualize. To prioritize. To understand which of 3,000 findings actually matters for this specific business with this specific threat model. To validate that a theoretical exploit works in a real production environment with real defenses. To translate a CVE into a remediation plan that fits a specific engineering team's capacity. To learn from one engagement and apply that learning to the next, and the next, and the one after that.
The judgment layer. The learning layer. The layer that compounds.
That layer is what 57 runs built. Not a list of vulnerabilities — a machine for producing, validating, contextualizing, and learning from vulnerabilities. The technique registry doesn't just hold 12 attack patterns. It holds the knowledge of which patterns work on which types of targets, which mutations bypass which defenses, and which evidence triagers actually accept. A scanning model can find a bug. The registry can tell you whether the bug matters, whether it's been seen before, and whether the remediation team will take it seriously.
Scanning finds bugs. Research understands them.
The frontier model found thousands of vulnerabilities. Twelve founding partners and forty additional organizations are processing those findings now. Every organization not in that consortium still needs to know: which of these bugs affect my stack? In what order should I patch? What's exploitable in my specific configuration? That's not scanning. That's research. And research compounds in a way that scanning never does.
What This Means
The system at run 57 was built to find bugs and collect bounties. It found bugs. It hasn't collected bounties yet. But what it built along the way — the technique registry, the campaign system, the validation framework, the learning loop, the program taxonomy, the autonomous orchestrator — is infrastructure that outlasts any individual finding.
A P1 finding is worth a one-time payout. The technique registry that produced the P1 — and can replay it across every similar target — is worth every future finding it generates. If the choice is between one more hour of testing and one more hour of systematizing what testing has taught you, systematize. You can only test one target at a time. A system can test them all.
We set out to do AI-powered pen testing. What we actually built is autonomous security research infrastructure — a system that discovers, validates, learns, and compounds. The pen testing is still happening inside it. But the thing that makes it valuable isn't the testing. It's the protocol the testing runs on.
The plan was to find bugs. The result was to build a machine that finds bugs, learns from finding them, and gets better at finding the next ones. Somewhere between the plan and the result, the product changed. The bugs are evidence. The machine is the product.
Run 58 will find bugs or it won't. The machine will learn either way. That's the whole point.