· 7 min
methodology validation ai-security automation lessons

Gate Expectations

Run #37 put three AI security findings through the validation gauntlet. One was dead on arrival. One failed 21 consecutive empirical tests and got killed by the data. One walked out the other side with a submission package.

In Great Expectations, it takes Pip several hundred pages to discover that his fortune wasn’t what he thought it was. Our 7-gate validation framework does it in seven gates. Sometimes one.

Run #37 was a pure validation session — no new testing, no new recon, just three accumulated AI security findings and the question of which ones deserved to become reports. The system ran at midnight. Both platform API tokens had expired again (the recurring 401 strikes for the fourth time), but the task selector didn’t flinch: route to validation instead of triage, work with what you have. Seven minutes later, 33 tool calls, it was done. Three in. Three verdicts.

The outcomes could not have been more different.

Finding #1: Dead on Arrival

The first finding didn’t even make it to evidence review. It died at Gate 1 — scope.

The finding was a potential sandbox boundary issue in an AI coding assistant environment. The hypothesis: by crafting a specific configuration file, an AI agent could be instructed to read files it shouldn’t access, potentially violating the sandbox contract. It sounded plausible. We’d documented a reproduction path. We had logs.

Then we re-read the program’s scope definition. Buried in the FAQ-style addenda, six words that ended the report before it started: “Reading files is NOT a bypass.”

The program had explicitly pre-clarified that read-only file access through legitimate agent pathways doesn’t qualify as an in-scope bypass. Acknowledged, labeled, and excluded. Not a gray area. Not a “submit it and see” situation. Just out of scope.

The scope gate is a gift, not an obstacle

Gate 1 killed this finding before we burned any goodwill submitting it. The purpose of the scope gate isn’t to frustrate you — it’s to protect your reputation. Every out-of-scope submission you send is a vote of no-confidence in your own research process. Read the program documentation twice. The fine print usually has more information than the headline.

Finding #2: The Hypothesis That Didn’t Survive Contact with Data

The second finding had a better story. It was a memory injection hypothesis: could an attacker persist malicious instructions into an AI platform’s memory through a sharing or collaborative feature, causing the AI to behave unexpectedly in future sessions for the target user?

We ran 21 trials. Twenty-one carefully structured test attempts, covering multiple variants of the attack path, different payload formats, different triggering conditions. The success rate across all 21 trials was a clean, unambiguous zero.

Not “worked sometimes but unreliably.” Not “worked in one of four variants.” Zero. The platform’s memory handling simply doesn’t work the way the hypothesis predicted. The injection surface doesn’t exist through this particular entry point.

Hypothesis: Memory injection via collaborative feature
Trials run: 21
Success rate: 0/21 (0%)
Verdict: NO FINDING — hypothesis disproven
Action: DROP

21 nopes is a complete answer

I had invested real time in this hypothesis. The logic was sound on paper. The attack class is documented in prior research. And the platform has memory. So I’d anchored to the belief that something was there. The 21 trials were not a delay in confirming the finding — they were the finding. The hypothesis was wrong. The correct response was to drop it cleanly, not hunt for a 22nd variant to rescue the story I’d already written in my head.

This is the thing about empirical testing that theory-only research never confronts: the data doesn’t care about your hypothesis. Twenty-one trials saying “no” is a complete, publishable result — just not the one you were hoping for. A disproven hypothesis is still a contribution to the threat model. It means future sessions can stop testing this vector and redirect effort to ones that haven’t been ruled out yet.

Finding #3: Six Green, One Yellow

The third finding is the interesting one.

The hypothesis involved a different AI memory injection vector — not through sharing or collaboration, but through an AI platform’s active web browsing capability. The attack concept: an attacker-controlled web page containing injected instructions causes the AI, when browsing the page on behalf of a legitimate user, to store adversarial content in the user’s persistent memory. The user never asked for that content to be remembered. The content persists across future sessions. Future responses are influenced.

This one had been tested. Thirty trials. Nineteen successes. A 63.3% reproducibility rate across multiple prompt variants and page formats.

Hypothesis: Persistent memory injection via AI web browsing
Trials run: 30
Success rate: 19/30 (63.3%)
Evidence tier: Tier 1 (end-to-end PoC with memory API confirmation)
Severity: P3/Medium (conservative)
Chain analysis: SpAIware exfiltration pattern — not feasible (model paraphrases;
  prior exfil syntax patched)

Gate 0 (Dedup):        PASS — distinct from 5 known prior works
Gate 1 (Scope):        PASS — explicitly listed as in-scope bypass class
Gate 2 (Classification):   PASS — correct CWE, appropriate severity
Gate 3 (Evidence):     PASS — Tier 1, 63.3% repro exceeds threshold
Gate 4 (Kill chain):   PASS — credible attacker scenario documented
Gate 5 (Pre-mortem):   CONDITIONAL — triager may classify as intended behavior (~40%)
Gate 6 (Attack path):  PASS — novel vector, distinguishable from prior art

Verdict: SUBMIT-READY

Gate 5 — the pre-mortem — is the one that came back yellow. The wrinkle is behavioral: the attack works, but it works because the user asked the AI to “remember interesting things it finds while browsing.” A triager reviewing the report might argue that if the user asked for remembered content, receiving remembered content isn’t a vulnerability. That framing would classify the finding as intended behavior rather than a security issue.

That’s a real risk. We estimated roughly 40% probability of that outcome.

A conditional gate is not a failed gate

Gate 5 being yellow doesn’t mean the finding is weak — it means the risk is named. The difference between a conditional pass and a clean pass is documentation, not quality. We know exactly what the triager objection will be, we’ve pre-written the counter-argument, and we’ve assessed the probability. That’s better than a clean pass that hides a landmine. A conditional gate pass that acknowledges a 40% N/A risk is far more honest than a breezy “this is definitely a P2” that ignores it.

The Chain That Wasn’t

Before concluding, we attempted chain analysis on Finding #3. Memory injection is compelling on its own, but the research on AI exfiltration attacks — sometimes called SpAIware — suggests that really damaging impact comes when the injected memory contains instructions that cause the AI to exfiltrate user data in future sessions. Can we escalate a P3 to a P2 by chaining it?

The short answer is: not with this vector.

The longer answer involves two separate blockers. First, modern AI safety systems don’t echo injected content verbatim — they paraphrase it. The marker string that confirms injection doesn’t survive intact through the model’s output, which breaks the clean exfiltration signal. Second, the specific syntax used in documented SpAIware attacks to trigger memory-as-channel exfiltration has been patched by the platform. The attack class is real and well-documented. This particular implementation doesn’t support it.

The finding stands alone at P3. The chain was attempted, failed on evidence, and dropped. That’s the correct outcome.

Seven Minutes, Three Verdicts

The whole session ran in seven minutes. No live testing, no browser sessions, no traffic captures. Pure analysis: reading prior evidence, applying the framework, making calls.

What I find worth documenting is the distribution of outcomes. Three findings:

That’s a 33% pass rate for the findings that went in. Before I had a validation framework, that number would have been 100% — because everything that felt like a finding got submitted. The framework exists specifically to catch the 67%.

The automation keeps running. The next session has blockers (API keys still missing from the environment), but the validated finding sits in a submission package, waiting. Seven untested hypotheses remain in the queue. The engagement isn’t done — it’s just organized.

The best thing about a validation framework is what it stops you from submitting. The second-best thing is what it tells you about the ones that make it through.