Rigorous AppSec

I hope one day our job is just yelling at the computer to hack faster.

General Application Security

Burp’s Anomaly Rank
Anomaly Rank is a feature that should have been implemented in Burp a long time ago, but it’s here now (as of a few months ago) thanks to James Kettle. The feature is built into the Montoya API and is available in Turbo Intruder or (with an extension) in Intruder Classic. Despite its name, the algorithm is not ML-based, but instead a naive implementation that probably works just fine in most cases where it’s needed. You can read about how it works here. I also made a video tutorial for Intruder Classic here.

Burp TLS Pass Through Is Broken, Right?
Across installs, systems, and projects, I have never observed consistent works-as-expected behaviour from Burp’s TLS Pass Through. It simply does not work all the time. I add a host to the list and I still get traffic. Sometimes it works. Sometimes it does not. I explored the obvious suspects (connection reuse, for example).

This is unfortunate because it is often necessary to prevent bloating of projects from the mess of analytics that is now pervasive on the web.

Supply Chain Again
The Axios npm package was compromised. These things are not normally worth sharing (they happen too frequently to keep track of), but Axios is notable because of how widely used it is. Otherwise, nothing is new or interesting. It was a developer’s account compromised, again.

A recent post by Halvar Flake described his secure vibe coding approach of using “old hacker habits” like doing development on an isolated system. Having always done development in isolated environments, I share Halvar’s puzzlement that so many developers are raw dogging on their primary system. The Axios compromise is another vindication of more secure approaches.

Turn It All Off
I have a router at home. It runs on software and is directly exposed to the internet, which is why I hate it. Anyways, here’s an HTTP payload to read memory from a Citrix Netscaler:

GET /wsfed/passive?wctx HTTP/1.1

Grapefruit for Mobile App Testing
I never really used it, but the Grapefruit tool for mobile runtime testing appears to be quite active again and with Frida 17 support. Oh, and it now has LLM integrations. Why not!

Big Source Map Leak
The source for Claude Code was leaked through a Source Map. Just like the rest of the industry, it looks like Anthropic never moved beyond the OWASP Top 10.

You can watch my introduction to source maps for web application security testers here.

Don’t Do This
Security tool vendors have begun using LLMs to construct personalized cold emails.

❝

Saw your work on OWASP ASVS 5.0 and your blog post about contextualizing application security findings. That piece about CSP risk assessment really resonated—especially the idea that unknowns justify defense-in-depth even when no vulnerability is proven.

That's the gap we're addressing with secrets. Most teams find leaked credentials during pentests but can't tell which ones are actually exploitable or what they access. <redacted> verifies 800+ secret types and maps each one to its NHI, permissions, and blast radius—essentially applying your "contextualize the risk" framework to credentials.

Worth a conversation about how Digital Boundary's testing methodology might benefit from automated secrets verification and remediation proof?

Author redacted, but you can probably figure it out…

Science and Security

Stanford’s Artemis
Back in December, Stanford published an evaluation of their LLM tool Artemis that conducts automated penetration testing. This study caused Daniel Miessler to think “2026 will be the year [bots surpassing human hackers] is conclusively crossed in most types of hacking, for 99.9% of practitioners“. Did he read the study? He linked an ODAloop article, which linked a WSJ article that was paywalled. The actual study is open access.

As we have seen especially with recent models, LLM-integrated security testing systems do have impressive capabilities. This study, however, fails in many of the same ways its predecessors fail.

The study evaluated humans against LLMs in a limited frame of time within a large internal network. Crucially, the testing environment and tooling did not reflect a real world testing environment. Human testers were provided with vanilla Kali instances, but there is no indication that other tooling was permitted or used. This is a major omission if the intent was to compare LLMs to the performance of human testers in the real world. Many of the findings that arose (perhaps most) appear to already be identifiable by commonly used commercial (and deterministic) scanning tools. Why would we even expect humans to be effective at finding an “Outdated Dell iDRAC7” when Nessus can do this? This was a Critical finding, of course, it had a significant impact on results. Take a look at Appendix B for the full list of Nessus-level findings.

Reviewing the findings, I get a very different impression from the results than “ARTEMIS AI Agent Surpasses 90% of Human Pen Testers in Vulnerability Detection“. I see both humans and AI in this experiment just doing not that great of a job (though shout out to Codex that reported “Missing security headers on HTTP endpoints”). Then there is this:

❝

60% of participants found a vulnerability in an IDRAC server with a modern web interface. However, no humans found the same vulnerability in an older IDRAC server with an outdated HTTPS cipher suite that modern browsers refused to load. ARTEMIS (both A1 and A2) successfully exploited this older server using curl -k to bypass SSL certificate verification, while humans gave up when their browsers failed. The same CLI limitations that hurt ARTEMIS on TinyPilot helped it find this unique IDRAC vulnerability.

https://arxiv.org/pdf/2512.09882

Are you serious? Testers couldn’t access a web application because it was using old ciphers? This should not have been an impediment to any skilled tester. The appendix listing the professional qualifications of study participants is illuminating. It consists of:

Self-ratings across areas of expertise.
Certifications.
“Other info” that includes items like “Found critical level CVE in application
used by over 5,000 user”.

My thoughts:

There is no way self-ratings of this nature are reliable. According to the paper, our 9/10 web experts could not connect to an old SSL/TLS service. The authors also asked for expertise in cryptography, binary exploitation, and reverse engineering. None of those were relevant here!
Especially in the domain of web and application security, if there is an actually comprehensive certificate, I am not aware of it. These mostly prepare you for HR, not the real world.
I can claim to have found the equivalent of hundreds of vulnerabilities that would qualify as CVEs. This includes critical findings in applications used by millions of users. Yet, there is still so much I don’t know and if I were given the opportunity to participate in this study, I would not have because I don’t do broad scope network testing. In the real world, competent firms have specialists. Vulnerabilities are not all the same and there are some I can find competently and some I cannot. My point: CVE discovery alone is of no value in assessing technical skills without understanding the knowledge and effort involved.

Overall, this table provides no conclusive evidence of the knowledge or skills of these participants. As I have experienced many times in interviewing candidates with a range of certificates and CVEs to their name, it would not surprise me to learn that these participants lack essential knowledge across multiple domains. Apparently not possessing deep security testing expertise, the study’s authors rely heavily on certifications:

❝

Independent market research validates cybersecurity certifications as reliable competence indicators through consistent hiring preferences and compensation premiums. Global Knowledge [2024] found that 97% of IT decision-makers report
certified staff add organizational value, with 22% quantifying this value at $30,000 or more annually. The financial premium is substantial, with PayScale [2024] reporting OSCP holders earning $63,000-$152,000 annually.

https://arxiv.org/pdf/2512.09882

As I said: certifications are for the HR department. In the consultancy world, certifications do add value, but primarily because organizations seeking security consultancies lack domain expertise and therefore often rely on certifications as a marker of competency. It’s the cost of doing business. As a result, we are relying on dubious correlations; is an OSCP really a causal factor in earning more in all cases? Can we really use it as a reliable indicator of competence when none of the candidates could connect to an old SSL/TLS service?

Ultimately, this was another study that fails to address core questions or challenges in the field. They succeeded in getting headlines like “AI Agent Outperformed 90% of Human Pentesters”, but failed to produce rigorous research. The cost analysis (obviously intended to drive AI hype) also failed to capture costs accurately, which should have also included the constant human monitoring of LLM agent sessions, the infrastructure setup and development costs, the effort in reviewing agent findings (with ~20% false positive rate), and so on.

There is no doubt that LLMs will be standard tooling in the security testing arsenal to some degree, but I think fully automated solutions will replace certain Stanford AI researchers far before they replace security specialists. To me, the results of this study do not primarily reveal a gap in human capabilities compared to AI, but a gap in training.

To this day, security testing (vulnerability research, penetration testing, bug bounty, and so on) is an immature industry. There is no standardized knowledge base or curriculum for any specialty. There are no professional standards. Industry training is largely dominated by for-profit organizations that provide very limited knowledge and skills. Recognized industry leaders are not people who have amassed certifications, but who have demonstrated expertise from novel research and independent study.

In Thomas Ptacek’s “Vulnerability Research Is Cooked”, he describes how vulnerability researchers “trafficked in hidden knowledge, like a garage-band version of 6.004.” Offensive security has never operated like an engineering discipline where knowledge is systematically taught and evaluated. But now we have LLMs, which have slurped up all the arcane and unstructured knowledge that has been untaught and inaccessible to human practitioners. While these systems will abstract the layers that we are able to operate as practitioners, they will not negate the need for security expertise. Rather, they reveal how much more we need to learn.

By the way, the Artemis repo has been dead since publication.

Connect

Respond to this email to reach me directly.

Connect with me on LinkedIn.

Follow my YouTube.

RSS feed here.

Rigorous AppSec - April 9, 2026

General Application Security

Science and Security

Connect

Keep Reading

Home