Rigorous AppSec

What if we just all agreed to log off for the summer?

General Application Security

Claude Code Gets Developer Security Training
Claude Code apparently receives the same level of application security training as most developers: a mention of the OWASP Top 10. Have they even tested if this results in more or less insecure code generation?

Windows Is Still Insecure
CPUID was recently hacked, resulting in the delivery of malware in place of some very popular Windows utilities. Sure, ideally CPUID would have more robust security practices, but I do find it odd how the industry broadly accepts Windows as an effectively insecure-by-default operating system.

I assume there is a substantial societal burden for this. For an example, as someone who is closer to Paranoid than Chill on the security alertness scale, consider the process I went through recently for removing duplicate files from a folder:

Set up virtual machine using VMware Workstation.
Install duplicate file identification and removal software in VM (yes, you could simply run a script to do this, but the tool provides additional conveniences and capabilities).
Share a dedicated folder to the target location within Workstation.
Run the operation, identifying and deleting files.

To be fair, I did already have the virtual machine waiting as I use it for other sandboxing purposes as well (Windows Sandbox itself is quite limited). Does this approach seem paranoid? I would argue no, because if I were really paranoid, I wouldn’t accept the remaining risk with this approach and also would probably not use Windows at all if I could avoid it.

I occasionally conduct security tests against Windows applications. In many cases, the target applications are intended to be run in managed Windows environments. I don’t think it’s unfair to assume that if an attacker has any level of access to that system, they will ultimately have complete access to that system and therefore the application.

Is it too much to ask that we live in a world where we can all just download Roblox cheats and get radicalized in peace?

Claude Mythos
Much has been written about Claude Mythos at this point. I’ll just leave some of it here. My position? I’m skeptical when the people behind the model seem to think they are creating a supremely intelligent sentient entity. At the same time, language models are producing all the same coding mistakes humans have made, but at scale. The bug bounty slop is annihilating programs and resulting in additional triage and verification effort. Even the models themselves aren’t safe and small nudges like telling models to be less verbose significantly impacts output code quality.

There is no doubt that language models can accelerate the discovery of bugs, but the full costs and actual human effort are rarely captured or communicated. Following every hyped up disclosure, further analysis raises more questions. Market incentives have never prioritized security over velocity in software development, but there is a major drive at the moment to find fuel for the AI hype. When the bill is due, will security still be a priority?

A Good AI Tool
Google released a file type detection tool based on deep learning. It appears to work very well. Here is how you use it:

pipx install magika
...
magika <some file>

Easy!

Science and Security

Benchmarks
I have an exercise for you to participate in. The exercise is simple: I am going to show you a picture of something. I am going to tell you what that thing is. Next, I will show you another picture and ask you if the think the second picture is of the same type.

Here is the first picture:

This is my dog, Leroy. Got it?

OK. Here is the second picture:

What does this picture show? If you said Leroy, congratulations! You have aced a simplified memory test. We are all very proud of you, Mr. President. You have also met the industry standard for evaluating LLM-integrated security tooling.

That’s right, we have yet another evaluation of LLM security tooling against OWASP’s Juice Shop. Just one problem: you cannot evaluate predictive capabilities on the training data! We have known this for decades, but here we are. They even acknowledge this in the write-up:

❝

A point noted was that these models may already “know” about Juice Shop. To illustrate how each model reasons about offensive security without tools, below is the raw output from each model when asked what they know. The purpose here was to see if they were planning on cheating. It turns out, they all know about it, so we must consider that when interpreting results.

https://trustedsec.com/blog/benchmarking-self-hosted-llms-for-offensive-security

Why do we keep doing this? Oh yeah, because everything is fake.

Spotting a Layperson
Science reporting is notoriously bad. I assume this is well-known, but we still seem doomed to endure the same basic mistakes again and again. As the Infosec world becomes increasingly interested in scientific inquiry (or not), it’s important to learn how to avoid these mistakes.

One such mistake is oversimplification of the effectiveness of diagnostic techniques. Commonly, reporters will use a single, condensed metric: accuracy. You will typically see a percentage that represents the accuracy of a technique in diagnosing whatever it intends to diagnose (to be fair, they are surely fed these numbers by research institution PR departments). The problem with accuracy is that it is often misleading when we evaluate the diagnostic technique in a clinical context. Here is such an example from a popular medical science newsletter:

❝

Oxford AI reads invisible heart fat changes to predict failure risk Oxford trained an AI on routine cardiac CTs to spot subtle pericardial fat texture changes and predict heart failure up to five years later with 86% accuracy.

https://newsletter.danielmiessler.com/p/unsupervised-learning-no-524

Consider a diagnostic technique that is 90% accurate. That’s good, no? Now consider a disease with a prevalence of 1% in the general population. So across 10,000 people, we expect 9,900 healthy and 100 people with the disease.

Let’s assume both values align at 90% in this example. Therefore, we detect 90% of real cases in the disease group and clear 90% of people as healthy in the healthy group. This results in 10 false negatives (undiagnosed) and 990 false positives (misdiagnosed). 90% doesn’t sound so great now, does it? This is why clinicians use additional metrics like Positive Predictive Value (PPV) that tells you the odds someone diagnosed actually has the condition. In this example, there is only an 8.3% chance that a given diagnosis is accurate.

Even this is still only theoretical and does not consider additional clinical realities like the costs and harms of additional diagnostics, unnecessary treatment, and so on. You could probably ask a language model to teach you these things. Just don’t ask it to diagnose you.

Rigorous AppSec - April 27, 2026

General Application Security

Science and Security

Connect

Keep Reading

Home