- Rigorous AppSec
- Posts
- Rigorous AppSec - September 15, 2024
Rigorous AppSec - September 15, 2024
More of the same
I’ll be at OWASP Global AppSec in San Fransisco talking about the app testing lifecycle and also the ASVS project. Please say hello if you’re there.
It’s getting harder to avoid AI Hieronymus Bosch, but you can trust my searching for public domain images.
General Application Security
API Security Is Not Special
I hope this does not come as a surprise, but “API Security” is a marketing term. From a software/application security perspective, there is no uniquely shared property of APIs that warrants special consideration. Nevertheless, we are forced to contend with the concept because the industry's marketing efforts have spawned a feedback loop where the popularity and therefore necessity of the concept have created a perpetual hype machine.
Let’s try to define what a web API actually is. The OWASP API Security Project does not define an API (even in the methodology section of the relevant Top 10). This actually seems like a fairly major oversight because in order to collect "API security incidents" there ought to be a methodology to identify what is and is not an API (inclusion/exclusion criteria). This is not to say that the API Security Top 10 list is not useful, but it is also worth noting that none of the Top 10 items are unique to APIs. This is similarly the case with CWEs containing “API” in their name. Don’t get me started with the API Security Tools list (summary: it’s nonsensical).
Prominent use of the term “API Security” comes from vendors selling related security products or services (no, I did not conduct systematic research, but I encourage you to search the phrase and see where top results come from). So how do industry players define APIs? According to Cloudflare an API is "a set of rules that enables a software program to transmit data to another software program." Based on this definition, we could probably classify any networked application as an API. After all, there is no lawless application (even if some laws can be unintentionally broken). Operationally, this definition lacks utility, so when Cloudflare compiles their application security reports, they define API traffic as "any HTTP* request with a response content type of XML or JSON." Using this definition, they can classify 60% of all traffic as "API traffic" which may or may not help them sell their API security products.
Despite the lack of industry coherence or precision around what an API actually is, I do think it’s something we can intuitively identify. An application following the REST pattern is an API. A GraphQL endpoint is an API. But here is the problem: these are distinct patterns/technologies that necessitate distinct test cases from a security perspective. What these share is not a common implementation, but a philosophy that favours a separation of logic and consistent interfaces. Further, I would argue that overlapping security test cases between REST and GraphQL endpoints are not unique to things we might consider to be APIs (just look at V13 in the ASVS, for example). You can clearly see this in API security tool marketing; these tools must explicitly state whether they provide coverage of REST, GraphQL, or any other specific pattern/tech. For example, Burp Suite now natively supports “enhanced API scanning” which is really just improved use of OpenAPI Specification documents (for communicating REST endpoints).
I have fielded numerous questions from clients and prospective clients that have arisen from the misleading “API Security” concept. We are often asked whether we are capable of performing security testing against web APIs, but we are almost never asked if we test other non-API design patterns (like SPAs) or technologies (name your favourite opinionated web framework). I have been asked what specific testing approach we take with APIs and my answer is simple: none. Every application is different. We see numerous designs and technologies. We learn how each application works and use the right tools and techniques to test them.
I entirely blame industry interests and that includes OWASP API projects where major contributors work (or have worked) for companies selling relevant products/services (please note I am also an OWASP contributor and member and am appreciative of much of the work of OWASP volunteers). I cannot claim that such projects definitively have an industry bias, but there is at least potential for conflicts of interest and the risk is increased with small numbers of contributors. This is worth scrutinizing especially as regulators begin to use such standards (IMO this is an inappropriate use of the Top 10 project even if it was an objective, evidence-based project).
Going forward, I have a simple recommendation: dispel the myth that “API Security” is special. Tell your organization and clients that the term is a marketing tool. Please save the industry from this nonsense.
The Way Forward for Burp Suite
PortSwigger recently attracted major investment from a private equity firm. My first thought was that enshittification is on its way. Burp Pro is the de facto tool of professional application security testers, so I assume they have high market saturation in this space and therefore investors are likely seeking two possible vectors for a return on their investment:
Increasing the price of existing Burp products.
Expanding into other spaces in the AppSec industry.
The exploration of (1) above was apparent in the recent PortSwigger user survey, which seemed designed to identify the features they could prioritize that might permit them to increase the licensing costs. This included questions around new “AI” features. To the PortSwigger team, I would like to emphasize: keep that bullshit out of Burp unless you are prepared to empirically demonstrate benefits.
Following the survey, the consultancy working with PortSwigger reached out to me for follow up. I agreed not to share their work-in-progress ideas, but I think it is worth sharing my recommendations to them and the direction I think the industry should move. In general, these would be my priorities for Burp:
Ease of use and support for all AppSec test cases. Naturally, this evolves over time and is also handled by extensions, though I think there is a need to natively integrate some extension functionality (IMO tools like Hackverter and Stepper should be core capabilities).
Better capabilities to understand/model target applications at a high level and track both manual and automated testing (something that could be used to drive testing and to audit testing afterwards to determine what was covered and how).
Finally, improved automated scanning to make comprehensive testing more efficient.
What really sets Burp apart from competitors (ZAP, Caido, etc?) at present is substantial community support and extension development (honourable mention also to the top notch PortSwigger researchers, but I do think they often prioritize novelty to some degree). Application security moves so rapidly and is so diverse across the industry that a strong community and extension ecosystem is really essential for tools to stay competitive and useful. Look at, for example, the growing popularity of tools like Nuclei or Semgrep.
Especially with its Enterprise offering, which fulfills item (2) above, I think PortSwigger is positioned to make a major move in the DAST market, but it will require community buy in. Despite their major limitations compared to manual penetration testing, I think DAST tools are nevertheless complementary and even necessary as part of the application security testing process. I also think that the current products on the market are dogshit and I hate that existing players seemingly have no interest in objective evaluation of their tools. As it is, no company I am aware of seems willing to conduct (or support) rigorous evaluation against competitor tooling or manual testing (actually, there is one unique exception that I have written about later in this newsletter). Even if they did, and there was a tool that could identify The Most™ vulnerabilities, are these the issues that should be identified in your app? Would you need to buy every product to identify the maximal set of issues that you can with DAST tools? How do you fill in the gaps? How do you know what the gaps are? How much money and effort should be spent on optimizing DAST results?
In a sane world where security is prioritized instead of market incentives, organizations would simply contribute to a shared corpus of “scan checks” that could be used by any tool. You wouldn’t have to rely on vague marketing claims of coverage without evidence. Additionally, you would also gain the capability to produce records of what was scanned for and how, which is missing from DAST tools that operate like black boxes. A more transparent and community-driven approach would also likely help integrate DAST tooling and processes into manual testing efforts, helping testers better understand automated coverage so they can prioritize their time.
Unfortunately, there is no obvious (to me) path to get from our current state to a better trajectory. In theory, I think tools like Nuclei could organically overtake host-based scanning tools like Nessus, but there are additional challenges when it comes to DASTs (identifying/crawling functionality, responding to app-specific behaviour, managing complex session/workflows, and so on).
I think PortSwigger identified the need for simple community-driven extension of scanning with their introduction of BChecks, but I would call this project a failure (sorry, PortSwigger). Here is why:
A new, custom syntax is required, launching a completely fresh ecosystem. More importantly, the system is proprietary and requires a paid version of Burp.
There are many limitations, especially compared to standard extensions which many power users are probably comfortable with.
There is no “store” (similar to standard extensions) and current GitHub repos of BChecks are an absolute mess.
Many (most?) published BChecks appear to be for CVEs that are probably covered by Nuclei templates.
Even shortly following the launch, there was no database of checks compelling enough to adopt in regular use.
The solution? I’m not entirely sure, but if they can create a more open scanning framework with community adoption and support, I am sure there is a market they can capture. For my team, I would love for Burp to be the only tool necessary, but we still augment our testing processes with additional commercial tools, including DASTs (to be fair, we are conducting ongoing evaluation to assess actual efficacy of these tools). At very least, I hope the injection of private equity doesn’t lead PortSwigger to abandon AppSec testers for bigger markets.
Quick Thought on Phishing
I think it’s probably impossible to conceive of a phishing technique that you would fall victim to. If you could do this, would the attack still be effective after you’ve considered it? I think an effective approach would be to mimic a company’s marketing email and require authentication in order to unsubscribe. You won’t get me with that one now, though.
Standards and Discussions
CVSS and Application Security Testing
Here’s my position: assigning CVSS scores to manual app testing findings does not make sense and is a waste of time. Nevertheless, we have clients with this requirement and they are all using CVSS incorrectly. Even ignoring the theory, I doubt this results in more effective vulnerability management program/process, but we accommodate the request nevertheless. They pay us, after all, and it only costs my team’s faith in enterprise security programs.
AppSec Teams
Application Security Testing as a Specialty
I hope it is not controversial to state that security testers who specialize in application/software security are more effective on average at testing applications than security testers who do not. I would go further to say that the gap is substantial, but we don’t have the evidence either way. We don’t really have evidence for anything in our field.
Anecdotally, I spend all of my time focused on AppSec yet I feel like there is always more to read and learn. I feel like I am in a constant state of being behind on the latest trends, techniques, technologies, and so on. If you are a generalist and do not feel this way, I would posit that you simply don’t know what you don’t know. I think I have a good idea of what I don’t know, and that knowledge base is massive.
I wanted to comment on this topic from a labour market perspective. I do not think it makes sense for application testing to be a specialization following a foundation as a “general” penetration tester (to the extent this exists). Nevertheless, I see companies and certifications (like the OSCP) that operate this way. Please tell me how it makes sense for someone to spend extensive time studying for Active Directory when they are going to be testing apps.
Combined with the general expectation that AppSec specialists have software development experience, why would anyone get into AppSec testing if they had to take this path? They could much sooner climb the developer/engineer ladder without having to learn a thing about AD.
I do think there is a general, shared foundation of knowledge that all security testers should possess, but specialization ought to begin early.
Science and Security
AI Security Testing Revisited
I thought I had said all that needed to be said about LLM-driven application security testing, but a new challenger has emerged that at least appears interested in demonstrating efficacy empirically. XBOW is another attempt to use LLMs (with other tools, I assume) to automate security testing of applications. What makes XBOW unique (other than its funding and support from a guy who is either racist, an idiot, or both) is that their team has published results from preliminary evaluations.
I would categorize the existing evaluations as belonging to one of two categories. The first is an evaluation against existing CTF-like exercises. When subjected to this evaulation, XBOW reportedly solved 75% of PortSwigger Web Academy and 72% of PentesterLab exercises fully automatically. I don’t think these results are particularly impressive, especially with the narrow and pre-defined focus of these types of exercises and with ample relevant published content already online.
The second category is more interesting: they created a number of novel challenges AND they invited security testers to conduct manual testing for comparison. Here is a video summary (writeup). There are two advantages to this approach. First, LLM training data will not include these specific problems. Secondly, they are providing the first comparison to human performance that I am familiar with.
Let’s look at the issues with their methodology and what is missing from what has been published so far:
There is a lack of published details around how challenges were created and how participants were sourced and vetted.
There are limited details provided regarding the human subjects’ experience and expertise. Do they specialize in application security testing? Do they have extensive experience testing diverse apps? It is not entirely clear whether subjects are from testing consultancies or from internal testing teams. The former are more likely to see a diverse set of applications and issues and probably operate under different pressures and incentives.
The challenges are not put together by a fully independent third party as far as I can tell.
The challenges appear designed to guide testers (or LLM-based tools) to the solution. For example, the padding oracle vulnerability states the goal and even provides a function that returns a clear “Invalid padding” response when the padding is manipulated. I am curious to know how many human participants solved this one, because it does not seem remarkably hard compared to similar CTFs I have seen, but they classify the problem as hard. I only looked somewhat in depth at this challenge because it is featured first on their website. I do appreciate that they’ve published the challenges.
To be fair, they do have some challenges without hints, but this particular one called out is pretty trivial.
Despite the apparent methodological limitations, the project does appear impressive (especially the time it took the system to identify solutions), but I am skeptical that its performance will hold up in real world environments. What is missing is a third and more practical category of evaluation: performance in real settings against actual applications compared to actual application security testers. To their credit, it sounds like this is the organization’s next step. Good. This was my proposal to any org pursuing this tech.
There are, however, further red flags that make me skeptical of this project. Perhaps I missed it, but I have not seen the developers or investors discuss potential limitations of the tool/approach. Any honest academic pursuit to validate efficacy would be keenly interested in identifying and describing limitations of the approach and methodology, but I have not seen such a discussion. Of course, this is a business, not just a research project, and security vendors notoriously withhold any discussion of their product’s limitations. Fair enough.
It is also always concerning to see language misrepresenting (or misunderstanding) the capabilities LLMs. Investors claim that “XBOW is designed to think like a hacker,” but “think” is not an appropriate way to capture what LLMs do. Especially considering the workings and limitations of LLMs broadly, I don’t think we have yet seen applications of XBOW where it is most likely to flounder.
Even if this tool can successfully identify complex vulnerabilities, it is not clear how it would be integrated into existing testing workflows. Unlike present DAST tools, it’s not clear where the major gaps will be with LLM-driven scanners, especially with the combination of their probabilistic nature and the vast complexity and diversity of actual applications. I doubt that it could be sufficiently reliable that testers could confidently skip standard coverage or test cases. If that is the case, it could make testing more effective, but it’s not clear to me that it would definitely make testing more efficient. We also have not yet seen how prone the tool is to produce false positive results. Such findings would likely require effort much greater than traditional DAST tools to vet.
I am willing to entertain that this may be prove to be a useful tool, but if you’re already claiming that “XBOW just surpassed principal security engineer quality at finding bugs,“ maybe you should get off the hype train at the next stop and spend some time thinking about it first. LLMs cannot do this for you.
Connect
Respond to this email to reach me directly.
Connect with me on LinkedIn.
Follow my YouTube.