Dev.to LLM2h ago|Research & Papers

Can LLMs Detect Real Vulnerabilities in Real Code?

The article discusses the N-Day-Bench benchmark, which evaluates whether large language models (LLMs) can identify real vulnerabilities in production codebases, not just synthetic ones. The results show LLMs can detect some common issues like hardcoded credentials, but struggle with more complex vulnerabilities like business logic flaws.

💡

Why it matters

This research highlights the limitations of using LLMs for security auditing, despite their potential as a first line of defense against obvious issues.

Key Points

1N-Day-Bench tests LLMs' ability to find known vulnerabilities (with CVEs) in real codebases
2LLMs perform reasonably well on classic issues like SQL injection and hardcoded credentials
3LLMs consistently fail to detect vulnerabilities in business logic, race conditions, and cross-component interactions

Details

N-Day-Bench is a benchmark published in 2025 that evaluates whether LLMs can identify real vulnerabilities, not just synthetic ones, in production codebases. The methodology involves providing LLMs with the relevant context (affected files, not the entire repo) and asking them to identify the vulnerability without any hints about the CVE. The results show the best models can correctly identify 20-35% of the vulnerabilities when directly queried. While this may seem low, it's not that different from the performance of an average developer doing manual code reviews. The real issue is the gap between 'generation mode' and 'audit mode' - when generating code, LLMs prioritize functionality over security, but when explicitly asked to audit, they can identify issues the model didn't flag during generation. This suggests the problem is not with the tool, but with the process of using it. The benchmark also reveals LLMs struggle with more complex vulnerabilities like business logic flaws, race conditions, and authorization issues that span multiple components.

Can LLMs Detect Real Vulnerabilities in Real Code?

Why it matters

Key Points

Details

Dive deeper

Related Articles

Building Autonomous AI Agents with Free LLM APIs: A Practic…

Prompt Injection Attacks on Enterprise AI Agents Surge 340%

Comparing Efficiency of Data Formats for the Claude API

Running Local AI Efficiently on CPU Without GPU

Avoid Overengineering Your AI Agent - Let the LLM Handle It

Building a Voice-Controlled Local AI Agent: Architecture, M…

Building an AI Agent from Scratch: A Step-by-Step Guide

Rethinking AI Agent Architecture Beyond Prompts

The Hidden Reason AI Systems Fail to Deliver Reliable Answe…

RAG vs Fine-Tuning vs Hybrid: Cost-Performance for 3 Use Ca…

AI Curator

Ask me anything about AI

Related Articles

Building Autonomous AI Agents with Free LLM APIs: A Practic…

Prompt Injection Attacks on Enterprise AI Agents Surge 340%

Comparing Efficiency of Data Formats for the Claude API

Running Local AI Efficiently on CPU Without GPU

Avoid Overengineering Your AI Agent - Let the LLM Handle It

Building a Voice-Controlled Local AI Agent: Architecture, M…

Building an AI Agent from Scratch: A Step-by-Step Guide

Rethinking AI Agent Architecture Beyond Prompts

The Hidden Reason AI Systems Fail to Deliver Reliable Answe…

RAG vs Fine-Tuning vs Hybrid: Cost-Performance for 3 Use Ca…