Prompt Injection & Safety Lab
A CTF-style playground with 7 progressively harder prompt injection challenges. Each level adds new defenses, from naive prompts to input analysis, output filtering, and a two-model sandboxed architecture. Built to explore and teach LLM security patterns.

Overview
This is a capture-the-flag style playground for testing prompt injection attacks against progressively hardened LLM defenses. Each of the 7 levels has a secret phrase embedded in its system prompt, and the goal is to extract it.
The Levels
- Level 1: The Naive Assistant, no defenses at all
- Level 2: The Cautious Helper, basic instruction not to reveal the secret
- Level 3: The Encoded Guardian, input analysis for injection patterns
- Level 4: The Roleplay Blocker, defenses against persona-switching attacks
- Level 5: The Output Filter, post-generation output scanning
- Level 6: The Sandboxed Assistant, two-model architecture where one model evaluates the other's output
- Level 7: The Fortress, combines all previous defenses
Technical Details
All 7 levels are handled by a single API route. Each level defines its own system prompt, defense strategies, and difficulty rating. The two-model architecture in Level 6 makes two separate OpenAI API calls: one to generate the response and one to evaluate whether it leaked the secret.
This project lives at /ai/security on this site.
Tech Stack
Attribution
Interested in working together? I'm always open to discussing new projects and opportunities.