Prompt Injection & Safety Lab

Overview

This is a capture-the-flag style playground for testing prompt injection attacks against progressively hardened LLM defenses. Each of the 7 levels has a secret phrase embedded in its system prompt, and the goal is to extract it.

The Levels

Level 1: The Naive Assistant, no defenses at all
Level 2: The Cautious Helper, basic instruction not to reveal the secret
Level 3: The Encoded Guardian, input analysis for injection patterns
Level 4: The Roleplay Blocker, defenses against persona-switching attacks
Level 5: The Output Filter, post-generation output scanning
Level 6: The Sandboxed Assistant, two-model architecture where one model evaluates the other's output
Level 7: The Fortress, combines all previous defenses

Technical Details

All 7 levels are handled by a single API route. Each level defines its own system prompt, defense strategies, and difficulty rating. The two-model architecture in Level 6 makes two separate OpenAI API calls: one to generate the response and one to evaluate whether it leaked the secret.

This project lives at /ai/security on this site.

Prompt Injection & Safety Lab

Overview

The Levels

Technical Details

Tech Stack

Attribution

Related Projects

MeetWith

RAG Knowledge Search

Chat Creatures: AI Kiosk Platform

Learn More About Me