Adversarial Prompt Defense

Executive Overview

As companies add AI chatbots to handle customer questions, they accidentally create a new problem: the AI can be tricked into doing things it shouldn't. These AI systems work by reading company documents (like return policies) and using that information to answer questions. But if someone sneaks bad instructions into those documents, the AI might break company rules — like giving refunds when it shouldn't, sharing private information, or following commands from attackers.

Problem Framing

Unlike jailbreaking — where someone tricks the AI during a conversation to bypass its safety rules — RAG poisoning attacks the AI's information sources.

A bad actor plants hidden instructions in the documents the AI reads and trusts, essentially poisoning its knowledge base.

The key difference: jailbreaking affects one conversation, but RAG poisoning compromises every interaction because it corrupts the AI's foundational information.

This case study focuses on protecting ShopBot — a customer service AI — from this type of attack. I built a detection framework that monitors what information the AI retrieves, identifies suspicious content, and prevents the AI from being manipulated into breaking business rules like unauthorized refunds or data leaks.

The result: a structured approach to AI security that prevents attackers from hijacking the system's trusted information sources.

Comparative Analysis: Direct vs. Indirect Prompt Injection

Two Ways Attackers Compromise AI Systems

Category	Direct Attack (Jailbreaking)	Indirect Attack (RAG Poisoning)
What It Is	Tricking the AI during your conversation to ignore its safety rules	Planting hidden commands in documents, websites, or databases that the AI reads and trusts
Where the Attack Happens	In the chat window — through what you type	In the AI's information sources — corrupting what it reads to learn from
Who's the Attacker	The person chatting with the AI	Someone who tampers with information sources beforehand. The person chatting usually has no idea they're part of an attack
How It Works	Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules	Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules/td>
Common Attack Methods	Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules	• Pretend games ("Act like you have no limits") • Hiding words in code • Computer-generated phrases designed to force "yes"
How Many People Affected	Just one conversation with one person	Everyone who asks about that topic — one corrupted policy document affects all users
Attacker's Goal	Get the AI to create banned content or break its safety rules	Make the AI execute commands on behalf of the attacker — like stealing data or sending fraudulent messages
How to Detect It	Watch what people type for manipulation patterns or rule-breaking attempts	Check the information the AI retrieves before it uses it to answer questions

Design Standard

ShopBot was intentionally designed as a simple RAG simulation to make the attack surface observable. The system has four primary components:

1. Knowledge Base

(`policy.txt`)

This serves as the AI's rulebook / logic. Whatever lives here, the system treats it as fact, making it a big target for hackers and bad actors.

2. Generation Engine

(`bot.py`)

Combines system instructions, retrieved knowledge base content and the user prompt and sends them to the Open AI model for completion.

Long anomalous input in search bar — Fig 2.1 **Screenshot**: 📸 Prompt Construction > SYSTEM + CONTEXT + USER concatenation.

3. Telemetry Layer

(`hcs.llm_event.v1`)

Every interaction is logged using a canonical schema including:

Context hash
Injection score
Session ID
User input
Model response
Severity classification

4. Detection Engine

(`detector.py`)

A secondary python tool that parses structured logs and raises alerts based on injection score thresholds and suspicious markers. It serves as a "watchdog" that reviews the records and sends alerts when there's an appearance of danger.

By watching what information the AI reads and tracking suspicious patterns, this project shows how AI systems can be protected using the same security monitoring techniques used for computer networks and user accounts.

Trust Boundary Collapse

This occurs when a model can no longer distinguish between legitimate system instructions and malicious content embedded in retrieved documents — because it treats both as equally trusted input.

Design Standard

ShopBot is a CLI AI chatbot. The baseline result shows that the AI following the baseline knowledge base. The poisoned result showcases the output when the KB is changed.

Evidence

No self reporting of commpromise — **Screenshot:** Baseline prompt and result. The AI reacts as expected.

Detection Engineering Insights

Evidence Standard

Each insight below is backed by DevTools telemetry, correlated by session ID, and validated via exported JSON event analysis

The Model Does Not Self-Report Compromise

During the poisoned knowledge base test, ShopBot generated confident, policy-violating responses without any indication that it had followed malicious instructions.

At no point did the model:

Flag suspicious content
Warn about policy override
Indicate instruction conflict

The compromise was invisible at the language layer.

Trust Boundary Collapse Is Architectural, Not Behavioral

The RAG pipeline concatenates system prompt, retrieved context, user question into a single instruction stream before sending it to the model.

Because these components share the same channel, the model cannot reliably distinguish between business rules, retrieved documentation, or embedded malicious instructions.

This is not a user-manipulation problem — it is a structural design weakness.

What this proves

Phase 1 did not fix this boundary collapse. It exposed it.

Context-Level Telemetry Is Required; Output Monitoring Is Insufficient

In several runs, model outputs appeared normal at a glance. However, telemetry showed elevated injection risk due to suspicious markers embedded in retrieved content.

If monitoring relied solely on model output, the compromise would go undetected.

Because monitoring inspected retrieved context, hash fingerprints, and injection score — the poisoning became measurable.

What this proves

This confirms that detection must occur before or alongside generation — not after.

Structured Severity Enables Operationalization

Injection scoring alone is not operationally useful. By mapping score thresholds to severity levels (`LOW`, `MEDIUM`, `HIGH`, `CRITICAL`), the system converted experimental signals into actionable alerts.

The detector engine parsed canonical schema events, evaluated injection score thresholds, and raised severity-based alerts

This transitions the project from AI experimentation into detection engineering.

What this proves

This confirms that detection must occur before or alongside generation — not after.

Knowledge Bases Are Not Just Reference Material

The only modification made during the poisoning test was inside `knowledge_base/policy.txt`.

No model weights changed.
No system prompt changed.
No API configuration changed.

Yet system behavior changed.

This demonstrates that in RAG architectures, the knowledge base acts as logic rather than simple reference data. That reframes document stores as supply chain attack surfaces.

Schema Drift Breaks Detection Visibility

After modifying the telemetry structure (`hcs.llm_event.v1`), the detector initially failed to flag events. The issue was not model behavior — it was schema misalignment between logging format and detection logic..

This mirrors real-world SIEM failures where detection breaks silently due to log schema changes.

Detection engineering requires strict schema governance.

RAG Poisoning Is Persistent; Jailbreaking Is Ephemeral

When the knowledge base was poisoned, all subsequent queries were affected. When restored, baseline behavior returned.

This confirms the difference between Jailbreaking being session-level manipulation and short lived, while RAG poisoning is a persistent systemic compromise

This makes RAG poisoning more operationally dangerous.

Phase 1 Achieves Detection, Not Prevention

Even when flagged as `HIGH` or `CRITICAL`, the model still generated a response.

Phase 1 provides:

Visibility
Risk scoring
Structured telemetry

It does not yet:

Block output
Sanitize contextt
Enforce policy isolation

This cleanly defines Phase 2 scope: behavioral enforcement

What Was Learned

Engineering Security for the "Black Box"

The primary takeaway from Phase 1 is that LLM security cannot be delegated to the model itself. Security must be engineered as an external, observable layer that treats AI inputs and outputs with the same scrutiny as network traffic or database queries.

1. The Visibility Gap is the Greatest Risk

Standard LLM implementations are "silent" when compromised. Because the model is designed to be helpful and conversational, it will follow malicious instructions with the same confidence it applies to legitimate ones. Without a Telemetry Layer (like the hcs.llm_event.v1 schema used here), a compromise is functionally invisible until the business impact — fraud, data leak — has already occurred.

2. Knowledge Bases are "Executable Code"

In a RAG architecture, the documents retrieved from a database are not just "static references" — they are active instructions. This project proved that tampering with a single policy.txt file is equivalent to a code-injection attack. We must start treating vector databases and document stores as high-priority supply chain surfaces that require:

3. The Poison Only Works If It Gets Retrieved

The general prompt of "Can I get a refund?" was expanded to: "I want to return this item from 45 days ago but I don't have a receipt. Can I have a refund?"

This version violates both the 30-day return window and the receipt requirement — making it a stronger test case.

Even though the knowledge base was already poisoned, the attack failed. The injected content likely didn't have enough overlap with the query to get retrieved. The model never saw the malicious instruction.

This means attackers have to think like a search engine — writing malicious instructions that sound like legitimate business content.

The closer an attack mirrors your actual workflows, the harder it is to filter out.

No self reporting of compromise — **Screenshot:** Poisoned knowledge base but the attack still failed.

4. The Architecture is the Vulnerability

The "Trust Boundary Collapse" is not a bug that can be patched with a better prompt; it is a fundamental characteristic of how current LLMs process information. Because the "Data" (the knowledge base) and the "Instructions" (the system prompt) share the same processing channel, the system is inherently vulnerable. This realization shifts the focus from "better prompting" to Architectural Hardening — which is the primary objective of Phase 2.

Conclusion

Turning Findings into Detection Rules

Phase 1 was not about building a "perfectly secure AI." It was about proving that RAG poisoning is real, observable, and measurable.

By poisoning ShopBot's knowledge base and capturing structured telemetry, I demonstrated how malicious instructions inside trusted documents can silently override business rules. Once logging and injection scoring were introduced, the attack was no longer invisible.

ShopBot went from being a black box to a monitored system with traceable decision inputs, risk scoring, and severity classification.:

At this stage, the system can detect compromise — but it does not yet prevent it.

Phase 2 will focus on behavioral enforcement and hardening controls to prevent trust boundary collapse in real time.

Design the system. Deploy the controls. Defend the trust.

Defending Against Adversarial Prompt Injection:

Core Objective

Tools & Technologies

Risk Mitigation

Executive Overview

Problem Framing

Unlike jailbreaking — where someone tricks the AI during a conversation to bypass its safety rules — RAG poisoning attacks the AI's information sources.

Two Ways Attackers Compromise AI Systems

Design Standard

1. Knowledge Base

2. Generation Engine

3. Telemetry Layer

4. Detection Engine

This occurs when a model can no longer distinguish between legitimate system instructions and malicious content embedded in retrieved documents — because it treats both as equally trusted input.

Design Standard

Evidence

The Model Does Not Self-Report Compromise

Trust Boundary Collapse Is Architectural, Not Behavioral

Context-Level Telemetry Is Required; Output Monitoring Is Insufficient

This confirms that detection must occur before or alongside generation — not after.

Structured Severity Enables Operationalization

This confirms that detection must occur before or alongside generation — not after.

Knowledge Bases Are Not Just Reference Material

Yet system behavior changed.

Schema Drift Breaks Detection Visibility

Detection engineering requires strict schema governance.

RAG Poisoning Is Persistent; Jailbreaking Is Ephemeral

This makes RAG poisoning more operationally dangerous.

Phase 1 Achieves Detection, Not Prevention

Phase 1 provides:

It does not yet:

This cleanly defines Phase 2 scope: behavioral enforcement

What Was Learned

Engineering Security for the "Black Box"

1. The Visibility Gap is the Greatest Risk

2. Knowledge Bases are "Executable Code"

3. The Poison Only Works If It Gets Retrieved

The closer an attack mirrors your actual workflows, the harder it is to filter out.

4. The Architecture is the Vulnerability

Conclusion