Executive Overview
As companies add AI chatbots to handle customer questions, they accidentally create a new problem: the AI can be tricked into doing things it shouldn't. These AI systems work by reading company documents (like return policies) and using that information to answer questions. But if someone sneaks bad instructions into those documents, the AI might break company rules — like giving refunds when it shouldn't, sharing private information, or following commands from attackers.
Problem Framing
Unlike jailbreaking — where someone tricks the AI during a conversation to bypass its safety rules — RAG poisoning attacks the AI's information sources.
A bad actor plants hidden instructions in the documents the AI reads and trusts, essentially poisoning its knowledge base.
The key difference: jailbreaking affects one conversation, but RAG poisoning compromises every interaction because it corrupts the AI's foundational information.
This case study focuses on protecting ShopBot — a customer service AI — from this type of attack. I built a detection framework that monitors what information the AI retrieves, identifies suspicious content, and prevents the AI from being manipulated into breaking business rules like unauthorized refunds or data leaks.
The result: a structured approach to AI security that prevents attackers from hijacking the system's trusted information sources.
Comparative Analysis: Direct vs. Indirect Prompt Injection
Two Ways Attackers Compromise AI Systems
| Category | Direct Attack (Jailbreaking) | Indirect Attack (RAG Poisoning) |
|---|---|---|
| What It Is | Tricking the AI during your conversation to ignore its safety rules | Planting hidden commands in documents, websites, or databases that the AI reads and trusts |
| Where the Attack Happens | In the chat window — through what you type | In the AI's information sources — corrupting what it reads to learn from |
| Who's the Attacker | The person chatting with the AI | Someone who tampers with information sources beforehand. The person chatting usually has no idea they're part of an attack |
| How It Works | Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules | Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules/td> |
| Common Attack Methods | Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules | • Pretend games ("Act like you have no limits") • Hiding words in code • Computer-generated phrases designed to force "yes" |
| How Many People Affected | Just one conversation with one person | Everyone who asks about that topic — one corrupted policy document affects all users |
| Attacker's Goal | Get the AI to create banned content or break its safety rules | Make the AI execute commands on behalf of the attacker — like stealing data or sending fraudulent messages |
| How to Detect It | Watch what people type for manipulation patterns or rule-breaking attempts | Check the information the AI retrieves before it uses it to answer questions |
Design Architecture
Design Standard
ShopBot was intentionally designed as a simple RAG simulation to make the attack surface observable. The system has four primary components:
1. Knowledge Base
(`policy.txt`)
This serves as the AI's rulebook / logic. Whatever lives here, the system treats it as fact, making it a big target for hackers and bad actors.
2. Generation Engine
(`bot.py`)
Combines system instructions, retrieved knowledge base content and the user prompt and sends them to the Open AI model for completion.
3. Telemetry Layer
(`hcs.llm_event.v1`)
Every interaction is logged using a canonical schema including:
- Context hash
- Injection score
- Session ID
- User input
- Model response
- Severity classification
4. Detection Engine
(`detector.py`)
A secondary python tool that parses structured logs and raises alerts based on injection score thresholds and suspicious markers. It serves as a "watchdog" that reviews the records and sends alerts when there's an appearance of danger.
By watching what information the AI reads and tracking suspicious patterns, this project shows how AI systems can be protected using the same security monitoring techniques used for computer networks and user accounts.
This occurs when a model can no longer distinguish between legitimate system instructions and malicious content embedded in retrieved documents — because it treats both as equally trusted input.
Simulation Results
Design Standard
ShopBot is a CLI AI chatbot. The baseline result shows that the AI following the baseline knowledge base. The poisoned result showcases the output when the KB is changed.
- Flag suspicious content
- Warn about policy override
- Indicate instruction conflict
Evidence
detector.py
Detection Engineering Insights
Evidence Standard
Each insight below is backed by DevTools telemetry, correlated by session ID, and validated via exported JSON event analysis
The Model Does Not Self-Report Compromise
During the poisoned knowledge base test, ShopBot generated confident, policy-violating responses without any indication that it had followed malicious instructions.
At no point did the model:
The compromise was invisible at the language layer.
Trust Boundary Collapse Is Architectural, Not Behavioral
The RAG pipeline concatenates system prompt, retrieved context, user question into a single instruction stream before sending it to the model.
Because these components share the same channel, the model cannot reliably distinguish between business rules, retrieved documentation, or embedded malicious instructions.
This is not a user-manipulation problem — it is a structural design weakness.
What this proves
Phase 1 did not fix this boundary collapse. It exposed it.
Context-Level Telemetry Is Required; Output Monitoring Is Insufficient
In several runs, model outputs appeared normal at a glance. However, telemetry showed elevated injection risk due to suspicious markers embedded in retrieved content.
If monitoring relied solely on model output, the compromise would go undetected.
Because monitoring inspected retrieved context, hash fingerprints, and injection score — the poisoning became measurable.
What this proves
This confirms that detection must occur before or alongside generation — not after.
Structured Severity Enables Operationalization
Injection scoring alone is not operationally useful. By mapping score thresholds to severity levels (`LOW`, `MEDIUM`, `HIGH`, `CRITICAL`), the system converted experimental signals into actionable alerts.
The detector engine parsed canonical schema events, evaluated injection score thresholds, and raised severity-based alerts
This transitions the project from AI experimentation into detection engineering.
What this proves
This confirms that detection must occur before or alongside generation — not after.
Knowledge Bases Are Not Just Reference Material
The only modification made during the poisoning test was inside `knowledge_base/policy.txt`.
- No model weights changed.
- No system prompt changed.
- No API configuration changed.
Yet system behavior changed.
This demonstrates that in RAG architectures, the knowledge base acts as logic rather than simple reference data. That reframes document stores as supply chain attack surfaces.
Schema Drift Breaks Detection Visibility
After modifying the telemetry structure (`hcs.llm_event.v1`), the detector initially failed to flag events. The issue was not model behavior — it was schema misalignment between logging format and detection logic..
This mirrors real-world SIEM failures where detection breaks silently due to log schema changes.
Detection engineering requires strict schema governance.
RAG Poisoning Is Persistent; Jailbreaking Is Ephemeral
When the knowledge base was poisoned, all subsequent queries were affected. When restored, baseline behavior returned.
This confirms the difference between Jailbreaking being session-level manipulation and short lived, while RAG poisoning is a persistent systemic compromise
This makes RAG poisoning more operationally dangerous.
Phase 1 Achieves Detection, Not Prevention
Even when flagged as `HIGH` or `CRITICAL`, the model still generated a response.
Phase 1 provides:
- Visibility
- Risk scoring
- Structured telemetry
It does not yet:
- Block output
- Sanitize contextt
- Enforce policy isolation
This cleanly defines Phase 2 scope: behavioral enforcement
What Was Learned
Engineering Security for the "Black Box"
The primary takeaway from Phase 1 is that LLM security cannot be delegated to the model itself. Security must be engineered as an external, observable layer that treats AI inputs and outputs with the same scrutiny as network traffic or database queries.
1. The Visibility Gap is the Greatest Risk
Standard LLM implementations are "silent" when compromised. Because the model is designed to be helpful and conversational, it will follow malicious instructions with the same confidence it applies to legitimate ones. Without a Telemetry Layer (like the hcs.llm_event.v1 schema used here), a compromise is functionally invisible until the business impact — fraud, data leak — has already occurred.
2. Knowledge Bases are "Executable Code"
In a RAG architecture, the documents retrieved from a database are not just "static references" — they are active instructions. This project proved that tampering with a single policy.txt file is equivalent to a code-injection attack. We must start treating vector databases and document stores as high-priority supply chain surfaces that require:
3. The Poison Only Works If It Gets Retrieved
The general prompt of "Can I get a refund?" was expanded to: "I want to return this item from 45 days ago but I don't have a receipt. Can I have a refund?"
This version violates both the 30-day return window and the receipt requirement — making it a stronger test case.
Even though the knowledge base was already poisoned, the attack failed. The injected content likely didn't have enough overlap with the query to get retrieved. The model never saw the malicious instruction.
This means attackers have to think like a search engine — writing malicious instructions that sound like legitimate business content.
The closer an attack mirrors your actual workflows, the harder it is to filter out.
4. The Architecture is the Vulnerability
The "Trust Boundary Collapse" is not a bug that can be patched with a better prompt; it is a fundamental characteristic of how current LLMs process information. Because the "Data" (the knowledge base) and the "Instructions" (the system prompt) share the same processing channel, the system is inherently vulnerable. This realization shifts the focus from "better prompting" to Architectural Hardening — which is the primary objective of Phase 2.
Conclusion
Turning Findings into Detection Rules
Phase 1 was not about building a "perfectly secure AI." It was about proving that RAG poisoning is real, observable, and measurable.
By poisoning ShopBot's knowledge base and capturing structured telemetry, I demonstrated how malicious instructions inside trusted documents can silently override business rules. Once logging and injection scoring were introduced, the attack was no longer invisible.
ShopBot went from being a black box to a monitored system with traceable decision inputs, risk scoring, and severity classification.:
At this stage, the system can detect compromise — but it does not yet prevent it.
Phase 2 will focus on behavioral enforcement and hardening controls to prevent trust boundary collapse in real time.
Design the system. Deploy the controls. Defend the trust.