Executive Overview

As companies add AI chatbots to handle customer questions, they accidentally create a new problem: the AI can be tricked into doing things it shouldn't. These AI systems work by reading company documents (like return policies) and using that information to answer questions. But if someone sneaks bad instructions into those documents, the AI might break company rules — like giving refunds when it shouldn't, sharing private information, or following commands from attackers.

Problem Framing

Unlike jailbreaking — where someone tricks the AI during a conversation to bypass its safety rules — RAG poisoning attacks the AI's information sources.

A bad actor plants hidden instructions in the documents the AI reads and trusts, essentially poisoning its knowledge base.

The key difference: jailbreaking affects one conversation, but RAG poisoning compromises every interaction because it corrupts the AI's foundational information.

This case study focuses on protecting ShopBot — a customer service AI — from this type of attack. I built a detection framework that monitors what information the AI retrieves, identifies suspicious content, and prevents the AI from being manipulated into breaking business rules like unauthorized refunds or data leaks.

The result: a structured approach to AI security that prevents attackers from hijacking the system's trusted information sources.

Comparative Analysis: Direct vs. Indirect Prompt Injection

Two Ways Attackers Compromise AI Systems

Category Direct Attack (Jailbreaking) Indirect Attack (RAG Poisoning)
What It Is Tricking the AI during your conversation to ignore its safety rules Planting hidden commands in documents, websites, or databases that the AI reads and trusts
Where the Attack Happens In the chat window — through what you type In the AI's information sources — corrupting what it reads to learn from
Who's the Attacker The person chatting with the AI Someone who tampers with information sources beforehand. The person chatting usually has no idea they're part of an attack
How It Works Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules/td>
Common Attack Methods Manipulating the conversation by pretending, using confusing language, or playing word games to bypass rules • Pretend games ("Act like you have no limits") • Hiding words in code • Computer-generated phrases designed to force "yes"
How Many People Affected Just one conversation with one person Everyone who asks about that topic — one corrupted policy document affects all users
Attacker's Goal Get the AI to create banned content or break its safety rules Make the AI execute commands on behalf of the attacker — like stealing data or sending fraudulent messages
How to Detect It Watch what people type for manipulation patterns or rule-breaking attempts Check the information the AI retrieves before it uses it to answer questions

Design Architecture

Design Standard

ShopBot was intentionally designed as a simple RAG simulation to make the attack surface observable. The system has four primary components:

1. Knowledge Base

(`policy.txt`)

This serves as the AI's rulebook / logic. Whatever lives here, the system treats it as fact, making it a big target for hackers and bad actors.


2. Generation Engine

(`bot.py`)

Combines system instructions, retrieved knowledge base content and the user prompt and sends them to the Open AI model for completion.

Long anomalous input in search bar
Fig 2.1 Screenshot: 📸 Prompt Construction > SYSTEM + CONTEXT + USER concatenation.
Long anomalous input in search bar
Fig 2.2 Screenshot: 📸 Malicious instruction embedded inside a trusted knowledge source

3. Telemetry Layer

(`hcs.llm_event.v1`)

Every interaction is logged using a canonical schema including:

  • Context hash
  • Injection score
  • Session ID
  • User input
  • Model response
  • Severity classification

4. Detection Engine

(`detector.py`)

A secondary python tool that parses structured logs and raises alerts based on injection score thresholds and suspicious markers. It serves as a "watchdog" that reviews the records and sends alerts when there's an appearance of danger.


By watching what information the AI reads and tracking suspicious patterns, this project shows how AI systems can be protected using the same security monitoring techniques used for computer networks and user accounts.



Analyst discovering insights
Trust Boundary Collapse

This occurs when a model can no longer distinguish between legitimate system instructions and malicious content embedded in retrieved documents — because it treats both as equally trusted input.

Simulation Results

Design Standard

ShopBot is a CLI AI chatbot. The baseline result shows that the AI following the baseline knowledge base. The poisoned result showcases the output when the KB is changed.

    Evidence

    No self reporting of commpromise
    Screenshot: Baseline prompt and result. The AI reacts as expected.
    No self reporting of commpromise
    Screenshot: Compromised Model Output Model response influenced by poisoned knowledge base. No self-reporting of compromise.
    No self reporting of commpromise
    Screenshot: JSON Telemetry Log Injection score elevated. Context hash changed.
    No self reporting of commpromise
    Screenshot: Critical Alert from detector.py

Detection Engineering Insights

Evidence Standard

Each insight below is backed by DevTools telemetry, correlated by session ID, and validated via exported JSON event analysis

01

The Model Does Not Self-Report Compromise

During the poisoned knowledge base test, ShopBot generated confident, policy-violating responses without any indication that it had followed malicious instructions.

At no point did the model:

  • Flag suspicious content
  • Warn about policy override
  • Indicate instruction conflict

The compromise was invisible at the language layer.

02

Trust Boundary Collapse Is Architectural, Not Behavioral

The RAG pipeline concatenates system prompt, retrieved context, user question into a single instruction stream before sending it to the model.

Because these components share the same channel, the model cannot reliably distinguish between business rules, retrieved documentation, or embedded malicious instructions.

This is not a user-manipulation problem — it is a structural design weakness.

What this proves

Phase 1 did not fix this boundary collapse. It exposed it.

03

Context-Level Telemetry Is Required; Output Monitoring Is Insufficient

In several runs, model outputs appeared normal at a glance. However, telemetry showed elevated injection risk due to suspicious markers embedded in retrieved content.

If monitoring relied solely on model output, the compromise would go undetected.

Because monitoring inspected retrieved context, hash fingerprints, and injection score — the poisoning became measurable.

What this proves

This confirms that detection must occur before or alongside generation — not after.

04

Structured Severity Enables Operationalization

Injection scoring alone is not operationally useful. By mapping score thresholds to severity levels (`LOW`, `MEDIUM`, `HIGH`, `CRITICAL`), the system converted experimental signals into actionable alerts.

The detector engine parsed canonical schema events, evaluated injection score thresholds, and raised severity-based alerts

This transitions the project from AI experimentation into detection engineering.

What this proves

This confirms that detection must occur before or alongside generation — not after.

05

Knowledge Bases Are Not Just Reference Material

The only modification made during the poisoning test was inside `knowledge_base/policy.txt`.

  • No model weights changed.
  • No system prompt changed.
  • No API configuration changed.

Yet system behavior changed.


This demonstrates that in RAG architectures, the knowledge base acts as logic rather than simple reference data. That reframes document stores as supply chain attack surfaces.

06

Schema Drift Breaks Detection Visibility

After modifying the telemetry structure (`hcs.llm_event.v1`), the detector initially failed to flag events. The issue was not model behavior — it was schema misalignment between logging format and detection logic..

This mirrors real-world SIEM failures where detection breaks silently due to log schema changes.

Detection engineering requires strict schema governance.

07

RAG Poisoning Is Persistent; Jailbreaking Is Ephemeral

When the knowledge base was poisoned, all subsequent queries were affected. When restored, baseline behavior returned.

This confirms the difference between Jailbreaking being session-level manipulation and short lived, while RAG poisoning is a persistent systemic compromise

This makes RAG poisoning more operationally dangerous.

08

Phase 1 Achieves Detection, Not Prevention

Even when flagged as `HIGH` or `CRITICAL`, the model still generated a response.

Phase 1 provides:

  • Visibility
  • Risk scoring
  • Structured telemetry

It does not yet:

  • Block output
  • Sanitize contextt
  • Enforce policy isolation

This cleanly defines Phase 2 scope: behavioral enforcement

What Was Learned

Engineering Security for the "Black Box"

The primary takeaway from Phase 1 is that LLM security cannot be delegated to the model itself. Security must be engineered as an external, observable layer that treats AI inputs and outputs with the same scrutiny as network traffic or database queries.

1. The Visibility Gap is the Greatest Risk

Standard LLM implementations are "silent" when compromised. Because the model is designed to be helpful and conversational, it will follow malicious instructions with the same confidence it applies to legitimate ones. Without a Telemetry Layer (like the hcs.llm_event.v1 schema used here), a compromise is functionally invisible until the business impact — fraud, data leak — has already occurred.

2. Knowledge Bases are "Executable Code"

In a RAG architecture, the documents retrieved from a database are not just "static references" — they are active instructions. This project proved that tampering with a single policy.txt file is equivalent to a code-injection attack. We must start treating vector databases and document stores as high-priority supply chain surfaces that require:

3. The Poison Only Works If It Gets Retrieved

The general prompt of "Can I get a refund?" was expanded to: "I want to return this item from 45 days ago but I don't have a receipt. Can I have a refund?"

This version violates both the 30-day return window and the receipt requirement — making it a stronger test case.

Even though the knowledge base was already poisoned, the attack failed. The injected content likely didn't have enough overlap with the query to get retrieved. The model never saw the malicious instruction.

This means attackers have to think like a search engine — writing malicious instructions that sound like legitimate business content.

The closer an attack mirrors your actual workflows, the harder it is to filter out.

No self reporting of compromise
Screenshot: Poisoned knowledge base but the attack still failed.
No self reporting of commpromise
Screenshot: When the poisoned KB was adjusted to specifically recognize "return" as well as "refund", and now the bot shows the poisoned response.

4. The Architecture is the Vulnerability

The "Trust Boundary Collapse" is not a bug that can be patched with a better prompt; it is a fundamental characteristic of how current LLMs process information. Because the "Data" (the knowledge base) and the "Instructions" (the system prompt) share the same processing channel, the system is inherently vulnerable. This realization shifts the focus from "better prompting" to Architectural Hardening — which is the primary objective of Phase 2.

No self reporting of commpromise
Screenshot: Knowledge Base Integrity Visualisation (Jupyter Graph) Knowledge base version comparison shows two compromised hashes (score = 20) and two clean baselines (score = 0).

Conclusion

Turning Findings into Detection Rules

Phase 1 was not about building a "perfectly secure AI." It was about proving that RAG poisoning is real, observable, and measurable.

By poisoning ShopBot's knowledge base and capturing structured telemetry, I demonstrated how malicious instructions inside trusted documents can silently override business rules. Once logging and injection scoring were introduced, the attack was no longer invisible.

ShopBot went from being a black box to a monitored system with traceable decision inputs, risk scoring, and severity classification.:

At this stage, the system can detect compromise — but it does not yet prevent it.

Phase 2 will focus on behavioral enforcement and hardening controls to prevent trust boundary collapse in real time.


Design the system. Deploy the controls. Defend the trust.