How to Build a Secure Retrieval-Augmented Generation (RAG) System Without Data Leaks

A secure retrieval augmented generation (RAG) system is an architecture that grounds LLM outputs in private enterprise data while enforcing strict access control lists (ACLs), data encryption, and PII redaction across the entire ingestion and retrieval pipeline.

Most standard RAG tutorials assume a single-user environment with public API access. In a real company, this setup is a liability. If you build a naive system, you risk exposing proprietary contracts, customer invoices, or employee records.

Under frameworks like India's Digital Personal Data Protection Act (DPDPA) 2023, failing to protect this personal data can cost your business up to INR 250 Crore (approximately USD 30 million) in statutory penalties. To remain compliant with GDPR and the DPDPA, you must secure three distinct boundaries: ingestion, vector storage, and the LLM context window.

What Usually Goes Wrong: The Three Common Leak Vectors in RAG#

When engineering teams deploy a naive private RAG setup, they usually introduce three critical leak vectors:

External API Exposure: Sending raw, unredacted corporate documents to third-party embedding and LLM APIs. By default, data sent to the OpenAI API is not used to train models, but it is retained for up to 30 days for data-loss prevention and abuse monitoring unless you have an approved Zero Data Retention (ZDR) policy.
The ACL Bypass: A standard vector database indexes documents globally. If your database does not check permissions, a low-privilege user can query the system and retrieve high-privilege data, like HR files or financial forecasts, via semantic search. This directly triggers the OWASP Top 10 LLM vulnerability "Sensitive Information Disclosure" (LLM06).
LLM Context Window Poisoning: Retrieving contextually relevant but restricted documents and feeding them into a shared LLM session. Without isolation, this data can be cached, leaked, or used to train public models.

Diagram showing data leak vectors in naive RAG versus secure RAG architecture

The 'ACL Sync' Nightmare: Why Naive Semantic Search Bypasses Permissions#

Source systems like SharePoint, Google Drive, and Slack use complex, nested group permissions. A user's access depends on active directory groups, folder inheritance, and direct shares.

A vector database does not natively understand these directory structures. It only performs mathematical similarity matching between vector embeddings. If a user asks, "What are the Q4 redundancy plans?", the database matches the query vector with the document vectors. Without metadata-level permission mapping, the system will gladly return chunks from restricted HR documents to a junior engineer.

The Secure RAG Architecture: A System Pattern That Works#

To protect your enterprise search systems, you must process data through a zero-trust pipeline before it ever reaches a database or an LLM.

A secure architecture isolates each component within your private network. Instead of relying on external services, you deploy a deterministic ingestion gateway, run a self-hosted embedding model, use a vector database with strict metadata filtering, and connect to a private LLM instance. This pattern ensures that data is redacted before processing, stored with cryptographic access controls, and queried only by authorized users.

Flowchart showing the secure RAG data path from raw document to private LLM

Step-by-Step Implementation of a Secure RAG Pipeline#

Step 1: Deterministic Redaction over LLM-based 'Cleaning'#

Do not use an LLM to clean your data. LLMs are slow, expensive, and non-deterministic. They routinely miss edge-case personal identifiable information (PII) and can actually introduce hallucinated data.

Instead, deploy a deterministic Named Entity Recognition (NER) engine like Microsoft Presidio at your ingestion gateway. Before any text is chunked or embedded, run it through the engine to replace names, social security numbers, and internal IP addresses with placeholders like [REDACTED_IP].

Step 2: Hosting Local Embeddings to Keep Vector Generation In-House#

To eliminate outbound network calls for vectorization, host open-source embedding models inside your private cloud. You can deploy models like BAAI/bge-large-en-v1.5 or Cohere-embed-v3 inside your AWS ECS, EKS, or Google Kubernetes Engine (GKE) clusters.

Using self-hosted runtimes like Hugging Face Text Embeddings Inference (TEI) or Triton Inference Server keeps your raw text chunks entirely within your private VPC boundary.

Here is an example of how to configure an ingestion worker to use a local TEI endpoint instead of an external API:

embedder.ts

import { HFEndpointEmbeddings } from "@langchain/community/embeddings/hf_endpoints";
 
const embeddings = new HFEndpointEmbeddings({
  endpointUrl: "http://tei-service.local-vpc.internal/embed",
  headers: {
    "Content-Type": "application/json",
  },
});
 
export async function generateChunkVector(text: string): Promise<number[]> {
  return await embeddings.embedQuery(text);
}

Step 3: Building the ACL Sync and Metadata Filtering Mechanism#

To make your vector database respect your company's directory structure, you must sync source permissions directly to your vector metadata.

During document ingestion, extract the permission arrays, such as allowed user IDs, group IDs, or tenant IDs, from systems like Okta or Active Directory. Store this access list as an array of strings within the metadata payload of each vector chunk.

When a user queries your enterprise search, intercept the request. Retrieve the user's authenticated group memberships from their session token, and append an $in metadata filter to the vector search query. Modern vector databases like pgvector, Qdrant, or Pinecone support metadata filtering to restrict queries at the database level.

Here is how to run a secure query using metadata filtering in Qdrant:

secure_query.ts

import { QdrantClient } from "@qdrant/js-client-rest";
 
const client = new QdrantClient({ url: "http://qdrant.local-vpc.internal" });
 
interface SecureQueryParams {
  vector: number[];
  userGroups: string[];
  tenantId: string;
}
 
export async function searchSecureVectors({ vector, userGroups, tenantId }: SecureQueryParams) {
  return await client.search("enterprise_documents", {
    vector: vector,
    limit: 10,
    filter: {
      must: [
        {
          key: "tenant_id",
          match: { value: tenantId }
        },
        {
          should: userGroups.map(group => ({
            key: "allowed_groups",
            match: { value: group }
          }))
        }
      ]
    }
  });
}

Operational Risks: Latency, Cost, and Maintenance Tradeoffs#

Building a secure private RAG system introduces real operational tradeoffs. Security is never free; it requires a commitment to infrastructure maintenance and compute budgets.

First, self-hosting embedding and LLM models means managing GPU instances. You will face cold-start latencies if your services scale down to zero, and you must manage the patching and scaling of your inference nodes.

Second, your vector database will require significantly more memory (RAM). Storing large arrays of ACLs and user groups inside vector metadata payloads makes indexing more complex. As your document library grows into millions of chunks, your database memory footprint will expand to keep these metadata indexes in RAM for fast filtering.

Third, real-time permission syncing is fragile. If a manager revokes a user's access to a sensitive folder in Google Drive, that change must propagate to your vector database immediately. A delay of even a few minutes creates a race condition where the user can still retrieve the restricted data via semantic search.

SECURITY WARNING

Always design your system to fail closed. If a document's permissions are missing or corrupted during ingestion, the system must default to excluding it from all search queries.

A safe RAG system is more than embeddings. It needs document permissions, source citations, data residency, prompt-injection controls, and audit logs that explain exactly where every answer came from. If an auditor asks why a user received a specific answer, you must be able to trace the response back to the exact chunk, the vector query, the metadata filter applied, and the original document source.

Where to start: Securing Your Enterprise Data Path#

Trying to bolt security onto a public RAG system after it is already built is a recipe for data leaks. Security must be baked into the ingestion and retrieval architecture from day one. If you allow unredacted, unmapped data into your vector database, cleaning it up after the fact is incredibly difficult.

Your immediate next step is to audit your target data sources, such as SharePoint, Google Drive, and internal databases, and map out how their permissions are structured. Before writing any code, document every group, role, and tenant boundary that your search system must respect.

From there, set up a local proof-of-concept on your machine. Install Microsoft Presidio to handle deterministic redaction, and run a local instance of a vector database like pgvector or Qdrant. Write a basic ingestion script that extracts a document, redacts its PII, attaches a mock list of allowed user groups to its metadata, and queries it using a metadata filter. Once you prove that your ACL sync logic works locally, you can confidently begin deploying your secure pipeline to your private cloud infrastructure.