Okira Labs / AI Engineering

Building an enterprise semantic search system is more than just running a vector database and calling an API. Many engineering teams start with a simple proof-of-concept, only to watch it break when faced with real-world corporate data. Your employees do not search your internal databases the way they search the public web. They use highly specific terminology, internal codes, and acronyms. They expect your internal search to locate precise answers while strictly respecting their access permissions. If your systems fail to protect document-level security, or if they violate strict regulatory frameworks, your search tool becomes a liability.

What is Enterprise Semantic Search?#

Enterprise semantic search is a retrieval system that understands the intent and contextual meaning of user queries across fragmented corporate data silos, rather than relying solely on literal keyword matching.

In a corporate environment, legacy keyword search systems like BM25 often fail. This is because different teams use different vocabularies for identical concepts. A product manager might search for "customer churn," while a database engineer searches for "account termination logs." Keyword search cannot bridge this gap because it looks for exact string matches.

To build a reliable knowledge management platform, you must adopt a hybrid retrieval system. This architecture combines dense vector embeddings with sparse lexical search. Dense embeddings map queries and documents into a continuous vector space where conceptually similar items sit close to each other. Sparse retrieval (BM25) handles exact matches, such as product serial numbers, legal case codes, or customer identifiers.

According to the BEIR benchmark, zero-shot dense retrieval models frequently underperform BM25 on specialized, out-of-domain corporate datasets by up to 10% to 20% NDCG@10. Combining both methods ensures you do not lose exact-match precision while gaining deep conceptual understanding.

The Naive RAG Trap: What Goes Wrong in Production#

Most projects start with a naive Retrieval-Augmented Generation (RAG) design. Developers write a script to parse PDFs, split them into uniform chunks, generate embeddings, and dump them into a vector database. This works during a demo with five documents.

In production, this approach falls apart. When you scale to millions of documents across shared drives, emails, and databases, you hit three major bottlenecks: authorization leakage, compliance violations, and model migration overhead.

Keyword search breaks when teams use different vocabulary for the same concept. Semantic search helps unify documents, tickets, policies, and research, but it must preserve permissions and return citations to earn trust. If an employee searches for "salary adjustments," the engine must never return chunks of a restricted HR spreadsheet unless that specific employee has explicit access.

The ACL Filtering Nightmare: Why Pre- and Post-Filtering Fail#

Securing search results requires enforcing document-level Access Control Lists (ACLs). There are two naive ways to do this, and both fail.

First, post-filtering retrieves the top-k most similar vector matches first, then discards any results the user is not authorized to see. This leads to "search starvation." If the top 20 matches all belong to a restricted directory, the system filters them all out. The user sees an empty results page, even though authorized documents exist further down the list.

Second, pre-filtering restricts the search space before running the vector query. This sounds logical, but it forces the database to bypass the hierarchical navigable small world (HNSW) index. The query reverts to a slow, expensive full-table scan across the remaining documents.

The solution is single-stage filtering. In this model, metadata constraints are evaluated during the vector graph traversal. The search algorithm only traverses nodes that match the user's security tokens, maintaining both speed and accuracy.

Modern compliance frameworks demand strict data governance. Section 12 of India's Digital Personal Data Protection Act (DPDPA) 2023, along with Europe's GDPR, mandates a strict "Right to Erasure" for personal data. Under the DPDPA, failure to implement proper security safeguards to prevent personal data breaches can result in penalties up to INR 250 Crore (approximately USD 30 million).

Vector databases do not naturally map individual vector chunks back to a specific user identity or source document. When you delete a source file from S3 or Sharepoint, the raw file is gone, but orphaned, highly sensitive vector chunks remain inside your vector index. Because these chunks contain raw text fragments in their metadata, they can still be retrieved during searches, creating a massive compliance risk.

The Re-indexing Downtime Trap#

Embedding models are not interchangeable. A vector generated by a Cohere model cannot be compared to a vector generated by an OpenAI or Hugging Face model. They exist in entirely different vector spaces.

When you decide to upgrade your embedding model to improve accuracy, you cannot simply swap the API key. You must re-index 100% of your enterprise document corpus. For millions of documents, this process can take days and requires significant compute power. If you do not plan for this, your search service will face extended downtime.

The Production-Grade Architecture: Hybrid Retrieval with Single-Stage Filtering#

To build an enterprise-ready search system, you must decouple ingestion, indexing, and query execution. The architecture must handle identity propagation, parallel indexing, and multi-stage retrieval.

At ingestion, documents are parsed, and their metadata (including ACLs and document ownership) is extracted. The text is chunked and sent to two parallel engines: a vector database for dense embeddings and a search engine (like Elasticsearch or OpenSearch) for sparse BM25 indexing.

At query time, the system fetches the user's security groups from your identity provider, such as Okta or Active Directory. These group IDs are injected directly into the vector query as metadata filters. Once the vector and lexical engines return their filtered candidate sets, a cross-encoder model reranks the combined results to deliver the most relevant, authorized documents to the user.

Implementation Blueprint: Solving the Three Core Engineering Challenges#

Translating this architecture into code requires solving the operational challenges of security, compliance, and model updates. Use these three blueprints to structure your implementation.

Step 1: Implementing Single-Stage ACL Filtering#

To implement single-stage filtering, you must attach user and group ACLs directly to the metadata payload of every vector chunk during ingestion. When a user queries the system, you retrieve their active security tokens from your identity provider and pass them as query parameters.

The vector database evaluates these metadata constraints while traversing the HNSW graph. Here is an example of how to execute a filtered query using a PostgreSQL database with the pgvector extension:

query_with_acl.sql

SELECT id, chunk_text, 1 - (embedding <=> :query_embedding) AS similarity
FROM document_chunks
WHERE authorized_groups && :user_groups
ORDER BY embedding <=> :query_embedding
LIMIT 10;

This query uses the PostgreSQL array overlap operator (&&) to ensure the document chunk's authorized_groups array shares at least one value with the user's active user_groups array. Because pgvector integrates with standard PostgreSQL indexing, you can build a composite index to keep queries fast.

Step 2: Building a Bidirectional Mapping Table for Vector Erasure#

To comply with the Right to Erasure under DPDPA 2023, you must be able to completely delete all vector chunks associated with a specific document or user. Do not rely on scanning the vector database metadata for deletions.

Instead, maintain a relational database table that acts as a master mapping registry. This table records the exact relationships between source documents, users, and vector IDs.

mapping_schema.ts

interface VectorMapping {
  mapping_id: string;
  source_document_id: string;
  user_id: string;
  vector_db_id: string;
  chunk_index: number;
  created_at: Date;
}

When a deletion request arrives:

Query the mapping table using the source_document_id or user_id.
Retrieve the list of corresponding vector_db_id values.
Issue a transactional batch delete command to your vector database using those IDs.
Delete the corresponding records from your BM25 lexical index.
Remove the entries from the mapping table.

This database-first approach ensures that no orphaned vector chunks remain in your index after a source document is deleted.

Step 3: Executing Zero-Downtime Model Upgrades with Dual-Index Architecture#

To upgrade your embedding model without taking your search service offline, you must use a dual-index architecture. This pattern uses index aliases to switch traffic instantly once background indexing is complete.

DUAL-INDEX PATTERN

Never point your application directly to a physical index name. Always route queries and writes through logical aliases to allow background swaps.

The migration process follows four steps:

Create a new target index (Index B) configured for the new embedding model's dimensions.
Update your ingestion pipeline to write all new and updated documents to both the active index (Index A) and the target index (Index B) simultaneously.
Run a background worker to read historical documents from your database, generate new embeddings using the target model, and write them to Index B.
Once Index B is fully synchronized, update your API gateway's index alias to point to Index B. You can then safely decommission Index A.

Evaluating System Risks and Performance Trade-offs#

Every architectural decision has a cost. While single-stage filtering, bidirectional mapping, and dual-index migrations solve security and compliance problems, they introduce operational trade-offs.

First, complex metadata filtering adds latency. When you inject deep, nested ACL arrays into your vector queries, the database must evaluate these conditions at every step of the graph traversal. If a user belongs to hundreds of security groups, search latency will rise. You must monitor your p99 query times and optimize your metadata indexes to prevent slowdowns.

Second, running a dual-index architecture during a model migration temporarily doubles your storage and compute costs. You are generating twice as many embeddings and storing two complete copies of your vector dataset. Ensure your infrastructure budget can accommodate these spikes during migration windows.

Finally, real-time index synchronization is difficult to maintain. If a document's ACL changes in your source system, that change must propagate to your search index immediately. A delay in synchronization can cause a temporary security window where a user can view a document they no longer have permission to access.

To mitigate these risks, you must build thorough validation checks. Before swapping your index alias during a model upgrade, run automated tests to verify that the document counts match and that the search results align with expectations.

Where to start#

If you are tasked with building or upgrading an enterprise search system, do not start by choosing a vector database or an embedding model. Start by conducting a thorough audit of your source data's existing authorization schemas and compliance requirements.

Map out where your data lives, who owns it, and how permissions are managed in your source systems. Once you understand your security landscape, establish a standardized metadata schema that captures document IDs, owners, creation dates, and ACLs uniformly across all data connectors. Selecting a database engine that natively supports single-stage filtering and transactional, ID-based deletions will then allow you to build an ingestion pipeline that remains secure and compliant as your data scales.