Home technology it-services Scalable Data Pipelines and Retrieval Systems for Retrieval-Augmented Generation
It Services
CIO Bulletin
2024-11-14
Retrieval-Augmented Generation (RAG) has become an established pattern for AI-powered applications, effectively addressing the shortcomings of large language models that lack sufficient contextual understanding or domain-specific knowledge.
The architecture centers on maintaining an external data store from which the application fetches information "similar" to a user's prompt. This retrieved knowledge is appended to the original question and passed to the LLM to generate a grounded response.
RAG applications can vary significantly, but most follow two major workflows:
File ingestion pipeline: A series of functions that convert incoming files into vector representations and store them in a specialized vector database. This step enhances AI response quality by giving the system access to more detailed, domain-specific information.
Query response pipeline: A series of functions that accept human-readable prompts, generate embeddings, perform similarity search against the vector database, and assemble a request to the LLM API to generate and stream a contextually augmented response.
In a typical RAG application:
However, the orchestrator’s scalability can become a bottleneck depending on the chosen technology. A fundamental principle is to let orchestrators orchestrate — not compute.
Heavy computations should be offloaded to specialized microservices to keep the orchestrator lightweight, responsive, and capable of high task throughput.
Running long or CPU-intensive tasks within the orchestrator can saturate the scheduler and workers, starving critical scheduling processes and degrading system performance. While executing heavy workloads internally can slightly reduce network latency, the architectural tradeoff generally favors offloading to external services.
When an event occurs, it triggers a DAG to process and vectorize an incoming file.
Typical stages of the DAG include:
As with any system that breaks work into smaller tasks, defining what constitutes success or failure is critical. Notifications, idempotency and rollbacks should be accounted for in advance.
Vectorization is a critical stage in RAG workflows, deserving special attention.
A vectorization function processes a page of textual content, generating an embedding vector along with associated metadata and enhancement properties.
The metadata must be carefully structured to support access control, retrieval quality, and downstream filtering. Below is an example of a well-formed vector payload:
{
"id": "file_id_page_chunk_id",
"values": [0.123, -0.456, 0.789, ...],
"metadata": {
"file_id": "abcd1234",
"project_id": "proj5678",
"organization_id": "org9012",
"category": "private_funds",
"chunk_no": 5,
"length": 230,
"text": "This is the extracted chunk from the document...",
"created_at": "2025-04-26T19:45:00Z",
"updated_at": "2025-04-26T19:50:00Z",
"embedding_model": "text-embedding-ada-002",
"version": "v1"
}
}
Access control: Fields like project_id and organization_id allow the system to enforce security policies and restrict data retrieval to authorized users.
Content granularity: chunk_no points to the specific section of a page associated with the stored text, enabling finer-grained search results.
Freshness and prioritization: Timestamps such as created_at and updated_at enable prioritizing more recent documents during retrieval.
Another important metadata property is embedding_model. A RAG system may interact with multiple LLMs that rely on different embedding models, each generating vectors with distinct formats and semantic spaces.
Embeddings created by different models should not be mixed, as this can degrade retrieval accuracy. Storing the embedding_model alongside each vector enables proper versioning, filtering, and maintenance of embedding integrity over time.
When a user submits a prompt, the server instance should first fetch the relevant context from a centralized Redis database — for example, retrieving chat history using identifiers such as organization_id, user_id, or chat_id.
A chat history typically consists of user prompts and AI responses. This historical data, combined with the current user prompt, is used to generate embeddings, which are then sent in a request to an LLM service.
There may also be an intent identification step that determines which context should be retrieved and which model should be used.
The ability for the server to dynamically switch between different workflows (e.g., conversational chat vs. document analysis) is valuable but requires additional AI capabilities.
Integrating a lightweight, low-cost model for intent identification can make this routing process smoother and more efficient.
To produce structured outputs like tables, lists, or other formatted elements, the LLM must be explicitly instructed to format its response using markup languages such as HTML or Markdown during generation.
Importantly, in vector databases, it is best practice to store pure, semantically meaningful text without embedded HTML tags or visual formatting.
Embedding models are optimized to process clean natural language, and introducing markup into stored content would degrade embedding quality and retrieval accuracy.
Building scalable infrastructure for retrieval-augmented applications demands more than connecting a vector database to an LLM. It requires thoughtful design across ingestion, vectorization, metadata structuring, orchestration, and real-time retrieval. As AI adoption accelerates, organizations that invest early in robust retrieval foundations will not only improve system performance, but also gain a strategic advantage in delivering trustworthy, efficient AI-powered experiences.
Digital-marketing
Artificial-intelligence
Lifestyle-and-fashion
Food-and-beverage