RAG: What It Is and How It Works

By Ricardo Gutierrez · · 19 min read

In this article

  1. How it works step by step
  2. Embeddings explained simply
  3. Chunking strategies
  4. Vector store options
  5. RAG pipeline step by step
  6. When to use RAG
  7. RAG vs fine-tuning: how to decide
  8. Basic architecture
  9. Tools for implementing RAG
  10. Evaluation metrics
  11. Common mistakes and how to avoid them
  12. Next step
  13. FAQ
Team experience: I implemented RAG with Qdrant + Qwen3.5-27B embeddings for our GRC platform's memory module. I reserve fine-tuning for cyber classification (29K ChatML training pairs). For 90% of cases, advanced prompting + RAG is sufficient and much cheaper.

Think of the difference between a student answering from memory (normal LLM) and one who can check their notes before answering (LLM with RAG). The second is more accurate, more up-to-date and can cite sources.

RAG in one sentence

RAG = search for the right information + generate a response with that information as context. It's the most used technique for making LLMs work with your private data without needing to retrain them.

Quick summary

RAG combines semantic search with text generation. Your documents are converted to embeddings, stored in a vector store, and when you ask something, the system retrieves the relevant fragments and passes them to the LLM as context. Cheaper and more flexible than fine-tuning for 80% of cases.

How it works step by step

The RAG flow has three phases:

1. Indexing (once): your documents are split into fragments (chunks), converted to numerical vectors (embeddings) and stored in a vector database.

2. Retrieval (each query): when the user asks a question, that question is converted to a vector and the most similar fragments are searched in the database.

3. Generation (each query): the retrieved fragments are passed as context to the LLM along with the original question. The model generates a response based on that information.

# Flujo simplificado de RAG
1. INDEXACIÓN (offline)
 Documentos → Chunking → Embeddings → Vector DB

2. CONSULTA (runtime)
 Pregunta usuario → Embedding → Búsqueda similar → Top-K chunks

3. GENERACIÓN (runtime)
 System prompt + Chunks relevantes + Pregunta → LLM → Respuesta

Embeddings explained simply

Embeddings are the heart of RAG. Imagine each text sentence becomes a point in a space with many dimensions (768, 1024, or up to 3072 dimensions). Sentences with similar meaning end up as nearby points.

Intuitive example:

This allows searching by meaning instead of exact words. If you ask "pet resting", it will find the fragment about the cat sleeping even though they share no words.

Popular embedding models in 2026:

Key consideration: the embedding model must be the same for indexing and searching. If you index with text-embedding-3-small, you must search with the same model. Changing models requires re-indexing all documents.

Chunking strategies

How you split your documents into fragments is the most important decision in a RAG pipeline (and the most underrated). Bad chunking ruins everything else.

Fixed-size chunking

Splits every X characters with overlap. Simple but crude. Can cut sentences in half. Useful as baseline.

# Ejemplo: chunks de 500 caracteres con 100 de overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=100
)

Recursive chunking (most used)

Tries to split by paragraphs, then by sentences, then by characters. Respects text structure. It's the LangChain default and works well for 80% of documents.

Semantic chunking

Uses embeddings to detect topic changes and splits there. More computationally expensive, but produces thematically coherent chunks. Ideal for long documents with multiple topics.

Structure-based chunking

Splits by headers (H1, H2, H3), Markdown sections, or document structure. Ideal for technical documentation, wikis and manuals that are already well-structured.

Practical rule: start with 500-1000 token chunks with 10-20% overlap. If retrieval is poor, experiment with smaller sizes (200-500) for dense documents or larger (1000-2000) for narrative documents.

Vector store options

The vector database stores your embeddings and enables similarity search. The main options in 2026:

To get started (free/simple):

For production (scalable):

Our recommendation: ChromaDB for prototypes, pgvector if you already use PostgreSQL/Supabase, Qdrant for self-hosted production.

RAG pipeline step by step

A complete production RAG pipeline has more components than the basic flow. Here's the realistic version:

# Pipeline RAG completo (producción)

1. INGESTA
   Documentos → Parsing (PDF, DOCX, HTML) → Limpieza de texto
   → Metadata extraction (fecha, autor, categoría)

2. PROCESAMIENTO
   Texto limpio → Chunking (estrategia elegida)
   → Embedding (batch processing)
   → Almacenamiento en vector DB + metadata

3. CONSULTA
   Pregunta → Query rewriting (opcional, mejora resultados)
   → Embedding de la pregunta
   → Búsqueda vectorial (top-K, típicamente K=4-8)
   → Re-ranking (opcional, reordena por relevancia)
   → Filtrado por metadata (fecha, categoría, permisos)

4. GENERACIÓN
   System prompt + Contexto recuperado + Pregunta
   → LLM → Respuesta + Citations

5. POST-PROCESAMIENTO
   Verificación de hallucinations → Formateo
   → Logging para evaluación

Query rewriting: before searching, you rephrase the question for better results. Example: "how long did it take?" gets rewritten as "delivery timelines of the previously mentioned project". A small LLM can do this for you.

Re-ranking: vector search results aren't always in optimal order. A re-ranking model (like Cohere Rerank or BGE Reranker) reorders results by actual relevance to the question. Significantly improves quality.

When to use RAG

Use RAG when: you need answers based on specific documents (manuals, contracts, knowledge base), data changes frequently (news, prices, inventory), you need to cite sources, or you want to reduce model hallucinations.

Don't use RAG when: the task is purely creative (writing fiction), the needed knowledge is general and stable (grammar, basic math), or you have very few documents (less than 10 pages, better to pass them directly in the prompt).

RAG vs fine-tuning: how to decide

The most frequent question. To understand when to choose each, read our guide on fine-tuning: when you need it and when you don't. Here's the decision broken down:

Choose RAG when:

Choose fine-tuning when:

The hybrid option (best of both): Fine-tune for base style and domain, RAG for specific and updated data. It's the architecture we use in production for our GRC module: the model understands cybersecurity jargon (fine-tuning), and retrieves client-specific regulations (RAG).

Basic architecture

The components of a RAG system are:

Document loader: reads PDFs, Word, HTML, Markdown, CSVs. Tools: LangChain loaders, Unstructured, LlamaIndex.

Chunker: splits documents into 200-1000 token fragments. Strategies: by paragraphs, by overlap, semantic.

Embedding model: converts text to vectors. Models: text-embedding-3-small (OpenAI), embed-v3 (Cohere), BGE (open source). To understand embeddings, read about vector databases.

Vector database: stores and searches vectors. Options: Qdrant, Pinecone, Weaviate, ChromaDB, pgvector.

LLM: generates the final response. Any LLM works: GPT-4o, Claude Sonnet, Llama, Qwen. If you don't know which to choose, check which LLM to choose.

Tools for implementing RAG

No-code: ChatGPT with file attachments (implicit RAG), Claude Projects (upload documents), Google NotebookLM (automatic RAG on your sources).

Low-code: n8n with vector store nodes, Flowise (drag & drop RAG), Dify (visual RAG platform).

With code: LangChain (Python/JS, most popular), LlamaIndex (RAG-specialized), Haystack (pipeline-based). To start with code, check the LangChain tutorial.

Evaluation metrics

A RAG without metrics is a RAG you don't know works. The key metrics:

Retrieval metrics (the searcher):

Generation metrics (the response):

Recommended evaluation framework: RAGAS (ragas.io). An open source framework that automatically calculates faithfulness, answer relevancy, context precision and context recall. Integrates with LangChain and LlamaIndex.

# Evaluación con RAGAS (ejemplo)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset,  # preguntas + respuestas + contextos
    metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)  # Scores de 0 a 1 por métrica

Common mistakes and how to avoid them

Chunks too small: lose context. A paragraph cut in half makes no sense. Solution: use 10-20% overlap and recursive chunking that respects paragraph boundaries.

Chunks too large: dilute relevance. The model receives lots of text but little useful content. Solution: if your chunks exceed 1000 tokens, they're probably too large for most queries.

Not evaluating retrieval: if retrieved chunks aren't relevant, the response will never be good. Measure search precision before blaming the LLM. Solution: log retrieved chunks and manually review them for your first 50 queries.

Ignoring pre-processing: scanned PDFs, complex tables, images with text. If the document doesn't parse well, RAG fails. Solution: use Unstructured.io or OCR services before chunking.

Not using metadata: retrieving chunks without filtering by date, author or category mixes obsolete or irrelevant information. Solution: always store metadata alongside the embedding and filter when appropriate.

K too high: retrieving 20 chunks saturates the LLM context and dilutes the signal. Solution: start with K=4, increase to K=6-8 only if you need more coverage. More than 10 rarely improves results.

Next step

If you've never implemented RAG, start with Claude Projects or NotebookLM (zero code). Upload 5-10 documents and try asking questions. Once you understand the concept, move to LangChain for custom implementations.

At IAcademy we cover RAG in depth: from concept to code implementation.

FAQ

Does RAG work well with documents in Spanish?

Yes, as long as you use a multilingual embedding model. text-embedding-3-small (OpenAI), embed-v3 (Cohere) and BGE-M3 support Spanish with good quality. Avoid models trained only in English.

How many documents can a RAG system handle?

There's no practical limit. ChromaDB handles hundreds of thousands of chunks. Qdrant and Pinecone scale to millions. Search latency stays in milliseconds even with millions of vectors (approximate search, HNSW).

Can RAG work with images or only text?

Multimodal RAG is possible with embedding models that support images (CLIP, Jina CLIP). You can index images, diagrams and screenshots. You can also use OCR to extract text from images before indexing.

Master RAG and other advanced techniques

The first 3 IAcademy modules are free. Advanced modules cover RAG, agents and fine-tuning.

Start free