In this article
Think of the difference between a student answering from memory (normal LLM) and one who can check their notes before answering (LLM with RAG). The second is more accurate, more up-to-date and can cite sources.
RAG in one sentence
RAG = search for the right information + generate a response with that information as context. It's the most used technique for making LLMs work with your private data without needing to retrain them.
Quick summary
RAG combines semantic search with text generation. Your documents are converted to embeddings, stored in a vector store, and when you ask something, the system retrieves the relevant fragments and passes them to the LLM as context. Cheaper and more flexible than fine-tuning for 80% of cases.
How it works step by step
The RAG flow has three phases:
1. Indexing (once): your documents are split into fragments (chunks), converted to numerical vectors (embeddings) and stored in a vector database.
2. Retrieval (each query): when the user asks a question, that question is converted to a vector and the most similar fragments are searched in the database.
3. Generation (each query): the retrieved fragments are passed as context to the LLM along with the original question. The model generates a response based on that information.
# Flujo simplificado de RAG
1. INDEXACIÓN (offline)
Documentos → Chunking → Embeddings → Vector DB
2. CONSULTA (runtime)
Pregunta usuario → Embedding → Búsqueda similar → Top-K chunks
3. GENERACIÓN (runtime)
System prompt + Chunks relevantes + Pregunta → LLM → Respuesta
Embeddings explained simply
Embeddings are the heart of RAG. Imagine each text sentence becomes a point in a space with many dimensions (768, 1024, or up to 3072 dimensions). Sentences with similar meaning end up as nearby points.
Intuitive example:
- "The cat sleeps on the sofa" and "The feline rests on the couch" will be very close in vector space (same meaning).
- "The cat sleeps on the sofa" and "The economy grew 3% this quarter" will be far apart (different meanings).
This allows searching by meaning instead of exact words. If you ask "pet resting", it will find the fragment about the cat sleeping even though they share no words.
Popular embedding models in 2026:
- text-embedding-3-small (OpenAI): 1536 dimensiones, buena calidad, bajo coste (0.02 USD por 1M tokens). El más usado por su equilibrio calidad/precio.
- text-embedding-3-large (OpenAI): 3072 dimensiones, máxima calidad cloud. Para cuando necesitas la mejor precisión posible.
- embed-v3 (Cohere): Multilingüe nativo, excelente para español. Gratis hasta 100 llamadas/minuto en el plan trial.
- BGE-M3 (BAAI, open source): Ejecutable localmente. Multilingüe. Gratis total si tienes GPU. La mejor opción soberana.
- nomic-embed-text (open source): Ligero, funciona en CPU. Perfecto para prototipado rápido con Ollama.
Key consideration: the embedding model must be the same for indexing and searching. If you index with text-embedding-3-small, you must search with the same model. Changing models requires re-indexing all documents.
Chunking strategies
How you split your documents into fragments is the most important decision in a RAG pipeline (and the most underrated). Bad chunking ruins everything else.
Fixed-size chunking
Splits every X characters with overlap. Simple but crude. Can cut sentences in half. Useful as baseline.
# Ejemplo: chunks de 500 caracteres con 100 de overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=100
)
Recursive chunking (most used)
Tries to split by paragraphs, then by sentences, then by characters. Respects text structure. It's the LangChain default and works well for 80% of documents.
Semantic chunking
Uses embeddings to detect topic changes and splits there. More computationally expensive, but produces thematically coherent chunks. Ideal for long documents with multiple topics.
Structure-based chunking
Splits by headers (H1, H2, H3), Markdown sections, or document structure. Ideal for technical documentation, wikis and manuals that are already well-structured.
Practical rule: start with 500-1000 token chunks with 10-20% overlap. If retrieval is poor, experiment with smaller sizes (200-500) for dense documents or larger (1000-2000) for narrative documents.
Vector store options
The vector database stores your embeddings and enables similarity search. The main options in 2026:
To get started (free/simple):
- ChromaDB: Base de datos vectorial embebida. Se instala con pip, funciona en memoria o en disco. Perfecta para prototipado y proyectos pequeños (menos de 100K documentos). Sin servidor, sin configuración.
- pgvector (PostgreSQL): Extensión de PostgreSQL para vectores. Si ya usas PostgreSQL (Supabase, RDS), añadir vectores es trivial. Excelente para no añadir otra base de datos a tu stack.
- SQLite-VSS: Vectores en SQLite. Ultra-ligero, ideal para aplicaciones de escritorio o prototipos.
For production (scalable):
- Qdrant: Self-hosted o cloud. Rápido, filtros avanzados, API REST clara. Nuestra elección para producción. Plan cloud gratuito con 1 GB de almacenamiento.
- Pinecone: Serverless, escala automática. Plan gratuito gJanuaryso (100K vectores). La opción más fácil para producción sin gestionar infraestructura.
- Weaviate: Hybrid search (vectorial + keyword). Bueno si necesitas combinar búsqueda semántica con búsqueda tradicional. Cloud y self-hosted.
- Milvus: El más escalable. Para millones de vectores. Más complejo de operar. Usado por grandes empresas.
Our recommendation: ChromaDB for prototypes, pgvector if you already use PostgreSQL/Supabase, Qdrant for self-hosted production.
RAG pipeline step by step
A complete production RAG pipeline has more components than the basic flow. Here's the realistic version:
# Pipeline RAG completo (producción)
1. INGESTA
Documentos → Parsing (PDF, DOCX, HTML) → Limpieza de texto
→ Metadata extraction (fecha, autor, categoría)
2. PROCESAMIENTO
Texto limpio → Chunking (estrategia elegida)
→ Embedding (batch processing)
→ Almacenamiento en vector DB + metadata
3. CONSULTA
Pregunta → Query rewriting (opcional, mejora resultados)
→ Embedding de la pregunta
→ Búsqueda vectorial (top-K, típicamente K=4-8)
→ Re-ranking (opcional, reordena por relevancia)
→ Filtrado por metadata (fecha, categoría, permisos)
4. GENERACIÓN
System prompt + Contexto recuperado + Pregunta
→ LLM → Respuesta + Citations
5. POST-PROCESAMIENTO
Verificación de hallucinations → Formateo
→ Logging para evaluación
Query rewriting: before searching, you rephrase the question for better results. Example: "how long did it take?" gets rewritten as "delivery timelines of the previously mentioned project". A small LLM can do this for you.
Re-ranking: vector search results aren't always in optimal order. A re-ranking model (like Cohere Rerank or BGE Reranker) reorders results by actual relevance to the question. Significantly improves quality.
When to use RAG
Use RAG when: you need answers based on specific documents (manuals, contracts, knowledge base), data changes frequently (news, prices, inventory), you need to cite sources, or you want to reduce model hallucinations.
Don't use RAG when: the task is purely creative (writing fiction), the needed knowledge is general and stable (grammar, basic math), or you have very few documents (less than 10 pages, better to pass them directly in the prompt).
RAG vs fine-tuning: how to decide
The most frequent question. To understand when to choose each, read our guide on fine-tuning: when you need it and when you don't. Here's the decision broken down:
Choose RAG when:
- Data changes frequently (weekly or more often)
- You need answers with citable sources
- You have a large knowledge base (100+ documents)
- You want results fast (implementation in hours, not weeks)
- Budget is limited (you don't want to pay for training GPUs)
- You need multi-tenancy (each user sees only their documents)
Choose fine-tuning when:
- You need a very specific response style or format
- The domain is so specialized that the base model doesn't understand the jargon
- You want to reduce latency (no retrieval step)
- You have high-quality training data (thousands of question-answer pairs)
- The knowledge is stable and doesn't change often
The hybrid option (best of both): Fine-tune for base style and domain, RAG for specific and updated data. It's the architecture we use in production for our GRC module: the model understands cybersecurity jargon (fine-tuning), and retrieves client-specific regulations (RAG).
Basic architecture
The components of a RAG system are:
Document loader: reads PDFs, Word, HTML, Markdown, CSVs. Tools: LangChain loaders, Unstructured, LlamaIndex.
Chunker: splits documents into 200-1000 token fragments. Strategies: by paragraphs, by overlap, semantic.
Embedding model: converts text to vectors. Models: text-embedding-3-small (OpenAI), embed-v3 (Cohere), BGE (open source). To understand embeddings, read about vector databases.
Vector database: stores and searches vectors. Options: Qdrant, Pinecone, Weaviate, ChromaDB, pgvector.
LLM: generates the final response. Any LLM works: GPT-4o, Claude Sonnet, Llama, Qwen. If you don't know which to choose, check which LLM to choose.
Tools for implementing RAG
No-code: ChatGPT with file attachments (implicit RAG), Claude Projects (upload documents), Google NotebookLM (automatic RAG on your sources).
Low-code: n8n with vector store nodes, Flowise (drag & drop RAG), Dify (visual RAG platform).
With code: LangChain (Python/JS, most popular), LlamaIndex (RAG-specialized), Haystack (pipeline-based). To start with code, check the LangChain tutorial.
Evaluation metrics
A RAG without metrics is a RAG you don't know works. The key metrics:
Retrieval metrics (the searcher):
- Recall@K: De los documentos relevantes totales, cuántos aparecen en tu top-K resultados. Si hay 3 chunks relevantes y recuperas 2 de 4, tu recall@4 es 66%.
- Precision@K: De los K documentos recuperados, cuántos son realmente relevantes. Si recuperas 4 chunks y 2 son relevantes, precision@4 es 50%.
- MRR (Mean Reciprocal Rank): En qué posición aparece el primer resultado relevante. Cuanto más arriba, mejor.
Generation metrics (the response):
- Faithfulness: Is the response based on the retrieved context? Or does it invent information? A RAG with low faithfulness has hallucinations.
- Relevance: Does the response actually answer the user's question?
- Completeness: Does the response cover all available information in the retrieved chunks?
Recommended evaluation framework: RAGAS (ragas.io). An open source framework that automatically calculates faithfulness, answer relevancy, context precision and context recall. Integrates with LangChain and LlamaIndex.
# Evaluación con RAGAS (ejemplo)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset, # preguntas + respuestas + contextos
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result) # Scores de 0 a 1 por métrica
Common mistakes and how to avoid them
Chunks too small: lose context. A paragraph cut in half makes no sense. Solution: use 10-20% overlap and recursive chunking that respects paragraph boundaries.
Chunks too large: dilute relevance. The model receives lots of text but little useful content. Solution: if your chunks exceed 1000 tokens, they're probably too large for most queries.
Not evaluating retrieval: if retrieved chunks aren't relevant, the response will never be good. Measure search precision before blaming the LLM. Solution: log retrieved chunks and manually review them for your first 50 queries.
Ignoring pre-processing: scanned PDFs, complex tables, images with text. If the document doesn't parse well, RAG fails. Solution: use Unstructured.io or OCR services before chunking.
Not using metadata: retrieving chunks without filtering by date, author or category mixes obsolete or irrelevant information. Solution: always store metadata alongside the embedding and filter when appropriate.
K too high: retrieving 20 chunks saturates the LLM context and dilutes the signal. Solution: start with K=4, increase to K=6-8 only if you need more coverage. More than 10 rarely improves results.
Next step
If you've never implemented RAG, start with Claude Projects or NotebookLM (zero code). Upload 5-10 documents and try asking questions. Once you understand the concept, move to LangChain for custom implementations.
At IAcademy we cover RAG in depth: from concept to code implementation.
FAQ
Does RAG work well with documents in Spanish?
Yes, as long as you use a multilingual embedding model. text-embedding-3-small (OpenAI), embed-v3 (Cohere) and BGE-M3 support Spanish with good quality. Avoid models trained only in English.
How many documents can a RAG system handle?
There's no practical limit. ChromaDB handles hundreds of thousands of chunks. Qdrant and Pinecone scale to millions. Search latency stays in milliseconds even with millions of vectors (approximate search, HNSW).
Can RAG work with images or only text?
Multimodal RAG is possible with embedding models that support images (CLIP, Jina CLIP). You can index images, diagrams and screenshots. You can also use OCR to extract text from images before indexing.
Master RAG and other advanced techniques
The first 3 IAcademy modules are free. Advanced modules cover RAG, agents and fine-tuning.
Start free