RAG - Beyond Basics

Rainer Stropek | time cockpit

Rainer Stropek

Passionate software developers for 30+ years
- time cockpit
Microsoft MVP, Regional Director
Trainer, Teacher, Mentor
💕 community

https://rainerstropek.me

Data Source

Unstructured, semi-structured, structured

Vector DB

Embedding vectors,
backlinks, metadata

Bot

Takes question,
performs retrieval,
builds prompt,
sends to LLM,
answers to user

LLM

Receives prompt with embedded content from retrieval, generates answer

User

Make or Buy?

Office Suites
- E.g. Microsoft Copilot
Built-in RAG by LLM providers
- E.g. OpenAI File Search Tool 🔗
Platforms
- E.g. LlamaIndex 🔗
How to choose?
- If core differentiator ➡️ Make
- If commodity functionality ➡️ Buy
- If need specific customization ➡️ Make or hybrid

Document Processing

Garbage in, garbage out 😅

Data Source

Vector DB

Bot

LLM

User

General Considerations

Cleanup!
How to deal with document versions?
Governance hurdles
Define use cases, involve end users
Know your documents
Set realistic expectations
Consider permission requirements from day one
Special case: Extract structured data from documents

Parsing

Source format might not be text
- E.g. images, PDF, docx, xlsx, etc.
- Typically converted to Markdown
Many options
- Using LLMs
- Use specialized online services
  E.g. LlamaParse, Azure Document Intelligence, Mistral OCR
- Use libraries/packages
  E.g. MarkItDown (MS), docling
- Use ETL plattforms specialized on AI (e.g. unstructured.io)
How to choose?
- Special requirements (features, security, governance, license form)?
- Know your documents (content, quality, structure, etc.)
- Test with real-world documents
- Involve your end users

Splitting

Different document types require different chunking strategies
Examples:
- Legal contracts
- Technical documents
- Conversational transcripts
- Code
Text splitters
- Simple: Length-based (tokens, characters)
- Usually: Document-aware splitters; examples:
  - LangChain Text Splitters 🔗
  - LlamaIndex Markdown Parser
- Advance: Semantic splitting
- See also 5 Levels Of Text Splitting

Keeping Context

Challenge
- Chunks might lose important context about parent document
Possible Solutions
- Prepend document metadata to each chunk
- Use a context window that includes neighboring chunks
- Attach summaries to chunks

Multi-Modal Content

Challenge
- Documents with embedded images, charts, tables
Specialized libraries/services available

Vectors

Data Source

Vector DB

Bot

LLM

User

Embedding Model Selection

Challenge
- Different Models have different strengths
- Change is difficult as re-indexing is expensive
How to choose?
- Start with general-purpose (e.g. OpenAI text-embedding models)
- SentenceTransformers 🔗
- Massive Text Embeddings Benchmark (MTEB) leaderboard 🔗

Choosing a Vector DB

Examples (open and closed source)
- Specialized:
  Qdrant, Azure AI Search, pinecone, Weaviate, Milvus)
- Unified data storage:
  PostgreSQL (e.g. PGVector), SQL, Mongo, CosmosDB, etc.
Location
- On-premises/cloud
- Cloud only
How to choose?
- Specific requirements (features, scalability, auth, etc.)?
  Look for specialized solutions
- Already using a DB? It probably already supports vector search
- What is available in your private/public cloud?

Beyond Vectors

Content of Vector DB
- Embedding vectors
- Backlinks to original documents/fragments (e.g. PKs)
- Metadata (keyword search, faceted search, security filter)
Metadata for document-level permissions
- E.g. Security Filter Pattern in Azure AI Search 🔗
Challenge: Use permissions from source system
- Use standard component
  E.g. Permissions-aware content retrieval with SharePoint and LlamaCloud 🔗
- Implement custom solutions
- Don't do it 😅

Queries

Data Source

Vector DB

Bot

LLM

User

LLMs cannot magically answer everything!
- Work on real-world use cases
- Involve your end users
Detail queries
- Look for a specific answer in a single document
- Examples
  - Search answer in knowledge base
  - Who signed contract X?
  - Who was present at meeting Y?
Aggregations
- Count, get aggregated data
- Examples
  - How many contracts do we have with vendor X?
  - What is the total worth of all our support contracts?
- Pre-compute statistics, use structured extraction
- Consider agent-based retrieval patterns

Enhance Retrieval Results

Rewrite queries with LLM; examples:
- Query expansion with synonyms/related terms
- Multi-query generation for different interpretations
HyDE (Hypothetical Document Embeddings)
- "Imagine" what queries users might type
- Encode the documents with these hypothetical queries
Dense retrieval with re-ranking
- Dense retrieval: Fast retrieval based on embedding vectors
- Re-raking: Detailed analysis of documents, find few best fitting docs
Graph-based document retrieval
- Store relationships between documents
- Explore "neighbors" of retrieved docs via graph
Consider existing platforms like LlamaIndex for complex queries

Document-Level Authorization

Different users have access to different documents
Make or buy?
- E.g. OSO 🔗
Metadata filtering at retrieval time
- Row-level security in vector database
- Import permissions from source systems?
Separate indexes per access level
- Suitable if only a few access levels

Agentic RAG

Retrieval as Function Tool

Retrieval step implemented as a function tool
- Replaces embedding documents in prompts
- Agent decides about tool use autonomously
Different tools for different use cases
- Aggregation vs. detail queries
- Larger scope: Specialized agents
Make tools widely available using MCPs
- Potentially no need for custom UI

Q&A

Rainer Stropek | time cockpit

RAG-Beyond-Basics

By Rainer Stropek

RAG-Beyond-Basics

Rainer Stropek

rstropek

RAG - Beyond Basics

Rainer Stropek

Data Source

Vector DB

Bot

LLM

User

Make or Buy?

Make or Buy?

Document Processing

Data Source

Vector DB

Bot

LLM

User

General Considerations

Parsing

Splitting

Keeping Context

Multi-Modal Content

Vectors

Data Source

Vector DB

Bot

LLM

User

Embedding Model Selection

Choosing a Vector DB

Beyond Vectors

Queries

Data Source

Vector DB

Bot

LLM

User

Enhance Retrieval Results

Document-Level Authorization

Agentic RAG

Retrieval as Function Tool

Q&A

RAG-Beyond-Basics

More from Rainer Stropek