RAG - Beyond Basics

Rainer Stropek | time cockpit

Rainer Stropek

  • Passionate software developers for 30+ years
    • time cockpit
       
  • Microsoft MVP, Regional Director
     
  • Trainer, Teacher, Mentor
     
  • 💕 community

Data Source

Unstructured, semi-structured, structured

Vector DB

Embedding vectors,
backlinks, metadata

Bot

Takes question,
performs retrieval,
builds prompt,
sends to LLM,
answers to user

LLM

Receives prompt with embedded content from retrieval, generates answer

User

Make or Buy?

Make or Buy?

  • Office Suites
    • E.g. Microsoft Copilot
  • Built-in RAG by LLM providers
    • E.g. OpenAI File Search Tool 🔗
  • Platforms
    • E.g. LlamaIndex 🔗
  • How to choose?
    • If core differentiator ➡️ Make
    • If commodity functionality ➡️ Buy
    • If need specific customization ➡️ Make or hybrid

Document Processing

Garbage in, garbage out 😅

Data Source

Vector DB

Bot

LLM

User

General Considerations

  • Cleanup!
  • How to deal with document versions?
  • Governance hurdles
  • Define use cases, involve end users
  • Know your documents
  • Set realistic expectations
  • Consider permission requirements from day one
  • Special case: Extract structured data from documents

Parsing

  • Source format might not be text
    • E.g. images, PDF, docx, xlsx, etc.
    • Typically converted to Markdown
  • Many options
    • Using LLMs
    • Use specialized online services
      E.g. LlamaParse, Azure Document Intelligence, Mistral OCR
    • Use libraries/packages
      E.g. MarkItDown (MS), docling
    • Use ETL plattforms specialized on AI (e.g. unstructured.io)
  • How to choose?
    • Special requirements (features, security, governance, license form)?
    • Know your documents (content, quality, structure, etc.)
    • Test with real-world documents
    • Involve your end users

Splitting

  • Different document types require different chunking strategies
    ​Examples:
    • Legal contracts
    • Technical documents
    • Conversational transcripts
    • Code
  • Text splitters
    • Simple: Length-based (tokens, characters)
    • Usually: Document-aware splitters; examples:
      • LangChain Text Splitters 🔗
      • LlamaIndex Markdown Parser
    • Advance: Semantic splitting
    • See also 5 Levels Of Text Splitting

Keeping Context

  • Challenge
    • Chunks might lose important context about parent document
  • Possible Solutions
    • Prepend document metadata to each chunk
    • Use a context window that includes neighboring chunks
    • Attach summaries to chunks

Multi-Modal Content

  • Challenge
    • Documents with embedded images, charts, tables
  • Specialized libraries/services available

Vectors

Data Source

Vector DB

Bot

LLM

User

Embedding Model Selection

  • Challenge
    • Different Models have different strengths
    • Change is difficult as re-indexing is expensive
  • How to choose?
    • Start with general-purpose (e.g. OpenAI text-embedding models)
    • SentenceTransformers 🔗
    • Massive Text Embeddings Benchmark (MTEB) leaderboard 🔗

Choosing a Vector DB

  • Examples (open and closed source)
    • Specialized:
      Qdrant, Azure AI Search, pinecone, Weaviate, Milvus)
    • Unified data storage:
      PostgreSQL (e.g. PGVector), SQL, Mongo, CosmosDB, etc.
  • Location
    • On-premises/cloud
    • Cloud only
  • How to choose?
    • Specific requirements (features, scalability, auth, etc.)?
      Look for specialized solutions
    • Already using a DB? It probably already supports vector search
    • What is available in your private/public cloud?

Beyond Vectors

  • Content of Vector DB
    • Embedding vectors
    • Backlinks to original documents/fragments (e.g. PKs)
    • Metadata (keyword search, faceted search, security filter)
  • Metadata for document-level permissions
    • E.g. Security Filter Pattern in Azure AI Search 🔗
  • Challenge: Use permissions from source system
    • Use standard component
      E.g. Permissions-aware content retrieval with SharePoint and LlamaCloud 🔗
    • Implement custom solutions
    • Don't do it 😅

Queries

Data Source

Vector DB

Bot

LLM

User

  • LLMs cannot magically answer everything!
    • Work on real-world use cases
    • Involve your end users
  • Detail queries
    • Look for a specific answer in a single document
    • Examples
      • Search answer in knowledge base
      • Who signed contract X?
      • Who was present at meeting Y?
  • Aggregations
    • Count, get aggregated data
    • Examples
      • How many contracts do we have with vendor X?
      • What is the total worth of all our support contracts?
    • Pre-compute statistics, use structured extraction
    • Consider agent-based retrieval patterns

Enhance Retrieval Results

  • Rewrite queries with LLM; examples:
    • Query expansion with synonyms/related terms
    • Multi-query generation for different interpretations
  • HyDE (Hypothetical Document Embeddings)
    • "Imagine" what queries users might type
    • Encode the documents with these hypothetical queries
  • Dense retrieval with re-ranking
    • Dense retrieval: Fast retrieval based on embedding vectors
    • Re-raking: Detailed analysis of documents, find few best fitting docs
  • Graph-based document retrieval
    • Store relationships between documents
    • Explore "neighbors" of retrieved docs via graph
  • Consider existing platforms like LlamaIndex for complex queries

Document-Level Authorization

  • Different users have access to different documents
  • Make or buy?
  • Metadata filtering at retrieval time
    • Row-level security in vector database
    • Import permissions from source systems?
  • Separate indexes per access level
    • Suitable if only a few access levels

Agentic RAG

Retrieval as Function Tool

  • Retrieval step implemented as a function tool
    • Replaces embedding documents in prompts
    • Agent decides about tool use autonomously
  • Different tools for different use cases
    • Aggregation vs. detail queries
    • Larger scope: Specialized agents
  • Make tools widely available using MCPs
    • Potentially no need for custom UI

Q&A

Rainer Stropek | time cockpit

RAG-Beyond-Basics

By Rainer Stropek

RAG-Beyond-Basics

  • 220