OwusuBlessing f0f4f0d376 first commit
2025-07-25 21:33:37 +01:00
2025-07-25 21:33:37 +01:00

ML Engineer Assessment: DocuChat - RAG Q&A System

Scenario

You are tasked with building DocuChat, an Enterprise Knowledge Assistant for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.

Document Sources Provided

  • Sources:
    • Python Data Science Handbook (full text in Jupyter Notebooks)
    • Scikit-learn documentation PDFs
    • NumPy, Matplotlib and Pandas tutorials
  • Formats: Jupyter notebooks (.ipynb), PDF, Markdown etc.
  • Domain: Data science, machine learning, Python libraries

Option 2: API Documentation Collection

  • Sources:
    • REST API documentation from major services
    • OpenAPI specifications and examples
    • Developer guides and integration tutorials
  • Formats: JSON, Markdown, HTML converted to text
  • Domain: Software development, APIs, integrations

Test Questions Provided

  • File: data/test_questions.json
  • Size: 50 carefully crafted Q&A pairs
  • Examples:
    {
      "question": "How do you handle missing data in pandas?",
      "expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
      "source_sections": ["pandas-missing-data", "data-cleaning"]
    }
    

🎯 Your Mission (2-4 Days)

Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.


🔧 Technical Requirements

Core Implementation (Must Have)

  1. Document Processing Pipeline

    • Multi-format document parsing (PDF, MD, TXT)
    • Intelligent text chunking with overlap handling
    • Metadata extraction (document type, section, timestamps)
    • Handle tables, code blocks, and structured content
  2. Vector Database & Retrieval

    • Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
    • Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
    • Hybrid search (semantic + keyword/BM25)
    • Retrieval optimization and re-ranking(where and if necessary)
  3. LLM Integration & Generation

    • LLM API integration (OpenAI, Anthropic, groq or local models)
    • Context-aware prompt engineering
    • Source citation and attribution
    • Answer quality validation and filtering
  4. RAG Orchestration

    • End-to-end query processing pipeline
    • Context window management for long documents
    • Multi-step reasoning for complex questions
    • Confidence scoring and uncertainty handling
  5. Evaluation & Metrics

    • Human-evaluation framework for answer quality

Advanced Features (Nice to Have)

  • Query classification and routing
  • Conversational memory and follow-up handling
  • Real-time document ingestion pipeline
  • A/B testing framework for different retrieval strategies
  • Cost optimization and caching strategies

📋 Deliverables

1. Code Structure (Clean & Modular)

2. Documentation & Notebooks

  • README.md: Architecture overview, setup instructions, API usage
  • Jupyter Notebooks:
    • Document analysis and chunking strategy exploration
    • Embedding model comparison and retrieval experiments
    • RAG pipeline evaluation and optimization insights
    • Performance analysis with different LLM configurations (if multiple llms providers are compared)
  • API Documentation: FastAPI auto-generated docs with examples
  • System Architecture: Diagram showing component interactions

3. Executable Pipelines

Document Ingestion Pipeline:

python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2

Query Pipeline:

python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo

Evaluation Pipeline:

python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/

4. REST API (Required)

# API endpoints for production usage
POST /ingest - Upload and process new documents
POST /query - Ask questions and get answers with sources  
GET /documents - List indexed documents
POST /evaluate - Run evaluation on test questions
GET /health - System health check

🎯 Evaluation Criteria

Technical Skills

  • RAG Architecture: Proper component design and integration
  • Vector DB Implementation: Efficient storage, indexing, and retrieval
  • LLM Integration: Effective prompt engineering and API usage
  • Code Quality: Clean, modular, well-documented, testable code
  • Performance: Response time optimization and resource efficiency

System Design

  • Scalability: Architecture that can handle growing document collections
  • Configurability: Easy to swap embedding models, LLMs, vector DBs
  • Error Handling: Robust handling of failures and edge cases
  • API Design: Well-designed REST endpoints with proper validation
  • Production Readiness: Monitoring, logging, health checks

Problem Solving

  • Chunking Strategy: Intelligent document segmentation approach
  • Retrieval Optimization: Hybrid search, re-ranking, context management
  • Answer Quality: Handling of complex questions, citations, uncertainty
  • Evaluation Design: Comprehensive metrics and testing framework

Communication

  • Documentation: Clear system explanation and usage examples

💡 Bonus Points

  • Advanced Retrieval: Query expansion, hypothetical document embeddings, query routing etc..

🛠️ Suggested Tech Stack

Core Components:

  • Vector DB: Chroma, Pinecone, Weaviate, or FAISS
  • Embeddings: OpenAI, Sentence-Transformers, or Cohere
  • LLM: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
  • Document Processing: LangChain, LlamaIndex, or custom parsers
  • API: FastAPI, Flask
  • Testing: pytest, httpx

Optional:

  • Frontend: Streamlit, Gradio, or render directly with fastapi

🔧 Pipeline Requirements

Ingestion Pipeline (pipelines/ingestion_pipeline.py)

Must include:

# Key components for document ingestion:
- Multi-format document parsing (PDF, MD, TXT)
- Intelligent chunking with overlap handling  
- Metadata extraction and enrichment
- Embedding generation and batch processing
- Vector database indexing and storage
- Progress tracking and error recovery

Query Pipeline (pipelines/query_pipeline.py)

Must include:

# Key components for question answering:
- Query preprocessing and classification
- Semantic and hybrid retrieval
- Context ranking and selection
- LLM prompt construction
- Answer generation with citations
- Response post-processing and validation

Success Indicators

  • RAG System Works: End-to-end question answering with proper citations
  • All Pipelines Execute: Ingestion, query, and evaluation pipelines run successfully
  • High Answer Quality: Relevant, accurate responses to test questions (>80% human evaluation score)
  • Production Ready: Robust error handling, logging, and API design
  • Clear Architecture: Well-designed, modular system with proper separation of concerns
  • Comprehensive Evaluation: Multiple metrics and thorough performance analysis

🚨 Technical Challenges to Address

  • Context Window Limits: How to handle long documents and conversations
  • Retrieval Quality: Balancing precision vs recall in document chunks
  • Answer Attribution: Proper source citation and confidence scoring
  • Cost Optimization: Efficient use of LLM APIs and embedding generation
  • Latency Optimization: Fast response times for interactive usage
  • Content Diversity: Handling different document types and structures
S
Description
No description provided
Readme 29 KiB