# ML Engineer Assessment: **DocuChat** - RAG Q&A System ## Scenario You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations. ## Document Sources Provided ### **Option 1: Python Data Science Documentation (Recommended)** - **Sources**: - Python Data Science Handbook (full text in Jupyter Notebooks) - Scikit-learn documentation PDFs - NumPy, Matplotlib and Pandas tutorials - - **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc. - **Domain**: Data science, machine learning, Python libraries ### **Option 2: API Documentation Collection** - **Sources**: - REST API documentation from major services - OpenAPI specifications and examples - Developer guides and integration tutorials - **Formats**: JSON, Markdown, HTML converted to text - **Domain**: Software development, APIs, integrations ### **Test Questions Provided** - **File**: `data/test_questions.json` - **Size**: 50 carefully crafted Q&A pairs - **Examples**: ```json { "question": "How do you handle missing data in pandas?", "expected_answer": "Use methods like dropna(), fillna(), or interpolate()", "source_sections": ["pandas-missing-data", "data-cleaning"] } ``` ## 🎯 Your Mission (2-4 Days) Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval. --- ## 🔧 Technical Requirements ### Core Implementation (Must Have) 1. **Document Processing Pipeline** - Multi-format document parsing (PDF, MD, TXT) - Intelligent text chunking with overlap handling - Metadata extraction (document type, section, timestamps) - Handle tables, code blocks, and structured content 2. **Vector Database & Retrieval** - Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS) - Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere) - Hybrid search (semantic + keyword/BM25) - Retrieval optimization and re-ranking(where and if necessary) 3. **LLM Integration & Generation** - LLM API integration (OpenAI, Anthropic, groq or local models) - Context-aware prompt engineering - Source citation and attribution - Answer quality validation and filtering 4. **RAG Orchestration** - End-to-end query processing pipeline - Context window management for long documents - Multi-step reasoning for complex questions - Confidence scoring and uncertainty handling 5. **Evaluation & Metrics** - Human-evaluation framework for answer quality ### Advanced Features (Nice to Have) - Query classification and routing - Conversational memory and follow-up handling - Real-time document ingestion pipeline - A/B testing framework for different retrieval strategies - Cost optimization and caching strategies ## 📋 Deliverables ### 1. Code Structure (Clean & Modular) ### 2. Documentation & Notebooks - **README.md**: Architecture overview, setup instructions, API usage - **Jupyter Notebooks**: - Document analysis and chunking strategy exploration - Embedding model comparison and retrieval experiments - RAG pipeline evaluation and optimization insights - Performance analysis with different LLM configurations (if multiple llms providers are compared) - **API Documentation**: FastAPI auto-generated docs with examples - **System Architecture**: Diagram showing component interactions ### 3. Executable Pipelines **Document Ingestion Pipeline:** ```bash python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2 ``` **Query Pipeline:** ```bash python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo ``` **Evaluation Pipeline:** ```bash python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/ ``` ### 4. REST API (Required) ```python # API endpoints for production usage POST /ingest - Upload and process new documents POST /query - Ask questions and get answers with sources GET /documents - List indexed documents POST /evaluate - Run evaluation on test questions GET /health - System health check ``` --- ## 🎯 Evaluation Criteria ### Technical Skills - **RAG Architecture**: Proper component design and integration - **Vector DB Implementation**: Efficient storage, indexing, and retrieval - **LLM Integration**: Effective prompt engineering and API usage - **Code Quality**: Clean, modular, well-documented, testable code - **Performance**: Response time optimization and resource efficiency ### System Design - **Scalability**: Architecture that can handle growing document collections - **Configurability**: Easy to swap embedding models, LLMs, vector DBs - **Error Handling**: Robust handling of failures and edge cases - **API Design**: Well-designed REST endpoints with proper validation - **Production Readiness**: Monitoring, logging, health checks ### Problem Solving - **Chunking Strategy**: Intelligent document segmentation approach - **Retrieval Optimization**: Hybrid search, re-ranking, context management - **Answer Quality**: Handling of complex questions, citations, uncertainty - **Evaluation Design**: Comprehensive metrics and testing framework ### Communication - **Documentation**: Clear system explanation and usage examples --- ## 💡 Bonus Points - **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc.. --- ## 🛠️ Suggested Tech Stack **Core Components:** - **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS - **Embeddings**: OpenAI, Sentence-Transformers, or Cohere - **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral) - **Document Processing**: LangChain, LlamaIndex, or custom parsers - **API**: FastAPI, Flask - **Testing**: pytest, httpx **Optional:** - **Frontend**: Streamlit, Gradio, or render directly with fastapi --- ## 🔧 Pipeline Requirements ### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`) **Must include:** ```python # Key components for document ingestion: - Multi-format document parsing (PDF, MD, TXT) - Intelligent chunking with overlap handling - Metadata extraction and enrichment - Embedding generation and batch processing - Vector database indexing and storage - Progress tracking and error recovery ``` ### Query Pipeline (`pipelines/query_pipeline.py`) **Must include:** ```python # Key components for question answering: - Query preprocessing and classification - Semantic and hybrid retrieval - Context ranking and selection - LLM prompt construction - Answer generation with citations - Response post-processing and validation ``` ## ⚡ Success Indicators - **RAG System Works**: End-to-end question answering with proper citations - **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully - **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score) - **Production Ready**: Robust error handling, logging, and API design - **Clear Architecture**: Well-designed, modular system with proper separation of concerns - **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis --- ## 🚨 Technical Challenges to Address - **Context Window Limits**: How to handle long documents and conversations - **Retrieval Quality**: Balancing precision vs recall in document chunks - **Answer Attribution**: Proper source citation and confidence scoring - **Cost Optimization**: Efficient use of LLM APIs and embedding generation - **Latency Optimization**: Fast response times for interactive usage - **Content Diversity**: Handling different document types and structures