first commit

2025-07-25 21:33:37 +01:00
commit f0f4f0d376
1 changed files with 213 additions and 0 deletions
@@ -0,0 +1,213 @@
 # ML Engineer Assessment: **DocuChat** - RAG Q&A System
 ## Scenario
 You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.
 ## Document Sources Provided
 ### **Option 1: Python Data Science Documentation (Recommended)**
 - **Sources**: 
  - Python Data Science Handbook (full text in Jupyter Notebooks)
  - Scikit-learn documentation PDFs
  - NumPy, Matplotlib and Pandas tutorials
  - 
 - **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc.
 - **Domain**: Data science, machine learning, Python libraries
 ### **Option 2: API Documentation Collection**
 - **Sources**: 
  - REST API documentation from major services
  - OpenAPI specifications and examples
  - Developer guides and integration tutorials
 - **Formats**: JSON, Markdown, HTML converted to text
 - **Domain**: Software development, APIs, integrations
 ### **Test Questions Provided**
 - **File**: `data/test_questions.json`
 - **Size**: 50 carefully crafted Q&A pairs
 - **Examples**:
  ```json
  {
    "question": "How do you handle missing data in pandas?",
    "expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
    "source_sections": ["pandas-missing-data", "data-cleaning"]
  }
  ```
 ## 🎯 Your Mission (2-4 Days)
 Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.
 ---
 ## 🔧 Technical Requirements
 ### Core Implementation (Must Have)
 1. **Document Processing Pipeline**
   - Multi-format document parsing (PDF, MD, TXT)
   - Intelligent text chunking with overlap handling
   - Metadata extraction (document type, section, timestamps)
   - Handle tables, code blocks, and structured content
 2. **Vector Database & Retrieval**
   - Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
   - Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
   - Hybrid search (semantic + keyword/BM25)
   - Retrieval optimization and re-ranking(where and if necessary)
 3. **LLM Integration & Generation**
   - LLM API integration (OpenAI, Anthropic, groq or local models)
   - Context-aware prompt engineering
   - Source citation and attribution
   - Answer quality validation and filtering
 4. **RAG Orchestration**
   - End-to-end query processing pipeline
   - Context window management for long documents
   - Multi-step reasoning for complex questions
   - Confidence scoring and uncertainty handling
 5. **Evaluation & Metrics**
   - Human-evaluation framework for answer quality
 ### Advanced Features (Nice to Have)
 - Query classification and routing
 - Conversational memory and follow-up handling  
 - Real-time document ingestion pipeline
 - A/B testing framework for different retrieval strategies
 - Cost optimization and caching strategies
 ## 📋 Deliverables
 ### 1. Code Structure (Clean & Modular)
 ### 2. Documentation & Notebooks
 - **README.md**: Architecture overview, setup instructions, API usage
 - **Jupyter Notebooks**: 
  - Document analysis and chunking strategy exploration
  - Embedding model comparison and retrieval experiments
  - RAG pipeline evaluation and optimization insights
  - Performance analysis with different LLM configurations (if multiple llms providers are compared)
 - **API Documentation**: FastAPI auto-generated docs with examples
 - **System Architecture**: Diagram showing component interactions
 ### 3. Executable Pipelines
 **Document Ingestion Pipeline:**
 ```bash
 python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2
 ```
 **Query Pipeline:**
 ```bash
 python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo
 ```
 **Evaluation Pipeline:**
 ```bash
 python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/
 ```
 ### 4. REST API (Required)
 ```python
 # API endpoints for production usage
 POST /ingest - Upload and process new documents
 POST /query - Ask questions and get answers with sources  
 GET /documents - List indexed documents
 POST /evaluate - Run evaluation on test questions
 GET /health - System health check
 ```
 ---
 ## 🎯 Evaluation Criteria
 ### Technical Skills
 - **RAG Architecture**: Proper component design and integration
 - **Vector DB Implementation**: Efficient storage, indexing, and retrieval  
 - **LLM Integration**: Effective prompt engineering and API usage
 - **Code Quality**: Clean, modular, well-documented, testable code
 - **Performance**: Response time optimization and resource efficiency
 ### System Design
 - **Scalability**: Architecture that can handle growing document collections
 - **Configurability**: Easy to swap embedding models, LLMs, vector DBs
 - **Error Handling**: Robust handling of failures and edge cases
 - **API Design**: Well-designed REST endpoints with proper validation
 - **Production Readiness**: Monitoring, logging, health checks
 ### Problem Solving
 - **Chunking Strategy**: Intelligent document segmentation approach
 - **Retrieval Optimization**: Hybrid search, re-ranking, context management
 - **Answer Quality**: Handling of complex questions, citations, uncertainty
 - **Evaluation Design**: Comprehensive metrics and testing framework
 ### Communication
 - **Documentation**: Clear system explanation and usage examples
 ---
 ## 💡 Bonus Points
 - **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc..
 ---
 ## 🛠️ Suggested Tech Stack
 **Core Components:**
 - **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS
 - **Embeddings**: OpenAI, Sentence-Transformers, or Cohere
 - **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
 - **Document Processing**: LangChain, LlamaIndex, or custom parsers
 - **API**: FastAPI, Flask
 - **Testing**: pytest, httpx
 **Optional:**
 - **Frontend**: Streamlit, Gradio, or render directly with fastapi
 ---
 ## 🔧 Pipeline Requirements
 ### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
 **Must include:**
 ```python
 # Key components for document ingestion:
 - Multi-format document parsing (PDF, MD, TXT)
 - Intelligent chunking with overlap handling  
 - Metadata extraction and enrichment
 - Embedding generation and batch processing
 - Vector database indexing and storage
 - Progress tracking and error recovery
 ```
 ### Query Pipeline (`pipelines/query_pipeline.py`) 
 **Must include:**
 ```python
 # Key components for question answering:
 - Query preprocessing and classification
 - Semantic and hybrid retrieval
 - Context ranking and selection
 - LLM prompt construction
 - Answer generation with citations
 - Response post-processing and validation
 ```
 ## ⚡ Success Indicators
 - **RAG System Works**: End-to-end question answering with proper citations
 - **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully
 - **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score)
 - **Production Ready**: Robust error handling, logging, and API design
 - **Clear Architecture**: Well-designed, modular system with proper separation of concerns
 - **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis
 ---
 ## 🚨 Technical Challenges to Address
 - **Context Window Limits**: How to handle long documents and conversations
 - **Retrieval Quality**: Balancing precision vs recall in document chunks
 - **Answer Attribution**: Proper source citation and confidence scoring
 - **Cost Optimization**: Efficient use of LLM APIs and embedding generation  
 - **Latency Optimization**: Fast response times for interactive usage
 - **Content Diversity**: Handling different document types and structures