first commit

2025-07-25 21:33:37 +01:00
commit f0f4f0d376
1 changed files with 213 additions and 0 deletions
@@ -0,0 +1,213 @@
+# ML Engineer Assessment: **DocuChat** - RAG Q&A System
+## Scenario
+You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.
+
+## Document Sources Provided
+
+### **Option 1: Python Data Science Documentation (Recommended)**
+- **Sources**: 
+  - Python Data Science Handbook (full text in Jupyter Notebooks)
+  - Scikit-learn documentation PDFs
+  - NumPy, Matplotlib and Pandas tutorials
+  - 
+- **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc.
+- **Domain**: Data science, machine learning, Python libraries
+
+### **Option 2: API Documentation Collection**
+- **Sources**: 
+  - REST API documentation from major services
+  - OpenAPI specifications and examples
+  - Developer guides and integration tutorials
+- **Formats**: JSON, Markdown, HTML converted to text
+- **Domain**: Software development, APIs, integrations
+
+### **Test Questions Provided**
+- **File**: `data/test_questions.json`
+- **Size**: 50 carefully crafted Q&A pairs
+- **Examples**:
+  ```json
+  {
+    "question": "How do you handle missing data in pandas?",
+    "expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
+    "source_sections": ["pandas-missing-data", "data-cleaning"]
+  }
+  ```
+
+## 🎯 Your Mission (2-4 Days)
+Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.
+
+---
+
+## 🔧 Technical Requirements
+
+### Core Implementation (Must Have)
+1. **Document Processing Pipeline**
+   - Multi-format document parsing (PDF, MD, TXT)
+   - Intelligent text chunking with overlap handling
+   - Metadata extraction (document type, section, timestamps)
+   - Handle tables, code blocks, and structured content
+
+2. **Vector Database & Retrieval**
+   - Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
+   - Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
+   - Hybrid search (semantic + keyword/BM25)
+   - Retrieval optimization and re-ranking(where and if necessary)
+
+3. **LLM Integration & Generation**
+   - LLM API integration (OpenAI, Anthropic, groq or local models)
+   - Context-aware prompt engineering
+   - Source citation and attribution
+   - Answer quality validation and filtering
+
+4. **RAG Orchestration**
+   - End-to-end query processing pipeline
+   - Context window management for long documents
+   - Multi-step reasoning for complex questions
+   - Confidence scoring and uncertainty handling
+
+5. **Evaluation & Metrics**
+   - Human-evaluation framework for answer quality
+
+### Advanced Features (Nice to Have)
+- Query classification and routing
+- Conversational memory and follow-up handling  
+- Real-time document ingestion pipeline
+- A/B testing framework for different retrieval strategies
+- Cost optimization and caching strategies
+
+## 📋 Deliverables
+
+### 1. Code Structure (Clean & Modular)
+
+
+### 2. Documentation & Notebooks
+- **README.md**: Architecture overview, setup instructions, API usage
+- **Jupyter Notebooks**: 
+  - Document analysis and chunking strategy exploration
+  - Embedding model comparison and retrieval experiments
+  - RAG pipeline evaluation and optimization insights
+  - Performance analysis with different LLM configurations (if multiple llms providers are compared)
+- **API Documentation**: FastAPI auto-generated docs with examples
+- **System Architecture**: Diagram showing component interactions
+
+### 3. Executable Pipelines
+**Document Ingestion Pipeline:**
+```bash
+python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2
+```
+
+**Query Pipeline:**
+```bash
+python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo
+```
+
+**Evaluation Pipeline:**
+```bash
+python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/
+```
+
+### 4. REST API (Required)
+```python
+# API endpoints for production usage
+POST /ingest - Upload and process new documents
+POST /query - Ask questions and get answers with sources  
+GET /documents - List indexed documents
+POST /evaluate - Run evaluation on test questions
+GET /health - System health check
+```
+
+---
+
+## 🎯 Evaluation Criteria
+
+### Technical Skills
+- **RAG Architecture**: Proper component design and integration
+- **Vector DB Implementation**: Efficient storage, indexing, and retrieval  
+- **LLM Integration**: Effective prompt engineering and API usage
+- **Code Quality**: Clean, modular, well-documented, testable code
+- **Performance**: Response time optimization and resource efficiency
+
+### System Design
+- **Scalability**: Architecture that can handle growing document collections
+- **Configurability**: Easy to swap embedding models, LLMs, vector DBs
+- **Error Handling**: Robust handling of failures and edge cases
+- **API Design**: Well-designed REST endpoints with proper validation
+- **Production Readiness**: Monitoring, logging, health checks
+
+### Problem Solving
+- **Chunking Strategy**: Intelligent document segmentation approach
+- **Retrieval Optimization**: Hybrid search, re-ranking, context management
+- **Answer Quality**: Handling of complex questions, citations, uncertainty
+- **Evaluation Design**: Comprehensive metrics and testing framework
+
+### Communication
+- **Documentation**: Clear system explanation and usage examples
+
+
+---
+
+## 💡 Bonus Points
+
+- **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc..
+
+
+---
+
+## 🛠️ Suggested Tech Stack
+**Core Components:**
+- **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS
+- **Embeddings**: OpenAI, Sentence-Transformers, or Cohere
+- **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
+- **Document Processing**: LangChain, LlamaIndex, or custom parsers
+- **API**: FastAPI, Flask
+- **Testing**: pytest, httpx
+
+**Optional:**
+- **Frontend**: Streamlit, Gradio, or render directly with fastapi
+
+---
+
+## 🔧 Pipeline Requirements
+
+### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
+**Must include:**
+```python
+# Key components for document ingestion:
+- Multi-format document parsing (PDF, MD, TXT)
+- Intelligent chunking with overlap handling  
+- Metadata extraction and enrichment
+- Embedding generation and batch processing
+- Vector database indexing and storage
+- Progress tracking and error recovery
+```
+
+### Query Pipeline (`pipelines/query_pipeline.py`) 
+**Must include:**
+```python
+# Key components for question answering:
+- Query preprocessing and classification
+- Semantic and hybrid retrieval
+- Context ranking and selection
+- LLM prompt construction
+- Answer generation with citations
+- Response post-processing and validation
+```
+
+
+## ⚡ Success Indicators
+- **RAG System Works**: End-to-end question answering with proper citations
+- **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully
+- **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score)
+- **Production Ready**: Robust error handling, logging, and API design
+- **Clear Architecture**: Well-designed, modular system with proper separation of concerns
+- **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis
+
+---
+
+## 🚨 Technical Challenges to Address
+- **Context Window Limits**: How to handle long documents and conversations
+- **Retrieval Quality**: Balancing precision vs recall in document chunks
+- **Answer Attribution**: Proper source citation and confidence scoring
+- **Cost Optimization**: Efficient use of LLM APIs and embedding generation  
+- **Latency Optimization**: Fast response times for interactive usage
+- **Content Diversity**: Handling different document types and structures