README.md

# ML Engineer Assessment: **DocuChat** - RAG Q&A System
## Scenario
You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.

## Document Sources Provided

### **Option 1: Python Data Science Documentation (Recommended)**
- **Sources**: 
  - Python Data Science Handbook (full text in Jupyter Notebooks)
  - Scikit-learn documentation PDFs
  - NumPy, Matplotlib and Pandas tutorials
  - 
- **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc.
- **Domain**: Data science, machine learning, Python libraries

### **Option 2: API Documentation Collection**
- **Sources**: 
  - REST API documentation from major services
  - OpenAPI specifications and examples
  - Developer guides and integration tutorials
- **Formats**: JSON, Markdown, HTML converted to text
- **Domain**: Software development, APIs, integrations

### **Test Questions Provided**
- **File**: `data/test_questions.json`
- **Size**: 50 carefully crafted Q&A pairs
- **Examples**:
  ```json
  {
    "question": "How do you handle missing data in pandas?",
    "expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
    "source_sections": ["pandas-missing-data", "data-cleaning"]
  }
  ```

## 🎯 Your Mission (2-4 Days)
Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.

---

## 🔧 Technical Requirements

### Core Implementation (Must Have)
1. **Document Processing Pipeline**
   - Multi-format document parsing (PDF, MD, TXT)
   - Intelligent text chunking with overlap handling
   - Metadata extraction (document type, section, timestamps)
   - Handle tables, code blocks, and structured content

2. **Vector Database & Retrieval**
   - Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
   - Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
   - Hybrid search (semantic + keyword/BM25)
   - Retrieval optimization and re-ranking(where and if necessary)

3. **LLM Integration & Generation**
   - LLM API integration (OpenAI, Anthropic, groq or local models)
   - Context-aware prompt engineering
   - Source citation and attribution
   - Answer quality validation and filtering

4. **RAG Orchestration**
   - End-to-end query processing pipeline
   - Context window management for long documents
   - Multi-step reasoning for complex questions
   - Confidence scoring and uncertainty handling

5. **Evaluation & Metrics**
   - Human-evaluation framework for answer quality

### Advanced Features (Nice to Have)
- Query classification and routing
- Conversational memory and follow-up handling  
- Real-time document ingestion pipeline
- A/B testing framework for different retrieval strategies
- Cost optimization and caching strategies

## 📋 Deliverables

### 1. Code Structure (Clean & Modular)


### 2. Documentation & Notebooks
- **README.md**: Architecture overview, setup instructions, API usage
- **Jupyter Notebooks**: 
  - Document analysis and chunking strategy exploration
  - Embedding model comparison and retrieval experiments
  - RAG pipeline evaluation and optimization insights
  - Performance analysis with different LLM configurations (if multiple llms providers are compared)
- **API Documentation**: FastAPI auto-generated docs with examples
- **System Architecture**: Diagram showing component interactions

### 3. Executable Pipelines
**Document Ingestion Pipeline:**
```bash
python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2
```

**Query Pipeline:**
```bash
python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo
```

**Evaluation Pipeline:**
```bash
python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/
```

### 4. REST API (Required)
```python
# API endpoints for production usage
POST /ingest - Upload and process new documents
POST /query - Ask questions and get answers with sources  
GET /documents - List indexed documents
POST /evaluate - Run evaluation on test questions
GET /health - System health check
```

---

## 🎯 Evaluation Criteria

### Technical Skills
- **RAG Architecture**: Proper component design and integration
- **Vector DB Implementation**: Efficient storage, indexing, and retrieval  
- **LLM Integration**: Effective prompt engineering and API usage
- **Code Quality**: Clean, modular, well-documented, testable code
- **Performance**: Response time optimization and resource efficiency

### System Design
- **Scalability**: Architecture that can handle growing document collections
- **Configurability**: Easy to swap embedding models, LLMs, vector DBs
- **Error Handling**: Robust handling of failures and edge cases
- **API Design**: Well-designed REST endpoints with proper validation
- **Production Readiness**: Monitoring, logging, health checks

### Problem Solving
- **Chunking Strategy**: Intelligent document segmentation approach
- **Retrieval Optimization**: Hybrid search, re-ranking, context management
- **Answer Quality**: Handling of complex questions, citations, uncertainty
- **Evaluation Design**: Comprehensive metrics and testing framework

### Communication
- **Documentation**: Clear system explanation and usage examples


---

## 💡 Bonus Points

- **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc..


---

## 🛠️ Suggested Tech Stack
**Core Components:**
- **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS
- **Embeddings**: OpenAI, Sentence-Transformers, or Cohere
- **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
- **Document Processing**: LangChain, LlamaIndex, or custom parsers
- **API**: FastAPI, Flask
- **Testing**: pytest, httpx

**Optional:**
- **Frontend**: Streamlit, Gradio, or render directly with fastapi

---

## 🔧 Pipeline Requirements

### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
**Must include:**
```python
# Key components for document ingestion:
- Multi-format document parsing (PDF, MD, TXT)
- Intelligent chunking with overlap handling  
- Metadata extraction and enrichment
- Embedding generation and batch processing
- Vector database indexing and storage
- Progress tracking and error recovery
```

### Query Pipeline (`pipelines/query_pipeline.py`) 
**Must include:**
```python
# Key components for question answering:
- Query preprocessing and classification
- Semantic and hybrid retrieval
- Context ranking and selection
- LLM prompt construction
- Answer generation with citations
- Response post-processing and validation
```


## ⚡ Success Indicators
- **RAG System Works**: End-to-end question answering with proper citations
- **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully
- **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score)
- **Production Ready**: Robust error handling, logging, and API design
- **Clear Architecture**: Well-designed, modular system with proper separation of concerns
- **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis

---

## 🚨 Technical Challenges to Address
- **Context Window Limits**: How to handle long documents and conversations
- **Retrieval Quality**: Balancing precision vs recall in document chunks
- **Answer Attribution**: Proper source citation and confidence scoring
- **Cost Optimization**: Efficient use of LLM APIs and embedding generation  
- **Latency Optimization**: Fast response times for interactive usage
- **Content Diversity**: Handling different document types and structures
first commit 2025-07-25 21:33:37 +01:00			`# ML Engineer Assessment: DocuChat - RAG Q&A System`
			`## Scenario`
			`You are tasked with building DocuChat, an Enterprise Knowledge Assistant for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.`

			`## Document Sources Provided`

			`### Option 1: Python Data Science Documentation (Recommended)`
			`- Sources:`
			`- Python Data Science Handbook (full text in Jupyter Notebooks)`
			`- Scikit-learn documentation PDFs`
			`- NumPy, Matplotlib and Pandas tutorials`
			`-`
			`- Formats: Jupyter notebooks (.ipynb), PDF, Markdown etc.`
			`- Domain: Data science, machine learning, Python libraries`

			`### Option 2: API Documentation Collection`
			`- Sources:`
			`- REST API documentation from major services`
			`- OpenAPI specifications and examples`
			`- Developer guides and integration tutorials`
			`- Formats: JSON, Markdown, HTML converted to text`
			`- Domain: Software development, APIs, integrations`

			`### Test Questions Provided`
			- File: `data/test_questions.json`
			`- Size: 50 carefully crafted Q&A pairs`
			`- Examples:`
			```json
			`{`
			`"question": "How do you handle missing data in pandas?",`
			`"expected_answer": "Use methods like dropna(), fillna(), or interpolate()",`
			`"source_sections": ["pandas-missing-data", "data-cleaning"]`
			`}`
			```

			`## 🎯 Your Mission (2-4 Days)`
			`Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.`

			`---`

			`## 🔧 Technical Requirements`

			`### Core Implementation (Must Have)`
			`1. Document Processing Pipeline`
			`- Multi-format document parsing (PDF, MD, TXT)`
			`- Intelligent text chunking with overlap handling`
			`- Metadata extraction (document type, section, timestamps)`
			`- Handle tables, code blocks, and structured content`

			`2. Vector Database & Retrieval`
			`- Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)`
			`- Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)`
			`- Hybrid search (semantic + keyword/BM25)`
			`- Retrieval optimization and re-ranking(where and if necessary)`

			`3. LLM Integration & Generation`
			`- LLM API integration (OpenAI, Anthropic, groq or local models)`
			`- Context-aware prompt engineering`
			`- Source citation and attribution`
			`- Answer quality validation and filtering`

			`4. RAG Orchestration`
			`- End-to-end query processing pipeline`
			`- Context window management for long documents`
			`- Multi-step reasoning for complex questions`
			`- Confidence scoring and uncertainty handling`

			`5. Evaluation & Metrics`
			`- Human-evaluation framework for answer quality`

			`### Advanced Features (Nice to Have)`
			`- Query classification and routing`
			`- Conversational memory and follow-up handling`
			`- Real-time document ingestion pipeline`
			`- A/B testing framework for different retrieval strategies`
			`- Cost optimization and caching strategies`

			`## 📋 Deliverables`

			`### 1. Code Structure (Clean & Modular)`


			`### 2. Documentation & Notebooks`
			`- README.md: Architecture overview, setup instructions, API usage`
			`- Jupyter Notebooks:`
			`- Document analysis and chunking strategy exploration`
			`- Embedding model comparison and retrieval experiments`
			`- RAG pipeline evaluation and optimization insights`
			`- Performance analysis with different LLM configurations (if multiple llms providers are compared)`
			`- API Documentation: FastAPI auto-generated docs with examples`
			`- System Architecture: Diagram showing component interactions`

			`### 3. Executable Pipelines`
			`Document Ingestion Pipeline:`
			```bash
			`python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2`
			```

			`Query Pipeline:`
			```bash
			`python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo`
			```

			`Evaluation Pipeline:`
			```bash
			`python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/`
			```

			`### 4. REST API (Required)`
			```python
			`# API endpoints for production usage`
			`POST /ingest - Upload and process new documents`
			`POST /query - Ask questions and get answers with sources`
			`GET /documents - List indexed documents`
			`POST /evaluate - Run evaluation on test questions`
			`GET /health - System health check`
			```

			`---`

			`## 🎯 Evaluation Criteria`

			`### Technical Skills`
			`- RAG Architecture: Proper component design and integration`
			`- Vector DB Implementation: Efficient storage, indexing, and retrieval`
			`- LLM Integration: Effective prompt engineering and API usage`
			`- Code Quality: Clean, modular, well-documented, testable code`
			`- Performance: Response time optimization and resource efficiency`

			`### System Design`
			`- Scalability: Architecture that can handle growing document collections`
			`- Configurability: Easy to swap embedding models, LLMs, vector DBs`
			`- Error Handling: Robust handling of failures and edge cases`
			`- API Design: Well-designed REST endpoints with proper validation`
			`- Production Readiness: Monitoring, logging, health checks`

			`### Problem Solving`
			`- Chunking Strategy: Intelligent document segmentation approach`
			`- Retrieval Optimization: Hybrid search, re-ranking, context management`
			`- Answer Quality: Handling of complex questions, citations, uncertainty`
			`- Evaluation Design: Comprehensive metrics and testing framework`

			`### Communication`
			`- Documentation: Clear system explanation and usage examples`


			`---`

			`## 💡 Bonus Points`

			`- Advanced Retrieval: Query expansion, hypothetical document embeddings, query routing etc..`


			`---`

			`## 🛠️ Suggested Tech Stack`
			`Core Components:`
			`- Vector DB: Chroma, Pinecone, Weaviate, or FAISS`
			`- Embeddings: OpenAI, Sentence-Transformers, or Cohere`
			`- LLM: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)`
			`- Document Processing: LangChain, LlamaIndex, or custom parsers`
			`- API: FastAPI, Flask`
			`- Testing: pytest, httpx`

			`Optional:`
			`- Frontend: Streamlit, Gradio, or render directly with fastapi`

			`---`

			`## 🔧 Pipeline Requirements`

			### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
			`Must include:`
			```python
			`# Key components for document ingestion:`
			`- Multi-format document parsing (PDF, MD, TXT)`
			`- Intelligent chunking with overlap handling`
			`- Metadata extraction and enrichment`
			`- Embedding generation and batch processing`
			`- Vector database indexing and storage`
			`- Progress tracking and error recovery`
			```

			### Query Pipeline (`pipelines/query_pipeline.py`)
			`Must include:`
			```python
			`# Key components for question answering:`
			`- Query preprocessing and classification`
			`- Semantic and hybrid retrieval`
			`- Context ranking and selection`
			`- LLM prompt construction`
			`- Answer generation with citations`
			`- Response post-processing and validation`
			```


			`## ⚡ Success Indicators`
			`- RAG System Works: End-to-end question answering with proper citations`
			`- All Pipelines Execute: Ingestion, query, and evaluation pipelines run successfully`
			`- High Answer Quality: Relevant, accurate responses to test questions (>80% human evaluation score)`
			`- Production Ready: Robust error handling, logging, and API design`
			`- Clear Architecture: Well-designed, modular system with proper separation of concerns`
			`- Comprehensive Evaluation: Multiple metrics and thorough performance analysis`

			`---`

			`## 🚨 Technical Challenges to Address`
			`- Context Window Limits: How to handle long documents and conversations`
			`- Retrieval Quality: Balancing precision vs recall in document chunks`
			`- Answer Attribution: Proper source citation and confidence scoring`
			`- Cost Optimization: Efficient use of LLM APIs and embedding generation`
			`- Latency Optimization: Fast response times for interactive usage`
			`- Content Diversity: Handling different document types and structures`