Files
docuchat/README.md
T

213 lines
7.8 KiB
Markdown
Raw Normal View History

2025-07-25 21:33:37 +01:00
# ML Engineer Assessment: **DocuChat** - RAG Q&A System
## Scenario
You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.
## Document Sources Provided
### **Option 1: Python Data Science Documentation (Recommended)**
- **Sources**:
- Python Data Science Handbook (full text in Jupyter Notebooks)
- Scikit-learn documentation PDFs
- NumPy, Matplotlib and Pandas tutorials
-
- **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc.
- **Domain**: Data science, machine learning, Python libraries
### **Option 2: API Documentation Collection**
- **Sources**:
- REST API documentation from major services
- OpenAPI specifications and examples
- Developer guides and integration tutorials
- **Formats**: JSON, Markdown, HTML converted to text
- **Domain**: Software development, APIs, integrations
### **Test Questions Provided**
- **File**: `data/test_questions.json`
- **Size**: 50 carefully crafted Q&A pairs
- **Examples**:
```json
{
"question": "How do you handle missing data in pandas?",
"expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
"source_sections": ["pandas-missing-data", "data-cleaning"]
}
```
## 🎯 Your Mission (2-4 Days)
Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.
---
## 🔧 Technical Requirements
### Core Implementation (Must Have)
1. **Document Processing Pipeline**
- Multi-format document parsing (PDF, MD, TXT)
- Intelligent text chunking with overlap handling
- Metadata extraction (document type, section, timestamps)
- Handle tables, code blocks, and structured content
2. **Vector Database & Retrieval**
- Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
- Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
- Hybrid search (semantic + keyword/BM25)
- Retrieval optimization and re-ranking(where and if necessary)
3. **LLM Integration & Generation**
- LLM API integration (OpenAI, Anthropic, groq or local models)
- Context-aware prompt engineering
- Source citation and attribution
- Answer quality validation and filtering
4. **RAG Orchestration**
- End-to-end query processing pipeline
- Context window management for long documents
- Multi-step reasoning for complex questions
- Confidence scoring and uncertainty handling
5. **Evaluation & Metrics**
- Human-evaluation framework for answer quality
### Advanced Features (Nice to Have)
- Query classification and routing
- Conversational memory and follow-up handling
- Real-time document ingestion pipeline
- A/B testing framework for different retrieval strategies
- Cost optimization and caching strategies
## 📋 Deliverables
### 1. Code Structure (Clean & Modular)
### 2. Documentation & Notebooks
- **README.md**: Architecture overview, setup instructions, API usage
- **Jupyter Notebooks**:
- Document analysis and chunking strategy exploration
- Embedding model comparison and retrieval experiments
- RAG pipeline evaluation and optimization insights
- Performance analysis with different LLM configurations (if multiple llms providers are compared)
- **API Documentation**: FastAPI auto-generated docs with examples
- **System Architecture**: Diagram showing component interactions
### 3. Executable Pipelines
**Document Ingestion Pipeline:**
```bash
python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2
```
**Query Pipeline:**
```bash
python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo
```
**Evaluation Pipeline:**
```bash
python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/
```
### 4. REST API (Required)
```python
# API endpoints for production usage
POST /ingest - Upload and process new documents
POST /query - Ask questions and get answers with sources
GET /documents - List indexed documents
POST /evaluate - Run evaluation on test questions
GET /health - System health check
```
---
## 🎯 Evaluation Criteria
### Technical Skills
- **RAG Architecture**: Proper component design and integration
- **Vector DB Implementation**: Efficient storage, indexing, and retrieval
- **LLM Integration**: Effective prompt engineering and API usage
- **Code Quality**: Clean, modular, well-documented, testable code
- **Performance**: Response time optimization and resource efficiency
### System Design
- **Scalability**: Architecture that can handle growing document collections
- **Configurability**: Easy to swap embedding models, LLMs, vector DBs
- **Error Handling**: Robust handling of failures and edge cases
- **API Design**: Well-designed REST endpoints with proper validation
- **Production Readiness**: Monitoring, logging, health checks
### Problem Solving
- **Chunking Strategy**: Intelligent document segmentation approach
- **Retrieval Optimization**: Hybrid search, re-ranking, context management
- **Answer Quality**: Handling of complex questions, citations, uncertainty
- **Evaluation Design**: Comprehensive metrics and testing framework
### Communication
- **Documentation**: Clear system explanation and usage examples
---
## 💡 Bonus Points
- **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc..
---
## 🛠️ Suggested Tech Stack
**Core Components:**
- **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS
- **Embeddings**: OpenAI, Sentence-Transformers, or Cohere
- **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
- **Document Processing**: LangChain, LlamaIndex, or custom parsers
- **API**: FastAPI, Flask
- **Testing**: pytest, httpx
**Optional:**
- **Frontend**: Streamlit, Gradio, or render directly with fastapi
---
## 🔧 Pipeline Requirements
### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
**Must include:**
```python
# Key components for document ingestion:
- Multi-format document parsing (PDF, MD, TXT)
- Intelligent chunking with overlap handling
- Metadata extraction and enrichment
- Embedding generation and batch processing
- Vector database indexing and storage
- Progress tracking and error recovery
```
### Query Pipeline (`pipelines/query_pipeline.py`)
**Must include:**
```python
# Key components for question answering:
- Query preprocessing and classification
- Semantic and hybrid retrieval
- Context ranking and selection
- LLM prompt construction
- Answer generation with citations
- Response post-processing and validation
```
## ⚡ Success Indicators
- **RAG System Works**: End-to-end question answering with proper citations
- **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully
- **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score)
- **Production Ready**: Robust error handling, logging, and API design
- **Clear Architecture**: Well-designed, modular system with proper separation of concerns
- **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis
---
## 🚨 Technical Challenges to Address
- **Context Window Limits**: How to handle long documents and conversations
- **Retrieval Quality**: Balancing precision vs recall in document chunks
- **Answer Attribution**: Proper source citation and confidence scoring
- **Cost Optimization**: Efficient use of LLM APIs and embedding generation
- **Latency Optimization**: Fast response times for interactive usage
- **Content Diversity**: Handling different document types and structures