first commit
This commit is contained in:
@@ -0,0 +1,213 @@
|
||||
# ML Engineer Assessment: **DocuChat** - RAG Q&A System
|
||||
## Scenario
|
||||
You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.
|
||||
|
||||
## Document Sources Provided
|
||||
|
||||
### **Option 1: Python Data Science Documentation (Recommended)**
|
||||
- **Sources**:
|
||||
- Python Data Science Handbook (full text in Jupyter Notebooks)
|
||||
- Scikit-learn documentation PDFs
|
||||
- NumPy, Matplotlib and Pandas tutorials
|
||||
-
|
||||
- **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc.
|
||||
- **Domain**: Data science, machine learning, Python libraries
|
||||
|
||||
### **Option 2: API Documentation Collection**
|
||||
- **Sources**:
|
||||
- REST API documentation from major services
|
||||
- OpenAPI specifications and examples
|
||||
- Developer guides and integration tutorials
|
||||
- **Formats**: JSON, Markdown, HTML converted to text
|
||||
- **Domain**: Software development, APIs, integrations
|
||||
|
||||
### **Test Questions Provided**
|
||||
- **File**: `data/test_questions.json`
|
||||
- **Size**: 50 carefully crafted Q&A pairs
|
||||
- **Examples**:
|
||||
```json
|
||||
{
|
||||
"question": "How do you handle missing data in pandas?",
|
||||
"expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
|
||||
"source_sections": ["pandas-missing-data", "data-cleaning"]
|
||||
}
|
||||
```
|
||||
|
||||
## 🎯 Your Mission (2-4 Days)
|
||||
Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Technical Requirements
|
||||
|
||||
### Core Implementation (Must Have)
|
||||
1. **Document Processing Pipeline**
|
||||
- Multi-format document parsing (PDF, MD, TXT)
|
||||
- Intelligent text chunking with overlap handling
|
||||
- Metadata extraction (document type, section, timestamps)
|
||||
- Handle tables, code blocks, and structured content
|
||||
|
||||
2. **Vector Database & Retrieval**
|
||||
- Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
|
||||
- Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
|
||||
- Hybrid search (semantic + keyword/BM25)
|
||||
- Retrieval optimization and re-ranking(where and if necessary)
|
||||
|
||||
3. **LLM Integration & Generation**
|
||||
- LLM API integration (OpenAI, Anthropic, groq or local models)
|
||||
- Context-aware prompt engineering
|
||||
- Source citation and attribution
|
||||
- Answer quality validation and filtering
|
||||
|
||||
4. **RAG Orchestration**
|
||||
- End-to-end query processing pipeline
|
||||
- Context window management for long documents
|
||||
- Multi-step reasoning for complex questions
|
||||
- Confidence scoring and uncertainty handling
|
||||
|
||||
5. **Evaluation & Metrics**
|
||||
- Human-evaluation framework for answer quality
|
||||
|
||||
### Advanced Features (Nice to Have)
|
||||
- Query classification and routing
|
||||
- Conversational memory and follow-up handling
|
||||
- Real-time document ingestion pipeline
|
||||
- A/B testing framework for different retrieval strategies
|
||||
- Cost optimization and caching strategies
|
||||
|
||||
## 📋 Deliverables
|
||||
|
||||
### 1. Code Structure (Clean & Modular)
|
||||
|
||||
|
||||
### 2. Documentation & Notebooks
|
||||
- **README.md**: Architecture overview, setup instructions, API usage
|
||||
- **Jupyter Notebooks**:
|
||||
- Document analysis and chunking strategy exploration
|
||||
- Embedding model comparison and retrieval experiments
|
||||
- RAG pipeline evaluation and optimization insights
|
||||
- Performance analysis with different LLM configurations (if multiple llms providers are compared)
|
||||
- **API Documentation**: FastAPI auto-generated docs with examples
|
||||
- **System Architecture**: Diagram showing component interactions
|
||||
|
||||
### 3. Executable Pipelines
|
||||
**Document Ingestion Pipeline:**
|
||||
```bash
|
||||
python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2
|
||||
```
|
||||
|
||||
**Query Pipeline:**
|
||||
```bash
|
||||
python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo
|
||||
```
|
||||
|
||||
**Evaluation Pipeline:**
|
||||
```bash
|
||||
python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/
|
||||
```
|
||||
|
||||
### 4. REST API (Required)
|
||||
```python
|
||||
# API endpoints for production usage
|
||||
POST /ingest - Upload and process new documents
|
||||
POST /query - Ask questions and get answers with sources
|
||||
GET /documents - List indexed documents
|
||||
POST /evaluate - Run evaluation on test questions
|
||||
GET /health - System health check
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Evaluation Criteria
|
||||
|
||||
### Technical Skills
|
||||
- **RAG Architecture**: Proper component design and integration
|
||||
- **Vector DB Implementation**: Efficient storage, indexing, and retrieval
|
||||
- **LLM Integration**: Effective prompt engineering and API usage
|
||||
- **Code Quality**: Clean, modular, well-documented, testable code
|
||||
- **Performance**: Response time optimization and resource efficiency
|
||||
|
||||
### System Design
|
||||
- **Scalability**: Architecture that can handle growing document collections
|
||||
- **Configurability**: Easy to swap embedding models, LLMs, vector DBs
|
||||
- **Error Handling**: Robust handling of failures and edge cases
|
||||
- **API Design**: Well-designed REST endpoints with proper validation
|
||||
- **Production Readiness**: Monitoring, logging, health checks
|
||||
|
||||
### Problem Solving
|
||||
- **Chunking Strategy**: Intelligent document segmentation approach
|
||||
- **Retrieval Optimization**: Hybrid search, re-ranking, context management
|
||||
- **Answer Quality**: Handling of complex questions, citations, uncertainty
|
||||
- **Evaluation Design**: Comprehensive metrics and testing framework
|
||||
|
||||
### Communication
|
||||
- **Documentation**: Clear system explanation and usage examples
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 💡 Bonus Points
|
||||
|
||||
- **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc..
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Suggested Tech Stack
|
||||
**Core Components:**
|
||||
- **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS
|
||||
- **Embeddings**: OpenAI, Sentence-Transformers, or Cohere
|
||||
- **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
|
||||
- **Document Processing**: LangChain, LlamaIndex, or custom parsers
|
||||
- **API**: FastAPI, Flask
|
||||
- **Testing**: pytest, httpx
|
||||
|
||||
**Optional:**
|
||||
- **Frontend**: Streamlit, Gradio, or render directly with fastapi
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Pipeline Requirements
|
||||
|
||||
### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
|
||||
**Must include:**
|
||||
```python
|
||||
# Key components for document ingestion:
|
||||
- Multi-format document parsing (PDF, MD, TXT)
|
||||
- Intelligent chunking with overlap handling
|
||||
- Metadata extraction and enrichment
|
||||
- Embedding generation and batch processing
|
||||
- Vector database indexing and storage
|
||||
- Progress tracking and error recovery
|
||||
```
|
||||
|
||||
### Query Pipeline (`pipelines/query_pipeline.py`)
|
||||
**Must include:**
|
||||
```python
|
||||
# Key components for question answering:
|
||||
- Query preprocessing and classification
|
||||
- Semantic and hybrid retrieval
|
||||
- Context ranking and selection
|
||||
- LLM prompt construction
|
||||
- Answer generation with citations
|
||||
- Response post-processing and validation
|
||||
```
|
||||
|
||||
|
||||
## ⚡ Success Indicators
|
||||
- **RAG System Works**: End-to-end question answering with proper citations
|
||||
- **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully
|
||||
- **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score)
|
||||
- **Production Ready**: Robust error handling, logging, and API design
|
||||
- **Clear Architecture**: Well-designed, modular system with proper separation of concerns
|
||||
- **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Technical Challenges to Address
|
||||
- **Context Window Limits**: How to handle long documents and conversations
|
||||
- **Retrieval Quality**: Balancing precision vs recall in document chunks
|
||||
- **Answer Attribution**: Proper source citation and confidence scoring
|
||||
- **Cost Optimization**: Efficient use of LLM APIs and embedding generation
|
||||
- **Latency Optimization**: Fast response times for interactive usage
|
||||
- **Content Diversity**: Handling different document types and structures
|
||||
Reference in New Issue
Block a user