first commit
This commit is contained in:
@@ -0,0 +1,213 @@
|
|||||||
|
# ML Engineer Assessment: **DocuChat** - RAG Q&A System
|
||||||
|
## Scenario
|
||||||
|
You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations.
|
||||||
|
|
||||||
|
## Document Sources Provided
|
||||||
|
|
||||||
|
### **Option 1: Python Data Science Documentation (Recommended)**
|
||||||
|
- **Sources**:
|
||||||
|
- Python Data Science Handbook (full text in Jupyter Notebooks)
|
||||||
|
- Scikit-learn documentation PDFs
|
||||||
|
- NumPy, Matplotlib and Pandas tutorials
|
||||||
|
-
|
||||||
|
- **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc.
|
||||||
|
- **Domain**: Data science, machine learning, Python libraries
|
||||||
|
|
||||||
|
### **Option 2: API Documentation Collection**
|
||||||
|
- **Sources**:
|
||||||
|
- REST API documentation from major services
|
||||||
|
- OpenAPI specifications and examples
|
||||||
|
- Developer guides and integration tutorials
|
||||||
|
- **Formats**: JSON, Markdown, HTML converted to text
|
||||||
|
- **Domain**: Software development, APIs, integrations
|
||||||
|
|
||||||
|
### **Test Questions Provided**
|
||||||
|
- **File**: `data/test_questions.json`
|
||||||
|
- **Size**: 50 carefully crafted Q&A pairs
|
||||||
|
- **Examples**:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"question": "How do you handle missing data in pandas?",
|
||||||
|
"expected_answer": "Use methods like dropna(), fillna(), or interpolate()",
|
||||||
|
"source_sections": ["pandas-missing-data", "data-cleaning"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🎯 Your Mission (2-4 Days)
|
||||||
|
Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Technical Requirements
|
||||||
|
|
||||||
|
### Core Implementation (Must Have)
|
||||||
|
1. **Document Processing Pipeline**
|
||||||
|
- Multi-format document parsing (PDF, MD, TXT)
|
||||||
|
- Intelligent text chunking with overlap handling
|
||||||
|
- Metadata extraction (document type, section, timestamps)
|
||||||
|
- Handle tables, code blocks, and structured content
|
||||||
|
|
||||||
|
2. **Vector Database & Retrieval**
|
||||||
|
- Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS)
|
||||||
|
- Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere)
|
||||||
|
- Hybrid search (semantic + keyword/BM25)
|
||||||
|
- Retrieval optimization and re-ranking(where and if necessary)
|
||||||
|
|
||||||
|
3. **LLM Integration & Generation**
|
||||||
|
- LLM API integration (OpenAI, Anthropic, groq or local models)
|
||||||
|
- Context-aware prompt engineering
|
||||||
|
- Source citation and attribution
|
||||||
|
- Answer quality validation and filtering
|
||||||
|
|
||||||
|
4. **RAG Orchestration**
|
||||||
|
- End-to-end query processing pipeline
|
||||||
|
- Context window management for long documents
|
||||||
|
- Multi-step reasoning for complex questions
|
||||||
|
- Confidence scoring and uncertainty handling
|
||||||
|
|
||||||
|
5. **Evaluation & Metrics**
|
||||||
|
- Human-evaluation framework for answer quality
|
||||||
|
|
||||||
|
### Advanced Features (Nice to Have)
|
||||||
|
- Query classification and routing
|
||||||
|
- Conversational memory and follow-up handling
|
||||||
|
- Real-time document ingestion pipeline
|
||||||
|
- A/B testing framework for different retrieval strategies
|
||||||
|
- Cost optimization and caching strategies
|
||||||
|
|
||||||
|
## 📋 Deliverables
|
||||||
|
|
||||||
|
### 1. Code Structure (Clean & Modular)
|
||||||
|
|
||||||
|
|
||||||
|
### 2. Documentation & Notebooks
|
||||||
|
- **README.md**: Architecture overview, setup instructions, API usage
|
||||||
|
- **Jupyter Notebooks**:
|
||||||
|
- Document analysis and chunking strategy exploration
|
||||||
|
- Embedding model comparison and retrieval experiments
|
||||||
|
- RAG pipeline evaluation and optimization insights
|
||||||
|
- Performance analysis with different LLM configurations (if multiple llms providers are compared)
|
||||||
|
- **API Documentation**: FastAPI auto-generated docs with examples
|
||||||
|
- **System Architecture**: Diagram showing component interactions
|
||||||
|
|
||||||
|
### 3. Executable Pipelines
|
||||||
|
**Document Ingestion Pipeline:**
|
||||||
|
```bash
|
||||||
|
python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2
|
||||||
|
```
|
||||||
|
|
||||||
|
**Query Pipeline:**
|
||||||
|
```bash
|
||||||
|
python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo
|
||||||
|
```
|
||||||
|
|
||||||
|
**Evaluation Pipeline:**
|
||||||
|
```bash
|
||||||
|
python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. REST API (Required)
|
||||||
|
```python
|
||||||
|
# API endpoints for production usage
|
||||||
|
POST /ingest - Upload and process new documents
|
||||||
|
POST /query - Ask questions and get answers with sources
|
||||||
|
GET /documents - List indexed documents
|
||||||
|
POST /evaluate - Run evaluation on test questions
|
||||||
|
GET /health - System health check
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🎯 Evaluation Criteria
|
||||||
|
|
||||||
|
### Technical Skills
|
||||||
|
- **RAG Architecture**: Proper component design and integration
|
||||||
|
- **Vector DB Implementation**: Efficient storage, indexing, and retrieval
|
||||||
|
- **LLM Integration**: Effective prompt engineering and API usage
|
||||||
|
- **Code Quality**: Clean, modular, well-documented, testable code
|
||||||
|
- **Performance**: Response time optimization and resource efficiency
|
||||||
|
|
||||||
|
### System Design
|
||||||
|
- **Scalability**: Architecture that can handle growing document collections
|
||||||
|
- **Configurability**: Easy to swap embedding models, LLMs, vector DBs
|
||||||
|
- **Error Handling**: Robust handling of failures and edge cases
|
||||||
|
- **API Design**: Well-designed REST endpoints with proper validation
|
||||||
|
- **Production Readiness**: Monitoring, logging, health checks
|
||||||
|
|
||||||
|
### Problem Solving
|
||||||
|
- **Chunking Strategy**: Intelligent document segmentation approach
|
||||||
|
- **Retrieval Optimization**: Hybrid search, re-ranking, context management
|
||||||
|
- **Answer Quality**: Handling of complex questions, citations, uncertainty
|
||||||
|
- **Evaluation Design**: Comprehensive metrics and testing framework
|
||||||
|
|
||||||
|
### Communication
|
||||||
|
- **Documentation**: Clear system explanation and usage examples
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💡 Bonus Points
|
||||||
|
|
||||||
|
- **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc..
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛠️ Suggested Tech Stack
|
||||||
|
**Core Components:**
|
||||||
|
- **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS
|
||||||
|
- **Embeddings**: OpenAI, Sentence-Transformers, or Cohere
|
||||||
|
- **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral)
|
||||||
|
- **Document Processing**: LangChain, LlamaIndex, or custom parsers
|
||||||
|
- **API**: FastAPI, Flask
|
||||||
|
- **Testing**: pytest, httpx
|
||||||
|
|
||||||
|
**Optional:**
|
||||||
|
- **Frontend**: Streamlit, Gradio, or render directly with fastapi
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🔧 Pipeline Requirements
|
||||||
|
|
||||||
|
### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`)
|
||||||
|
**Must include:**
|
||||||
|
```python
|
||||||
|
# Key components for document ingestion:
|
||||||
|
- Multi-format document parsing (PDF, MD, TXT)
|
||||||
|
- Intelligent chunking with overlap handling
|
||||||
|
- Metadata extraction and enrichment
|
||||||
|
- Embedding generation and batch processing
|
||||||
|
- Vector database indexing and storage
|
||||||
|
- Progress tracking and error recovery
|
||||||
|
```
|
||||||
|
|
||||||
|
### Query Pipeline (`pipelines/query_pipeline.py`)
|
||||||
|
**Must include:**
|
||||||
|
```python
|
||||||
|
# Key components for question answering:
|
||||||
|
- Query preprocessing and classification
|
||||||
|
- Semantic and hybrid retrieval
|
||||||
|
- Context ranking and selection
|
||||||
|
- LLM prompt construction
|
||||||
|
- Answer generation with citations
|
||||||
|
- Response post-processing and validation
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## ⚡ Success Indicators
|
||||||
|
- **RAG System Works**: End-to-end question answering with proper citations
|
||||||
|
- **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully
|
||||||
|
- **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score)
|
||||||
|
- **Production Ready**: Robust error handling, logging, and API design
|
||||||
|
- **Clear Architecture**: Well-designed, modular system with proper separation of concerns
|
||||||
|
- **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚨 Technical Challenges to Address
|
||||||
|
- **Context Window Limits**: How to handle long documents and conversations
|
||||||
|
- **Retrieval Quality**: Balancing precision vs recall in document chunks
|
||||||
|
- **Answer Attribution**: Proper source citation and confidence scoring
|
||||||
|
- **Cost Optimization**: Efficient use of LLM APIs and embedding generation
|
||||||
|
- **Latency Optimization**: Fast response times for interactive usage
|
||||||
|
- **Content Diversity**: Handling different document types and structures
|
||||||
Reference in New Issue
Block a user