From f0f4f0d376616e8c3f06e8da0c18a5bdbbe78a87 Mon Sep 17 00:00:00 2001 From: OwusuBlessing Date: Fri, 25 Jul 2025 21:33:37 +0100 Subject: [PATCH] first commit --- README.md | 213 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 213 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..f8adde6 --- /dev/null +++ b/README.md @@ -0,0 +1,213 @@ +# ML Engineer Assessment: **DocuChat** - RAG Q&A System +## Scenario +You are tasked with building **DocuChat**, an **Enterprise Knowledge Assistant** for a software company. The system should allow employees to ask questions about internal documentation (APIs, policies, technical specs) and get accurate, contextual answers with source citations. + +## Document Sources Provided + +### **Option 1: Python Data Science Documentation (Recommended)** +- **Sources**: + - Python Data Science Handbook (full text in Jupyter Notebooks) + - Scikit-learn documentation PDFs + - NumPy, Matplotlib and Pandas tutorials + - +- **Formats**: Jupyter notebooks (.ipynb), PDF, Markdown etc. +- **Domain**: Data science, machine learning, Python libraries + +### **Option 2: API Documentation Collection** +- **Sources**: + - REST API documentation from major services + - OpenAPI specifications and examples + - Developer guides and integration tutorials +- **Formats**: JSON, Markdown, HTML converted to text +- **Domain**: Software development, APIs, integrations + +### **Test Questions Provided** +- **File**: `data/test_questions.json` +- **Size**: 50 carefully crafted Q&A pairs +- **Examples**: + ```json + { + "question": "How do you handle missing data in pandas?", + "expected_answer": "Use methods like dropna(), fillna(), or interpolate()", + "source_sections": ["pandas-missing-data", "data-cleaning"] + } + ``` + +## 🎯 Your Mission (2-4 Days) +Build a production-ready RAG (Retrieval Augmented Generation) system that demonstrates advanced ML engineering skills in LLM applications, vector databases, and information retrieval. + +--- + +## 🔧 Technical Requirements + +### Core Implementation (Must Have) +1. **Document Processing Pipeline** + - Multi-format document parsing (PDF, MD, TXT) + - Intelligent text chunking with overlap handling + - Metadata extraction (document type, section, timestamps) + - Handle tables, code blocks, and structured content + +2. **Vector Database & Retrieval** + - Choose and implement vector database (Pinecone, Weaviate, Chroma, or FAISS) + - Semantic embedding generation (OpenAI, Sentence-Transformers, or Cohere) + - Hybrid search (semantic + keyword/BM25) + - Retrieval optimization and re-ranking(where and if necessary) + +3. **LLM Integration & Generation** + - LLM API integration (OpenAI, Anthropic, groq or local models) + - Context-aware prompt engineering + - Source citation and attribution + - Answer quality validation and filtering + +4. **RAG Orchestration** + - End-to-end query processing pipeline + - Context window management for long documents + - Multi-step reasoning for complex questions + - Confidence scoring and uncertainty handling + +5. **Evaluation & Metrics** + - Human-evaluation framework for answer quality + +### Advanced Features (Nice to Have) +- Query classification and routing +- Conversational memory and follow-up handling +- Real-time document ingestion pipeline +- A/B testing framework for different retrieval strategies +- Cost optimization and caching strategies + +## 📋 Deliverables + +### 1. Code Structure (Clean & Modular) + + +### 2. Documentation & Notebooks +- **README.md**: Architecture overview, setup instructions, API usage +- **Jupyter Notebooks**: + - Document analysis and chunking strategy exploration + - Embedding model comparison and retrieval experiments + - RAG pipeline evaluation and optimization insights + - Performance analysis with different LLM configurations (if multiple llms providers are compared) +- **API Documentation**: FastAPI auto-generated docs with examples +- **System Architecture**: Diagram showing component interactions + +### 3. Executable Pipelines +**Document Ingestion Pipeline:** +```bash +python pipelines/ingestion_pipeline.py --docs_path data/documents/ --vector_db chroma --embedding_model all-MiniLM-L6-v2 +``` + +**Query Pipeline:** +```bash +python pipelines/query_pipeline.py --query "How do I authenticate with the GraphQL API?" --top_k 5 --llm_model gpt-3.5-turbo +``` + +**Evaluation Pipeline:** +```bash +python pipelines/evaluation_pipeline.py --test_file data/test_questions.json --output_dir results/evaluation/ +``` + +### 4. REST API (Required) +```python +# API endpoints for production usage +POST /ingest - Upload and process new documents +POST /query - Ask questions and get answers with sources +GET /documents - List indexed documents +POST /evaluate - Run evaluation on test questions +GET /health - System health check +``` + +--- + +## 🎯 Evaluation Criteria + +### Technical Skills +- **RAG Architecture**: Proper component design and integration +- **Vector DB Implementation**: Efficient storage, indexing, and retrieval +- **LLM Integration**: Effective prompt engineering and API usage +- **Code Quality**: Clean, modular, well-documented, testable code +- **Performance**: Response time optimization and resource efficiency + +### System Design +- **Scalability**: Architecture that can handle growing document collections +- **Configurability**: Easy to swap embedding models, LLMs, vector DBs +- **Error Handling**: Robust handling of failures and edge cases +- **API Design**: Well-designed REST endpoints with proper validation +- **Production Readiness**: Monitoring, logging, health checks + +### Problem Solving +- **Chunking Strategy**: Intelligent document segmentation approach +- **Retrieval Optimization**: Hybrid search, re-ranking, context management +- **Answer Quality**: Handling of complex questions, citations, uncertainty +- **Evaluation Design**: Comprehensive metrics and testing framework + +### Communication +- **Documentation**: Clear system explanation and usage examples + + +--- + +## 💡 Bonus Points + +- **Advanced Retrieval**: Query expansion, hypothetical document embeddings, query routing etc.. + + +--- + +## 🛠️ Suggested Tech Stack +**Core Components:** +- **Vector DB**: Chroma, Pinecone, Weaviate, or FAISS +- **Embeddings**: OpenAI, Sentence-Transformers, or Cohere +- **LLM**: OpenAI GPT, Anthropic Claude,Groq, or local models (Llama2, Mistral) +- **Document Processing**: LangChain, LlamaIndex, or custom parsers +- **API**: FastAPI, Flask +- **Testing**: pytest, httpx + +**Optional:** +- **Frontend**: Streamlit, Gradio, or render directly with fastapi + +--- + +## 🔧 Pipeline Requirements + +### Ingestion Pipeline (`pipelines/ingestion_pipeline.py`) +**Must include:** +```python +# Key components for document ingestion: +- Multi-format document parsing (PDF, MD, TXT) +- Intelligent chunking with overlap handling +- Metadata extraction and enrichment +- Embedding generation and batch processing +- Vector database indexing and storage +- Progress tracking and error recovery +``` + +### Query Pipeline (`pipelines/query_pipeline.py`) +**Must include:** +```python +# Key components for question answering: +- Query preprocessing and classification +- Semantic and hybrid retrieval +- Context ranking and selection +- LLM prompt construction +- Answer generation with citations +- Response post-processing and validation +``` + + +## ⚡ Success Indicators +- **RAG System Works**: End-to-end question answering with proper citations +- **All Pipelines Execute**: Ingestion, query, and evaluation pipelines run successfully +- **High Answer Quality**: Relevant, accurate responses to test questions (>80% human evaluation score) +- **Production Ready**: Robust error handling, logging, and API design +- **Clear Architecture**: Well-designed, modular system with proper separation of concerns +- **Comprehensive Evaluation**: Multiple metrics and thorough performance analysis + +--- + +## 🚨 Technical Challenges to Address +- **Context Window Limits**: How to handle long documents and conversations +- **Retrieval Quality**: Balancing precision vs recall in document chunks +- **Answer Attribution**: Proper source citation and confidence scoring +- **Cost Optimization**: Efficient use of LLM APIs and embedding generation +- **Latency Optimization**: Fast response times for interactive usage +- **Content Diversity**: Handling different document types and structures \ No newline at end of file