Files
ds_scp_task_solution/README.md
T

259 lines
8.4 KiB
Markdown
Raw Normal View History

2025-07-17 22:20:25 +01:00
# Mini SpecsComply Pro (SCP)
## Overview
Mini SpecsComply Pro (SCP) is a lightweight document compliance and validation tool designed to analyze and verify technical documents against predefined standards and project-specific requirements. It leverages advanced AI models for embedding, reasoning, and ranking to ensure fast and accurate document processing.
## Features
- **Document Analysis:** Automated analysis of technical documents for compliance verification
- **AI-Powered Processing:**
- GROQ LLM for deep reasoning and compliance analysis
- Cohere for document embedding and result ranking
- **Advanced Standards Matching:**
- Sophisticated matching algorithm to identify relevant standards
- Section-based analysis for contextual understanding
- Technical term recognition and keyword extraction
- Relevance scoring system for accurate standard selection
- **Custom Standards Support:**
- Upload and manage your own compliance standards
- JSON-based standard definitions with flexible structure
- **Vector Database Support:**
- Pinecone (default)
- Weaviate (alternative)
- **RESTful API:** Built with FastAPI for easy integration
- **Real-time Processing:** Async support for efficient document handling
- **Structured Reports:** Detailed compliance feedback and recommendations with applied standards tracking
## Prerequisites
- Python 3.8 or higher
- pip or poetry for package management
- API keys for:
- GROQ
- Cohere
- Pinecone (if using Pinecone) or Weaviate URL (if using Weaviate)
## Installation
1. Clone the repository:
```bash
2025-07-17 22:28:37 +01:00
git clone http://23.29.118.76:3000/task/ds_scp_task_solution.git
cd ds_scp_task_solution
2025-07-17 22:20:25 +01:00
```
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Create a `.env` file in the project root:
```env
# Required API Keys
GROQ_API_KEY=your_groq_api_key
COHERE_API_KEY=your_cohere_api_key
# Vector Database (Choose one)
# For Pinecone:
VECTOR_DB=pinecone
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_ENVIRONMENT=your_pinecone_environment #us-east-1
PINECONE_INDEX_NAME=specscomply-documents
2025-07-17 22:20:25 +01:00
# Or for Weaviate:
# VECTOR_DB=weaviate
# WEAVIATE_URL=your_weaviate_url
# WEAVIATE_API_KEY=your_weaviate_api_key
# Optional Settings
APP_NAME="Mini SpecsComply Pro"
APP_VERSION="0.1.0"
DEBUG=False
```
## Running the Application
### Quick Start
```bash
python launch.py
```
This will check your environment setup and start the application. Go to `http://localhost:8000` in your browser.
The API will be available at:
- API Documentation: `http://localhost:8000/docs`
## API Endpoints
- `POST /api/documents/upload` - Upload a document for analysis
- `GET /api/documents/{document_id}` - Get document status and results
- `POST /api/documents/{document_id}/resubmit` - Resubmit a document for re-analysis
- `GET /api/documents/{document_id}/analysis` - Get detailed compliance analysis
- `GET /api/standards` - List all available standards
- `POST /api/standards/upload` - Upload a custom standard definition
- `GET /api/standards/{standard_id}` - Get details of a specific standard
- `GET /api/health` - Health check endpoint
## Configuration
The application can be configured through environment variables or the `.env` file. Key configuration options:
- `DEBUG`: Enable debug mode (default: False)
- `VECTOR_DB`: Choose vector database backend ("pinecone" or "weaviate")
- `EMBEDDING_MODEL`: Cohere embedding model (default: "embed-english-v3.0")
- `RERANKER_MODEL`: Cohere reranker model (default: "rerank-english-v2.0")
- `REASONING_MODEL`: GROQ model (default: "llama-3.3-70b-versatile")
## Development
### Project Structure
```
mini-specscomply-pro/
├── app/
│ ├── api/ # API routes and endpoints
│ ├── core/ # Core configuration and models
│ └── services/ # Business logic services
|── Data/ # Sample data and documents
├── requirements.txt # Project dependencies
├── run.py # Application runner
|── launch.py # Setup and launch script
├── .env # Environment variables
├── .gitignore # Git ignore file
├── README.md # Project documentation
```
## Advanced Standards Matching
Mini SpecsComply Pro uses a sophisticated algorithm to match documents with relevant standards:
1. **Document Analysis**
- Extracts sections and headings from the document
- Identifies key technical terms and phrases
- Recognizes standard references (e.g., "ISO-9001", "IEEE 829")
2. **Relevance Scoring**
- Calculates weighted scores based on multiple factors:
- Direct standard name matches (highest weight)
- Keyword matches between document and standard
- Section-specific matches (e.g., in References or Requirements sections)
- Technical term matches
- Requirement-specific matches
3. **Standard Selection**
- Selects the most relevant standards based on score threshold
- Applies these standards during compliance analysis
- Displays applied standards in the compliance report
This approach ensures that the most appropriate standards are applied to each document, improving the accuracy and relevance of compliance analysis.
## Document and Standard Formats
### Compliance Documents
For best results, structure your compliance documents with clear sections and headings. The system performs better with well-organized documents that include:
1. **Clear Headings**: Use markdown-style headings (e.g., `# Section Title`) to organize content
2. **Introduction Section**: Provide context and purpose of the document
3. **Scope Section**: Define what the document covers and doesn't cover
4. **Requirements Sections**: Clearly state requirements using terms like "shall", "must", "should"
5. **References Section**: List relevant standards, specifications, or other documents
6. **Technical Details**: Include specific technical information relevant to compliance
Example document structure:
```markdown
# System Compliance Specification
## Introduction
This document specifies the compliance requirements for the XYZ system.
## Scope
This specification applies to all components of the XYZ system.
## Requirements
### Functional Requirements
1. The system shall process user input within 500ms.
2. The system must maintain data integrity during power failures.
### Security Requirements
1. All data transmissions shall be encrypted using AES-256.
2. User authentication must comply with NIST guidelines.
## References
- ISO-9001:2015 Quality Management Systems
- IEEE-829 Software Test Documentation
```
### Custom Standard Definitions
Custom standards are defined in JSON format with the following structure:
```json
{
"name": "ISO-9001",
"description": "Quality Management System standard",
"requirements": [
{
"id": "ISO-9001-4.1",
"description": "The organization shall determine external and internal issues relevant to its purpose and strategic direction.",
"severity": "major"
},
{
"id": "ISO-9001-4.2",
"description": "The organization shall monitor and review information about these external and internal issues.",
"severity": "minor"
}
]
}
```
You can also define multiple standards in a single file:
```json
{
"standards": [
{
"name": "ISO-9001",
"description": "Quality Management System standard",
"requirements": [...]
},
{
"name": "IEEE-829",
"description": "Software Test Documentation standard",
"requirements": [...]
}
]
}
```
Requirement severity levels:
- `critical`: Major non-compliance that must be addressed immediately
- `major`: Significant issue that should be addressed soon
- `minor`: Less significant issue that should be addressed when convenient
- `info`: Informational note or suggestion
## Troubleshooting
Common issues and solutions:
1. **Missing API Keys**
- Ensure all required API keys are set in your `.env` file
- Check the API key format and validity
2. **Vector Database Connection**
- Verify the vector database configuration
- Ensure the selected database service is running and accessible
3. **Model Errors**
- Check API quotas and limits
- Verify model names in configuration
4. **Standards Not Being Applied**
- Verify that standards have been uploaded correctly
- Check the logs for standards matching information
- Ensure document content includes relevant terminology for matching