e63bf2d8c7
Change PINECONE_INDEX_NAME example from 'specscomply_documents' to 'specscomply-documents' to reflect Pinecone's naming requirements (lowercase alphanumeric characters and hyphens only)
259 lines
8.4 KiB
Markdown
259 lines
8.4 KiB
Markdown
# Mini SpecsComply Pro (SCP)
|
|
|
|
## Overview
|
|
|
|
Mini SpecsComply Pro (SCP) is a lightweight document compliance and validation tool designed to analyze and verify technical documents against predefined standards and project-specific requirements. It leverages advanced AI models for embedding, reasoning, and ranking to ensure fast and accurate document processing.
|
|
|
|
## Features
|
|
|
|
- **Document Analysis:** Automated analysis of technical documents for compliance verification
|
|
- **AI-Powered Processing:**
|
|
- GROQ LLM for deep reasoning and compliance analysis
|
|
- Cohere for document embedding and result ranking
|
|
- **Advanced Standards Matching:**
|
|
- Sophisticated matching algorithm to identify relevant standards
|
|
- Section-based analysis for contextual understanding
|
|
- Technical term recognition and keyword extraction
|
|
- Relevance scoring system for accurate standard selection
|
|
- **Custom Standards Support:**
|
|
- Upload and manage your own compliance standards
|
|
- JSON-based standard definitions with flexible structure
|
|
- **Vector Database Support:**
|
|
- Pinecone (default)
|
|
- Weaviate (alternative)
|
|
- **RESTful API:** Built with FastAPI for easy integration
|
|
- **Real-time Processing:** Async support for efficient document handling
|
|
- **Structured Reports:** Detailed compliance feedback and recommendations with applied standards tracking
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.8 or higher
|
|
- pip or poetry for package management
|
|
- API keys for:
|
|
- GROQ
|
|
- Cohere
|
|
- Pinecone (if using Pinecone) or Weaviate URL (if using Weaviate)
|
|
|
|
## Installation
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone http://23.29.118.76:3000/task/ds_scp_task_solution.git
|
|
cd ds_scp_task_solution
|
|
```
|
|
|
|
2. Create and activate a virtual environment:
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
```
|
|
|
|
3. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
4. Create a `.env` file in the project root:
|
|
```env
|
|
# Required API Keys
|
|
GROQ_API_KEY=your_groq_api_key
|
|
COHERE_API_KEY=your_cohere_api_key
|
|
|
|
# Vector Database (Choose one)
|
|
# For Pinecone:
|
|
VECTOR_DB=pinecone
|
|
PINECONE_API_KEY=your_pinecone_api_key
|
|
PINECONE_ENVIRONMENT=your_pinecone_environment #us-east-1
|
|
PINECONE_INDEX_NAME=specscomply-documents
|
|
|
|
# Or for Weaviate:
|
|
# VECTOR_DB=weaviate
|
|
# WEAVIATE_URL=your_weaviate_url
|
|
# WEAVIATE_API_KEY=your_weaviate_api_key
|
|
|
|
# Optional Settings
|
|
APP_NAME="Mini SpecsComply Pro"
|
|
APP_VERSION="0.1.0"
|
|
DEBUG=False
|
|
```
|
|
|
|
## Running the Application
|
|
|
|
### Quick Start
|
|
```bash
|
|
python launch.py
|
|
```
|
|
This will check your environment setup and start the application. Go to `http://localhost:8000` in your browser.
|
|
|
|
|
|
The API will be available at:
|
|
- API Documentation: `http://localhost:8000/docs`
|
|
|
|
## API Endpoints
|
|
|
|
- `POST /api/documents/upload` - Upload a document for analysis
|
|
- `GET /api/documents/{document_id}` - Get document status and results
|
|
- `POST /api/documents/{document_id}/resubmit` - Resubmit a document for re-analysis
|
|
- `GET /api/documents/{document_id}/analysis` - Get detailed compliance analysis
|
|
- `GET /api/standards` - List all available standards
|
|
- `POST /api/standards/upload` - Upload a custom standard definition
|
|
- `GET /api/standards/{standard_id}` - Get details of a specific standard
|
|
- `GET /api/health` - Health check endpoint
|
|
|
|
## Configuration
|
|
|
|
The application can be configured through environment variables or the `.env` file. Key configuration options:
|
|
|
|
- `DEBUG`: Enable debug mode (default: False)
|
|
- `VECTOR_DB`: Choose vector database backend ("pinecone" or "weaviate")
|
|
- `EMBEDDING_MODEL`: Cohere embedding model (default: "embed-english-v3.0")
|
|
- `RERANKER_MODEL`: Cohere reranker model (default: "rerank-english-v2.0")
|
|
- `REASONING_MODEL`: GROQ model (default: "llama-3.3-70b-versatile")
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
```
|
|
mini-specscomply-pro/
|
|
├── app/
|
|
│ ├── api/ # API routes and endpoints
|
|
│ ├── core/ # Core configuration and models
|
|
│ └── services/ # Business logic services
|
|
|── Data/ # Sample data and documents
|
|
├── requirements.txt # Project dependencies
|
|
├── run.py # Application runner
|
|
|── launch.py # Setup and launch script
|
|
├── .env # Environment variables
|
|
├── .gitignore # Git ignore file
|
|
├── README.md # Project documentation
|
|
```
|
|
|
|
## Advanced Standards Matching
|
|
|
|
Mini SpecsComply Pro uses a sophisticated algorithm to match documents with relevant standards:
|
|
|
|
1. **Document Analysis**
|
|
- Extracts sections and headings from the document
|
|
- Identifies key technical terms and phrases
|
|
- Recognizes standard references (e.g., "ISO-9001", "IEEE 829")
|
|
|
|
2. **Relevance Scoring**
|
|
- Calculates weighted scores based on multiple factors:
|
|
- Direct standard name matches (highest weight)
|
|
- Keyword matches between document and standard
|
|
- Section-specific matches (e.g., in References or Requirements sections)
|
|
- Technical term matches
|
|
- Requirement-specific matches
|
|
|
|
3. **Standard Selection**
|
|
- Selects the most relevant standards based on score threshold
|
|
- Applies these standards during compliance analysis
|
|
- Displays applied standards in the compliance report
|
|
|
|
This approach ensures that the most appropriate standards are applied to each document, improving the accuracy and relevance of compliance analysis.
|
|
|
|
## Document and Standard Formats
|
|
|
|
### Compliance Documents
|
|
|
|
For best results, structure your compliance documents with clear sections and headings. The system performs better with well-organized documents that include:
|
|
|
|
1. **Clear Headings**: Use markdown-style headings (e.g., `# Section Title`) to organize content
|
|
2. **Introduction Section**: Provide context and purpose of the document
|
|
3. **Scope Section**: Define what the document covers and doesn't cover
|
|
4. **Requirements Sections**: Clearly state requirements using terms like "shall", "must", "should"
|
|
5. **References Section**: List relevant standards, specifications, or other documents
|
|
6. **Technical Details**: Include specific technical information relevant to compliance
|
|
|
|
Example document structure:
|
|
```markdown
|
|
# System Compliance Specification
|
|
|
|
## Introduction
|
|
This document specifies the compliance requirements for the XYZ system.
|
|
|
|
## Scope
|
|
This specification applies to all components of the XYZ system.
|
|
|
|
## Requirements
|
|
### Functional Requirements
|
|
1. The system shall process user input within 500ms.
|
|
2. The system must maintain data integrity during power failures.
|
|
|
|
### Security Requirements
|
|
1. All data transmissions shall be encrypted using AES-256.
|
|
2. User authentication must comply with NIST guidelines.
|
|
|
|
## References
|
|
- ISO-9001:2015 Quality Management Systems
|
|
- IEEE-829 Software Test Documentation
|
|
```
|
|
|
|
### Custom Standard Definitions
|
|
|
|
Custom standards are defined in JSON format with the following structure:
|
|
|
|
```json
|
|
{
|
|
"name": "ISO-9001",
|
|
"description": "Quality Management System standard",
|
|
"requirements": [
|
|
{
|
|
"id": "ISO-9001-4.1",
|
|
"description": "The organization shall determine external and internal issues relevant to its purpose and strategic direction.",
|
|
"severity": "major"
|
|
},
|
|
{
|
|
"id": "ISO-9001-4.2",
|
|
"description": "The organization shall monitor and review information about these external and internal issues.",
|
|
"severity": "minor"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
You can also define multiple standards in a single file:
|
|
|
|
```json
|
|
{
|
|
"standards": [
|
|
{
|
|
"name": "ISO-9001",
|
|
"description": "Quality Management System standard",
|
|
"requirements": [...]
|
|
},
|
|
{
|
|
"name": "IEEE-829",
|
|
"description": "Software Test Documentation standard",
|
|
"requirements": [...]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Requirement severity levels:
|
|
- `critical`: Major non-compliance that must be addressed immediately
|
|
- `major`: Significant issue that should be addressed soon
|
|
- `minor`: Less significant issue that should be addressed when convenient
|
|
- `info`: Informational note or suggestion
|
|
|
|
## Troubleshooting
|
|
|
|
Common issues and solutions:
|
|
|
|
1. **Missing API Keys**
|
|
- Ensure all required API keys are set in your `.env` file
|
|
- Check the API key format and validity
|
|
|
|
2. **Vector Database Connection**
|
|
- Verify the vector database configuration
|
|
- Ensure the selected database service is running and accessible
|
|
|
|
3. **Model Errors**
|
|
- Check API quotas and limits
|
|
- Verify model names in configuration
|
|
|
|
4. **Standards Not Being Applied**
|
|
- Verify that standards have been uploaded correctly
|
|
- Check the logs for standards matching information
|
|
- Ensure document content includes relevant terminology for matching |