Initial commit

2025-08-04 14:50:33 +01:00
commit 40b28a7ee3
30 changed files with 3410 additions and 0 deletions
@@ -0,0 +1,155 @@
+# Semantic Search Engine POC
+
+A proof-of-concept intelligent semantic search engine for archival documents, made to show how advanced search can work with different types of files like PDFs, XML files, and more.
+
+## Project Overview
+
+This POC addresses the requirements for a future full-scale semantic search system capable of:
+
+- **Entity-centric search** across persons, places, events, buildings, and organizations
+- **Multi-modal document processing** (PDFs, XML, text, images, audio, video)
+- **Semantic similarity search** using modern embedding techniques
+- **Relationship discovery** between entities across documents
+- **Access control** for public vs. restricted documents
+- **Scalable architecture** for production deployment
+
+## Architecture
+
+```
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Document      │    │   Entity        │    │   Vector        │
+│   Processor     │───▶│   Extractor     │───▶│   Store         │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+         │                       │                       │
+         ▼                       ▼                       ▼
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Text          │    │   Named Entity  │    │   Embeddings    │
+│   Extraction    │    │   Recognition   │    │   (ChromaDB)    │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+                                │
+                                ▼
+                      ┌─────────────────┐
+                      │   Search        │
+                      │   Service       │
+                      └─────────────────┘
+```
+
+### Prerequisites
+
+- Python 3.8+
+- pip
+- Git
+
+### Installation
+
+1. **Clone the repository**
+```bash
+git clone <repository-url>
+cd semantic_search_poc
+```
+
+2. **Create virtual environment**
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+
+3. **Install dependencies**
+```bash
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+```
+
+4. **Initialize the environment**
+```bash
+python scripts/setup_data.py
+```
+
+5. **Run the POC**
+```bash
+python -m src.main
+```
+
+### Expected Output
+
+The POC will demonstrate:
+- Document processing and indexing
+- Semantic search across sample documents
+- Entity extraction and relationship discovery
+- Performance metrics and statistics
+
+## Features
+
+### Document Processing
+- **PDF text extraction** using PyPDF2
+- **XML parsing** for finding aids
+- **DOCX support** for modern documents
+- **Metadata extraction** (title, author, creation date, keywords)
+- **Multi-language support** (currently optimized for English)
+
+### Entity Recognition
+- **Named Entity Recognition** using spaCy
+- **Custom entity types**: Person, Place, Event, Organization, Building, Date
+- **Relationship extraction** between entities
+- **Confidence scoring** for entity matches
+
+### Semantic Search
+- **Vector embeddings** using Sentence-BERT (`all-MiniLM-L6-v2`)
+- **Similarity search** with configurable thresholds
+- **Hybrid search** combining semantic and keyword matching
+- **Entity-filtered search** results
+
+### Vector Storage
+- **ChromaDB integration** for persistent vector storage
+- **Scalable indexing** for large document collections
+- **Metadata filtering** and search optimization
+
+## Configuration
+
+Key settings in `config/settings.py`:
+
+```python
+# Embedding Model
+EMBEDDING_MODEL = "all-MiniLM-L6-v2"
+EMBEDDING_DIMENSION = 384
+
+# Search Parameters
+MAX_SEARCH_RESULTS = 50
+SIMILARITY_THRESHOLD = 0.3
+
+# File Processing
+MAX_FILE_SIZE = 50 * 1024 * 1024  # 50MB
+ALLOWED_EXTENSIONS = [".pdf", ".txt", ".docx", ".xml"]
+```
+
+## Project Structure
+
+```
+semantic_search_poc/
+├── README.md
+├── requirements.txt
+├── .env.example
+├── config/
+│   └── settings.py              # Configuration settings
+├── src/
+│   ├── main.py                  # Main application entry point
+│   ├── models/
+│   │   ├── document.py          # Document data models
+│   │   └── search_result.py     # Search result models
+│   ├── services/
+│   │   ├── document_processor.py # Document processing pipeline
+│   │   ├── embedding_service.py  # Embedding generation
+│   │   ├── entity_extractor.py   # Named entity recognition
+│   │   ├── search_service.py     # Main search functionality
+│   │   └── vector_store.py       # Vector database operations
+│   └── utils/
+│       ├── file_handlers.py      # File processing utilities
+│       └── logger.py             # Logging configuration
+├── data/
+│   ├── raw/                     # Input documents
+│   ├── processed/               # Processed document metadata
+│   └── embeddings/              # Vector embeddings storage
+├── tests/                       # Unit tests
+├── notebooks/                   # Jupyter notebooks for analysis
+└── scripts/                     # Utility scripts
+```