# Semantic Search Engine POC

A proof-of-concept intelligent semantic search engine for archival documents, made to show how advanced search can work with different types of files like PDFs, XML files, and more.

## Project Overview

This POC addresses the requirements for a future full-scale semantic search system capable of:

- **Entity-centric search** across persons, places, events, buildings, and organizations
- **Multi-modal document processing** (PDFs, XML, text, images, audio, video)
- **Semantic similarity search** using modern embedding techniques
- **Relationship discovery** between entities across documents
- **Access control** for public vs. restricted documents
- **Scalable architecture** for production deployment

## Architecture

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Document      │    │   Entity        │    │   Vector        │
│   Processor     │───▶│   Extractor     │───▶│   Store         │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Text          │    │   Named Entity  │    │   Embeddings    │
│   Extraction    │    │   Recognition   │    │   (ChromaDB)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
                      ┌─────────────────┐
                      │   Search        │
                      │   Service       │
                      └─────────────────┘
```

### Prerequisites

- Python 3.8+
- pip
- Git

### Installation

1. **Clone the repository**
```bash
git clone <repository-url>
cd maryam-ocr
```

2. **Create virtual environment**
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. **Install dependencies**
```bash
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```

4. **Initialize the environment**
```bash
python scripts/setup_data.py
```

5. **Run the POC**
```bash
python -m src.main
```

### Expected Output

The POC will demonstrate:
- Document processing and indexing
- Semantic search across sample documents
- Entity extraction and relationship discovery
- Performance metrics and statistics

## Features

### Document Processing
- **PDF text extraction** using PyPDF2
- **XML parsing** for finding aids
- **DOCX support** for modern documents
- **Metadata extraction** (title, author, creation date, keywords)
- **Multi-language support** (currently optimized for English)

### Entity Recognition
- **Named Entity Recognition** using spaCy
- **Custom entity types**: Person, Place, Event, Organization, Building, Date
- **Relationship extraction** between entities
- **Confidence scoring** for entity matches

### Semantic Search
- **Vector embeddings** using Sentence-BERT (`all-MiniLM-L6-v2`)
- **Similarity search** with configurable thresholds
- **Hybrid search** combining semantic and keyword matching
- **Entity-filtered search** results

### Vector Storage
- **ChromaDB integration** for persistent vector storage
- **Scalable indexing** for large document collections
- **Metadata filtering** and search optimization

## Configuration

Key settings in `config/settings.py`:

```python
# Embedding Model
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
EMBEDDING_DIMENSION = 384

# Search Parameters
MAX_SEARCH_RESULTS = 50
SIMILARITY_THRESHOLD = 0.3

# File Processing
MAX_FILE_SIZE = 50 * 1024 * 1024  # 50MB
ALLOWED_EXTENSIONS = [".pdf", ".txt", ".docx", ".xml"]
```

## Project Structure

```
semantic_search_poc/
├── README.md
├── requirements.txt
├── .env.example
├── config/
│   └── settings.py              # Configuration settings
├── src/
│   ├── main.py                  # Main application entry point
│   ├── models/
│   │   ├── document.py          # Document data models
│   │   └── search_result.py     # Search result models
│   ├── services/
│   │   ├── document_processor.py # Document processing pipeline
│   │   ├── embedding_service.py  # Embedding generation
│   │   ├── entity_extractor.py   # Named entity recognition
│   │   ├── search_service.py     # Main search functionality
│   │   └── vector_store.py       # Vector database operations
│   └── utils/
│       ├── file_handlers.py      # File processing utilities
│       └── logger.py             # Logging configuration
├── data/
│   ├── raw/                     # Input documents
│   ├── processed/               # Processed document metadata
│   └── embeddings/              # Vector embeddings storage
├── tests/                       # Unit tests
├── notebooks/                   # Jupyter notebooks for analysis
└── scripts/                     # Utility scripts
```