A proof-of-concept intelligent semantic search engine for archival documents, made to show how advanced search can work with different types of files like PDFs, XML files, and more.

Project Overview

This POC addresses the requirements for a future full-scale semantic search system capable of:

Entity-centric search across persons, places, events, buildings, and organizations
Multi-modal document processing (PDFs, XML, text, images, audio, video)
Semantic similarity search using modern embedding techniques
Relationship discovery between entities across documents
Access control for public vs. restricted documents
Scalable architecture for production deployment

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Document      │    │   Entity        │    │   Vector        │
│   Processor     │───▶│   Extractor     │───▶│   Store         │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Text          │    │   Named Entity  │    │   Embeddings    │
│   Extraction    │    │   Recognition   │    │   (ChromaDB)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
                      ┌─────────────────┐
                      │   Search        │
                      │   Service       │
                      └─────────────────┘

Prerequisites

Python 3.8+
pip
Git

Installation

Clone the repository

git clone <repository-url>
cd semantic_search_poc

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Initialize the environment

python scripts/setup_data.py

Run the POC

python -m src.main

Expected Output

The POC will demonstrate:

Document processing and indexing
Semantic search across sample documents
Entity extraction and relationship discovery
Performance metrics and statistics

Features

Document Processing

PDF text extraction using PyPDF2
XML parsing for finding aids
DOCX support for modern documents
Metadata extraction (title, author, creation date, keywords)
Multi-language support (currently optimized for English)

Entity Recognition

Named Entity Recognition using spaCy
Custom entity types: Person, Place, Event, Organization, Building, Date
Relationship extraction between entities
Confidence scoring for entity matches

Semantic Search

Vector embeddings using Sentence-BERT (all-MiniLM-L6-v2)
Similarity search with configurable thresholds
Hybrid search combining semantic and keyword matching
Entity-filtered search results

Vector Storage

ChromaDB integration for persistent vector storage
Scalable indexing for large document collections
Metadata filtering and search optimization

Configuration

Key settings in config/settings.py:

# Embedding Model
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
EMBEDDING_DIMENSION = 384

# Search Parameters
MAX_SEARCH_RESULTS = 50
SIMILARITY_THRESHOLD = 0.3

# File Processing
MAX_FILE_SIZE = 50 * 1024 * 1024  # 50MB
ALLOWED_EXTENSIONS = [".pdf", ".txt", ".docx", ".xml"]

Project Structure

semantic_search_poc/
├── README.md
├── requirements.txt
├── .env.example
├── config/
│   └── settings.py              # Configuration settings
├── src/
│   ├── main.py                  # Main application entry point
│   ├── models/
│   │   ├── document.py          # Document data models
│   │   └── search_result.py     # Search result models
│   ├── services/
│   │   ├── document_processor.py # Document processing pipeline
│   │   ├── embedding_service.py  # Embedding generation
│   │   ├── entity_extractor.py   # Named entity recognition
│   │   ├── search_service.py     # Main search functionality
│   │   └── vector_store.py       # Vector database operations
│   └── utils/
│       ├── file_handlers.py      # File processing utilities
│       └── logger.py             # Logging configuration
├── data/
│   ├── raw/                     # Input documents
│   ├── processed/               # Processed document metadata
│   └── embeddings/              # Vector embeddings storage
├── tests/                       # Unit tests
├── notebooks/                   # Jupyter notebooks for analysis
└── scripts/                     # Utility scripts