40b28a7ee3a30ddb5cd723571891c671c30466c5
Semantic Search Engine POC
A proof-of-concept intelligent semantic search engine for archival documents, made to show how advanced search can work with different types of files like PDFs, XML files, and more.
Project Overview
This POC addresses the requirements for a future full-scale semantic search system capable of:
- Entity-centric search across persons, places, events, buildings, and organizations
- Multi-modal document processing (PDFs, XML, text, images, audio, video)
- Semantic similarity search using modern embedding techniques
- Relationship discovery between entities across documents
- Access control for public vs. restricted documents
- Scalable architecture for production deployment
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Document │ │ Entity │ │ Vector │
│ Processor │───▶│ Extractor │───▶│ Store │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Text │ │ Named Entity │ │ Embeddings │
│ Extraction │ │ Recognition │ │ (ChromaDB) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Search │
│ Service │
└─────────────────┘
Prerequisites
- Python 3.8+
- pip
- Git
Installation
- Clone the repository
git clone <repository-url>
cd semantic_search_poc
- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
- Initialize the environment
python scripts/setup_data.py
- Run the POC
python -m src.main
Expected Output
The POC will demonstrate:
- Document processing and indexing
- Semantic search across sample documents
- Entity extraction and relationship discovery
- Performance metrics and statistics
Features
Document Processing
- PDF text extraction using PyPDF2
- XML parsing for finding aids
- DOCX support for modern documents
- Metadata extraction (title, author, creation date, keywords)
- Multi-language support (currently optimized for English)
Entity Recognition
- Named Entity Recognition using spaCy
- Custom entity types: Person, Place, Event, Organization, Building, Date
- Relationship extraction between entities
- Confidence scoring for entity matches
Semantic Search
- Vector embeddings using Sentence-BERT (
all-MiniLM-L6-v2) - Similarity search with configurable thresholds
- Hybrid search combining semantic and keyword matching
- Entity-filtered search results
Vector Storage
- ChromaDB integration for persistent vector storage
- Scalable indexing for large document collections
- Metadata filtering and search optimization
Configuration
Key settings in config/settings.py:
# Embedding Model
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
EMBEDDING_DIMENSION = 384
# Search Parameters
MAX_SEARCH_RESULTS = 50
SIMILARITY_THRESHOLD = 0.3
# File Processing
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50MB
ALLOWED_EXTENSIONS = [".pdf", ".txt", ".docx", ".xml"]
Project Structure
semantic_search_poc/
├── README.md
├── requirements.txt
├── .env.example
├── config/
│ └── settings.py # Configuration settings
├── src/
│ ├── main.py # Main application entry point
│ ├── models/
│ │ ├── document.py # Document data models
│ │ └── search_result.py # Search result models
│ ├── services/
│ │ ├── document_processor.py # Document processing pipeline
│ │ ├── embedding_service.py # Embedding generation
│ │ ├── entity_extractor.py # Named entity recognition
│ │ ├── search_service.py # Main search functionality
│ │ └── vector_store.py # Vector database operations
│ └── utils/
│ ├── file_handlers.py # File processing utilities
│ └── logger.py # Logging configuration
├── data/
│ ├── raw/ # Input documents
│ ├── processed/ # Processed document metadata
│ └── embeddings/ # Vector embeddings storage
├── tests/ # Unit tests
├── notebooks/ # Jupyter notebooks for analysis
└── scripts/ # Utility scripts
Description
Languages
Python
75.5%
HTML
24.5%