From cebb4c2fe097dd3bb9c94654db790d52eb37e968 Mon Sep 17 00:00:00 2001 From: OwusuBlessing Date: Tue, 29 Jul 2025 01:52:11 +0100 Subject: [PATCH] made first commit --- README.md | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 187 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..060a183 --- /dev/null +++ b/README.md @@ -0,0 +1,187 @@ +\ + +# VoiceIntent – Intelligent Audio Command Understanding System + +VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems. + +--- + +## Project Goals + +* ✅ Build an end-to-end speech pipeline using **ASR + NLP** +* ✅ Classify spoken commands into **structured intents** +* ✅ Serve predictions via a clean API or UI +* ✅ Ensure modularity and production readiness + +--- + +## Core Features + +| Feature | Description | +| ---------------------- | -------------------------------------------------------------------------- | +| Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription | +| Intent classifier | Classifies transcribed text into `action`, `object`, `location` | +| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices | +| CLI pipelines | One-command training, inference, and evaluation | +| API + UI | FastAPI for RESTful endpoints; Streamlit demo included | +| Notebooks | EDA, ASR error analysis, intent confusion reports | + +--- + +## Project Structure + + + +--- + +## Dataset: Fluent Speech Commands + +* 23,132 single-sentence voice commands (1–2 seconds) +* Labels: `action`, `object`, `location` +* Examples: + + * “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen` + * “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom` + +--- + +## Setup Instructions + +### 1. Create Environment + +```bash +git clone https://github.com/your-org/voiceiq.git +cd voiceiq +python -m venv venv +source venv/bin/activate # or venv\Scripts\activate on Windows +pip install -r requirements.txt +``` + +### 2. Download Dataset + +```bash +# Download and unzip Fluent Speech Commands +# Or use torchaudio.datasets.FluentSpeechCommands if available +``` + +--- + +## How to Run Pipelines + +### 1. Train Pipeline + +```bash +python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline" +``` + +### 2. Inference Pipeline + +```bash +# Predict from a WAV file +python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav +``` + +### 3. Evaluation Pipeline + +```bash +python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv +``` + +--- + +## API & UI Demos + +### FastAPI Server + +```bash +uvicorn src.api.server:app --reload +``` + +* `POST /predict` — Upload audio and get predicted intent +* `GET /health` — System health check + +### Streamlit Demo + +```bash +python run_demo.py +``` + +* Upload `.wav` or record live +* View transcript, structured intent, and confidence scores + +--- + +## Metrics Tracked + +| Metric | Description | +| ----------------------- | ------------------------------------------------------ | +| **WER** | Word Error Rate from ASR | +| **Intent Acc** | Accuracy for full `(action, object, location)` triplet | +| **F1 Scores** | Macro, micro, and per-label F1 | +| **Confusion Matrix** | Action/Object/Location classification errors | + +--- + +## Included Notebooks + +| Notebook | Purpose | +| -------------------------------- | ----------------------------------------------- | +| `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms | +| `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions | +| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review | +| `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns | + +--- + +## Models Used + +* **Whisper Base (ASR)** – Robust transcription of short commands +* **DistilBERT / BERT** – Text classification of transcripts or any of your choice +* Optionally: Fine-tune **Whisper** for joint ASR+intent learning + +--- + +## Design Decisions + +* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable +* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation +* **Separation of Concerns**: Clean division between preprocessing, training, and inference + +--- + +## Potential Extensions + +* ✅ Real-time streaming inference +* ✅ Speaker identification and voice embeddings +* ✅ End-to-end fine-tuning of Whisper for direct audio → intent +* ✅ Multilingual support via Whisper large models +* ✅ Deployable microservice with Docker + +--- + +## Example Usage + +### Voice Input: + +> "Turn on the fan in the bedroom" + +### Output: + +```json +{ + "transcript": "turn on the fan in the bedroom", + "intent": { + "action": "activate", + "object": "fan", + "location": "bedroom" + }, + "confidence": { + "action": 0.98, + "object": 0.95, + "location": 0.93 + } +} +``` + + +