made first commit

2025-07-29 01:52:11 +01:00
commit cebb4c2fe0
1 changed files with 187 additions and 0 deletions
@@ -0,0 +1,187 @@
+\
+
+# VoiceIntent – Intelligent Audio Command Understanding System
+
+VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
+
+---
+
+## Project Goals
+
+* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
+* ✅ Classify spoken commands into **structured intents**
+* ✅ Serve predictions via a clean API or UI
+* ✅ Ensure modularity and production readiness
+
+---
+
+## Core Features
+
+| Feature                | Description                                                                |
+| ---------------------- | -------------------------------------------------------------------------- |
+| Speech-to-text      | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
+| Intent classifier   | Classifies transcribed text into `action`, `object`, `location`            |
+| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices            |
+| CLI pipelines       | One-command training, inference, and evaluation                            |
+| API + UI            | FastAPI for RESTful endpoints; Streamlit demo included                     |
+|  Notebooks           | EDA, ASR error analysis, intent confusion reports                          |
+
+---
+
+##  Project Structure
+
+
+
+---
+
+##  Dataset: Fluent Speech Commands
+
+* 23,132 single-sentence voice commands (1–2 seconds)
+* Labels: `action`, `object`, `location`
+* Examples:
+
+  * “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
+  * “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
+
+---
+
+##  Setup Instructions
+
+###  1. Create Environment
+
+```bash
+git clone https://github.com/your-org/voiceiq.git
+cd voiceiq
+python -m venv venv
+source venv/bin/activate  # or venv\Scripts\activate on Windows
+pip install -r requirements.txt
+```
+
+### 2. Download Dataset
+
+```bash
+# Download and unzip Fluent Speech Commands
+# Or use torchaudio.datasets.FluentSpeechCommands if available
+```
+
+---
+
+##  How to Run Pipelines
+
+### 1. Train Pipeline
+
+```bash
+python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
+```
+
+### 2. Inference Pipeline
+
+```bash
+# Predict from a WAV file
+python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
+```
+
+###  3. Evaluation Pipeline
+
+```bash
+python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
+```
+
+---
+
+##  API & UI Demos
+
+### FastAPI Server
+
+```bash
+uvicorn src.api.server:app --reload
+```
+
+* `POST /predict` — Upload audio and get predicted intent
+* `GET /health` — System health check
+
+### Streamlit Demo
+
+```bash
+python run_demo.py
+```
+
+* Upload `.wav` or record live
+* View transcript, structured intent, and confidence scores
+
+---
+
+##  Metrics Tracked
+
+| Metric                  | Description                                            |
+| ----------------------- | ------------------------------------------------------ |
+| **WER**              | Word Error Rate from ASR                               |
+| **Intent Acc**       | Accuracy for full `(action, object, location)` triplet |
+| **F1 Scores**        | Macro, micro, and per-label F1                         |
+| **Confusion Matrix** | Action/Object/Location classification errors           |
+
+---
+
+## Included Notebooks
+
+| Notebook                         | Purpose                                         |
+| -------------------------------- | ----------------------------------------------- |
+| `01_audio_exploration.ipynb`     | Visualize waveforms, mel spectrograms           |
+| `02_asr_error_analysis.ipynb`    | Compare Whisper vs Wav2Vec2 transcriptions      |
+| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
+| `04_results_analysis.ipynb`      | Plot confusion matrices and F1 breakdowns       |
+
+---
+
+## Models Used
+
+* **Whisper Base (ASR)** – Robust transcription of short commands
+* **DistilBERT / BERT** – Text classification of transcripts or any of your choice 
+* Optionally: Fine-tune **Whisper** for joint ASR+intent learning
+
+---
+
+## Design Decisions
+
+* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
+* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
+* **Separation of Concerns**: Clean division between preprocessing, training, and inference
+
+---
+
+## Potential Extensions
+
+* ✅ Real-time streaming inference
+* ✅ Speaker identification and voice embeddings
+* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
+* ✅ Multilingual support via Whisper large models
+* ✅ Deployable microservice with Docker
+
+---
+
+## Example Usage
+
+### Voice Input:
+
+> "Turn on the fan in the bedroom"
+
+### Output:
+
+```json
+{
+  "transcript": "turn on the fan in the bedroom",
+  "intent": {
+    "action": "activate",
+    "object": "fan",
+    "location": "bedroom"
+  },
+  "confidence": {
+    "action": 0.98,
+    "object": 0.95,
+    "location": 0.93
+  }
+}
+```
+
+
+