voice_intent/README.md

\

# VoiceIntent – Intelligent Audio Command Understanding System

VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.

---

## Project Goals

* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
* ✅ Classify spoken commands into **structured intents**
* ✅ Serve predictions via a clean API or UI
* ✅ Ensure modularity and production readiness

---

## Core Features

| Feature                | Description                                                                |
| ---------------------- | -------------------------------------------------------------------------- |
| Speech-to-text      | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
| Intent classifier   | Classifies transcribed text into `action`, `object`, `location`            |
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices            |
| CLI pipelines       | One-command training, inference, and evaluation                            |
| API + UI            | FastAPI for RESTful endpoints; Streamlit demo included                     |
|  Notebooks           | EDA, ASR error analysis, intent confusion reports                          |

---

##  Project Structure


---

##  Dataset: Fluent Speech Commands

* 23,132 single-sentence voice commands (1–2 seconds)
* Labels: `action`, `object`, `location`
* Examples:

  * “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
  * “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`

---

##  Setup Instructions

###  1. Create Environment

```bash
git clone https://github.com/your-org/voiceiq.git
cd voiceiq
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt
```

### 2. Download Dataset

```bash
# Download and unzip Fluent Speech Commands
# Or use torchaudio.datasets.FluentSpeechCommands if available
```

---

##  How to Run Pipelines

### 1. Train Pipeline

```bash
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
```

### 2. Inference Pipeline

```bash
# Predict from a WAV file
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
```

###  3. Evaluation Pipeline

```bash
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
```

---

##  API & UI Demos

### FastAPI Server

```bash
uvicorn src.api.server:app --reload
```

* `POST /predict` — Upload audio and get predicted intent
* `GET /health` — System health check

### Streamlit Demo

```bash
python run_demo.py
```

* Upload `.wav` or record live
* View transcript, structured intent, and confidence scores

---

##  Metrics Tracked

| Metric                  | Description                                            |
| ----------------------- | ------------------------------------------------------ |
| **WER**              | Word Error Rate from ASR                               |
| **Intent Acc**       | Accuracy for full `(action, object, location)` triplet |
| **F1 Scores**        | Macro, micro, and per-label F1                         |
| **Confusion Matrix** | Action/Object/Location classification errors           |

---

## Included Notebooks

| Notebook                         | Purpose                                         |
| -------------------------------- | ----------------------------------------------- |
| `01_audio_exploration.ipynb`     | Visualize waveforms, mel spectrograms           |
| `02_asr_error_analysis.ipynb`    | Compare Whisper vs Wav2Vec2 transcriptions      |
| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
| `04_results_analysis.ipynb`      | Plot confusion matrices and F1 breakdowns       |

---

## Models Used

* **Whisper Base (ASR)** – Robust transcription of short commands
* **DistilBERT / BERT** – Text classification of transcripts or any of your choice
* Optionally: Fine-tune **Whisper** for joint ASR+intent learning

---

## Design Decisions

* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
* **Separation of Concerns**: Clean division between preprocessing, training, and inference

---

## Potential Extensions

* ✅ Real-time streaming inference
* ✅ Speaker identification and voice embeddings
* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
* ✅ Multilingual support via Whisper large models
* ✅ Deployable microservice with Docker

---

## Example Usage

### Voice Input:

> "Turn on the fan in the bedroom"

### Output:

```json
{
  "transcript": "turn on the fan in the bedroom",
  "intent": {
    "action": "activate",
    "object": "fan",
    "location": "bedroom"
  },
  "confidence": {
    "action": 0.98,
    "object": 0.95,
    "location": 0.93
  }
}
```