188 lines
5.1 KiB
Markdown
188 lines
5.1 KiB
Markdown
\
|
||
|
||
# VoiceIntent – Intelligent Audio Command Understanding System
|
||
|
||
VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
|
||
|
||
---
|
||
|
||
## Project Goals
|
||
|
||
* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
|
||
* ✅ Classify spoken commands into **structured intents**
|
||
* ✅ Serve predictions via a clean API or UI
|
||
* ✅ Ensure modularity and production readiness
|
||
|
||
---
|
||
|
||
## Core Features
|
||
|
||
| Feature | Description |
|
||
| ---------------------- | -------------------------------------------------------------------------- |
|
||
| Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
|
||
| Intent classifier | Classifies transcribed text into `action`, `object`, `location` |
|
||
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices |
|
||
| CLI pipelines | One-command training, inference, and evaluation |
|
||
| API + UI | FastAPI for RESTful endpoints; Streamlit demo included |
|
||
| Notebooks | EDA, ASR error analysis, intent confusion reports |
|
||
|
||
---
|
||
|
||
## Project Structure
|
||
|
||
|
||
|
||
---
|
||
|
||
## Dataset: Fluent Speech Commands
|
||
|
||
* 23,132 single-sentence voice commands (1–2 seconds)
|
||
* Labels: `action`, `object`, `location`
|
||
* Examples:
|
||
|
||
* “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
|
||
* “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
|
||
|
||
---
|
||
|
||
## Setup Instructions
|
||
|
||
### 1. Create Environment
|
||
|
||
```bash
|
||
git clone https://github.com/your-org/voiceiq.git
|
||
cd voiceiq
|
||
python -m venv venv
|
||
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
### 2. Download Dataset
|
||
|
||
```bash
|
||
# Download and unzip Fluent Speech Commands
|
||
# Or use torchaudio.datasets.FluentSpeechCommands if available
|
||
```
|
||
|
||
---
|
||
|
||
## How to Run Pipelines
|
||
|
||
### 1. Train Pipeline
|
||
|
||
```bash
|
||
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
|
||
```
|
||
|
||
### 2. Inference Pipeline
|
||
|
||
```bash
|
||
# Predict from a WAV file
|
||
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
|
||
```
|
||
|
||
### 3. Evaluation Pipeline
|
||
|
||
```bash
|
||
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
|
||
```
|
||
|
||
---
|
||
|
||
## API & UI Demos
|
||
|
||
### FastAPI Server
|
||
|
||
```bash
|
||
uvicorn src.api.server:app --reload
|
||
```
|
||
|
||
* `POST /predict` — Upload audio and get predicted intent
|
||
* `GET /health` — System health check
|
||
|
||
### Streamlit Demo
|
||
|
||
```bash
|
||
python run_demo.py
|
||
```
|
||
|
||
* Upload `.wav` or record live
|
||
* View transcript, structured intent, and confidence scores
|
||
|
||
---
|
||
|
||
## Metrics Tracked
|
||
|
||
| Metric | Description |
|
||
| ----------------------- | ------------------------------------------------------ |
|
||
| **WER** | Word Error Rate from ASR |
|
||
| **Intent Acc** | Accuracy for full `(action, object, location)` triplet |
|
||
| **F1 Scores** | Macro, micro, and per-label F1 |
|
||
| **Confusion Matrix** | Action/Object/Location classification errors |
|
||
|
||
---
|
||
|
||
## Included Notebooks
|
||
|
||
| Notebook | Purpose |
|
||
| -------------------------------- | ----------------------------------------------- |
|
||
| `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms |
|
||
| `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions |
|
||
| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
|
||
| `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns |
|
||
|
||
---
|
||
|
||
## Models Used
|
||
|
||
* **Whisper Base (ASR)** – Robust transcription of short commands
|
||
* **DistilBERT / BERT** – Text classification of transcripts or any of your choice
|
||
* Optionally: Fine-tune **Whisper** for joint ASR+intent learning
|
||
|
||
---
|
||
|
||
## Design Decisions
|
||
|
||
* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
|
||
* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
|
||
* **Separation of Concerns**: Clean division between preprocessing, training, and inference
|
||
|
||
---
|
||
|
||
## Potential Extensions
|
||
|
||
* ✅ Real-time streaming inference
|
||
* ✅ Speaker identification and voice embeddings
|
||
* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
|
||
* ✅ Multilingual support via Whisper large models
|
||
* ✅ Deployable microservice with Docker
|
||
|
||
---
|
||
|
||
## Example Usage
|
||
|
||
### Voice Input:
|
||
|
||
> "Turn on the fan in the bedroom"
|
||
|
||
### Output:
|
||
|
||
```json
|
||
{
|
||
"transcript": "turn on the fan in the bedroom",
|
||
"intent": {
|
||
"action": "activate",
|
||
"object": "fan",
|
||
"location": "bedroom"
|
||
},
|
||
"confidence": {
|
||
"action": 0.98,
|
||
"object": 0.95,
|
||
"location": 0.93
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|