Files

188 lines
5.1 KiB
Markdown
Raw Permalink Normal View History

2025-07-29 01:52:11 +01:00
\
# VoiceIntent Intelligent Audio Command Understanding System
VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
---
## Project Goals
* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
* ✅ Classify spoken commands into **structured intents**
* ✅ Serve predictions via a clean API or UI
* ✅ Ensure modularity and production readiness
---
## Core Features
| Feature | Description |
| ---------------------- | -------------------------------------------------------------------------- |
| Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
| Intent classifier | Classifies transcribed text into `action`, `object`, `location` |
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices |
| CLI pipelines | One-command training, inference, and evaluation |
| API + UI | FastAPI for RESTful endpoints; Streamlit demo included |
| Notebooks | EDA, ASR error analysis, intent confusion reports |
---
## Project Structure
---
## Dataset: Fluent Speech Commands
* 23,132 single-sentence voice commands (12 seconds)
* Labels: `action`, `object`, `location`
* Examples:
* “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
* “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
---
## Setup Instructions
### 1. Create Environment
```bash
git clone https://github.com/your-org/voiceiq.git
cd voiceiq
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
```
### 2. Download Dataset
```bash
# Download and unzip Fluent Speech Commands
# Or use torchaudio.datasets.FluentSpeechCommands if available
```
---
## How to Run Pipelines
### 1. Train Pipeline
```bash
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
```
### 2. Inference Pipeline
```bash
# Predict from a WAV file
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
```
### 3. Evaluation Pipeline
```bash
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
```
---
## API & UI Demos
### FastAPI Server
```bash
uvicorn src.api.server:app --reload
```
* `POST /predict` — Upload audio and get predicted intent
* `GET /health` — System health check
### Streamlit Demo
```bash
python run_demo.py
```
* Upload `.wav` or record live
* View transcript, structured intent, and confidence scores
---
## Metrics Tracked
| Metric | Description |
| ----------------------- | ------------------------------------------------------ |
| **WER** | Word Error Rate from ASR |
| **Intent Acc** | Accuracy for full `(action, object, location)` triplet |
| **F1 Scores** | Macro, micro, and per-label F1 |
| **Confusion Matrix** | Action/Object/Location classification errors |
---
## Included Notebooks
| Notebook | Purpose |
| -------------------------------- | ----------------------------------------------- |
| `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms |
| `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions |
| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
| `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns |
---
## Models Used
* **Whisper Base (ASR)** Robust transcription of short commands
* **DistilBERT / BERT** Text classification of transcripts or any of your choice
* Optionally: Fine-tune **Whisper** for joint ASR+intent learning
---
## Design Decisions
* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
* **Separation of Concerns**: Clean division between preprocessing, training, and inference
---
## Potential Extensions
* ✅ Real-time streaming inference
* ✅ Speaker identification and voice embeddings
* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
* ✅ Multilingual support via Whisper large models
* ✅ Deployable microservice with Docker
---
## Example Usage
### Voice Input:
> "Turn on the fan in the bedroom"
### Output:
```json
{
"transcript": "turn on the fan in the bedroom",
"intent": {
"action": "activate",
"object": "fan",
"location": "bedroom"
},
"confidence": {
"action": 0.98,
"object": 0.95,
"location": 0.93
}
}
```