Files
2025-07-29 01:52:11 +01:00

188 lines
5.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
\
# VoiceIntent Intelligent Audio Command Understanding System
VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
---
## Project Goals
* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
* ✅ Classify spoken commands into **structured intents**
* ✅ Serve predictions via a clean API or UI
* ✅ Ensure modularity and production readiness
---
## Core Features
| Feature | Description |
| ---------------------- | -------------------------------------------------------------------------- |
| Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
| Intent classifier | Classifies transcribed text into `action`, `object`, `location` |
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices |
| CLI pipelines | One-command training, inference, and evaluation |
| API + UI | FastAPI for RESTful endpoints; Streamlit demo included |
| Notebooks | EDA, ASR error analysis, intent confusion reports |
---
## Project Structure
---
## Dataset: Fluent Speech Commands
* 23,132 single-sentence voice commands (12 seconds)
* Labels: `action`, `object`, `location`
* Examples:
* “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
* “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
---
## Setup Instructions
### 1. Create Environment
```bash
git clone https://github.com/your-org/voiceiq.git
cd voiceiq
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
```
### 2. Download Dataset
```bash
# Download and unzip Fluent Speech Commands
# Or use torchaudio.datasets.FluentSpeechCommands if available
```
---
## How to Run Pipelines
### 1. Train Pipeline
```bash
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
```
### 2. Inference Pipeline
```bash
# Predict from a WAV file
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
```
### 3. Evaluation Pipeline
```bash
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
```
---
## API & UI Demos
### FastAPI Server
```bash
uvicorn src.api.server:app --reload
```
* `POST /predict` — Upload audio and get predicted intent
* `GET /health` — System health check
### Streamlit Demo
```bash
python run_demo.py
```
* Upload `.wav` or record live
* View transcript, structured intent, and confidence scores
---
## Metrics Tracked
| Metric | Description |
| ----------------------- | ------------------------------------------------------ |
| **WER** | Word Error Rate from ASR |
| **Intent Acc** | Accuracy for full `(action, object, location)` triplet |
| **F1 Scores** | Macro, micro, and per-label F1 |
| **Confusion Matrix** | Action/Object/Location classification errors |
---
## Included Notebooks
| Notebook | Purpose |
| -------------------------------- | ----------------------------------------------- |
| `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms |
| `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions |
| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
| `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns |
---
## Models Used
* **Whisper Base (ASR)** Robust transcription of short commands
* **DistilBERT / BERT** Text classification of transcripts or any of your choice
* Optionally: Fine-tune **Whisper** for joint ASR+intent learning
---
## Design Decisions
* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
* **Separation of Concerns**: Clean division between preprocessing, training, and inference
---
## Potential Extensions
* ✅ Real-time streaming inference
* ✅ Speaker identification and voice embeddings
* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
* ✅ Multilingual support via Whisper large models
* ✅ Deployable microservice with Docker
---
## Example Usage
### Voice Input:
> "Turn on the fan in the bedroom"
### Output:
```json
{
"transcript": "turn on the fan in the bedroom",
"intent": {
"action": "activate",
"object": "fan",
"location": "bedroom"
},
"confidence": {
"action": 0.98,
"object": 0.95,
"location": 0.93
}
}
```