5.1 KiB
5.1 KiB
\
VoiceIntent – Intelligent Audio Command Understanding System
VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (action, object, location). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
Project Goals
- ✅ Build an end-to-end speech pipeline using ASR + NLP
- ✅ Classify spoken commands into structured intents
- ✅ Serve predictions via a clean API or UI
- ✅ Ensure modularity and production readiness
Core Features
| Feature | Description |
|---|---|
| Speech-to-text | Uses OpenAI Whisper for transcription |
| Intent classifier | Classifies transcribed text into action, object, location |
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices |
| CLI pipelines | One-command training, inference, and evaluation |
| API + UI | FastAPI for RESTful endpoints; Streamlit demo included |
| Notebooks | EDA, ASR error analysis, intent confusion reports |
Project Structure
Dataset: Fluent Speech Commands
-
23,132 single-sentence voice commands (1–2 seconds)
-
Labels:
action,object,location -
Examples:
- “Turn on the lights in the kitchen” →
activate,lights,kitchen - “Switch off the fan in the bedroom” →
deactivate,fan,bedroom
- “Turn on the lights in the kitchen” →
Setup Instructions
1. Create Environment
git clone https://github.com/your-org/voiceiq.git
cd voiceiq
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
2. Download Dataset
# Download and unzip Fluent Speech Commands
# Or use torchaudio.datasets.FluentSpeechCommands if available
How to Run Pipelines
1. Train Pipeline
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
2. Inference Pipeline
# Predict from a WAV file
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
3. Evaluation Pipeline
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
API & UI Demos
FastAPI Server
uvicorn src.api.server:app --reload
POST /predict— Upload audio and get predicted intentGET /health— System health check
Streamlit Demo
python run_demo.py
- Upload
.wavor record live - View transcript, structured intent, and confidence scores
Metrics Tracked
| Metric | Description |
|---|---|
| WER | Word Error Rate from ASR |
| Intent Acc | Accuracy for full (action, object, location) triplet |
| F1 Scores | Macro, micro, and per-label F1 |
| Confusion Matrix | Action/Object/Location classification errors |
Included Notebooks
| Notebook | Purpose |
|---|---|
01_audio_exploration.ipynb |
Visualize waveforms, mel spectrograms |
02_asr_error_analysis.ipynb |
Compare Whisper vs Wav2Vec2 transcriptions |
03_intent_classification.ipynb |
Hyperparameter tuning, misclassification review |
04_results_analysis.ipynb |
Plot confusion matrices and F1 breakdowns |
Models Used
- Whisper Base (ASR) – Robust transcription of short commands
- DistilBERT / BERT – Text classification of transcripts or any of your choice
- Optionally: Fine-tune Whisper for joint ASR+intent learning
Design Decisions
- Pipeline Modularity: All components (ASR, NLP, evaluation) are swappable
- Config-Driven: Use YAML configs for training, ASR models, and evaluation
- Separation of Concerns: Clean division between preprocessing, training, and inference
Potential Extensions
- ✅ Real-time streaming inference
- ✅ Speaker identification and voice embeddings
- ✅ End-to-end fine-tuning of Whisper for direct audio → intent
- ✅ Multilingual support via Whisper large models
- ✅ Deployable microservice with Docker
Example Usage
Voice Input:
"Turn on the fan in the bedroom"
Output:
{
"transcript": "turn on the fan in the bedroom",
"intent": {
"action": "activate",
"object": "fan",
"location": "bedroom"
},
"confidence": {
"action": 0.98,
"object": 0.95,
"location": 0.93
}
}