Files
2025-07-29 01:52:11 +01:00

5.1 KiB
Raw Permalink Blame History

\

VoiceIntent Intelligent Audio Command Understanding System

VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (action, object, location). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.


Project Goals

  • Build an end-to-end speech pipeline using ASR + NLP
  • Classify spoken commands into structured intents
  • Serve predictions via a clean API or UI
  • Ensure modularity and production readiness

Core Features

Feature Description
Speech-to-text Uses OpenAI Whisper for transcription
Intent classifier Classifies transcribed text into action, object, location
Evaluation pipeline Tracks WER, accuracy, precision, recall, and confusion matrices
CLI pipelines One-command training, inference, and evaluation
API + UI FastAPI for RESTful endpoints; Streamlit demo included
Notebooks EDA, ASR error analysis, intent confusion reports

Project Structure


Dataset: Fluent Speech Commands

  • 23,132 single-sentence voice commands (12 seconds)

  • Labels: action, object, location

  • Examples:

    • “Turn on the lights in the kitchen” → activate, lights, kitchen
    • “Switch off the fan in the bedroom” → deactivate, fan, bedroom

Setup Instructions

1. Create Environment

git clone https://github.com/your-org/voiceiq.git
cd voiceiq
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

2. Download Dataset

# Download and unzip Fluent Speech Commands
# Or use torchaudio.datasets.FluentSpeechCommands if available

How to Run Pipelines

1. Train Pipeline

python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"

2. Inference Pipeline

# Predict from a WAV file
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav

3. Evaluation Pipeline

python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv

API & UI Demos

FastAPI Server

uvicorn src.api.server:app --reload
  • POST /predict — Upload audio and get predicted intent
  • GET /health — System health check

Streamlit Demo

python run_demo.py
  • Upload .wav or record live
  • View transcript, structured intent, and confidence scores

Metrics Tracked

Metric Description
WER Word Error Rate from ASR
Intent Acc Accuracy for full (action, object, location) triplet
F1 Scores Macro, micro, and per-label F1
Confusion Matrix Action/Object/Location classification errors

Included Notebooks

Notebook Purpose
01_audio_exploration.ipynb Visualize waveforms, mel spectrograms
02_asr_error_analysis.ipynb Compare Whisper vs Wav2Vec2 transcriptions
03_intent_classification.ipynb Hyperparameter tuning, misclassification review
04_results_analysis.ipynb Plot confusion matrices and F1 breakdowns

Models Used

  • Whisper Base (ASR) Robust transcription of short commands
  • DistilBERT / BERT Text classification of transcripts or any of your choice
  • Optionally: Fine-tune Whisper for joint ASR+intent learning

Design Decisions

  • Pipeline Modularity: All components (ASR, NLP, evaluation) are swappable
  • Config-Driven: Use YAML configs for training, ASR models, and evaluation
  • Separation of Concerns: Clean division between preprocessing, training, and inference

Potential Extensions

  • Real-time streaming inference
  • Speaker identification and voice embeddings
  • End-to-end fine-tuning of Whisper for direct audio → intent
  • Multilingual support via Whisper large models
  • Deployable microservice with Docker

Example Usage

Voice Input:

"Turn on the fan in the bedroom"

Output:

{
  "transcript": "turn on the fan in the bedroom",
  "intent": {
    "action": "activate",
    "object": "fan",
    "location": "bedroom"
  },
  "confidence": {
    "action": 0.98,
    "object": 0.95,
    "location": 0.93
  }
}