\ # VoiceIntent – Intelligent Audio Command Understanding System VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems. --- ## Project Goals * ✅ Build an end-to-end speech pipeline using **ASR + NLP** * ✅ Classify spoken commands into **structured intents** * ✅ Serve predictions via a clean API or UI * ✅ Ensure modularity and production readiness --- ## Core Features | Feature | Description | | ---------------------- | -------------------------------------------------------------------------- | | Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription | | Intent classifier | Classifies transcribed text into `action`, `object`, `location` | | Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices | | CLI pipelines | One-command training, inference, and evaluation | | API + UI | FastAPI for RESTful endpoints; Streamlit demo included | | Notebooks | EDA, ASR error analysis, intent confusion reports | --- ## Project Structure --- ## Dataset: Fluent Speech Commands * 23,132 single-sentence voice commands (1–2 seconds) * Labels: `action`, `object`, `location` * Examples: * “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen` * “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom` --- ## Setup Instructions ### 1. Create Environment ```bash git clone https://github.com/your-org/voiceiq.git cd voiceiq python -m venv venv source venv/bin/activate # or venv\Scripts\activate on Windows pip install -r requirements.txt ``` ### 2. Download Dataset ```bash # Download and unzip Fluent Speech Commands # Or use torchaudio.datasets.FluentSpeechCommands if available ``` --- ## How to Run Pipelines ### 1. Train Pipeline ```bash python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline" ``` ### 2. Inference Pipeline ```bash # Predict from a WAV file python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav ``` ### 3. Evaluation Pipeline ```bash python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv ``` --- ## API & UI Demos ### FastAPI Server ```bash uvicorn src.api.server:app --reload ``` * `POST /predict` — Upload audio and get predicted intent * `GET /health` — System health check ### Streamlit Demo ```bash python run_demo.py ``` * Upload `.wav` or record live * View transcript, structured intent, and confidence scores --- ## Metrics Tracked | Metric | Description | | ----------------------- | ------------------------------------------------------ | | **WER** | Word Error Rate from ASR | | **Intent Acc** | Accuracy for full `(action, object, location)` triplet | | **F1 Scores** | Macro, micro, and per-label F1 | | **Confusion Matrix** | Action/Object/Location classification errors | --- ## Included Notebooks | Notebook | Purpose | | -------------------------------- | ----------------------------------------------- | | `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms | | `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions | | `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review | | `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns | --- ## Models Used * **Whisper Base (ASR)** – Robust transcription of short commands * **DistilBERT / BERT** – Text classification of transcripts or any of your choice * Optionally: Fine-tune **Whisper** for joint ASR+intent learning --- ## Design Decisions * **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable * **Config-Driven**: Use YAML configs for training, ASR models, and evaluation * **Separation of Concerns**: Clean division between preprocessing, training, and inference --- ## Potential Extensions * ✅ Real-time streaming inference * ✅ Speaker identification and voice embeddings * ✅ End-to-end fine-tuning of Whisper for direct audio → intent * ✅ Multilingual support via Whisper large models * ✅ Deployable microservice with Docker --- ## Example Usage ### Voice Input: > "Turn on the fan in the bedroom" ### Output: ```json { "transcript": "turn on the fan in the bedroom", "intent": { "action": "activate", "object": "fan", "location": "bedroom" }, "confidence": { "action": 0.98, "object": 0.95, "location": 0.93 } } ```