made first commit
This commit is contained in:
@@ -0,0 +1,187 @@
|
|||||||
|
\
|
||||||
|
|
||||||
|
# VoiceIntent – Intelligent Audio Command Understanding System
|
||||||
|
|
||||||
|
VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project Goals
|
||||||
|
|
||||||
|
* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
|
||||||
|
* ✅ Classify spoken commands into **structured intents**
|
||||||
|
* ✅ Serve predictions via a clean API or UI
|
||||||
|
* ✅ Ensure modularity and production readiness
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Features
|
||||||
|
|
||||||
|
| Feature | Description |
|
||||||
|
| ---------------------- | -------------------------------------------------------------------------- |
|
||||||
|
| Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
|
||||||
|
| Intent classifier | Classifies transcribed text into `action`, `object`, `location` |
|
||||||
|
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices |
|
||||||
|
| CLI pipelines | One-command training, inference, and evaluation |
|
||||||
|
| API + UI | FastAPI for RESTful endpoints; Streamlit demo included |
|
||||||
|
| Notebooks | EDA, ASR error analysis, intent confusion reports |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dataset: Fluent Speech Commands
|
||||||
|
|
||||||
|
* 23,132 single-sentence voice commands (1–2 seconds)
|
||||||
|
* Labels: `action`, `object`, `location`
|
||||||
|
* Examples:
|
||||||
|
|
||||||
|
* “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
|
||||||
|
* “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Setup Instructions
|
||||||
|
|
||||||
|
### 1. Create Environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/your-org/voiceiq.git
|
||||||
|
cd voiceiq
|
||||||
|
python -m venv venv
|
||||||
|
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Download Dataset
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Download and unzip Fluent Speech Commands
|
||||||
|
# Or use torchaudio.datasets.FluentSpeechCommands if available
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## How to Run Pipelines
|
||||||
|
|
||||||
|
### 1. Train Pipeline
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Inference Pipeline
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Predict from a WAV file
|
||||||
|
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Evaluation Pipeline
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API & UI Demos
|
||||||
|
|
||||||
|
### FastAPI Server
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uvicorn src.api.server:app --reload
|
||||||
|
```
|
||||||
|
|
||||||
|
* `POST /predict` — Upload audio and get predicted intent
|
||||||
|
* `GET /health` — System health check
|
||||||
|
|
||||||
|
### Streamlit Demo
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_demo.py
|
||||||
|
```
|
||||||
|
|
||||||
|
* Upload `.wav` or record live
|
||||||
|
* View transcript, structured intent, and confidence scores
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Metrics Tracked
|
||||||
|
|
||||||
|
| Metric | Description |
|
||||||
|
| ----------------------- | ------------------------------------------------------ |
|
||||||
|
| **WER** | Word Error Rate from ASR |
|
||||||
|
| **Intent Acc** | Accuracy for full `(action, object, location)` triplet |
|
||||||
|
| **F1 Scores** | Macro, micro, and per-label F1 |
|
||||||
|
| **Confusion Matrix** | Action/Object/Location classification errors |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Included Notebooks
|
||||||
|
|
||||||
|
| Notebook | Purpose |
|
||||||
|
| -------------------------------- | ----------------------------------------------- |
|
||||||
|
| `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms |
|
||||||
|
| `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions |
|
||||||
|
| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
|
||||||
|
| `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Models Used
|
||||||
|
|
||||||
|
* **Whisper Base (ASR)** – Robust transcription of short commands
|
||||||
|
* **DistilBERT / BERT** – Text classification of transcripts or any of your choice
|
||||||
|
* Optionally: Fine-tune **Whisper** for joint ASR+intent learning
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Decisions
|
||||||
|
|
||||||
|
* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
|
||||||
|
* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
|
||||||
|
* **Separation of Concerns**: Clean division between preprocessing, training, and inference
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Potential Extensions
|
||||||
|
|
||||||
|
* ✅ Real-time streaming inference
|
||||||
|
* ✅ Speaker identification and voice embeddings
|
||||||
|
* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
|
||||||
|
* ✅ Multilingual support via Whisper large models
|
||||||
|
* ✅ Deployable microservice with Docker
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example Usage
|
||||||
|
|
||||||
|
### Voice Input:
|
||||||
|
|
||||||
|
> "Turn on the fan in the bedroom"
|
||||||
|
|
||||||
|
### Output:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"transcript": "turn on the fan in the bedroom",
|
||||||
|
"intent": {
|
||||||
|
"action": "activate",
|
||||||
|
"object": "fan",
|
||||||
|
"location": "bedroom"
|
||||||
|
},
|
||||||
|
"confidence": {
|
||||||
|
"action": 0.98,
|
||||||
|
"object": 0.95,
|
||||||
|
"location": 0.93
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user