made first commit
This commit is contained in:
@@ -0,0 +1,187 @@
|
||||
\
|
||||
|
||||
# VoiceIntent – Intelligent Audio Command Understanding System
|
||||
|
||||
VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
|
||||
|
||||
---
|
||||
|
||||
## Project Goals
|
||||
|
||||
* ✅ Build an end-to-end speech pipeline using **ASR + NLP**
|
||||
* ✅ Classify spoken commands into **structured intents**
|
||||
* ✅ Serve predictions via a clean API or UI
|
||||
* ✅ Ensure modularity and production readiness
|
||||
|
||||
---
|
||||
|
||||
## Core Features
|
||||
|
||||
| Feature | Description |
|
||||
| ---------------------- | -------------------------------------------------------------------------- |
|
||||
| Speech-to-text | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
|
||||
| Intent classifier | Classifies transcribed text into `action`, `object`, `location` |
|
||||
| Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices |
|
||||
| CLI pipelines | One-command training, inference, and evaluation |
|
||||
| API + UI | FastAPI for RESTful endpoints; Streamlit demo included |
|
||||
| Notebooks | EDA, ASR error analysis, intent confusion reports |
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Dataset: Fluent Speech Commands
|
||||
|
||||
* 23,132 single-sentence voice commands (1–2 seconds)
|
||||
* Labels: `action`, `object`, `location`
|
||||
* Examples:
|
||||
|
||||
* “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
|
||||
* “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
|
||||
|
||||
---
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### 1. Create Environment
|
||||
|
||||
```bash
|
||||
git clone https://github.com/your-org/voiceiq.git
|
||||
cd voiceiq
|
||||
python -m venv venv
|
||||
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Download Dataset
|
||||
|
||||
```bash
|
||||
# Download and unzip Fluent Speech Commands
|
||||
# Or use torchaudio.datasets.FluentSpeechCommands if available
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How to Run Pipelines
|
||||
|
||||
### 1. Train Pipeline
|
||||
|
||||
```bash
|
||||
python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
|
||||
```
|
||||
|
||||
### 2. Inference Pipeline
|
||||
|
||||
```bash
|
||||
# Predict from a WAV file
|
||||
python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
|
||||
```
|
||||
|
||||
### 3. Evaluation Pipeline
|
||||
|
||||
```bash
|
||||
python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API & UI Demos
|
||||
|
||||
### FastAPI Server
|
||||
|
||||
```bash
|
||||
uvicorn src.api.server:app --reload
|
||||
```
|
||||
|
||||
* `POST /predict` — Upload audio and get predicted intent
|
||||
* `GET /health` — System health check
|
||||
|
||||
### Streamlit Demo
|
||||
|
||||
```bash
|
||||
python run_demo.py
|
||||
```
|
||||
|
||||
* Upload `.wav` or record live
|
||||
* View transcript, structured intent, and confidence scores
|
||||
|
||||
---
|
||||
|
||||
## Metrics Tracked
|
||||
|
||||
| Metric | Description |
|
||||
| ----------------------- | ------------------------------------------------------ |
|
||||
| **WER** | Word Error Rate from ASR |
|
||||
| **Intent Acc** | Accuracy for full `(action, object, location)` triplet |
|
||||
| **F1 Scores** | Macro, micro, and per-label F1 |
|
||||
| **Confusion Matrix** | Action/Object/Location classification errors |
|
||||
|
||||
---
|
||||
|
||||
## Included Notebooks
|
||||
|
||||
| Notebook | Purpose |
|
||||
| -------------------------------- | ----------------------------------------------- |
|
||||
| `01_audio_exploration.ipynb` | Visualize waveforms, mel spectrograms |
|
||||
| `02_asr_error_analysis.ipynb` | Compare Whisper vs Wav2Vec2 transcriptions |
|
||||
| `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
|
||||
| `04_results_analysis.ipynb` | Plot confusion matrices and F1 breakdowns |
|
||||
|
||||
---
|
||||
|
||||
## Models Used
|
||||
|
||||
* **Whisper Base (ASR)** – Robust transcription of short commands
|
||||
* **DistilBERT / BERT** – Text classification of transcripts or any of your choice
|
||||
* Optionally: Fine-tune **Whisper** for joint ASR+intent learning
|
||||
|
||||
---
|
||||
|
||||
## Design Decisions
|
||||
|
||||
* **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
|
||||
* **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
|
||||
* **Separation of Concerns**: Clean division between preprocessing, training, and inference
|
||||
|
||||
---
|
||||
|
||||
## Potential Extensions
|
||||
|
||||
* ✅ Real-time streaming inference
|
||||
* ✅ Speaker identification and voice embeddings
|
||||
* ✅ End-to-end fine-tuning of Whisper for direct audio → intent
|
||||
* ✅ Multilingual support via Whisper large models
|
||||
* ✅ Deployable microservice with Docker
|
||||
|
||||
---
|
||||
|
||||
## Example Usage
|
||||
|
||||
### Voice Input:
|
||||
|
||||
> "Turn on the fan in the bedroom"
|
||||
|
||||
### Output:
|
||||
|
||||
```json
|
||||
{
|
||||
"transcript": "turn on the fan in the bedroom",
|
||||
"intent": {
|
||||
"action": "activate",
|
||||
"object": "fan",
|
||||
"location": "bedroom"
|
||||
},
|
||||
"confidence": {
|
||||
"action": 0.98,
|
||||
"object": 0.95,
|
||||
"location": 0.93
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user