made first commit

2025-07-29 01:52:11 +01:00
commit cebb4c2fe0
1 changed files with 187 additions and 0 deletions
@@ -0,0 +1,187 @@
 \
 # VoiceIntent – Intelligent Audio Command Understanding System
 VoiceIQ is a modular speech-to-intent classification system that processes raw voice input, transcribes it using a pre-trained ASR model, and classifies user intent into structured components (`action`, `object`, `location`). It's designed for use in smart assistants, hands-free interfaces, and voice-based automation systems.
 ---
 ## Project Goals
 * ✅ Build an end-to-end speech pipeline using **ASR + NLP**
 * ✅ Classify spoken commands into **structured intents**
 * ✅ Serve predictions via a clean API or UI
 * ✅ Ensure modularity and production readiness
 ---
 ## Core Features
 | Feature                | Description                                                                |
 | ---------------------- | -------------------------------------------------------------------------- |
 | Speech-to-text      | Uses [OpenAI Whisper](https://github.com/openai/whisper) for transcription |
 | Intent classifier   | Classifies transcribed text into `action`, `object`, `location`            |
 | Evaluation pipeline | Tracks WER, accuracy, precision, recall, and confusion matrices            |
 | CLI pipelines       | One-command training, inference, and evaluation                            |
 | API + UI            | FastAPI for RESTful endpoints; Streamlit demo included                     |
 |  Notebooks           | EDA, ASR error analysis, intent confusion reports                          |
 ---
 ##  Project Structure
 ---
 ##  Dataset: Fluent Speech Commands
 * 23,132 single-sentence voice commands (1–2 seconds)
 * Labels: `action`, `object`, `location`
 * Examples:
  * “Turn on the lights in the kitchen” → `activate`, `lights`, `kitchen`
  * “Switch off the fan in the bedroom” → `deactivate`, `fan`, `bedroom`
 ---
 ##  Setup Instructions
 ###  1. Create Environment
 ```bash
 git clone https://github.com/your-org/voiceiq.git
 cd voiceiq
 python -m venv venv
 source venv/bin/activate  # or venv\Scripts\activate on Windows
 pip install -r requirements.txt
 ```
 ### 2. Download Dataset
 ```bash
 # Download and unzip Fluent Speech Commands
 # Or use torchaudio.datasets.FluentSpeechCommands if available
 ```
 ---
 ##  How to Run Pipelines
 ### 1. Train Pipeline
 ```bash
 python pipelines/train_pipeline.py --config configs/train_config.yaml --experiment_name "whisper+bert_baseline"
 ```
 ### 2. Inference Pipeline
 ```bash
 # Predict from a WAV file
 python pipelines/inference_pipeline.py --model_path models/best.pt --audio_file data/test_audio/command.wav
 ```
 ###  3. Evaluation Pipeline
 ```bash
 python pipelines/evaluation_pipeline.py --model_path models/best.pt --test_data data/processed/test.csv
 ```
 ---
 ##  API & UI Demos
 ### FastAPI Server
 ```bash
 uvicorn src.api.server:app --reload
 ```
 * `POST /predict` — Upload audio and get predicted intent
 * `GET /health` — System health check
 ### Streamlit Demo
 ```bash
 python run_demo.py
 ```
 * Upload `.wav` or record live
 * View transcript, structured intent, and confidence scores
 ---
 ##  Metrics Tracked
 | Metric                  | Description                                            |
 | ----------------------- | ------------------------------------------------------ |
 | **WER**              | Word Error Rate from ASR                               |
 | **Intent Acc**       | Accuracy for full `(action, object, location)` triplet |
 | **F1 Scores**        | Macro, micro, and per-label F1                         |
 | **Confusion Matrix** | Action/Object/Location classification errors           |
 ---
 ## Included Notebooks
 | Notebook                         | Purpose                                         |
 | -------------------------------- | ----------------------------------------------- |
 | `01_audio_exploration.ipynb`     | Visualize waveforms, mel spectrograms           |
 | `02_asr_error_analysis.ipynb`    | Compare Whisper vs Wav2Vec2 transcriptions      |
 | `03_intent_classification.ipynb` | Hyperparameter tuning, misclassification review |
 | `04_results_analysis.ipynb`      | Plot confusion matrices and F1 breakdowns       |
 ---
 ## Models Used
 * **Whisper Base (ASR)** – Robust transcription of short commands
 * **DistilBERT / BERT** – Text classification of transcripts or any of your choice 
 * Optionally: Fine-tune **Whisper** for joint ASR+intent learning
 ---
 ## Design Decisions
 * **Pipeline Modularity**: All components (ASR, NLP, evaluation) are swappable
 * **Config-Driven**: Use YAML configs for training, ASR models, and evaluation
 * **Separation of Concerns**: Clean division between preprocessing, training, and inference
 ---
 ## Potential Extensions
 * ✅ Real-time streaming inference
 * ✅ Speaker identification and voice embeddings
 * ✅ End-to-end fine-tuning of Whisper for direct audio → intent
 * ✅ Multilingual support via Whisper large models
 * ✅ Deployable microservice with Docker
 ---
 ## Example Usage
 ### Voice Input:
 > "Turn on the fan in the bedroom"
 ### Output:
 ```json
 {
  "transcript": "turn on the fan in the bedroom",
  "intent": {
    "action": "activate",
    "object": "fan",
    "location": "bedroom"
  },
  "confidence": {
    "action": 0.98,
    "object": 0.95,
    "location": 0.93
  }
 }
 ```