initial setupt

2025-08-06 22:45:37 +01:00
commit fef3f5ae35
42 changed files with 7147 additions and 0 deletions
@@ -0,0 +1,582 @@
+# Fine-Tune Task: NLP Pipeline Framework
+
+A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).
+
+## 🎯 Supported Tasks
+
+This framework supports multiple NLP tasks with organized configurations:
+
+- **Classification**: Text classification, sentiment analysis, topic classification
+- **Completion**: Text generation, code completion, story generation
+- **Styling**: Style transfer, tone classification, writing style adaptation
+- **Matching**: Semantic matching, entity matching, similarity scoring
+
+### Current Implementation Status
+
+- ✅ **Classification**: Fully implemented with emotion classification example
+- 🔄 **Completion**: Planned for future updates
+- 🔄 **Styling**: Planned for future updates
+- 🔄 **Matching**: Planned for future updates
+
+**Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.
+
+## 🏗️ Project Structure
+
+```
+fine-tune-task/
+├── configs/                    # YAML configuration files
+│   ├── classification/         # ✅ Implemented
+│   │   ├── emotion.yaml       # Emotion classification
+│   │   └── custom.yaml        # Custom dataset
+│   ├── completion/             # 🔄 Planned for future updates
+│   ├── styling/               # 🔄 Planned for future updates
+│   └── matching/              # 🔄 Planned for future updates
+├── data/                       # Data directories
+│   ├── raw/                    # Raw input data
+│   │   ├── classification/     # ✅ Implemented
+│   │   ├── completion/         # 🔄 Planned for future updates
+│   │   ├── styling/           # 🔄 Planned for future updates
+│   │   └── matching/          # 🔄 Planned for future updates
+│   └── processed/              # Processed data
+│       ├── classification/     # ✅ Implemented
+│       ├── completion/         # 🔄 Planned for future updates
+│       ├── styling/           # 🔄 Planned for future updates
+│       └── matching/          # 🔄 Planned for future updates
+├── pipelines/                  # Core pipeline scripts
+│   ├── classification/         # ✅ Implemented
+│   │   ├── data_processor.py  # Data processing
+│   │   ├── train.py          # Training
+│   │   └── inference.py      # Inference
+│   ├── completion/            # 🔄 Framework ready
+│   ├── styling/              # 🔄 Framework ready
+│   └── matching/             # 🔄 Framework ready
+├── scripts/                    # User-friendly scripts
+│   ├── classification/         # ✅ Implemented
+│   │   ├── data_processor.py  # Data processing script
+│   │   ├── trainer.py        # Training script
+│   │   └── inference.py      # Inference script
+│   ├── completion/            # 🔄 Framework ready
+│   ├── styling/              # 🔄 Framework ready
+│   └── matching/             # 🔄 Framework ready
+├── results/                    # Model outputs
+│   ├── classification/         # ✅ Implemented
+│   ├── completion/            # 🔄 Ready
+│   ├── styling/              # 🔄 Ready
+│   └── matching/             # 🔄 Ready
+└── utils/                      # Shared utility modules
+```
+
+## 🚀 Quick Start (Classification Task)
+
+### 1. Setup Environment
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Set Python path
+export PYTHONPATH=.
+```
+
+### 2. Data Processing
+
+```bash
+# Process emotion dataset
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+
+# Process with custom parameters
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000
+
+# Check output location
+ls -la ./data/processed/classification/emotion/classification/
+```
+
+**Expected Output:**
+```
+✅ Data processing completed successfully!
+  Data source: huggingface
+  Dataset: dair-ai/emotion
+  Total samples: 2999
+  Unique labels: 6
+  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
+  Output directory: ./data/processed/classification/emotion
+```
+
+### 3. Model Training
+
+```bash
+# Train using processed data
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+
+# Train with custom parameters
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32
+
+# Check model output
+ls -la ./results/classification/emotion_model/
+```
+
+**Expected Output:**
+```
+✅ Training completed successfully!
+  Model: bert-base-uncased
+  Data directory: ./data/processed/classification/emotion
+  Training for 3 epochs with batch size 16
+  Model saved to: ./results/classification/emotion_model
+```
+
+### 4. Model Inference
+
+```bash
+# Run inference
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
+
+# File-based inference
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
+```
+
+**Expected Output:**
+```
+✅ Inference completed successfully!
+  Loading model from: ./results/classification/emotion_model
+  Predicted label: joy
+  Confidence: 0.8542
+  Top 3 predictions:
+    - joy: 0.8542
+    - love: 0.1234
+    - surprise: 0.0224
+```
+
+## 🔧 Adding New Tasks
+
+To add a new task (e.g., completion, styling, matching), follow these steps:
+
+### Step 1: Create Task Directory Structure
+
+```bash
+# Create task directories
+mkdir -p configs/completion
+mkdir -p data/raw/completion data/processed/completion
+mkdir -p pipelines/completion
+mkdir -p scripts/completion
+mkdir -p results/completion
+mkdir -p tasks/completion
+mkdir -p models/completion
+```
+
+### Step 2: Create Task Configuration
+
+```bash
+# Create YAML configuration for new task
+cat > configs/completion/text_generation.yaml << 'EOF'
+# Text Generation Task Configuration
+task:
+  name: "completion"
+  type: "text_generation"
+
+# Data Processing Configuration
+data:
+  source: "huggingface"
+  dataset_name: "your-dataset-name"
+  output_dir: "./data/processed/completion/text_generation"
+  max_samples: 1000
+  # ... other data parameters
+
+# Model Configuration
+model:
+  name: "gpt2"  # Different model for completion
+  max_length: 1024
+  # ... model parameters
+
+# Training Configuration
+training:
+  num_epochs: 3
+  batch_size: 8  # Smaller batch for generation
+  learning_rate: 5e-5
+  data_dir: "./data/processed/completion/text_generation"
+  output_dir: "./results/completion/text_generation_model"
+
+# Inference Configuration
+inference:
+  model_path: "./results/completion/text_generation_model"
+  device: "auto"
+  batch_size: 1  # Generation is typically one at a time
+  max_length: 100
+  temperature: 0.7
+EOF
+```
+
+### Step 3: Create Pipeline Scripts
+
+Copy and modify the classification pipeline scripts:
+
+```bash
+# Copy classification scripts as templates
+cp pipelines/classification/data_processor.py pipelines/completion/
+cp pipelines/classification/train.py pipelines/completion/
+cp pipelines/classification/inference.py pipelines/completion/
+
+# Copy task scripts
+cp scripts/classification/data_processor.py scripts/completion/
+cp scripts/classification/trainer.py scripts/completion/
+cp scripts/classification/inference.py scripts/completion/
+```
+
+### Step 4: Modify Pipeline Code
+
+Update the pipeline scripts for your specific task:
+
+1. **Data Processor** (`pipelines/completion/data_processor.py`):
+   - Update data loading logic for completion datasets
+   - Modify preprocessing for text generation
+   - Adjust output format for completion tasks
+
+2. **Trainer** (`pipelines/completion/train.py`):
+   - Change model type to generation models (GPT, T5, etc.)
+   - Update training loop for text generation
+   - Modify evaluation metrics
+
+3. **Inference** (`pipelines/completion/inference.py`):
+   - Update inference for text generation
+   - Add generation parameters (temperature, top-k, etc.)
+   - Modify output format
+
+### Step 5: Update Task Scripts
+
+Modify the task scripts to use your new pipeline:
+
+```python
+# scripts/completion/data_processor.py
+def run_with_yaml_config(config_path: str, **cli_overrides):
+    cmd = [
+        "python", "pipelines/completion/data_processor.py",  # Updated path
+        "--config", config_path
+    ]
+    # ... rest of the function
+```
+
+### Step 6: Create Task-Specific Models
+
+```bash
+# Create model directory
+mkdir -p models/completion
+
+# Add task-specific model classes
+cat > models/completion/text_generator.py << 'EOF'
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+class TextGenerator:
+    def __init__(self, model_name):
+        self.model = AutoModelForCausalLM.from_pretrained(model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    
+    def generate(self, prompt, max_length=100, temperature=0.7):
+        # Implementation for text generation
+        pass
+EOF
+```
+
+### Step 7: Test Your New Task
+
+```bash
+# Test data processing
+python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml
+
+# Test training
+python scripts/completion/trainer.py --config configs/completion/text_generation.yaml
+
+# Test inference
+python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"
+```
+
+## 📋 YAML Configuration Guide
+
+### Configuration Structure
+
+Each YAML file is organized into clear sections:
+
+```yaml
+# Task Configuration
+task:
+  name: "classification"  # or "completion", "styling", "matching"
+  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"
+
+# Data Processing Configuration
+data:
+  source: "huggingface"                    # "huggingface" or "custom"
+  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
+  output_dir: "./data/processed/classification/emotion"
+  max_samples: 1000                        # Limit dataset size
+  # ... other data parameters
+
+# Model Configuration
+model:
+  name: "bert-base-uncased"                # Model from HuggingFace Hub
+  max_length: 512                          # Sequence length
+  num_labels: 6                            # Number of classes
+
+# Training Configuration
+training:
+  num_epochs: 3                            # Training epochs
+  batch_size: 16                           # Batch size
+  learning_rate: 2e-5                      # Learning rate
+  data_dir: "./data/processed/classification/emotion"
+  output_dir: "./results/classification/emotion_model"
+
+# Inference Configuration
+inference:
+  model_path: "./results/classification/emotion_model"
+  device: "auto"                           # "auto", "cuda", "cpu"
+  batch_size: 32                           # Inference batch size
+  return_top_k: 3                          # Top K predictions
+```
+
+### Available Configuration Files
+
+- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
+- `configs/classification/custom.yaml` - Custom dataset processing
+
+## 🔧 Usage Examples
+
+### Data Processing Examples
+
+```bash
+# 1. Use YAML config only
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+
+# 2. Override YAML values
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500
+
+# 3. Use CLI only (backward compatibility)
+python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion
+
+# 4. Run examples
+python scripts/classification/data_processor.py examples
+```
+
+### Training Examples
+
+```bash
+# 1. Use YAML config only
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+
+# 2. Override YAML values
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5
+
+# 3. Use CLI only
+python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3
+
+# 4. Run examples
+python scripts/classification/trainer.py examples
+```
+
+### Inference Examples
+
+```bash
+# 1. Single text prediction
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
+
+# 2. File-based prediction
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
+
+# 3. Interactive mode
+python scripts/classification/inference.py --config configs/classification/emotion.yaml
+
+# 4. Run examples
+python scripts/classification/inference.py examples
+```
+
+## 🐛 Troubleshooting Common Errors
+
+### 1. ModuleNotFoundError: No module named 'utils'
+
+**Error:**
+```
+ModuleNotFoundError: No module named 'utils'
+```
+
+**Solution:**
+```bash
+# Set Python path before running scripts
+export PYTHONPATH=.
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+```
+
+### 2. Model Path Not Found
+
+**Error:**
+```
+❌ Model path not found: ./results/classification/emotion_model
+```
+
+**Solution:**
+```bash
+# Train the model first
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+
+# Then run inference
+python scripts/classification/inference.py --config configs/classification/emotion.yaml
+```
+
+### 3. Data Directory Not Found
+
+**Error:**
+```
+❌ Data directory not found: ./data/processed/classification/emotion
+```
+
+**Solution:**
+```bash
+# Process data first
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+
+# Then train
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+```
+
+### 4. YAML Configuration Errors
+
+**Error:**
+```
+data_processor.py: error: --data-source is required (either in YAML config or CLI)
+```
+
+**Solution:**
+Check your YAML file structure. It should have:
+```yaml
+data:
+  source: "huggingface"  # Not data_source
+  dataset_name: "dair-ai/emotion"
+```
+
+### 5. HuggingFace Download Issues
+
+**Error:**
+```
+KeyboardInterrupt during model download
+```
+
+**Solution:**
+```bash
+# Use smaller dataset for testing
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100
+
+# Or use cached models
+export HF_HOME=./cache
+```
+
+### 6. CUDA/GPU Issues
+
+**Error:**
+```
+RuntimeError: CUDA out of memory
+```
+
+**Solution:**
+```bash
+# Reduce batch size
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8
+
+# Or use CPU
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu
+```
+
+## 📊 Monitoring and Logs
+
+### Check Processing Status
+
+```bash
+# Check data processing output
+ls -la ./data/processed/classification/emotion/classification/
+
+# Check training output
+ls -la ./results/classification/emotion_model/
+
+# Check logs
+tail -f logs/training.log
+```
+
+### Expected File Structure After Processing
+
+```
+./data/processed/classification/emotion/classification/
+├── train.jsonl       # Training data
+├── validation.jsonl   # Validation data
+└── test.jsonl        # Test data
+
+./results/classification/emotion_model/
+├── config.json       # Model configuration
+├── pytorch_model.bin # Model weights
+├── tokenizer.json    # Tokenizer
+└── label_info.json   # Label mappings
+```
+
+## 🔄 Workflow Summary
+
+1. **Setup**: Install dependencies and set PYTHONPATH
+2. **Data Processing**: Process raw data into organized splits
+3. **Training**: Train model using processed data
+4. **Inference**: Use trained model for predictions
+5. **Monitoring**: Check logs and outputs for errors
+
+## 📝 Creating Custom Configurations
+
+### For New Datasets
+
+1. Copy existing config:
+```bash
+cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml
+```
+
+2. Modify parameters:
+```yaml
+data:
+  source: "huggingface"
+  dataset_name: "your-dataset-name"
+  output_dir: "./data/processed/classification/my_dataset"
+  # ... other parameters
+
+training:
+  data_dir: "./data/processed/classification/my_dataset"
+  output_dir: "./results/classification/my_dataset_model"
+```
+
+3. Run pipeline:
+```bash
+python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml
+```
+
+### For Custom Data
+
+1. Use custom config:
+```yaml
+data:
+  source: "custom"
+  data_path: "./data/raw/my_data.jsonl"
+  output_dir: "./data/processed/classification/my_custom_dataset"
+```
+
+2. Run processing:
+```bash
+python scripts/classification/data_processor.py --config configs/classification/custom.yaml
+```
+
+## 🎯 Best Practices
+
+1. **Always check output directories** before running next step
+2. **Use small datasets for testing** before full runs
+3. **Monitor logs** for errors and warnings
+4. **Backup configurations** before major changes
+5. **Use version control** for YAML files
+6. **Test with CLI overrides** for quick experiments
+
+## 📞 Support
+
+For issues and questions:
+1. Check the troubleshooting section above
+2. Review logs in the output directories
+3. Verify YAML configuration structure
+4. Test with smaller datasets first
+
+---
+
+**Happy fine-tuning! 🚀**