updated style mimciking fine tuning

2025-08-13 23:50:20 +00:00
parent 8847035d12
commit 1b46270afa
83 changed files with 2537260 additions and 378 deletions
@@ -0,0 +1,763 @@
+# Fine-Tune Task: NLP Pipeline Framework
+
+A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).
+
+## Supported Tasks
+
+This framework supports multiple NLP tasks with organized configurations:
+
+- **Classification**: Text classification, sentiment analysis, topic classification
+- **Completion**: Text generation, code completion, story generation
+- **Styling**: Style transfer, tone classification, writing style adaptation
+- **Matching**: Semantic matching, entity matching, similarity scoring
+
+### Current Implementation Status
+
+- **Classification**: ✅ Fully implemented with emotion classification example
+- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
+- **Completion**: Planned for future updates
+- **Matching**: Planned for future updates
+
+**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.
+
+## Project Structure
+
+```
+fine-tune-task/
+├── configs/                    # YAML configuration files
+│   ├── classification/         # ✅ Implemented
+│   │   ├── emotion.yaml       # Emotion classification
+│   │   └── custom.yaml        # Custom dataset
+│   ├── styling/               # ✅ Implemented
+│   │   └── formal.yaml        # Formal style transfer
+│   ├── completion/             # Planned for future updates
+│   └── matching/              # Planned for future updates
+├── data/                       # Data directories
+│   ├── raw/                    # Raw input data
+│   │   ├── classification/     # ✅ Implemented
+│   │   ├── styling/           # ✅ Implemented
+│   │   ├── completion/         # Planned for future updates
+│   │   └── matching/          # Planned for future updates
+│   └── processed/              # Processed data
+│       ├── classification/     # ✅ Implemented
+│       ├── styling/           # ✅ Implemented
+│       ├── completion/         # Planned for future updates
+│       └── matching/          # Planned for future updates
+├── pipelines/                  # Core pipeline scripts
+│   ├── classification/         # ✅ Implemented
+│   │   ├── data_processor.py  # Data processing
+│   │   ├── train.py          # Training
+│   │   └── inference.py      # Inference
+│   ├── styling/               # ✅ Implemented
+│   │   ├── data_processor.py  # Style data processing
+│   │   ├── train.py          # LoRA fine-tuning
+│   │   └── inference.py      # Style transfer inference
+│   ├── completion/            # Planned for future updates
+│   └── matching/             # Planned for future updates
+├── scripts/                    # User-friendly scripts
+│   ├── classification/         # ✅ Implemented
+│   │   ├── data_processor.py  # Data processing script
+│   │   ├── trainer.py        # Training script
+│   │   └── inference.py      # Inference script
+│   ├── styling/               # ✅ Implemented
+│   │   ├── data_processor.py  # Style data processing script
+│   │   ├── train.py          # Training script
+│   │   └── inference.py      # Inference script
+│   ├── completion/            # Planned for future updates
+│   └── matching/             # Planned for future updates
+├── results/                    # Model outputs
+│   ├── classification/         # ✅ Implemented
+│   ├── styling/              # ✅ Implemented
+│   ├── completion/            # Planned for future updates
+│   └── matching/             # Planned for future updates
+└── utils/                      # Shared utility modules
+```
+
+## Quick Start (Classification Task)
+
+### 1. Setup Environment
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Set Python path
+export PYTHONPATH=.
+```
+
+### 2. Data Processing
+
+```bash
+# Process emotion dataset
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+
+# Process with custom parameters
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000
+
+# Check output location
+ls -la ./data/processed/classification/emotion/classification/
+```
+
+**Expected Output:**
+```
+Data processing completed successfully!
+  Data source: huggingface
+  Dataset: dair-ai/emotion
+  Total samples: 2999
+  Unique labels: 6
+  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
+  Output directory: ./data/processed/classification/emotion
+```
+
+### 3. Model Training
+
+```bash
+# Train using processed data
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+
+# Train with custom parameters
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32
+
+# Check model output
+ls -la ./results/classification/emotion_model/
+```
+
+**Expected Output:**
+```
+Training completed successfully!
+  Model: bert-base-uncased
+  Data directory: ./data/processed/classification/emotion
+  Training for 3 epochs with batch size 16
+  Model saved to: ./results/classification/emotion_model
+```
+
+### 4. Model Inference
+
+```bash
+# Run inference
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
+
+# File-based inference
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
+```
+
+**Expected Output:**
+```
+Inference completed successfully!
+  Loading model from: ./results/classification/emotion_model
+  Predicted label: joy
+  Confidence: 0.8542
+  Top 3 predictions:
+    - joy: 0.8542
+    - love: 0.1234
+    - surprise: 0.0224
+```
+
+## Quick Start (Styling Task)
+
+### 1. Setup Environment
+
+```bash
+# Install dependencies (including unsloth for styling)
+pip install -r requirements.txt
+
+# Set Python path
+export PYTHONPATH=.
+```
+
+### 2. Data Processing
+
+```bash
+# Process style transfer dataset
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml
+
+# Create HuggingFace dataset
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
+
+# Check output location
+ls -la ./data/processed/styling/formal/
+```
+
+**Expected Output:**
+```
+Styling data processing completed successfully!
+  Data source: custom
+  Data file: ./data/raw/styling/sample_formal.jsonl
+  Total samples: 5
+  Split sizes: {'train': 3, 'validation': 1, 'test': 1}
+  Output directory: ./data/processed/styling/formal
+  Style instruction: Rewrite the following text in a formal style
+```
+
+### 3. Model Training
+
+```bash
+# Train using processed data (automatically loads from YAML output_dir)
+python scripts/styling/train.py example
+
+# Custom training
+python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
+
+# Check model output
+ls -la ./models/styling/
+```
+
+**Expected Output:**
+```
+Training completed successfully!
+  Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
+  Dataset: Loaded from ./data/processed/styling/formal
+  Training for 3 epochs with batch size 4
+  Model saved to: ./models/styling
+```
+
+### 4. Model Inference
+
+```bash
+# Single text style transfer
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
+
+# Batch processing
+python scripts/styling/inference.py batch
+
+# Interactive mode
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml
+```
+
+**Expected Output:**
+```
+Inference completed successfully!
+  Input: Hey, what's up?
+  Output: Hello, how are you doing?
+  Style: Formal
+```
+
+## Adding New Tasks
+
+To add a new task (e.g., completion, styling, matching), follow these steps:
+
+### Example: Styling Task (Already Implemented)
+
+The styling task demonstrates a complete implementation:
+
+1. **Task Directory Structure** ✅
+```bash
+configs/styling/           # YAML configurations
+data/raw/styling/         # Raw style transfer data
+data/processed/styling/   # Processed data
+pipelines/styling/        # Core pipeline scripts
+scripts/styling/          # User-friendly scripts
+models/styling/           # Trained models
+```
+
+2. **Pipeline Components** ✅
+- **Data Processor**: Handles style transfer datasets with instruction/input/output format
+- **Trainer**: LoRA fine-tuning using Unsloth for efficiency
+- **Inference**: Style transfer with streaming and batch processing
+
+3. **Key Features** ✅
+- Automatic EOS token handling: `text + tokenizer.eos_token`
+- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
+- YAML integration: Uses `data.output_dir` for automatic dataset loading
+- HuggingFace dataset export and loading
+
+### For Other Tasks (completion, matching)
+
+1. **Create Task Directory Structure**
+```bash
+# Create task directories
+mkdir -p configs/completion
+mkdir -p data/raw/completion data/processed/completion
+mkdir -p pipelines/completion
+mkdir -p scripts/completion
+mkdir -p results/completion
+mkdir -p tasks/completion
+mkdir -p models/completion
+```
+
+2. **Create Task Configuration**
+
+```bash
+# Create YAML configuration for new task
+cat > configs/completion/text_generation.yaml << 'EOF'
+# Text Generation Task Configuration
+task:
+  name: "completion"
+  type: "text_generation"
+
+# Data Processing Configuration
+data:
+  source: "huggingface"
+  dataset_name: "your-dataset-name"
+  output_dir: "./data/processed/completion/text_generation"
+  max_samples: 1000
+  # ... other data parameters
+
+# Model Configuration
+model:
+  name: "gpt2"  # Different model for completion
+  max_length: 1024
+  # ... model parameters
+
+# Training Configuration
+training:
+  num_epochs: 3
+  batch_size: 8  # Smaller batch for generation
+  learning_rate: 5e-5
+  data_dir: "./data/processed/completion/text_generation"
+  output_dir: "./results/completion/text_generation_model"
+
+# Inference Configuration
+inference:
+  model_path: "./results/completion/text_generation_model"
+  device: "auto"
+  batch_size: 1  # Generation is typically one at a time
+  max_length: 100
+  temperature: 0.7
+EOF
+```
+
+3. **Create Pipeline Scripts**
+
+Copy and modify the classification pipeline scripts:
+
+```bash
+# Copy classification scripts as templates
+cp pipelines/classification/data_processor.py pipelines/completion/
+cp pipelines/classification/train.py pipelines/completion/
+cp pipelines/classification/inference.py pipelines/completion/
+
+# Copy task scripts
+cp scripts/classification/data_processor.py scripts/completion/
+cp scripts/classification/trainer.py scripts/completion/
+cp scripts/classification/inference.py scripts/completion/
+```
+
+4. **Modify Pipeline Code**
+
+Update the pipeline scripts for your specific task:
+
+1. **Data Processor** (`pipelines/completion/data_processor.py`):
+   - Update data loading logic for completion datasets
+   - Modify preprocessing for text generation
+   - Adjust output format for completion tasks
+
+2. **Trainer** (`pipelines/completion/train.py`):
+   - Change model type to generation models (GPT, T5, etc.)
+   - Update training loop for text generation
+   - Modify evaluation metrics
+
+3. **Inference** (`pipelines/completion/inference.py`):
+   - Update inference for text generation
+   - Add generation parameters (temperature, top-k, etc.)
+   - Modify output format
+
+5. **Update Task Scripts**
+
+Modify the task scripts to use your new pipeline:
+
+```python
+# scripts/completion/data_processor.py
+def run_with_yaml_config(config_path: str, **cli_overrides):
+    cmd = [
+        "python", "pipelines/completion/data_processor.py",  # Updated path
+        "--config", config_path
+    ]
+    # ... rest of the function
+```
+
+6. **Create Task-Specific Models**
+
+```bash
+# Create model directory
+mkdir -p models/completion
+
+# Add task-specific model classes
+cat > models/completion/text_generator.py << 'EOF'
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+class TextGenerator:
+    def __init__(self, model_name):
+        self.model = AutoModelForCausalLM.from_pretrained(model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+    
+    def generate(self, prompt, max_length=100, temperature=0.7):
+        # Implementation for text generation
+        pass
+EOF
+```
+
+7. **Test Your New Task**
+
+```bash
+# Test data processing
+python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml
+
+# Test training
+python scripts/completion/trainer.py --config configs/completion/text_generation.yaml
+
+# Test inference
+python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"
+```
+
+## YAML Configuration Guide
+
+### Configuration Structure
+
+Each YAML file is organized into clear sections:
+
+```yaml
+# Task Configuration
+task:
+  name: "classification"  # or "completion", "styling", "matching"
+  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"
+
+# Data Processing Configuration
+data:
+  source: "huggingface"                    # "huggingface" or "custom"
+  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
+  output_dir: "./data/processed/classification/emotion"
+  max_samples: 1000                        # Limit dataset size
+  # ... other data parameters
+
+# Model Configuration
+model:
+  name: "bert-base-uncased"                # Model from HuggingFace Hub
+  max_length: 512                          # Sequence length
+  num_labels: 6                            # Number of classes
+
+# Training Configuration
+training:
+  num_epochs: 3                            # Training epochs
+  batch_size: 16                           # Batch size
+  learning_rate: 2e-5                      # Learning rate
+  data_dir: "./data/processed/classification/emotion"
+  output_dir: "./results/classification/emotion_model"
+
+# Inference Configuration
+inference:
+  model_path: "./results/classification/emotion_model"
+  device: "auto"                           # "auto", "cuda", "cpu"
+  batch_size: 32                           # Inference batch size
+  return_top_k: 3                          # Top K predictions
+```
+
+### Styling Configuration Example
+
+```yaml
+# Styling Task Configuration
+task:
+  name: "styling"
+  type: "style_transfer"
+
+# Data Processing Configuration
+data:
+  source: "custom"
+  data_path: "./data/raw/styling/sample_formal.jsonl"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite the following text in a formal style"
+  output_dir: "./data/processed/styling/formal"
+  output_format: "alpaca"
+
+# Model Configuration
+model:
+  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
+  training_max_seq_length: 2048
+  training_load_in_4bit: true
+
+# Training Configuration
+training:
+  num_epochs: 3
+  batch_size: 2
+  learning_rate: 2e-4
+  weight_decay: 0.01
+
+# Inference Configuration
+inference:
+  batch_size: 1
+  max_new_tokens: 128
+  temperature: 0.8
+```
+
+### Available Configuration Files
+
+- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
+- `configs/classification/custom.yaml` - Custom dataset processing
+- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning
+
+## Usage Examples
+
+### Data Processing Examples
+
+```bash
+# 1. Use YAML config only
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+
+# 2. Override YAML values
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500
+
+# 3. Use CLI only (backward compatibility)
+python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion
+
+# 4. Run examples
+python scripts/classification/data_processor.py examples
+```
+
+### Training Examples
+
+```bash
+# 1. Use YAML config only
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+
+# 2. Override YAML values
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5
+
+# 3. Use CLI only
+python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3
+
+# 4. Run examples
+python scripts/classification/trainer.py examples
+```
+
+### Inference Examples
+
+```bash
+# 1. Single text prediction
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
+
+# 2. File-based prediction
+python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
+
+# 3. Interactive mode
+python scripts/classification/inference.py --config configs/classification/emotion.yaml
+
+# 4. Run examples
+python scripts/classification/inference.py examples
+```
+
+### Styling Examples
+
+```bash
+# 1. Data Processing
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
+
+# 2. Training
+python scripts/styling/train.py example
+python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
+
+# 3. Inference
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
+python scripts/styling/inference.py batch
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml
+
+# 4. Run examples
+python scripts/styling/data_processor.py examples
+python scripts/styling/train.py features
+python scripts/styling/inference.py features
+```
+
+## Troubleshooting Common Errors
+
+### 1. ModuleNotFoundError: No module named 'utils'
+
+**Error:**
+```
+ModuleNotFoundError: No module named 'utils'
+```
+
+**Solution:**
+```bash
+# Set Python path before running scripts
+export PYTHONPATH=.
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+```
+
+### 2. Model Path Not Found
+
+**Error:**
+```
+Model path not found: ./results/classification/emotion_model
+```
+
+**Solution:**
+```bash
+# Train the model first
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+
+# Then run inference
+python scripts/classification/inference.py --config configs/classification/emotion.yaml
+```
+
+### 3. Data Directory Not Found
+
+**Error:**
+```
+Data directory not found: ./data/processed/classification/emotion
+```
+
+**Solution:**
+```bash
+# Process data first
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
+
+# Then train
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml
+```
+
+### 4. YAML Configuration Errors
+
+**Error:**
+```
+data_processor.py: error: --data-source is required (either in YAML config or CLI)
+```
+
+**Solution:**
+Check your YAML file structure. It should have:
+```yaml
+data:
+  source: "huggingface"  # Not data_source
+  dataset_name: "dair-ai/emotion"
+```
+
+### 5. HuggingFace Download Issues
+
+**Error:**
+```
+KeyboardInterrupt during model download
+```
+
+**Solution:**
+```bash
+# Use smaller dataset for testing
+python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100
+
+# Or use cached models
+export HF_HOME=./cache
+```
+
+### 6. CUDA/GPU Issues
+
+**Error:**
+```
+RuntimeError: CUDA out of memory
+```
+
+**Solution:**
+```bash
+# Reduce batch size
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8
+
+# Or use CPU
+python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu
+```
+
+## Monitoring and Logs
+
+### Check Processing Status
+
+```bash
+# Check data processing output
+ls -la ./data/processed/classification/emotion/classification/
+
+# Check training output
+ls -la ./results/classification/emotion_model/
+
+# Check logs
+tail -f logs/training.log
+```
+
+### Expected File Structure After Processing
+
+```
+./data/processed/classification/emotion/classification/
+├── train.jsonl       # Training data
+├── validation.jsonl   # Validation data
+└── test.jsonl        # Test data
+
+./results/classification/emotion_model/
+├── config.json       # Model configuration
+├── pytorch_model.bin # Model weights
+├── tokenizer.json    # Tokenizer
+└── label_info.json   # Label mappings
+```
+
+## Workflow Summary
+
+### Classification Task
+1. **Setup**: Install dependencies and set PYTHONPATH
+2. **Data Processing**: Process raw data into organized splits
+3. **Training**: Train model using processed data
+4. **Inference**: Use trained model for predictions
+5. **Monitoring**: Check logs and outputs for errors
+
+### Styling Task
+1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
+2. **Data Processing**: Process style transfer data with instruction/input/output format
+3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
+4. **Inference**: Style transfer with streaming and batch processing
+5. **Monitoring**: Check training logs and model outputs
+
+## Creating Custom Configurations
+
+### For New Datasets
+
+1. Copy existing config:
+```bash
+cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml
+```
+
+2. Modify parameters:
+```yaml
+data:
+  source: "huggingface"
+  dataset_name: "your-dataset-name"
+  output_dir: "./data/processed/classification/my_dataset"
+  # ... other parameters
+
+training:
+  data_dir: "./data/processed/classification/my_dataset"
+  output_dir: "./results/classification/my_dataset_model"
+```
+
+3. Run pipeline:
+```bash
+python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml
+```
+
+### For Custom Data
+
+1. Use custom config:
+```yaml
+data:
+  source: "custom"
+  data_path: "./data/raw/my_data.jsonl"
+  output_dir: "./data/processed/classification/my_custom_dataset"
+```
+
+2. Run processing:
+```bash
+python scripts/classification/data_processor.py --config configs/classification/custom.yaml
+```
+
+## Best Practices
+
+1. **Always check output directories** before running next step
+2. **Use small datasets for testing** before full runs
+3. **Monitor logs** for errors and warnings
+4. **Backup configurations** before major changes
+5. **Use version control** for YAML files
+6. **Test with CLI overrides** for quick experiments
+
+## Support
+
+For issues and questions:
+1. Check the troubleshooting section above
+2. Review logs in the output directories
+3. Verify YAML configuration structure
+4. Test with smaller datasets first
+
+---
+
+**Happy fine-tuning!**