updated styling pipeline

2025-08-13 21:30:45 +01:00
parent 710d074b47
commit 8847035d12
1 changed files with 203 additions and 22 deletions
@@ -13,55 +13,62 @@ This framework supports multiple NLP tasks with organized configurations:

 ### Current Implementation Status

- **Classification**: Fully implemented with emotion classification example
+- **Classification**: ✅ Fully implemented with emotion classification example
+- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
 - **Completion**: Planned for future updates
- **Styling**: Planned for future updates
 - **Matching**: Planned for future updates

-**Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.
+**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.

 ## Project Structure

 ```
 fine-tune-task/
 ├── configs/                    # YAML configuration files
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   │   ├── emotion.yaml       # Emotion classification
 │   │   └── custom.yaml        # Custom dataset
+│   ├── styling/               # ✅ Implemented
+│   │   └── formal.yaml        # Formal style transfer
 │   ├── completion/             # Planned for future updates
-│   ├── styling/               # Planned for future updates
 │   └── matching/              # Planned for future updates
 ├── data/                       # Data directories
 │   ├── raw/                    # Raw input data
-│   │   ├── classification/     # Implemented
+│   │   ├── classification/     # ✅ Implemented
+│   │   ├── styling/           # ✅ Implemented
 │   │   ├── completion/         # Planned for future updates
-│   │   ├── styling/           # Planned for future updates
 │   │   └── matching/          # Planned for future updates
 │   └── processed/              # Processed data
-│       ├── classification/     # Implemented
+│       ├── classification/     # ✅ Implemented
+│       ├── styling/           # ✅ Implemented
 │       ├── completion/         # Planned for future updates
-│       ├── styling/           # Planned for future updates
 │       └── matching/          # Planned for future updates
 ├── pipelines/                  # Core pipeline scripts
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   │   ├── data_processor.py  # Data processing
 │   │   ├── train.py          # Training
 │   │   └── inference.py      # Inference
+│   ├── styling/               # ✅ Implemented
+│   │   ├── data_processor.py  # Style data processing
+│   │   ├── train.py          # LoRA fine-tuning
+│   │   └── inference.py      # Style transfer inference
 │   ├── completion/            # Planned for future updates
-│   ├── styling/              # Planned for future updates
 │   └── matching/             # Planned for future updates
 ├── scripts/                    # User-friendly scripts
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   │   ├── data_processor.py  # Data processing script
 │   │   ├── trainer.py        # Training script
 │   │   └── inference.py      # Inference script
+│   ├── styling/               # ✅ Implemented
+│   │   ├── data_processor.py  # Style data processing script
+│   │   ├── train.py          # Training script
+│   │   └── inference.py      # Inference script
 │   ├── completion/            # Planned for future updates
-│   ├── styling/              # Planned for future updates
 │   └── matching/             # Planned for future updates
 ├── results/                    # Model outputs
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
+│   ├── styling/              # ✅ Implemented
 │   ├── completion/            # Planned for future updates
-│   ├── styling/              # Planned for future updates
 │   └── matching/             # Planned for future updates
 └── utils/                      # Shared utility modules
 ```
@@ -146,12 +153,117 @@ Inference completed successfully!
    - surprise: 0.0224
 ```

+## Quick Start (Styling Task)
+
+### 1. Setup Environment
+
+```bash
+# Install dependencies (including unsloth for styling)
+pip install -r requirements.txt
+
+# Set Python path
+export PYTHONPATH=.
+```
+
+### 2. Data Processing
+
+```bash
+# Process style transfer dataset
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml
+
+# Create HuggingFace dataset
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
+
+# Check output location
+ls -la ./data/processed/styling/formal/
+```
+
+**Expected Output:**
+```
+Styling data processing completed successfully!
+  Data source: custom
+  Data file: ./data/raw/styling/sample_formal.jsonl
+  Total samples: 5
+  Split sizes: {'train': 3, 'validation': 1, 'test': 1}
+  Output directory: ./data/processed/styling/formal
+  Style instruction: Rewrite the following text in a formal style
+```
+
+### 3. Model Training
+
+```bash
+# Train using processed data (automatically loads from YAML output_dir)
+python scripts/styling/train.py example
+
+# Custom training
+python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
+
+# Check model output
+ls -la ./models/styling/
+```
+
+**Expected Output:**
+```
+Training completed successfully!
+  Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
+  Dataset: Loaded from ./data/processed/styling/formal
+  Training for 3 epochs with batch size 4
+  Model saved to: ./models/styling
+```
+
+### 4. Model Inference
+
+```bash
+# Single text style transfer
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
+
+# Batch processing
+python scripts/styling/inference.py batch
+
+# Interactive mode
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml
+```
+
+**Expected Output:**
+```
+Inference completed successfully!
+  Input: Hey, what's up?
+  Output: Hello, how are you doing?
+  Style: Formal
+```
+
 ## Adding New Tasks

 To add a new task (e.g., completion, styling, matching), follow these steps:

-### Step 1: Create Task Directory Structure
+### Example: Styling Task (Already Implemented)

+The styling task demonstrates a complete implementation:
+
+1. **Task Directory Structure** ✅
+```bash
+configs/styling/           # YAML configurations
+data/raw/styling/         # Raw style transfer data
+data/processed/styling/   # Processed data
+pipelines/styling/        # Core pipeline scripts
+scripts/styling/          # User-friendly scripts
+models/styling/           # Trained models
+```
+
+2. **Pipeline Components** ✅
+- **Data Processor**: Handles style transfer datasets with instruction/input/output format
+- **Trainer**: LoRA fine-tuning using Unsloth for efficiency
+- **Inference**: Style transfer with streaming and batch processing
+
+3. **Key Features** ✅
+- Automatic EOS token handling: `text + tokenizer.eos_token`
+- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
+- YAML integration: Uses `data.output_dir` for automatic dataset loading
+- HuggingFace dataset export and loading
+
+### For Other Tasks (completion, matching)
+
+1. **Create Task Directory Structure**
 ```bash
 # Create task directories
 mkdir -p configs/completion
@@ -163,7 +275,7 @@ mkdir -p tasks/completion
 mkdir -p models/completion
 ```

-### Step 2: Create Task Configuration
+2. **Create Task Configuration**

 ```bash
 # Create YAML configuration for new task
@@ -205,7 +317,7 @@ inference:
 EOF
 ```

-### Step 3: Create Pipeline Scripts
+3. **Create Pipeline Scripts**

 Copy and modify the classification pipeline scripts:

@@ -221,7 +333,7 @@ cp scripts/classification/trainer.py scripts/completion/
 cp scripts/classification/inference.py scripts/completion/
 ```

-### Step 4: Modify Pipeline Code
+4. **Modify Pipeline Code**

 Update the pipeline scripts for your specific task:

@@ -240,7 +352,7 @@ Update the pipeline scripts for your specific task:
   - Add generation parameters (temperature, top-k, etc.)
   - Modify output format

-### Step 5: Update Task Scripts
+5. **Update Task Scripts**

 Modify the task scripts to use your new pipeline:

@@ -254,7 +366,7 @@ def run_with_yaml_config(config_path: str, **cli_overrides):
    # ... rest of the function
 ```

-### Step 6: Create Task-Specific Models
+6. **Create Task-Specific Models**

 ```bash
 # Create model directory
@@ -275,7 +387,7 @@ class TextGenerator:
 EOF
 ```

-### Step 7: Test Your New Task
+7. **Test Your New Task**

 ```bash
 # Test data processing
@@ -330,10 +442,49 @@ inference:
  return_top_k: 3                          # Top K predictions
 ```

+### Styling Configuration Example
+
+```yaml
+# Styling Task Configuration
+task:
+  name: "styling"
+  type: "style_transfer"
+
+# Data Processing Configuration
+data:
+  source: "custom"
+  data_path: "./data/raw/styling/sample_formal.jsonl"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite the following text in a formal style"
+  output_dir: "./data/processed/styling/formal"
+  output_format: "alpaca"
+
+# Model Configuration
+model:
+  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
+  training_max_seq_length: 2048
+  training_load_in_4bit: true
+
+# Training Configuration
+training:
+  num_epochs: 3
+  batch_size: 2
+  learning_rate: 2e-4
+  weight_decay: 0.01
+
+# Inference Configuration
+inference:
+  batch_size: 1
+  max_new_tokens: 128
+  temperature: 0.8
+```
+
 ### Available Configuration Files

 - `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
 - `configs/classification/custom.yaml` - Custom dataset processing
+- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning

 ## Usage Examples

@@ -385,6 +536,28 @@ python scripts/classification/inference.py --config configs/classification/emoti
 python scripts/classification/inference.py examples
 ```

+### Styling Examples
+
+```bash
+# 1. Data Processing
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml
+python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
+
+# 2. Training
+python scripts/styling/train.py example
+python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
+
+# 3. Inference
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
+python scripts/styling/inference.py batch
+python scripts/styling/inference.py infer --config configs/styling/formal.yaml
+
+# 4. Run examples
+python scripts/styling/data_processor.py examples
+python scripts/styling/train.py features
+python scripts/styling/inference.py features
+```
+
 ## Troubleshooting Common Errors

 ### 1. ModuleNotFoundError: No module named 'utils'
@@ -512,12 +685,20 @@ tail -f logs/training.log

 ## Workflow Summary

+### Classification Task
 1. **Setup**: Install dependencies and set PYTHONPATH
 2. **Data Processing**: Process raw data into organized splits
 3. **Training**: Train model using processed data
 4. **Inference**: Use trained model for predictions
 5. **Monitoring**: Check logs and outputs for errors

+### Styling Task
+1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
+2. **Data Processing**: Process style transfer data with instruction/input/output format
+3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
+4. **Inference**: Style transfer with streaming and batch processing
+5. **Monitoring**: Check training logs and model outputs
+
 ## Creating Custom Configurations

 ### For New Datasets