updated styling pipeline

2025-08-13 21:30:45 +01:00
parent 710d074b47
commit 8847035d12
1 changed files with 203 additions and 22 deletions
@@ -13,55 +13,62 @@ This framework supports multiple NLP tasks with organized configurations:
 ### Current Implementation Status
- **Classification**: Fully implemented with emotion classification example
+- **Classification**: ✅ Fully implemented with emotion classification example
 - **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
 - **Completion**: Planned for future updates
 - **Styling**: Planned for future updates
 - **Matching**: Planned for future updates
-**Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.
+**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.
 ## Project Structure
 ```
 fine-tune-task/
 ├── configs/                    # YAML configuration files
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   │   ├── emotion.yaml       # Emotion classification
 │   │   └── custom.yaml        # Custom dataset
 │   ├── styling/               # ✅ Implemented
 │   │   └── formal.yaml        # Formal style transfer
 │   ├── completion/             # Planned for future updates
 │   ├── styling/               # Planned for future updates
 │   └── matching/              # Planned for future updates
 ├── data/                       # Data directories
 │   ├── raw/                    # Raw input data
-│   │   ├── classification/     # Implemented
+│   │   ├── classification/     # ✅ Implemented
 │   │   ├── styling/           # ✅ Implemented
 │   │   ├── completion/         # Planned for future updates
 │   │   ├── styling/           # Planned for future updates
 │   │   └── matching/          # Planned for future updates
 │   └── processed/              # Processed data
-│       ├── classification/     # Implemented
+│       ├── classification/     # ✅ Implemented
 │       ├── styling/           # ✅ Implemented
 │       ├── completion/         # Planned for future updates
 │       ├── styling/           # Planned for future updates
 │       └── matching/          # Planned for future updates
 ├── pipelines/                  # Core pipeline scripts
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   │   ├── data_processor.py  # Data processing
 │   │   ├── train.py          # Training
 │   │   └── inference.py      # Inference
 │   ├── styling/               # ✅ Implemented
 │   │   ├── data_processor.py  # Style data processing
 │   │   ├── train.py          # LoRA fine-tuning
 │   │   └── inference.py      # Style transfer inference
 │   ├── completion/            # Planned for future updates
 │   ├── styling/              # Planned for future updates
 │   └── matching/             # Planned for future updates
 ├── scripts/                    # User-friendly scripts
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   │   ├── data_processor.py  # Data processing script
 │   │   ├── trainer.py        # Training script
 │   │   └── inference.py      # Inference script
 │   ├── styling/               # ✅ Implemented
 │   │   ├── data_processor.py  # Style data processing script
 │   │   ├── train.py          # Training script
 │   │   └── inference.py      # Inference script
 │   ├── completion/            # Planned for future updates
 │   ├── styling/              # Planned for future updates
 │   └── matching/             # Planned for future updates
 ├── results/                    # Model outputs
-│   ├── classification/         # Implemented
+│   ├── classification/         # ✅ Implemented
 │   ├── styling/              # ✅ Implemented
 │   ├── completion/            # Planned for future updates
 │   ├── styling/              # Planned for future updates
 │   └── matching/             # Planned for future updates
 └── utils/                      # Shared utility modules
 ```
@@ -146,12 +153,117 @@ Inference completed successfully!
    - surprise: 0.0224
 ```
 ## Quick Start (Styling Task)
 ### 1. Setup Environment
 ```bash
 # Install dependencies (including unsloth for styling)
 pip install -r requirements.txt
 # Set Python path
 export PYTHONPATH=.
 ```
 ### 2. Data Processing
 ```bash
 # Process style transfer dataset
 python scripts/styling/data_processor.py --config configs/styling/formal.yaml
 # Create HuggingFace dataset
 python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
 # Check output location
 ls -la ./data/processed/styling/formal/
 ```
 **Expected Output:**
 ```
 Styling data processing completed successfully!
  Data source: custom
  Data file: ./data/raw/styling/sample_formal.jsonl
  Total samples: 5
  Split sizes: {'train': 3, 'validation': 1, 'test': 1}
  Output directory: ./data/processed/styling/formal
  Style instruction: Rewrite the following text in a formal style
 ```
 ### 3. Model Training
 ```bash
 # Train using processed data (automatically loads from YAML output_dir)
 python scripts/styling/train.py example
 # Custom training
 python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
 # Check model output
 ls -la ./models/styling/
 ```
 **Expected Output:**
 ```
 Training completed successfully!
  Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
  Dataset: Loaded from ./data/processed/styling/formal
  Training for 3 epochs with batch size 4
  Model saved to: ./models/styling
 ```
 ### 4. Model Inference
 ```bash
 # Single text style transfer
 python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
 # Batch processing
 python scripts/styling/inference.py batch
 # Interactive mode
 python scripts/styling/inference.py infer --config configs/styling/formal.yaml
 ```
 **Expected Output:**
 ```
 Inference completed successfully!
  Input: Hey, what's up?
  Output: Hello, how are you doing?
  Style: Formal
 ```
 ## Adding New Tasks
 To add a new task (e.g., completion, styling, matching), follow these steps:
-### Step 1: Create Task Directory Structure
+### Example: Styling Task (Already Implemented)
 The styling task demonstrates a complete implementation:
 1. **Task Directory Structure** ✅
 ```bash
 configs/styling/           # YAML configurations
 data/raw/styling/         # Raw style transfer data
 data/processed/styling/   # Processed data
 pipelines/styling/        # Core pipeline scripts
 scripts/styling/          # User-friendly scripts
 models/styling/           # Trained models
 ```
 2. **Pipeline Components** ✅
 - **Data Processor**: Handles style transfer datasets with instruction/input/output format
 - **Trainer**: LoRA fine-tuning using Unsloth for efficiency
 - **Inference**: Style transfer with streaming and batch processing
 3. **Key Features** ✅
 - Automatic EOS token handling: `text + tokenizer.eos_token`
 - Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
 - YAML integration: Uses `data.output_dir` for automatic dataset loading
 - HuggingFace dataset export and loading
 ### For Other Tasks (completion, matching)
 1. **Create Task Directory Structure**
 ```bash
 # Create task directories
 mkdir -p configs/completion
@@ -163,7 +275,7 @@ mkdir -p tasks/completion
 mkdir -p models/completion
 ```
-### Step 2: Create Task Configuration
+2. **Create Task Configuration**
 ```bash
 # Create YAML configuration for new task
@@ -205,7 +317,7 @@ inference:
 EOF
 ```
-### Step 3: Create Pipeline Scripts
+3. **Create Pipeline Scripts**
 Copy and modify the classification pipeline scripts:
@@ -221,7 +333,7 @@ cp scripts/classification/trainer.py scripts/completion/
 cp scripts/classification/inference.py scripts/completion/
 ```
-### Step 4: Modify Pipeline Code
+4. **Modify Pipeline Code**
 Update the pipeline scripts for your specific task:
@@ -240,7 +352,7 @@ Update the pipeline scripts for your specific task:
   - Add generation parameters (temperature, top-k, etc.)
   - Modify output format
-### Step 5: Update Task Scripts
+5. **Update Task Scripts**
 Modify the task scripts to use your new pipeline:
@@ -254,7 +366,7 @@ def run_with_yaml_config(config_path: str, **cli_overrides):
    # ... rest of the function
 ```
-### Step 6: Create Task-Specific Models
+6. **Create Task-Specific Models**
 ```bash
 # Create model directory
@@ -275,7 +387,7 @@ class TextGenerator:
 EOF
 ```
-### Step 7: Test Your New Task
+7. **Test Your New Task**
 ```bash
 # Test data processing
@@ -330,10 +442,49 @@ inference:
  return_top_k: 3                          # Top K predictions
 ```
 ### Styling Configuration Example
 ```yaml
 # Styling Task Configuration
 task:
  name: "styling"
  type: "style_transfer"
 # Data Processing Configuration
 data:
  source: "custom"
  data_path: "./data/raw/styling/sample_formal.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite the following text in a formal style"
  output_dir: "./data/processed/styling/formal"
  output_format: "alpaca"
 # Model Configuration
 model:
  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
  training_max_seq_length: 2048
  training_load_in_4bit: true
 # Training Configuration
 training:
  num_epochs: 3
  batch_size: 2
  learning_rate: 2e-4
  weight_decay: 0.01
 # Inference Configuration
 inference:
  batch_size: 1
  max_new_tokens: 128
  temperature: 0.8
 ```
 ### Available Configuration Files
 - `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
 - `configs/classification/custom.yaml` - Custom dataset processing
 - `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning
 ## Usage Examples
@@ -385,6 +536,28 @@ python scripts/classification/inference.py --config configs/classification/emoti
 python scripts/classification/inference.py examples
 ```
 ### Styling Examples
 ```bash
 # 1. Data Processing
 python scripts/styling/data_processor.py --config configs/styling/formal.yaml
 python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
 # 2. Training
 python scripts/styling/train.py example
 python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
 # 3. Inference
 python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
 python scripts/styling/inference.py batch
 python scripts/styling/inference.py infer --config configs/styling/formal.yaml
 # 4. Run examples
 python scripts/styling/data_processor.py examples
 python scripts/styling/train.py features
 python scripts/styling/inference.py features
 ```
 ## Troubleshooting Common Errors
 ### 1. ModuleNotFoundError: No module named 'utils'
@@ -512,12 +685,20 @@ tail -f logs/training.log
 ## Workflow Summary
 ### Classification Task
 1. **Setup**: Install dependencies and set PYTHONPATH
 2. **Data Processing**: Process raw data into organized splits
 3. **Training**: Train model using processed data
 4. **Inference**: Use trained model for predictions
 5. **Monitoring**: Check logs and outputs for errors
 ### Styling Task
 1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
 2. **Data Processing**: Process style transfer data with instruction/input/output format
 3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
 4. **Inference**: Style transfer with streaming and batch processing
 5. **Monitoring**: Check training logs and model outputs
 ## Creating Custom Configurations
 ### For New Datasets