From 8847035d127bcbd6c536036df831dd92a5b68218 Mon Sep 17 00:00:00 2001 From: OwusuBlessing Date: Wed, 13 Aug 2025 21:30:45 +0100 Subject: [PATCH] updated styling pipeline --- README.md | 225 ++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 203 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index f387422..f1ba946 100644 --- a/README.md +++ b/README.md @@ -13,55 +13,62 @@ This framework supports multiple NLP tasks with organized configurations: ### Current Implementation Status -- **Classification**: Fully implemented with emotion classification example +- **Classification**: ✅ Fully implemented with emotion classification example +- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning - **Completion**: Planned for future updates -- **Styling**: Planned for future updates - **Matching**: Planned for future updates -**Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates. +**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates. ## Project Structure ``` fine-tune-task/ ├── configs/ # YAML configuration files -│ ├── classification/ # Implemented +│ ├── classification/ # ✅ Implemented │ │ ├── emotion.yaml # Emotion classification │ │ └── custom.yaml # Custom dataset +│ ├── styling/ # ✅ Implemented +│ │ └── formal.yaml # Formal style transfer │ ├── completion/ # Planned for future updates -│ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── data/ # Data directories │ ├── raw/ # Raw input data -│ │ ├── classification/ # Implemented +│ │ ├── classification/ # ✅ Implemented +│ │ ├── styling/ # ✅ Implemented │ │ ├── completion/ # Planned for future updates -│ │ ├── styling/ # Planned for future updates │ │ └── matching/ # Planned for future updates │ └── processed/ # Processed data -│ ├── classification/ # Implemented +│ ├── classification/ # ✅ Implemented +│ ├── styling/ # ✅ Implemented │ ├── completion/ # Planned for future updates -│ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── pipelines/ # Core pipeline scripts -│ ├── classification/ # Implemented +│ ├── classification/ # ✅ Implemented │ │ ├── data_processor.py # Data processing │ │ ├── train.py # Training │ │ └── inference.py # Inference +│ ├── styling/ # ✅ Implemented +│ │ ├── data_processor.py # Style data processing +│ │ ├── train.py # LoRA fine-tuning +│ │ └── inference.py # Style transfer inference │ ├── completion/ # Planned for future updates -│ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── scripts/ # User-friendly scripts -│ ├── classification/ # Implemented +│ ├── classification/ # ✅ Implemented │ │ ├── data_processor.py # Data processing script │ │ ├── trainer.py # Training script │ │ └── inference.py # Inference script +│ ├── styling/ # ✅ Implemented +│ │ ├── data_processor.py # Style data processing script +│ │ ├── train.py # Training script +│ │ └── inference.py # Inference script │ ├── completion/ # Planned for future updates -│ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── results/ # Model outputs -│ ├── classification/ # Implemented +│ ├── classification/ # ✅ Implemented +│ ├── styling/ # ✅ Implemented │ ├── completion/ # Planned for future updates -│ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates └── utils/ # Shared utility modules ``` @@ -146,12 +153,117 @@ Inference completed successfully! - surprise: 0.0224 ``` +## Quick Start (Styling Task) + +### 1. Setup Environment + +```bash +# Install dependencies (including unsloth for styling) +pip install -r requirements.txt + +# Set Python path +export PYTHONPATH=. +``` + +### 2. Data Processing + +```bash +# Process style transfer dataset +python scripts/styling/data_processor.py --config configs/styling/formal.yaml + +# Create HuggingFace dataset +python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset + +# Check output location +ls -la ./data/processed/styling/formal/ +``` + +**Expected Output:** +``` +Styling data processing completed successfully! + Data source: custom + Data file: ./data/raw/styling/sample_formal.jsonl + Total samples: 5 + Split sizes: {'train': 3, 'validation': 1, 'test': 1} + Output directory: ./data/processed/styling/formal + Style instruction: Rewrite the following text in a formal style +``` + +### 3. Model Training + +```bash +# Train using processed data (automatically loads from YAML output_dir) +python scripts/styling/train.py example + +# Custom training +python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4 + +# Check model output +ls -la ./models/styling/ +``` + +**Expected Output:** +``` +Training completed successfully! + Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit + Dataset: Loaded from ./data/processed/styling/formal + Training for 3 epochs with batch size 4 + Model saved to: ./models/styling +``` + +### 4. Model Inference + +```bash +# Single text style transfer +python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" + +# Batch processing +python scripts/styling/inference.py batch + +# Interactive mode +python scripts/styling/inference.py infer --config configs/styling/formal.yaml +``` + +**Expected Output:** +``` +Inference completed successfully! + Input: Hey, what's up? + Output: Hello, how are you doing? + Style: Formal +``` + ## Adding New Tasks To add a new task (e.g., completion, styling, matching), follow these steps: -### Step 1: Create Task Directory Structure +### Example: Styling Task (Already Implemented) +The styling task demonstrates a complete implementation: + +1. **Task Directory Structure** ✅ +```bash +configs/styling/ # YAML configurations +data/raw/styling/ # Raw style transfer data +data/processed/styling/ # Processed data +pipelines/styling/ # Core pipeline scripts +scripts/styling/ # User-friendly scripts +models/styling/ # Trained models +``` + +2. **Pipeline Components** ✅ +- **Data Processor**: Handles style transfer datasets with instruction/input/output format +- **Trainer**: LoRA fine-tuning using Unsloth for efficiency +- **Inference**: Style transfer with streaming and batch processing + +3. **Key Features** ✅ +- Automatic EOS token handling: `text + tokenizer.eos_token` +- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)` +- YAML integration: Uses `data.output_dir` for automatic dataset loading +- HuggingFace dataset export and loading + +### For Other Tasks (completion, matching) + +1. **Create Task Directory Structure** ```bash # Create task directories mkdir -p configs/completion @@ -163,7 +275,7 @@ mkdir -p tasks/completion mkdir -p models/completion ``` -### Step 2: Create Task Configuration +2. **Create Task Configuration** ```bash # Create YAML configuration for new task @@ -205,7 +317,7 @@ inference: EOF ``` -### Step 3: Create Pipeline Scripts +3. **Create Pipeline Scripts** Copy and modify the classification pipeline scripts: @@ -221,7 +333,7 @@ cp scripts/classification/trainer.py scripts/completion/ cp scripts/classification/inference.py scripts/completion/ ``` -### Step 4: Modify Pipeline Code +4. **Modify Pipeline Code** Update the pipeline scripts for your specific task: @@ -240,7 +352,7 @@ Update the pipeline scripts for your specific task: - Add generation parameters (temperature, top-k, etc.) - Modify output format -### Step 5: Update Task Scripts +5. **Update Task Scripts** Modify the task scripts to use your new pipeline: @@ -254,7 +366,7 @@ def run_with_yaml_config(config_path: str, **cli_overrides): # ... rest of the function ``` -### Step 6: Create Task-Specific Models +6. **Create Task-Specific Models** ```bash # Create model directory @@ -275,7 +387,7 @@ class TextGenerator: EOF ``` -### Step 7: Test Your New Task +7. **Test Your New Task** ```bash # Test data processing @@ -330,10 +442,49 @@ inference: return_top_k: 3 # Top K predictions ``` +### Styling Configuration Example + +```yaml +# Styling Task Configuration +task: + name: "styling" + type: "style_transfer" + +# Data Processing Configuration +data: + source: "custom" + data_path: "./data/raw/styling/sample_formal.jsonl" + input_field: "text" + output_field: "styled_text" + instruction: "Rewrite the following text in a formal style" + output_dir: "./data/processed/styling/formal" + output_format: "alpaca" + +# Model Configuration +model: + training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" + training_max_seq_length: 2048 + training_load_in_4bit: true + +# Training Configuration +training: + num_epochs: 3 + batch_size: 2 + learning_rate: 2e-4 + weight_decay: 0.01 + +# Inference Configuration +inference: + batch_size: 1 + max_new_tokens: 128 + temperature: 0.8 +``` + ### Available Configuration Files - `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset - `configs/classification/custom.yaml` - Custom dataset processing +- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning ## Usage Examples @@ -385,6 +536,28 @@ python scripts/classification/inference.py --config configs/classification/emoti python scripts/classification/inference.py examples ``` +### Styling Examples + +```bash +# 1. Data Processing +python scripts/styling/data_processor.py --config configs/styling/formal.yaml +python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset + +# 2. Training +python scripts/styling/train.py example +python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2 + +# 3. Inference +python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" +python scripts/styling/inference.py batch +python scripts/styling/inference.py infer --config configs/styling/formal.yaml + +# 4. Run examples +python scripts/styling/data_processor.py examples +python scripts/styling/train.py features +python scripts/styling/inference.py features +``` + ## Troubleshooting Common Errors ### 1. ModuleNotFoundError: No module named 'utils' @@ -512,12 +685,20 @@ tail -f logs/training.log ## Workflow Summary +### Classification Task 1. **Setup**: Install dependencies and set PYTHONPATH 2. **Data Processing**: Process raw data into organized splits 3. **Training**: Train model using processed data 4. **Inference**: Use trained model for predictions 5. **Monitoring**: Check logs and outputs for errors +### Styling Task +1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH +2. **Data Processing**: Process style transfer data with instruction/input/output format +3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer +4. **Inference**: Style transfer with streaming and batch processing +5. **Monitoring**: Check training logs and model outputs + ## Creating Custom Configurations ### For New Datasets