updated styling pipeline

This commit is contained in:
OwusuBlessing
2025-08-13 21:30:45 +01:00
parent 710d074b47
commit 8847035d12
+203 -22
View File
@@ -13,55 +13,62 @@ This framework supports multiple NLP tasks with organized configurations:
### Current Implementation Status ### Current Implementation Status
- **Classification**: Fully implemented with emotion classification example - **Classification**: Fully implemented with emotion classification example
- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
- **Completion**: Planned for future updates - **Completion**: Planned for future updates
- **Styling**: Planned for future updates
- **Matching**: Planned for future updates - **Matching**: Planned for future updates
**Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates. **Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.
## Project Structure ## Project Structure
``` ```
fine-tune-task/ fine-tune-task/
├── configs/ # YAML configuration files ├── configs/ # YAML configuration files
│ ├── classification/ # Implemented │ ├── classification/ # Implemented
│ │ ├── emotion.yaml # Emotion classification │ │ ├── emotion.yaml # Emotion classification
│ │ └── custom.yaml # Custom dataset │ │ └── custom.yaml # Custom dataset
│ ├── styling/ # ✅ Implemented
│ │ └── formal.yaml # Formal style transfer
│ ├── completion/ # Planned for future updates │ ├── completion/ # Planned for future updates
│ ├── styling/ # Planned for future updates
│ └── matching/ # Planned for future updates │ └── matching/ # Planned for future updates
├── data/ # Data directories ├── data/ # Data directories
│ ├── raw/ # Raw input data │ ├── raw/ # Raw input data
│ │ ├── classification/ # Implemented │ │ ├── classification/ # Implemented
│ │ ├── styling/ # ✅ Implemented
│ │ ├── completion/ # Planned for future updates │ │ ├── completion/ # Planned for future updates
│ │ ├── styling/ # Planned for future updates
│ │ └── matching/ # Planned for future updates │ │ └── matching/ # Planned for future updates
│ └── processed/ # Processed data │ └── processed/ # Processed data
│ ├── classification/ # Implemented │ ├── classification/ # Implemented
│ ├── styling/ # ✅ Implemented
│ ├── completion/ # Planned for future updates │ ├── completion/ # Planned for future updates
│ ├── styling/ # Planned for future updates
│ └── matching/ # Planned for future updates │ └── matching/ # Planned for future updates
├── pipelines/ # Core pipeline scripts ├── pipelines/ # Core pipeline scripts
│ ├── classification/ # Implemented │ ├── classification/ # Implemented
│ │ ├── data_processor.py # Data processing │ │ ├── data_processor.py # Data processing
│ │ ├── train.py # Training │ │ ├── train.py # Training
│ │ └── inference.py # Inference │ │ └── inference.py # Inference
│ ├── styling/ # ✅ Implemented
│ │ ├── data_processor.py # Style data processing
│ │ ├── train.py # LoRA fine-tuning
│ │ └── inference.py # Style transfer inference
│ ├── completion/ # Planned for future updates │ ├── completion/ # Planned for future updates
│ ├── styling/ # Planned for future updates
│ └── matching/ # Planned for future updates │ └── matching/ # Planned for future updates
├── scripts/ # User-friendly scripts ├── scripts/ # User-friendly scripts
│ ├── classification/ # Implemented │ ├── classification/ # Implemented
│ │ ├── data_processor.py # Data processing script │ │ ├── data_processor.py # Data processing script
│ │ ├── trainer.py # Training script │ │ ├── trainer.py # Training script
│ │ └── inference.py # Inference script │ │ └── inference.py # Inference script
│ ├── styling/ # ✅ Implemented
│ │ ├── data_processor.py # Style data processing script
│ │ ├── train.py # Training script
│ │ └── inference.py # Inference script
│ ├── completion/ # Planned for future updates │ ├── completion/ # Planned for future updates
│ ├── styling/ # Planned for future updates
│ └── matching/ # Planned for future updates │ └── matching/ # Planned for future updates
├── results/ # Model outputs ├── results/ # Model outputs
│ ├── classification/ # Implemented │ ├── classification/ # Implemented
│ ├── styling/ # ✅ Implemented
│ ├── completion/ # Planned for future updates │ ├── completion/ # Planned for future updates
│ ├── styling/ # Planned for future updates
│ └── matching/ # Planned for future updates │ └── matching/ # Planned for future updates
└── utils/ # Shared utility modules └── utils/ # Shared utility modules
``` ```
@@ -146,12 +153,117 @@ Inference completed successfully!
- surprise: 0.0224 - surprise: 0.0224
``` ```
## Quick Start (Styling Task)
### 1. Setup Environment
```bash
# Install dependencies (including unsloth for styling)
pip install -r requirements.txt
# Set Python path
export PYTHONPATH=.
```
### 2. Data Processing
```bash
# Process style transfer dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
# Create HuggingFace dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
# Check output location
ls -la ./data/processed/styling/formal/
```
**Expected Output:**
```
Styling data processing completed successfully!
Data source: custom
Data file: ./data/raw/styling/sample_formal.jsonl
Total samples: 5
Split sizes: {'train': 3, 'validation': 1, 'test': 1}
Output directory: ./data/processed/styling/formal
Style instruction: Rewrite the following text in a formal style
```
### 3. Model Training
```bash
# Train using processed data (automatically loads from YAML output_dir)
python scripts/styling/train.py example
# Custom training
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
# Check model output
ls -la ./models/styling/
```
**Expected Output:**
```
Training completed successfully!
Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
Dataset: Loaded from ./data/processed/styling/formal
Training for 3 epochs with batch size 4
Model saved to: ./models/styling
```
### 4. Model Inference
```bash
# Single text style transfer
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
# Batch processing
python scripts/styling/inference.py batch
# Interactive mode
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
```
**Expected Output:**
```
Inference completed successfully!
Input: Hey, what's up?
Output: Hello, how are you doing?
Style: Formal
```
## Adding New Tasks ## Adding New Tasks
To add a new task (e.g., completion, styling, matching), follow these steps: To add a new task (e.g., completion, styling, matching), follow these steps:
### Step 1: Create Task Directory Structure ### Example: Styling Task (Already Implemented)
The styling task demonstrates a complete implementation:
1. **Task Directory Structure**
```bash
configs/styling/ # YAML configurations
data/raw/styling/ # Raw style transfer data
data/processed/styling/ # Processed data
pipelines/styling/ # Core pipeline scripts
scripts/styling/ # User-friendly scripts
models/styling/ # Trained models
```
2. **Pipeline Components**
- **Data Processor**: Handles style transfer datasets with instruction/input/output format
- **Trainer**: LoRA fine-tuning using Unsloth for efficiency
- **Inference**: Style transfer with streaming and batch processing
3. **Key Features**
- Automatic EOS token handling: `text + tokenizer.eos_token`
- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
- YAML integration: Uses `data.output_dir` for automatic dataset loading
- HuggingFace dataset export and loading
### For Other Tasks (completion, matching)
1. **Create Task Directory Structure**
```bash ```bash
# Create task directories # Create task directories
mkdir -p configs/completion mkdir -p configs/completion
@@ -163,7 +275,7 @@ mkdir -p tasks/completion
mkdir -p models/completion mkdir -p models/completion
``` ```
### Step 2: Create Task Configuration 2. **Create Task Configuration**
```bash ```bash
# Create YAML configuration for new task # Create YAML configuration for new task
@@ -205,7 +317,7 @@ inference:
EOF EOF
``` ```
### Step 3: Create Pipeline Scripts 3. **Create Pipeline Scripts**
Copy and modify the classification pipeline scripts: Copy and modify the classification pipeline scripts:
@@ -221,7 +333,7 @@ cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/ cp scripts/classification/inference.py scripts/completion/
``` ```
### Step 4: Modify Pipeline Code 4. **Modify Pipeline Code**
Update the pipeline scripts for your specific task: Update the pipeline scripts for your specific task:
@@ -240,7 +352,7 @@ Update the pipeline scripts for your specific task:
- Add generation parameters (temperature, top-k, etc.) - Add generation parameters (temperature, top-k, etc.)
- Modify output format - Modify output format
### Step 5: Update Task Scripts 5. **Update Task Scripts**
Modify the task scripts to use your new pipeline: Modify the task scripts to use your new pipeline:
@@ -254,7 +366,7 @@ def run_with_yaml_config(config_path: str, **cli_overrides):
# ... rest of the function # ... rest of the function
``` ```
### Step 6: Create Task-Specific Models 6. **Create Task-Specific Models**
```bash ```bash
# Create model directory # Create model directory
@@ -275,7 +387,7 @@ class TextGenerator:
EOF EOF
``` ```
### Step 7: Test Your New Task 7. **Test Your New Task**
```bash ```bash
# Test data processing # Test data processing
@@ -330,10 +442,49 @@ inference:
return_top_k: 3 # Top K predictions return_top_k: 3 # Top K predictions
``` ```
### Styling Configuration Example
```yaml
# Styling Task Configuration
task:
name: "styling"
type: "style_transfer"
# Data Processing Configuration
data:
source: "custom"
data_path: "./data/raw/styling/sample_formal.jsonl"
input_field: "text"
output_field: "styled_text"
instruction: "Rewrite the following text in a formal style"
output_dir: "./data/processed/styling/formal"
output_format: "alpaca"
# Model Configuration
model:
training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
training_max_seq_length: 2048
training_load_in_4bit: true
# Training Configuration
training:
num_epochs: 3
batch_size: 2
learning_rate: 2e-4
weight_decay: 0.01
# Inference Configuration
inference:
batch_size: 1
max_new_tokens: 128
temperature: 0.8
```
### Available Configuration Files ### Available Configuration Files
- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset - `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
- `configs/classification/custom.yaml` - Custom dataset processing - `configs/classification/custom.yaml` - Custom dataset processing
- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning
## Usage Examples ## Usage Examples
@@ -385,6 +536,28 @@ python scripts/classification/inference.py --config configs/classification/emoti
python scripts/classification/inference.py examples python scripts/classification/inference.py examples
``` ```
### Styling Examples
```bash
# 1. Data Processing
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
# 2. Training
python scripts/styling/train.py example
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
# 3. Inference
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
python scripts/styling/inference.py batch
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
# 4. Run examples
python scripts/styling/data_processor.py examples
python scripts/styling/train.py features
python scripts/styling/inference.py features
```
## Troubleshooting Common Errors ## Troubleshooting Common Errors
### 1. ModuleNotFoundError: No module named 'utils' ### 1. ModuleNotFoundError: No module named 'utils'
@@ -512,12 +685,20 @@ tail -f logs/training.log
## Workflow Summary ## Workflow Summary
### Classification Task
1. **Setup**: Install dependencies and set PYTHONPATH 1. **Setup**: Install dependencies and set PYTHONPATH
2. **Data Processing**: Process raw data into organized splits 2. **Data Processing**: Process raw data into organized splits
3. **Training**: Train model using processed data 3. **Training**: Train model using processed data
4. **Inference**: Use trained model for predictions 4. **Inference**: Use trained model for predictions
5. **Monitoring**: Check logs and outputs for errors 5. **Monitoring**: Check logs and outputs for errors
### Styling Task
1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
2. **Data Processing**: Process style transfer data with instruction/input/output format
3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
4. **Inference**: Style transfer with streaming and batch processing
5. **Monitoring**: Check training logs and model outputs
## Creating Custom Configurations ## Creating Custom Configurations
### For New Datasets ### For New Datasets