updated styling pipeline
This commit is contained in:
@@ -13,55 +13,62 @@ This framework supports multiple NLP tasks with organized configurations:
|
||||
|
||||
### Current Implementation Status
|
||||
|
||||
- **Classification**: Fully implemented with emotion classification example
|
||||
- **Classification**: ✅ Fully implemented with emotion classification example
|
||||
- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
|
||||
- **Completion**: Planned for future updates
|
||||
- **Styling**: Planned for future updates
|
||||
- **Matching**: Planned for future updates
|
||||
|
||||
**Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.
|
||||
**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
fine-tune-task/
|
||||
├── configs/ # YAML configuration files
|
||||
│ ├── classification/ # Implemented
|
||||
│ ├── classification/ # ✅ Implemented
|
||||
│ │ ├── emotion.yaml # Emotion classification
|
||||
│ │ └── custom.yaml # Custom dataset
|
||||
│ ├── styling/ # ✅ Implemented
|
||||
│ │ └── formal.yaml # Formal style transfer
|
||||
│ ├── completion/ # Planned for future updates
|
||||
│ ├── styling/ # Planned for future updates
|
||||
│ └── matching/ # Planned for future updates
|
||||
├── data/ # Data directories
|
||||
│ ├── raw/ # Raw input data
|
||||
│ │ ├── classification/ # Implemented
|
||||
│ │ ├── classification/ # ✅ Implemented
|
||||
│ │ ├── styling/ # ✅ Implemented
|
||||
│ │ ├── completion/ # Planned for future updates
|
||||
│ │ ├── styling/ # Planned for future updates
|
||||
│ │ └── matching/ # Planned for future updates
|
||||
│ └── processed/ # Processed data
|
||||
│ ├── classification/ # Implemented
|
||||
│ ├── classification/ # ✅ Implemented
|
||||
│ ├── styling/ # ✅ Implemented
|
||||
│ ├── completion/ # Planned for future updates
|
||||
│ ├── styling/ # Planned for future updates
|
||||
│ └── matching/ # Planned for future updates
|
||||
├── pipelines/ # Core pipeline scripts
|
||||
│ ├── classification/ # Implemented
|
||||
│ ├── classification/ # ✅ Implemented
|
||||
│ │ ├── data_processor.py # Data processing
|
||||
│ │ ├── train.py # Training
|
||||
│ │ └── inference.py # Inference
|
||||
│ ├── styling/ # ✅ Implemented
|
||||
│ │ ├── data_processor.py # Style data processing
|
||||
│ │ ├── train.py # LoRA fine-tuning
|
||||
│ │ └── inference.py # Style transfer inference
|
||||
│ ├── completion/ # Planned for future updates
|
||||
│ ├── styling/ # Planned for future updates
|
||||
│ └── matching/ # Planned for future updates
|
||||
├── scripts/ # User-friendly scripts
|
||||
│ ├── classification/ # Implemented
|
||||
│ ├── classification/ # ✅ Implemented
|
||||
│ │ ├── data_processor.py # Data processing script
|
||||
│ │ ├── trainer.py # Training script
|
||||
│ │ └── inference.py # Inference script
|
||||
│ ├── styling/ # ✅ Implemented
|
||||
│ │ ├── data_processor.py # Style data processing script
|
||||
│ │ ├── train.py # Training script
|
||||
│ │ └── inference.py # Inference script
|
||||
│ ├── completion/ # Planned for future updates
|
||||
│ ├── styling/ # Planned for future updates
|
||||
│ └── matching/ # Planned for future updates
|
||||
├── results/ # Model outputs
|
||||
│ ├── classification/ # Implemented
|
||||
│ ├── classification/ # ✅ Implemented
|
||||
│ ├── styling/ # ✅ Implemented
|
||||
│ ├── completion/ # Planned for future updates
|
||||
│ ├── styling/ # Planned for future updates
|
||||
│ └── matching/ # Planned for future updates
|
||||
└── utils/ # Shared utility modules
|
||||
```
|
||||
@@ -146,12 +153,117 @@ Inference completed successfully!
|
||||
- surprise: 0.0224
|
||||
```
|
||||
|
||||
## Quick Start (Styling Task)
|
||||
|
||||
### 1. Setup Environment
|
||||
|
||||
```bash
|
||||
# Install dependencies (including unsloth for styling)
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Set Python path
|
||||
export PYTHONPATH=.
|
||||
```
|
||||
|
||||
### 2. Data Processing
|
||||
|
||||
```bash
|
||||
# Process style transfer dataset
|
||||
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
|
||||
|
||||
# Create HuggingFace dataset
|
||||
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
|
||||
|
||||
# Check output location
|
||||
ls -la ./data/processed/styling/formal/
|
||||
```
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
Styling data processing completed successfully!
|
||||
Data source: custom
|
||||
Data file: ./data/raw/styling/sample_formal.jsonl
|
||||
Total samples: 5
|
||||
Split sizes: {'train': 3, 'validation': 1, 'test': 1}
|
||||
Output directory: ./data/processed/styling/formal
|
||||
Style instruction: Rewrite the following text in a formal style
|
||||
```
|
||||
|
||||
### 3. Model Training
|
||||
|
||||
```bash
|
||||
# Train using processed data (automatically loads from YAML output_dir)
|
||||
python scripts/styling/train.py example
|
||||
|
||||
# Custom training
|
||||
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
|
||||
|
||||
# Check model output
|
||||
ls -la ./models/styling/
|
||||
```
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
Training completed successfully!
|
||||
Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
|
||||
Dataset: Loaded from ./data/processed/styling/formal
|
||||
Training for 3 epochs with batch size 4
|
||||
Model saved to: ./models/styling
|
||||
```
|
||||
|
||||
### 4. Model Inference
|
||||
|
||||
```bash
|
||||
# Single text style transfer
|
||||
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
|
||||
|
||||
# Batch processing
|
||||
python scripts/styling/inference.py batch
|
||||
|
||||
# Interactive mode
|
||||
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
|
||||
```
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
Inference completed successfully!
|
||||
Input: Hey, what's up?
|
||||
Output: Hello, how are you doing?
|
||||
Style: Formal
|
||||
```
|
||||
|
||||
## Adding New Tasks
|
||||
|
||||
To add a new task (e.g., completion, styling, matching), follow these steps:
|
||||
|
||||
### Step 1: Create Task Directory Structure
|
||||
### Example: Styling Task (Already Implemented)
|
||||
|
||||
The styling task demonstrates a complete implementation:
|
||||
|
||||
1. **Task Directory Structure** ✅
|
||||
```bash
|
||||
configs/styling/ # YAML configurations
|
||||
data/raw/styling/ # Raw style transfer data
|
||||
data/processed/styling/ # Processed data
|
||||
pipelines/styling/ # Core pipeline scripts
|
||||
scripts/styling/ # User-friendly scripts
|
||||
models/styling/ # Trained models
|
||||
```
|
||||
|
||||
2. **Pipeline Components** ✅
|
||||
- **Data Processor**: Handles style transfer datasets with instruction/input/output format
|
||||
- **Trainer**: LoRA fine-tuning using Unsloth for efficiency
|
||||
- **Inference**: Style transfer with streaming and batch processing
|
||||
|
||||
3. **Key Features** ✅
|
||||
- Automatic EOS token handling: `text + tokenizer.eos_token`
|
||||
- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
|
||||
- YAML integration: Uses `data.output_dir` for automatic dataset loading
|
||||
- HuggingFace dataset export and loading
|
||||
|
||||
### For Other Tasks (completion, matching)
|
||||
|
||||
1. **Create Task Directory Structure**
|
||||
```bash
|
||||
# Create task directories
|
||||
mkdir -p configs/completion
|
||||
@@ -163,7 +275,7 @@ mkdir -p tasks/completion
|
||||
mkdir -p models/completion
|
||||
```
|
||||
|
||||
### Step 2: Create Task Configuration
|
||||
2. **Create Task Configuration**
|
||||
|
||||
```bash
|
||||
# Create YAML configuration for new task
|
||||
@@ -205,7 +317,7 @@ inference:
|
||||
EOF
|
||||
```
|
||||
|
||||
### Step 3: Create Pipeline Scripts
|
||||
3. **Create Pipeline Scripts**
|
||||
|
||||
Copy and modify the classification pipeline scripts:
|
||||
|
||||
@@ -221,7 +333,7 @@ cp scripts/classification/trainer.py scripts/completion/
|
||||
cp scripts/classification/inference.py scripts/completion/
|
||||
```
|
||||
|
||||
### Step 4: Modify Pipeline Code
|
||||
4. **Modify Pipeline Code**
|
||||
|
||||
Update the pipeline scripts for your specific task:
|
||||
|
||||
@@ -240,7 +352,7 @@ Update the pipeline scripts for your specific task:
|
||||
- Add generation parameters (temperature, top-k, etc.)
|
||||
- Modify output format
|
||||
|
||||
### Step 5: Update Task Scripts
|
||||
5. **Update Task Scripts**
|
||||
|
||||
Modify the task scripts to use your new pipeline:
|
||||
|
||||
@@ -254,7 +366,7 @@ def run_with_yaml_config(config_path: str, **cli_overrides):
|
||||
# ... rest of the function
|
||||
```
|
||||
|
||||
### Step 6: Create Task-Specific Models
|
||||
6. **Create Task-Specific Models**
|
||||
|
||||
```bash
|
||||
# Create model directory
|
||||
@@ -275,7 +387,7 @@ class TextGenerator:
|
||||
EOF
|
||||
```
|
||||
|
||||
### Step 7: Test Your New Task
|
||||
7. **Test Your New Task**
|
||||
|
||||
```bash
|
||||
# Test data processing
|
||||
@@ -330,10 +442,49 @@ inference:
|
||||
return_top_k: 3 # Top K predictions
|
||||
```
|
||||
|
||||
### Styling Configuration Example
|
||||
|
||||
```yaml
|
||||
# Styling Task Configuration
|
||||
task:
|
||||
name: "styling"
|
||||
type: "style_transfer"
|
||||
|
||||
# Data Processing Configuration
|
||||
data:
|
||||
source: "custom"
|
||||
data_path: "./data/raw/styling/sample_formal.jsonl"
|
||||
input_field: "text"
|
||||
output_field: "styled_text"
|
||||
instruction: "Rewrite the following text in a formal style"
|
||||
output_dir: "./data/processed/styling/formal"
|
||||
output_format: "alpaca"
|
||||
|
||||
# Model Configuration
|
||||
model:
|
||||
training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
|
||||
training_max_seq_length: 2048
|
||||
training_load_in_4bit: true
|
||||
|
||||
# Training Configuration
|
||||
training:
|
||||
num_epochs: 3
|
||||
batch_size: 2
|
||||
learning_rate: 2e-4
|
||||
weight_decay: 0.01
|
||||
|
||||
# Inference Configuration
|
||||
inference:
|
||||
batch_size: 1
|
||||
max_new_tokens: 128
|
||||
temperature: 0.8
|
||||
```
|
||||
|
||||
### Available Configuration Files
|
||||
|
||||
- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
|
||||
- `configs/classification/custom.yaml` - Custom dataset processing
|
||||
- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning
|
||||
|
||||
## Usage Examples
|
||||
|
||||
@@ -385,6 +536,28 @@ python scripts/classification/inference.py --config configs/classification/emoti
|
||||
python scripts/classification/inference.py examples
|
||||
```
|
||||
|
||||
### Styling Examples
|
||||
|
||||
```bash
|
||||
# 1. Data Processing
|
||||
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
|
||||
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
|
||||
|
||||
# 2. Training
|
||||
python scripts/styling/train.py example
|
||||
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
|
||||
|
||||
# 3. Inference
|
||||
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
|
||||
python scripts/styling/inference.py batch
|
||||
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
|
||||
|
||||
# 4. Run examples
|
||||
python scripts/styling/data_processor.py examples
|
||||
python scripts/styling/train.py features
|
||||
python scripts/styling/inference.py features
|
||||
```
|
||||
|
||||
## Troubleshooting Common Errors
|
||||
|
||||
### 1. ModuleNotFoundError: No module named 'utils'
|
||||
@@ -512,12 +685,20 @@ tail -f logs/training.log
|
||||
|
||||
## Workflow Summary
|
||||
|
||||
### Classification Task
|
||||
1. **Setup**: Install dependencies and set PYTHONPATH
|
||||
2. **Data Processing**: Process raw data into organized splits
|
||||
3. **Training**: Train model using processed data
|
||||
4. **Inference**: Use trained model for predictions
|
||||
5. **Monitoring**: Check logs and outputs for errors
|
||||
|
||||
### Styling Task
|
||||
1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
|
||||
2. **Data Processing**: Process style transfer data with instruction/input/output format
|
||||
3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
|
||||
4. **Inference**: Style transfer with streaming and batch processing
|
||||
5. **Monitoring**: Check training logs and model outputs
|
||||
|
||||
## Creating Custom Configurations
|
||||
|
||||
### For New Datasets
|
||||
|
||||
Reference in New Issue
Block a user