764 lines
22 KiB
Markdown
764 lines
22 KiB
Markdown
# Fine-Tune Task: NLP Pipeline Framework
|
|
|
|
A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).
|
|
|
|
## Supported Tasks
|
|
|
|
This framework supports multiple NLP tasks with organized configurations:
|
|
|
|
- **Classification**: Text classification, sentiment analysis, topic classification
|
|
- **Completion**: Text generation, code completion, story generation
|
|
- **Styling**: Style transfer, tone classification, writing style adaptation
|
|
- **Matching**: Semantic matching, entity matching, similarity scoring
|
|
|
|
### Current Implementation Status
|
|
|
|
- **Classification**: ✅ Fully implemented with emotion classification example
|
|
- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
|
|
- **Completion**: Planned for future updates
|
|
- **Matching**: Planned for future updates
|
|
|
|
**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
fine-tune-task/
|
|
├── configs/ # YAML configuration files
|
|
│ ├── classification/ # ✅ Implemented
|
|
│ │ ├── emotion.yaml # Emotion classification
|
|
│ │ └── custom.yaml # Custom dataset
|
|
│ ├── styling/ # ✅ Implemented
|
|
│ │ └── formal.yaml # Formal style transfer
|
|
│ ├── completion/ # Planned for future updates
|
|
│ └── matching/ # Planned for future updates
|
|
├── data/ # Data directories
|
|
│ ├── raw/ # Raw input data
|
|
│ │ ├── classification/ # ✅ Implemented
|
|
│ │ ├── styling/ # ✅ Implemented
|
|
│ │ ├── completion/ # Planned for future updates
|
|
│ │ └── matching/ # Planned for future updates
|
|
│ └── processed/ # Processed data
|
|
│ ├── classification/ # ✅ Implemented
|
|
│ ├── styling/ # ✅ Implemented
|
|
│ ├── completion/ # Planned for future updates
|
|
│ └── matching/ # Planned for future updates
|
|
├── pipelines/ # Core pipeline scripts
|
|
│ ├── classification/ # ✅ Implemented
|
|
│ │ ├── data_processor.py # Data processing
|
|
│ │ ├── train.py # Training
|
|
│ │ └── inference.py # Inference
|
|
│ ├── styling/ # ✅ Implemented
|
|
│ │ ├── data_processor.py # Style data processing
|
|
│ │ ├── train.py # LoRA fine-tuning
|
|
│ │ └── inference.py # Style transfer inference
|
|
│ ├── completion/ # Planned for future updates
|
|
│ └── matching/ # Planned for future updates
|
|
├── scripts/ # User-friendly scripts
|
|
│ ├── classification/ # ✅ Implemented
|
|
│ │ ├── data_processor.py # Data processing script
|
|
│ │ ├── trainer.py # Training script
|
|
│ │ └── inference.py # Inference script
|
|
│ ├── styling/ # ✅ Implemented
|
|
│ │ ├── data_processor.py # Style data processing script
|
|
│ │ ├── train.py # Training script
|
|
│ │ └── inference.py # Inference script
|
|
│ ├── completion/ # Planned for future updates
|
|
│ └── matching/ # Planned for future updates
|
|
├── results/ # Model outputs
|
|
│ ├── classification/ # ✅ Implemented
|
|
│ ├── styling/ # ✅ Implemented
|
|
│ ├── completion/ # Planned for future updates
|
|
│ └── matching/ # Planned for future updates
|
|
└── utils/ # Shared utility modules
|
|
```
|
|
|
|
## Quick Start (Classification Task)
|
|
|
|
### 1. Setup Environment
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Set Python path
|
|
export PYTHONPATH=.
|
|
```
|
|
|
|
### 2. Data Processing
|
|
|
|
```bash
|
|
# Process emotion dataset
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
|
|
|
|
# Process with custom parameters
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000
|
|
|
|
# Check output location
|
|
ls -la ./data/processed/classification/emotion/classification/
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Data processing completed successfully!
|
|
Data source: huggingface
|
|
Dataset: dair-ai/emotion
|
|
Total samples: 2999
|
|
Unique labels: 6
|
|
Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
|
|
Output directory: ./data/processed/classification/emotion
|
|
```
|
|
|
|
### 3. Model Training
|
|
|
|
```bash
|
|
# Train using processed data
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
|
|
|
|
# Train with custom parameters
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32
|
|
|
|
# Check model output
|
|
ls -la ./results/classification/emotion_model/
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Training completed successfully!
|
|
Model: bert-base-uncased
|
|
Data directory: ./data/processed/classification/emotion
|
|
Training for 3 epochs with batch size 16
|
|
Model saved to: ./results/classification/emotion_model
|
|
```
|
|
|
|
### 4. Model Inference
|
|
|
|
```bash
|
|
# Run inference
|
|
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
|
|
|
|
# File-based inference
|
|
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Inference completed successfully!
|
|
Loading model from: ./results/classification/emotion_model
|
|
Predicted label: joy
|
|
Confidence: 0.8542
|
|
Top 3 predictions:
|
|
- joy: 0.8542
|
|
- love: 0.1234
|
|
- surprise: 0.0224
|
|
```
|
|
|
|
## Quick Start (Styling Task)
|
|
|
|
### 1. Setup Environment
|
|
|
|
```bash
|
|
# Install dependencies (including unsloth for styling)
|
|
pip install -r requirements.txt
|
|
|
|
# Set Python path
|
|
export PYTHONPATH=.
|
|
```
|
|
|
|
### 2. Data Processing
|
|
|
|
```bash
|
|
# Process style transfer dataset
|
|
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
|
|
|
|
# Create HuggingFace dataset
|
|
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
|
|
|
|
# Check output location
|
|
ls -la ./data/processed/styling/formal/
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Styling data processing completed successfully!
|
|
Data source: custom
|
|
Data file: ./data/raw/styling/sample_formal.jsonl
|
|
Total samples: 5
|
|
Split sizes: {'train': 3, 'validation': 1, 'test': 1}
|
|
Output directory: ./data/processed/styling/formal
|
|
Style instruction: Rewrite the following text in a formal style
|
|
```
|
|
|
|
### 3. Model Training
|
|
|
|
```bash
|
|
# Train using processed data (automatically loads from YAML output_dir)
|
|
python scripts/styling/train.py example
|
|
|
|
# Custom training
|
|
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
|
|
|
|
# Check model output
|
|
ls -la ./models/styling/
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Training completed successfully!
|
|
Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
|
|
Dataset: Loaded from ./data/processed/styling/formal
|
|
Training for 3 epochs with batch size 4
|
|
Model saved to: ./models/styling
|
|
```
|
|
|
|
### 4. Model Inference
|
|
|
|
```bash
|
|
# Single text style transfer
|
|
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
|
|
|
|
# Batch processing
|
|
python scripts/styling/inference.py batch
|
|
|
|
# Interactive mode
|
|
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
Inference completed successfully!
|
|
Input: Hey, what's up?
|
|
Output: Hello, how are you doing?
|
|
Style: Formal
|
|
```
|
|
|
|
## Adding New Tasks
|
|
|
|
To add a new task (e.g., completion, styling, matching), follow these steps:
|
|
|
|
### Example: Styling Task (Already Implemented)
|
|
|
|
The styling task demonstrates a complete implementation:
|
|
|
|
1. **Task Directory Structure** ✅
|
|
```bash
|
|
configs/styling/ # YAML configurations
|
|
data/raw/styling/ # Raw style transfer data
|
|
data/processed/styling/ # Processed data
|
|
pipelines/styling/ # Core pipeline scripts
|
|
scripts/styling/ # User-friendly scripts
|
|
models/styling/ # Trained models
|
|
```
|
|
|
|
2. **Pipeline Components** ✅
|
|
- **Data Processor**: Handles style transfer datasets with instruction/input/output format
|
|
- **Trainer**: LoRA fine-tuning using Unsloth for efficiency
|
|
- **Inference**: Style transfer with streaming and batch processing
|
|
|
|
3. **Key Features** ✅
|
|
- Automatic EOS token handling: `text + tokenizer.eos_token`
|
|
- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
|
|
- YAML integration: Uses `data.output_dir` for automatic dataset loading
|
|
- HuggingFace dataset export and loading
|
|
|
|
### For Other Tasks (completion, matching)
|
|
|
|
1. **Create Task Directory Structure**
|
|
```bash
|
|
# Create task directories
|
|
mkdir -p configs/completion
|
|
mkdir -p data/raw/completion data/processed/completion
|
|
mkdir -p pipelines/completion
|
|
mkdir -p scripts/completion
|
|
mkdir -p results/completion
|
|
mkdir -p tasks/completion
|
|
mkdir -p models/completion
|
|
```
|
|
|
|
2. **Create Task Configuration**
|
|
|
|
```bash
|
|
# Create YAML configuration for new task
|
|
cat > configs/completion/text_generation.yaml << 'EOF'
|
|
# Text Generation Task Configuration
|
|
task:
|
|
name: "completion"
|
|
type: "text_generation"
|
|
|
|
# Data Processing Configuration
|
|
data:
|
|
source: "huggingface"
|
|
dataset_name: "your-dataset-name"
|
|
output_dir: "./data/processed/completion/text_generation"
|
|
max_samples: 1000
|
|
# ... other data parameters
|
|
|
|
# Model Configuration
|
|
model:
|
|
name: "gpt2" # Different model for completion
|
|
max_length: 1024
|
|
# ... model parameters
|
|
|
|
# Training Configuration
|
|
training:
|
|
num_epochs: 3
|
|
batch_size: 8 # Smaller batch for generation
|
|
learning_rate: 5e-5
|
|
data_dir: "./data/processed/completion/text_generation"
|
|
output_dir: "./results/completion/text_generation_model"
|
|
|
|
# Inference Configuration
|
|
inference:
|
|
model_path: "./results/completion/text_generation_model"
|
|
device: "auto"
|
|
batch_size: 1 # Generation is typically one at a time
|
|
max_length: 100
|
|
temperature: 0.7
|
|
EOF
|
|
```
|
|
|
|
3. **Create Pipeline Scripts**
|
|
|
|
Copy and modify the classification pipeline scripts:
|
|
|
|
```bash
|
|
# Copy classification scripts as templates
|
|
cp pipelines/classification/data_processor.py pipelines/completion/
|
|
cp pipelines/classification/train.py pipelines/completion/
|
|
cp pipelines/classification/inference.py pipelines/completion/
|
|
|
|
# Copy task scripts
|
|
cp scripts/classification/data_processor.py scripts/completion/
|
|
cp scripts/classification/trainer.py scripts/completion/
|
|
cp scripts/classification/inference.py scripts/completion/
|
|
```
|
|
|
|
4. **Modify Pipeline Code**
|
|
|
|
Update the pipeline scripts for your specific task:
|
|
|
|
1. **Data Processor** (`pipelines/completion/data_processor.py`):
|
|
- Update data loading logic for completion datasets
|
|
- Modify preprocessing for text generation
|
|
- Adjust output format for completion tasks
|
|
|
|
2. **Trainer** (`pipelines/completion/train.py`):
|
|
- Change model type to generation models (GPT, T5, etc.)
|
|
- Update training loop for text generation
|
|
- Modify evaluation metrics
|
|
|
|
3. **Inference** (`pipelines/completion/inference.py`):
|
|
- Update inference for text generation
|
|
- Add generation parameters (temperature, top-k, etc.)
|
|
- Modify output format
|
|
|
|
5. **Update Task Scripts**
|
|
|
|
Modify the task scripts to use your new pipeline:
|
|
|
|
```python
|
|
# scripts/completion/data_processor.py
|
|
def run_with_yaml_config(config_path: str, **cli_overrides):
|
|
cmd = [
|
|
"python", "pipelines/completion/data_processor.py", # Updated path
|
|
"--config", config_path
|
|
]
|
|
# ... rest of the function
|
|
```
|
|
|
|
6. **Create Task-Specific Models**
|
|
|
|
```bash
|
|
# Create model directory
|
|
mkdir -p models/completion
|
|
|
|
# Add task-specific model classes
|
|
cat > models/completion/text_generator.py << 'EOF'
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
class TextGenerator:
|
|
def __init__(self, model_name):
|
|
self.model = AutoModelForCausalLM.from_pretrained(model_name)
|
|
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
|
def generate(self, prompt, max_length=100, temperature=0.7):
|
|
# Implementation for text generation
|
|
pass
|
|
EOF
|
|
```
|
|
|
|
7. **Test Your New Task**
|
|
|
|
```bash
|
|
# Test data processing
|
|
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml
|
|
|
|
# Test training
|
|
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml
|
|
|
|
# Test inference
|
|
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"
|
|
```
|
|
|
|
## YAML Configuration Guide
|
|
|
|
### Configuration Structure
|
|
|
|
Each YAML file is organized into clear sections:
|
|
|
|
```yaml
|
|
# Task Configuration
|
|
task:
|
|
name: "classification" # or "completion", "styling", "matching"
|
|
type: "sequence_classification" # or "text_generation", "style_transfer", "semantic_matching"
|
|
|
|
# Data Processing Configuration
|
|
data:
|
|
source: "huggingface" # "huggingface" or "custom"
|
|
dataset_name: "dair-ai/emotion" # HuggingFace dataset name
|
|
output_dir: "./data/processed/classification/emotion"
|
|
max_samples: 1000 # Limit dataset size
|
|
# ... other data parameters
|
|
|
|
# Model Configuration
|
|
model:
|
|
name: "bert-base-uncased" # Model from HuggingFace Hub
|
|
max_length: 512 # Sequence length
|
|
num_labels: 6 # Number of classes
|
|
|
|
# Training Configuration
|
|
training:
|
|
num_epochs: 3 # Training epochs
|
|
batch_size: 16 # Batch size
|
|
learning_rate: 2e-5 # Learning rate
|
|
data_dir: "./data/processed/classification/emotion"
|
|
output_dir: "./results/classification/emotion_model"
|
|
|
|
# Inference Configuration
|
|
inference:
|
|
model_path: "./results/classification/emotion_model"
|
|
device: "auto" # "auto", "cuda", "cpu"
|
|
batch_size: 32 # Inference batch size
|
|
return_top_k: 3 # Top K predictions
|
|
```
|
|
|
|
### Styling Configuration Example
|
|
|
|
```yaml
|
|
# Styling Task Configuration
|
|
task:
|
|
name: "styling"
|
|
type: "style_transfer"
|
|
|
|
# Data Processing Configuration
|
|
data:
|
|
source: "custom"
|
|
data_path: "./data/raw/styling/sample_formal.jsonl"
|
|
input_field: "text"
|
|
output_field: "styled_text"
|
|
instruction: "Rewrite the following text in a formal style"
|
|
output_dir: "./data/processed/styling/formal"
|
|
output_format: "alpaca"
|
|
|
|
# Model Configuration
|
|
model:
|
|
training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
|
|
training_max_seq_length: 2048
|
|
training_load_in_4bit: true
|
|
|
|
# Training Configuration
|
|
training:
|
|
num_epochs: 3
|
|
batch_size: 2
|
|
learning_rate: 2e-4
|
|
weight_decay: 0.01
|
|
|
|
# Inference Configuration
|
|
inference:
|
|
batch_size: 1
|
|
max_new_tokens: 128
|
|
temperature: 0.8
|
|
```
|
|
|
|
### Available Configuration Files
|
|
|
|
- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
|
|
- `configs/classification/custom.yaml` - Custom dataset processing
|
|
- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning
|
|
|
|
## Usage Examples
|
|
|
|
### Data Processing Examples
|
|
|
|
```bash
|
|
# 1. Use YAML config only
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
|
|
|
|
# 2. Override YAML values
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500
|
|
|
|
# 3. Use CLI only (backward compatibility)
|
|
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion
|
|
|
|
# 4. Run examples
|
|
python scripts/classification/data_processor.py examples
|
|
```
|
|
|
|
### Training Examples
|
|
|
|
```bash
|
|
# 1. Use YAML config only
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
|
|
|
|
# 2. Override YAML values
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5
|
|
|
|
# 3. Use CLI only
|
|
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3
|
|
|
|
# 4. Run examples
|
|
python scripts/classification/trainer.py examples
|
|
```
|
|
|
|
### Inference Examples
|
|
|
|
```bash
|
|
# 1. Single text prediction
|
|
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
|
|
|
|
# 2. File-based prediction
|
|
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
|
|
|
|
# 3. Interactive mode
|
|
python scripts/classification/inference.py --config configs/classification/emotion.yaml
|
|
|
|
# 4. Run examples
|
|
python scripts/classification/inference.py examples
|
|
```
|
|
|
|
### Styling Examples
|
|
|
|
```bash
|
|
# 1. Data Processing
|
|
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
|
|
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
|
|
|
|
# 2. Training
|
|
python scripts/styling/train.py example
|
|
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
|
|
|
|
# 3. Inference
|
|
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
|
|
python scripts/styling/inference.py batch
|
|
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
|
|
|
|
# 4. Run examples
|
|
python scripts/styling/data_processor.py examples
|
|
python scripts/styling/train.py features
|
|
python scripts/styling/inference.py features
|
|
```
|
|
|
|
## Troubleshooting Common Errors
|
|
|
|
### 1. ModuleNotFoundError: No module named 'utils'
|
|
|
|
**Error:**
|
|
```
|
|
ModuleNotFoundError: No module named 'utils'
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Set Python path before running scripts
|
|
export PYTHONPATH=.
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
|
|
```
|
|
|
|
### 2. Model Path Not Found
|
|
|
|
**Error:**
|
|
```
|
|
Model path not found: ./results/classification/emotion_model
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Train the model first
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
|
|
|
|
# Then run inference
|
|
python scripts/classification/inference.py --config configs/classification/emotion.yaml
|
|
```
|
|
|
|
### 3. Data Directory Not Found
|
|
|
|
**Error:**
|
|
```
|
|
Data directory not found: ./data/processed/classification/emotion
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Process data first
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
|
|
|
|
# Then train
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
|
|
```
|
|
|
|
### 4. YAML Configuration Errors
|
|
|
|
**Error:**
|
|
```
|
|
data_processor.py: error: --data-source is required (either in YAML config or CLI)
|
|
```
|
|
|
|
**Solution:**
|
|
Check your YAML file structure. It should have:
|
|
```yaml
|
|
data:
|
|
source: "huggingface" # Not data_source
|
|
dataset_name: "dair-ai/emotion"
|
|
```
|
|
|
|
### 5. HuggingFace Download Issues
|
|
|
|
**Error:**
|
|
```
|
|
KeyboardInterrupt during model download
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Use smaller dataset for testing
|
|
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100
|
|
|
|
# Or use cached models
|
|
export HF_HOME=./cache
|
|
```
|
|
|
|
### 6. CUDA/GPU Issues
|
|
|
|
**Error:**
|
|
```
|
|
RuntimeError: CUDA out of memory
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Reduce batch size
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8
|
|
|
|
# Or use CPU
|
|
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu
|
|
```
|
|
|
|
## Monitoring and Logs
|
|
|
|
### Check Processing Status
|
|
|
|
```bash
|
|
# Check data processing output
|
|
ls -la ./data/processed/classification/emotion/classification/
|
|
|
|
# Check training output
|
|
ls -la ./results/classification/emotion_model/
|
|
|
|
# Check logs
|
|
tail -f logs/training.log
|
|
```
|
|
|
|
### Expected File Structure After Processing
|
|
|
|
```
|
|
./data/processed/classification/emotion/classification/
|
|
├── train.jsonl # Training data
|
|
├── validation.jsonl # Validation data
|
|
└── test.jsonl # Test data
|
|
|
|
./results/classification/emotion_model/
|
|
├── config.json # Model configuration
|
|
├── pytorch_model.bin # Model weights
|
|
├── tokenizer.json # Tokenizer
|
|
└── label_info.json # Label mappings
|
|
```
|
|
|
|
## Workflow Summary
|
|
|
|
### Classification Task
|
|
1. **Setup**: Install dependencies and set PYTHONPATH
|
|
2. **Data Processing**: Process raw data into organized splits
|
|
3. **Training**: Train model using processed data
|
|
4. **Inference**: Use trained model for predictions
|
|
5. **Monitoring**: Check logs and outputs for errors
|
|
|
|
### Styling Task
|
|
1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
|
|
2. **Data Processing**: Process style transfer data with instruction/input/output format
|
|
3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
|
|
4. **Inference**: Style transfer with streaming and batch processing
|
|
5. **Monitoring**: Check training logs and model outputs
|
|
|
|
## Creating Custom Configurations
|
|
|
|
### For New Datasets
|
|
|
|
1. Copy existing config:
|
|
```bash
|
|
cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml
|
|
```
|
|
|
|
2. Modify parameters:
|
|
```yaml
|
|
data:
|
|
source: "huggingface"
|
|
dataset_name: "your-dataset-name"
|
|
output_dir: "./data/processed/classification/my_dataset"
|
|
# ... other parameters
|
|
|
|
training:
|
|
data_dir: "./data/processed/classification/my_dataset"
|
|
output_dir: "./results/classification/my_dataset_model"
|
|
```
|
|
|
|
3. Run pipeline:
|
|
```bash
|
|
python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml
|
|
```
|
|
|
|
### For Custom Data
|
|
|
|
1. Use custom config:
|
|
```yaml
|
|
data:
|
|
source: "custom"
|
|
data_path: "./data/raw/my_data.jsonl"
|
|
output_dir: "./data/processed/classification/my_custom_dataset"
|
|
```
|
|
|
|
2. Run processing:
|
|
```bash
|
|
python scripts/classification/data_processor.py --config configs/classification/custom.yaml
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Always check output directories** before running next step
|
|
2. **Use small datasets for testing** before full runs
|
|
3. **Monitor logs** for errors and warnings
|
|
4. **Backup configurations** before major changes
|
|
5. **Use version control** for YAML files
|
|
6. **Test with CLI overrides** for quick experiments
|
|
|
|
## Support
|
|
|
|
For issues and questions:
|
|
1. Check the troubleshooting section above
|
|
2. Review logs in the output directories
|
|
3. Verify YAML configuration structure
|
|
4. Test with smaller datasets first
|
|
|
|
---
|
|
|
|
**Happy fine-tuning!**
|