22 KiB
Fine-Tune Task: NLP Pipeline Framework
A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).
Supported Tasks
This framework supports multiple NLP tasks with organized configurations:
- Classification: Text classification, sentiment analysis, topic classification
- Completion: Text generation, code completion, story generation
- Styling: Style transfer, tone classification, writing style adaptation
- Matching: Semantic matching, entity matching, similarity scoring
Current Implementation Status
- Classification: ✅ Fully implemented with emotion classification example
- Styling: ✅ Fully implemented with style transfer and LoRA fine-tuning
- Completion: Planned for future updates
- Matching: Planned for future updates
Note: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.
Project Structure
fine-tune-task/
├── configs/ # YAML configuration files
│ ├── classification/ # ✅ Implemented
│ │ ├── emotion.yaml # Emotion classification
│ │ └── custom.yaml # Custom dataset
│ ├── styling/ # ✅ Implemented
│ │ └── formal.yaml # Formal style transfer
│ ├── completion/ # Planned for future updates
│ └── matching/ # Planned for future updates
├── data/ # Data directories
│ ├── raw/ # Raw input data
│ │ ├── classification/ # ✅ Implemented
│ │ ├── styling/ # ✅ Implemented
│ │ ├── completion/ # Planned for future updates
│ │ └── matching/ # Planned for future updates
│ └── processed/ # Processed data
│ ├── classification/ # ✅ Implemented
│ ├── styling/ # ✅ Implemented
│ ├── completion/ # Planned for future updates
│ └── matching/ # Planned for future updates
├── pipelines/ # Core pipeline scripts
│ ├── classification/ # ✅ Implemented
│ │ ├── data_processor.py # Data processing
│ │ ├── train.py # Training
│ │ └── inference.py # Inference
│ ├── styling/ # ✅ Implemented
│ │ ├── data_processor.py # Style data processing
│ │ ├── train.py # LoRA fine-tuning
│ │ └── inference.py # Style transfer inference
│ ├── completion/ # Planned for future updates
│ └── matching/ # Planned for future updates
├── scripts/ # User-friendly scripts
│ ├── classification/ # ✅ Implemented
│ │ ├── data_processor.py # Data processing script
│ │ ├── trainer.py # Training script
│ │ └── inference.py # Inference script
│ ├── styling/ # ✅ Implemented
│ │ ├── data_processor.py # Style data processing script
│ │ ├── train.py # Training script
│ │ └── inference.py # Inference script
│ ├── completion/ # Planned for future updates
│ └── matching/ # Planned for future updates
├── results/ # Model outputs
│ ├── classification/ # ✅ Implemented
│ ├── styling/ # ✅ Implemented
│ ├── completion/ # Planned for future updates
│ └── matching/ # Planned for future updates
└── utils/ # Shared utility modules
Quick Start (Classification Task)
1. Setup Environment
# Install dependencies
pip install -r requirements.txt
# Set Python path
export PYTHONPATH=.
2. Data Processing
# Process emotion dataset
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
# Process with custom parameters
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000
# Check output location
ls -la ./data/processed/classification/emotion/classification/
Expected Output:
Data processing completed successfully!
Data source: huggingface
Dataset: dair-ai/emotion
Total samples: 2999
Unique labels: 6
Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
Output directory: ./data/processed/classification/emotion
3. Model Training
# Train using processed data
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
# Train with custom parameters
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32
# Check model output
ls -la ./results/classification/emotion_model/
Expected Output:
Training completed successfully!
Model: bert-base-uncased
Data directory: ./data/processed/classification/emotion
Training for 3 epochs with batch size 16
Model saved to: ./results/classification/emotion_model
4. Model Inference
# Run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
# File-based inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
Expected Output:
Inference completed successfully!
Loading model from: ./results/classification/emotion_model
Predicted label: joy
Confidence: 0.8542
Top 3 predictions:
- joy: 0.8542
- love: 0.1234
- surprise: 0.0224
Quick Start (Styling Task)
1. Setup Environment
# Install dependencies (including unsloth for styling)
pip install -r requirements.txt
# Set Python path
export PYTHONPATH=.
2. Data Processing
# Process style transfer dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
# Create HuggingFace dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
# Check output location
ls -la ./data/processed/styling/formal/
Expected Output:
Styling data processing completed successfully!
Data source: custom
Data file: ./data/raw/styling/sample_formal.jsonl
Total samples: 5
Split sizes: {'train': 3, 'validation': 1, 'test': 1}
Output directory: ./data/processed/styling/formal
Style instruction: Rewrite the following text in a formal style
3. Model Training
# Train using processed data (automatically loads from YAML output_dir)
python scripts/styling/train.py example
# Custom training
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4
# Check model output
ls -la ./models/styling/
Expected Output:
Training completed successfully!
Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
Dataset: Loaded from ./data/processed/styling/formal
Training for 3 epochs with batch size 4
Model saved to: ./models/styling
4. Model Inference
# Single text style transfer
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
# Batch processing
python scripts/styling/inference.py batch
# Interactive mode
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
Expected Output:
Inference completed successfully!
Input: Hey, what's up?
Output: Hello, how are you doing?
Style: Formal
Adding New Tasks
To add a new task (e.g., completion, styling, matching), follow these steps:
Example: Styling Task (Already Implemented)
The styling task demonstrates a complete implementation:
- Task Directory Structure ✅
configs/styling/ # YAML configurations
data/raw/styling/ # Raw style transfer data
data/processed/styling/ # Processed data
pipelines/styling/ # Core pipeline scripts
scripts/styling/ # User-friendly scripts
models/styling/ # Trained models
- Pipeline Components ✅
- Data Processor: Handles style transfer datasets with instruction/input/output format
- Trainer: LoRA fine-tuning using Unsloth for efficiency
- Inference: Style transfer with streaming and batch processing
- Key Features ✅
- Automatic EOS token handling:
text + tokenizer.eos_token - Dataset mapping:
dataset.map(formatting_prompts_func, batched=True) - YAML integration: Uses
data.output_dirfor automatic dataset loading - HuggingFace dataset export and loading
For Other Tasks (completion, matching)
- Create Task Directory Structure
# Create task directories
mkdir -p configs/completion
mkdir -p data/raw/completion data/processed/completion
mkdir -p pipelines/completion
mkdir -p scripts/completion
mkdir -p results/completion
mkdir -p tasks/completion
mkdir -p models/completion
- Create Task Configuration
# Create YAML configuration for new task
cat > configs/completion/text_generation.yaml << 'EOF'
# Text Generation Task Configuration
task:
name: "completion"
type: "text_generation"
# Data Processing Configuration
data:
source: "huggingface"
dataset_name: "your-dataset-name"
output_dir: "./data/processed/completion/text_generation"
max_samples: 1000
# ... other data parameters
# Model Configuration
model:
name: "gpt2" # Different model for completion
max_length: 1024
# ... model parameters
# Training Configuration
training:
num_epochs: 3
batch_size: 8 # Smaller batch for generation
learning_rate: 5e-5
data_dir: "./data/processed/completion/text_generation"
output_dir: "./results/completion/text_generation_model"
# Inference Configuration
inference:
model_path: "./results/completion/text_generation_model"
device: "auto"
batch_size: 1 # Generation is typically one at a time
max_length: 100
temperature: 0.7
EOF
- Create Pipeline Scripts
Copy and modify the classification pipeline scripts:
# Copy classification scripts as templates
cp pipelines/classification/data_processor.py pipelines/completion/
cp pipelines/classification/train.py pipelines/completion/
cp pipelines/classification/inference.py pipelines/completion/
# Copy task scripts
cp scripts/classification/data_processor.py scripts/completion/
cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/
- Modify Pipeline Code
Update the pipeline scripts for your specific task:
-
Data Processor (
pipelines/completion/data_processor.py):- Update data loading logic for completion datasets
- Modify preprocessing for text generation
- Adjust output format for completion tasks
-
Trainer (
pipelines/completion/train.py):- Change model type to generation models (GPT, T5, etc.)
- Update training loop for text generation
- Modify evaluation metrics
-
Inference (
pipelines/completion/inference.py):- Update inference for text generation
- Add generation parameters (temperature, top-k, etc.)
- Modify output format
-
Update Task Scripts
Modify the task scripts to use your new pipeline:
# scripts/completion/data_processor.py
def run_with_yaml_config(config_path: str, **cli_overrides):
cmd = [
"python", "pipelines/completion/data_processor.py", # Updated path
"--config", config_path
]
# ... rest of the function
- Create Task-Specific Models
# Create model directory
mkdir -p models/completion
# Add task-specific model classes
cat > models/completion/text_generator.py << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer
class TextGenerator:
def __init__(self, model_name):
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
def generate(self, prompt, max_length=100, temperature=0.7):
# Implementation for text generation
pass
EOF
- Test Your New Task
# Test data processing
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml
# Test training
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml
# Test inference
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"
YAML Configuration Guide
Configuration Structure
Each YAML file is organized into clear sections:
# Task Configuration
task:
name: "classification" # or "completion", "styling", "matching"
type: "sequence_classification" # or "text_generation", "style_transfer", "semantic_matching"
# Data Processing Configuration
data:
source: "huggingface" # "huggingface" or "custom"
dataset_name: "dair-ai/emotion" # HuggingFace dataset name
output_dir: "./data/processed/classification/emotion"
max_samples: 1000 # Limit dataset size
# ... other data parameters
# Model Configuration
model:
name: "bert-base-uncased" # Model from HuggingFace Hub
max_length: 512 # Sequence length
num_labels: 6 # Number of classes
# Training Configuration
training:
num_epochs: 3 # Training epochs
batch_size: 16 # Batch size
learning_rate: 2e-5 # Learning rate
data_dir: "./data/processed/classification/emotion"
output_dir: "./results/classification/emotion_model"
# Inference Configuration
inference:
model_path: "./results/classification/emotion_model"
device: "auto" # "auto", "cuda", "cpu"
batch_size: 32 # Inference batch size
return_top_k: 3 # Top K predictions
Styling Configuration Example
# Styling Task Configuration
task:
name: "styling"
type: "style_transfer"
# Data Processing Configuration
data:
source: "custom"
data_path: "./data/raw/styling/sample_formal.jsonl"
input_field: "text"
output_field: "styled_text"
instruction: "Rewrite the following text in a formal style"
output_dir: "./data/processed/styling/formal"
output_format: "alpaca"
# Model Configuration
model:
training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
training_max_seq_length: 2048
training_load_in_4bit: true
# Training Configuration
training:
num_epochs: 3
batch_size: 2
learning_rate: 2e-4
weight_decay: 0.01
# Inference Configuration
inference:
batch_size: 1
max_new_tokens: 128
temperature: 0.8
Available Configuration Files
configs/classification/emotion.yaml- Emotion classification with HuggingFace datasetconfigs/classification/custom.yaml- Custom dataset processingconfigs/styling/formal.yaml- Formal style transfer with LoRA fine-tuning
Usage Examples
Data Processing Examples
# 1. Use YAML config only
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
# 2. Override YAML values
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500
# 3. Use CLI only (backward compatibility)
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion
# 4. Run examples
python scripts/classification/data_processor.py examples
Training Examples
# 1. Use YAML config only
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
# 2. Override YAML values
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5
# 3. Use CLI only
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3
# 4. Run examples
python scripts/classification/trainer.py examples
Inference Examples
# 1. Single text prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"
# 2. File-based prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
# 3. Interactive mode
python scripts/classification/inference.py --config configs/classification/emotion.yaml
# 4. Run examples
python scripts/classification/inference.py examples
Styling Examples
# 1. Data Processing
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset
# 2. Training
python scripts/styling/train.py example
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2
# 3. Inference
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
python scripts/styling/inference.py batch
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
# 4. Run examples
python scripts/styling/data_processor.py examples
python scripts/styling/train.py features
python scripts/styling/inference.py features
Troubleshooting Common Errors
1. ModuleNotFoundError: No module named 'utils'
Error:
ModuleNotFoundError: No module named 'utils'
Solution:
# Set Python path before running scripts
export PYTHONPATH=.
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
2. Model Path Not Found
Error:
Model path not found: ./results/classification/emotion_model
Solution:
# Train the model first
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
# Then run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml
3. Data Directory Not Found
Error:
Data directory not found: ./data/processed/classification/emotion
Solution:
# Process data first
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
# Then train
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
4. YAML Configuration Errors
Error:
data_processor.py: error: --data-source is required (either in YAML config or CLI)
Solution: Check your YAML file structure. It should have:
data:
source: "huggingface" # Not data_source
dataset_name: "dair-ai/emotion"
5. HuggingFace Download Issues
Error:
KeyboardInterrupt during model download
Solution:
# Use smaller dataset for testing
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100
# Or use cached models
export HF_HOME=./cache
6. CUDA/GPU Issues
Error:
RuntimeError: CUDA out of memory
Solution:
# Reduce batch size
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8
# Or use CPU
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu
Monitoring and Logs
Check Processing Status
# Check data processing output
ls -la ./data/processed/classification/emotion/classification/
# Check training output
ls -la ./results/classification/emotion_model/
# Check logs
tail -f logs/training.log
Expected File Structure After Processing
./data/processed/classification/emotion/classification/
├── train.jsonl # Training data
├── validation.jsonl # Validation data
└── test.jsonl # Test data
./results/classification/emotion_model/
├── config.json # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json # Tokenizer
└── label_info.json # Label mappings
Workflow Summary
Classification Task
- Setup: Install dependencies and set PYTHONPATH
- Data Processing: Process raw data into organized splits
- Training: Train model using processed data
- Inference: Use trained model for predictions
- Monitoring: Check logs and outputs for errors
Styling Task
- Setup: Install dependencies (including unsloth) and set PYTHONPATH
- Data Processing: Process style transfer data with instruction/input/output format
- Training: LoRA fine-tuning using Unsloth for efficient style transfer
- Inference: Style transfer with streaming and batch processing
- Monitoring: Check training logs and model outputs
Creating Custom Configurations
For New Datasets
- Copy existing config:
cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml
- Modify parameters:
data:
source: "huggingface"
dataset_name: "your-dataset-name"
output_dir: "./data/processed/classification/my_dataset"
# ... other parameters
training:
data_dir: "./data/processed/classification/my_dataset"
output_dir: "./results/classification/my_dataset_model"
- Run pipeline:
python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml
For Custom Data
- Use custom config:
data:
source: "custom"
data_path: "./data/raw/my_data.jsonl"
output_dir: "./data/processed/classification/my_custom_dataset"
- Run processing:
python scripts/classification/data_processor.py --config configs/classification/custom.yaml
Best Practices
- Always check output directories before running next step
- Use small datasets for testing before full runs
- Monitor logs for errors and warnings
- Backup configurations before major changes
- Use version control for YAML files
- Test with CLI overrides for quick experiments
Support
For issues and questions:
- Check the troubleshooting section above
- Review logs in the output directories
- Verify YAML configuration structure
- Test with smaller datasets first
Happy fine-tuning!