18 KiB
Fine-Tuning Task Framework
A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching.
Table of Contents
- Overview
- Architecture
- Task Types
- Quick Start
- Configuration Guide
- Scripts & Commands
- Complete Workflows
- API Reference
- Troubleshooting
- Contributing
Overview
This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be:
- Task-Agnostic: Same pipeline structure for different task types
- Configuration-Driven: YAML-based configuration for all parameters
- Developer-Friendly: Clear scripts and comprehensive logging
- Production-Ready: Built-in validation, error handling, and optimization
Architecture
The framework follows a modular pipeline architecture:
Raw Data → Data Processing → Model Training → Inference/Evaluation
↓ ↓ ↓ ↓
JSONL/CSV HuggingFace Trained Ready for
Files Datasets Models Production
Core Components
- Data Processors: Convert raw data to training-ready formats
- Training Pipelines: Task-specific training with optimization
- Inference Engines: Production-ready text generation/classification
- Configuration Management: YAML-based parameter control
- Utility Scripts: Command-line interfaces for all operations
Task Types
1. Classification Task
Purpose: Text classification, sentiment analysis, topic categorization
Data Format:
{"text": "I love this product!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
Output: Classification probabilities and predicted labels
Use Cases: Sentiment analysis, spam detection, content moderation
2. Completion Task
Purpose: Text generation, story completion, code generation
Data Format:
{"prompt": "Once upon a time", "completion": "there was a brave knight..."}
{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"}
Output: Generated text continuations
Use Cases: Creative writing, code completion, content generation
3. Styling Task
Purpose: Style transfer, tone modification, writing style adaptation
Data Format:
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
Output: Text rewritten in target style
Use Cases: Formalization, casualization, domain adaptation
4. Matching Task
Purpose: Semantic similarity, question-answer matching, paraphrase detection
Data Format:
{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"}
{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"}
Output: Similarity scores or binary classifications
Use Cases: Search relevance, duplicate detection, semantic matching
Quick Start
Prerequisites
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "import torch, transformers, datasets; print('✅ All packages installed')"
Basic Workflow
# 1. Process data
python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml
# 2. Train model
python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml
# 3. Run inference
python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml
Configuration Guide
YAML Structure
All configurations follow this hierarchical structure:
# Task Configuration
task:
name: "task_type" # classification, completion, styling, matching
type: "specific_type" # e.g., "sentiment_analysis", "style_transfer"
# Data Configuration
data:
source: "custom" # "custom" or "huggingface"
data_path: "./data/raw/..." # Path to raw data
input_field: "text" # Field name for input
output_field: "label" # Field name for output
instruction: "Task instruction" # For instruction-following tasks
# Model Configuration
model:
name: "model_name" # HuggingFace model identifier
max_seq_length: 2048 # Maximum sequence length
dtype: null # Data type (auto-detected)
load_in_4bit: true # 4-bit quantization
# Training Configuration
training:
num_epochs: 3 # Training epochs
batch_size: 4 # Batch size
learning_rate: 2e-4 # Learning rate
warmup_steps: 5 # Warmup steps
max_steps: 60 # Maximum training steps
# Inference Configuration
inference:
batch_size: 32 # Inference batch size
max_new_tokens: 128 # Max tokens to generate
temperature: 0.8 # Sampling temperature
Configuration Parameters
Data Processing Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | "custom" | Data source type |
data_path |
string | required | Path to raw data file |
input_field |
string | "text" | Input field name |
output_field |
string | "label" | Output field name |
instruction |
string | task-specific | Task instruction |
data_format |
string | "jsonl" | Data file format |
max_length |
int | 256 | Maximum text length |
min_length |
int | 10 | Minimum text length |
clean_text |
boolean | true | Enable text cleaning |
lowercase |
boolean | false | Convert to lowercase |
Model Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
string | required | HuggingFace model name |
max_seq_length |
int | 2048 | Maximum sequence length |
dtype |
string | null | Data type (auto-detected) |
load_in_4bit |
boolean | true | Enable 4-bit quantization |
token |
string | null | HuggingFace access token |
Training Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
num_epochs |
int | 1 | Number of training epochs |
batch_size |
int | 2 | Training batch size |
learning_rate |
float | 2e-4 | Learning rate |
weight_decay |
float | 0.01 | Weight decay |
warmup_steps |
int | 5 | Warmup steps |
max_steps |
int | 60 | Maximum training steps |
gradient_accumulation_steps |
int | 4 | Gradient accumulation |
LoRA Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
lora_r |
int | 16 | LoRA rank |
lora_alpha |
int | 16 | LoRA alpha |
lora_dropout |
float | 0 | LoRA dropout |
target_modules |
list | ["q_proj", "k_proj", "v_proj", "o_proj"] | Target modules for LoRA |
Environment Variables
# HuggingFace token for gated models
export HF_TOKEN="hf_..."
# CUDA device selection
export CUDA_VISIBLE_DEVICES="0"
# Logging level
export LOG_LEVEL="INFO"
Scripts & Commands
Data Processing Scripts
Basic Usage
python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml
Advanced Options
python scripts/[task_type]/data_processor.py \
--config configs/[task_type]/[config].yaml \
--max-samples 1000 \
--log-level DEBUG \
--create-hf-dataset \
--hf-dataset-path ./datasets/[task_name]
Command Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--config |
string | required | YAML configuration file |
--max-samples |
int | all | Maximum samples to process |
--log-level |
string | "INFO" | Logging level |
--create-hf-dataset |
flag | false | Create HuggingFace dataset |
--hf-dataset-path |
string | auto | HuggingFace dataset path |
Training Scripts
Basic Usage
python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml
Advanced Options
python scripts/[task_type]/train.py train \
--config configs/[task_type]/[config].yaml \
--epochs 5 \
--batch-size 8 \
--learning-rate 1e-4 \
--max-steps 100
Command Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--config |
string | required | YAML configuration file |
--epochs |
int | YAML value | Override training epochs |
--batch-size |
int | YAML value | Override batch size |
--learning-rate |
float | YAML value | Override learning rate |
--max-steps |
int | YAML value | Override max steps |
--output-dir |
string | YAML value | Override output directory |
Inference Scripts
Basic Usage
python scripts/[task_type]/inference.py infer \
--config configs/[task_type]/[config].yaml \
--input-text "Your input text here"
Advanced Options
python scripts/[task_type]/inference.py infer \
--config configs/[task_type]/[config].yaml \
--input-text "Your input text here" \
--max-tokens 256 \
--temperature 0.7 \
--stream
Command Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--config |
string | required | YAML configuration file |
--input-text |
string | required | Text to process |
--max-tokens |
int | 128 | Maximum tokens to generate |
--temperature |
float | 0.8 | Sampling temperature |
--stream |
flag | false | Enable streaming generation |
Batch Processing
# Process multiple inputs from file
python scripts/[task_type]/inference.py batch \
--config configs/[task_type]/[config].yaml \
--input-file input.txt \
--output-file output.txt
Interactive Mode
# Enter interactive mode for testing
python scripts/[task_type]/inference.py interactive \
--config configs/[task_type]/[config].yaml
Complete Workflows
Classification Task Workflow
1. Data Preparation
# data/raw/classification/sentiment.jsonl
{"text": "I love this movie!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
{"text": "It's okay", "label": "neutral"}
2. Configuration
# configs/classification/sentiment.yaml
task:
name: "classification"
type: "sentiment_analysis"
data:
source: "custom"
data_path: "./data/raw/classification/sentiment.jsonl"
input_field: "text"
output_field: "label"
instruction: "Classify the sentiment of the following text"
model:
name: "microsoft/DialoGPT-medium"
max_seq_length: 512
training:
num_epochs: 3
batch_size: 8
learning_rate: 3e-5
3. Execute Pipeline
# Process data
python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml
# Train model
python scripts/classification/train.py train --config configs/classification/sentiment.yaml
# Run inference
python scripts/classification/inference.py infer \
--config configs/classification/sentiment.yaml \
--input-text "This product exceeded my expectations!"
Styling Task Workflow
1. Data Preparation
# data/raw/styling/formal.jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
{"text": "This is cool", "styled_text": "This is quite impressive"}
2. Configuration
# configs/styling/formal.yaml
task:
name: "styling"
type: "style_transfer"
data:
source: "custom"
data_path: "./data/raw/styling/formal.jsonl"
input_field: "text"
output_field: "styled_text"
instruction: "Rewrite the following text in a formal style"
model:
name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
max_seq_length: 2048
training:
num_epochs: 3
batch_size: 4
learning_rate: 2e-4
model_output_dir: "./models/styling"
3. Execute Pipeline
# Process data
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
# Train model
python scripts/styling/train.py train --config configs/styling/formal.yaml
# Run inference
python scripts/styling/inference.py infer \
--config configs/styling/formal.yaml \
--instruction "Rewrite in formal style" \
--input-text "Hey there! What's up?"
Completion Task Workflow
1. Data Preparation
# data/raw/completion/story.jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."}
{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."}
2. Configuration
# configs/completion/story.yaml
task:
name: "completion"
type: "story_generation"
data:
source: "custom"
data_path: "./data/raw/completion/story.jsonl"
input_field: "prompt"
output_field: "completion"
model:
name: "gpt2-medium"
max_seq_length: 1024
training:
num_epochs: 2
batch_size: 16
learning_rate: 5e-5
3. Execute Pipeline
# Process data
python scripts/completion/data_processor.py --config configs/completion/story.yaml
# Train model
python scripts/completion/train.py train --config configs/completion/story.yaml
# Run inference
python scripts/completion/inference.py infer \
--config configs/completion/story.yaml \
--input-text "The wizard cast a spell"
API Reference
Data Processing Classes
BaseDataProcessor
class BaseDataProcessor:
def __init__(self, config: Dict[str, Any])
def load_and_preprocess(self) -> Tuple[Dict, Dict]
def validate_data(self, data: Dict) -> Tuple[bool, List[str]]
def save_data(self, data: Dict, output_path: str)
ClassificationDataProcessor
class ClassificationDataProcessor(BaseDataProcessor):
def convert_to_classification_format(self, data: Dict) -> Dict
def create_label_mapping(self, labels: List[str]) -> Dict[str, int]
StylingDataProcessor
class StylingDataProcessor(BaseDataProcessor):
def convert_to_alpaca_format(self, data: Dict) -> Dict
def format_for_training(self, data: Dict) -> Dict
Training Classes
BaseTrainer
class BaseTrainer:
def __init__(self, config: Dict[str, Any])
def load_model_and_tokenizer(self)
def setup_training(self, dataset: Dataset)
def train(self, dataset_path: str) -> Dict
def save_model(self)
ClassificationTrainer
class ClassificationTrainer(BaseTrainer):
def setup_classification_head(self)
def compute_metrics(self, eval_pred) -> Dict
StylingTrainer
class StylingTrainer(BaseTrainer):
def setup_lora(self)
def format_dataset(self, dataset: Dataset) -> Dataset
Inference Classes
BaseInference
class BaseInference:
def __init__(self, config: Dict[str, Any])
def load_model_and_tokenizer(self)
def preprocess_input(self, input_text: str) -> torch.Tensor
def postprocess_output(self, output: torch.Tensor) -> str
ClassificationInference
class ClassificationInference(BaseInference):
def classify(self, text: str) -> Dict[str, float]
def batch_classify(self, texts: List[str]) -> List[Dict]
StylingInference
class StylingInference(BaseInference):
def style_transfer(self, text: str, instruction: str) -> str
def generate_text(self, instruction: str, input_text: str) -> str
Troubleshooting
Common Issues
1. Model Loading Errors
Error: FileNotFoundError: ./models/[task_name]/*.json
Solution:
- Verify model was trained successfully
- Check
model_output_dirin YAML config - Ensure model files exist in specified directory
2. Memory Issues
Error: CUDA out of memory
Solution:
- Reduce
batch_sizein YAML config - Enable
load_in_4bit: true - Use gradient accumulation
- Reduce
max_seq_length
3. Data Format Errors
Error: KeyError: 'input_field'
Solution:
- Verify field names in JSONL/CSV files
- Check
input_fieldandoutput_fieldin YAML - Ensure data format matches expected structure
4. Training Convergence Issues
Symptoms: Loss not decreasing, poor model performance
Solution:
- Adjust learning rate (try 1e-5 to 5e-4)
- Increase training epochs
- Check data quality and quantity
- Verify label distribution (for classification)
Debug Mode
Enable detailed logging:
export LOG_LEVEL="DEBUG"
python scripts/[task_type]/[script].py --log-level DEBUG
Performance Optimization
Memory Optimization
model:
load_in_4bit: true # 4-bit quantization
dtype: "float16" # Use float16 if supported
training:
gradient_accumulation_steps: 4 # Effective batch size = batch_size * steps
max_grad_norm: 1.0 # Gradient clipping
Speed Optimization
training:
dataloader_num_workers: 4 # Parallel data loading
fp16: true # Mixed precision training
bf16: false # Disable bfloat16 if not supported
Contributing
Adding New Task Types
- Create task directory structure:
pipelines/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py
scripts/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py
configs/[new_task]/
└── example.yaml
- Implement base classes:
- Extend
BaseDataProcessor - Extend
BaseTrainer - Extend
BaseInference
- Add configuration templates:
- Define task-specific parameters
- Document all configuration options
- Update documentation:
- Add task description to README
- Include usage examples
- Document configuration parameters
Code Style
- Follow PEP 8 guidelines
- Use type hints for all functions
- Include comprehensive docstrings
- Add unit tests for new functionality
Testing
# Run all tests
python -m pytest tests/
# Run specific task tests
python -m pytest tests/[task_type]/
# Run with coverage
python -m pytest --cov=pipelines tests/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Wiki
Happy fine-tuning! 🚀