Files

T

2025-08-13 23:59:28 +00:00

18 KiB

Raw Blame History

Fine-Tuning Task Framework

A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching.

Overview
Architecture
Task Types
Quick Start
Configuration Guide
Scripts & Commands
Complete Workflows
API Reference
Troubleshooting
Contributing

Overview

This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be:

Task-Agnostic: Same pipeline structure for different task types
Configuration-Driven: YAML-based configuration for all parameters
Developer-Friendly: Clear scripts and comprehensive logging
Production-Ready: Built-in validation, error handling, and optimization

Architecture

The framework follows a modular pipeline architecture:

Raw Data → Data Processing → Model Training → Inference/Evaluation
    ↓              ↓              ↓              ↓
  JSONL/CSV    HuggingFace    Trained      Ready for
  Files        Datasets       Models       Production

Core Components

Data Processors: Convert raw data to training-ready formats
Training Pipelines: Task-specific training with optimization
Inference Engines: Production-ready text generation/classification
Configuration Management: YAML-based parameter control
Utility Scripts: Command-line interfaces for all operations

Task Types

1. Classification Task

Purpose: Text classification, sentiment analysis, topic categorization

Data Format:

{"text": "I love this product!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}

Output: Classification probabilities and predicted labels

Use Cases: Sentiment analysis, spam detection, content moderation

2. Completion Task

Purpose: Text generation, story completion, code generation

Data Format:

{"prompt": "Once upon a time", "completion": "there was a brave knight..."}
{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"}

Output: Generated text continuations

Use Cases: Creative writing, code completion, content generation

3. Styling Task

Purpose: Style transfer, tone modification, writing style adaptation

Data Format:

{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}

Output: Text rewritten in target style

Use Cases: Formalization, casualization, domain adaptation

4. Matching Task

Purpose: Semantic similarity, question-answer matching, paraphrase detection

Data Format:

{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"}
{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"}

Output: Similarity scores or binary classifications

Use Cases: Search relevance, duplicate detection, semantic matching

Quick Start

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch, transformers, datasets; print('✅ All packages installed')"

Basic Workflow

# 1. Process data
python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml

# 2. Train model
python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml

# 3. Run inference
python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml

Configuration Guide

YAML Structure

All configurations follow this hierarchical structure:

# Task Configuration
task:
  name: "task_type"                    # classification, completion, styling, matching
  type: "specific_type"                # e.g., "sentiment_analysis", "style_transfer"

# Data Configuration
data:
  source: "custom"                     # "custom" or "huggingface"
  data_path: "./data/raw/..."          # Path to raw data
  input_field: "text"                  # Field name for input
  output_field: "label"                # Field name for output
  instruction: "Task instruction"      # For instruction-following tasks

# Model Configuration
model:
  name: "model_name"                   # HuggingFace model identifier
  max_seq_length: 2048                 # Maximum sequence length
  dtype: null                          # Data type (auto-detected)
  load_in_4bit: true                   # 4-bit quantization

# Training Configuration
training:
  num_epochs: 3                        # Training epochs
  batch_size: 4                        # Batch size
  learning_rate: 2e-4                  # Learning rate
  warmup_steps: 5                      # Warmup steps
  max_steps: 60                        # Maximum training steps

# Inference Configuration
inference:
  batch_size: 32                       # Inference batch size
  max_new_tokens: 128                  # Max tokens to generate
  temperature: 0.8                     # Sampling temperature

Configuration Parameters

Data Processing Parameters

Parameter	Type	Default	Description
`source`	string	"custom"	Data source type
`data_path`	string	required	Path to raw data file
`input_field`	string	"text"	Input field name
`output_field`	string	"label"	Output field name
`instruction`	string	task-specific	Task instruction
`data_format`	string	"jsonl"	Data file format
`max_length`	int	256	Maximum text length
`min_length`	int	10	Minimum text length
`clean_text`	boolean	true	Enable text cleaning
`lowercase`	boolean	false	Convert to lowercase

Model Parameters

Parameter	Type	Default	Description
`name`	string	required	HuggingFace model name
`max_seq_length`	int	2048	Maximum sequence length
`dtype`	string	null	Data type (auto-detected)
`load_in_4bit`	boolean	true	Enable 4-bit quantization
`token`	string	null	HuggingFace access token

Training Parameters

Parameter	Type	Default	Description
`num_epochs`	int	1	Number of training epochs
`batch_size`	int	2	Training batch size
`learning_rate`	float	2e-4	Learning rate
`weight_decay`	float	0.01	Weight decay
`warmup_steps`	int	5	Warmup steps
`max_steps`	int	60	Maximum training steps
`gradient_accumulation_steps`	int	4	Gradient accumulation

LoRA Parameters

Parameter	Type	Default	Description
`lora_r`	int	16	LoRA rank
`lora_alpha`	int	16	LoRA alpha
`lora_dropout`	float	0	LoRA dropout
`target_modules`	list	["q_proj", "k_proj", "v_proj", "o_proj"]	Target modules for LoRA

Environment Variables

# HuggingFace token for gated models
export HF_TOKEN="hf_..."

# CUDA device selection
export CUDA_VISIBLE_DEVICES="0"

# Logging level
export LOG_LEVEL="INFO"

Scripts & Commands

Data Processing Scripts

Basic Usage

python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml

Advanced Options

python scripts/[task_type]/data_processor.py \
  --config configs/[task_type]/[config].yaml \
  --max-samples 1000 \
  --log-level DEBUG \
  --create-hf-dataset \
  --hf-dataset-path ./datasets/[task_name]

Command Line Arguments

Argument	Type	Default	Description
`--config`	string	required	YAML configuration file
`--max-samples`	int	all	Maximum samples to process
`--log-level`	string	"INFO"	Logging level
`--create-hf-dataset`	flag	false	Create HuggingFace dataset
`--hf-dataset-path`	string	auto	HuggingFace dataset path

Training Scripts

Basic Usage

python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml

Advanced Options

python scripts/[task_type]/train.py train \
  --config configs/[task_type]/[config].yaml \
  --epochs 5 \
  --batch-size 8 \
  --learning-rate 1e-4 \
  --max-steps 100

Command Line Arguments

Argument	Type	Default	Description
`--config`	string	required	YAML configuration file
`--epochs`	int	YAML value	Override training epochs
`--batch-size`	int	YAML value	Override batch size
`--learning-rate`	float	YAML value	Override learning rate
`--max-steps`	int	YAML value	Override max steps
`--output-dir`	string	YAML value	Override output directory

Inference Scripts

Basic Usage

python scripts/[task_type]/inference.py infer \
  --config configs/[task_type]/[config].yaml \
  --input-text "Your input text here"

Advanced Options

python scripts/[task_type]/inference.py infer \
  --config configs/[task_type]/[config].yaml \
  --input-text "Your input text here" \
  --max-tokens 256 \
  --temperature 0.7 \
  --stream

Command Line Arguments

Argument	Type	Default	Description
`--config`	string	required	YAML configuration file
`--input-text`	string	required	Text to process
`--max-tokens`	int	128	Maximum tokens to generate
`--temperature`	float	0.8	Sampling temperature
`--stream`	flag	false	Enable streaming generation

Batch Processing

# Process multiple inputs from file
python scripts/[task_type]/inference.py batch \
  --config configs/[task_type]/[config].yaml \
  --input-file input.txt \
  --output-file output.txt

Interactive Mode

# Enter interactive mode for testing
python scripts/[task_type]/inference.py interactive \
  --config configs/[task_type]/[config].yaml

Complete Workflows

Classification Task Workflow

1. Data Preparation

# data/raw/classification/sentiment.jsonl
{"text": "I love this movie!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
{"text": "It's okay", "label": "neutral"}

2. Configuration

# configs/classification/sentiment.yaml
task:
  name: "classification"
  type: "sentiment_analysis"

data:
  source: "custom"
  data_path: "./data/raw/classification/sentiment.jsonl"
  input_field: "text"
  output_field: "label"
  instruction: "Classify the sentiment of the following text"

model:
  name: "microsoft/DialoGPT-medium"
  max_seq_length: 512

training:
  num_epochs: 3
  batch_size: 8
  learning_rate: 3e-5

3. Execute Pipeline

# Process data
python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml

# Train model
python scripts/classification/train.py train --config configs/classification/sentiment.yaml

# Run inference
python scripts/classification/inference.py infer \
  --config configs/classification/sentiment.yaml \
  --input-text "This product exceeded my expectations!"

Styling Task Workflow

1. Data Preparation

# data/raw/styling/formal.jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
{"text": "This is cool", "styled_text": "This is quite impressive"}

2. Configuration

# configs/styling/formal.yaml
task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./data/raw/styling/formal.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite the following text in a formal style"

model:
  name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
  max_seq_length: 2048

training:
  num_epochs: 3
  batch_size: 4
  learning_rate: 2e-4
  model_output_dir: "./models/styling"

3. Execute Pipeline

# Process data
python scripts/styling/data_processor.py --config configs/styling/formal.yaml

# Train model
python scripts/styling/train.py train --config configs/styling/formal.yaml

# Run inference
python scripts/styling/inference.py infer \
  --config configs/styling/formal.yaml \
  --instruction "Rewrite in formal style" \
  --input-text "Hey there! What's up?"

Completion Task Workflow

1. Data Preparation

# data/raw/completion/story.jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."}
{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."}

2. Configuration

# configs/completion/story.yaml
task:
  name: "completion"
  type: "story_generation"

data:
  source: "custom"
  data_path: "./data/raw/completion/story.jsonl"
  input_field: "prompt"
  output_field: "completion"

model:
  name: "gpt2-medium"
  max_seq_length: 1024

training:
  num_epochs: 2
  batch_size: 16
  learning_rate: 5e-5

3. Execute Pipeline

# Process data
python scripts/completion/data_processor.py --config configs/completion/story.yaml

# Train model
python scripts/completion/train.py train --config configs/completion/story.yaml

# Run inference
python scripts/completion/inference.py infer \
  --config configs/completion/story.yaml \
  --input-text "The wizard cast a spell"

API Reference

Data Processing Classes

BaseDataProcessor

class BaseDataProcessor:
    def __init__(self, config: Dict[str, Any])
    def load_and_preprocess(self) -> Tuple[Dict, Dict]
    def validate_data(self, data: Dict) -> Tuple[bool, List[str]]
    def save_data(self, data: Dict, output_path: str)

ClassificationDataProcessor

class ClassificationDataProcessor(BaseDataProcessor):
    def convert_to_classification_format(self, data: Dict) -> Dict
    def create_label_mapping(self, labels: List[str]) -> Dict[str, int]

StylingDataProcessor

class StylingDataProcessor(BaseDataProcessor):
    def convert_to_alpaca_format(self, data: Dict) -> Dict
    def format_for_training(self, data: Dict) -> Dict

Training Classes

BaseTrainer

class BaseTrainer:
    def __init__(self, config: Dict[str, Any])
    def load_model_and_tokenizer(self)
    def setup_training(self, dataset: Dataset)
    def train(self, dataset_path: str) -> Dict
    def save_model(self)

ClassificationTrainer

class ClassificationTrainer(BaseTrainer):
    def setup_classification_head(self)
    def compute_metrics(self, eval_pred) -> Dict

StylingTrainer

class StylingTrainer(BaseTrainer):
    def setup_lora(self)
    def format_dataset(self, dataset: Dataset) -> Dataset

Inference Classes

BaseInference

class BaseInference:
    def __init__(self, config: Dict[str, Any])
    def load_model_and_tokenizer(self)
    def preprocess_input(self, input_text: str) -> torch.Tensor
    def postprocess_output(self, output: torch.Tensor) -> str

ClassificationInference

class ClassificationInference(BaseInference):
    def classify(self, text: str) -> Dict[str, float]
    def batch_classify(self, texts: List[str]) -> List[Dict]

StylingInference

class StylingInference(BaseInference):
    def style_transfer(self, text: str, instruction: str) -> str
    def generate_text(self, instruction: str, input_text: str) -> str

Troubleshooting

Common Issues

1. Model Loading Errors

Error: FileNotFoundError: ./models/[task_name]/*.json

Solution:

Verify model was trained successfully
Check model_output_dir in YAML config
Ensure model files exist in specified directory

2. Memory Issues

Error: CUDA out of memory

Solution:

Reduce batch_size in YAML config
Enable load_in_4bit: true
Use gradient accumulation
Reduce max_seq_length

3. Data Format Errors

Error: KeyError: 'input_field'

Solution:

Verify field names in JSONL/CSV files
Check input_field and output_field in YAML
Ensure data format matches expected structure

4. Training Convergence Issues

Symptoms: Loss not decreasing, poor model performance

Solution:

Adjust learning rate (try 1e-5 to 5e-4)
Increase training epochs
Check data quality and quantity
Verify label distribution (for classification)

Debug Mode

Enable detailed logging:

export LOG_LEVEL="DEBUG"
python scripts/[task_type]/[script].py --log-level DEBUG

Performance Optimization

Memory Optimization

model:
  load_in_4bit: true          # 4-bit quantization
  dtype: "float16"            # Use float16 if supported

training:
  gradient_accumulation_steps: 4  # Effective batch size = batch_size * steps
  max_grad_norm: 1.0         # Gradient clipping

Speed Optimization

training:
  dataloader_num_workers: 4   # Parallel data loading
  fp16: true                  # Mixed precision training
  bf16: false                 # Disable bfloat16 if not supported

Contributing

Adding New Task Types

Create task directory structure:

pipelines/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py

scripts/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py

configs/[new_task]/
└── example.yaml

Implement base classes:

Extend BaseDataProcessor
Extend BaseTrainer
Extend BaseInference

Add configuration templates:

Define task-specific parameters
Document all configuration options

Update documentation:

Add task description to README
Include usage examples
Document configuration parameters

Code Style

Follow PEP 8 guidelines
Use type hints for all functions
Include comprehensive docstrings
Add unit tests for new functionality

Testing

# Run all tests
python -m pytest tests/

# Run specific task tests
python -m pytest tests/[task_type]/

# Run with coverage
python -m pytest --cov=pipelines tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Wiki

Happy fine-tuning! 🚀

18 KiB Raw Blame History

Fine-Tuning Task Framework

Table of Contents

Overview

Architecture

Core Components

Task Types

1. Classification Task

2. Completion Task

3. Styling Task

4. Matching Task

Quick Start

Prerequisites

Basic Workflow

Configuration Guide

YAML Structure

Configuration Parameters

Data Processing Parameters

Model Parameters

Training Parameters

LoRA Parameters

Environment Variables

Scripts & Commands

Data Processing Scripts

Basic Usage

Advanced Options

Command Line Arguments

Training Scripts

Basic Usage

Advanced Options

Command Line Arguments

Inference Scripts

Basic Usage

Advanced Options

Command Line Arguments

Batch Processing

Interactive Mode

Complete Workflows

Classification Task Workflow

1. Data Preparation

2. Configuration

3. Execute Pipeline

Styling Task Workflow

1. Data Preparation

2. Configuration

3. Execute Pipeline

Completion Task Workflow

1. Data Preparation

2. Configuration

3. Execute Pipeline

API Reference

Data Processing Classes

BaseDataProcessor

ClassificationDataProcessor

StylingDataProcessor

Training Classes

BaseTrainer

ClassificationTrainer

StylingTrainer

Inference Classes

BaseInference

ClassificationInference

StylingInference

Troubleshooting

Common Issues

1. Model Loading Errors

2. Memory Issues

3. Data Format Errors

4. Training Convergence Issues

Debug Mode

Performance Optimization

Memory Optimization

Speed Optimization

Contributing

Adding New Task Types

Code Style

Testing

License

Support

18 KiB

Raw Blame History