2025-08-28 22:48:02 +00:00
2025-08-06 22:45:37 +01:00
2025-08-28 22:41:56 +00:00
2025-08-28 17:57:59 +00:00
2025-08-28 22:48:02 +00:00
2025-08-28 22:48:02 +00:00
2025-08-28 22:41:56 +00:00
2025-08-28 17:57:59 +00:00
2025-08-06 22:45:37 +01:00
2025-08-28 22:41:56 +00:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
2025-08-28 17:57:59 +00:00
2025-08-13 21:17:01 +01:00
2025-08-13 21:17:01 +01:00
2025-08-13 23:59:28 +00:00

Fine-Tuning Task Framework

A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching.

Table of Contents

Overview

This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be:

  • Task-Agnostic: Same pipeline structure for different task types
  • Configuration-Driven: YAML-based configuration for all parameters
  • Developer-Friendly: Clear scripts and comprehensive logging
  • Production-Ready: Built-in validation, error handling, and optimization

Architecture

The framework follows a modular pipeline architecture:

Raw Data → Data Processing → Model Training → Inference/Evaluation
    ↓              ↓              ↓              ↓
  JSONL/CSV    HuggingFace    Trained      Ready for
  Files        Datasets       Models       Production

Core Components

  1. Data Processors: Convert raw data to training-ready formats
  2. Training Pipelines: Task-specific training with optimization
  3. Inference Engines: Production-ready text generation/classification
  4. Configuration Management: YAML-based parameter control
  5. Utility Scripts: Command-line interfaces for all operations

Task Types

1. Classification Task

Purpose: Text classification, sentiment analysis, topic categorization

Data Format:

{"text": "I love this product!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}

Output: Classification probabilities and predicted labels

Use Cases: Sentiment analysis, spam detection, content moderation

2. Completion Task

Purpose: Text generation, story completion, code generation

Data Format:

{"prompt": "Once upon a time", "completion": "there was a brave knight..."}
{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"}

Output: Generated text continuations

Use Cases: Creative writing, code completion, content generation

3. Styling Task

Purpose: Style transfer, tone modification, writing style adaptation

Data Format:

{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}

Output: Text rewritten in target style

Use Cases: Formalization, casualization, domain adaptation

4. Matching Task

Purpose: Semantic similarity, question-answer matching, paraphrase detection

Data Format:

{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"}
{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"}

Output: Similarity scores or binary classifications

Use Cases: Search relevance, duplicate detection, semantic matching

Quick Start

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch, transformers, datasets; print('✅ All packages installed')"

Basic Workflow

# 1. Process data
python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml

# 2. Train model
python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml

# 3. Run inference
python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml

Configuration Guide

YAML Structure

All configurations follow this hierarchical structure:

# Task Configuration
task:
  name: "task_type"                    # classification, completion, styling, matching
  type: "specific_type"                # e.g., "sentiment_analysis", "style_transfer"

# Data Configuration
data:
  source: "custom"                     # "custom" or "huggingface"
  data_path: "./data/raw/..."          # Path to raw data
  input_field: "text"                  # Field name for input
  output_field: "label"                # Field name for output
  instruction: "Task instruction"      # For instruction-following tasks

# Model Configuration
model:
  name: "model_name"                   # HuggingFace model identifier
  max_seq_length: 2048                 # Maximum sequence length
  dtype: null                          # Data type (auto-detected)
  load_in_4bit: true                   # 4-bit quantization

# Training Configuration
training:
  num_epochs: 3                        # Training epochs
  batch_size: 4                        # Batch size
  learning_rate: 2e-4                  # Learning rate
  warmup_steps: 5                      # Warmup steps
  max_steps: 60                        # Maximum training steps

# Inference Configuration
inference:
  batch_size: 32                       # Inference batch size
  max_new_tokens: 128                  # Max tokens to generate
  temperature: 0.8                     # Sampling temperature

Configuration Parameters

Data Processing Parameters

Parameter Type Default Description
source string "custom" Data source type
data_path string required Path to raw data file
input_field string "text" Input field name
output_field string "label" Output field name
instruction string task-specific Task instruction
data_format string "jsonl" Data file format
max_length int 256 Maximum text length
min_length int 10 Minimum text length
clean_text boolean true Enable text cleaning
lowercase boolean false Convert to lowercase

Model Parameters

Parameter Type Default Description
name string required HuggingFace model name
max_seq_length int 2048 Maximum sequence length
dtype string null Data type (auto-detected)
load_in_4bit boolean true Enable 4-bit quantization
token string null HuggingFace access token

Training Parameters

Parameter Type Default Description
num_epochs int 1 Number of training epochs
batch_size int 2 Training batch size
learning_rate float 2e-4 Learning rate
weight_decay float 0.01 Weight decay
warmup_steps int 5 Warmup steps
max_steps int 60 Maximum training steps
gradient_accumulation_steps int 4 Gradient accumulation

LoRA Parameters

Parameter Type Default Description
lora_r int 16 LoRA rank
lora_alpha int 16 LoRA alpha
lora_dropout float 0 LoRA dropout
target_modules list ["q_proj", "k_proj", "v_proj", "o_proj"] Target modules for LoRA

Environment Variables

# HuggingFace token for gated models
export HF_TOKEN="hf_..."

# CUDA device selection
export CUDA_VISIBLE_DEVICES="0"

# Logging level
export LOG_LEVEL="INFO"

Scripts & Commands

Data Processing Scripts

Basic Usage

python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml

Advanced Options

python scripts/[task_type]/data_processor.py \
  --config configs/[task_type]/[config].yaml \
  --max-samples 1000 \
  --log-level DEBUG \
  --create-hf-dataset \
  --hf-dataset-path ./datasets/[task_name]

Command Line Arguments

Argument Type Default Description
--config string required YAML configuration file
--max-samples int all Maximum samples to process
--log-level string "INFO" Logging level
--create-hf-dataset flag false Create HuggingFace dataset
--hf-dataset-path string auto HuggingFace dataset path

Training Scripts

Basic Usage

python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml

Advanced Options

python scripts/[task_type]/train.py train \
  --config configs/[task_type]/[config].yaml \
  --epochs 5 \
  --batch-size 8 \
  --learning-rate 1e-4 \
  --max-steps 100

Command Line Arguments

Argument Type Default Description
--config string required YAML configuration file
--epochs int YAML value Override training epochs
--batch-size int YAML value Override batch size
--learning-rate float YAML value Override learning rate
--max-steps int YAML value Override max steps
--output-dir string YAML value Override output directory

Inference Scripts

Basic Usage

python scripts/[task_type]/inference.py infer \
  --config configs/[task_type]/[config].yaml \
  --input-text "Your input text here"

Advanced Options

python scripts/[task_type]/inference.py infer \
  --config configs/[task_type]/[config].yaml \
  --input-text "Your input text here" \
  --max-tokens 256 \
  --temperature 0.7 \
  --stream

Command Line Arguments

Argument Type Default Description
--config string required YAML configuration file
--input-text string required Text to process
--max-tokens int 128 Maximum tokens to generate
--temperature float 0.8 Sampling temperature
--stream flag false Enable streaming generation

Batch Processing

# Process multiple inputs from file
python scripts/[task_type]/inference.py batch \
  --config configs/[task_type]/[config].yaml \
  --input-file input.txt \
  --output-file output.txt

Interactive Mode

# Enter interactive mode for testing
python scripts/[task_type]/inference.py interactive \
  --config configs/[task_type]/[config].yaml

Complete Workflows

Classification Task Workflow

1. Data Preparation

# data/raw/classification/sentiment.jsonl
{"text": "I love this movie!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
{"text": "It's okay", "label": "neutral"}

2. Configuration

# configs/classification/sentiment.yaml
task:
  name: "classification"
  type: "sentiment_analysis"

data:
  source: "custom"
  data_path: "./data/raw/classification/sentiment.jsonl"
  input_field: "text"
  output_field: "label"
  instruction: "Classify the sentiment of the following text"

model:
  name: "microsoft/DialoGPT-medium"
  max_seq_length: 512

training:
  num_epochs: 3
  batch_size: 8
  learning_rate: 3e-5

3. Execute Pipeline

# Process data
python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml

# Train model
python scripts/classification/train.py train --config configs/classification/sentiment.yaml

# Run inference
python scripts/classification/inference.py infer \
  --config configs/classification/sentiment.yaml \
  --input-text "This product exceeded my expectations!"

Styling Task Workflow

1. Data Preparation

# data/raw/styling/formal.jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
{"text": "This is cool", "styled_text": "This is quite impressive"}

2. Configuration

# configs/styling/formal.yaml
task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./data/raw/styling/formal.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite the following text in a formal style"

model:
  name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
  max_seq_length: 2048

training:
  num_epochs: 3
  batch_size: 4
  learning_rate: 2e-4
  model_output_dir: "./models/styling"

3. Execute Pipeline

# Process data
python scripts/styling/data_processor.py --config configs/styling/formal.yaml

# Train model
python scripts/styling/train.py train --config configs/styling/formal.yaml

# Run inference
python scripts/styling/inference.py infer \
  --config configs/styling/formal.yaml \
  --instruction "Rewrite in formal style" \
  --input-text "Hey there! What's up?"

Completion Task Workflow

1. Data Preparation

# data/raw/completion/story.jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."}
{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."}

2. Configuration

# configs/completion/story.yaml
task:
  name: "completion"
  type: "story_generation"

data:
  source: "custom"
  data_path: "./data/raw/completion/story.jsonl"
  input_field: "prompt"
  output_field: "completion"

model:
  name: "gpt2-medium"
  max_seq_length: 1024

training:
  num_epochs: 2
  batch_size: 16
  learning_rate: 5e-5

3. Execute Pipeline

# Process data
python scripts/completion/data_processor.py --config configs/completion/story.yaml

# Train model
python scripts/completion/train.py train --config configs/completion/story.yaml

# Run inference
python scripts/completion/inference.py infer \
  --config configs/completion/story.yaml \
  --input-text "The wizard cast a spell"

API Reference

Data Processing Classes

BaseDataProcessor

class BaseDataProcessor:
    def __init__(self, config: Dict[str, Any])
    def load_and_preprocess(self) -> Tuple[Dict, Dict]
    def validate_data(self, data: Dict) -> Tuple[bool, List[str]]
    def save_data(self, data: Dict, output_path: str)

ClassificationDataProcessor

class ClassificationDataProcessor(BaseDataProcessor):
    def convert_to_classification_format(self, data: Dict) -> Dict
    def create_label_mapping(self, labels: List[str]) -> Dict[str, int]

StylingDataProcessor

class StylingDataProcessor(BaseDataProcessor):
    def convert_to_alpaca_format(self, data: Dict) -> Dict
    def format_for_training(self, data: Dict) -> Dict

Training Classes

BaseTrainer

class BaseTrainer:
    def __init__(self, config: Dict[str, Any])
    def load_model_and_tokenizer(self)
    def setup_training(self, dataset: Dataset)
    def train(self, dataset_path: str) -> Dict
    def save_model(self)

ClassificationTrainer

class ClassificationTrainer(BaseTrainer):
    def setup_classification_head(self)
    def compute_metrics(self, eval_pred) -> Dict

StylingTrainer

class StylingTrainer(BaseTrainer):
    def setup_lora(self)
    def format_dataset(self, dataset: Dataset) -> Dataset

Inference Classes

BaseInference

class BaseInference:
    def __init__(self, config: Dict[str, Any])
    def load_model_and_tokenizer(self)
    def preprocess_input(self, input_text: str) -> torch.Tensor
    def postprocess_output(self, output: torch.Tensor) -> str

ClassificationInference

class ClassificationInference(BaseInference):
    def classify(self, text: str) -> Dict[str, float]
    def batch_classify(self, texts: List[str]) -> List[Dict]

StylingInference

class StylingInference(BaseInference):
    def style_transfer(self, text: str, instruction: str) -> str
    def generate_text(self, instruction: str, input_text: str) -> str

Troubleshooting

Common Issues

1. Model Loading Errors

Error: FileNotFoundError: ./models/[task_name]/*.json

Solution:

  • Verify model was trained successfully
  • Check model_output_dir in YAML config
  • Ensure model files exist in specified directory

2. Memory Issues

Error: CUDA out of memory

Solution:

  • Reduce batch_size in YAML config
  • Enable load_in_4bit: true
  • Use gradient accumulation
  • Reduce max_seq_length

3. Data Format Errors

Error: KeyError: 'input_field'

Solution:

  • Verify field names in JSONL/CSV files
  • Check input_field and output_field in YAML
  • Ensure data format matches expected structure

4. Training Convergence Issues

Symptoms: Loss not decreasing, poor model performance

Solution:

  • Adjust learning rate (try 1e-5 to 5e-4)
  • Increase training epochs
  • Check data quality and quantity
  • Verify label distribution (for classification)

Debug Mode

Enable detailed logging:

export LOG_LEVEL="DEBUG"
python scripts/[task_type]/[script].py --log-level DEBUG

Performance Optimization

Memory Optimization

model:
  load_in_4bit: true          # 4-bit quantization
  dtype: "float16"            # Use float16 if supported

training:
  gradient_accumulation_steps: 4  # Effective batch size = batch_size * steps
  max_grad_norm: 1.0         # Gradient clipping

Speed Optimization

training:
  dataloader_num_workers: 4   # Parallel data loading
  fp16: true                  # Mixed precision training
  bf16: false                 # Disable bfloat16 if not supported

Contributing

Adding New Task Types

  1. Create task directory structure:
pipelines/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py

scripts/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py

configs/[new_task]/
└── example.yaml
  1. Implement base classes:
  • Extend BaseDataProcessor
  • Extend BaseTrainer
  • Extend BaseInference
  1. Add configuration templates:
  • Define task-specific parameters
  • Document all configuration options
  1. Update documentation:
  • Add task description to README
  • Include usage examples
  • Document configuration parameters

Code Style

  • Follow PEP 8 guidelines
  • Use type hints for all functions
  • Include comprehensive docstrings
  • Add unit tests for new functionality

Testing

# Run all tests
python -m pytest tests/

# Run specific task tests
python -m pytest tests/[task_type]/

# Run with coverage
python -m pytest --cov=pipelines tests/

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support


Happy fine-tuning! 🚀

S
Description
No description provided
Readme 4.2 GiB
Languages
Python 72.5%
Jupyter Notebook 27.2%
Jinja 0.2%