2025-08-13 23:59:28 +00:00
# Fine-Tuning Task Framework
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching.
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
## Table of Contents
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
- [Overview ](#overview )
- [Architecture ](#architecture )
- [Task Types ](#task-types )
- [Quick Start ](#quick-start )
- [Configuration Guide ](#configuration-guide )
- [Scripts & Commands ](#scripts--commands )
- [Complete Workflows ](#complete-workflows )
- [API Reference ](#api-reference )
- [Troubleshooting ](#troubleshooting )
- [Contributing ](#contributing )
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
## Overview
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be:
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
- **Task-Agnostic**: Same pipeline structure for different task types
- **Configuration-Driven**: YAML-based configuration for all parameters
- **Developer-Friendly**: Clear scripts and comprehensive logging
- **Production-Ready**: Built-in validation, error handling, and optimization
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
## Architecture
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
The framework follows a **modular pipeline architecture ** :
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
Raw Data → Data Processing → Model Training → Inference/Evaluation
↓ ↓ ↓ ↓
JSONL/CSV HuggingFace Trained Ready for
Files Datasets Models Production
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
### Core Components
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
1. **Data Processors ** : Convert raw data to training-ready formats
2. **Training Pipelines ** : Task-specific training with optimization
3. **Inference Engines ** : Production-ready text generation/classification
4. **Configuration Management ** : YAML-based parameter control
5. **Utility Scripts ** : Command-line interfaces for all operations
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
## Task Types
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
### 1. Classification Task
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Purpose ** : Text classification, sentiment analysis, topic categorization
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Data Format ** :
``` jsonl
{"text": "I love this product!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
**Output ** : Classification probabilities and predicted labels
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Use Cases ** : Sentiment analysis, spam detection, content moderation
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
### 2. Completion Task
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Purpose ** : Text generation, story completion, code generation
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Data Format ** :
``` jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight..."}
{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"}
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
**Output ** : Generated text continuations
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Use Cases ** : Creative writing, code completion, content generation
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
### 3. Styling Task
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Purpose ** : Style transfer, tone modification, writing style adaptation
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Data Format ** :
``` jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
**Output ** : Text rewritten in target style
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Use Cases ** : Formalization, casualization, domain adaptation
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
### 4. Matching Task
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Purpose ** : Semantic similarity, question-answer matching, paraphrase detection
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
**Data Format ** :
``` jsonl
{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"}
{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"}
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
**Output ** : Similarity scores or binary classifications
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Use Cases ** : Search relevance, duplicate detection, semantic matching
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
## Quick Start
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
### Prerequisites
2025-08-06 22:45:37 +01:00
2025-08-13 21:30:45 +01:00
``` bash
2025-08-13 23:59:28 +00:00
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "import torch, transformers, datasets; print('✅ All packages installed')"
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
### Basic Workflow
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
``` bash
# 1. Process data
python scripts/[ task_type] /data_processor.py --config configs/[ task_type] /[ config] .yaml
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
# 2. Train model
python scripts/[ task_type] /train.py train --config configs/[ task_type] /[ config] .yaml
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
# 3. Run inference
python scripts/[ task_type] /inference.py infer --config configs/[ task_type] /[ config] .yaml
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
## Configuration Guide
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### YAML Structure
All configurations follow this hierarchical structure:
``` yaml
# Task Configuration
2025-08-06 22:45:37 +01:00
task :
2025-08-13 23:59:28 +00:00
name : "task_type" # classification, completion, styling, matching
type : "specific_type" # e.g., "sentiment_analysis", "style_transfer"
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
# Data Configuration
2025-08-06 22:45:37 +01:00
data :
2025-08-13 23:59:28 +00:00
source : "custom" # "custom" or "huggingface"
data_path : "./data/raw/..." # Path to raw data
input_field : "text" # Field name for input
output_field : "label" # Field name for output
instruction : "Task instruction" # For instruction-following tasks
2025-08-06 22:45:37 +01:00
# Model Configuration
model :
2025-08-13 23:59:28 +00:00
name : "model_name" # HuggingFace model identifier
max_seq_length : 2048 # Maximum sequence length
dtype : null # Data type (auto-detected)
load_in_4bit : true # 4-bit quantization
2025-08-06 22:45:37 +01:00
# Training Configuration
training :
2025-08-13 23:59:28 +00:00
num_epochs : 3 # Training epochs
batch_size : 4 # Batch size
learning_rate : 2e-4 # Learning rate
warmup_steps : 5 # Warmup steps
max_steps : 60 # Maximum training steps
2025-08-06 22:45:37 +01:00
# Inference Configuration
inference :
2025-08-13 23:59:28 +00:00
batch_size : 32 # Inference batch size
max_new_tokens : 128 # Max tokens to generate
temperature : 0.8 # Sampling temperature
```
### Configuration Parameters
#### Data Processing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `source` | string | "custom" | Data source type |
| `data_path` | string | required | Path to raw data file |
| `input_field` | string | "text" | Input field name |
| `output_field` | string | "label" | Output field name |
| `instruction` | string | task-specific | Task instruction |
| `data_format` | string | "jsonl" | Data file format |
| `max_length` | int | 256 | Maximum text length |
| `min_length` | int | 10 | Minimum text length |
| `clean_text` | boolean | true | Enable text cleaning |
| `lowercase` | boolean | false | Convert to lowercase |
#### Model Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `name` | string | required | HuggingFace model name |
| `max_seq_length` | int | 2048 | Maximum sequence length |
| `dtype` | string | null | Data type (auto-detected) |
| `load_in_4bit` | boolean | true | Enable 4-bit quantization |
| `token` | string | null | HuggingFace access token |
#### Training Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `num_epochs` | int | 1 | Number of training epochs |
| `batch_size` | int | 2 | Training batch size |
| `learning_rate` | float | 2e-4 | Learning rate |
| `weight_decay` | float | 0.01 | Weight decay |
| `warmup_steps` | int | 5 | Warmup steps |
| `max_steps` | int | 60 | Maximum training steps |
| `gradient_accumulation_steps` | int | 4 | Gradient accumulation |
#### LoRA Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `lora_r` | int | 16 | LoRA rank |
| `lora_alpha` | int | 16 | LoRA alpha |
| `lora_dropout` | float | 0 | LoRA dropout |
| `target_modules` | list | ["q_proj", "k_proj", "v_proj", "o_proj"] | Target modules for LoRA |
### Environment Variables
``` bash
# HuggingFace token for gated models
export HF_TOKEN = "hf_..."
# CUDA device selection
export CUDA_VISIBLE_DEVICES = "0"
# Logging level
export LOG_LEVEL = "INFO"
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
## Scripts & Commands
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### Data Processing Scripts
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### Basic Usage
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` bash
python scripts/[ task_type] /data_processor.py --config configs/[ task_type] /[ config] .yaml
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### Advanced Options
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` bash
python scripts/[ task_type] /data_processor.py \
--config configs/[ task_type] /[ config] .yaml \
--max-samples 1000 \
--log-level DEBUG \
--create-hf-dataset \
--hf-dataset-path ./datasets/[ task_name]
```
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### Command Line Arguments
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--config` | string | required | YAML configuration file |
| `--max-samples` | int | all | Maximum samples to process |
| `--log-level` | string | "INFO" | Logging level |
| `--create-hf-dataset` | flag | false | Create HuggingFace dataset |
| `--hf-dataset-path` | string | auto | HuggingFace dataset path |
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### Training Scripts
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### Basic Usage
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` bash
python scripts/[ task_type] /train.py train --config configs/[ task_type] /[ config] .yaml
```
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### Advanced Options
``` bash
python scripts/[ task_type] /train.py train \
--config configs/[ task_type] /[ config] .yaml \
--epochs 5 \
--batch-size 8 \
--learning-rate 1e-4 \
--max-steps 100
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### Command Line Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--config` | string | required | YAML configuration file |
| `--epochs` | int | YAML value | Override training epochs |
| `--batch-size` | int | YAML value | Override batch size |
| `--learning-rate` | float | YAML value | Override learning rate |
| `--max-steps` | int | YAML value | Override max steps |
| `--output-dir` | string | YAML value | Override output directory |
### Inference Scripts
#### Basic Usage
2025-08-06 22:45:37 +01:00
``` bash
2025-08-13 23:59:28 +00:00
python scripts/[ task_type] /inference.py infer \
--config configs/[ task_type] /[ config] .yaml \
--input-text "Your input text here"
```
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### Advanced Options
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` bash
python scripts/[ task_type] /inference.py infer \
--config configs/[ task_type] /[ config] .yaml \
--input-text "Your input text here" \
--max-tokens 256 \
--temperature 0.7 \
--stream
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### Command Line Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--config` | string | required | YAML configuration file |
| `--input-text` | string | required | Text to process |
| `--max-tokens` | int | 128 | Maximum tokens to generate |
| `--temperature` | float | 0.8 | Sampling temperature |
| `--stream` | flag | false | Enable streaming generation |
### Batch Processing
2025-08-06 22:45:37 +01:00
``` bash
2025-08-13 23:59:28 +00:00
# Process multiple inputs from file
python scripts/[ task_type] /inference.py batch \
--config configs/[ task_type] /[ config] .yaml \
--input-file input.txt \
--output-file output.txt
```
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### Interactive Mode
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` bash
# Enter interactive mode for testing
python scripts/[ task_type] /inference.py interactive \
--config configs/[ task_type] /[ config] .yaml
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
## Complete Workflows
### Classification Task Workflow
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 1. Data Preparation
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` jsonl
# data/raw/classification/sentiment.jsonl
{"text": "I love this movie!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
{"text": "It's okay", "label": "neutral"}
```
#### 2. Configuration
2025-08-06 22:45:37 +01:00
``` yaml
2025-08-13 23:59:28 +00:00
# configs/classification/sentiment.yaml
2025-08-06 22:45:37 +01:00
task :
2025-08-13 23:59:28 +00:00
name : "classification"
type : "sentiment_analysis"
2025-08-06 22:45:37 +01:00
data :
2025-08-13 23:59:28 +00:00
source : "custom"
data_path : "./data/raw/classification/sentiment.jsonl"
input_field : "text"
output_field : "label"
instruction : "Classify the sentiment of the following text"
2025-08-06 22:45:37 +01:00
model :
2025-08-13 23:59:28 +00:00
name : "microsoft/DialoGPT-medium"
max_seq_length : 512
2025-08-06 22:45:37 +01:00
training :
2025-08-13 23:59:28 +00:00
num_epochs : 3
batch_size : 8
learning_rate : 3e-5
```
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 3. Execute Pipeline
``` bash
# Process data
python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml
# Train model
python scripts/classification/train.py train --config configs/classification/sentiment.yaml
# Run inference
python scripts/classification/inference.py infer \
--config configs/classification/sentiment.yaml \
--input-text "This product exceeded my expectations!"
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
### Styling Task Workflow
#### 1. Data Preparation
``` jsonl
# data/raw/styling/formal.jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
{"text": "This is cool", "styled_text": "This is quite impressive"}
```
#### 2. Configuration
2025-08-13 21:30:45 +01:00
``` yaml
2025-08-13 23:59:28 +00:00
# configs/styling/formal.yaml
2025-08-13 21:30:45 +01:00
task :
name : "styling"
type : "style_transfer"
data :
source : "custom"
2025-08-13 23:59:28 +00:00
data_path : "./data/raw/styling/formal.jsonl"
2025-08-13 21:30:45 +01:00
input_field : "text"
output_field : "styled_text"
instruction : "Rewrite the following text in a formal style"
model :
2025-08-13 23:59:28 +00:00
name : "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
max_seq_length : 2048
2025-08-13 21:30:45 +01:00
training :
num_epochs : 3
2025-08-13 23:59:28 +00:00
batch_size : 4
2025-08-13 21:30:45 +01:00
learning_rate : 2e-4
2025-08-13 23:59:28 +00:00
model_output_dir : "./models/styling"
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
#### 3. Execute Pipeline
2025-08-06 22:45:37 +01:00
``` bash
2025-08-13 23:59:28 +00:00
# Process data
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
# Train model
python scripts/styling/train.py train --config configs/styling/formal.yaml
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
# Run inference
python scripts/styling/inference.py infer \
--config configs/styling/formal.yaml \
--instruction "Rewrite in formal style" \
--input-text "Hey there! What's up?"
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
### Completion Task Workflow
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 1. Data Preparation
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` jsonl
# data/raw/completion/story.jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."}
{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."}
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### 2. Configuration
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` yaml
# configs/completion/story.yaml
task :
name : "completion"
type : "story_generation"
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
data :
source : "custom"
data_path : "./data/raw/completion/story.jsonl"
input_field : "prompt"
output_field : "completion"
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
model :
name : "gpt2-medium"
max_seq_length : 1024
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
training :
num_epochs : 2
batch_size : 16
learning_rate : 5e-5
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### 3. Execute Pipeline
2025-08-13 21:30:45 +01:00
``` bash
2025-08-13 23:59:28 +00:00
# Process data
python scripts/completion/data_processor.py --config configs/completion/story.yaml
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
# Train model
python scripts/completion/train.py train --config configs/completion/story.yaml
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
# Run inference
python scripts/completion/inference.py infer \
--config configs/completion/story.yaml \
--input-text "The wizard cast a spell"
2025-08-13 21:30:45 +01:00
```
2025-08-13 23:59:28 +00:00
## API Reference
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### Data Processing Classes
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### BaseDataProcessor
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class BaseDataProcessor :
def __init__ ( self , config : Dict [ str , Any ] )
def load_and_preprocess ( self ) - > Tuple [ Dict , Dict ]
def validate_data ( self , data : Dict ) - > Tuple [ bool , List [ str ] ]
def save_data ( self , data : Dict , output_path : str )
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### ClassificationDataProcessor
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class ClassificationDataProcessor ( BaseDataProcessor ) :
def convert_to_classification_format ( self , data : Dict ) - > Dict
def create_label_mapping ( self , labels : List [ str ] ) - > Dict [ str , int ]
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### StylingDataProcessor
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class StylingDataProcessor ( BaseDataProcessor ) :
def convert_to_alpaca_format ( self , data : Dict ) - > Dict
def format_for_training ( self , data : Dict ) - > Dict
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
### Training Classes
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### BaseTrainer
``` python
class BaseTrainer :
def __init__ ( self , config : Dict [ str , Any ] )
def load_model_and_tokenizer ( self )
def setup_training ( self , dataset : Dataset )
def train ( self , dataset_path : str ) - > Dict
def save_model ( self )
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### ClassificationTrainer
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class ClassificationTrainer ( BaseTrainer ) :
def setup_classification_head ( self )
def compute_metrics ( self , eval_pred ) - > Dict
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### StylingTrainer
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class StylingTrainer ( BaseTrainer ) :
def setup_lora ( self )
def format_dataset ( self , dataset : Dataset ) - > Dataset
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
### Inference Classes
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### BaseInference
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class BaseInference :
def __init__ ( self , config : Dict [ str , Any ] )
def load_model_and_tokenizer ( self )
def preprocess_input ( self , input_text : str ) - > torch . Tensor
def postprocess_output ( self , output : torch . Tensor ) - > str
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### ClassificationInference
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class ClassificationInference ( BaseInference ) :
def classify ( self , text : str ) - > Dict [ str , float ]
def batch_classify ( self , texts : List [ str ] ) - > List [ Dict ]
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### StylingInference
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
``` python
class StylingInference ( BaseInference ) :
def style_transfer ( self , text : str , instruction : str ) - > str
def generate_text ( self , instruction : str , input_text : str ) - > str
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
## Troubleshooting
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### Common Issues
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 1. Model Loading Errors
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Error ** : `FileNotFoundError: ./models/[task_name]/*.json`
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Solution ** :
- Verify model was trained successfully
- Check `model_output_dir` in YAML config
- Ensure model files exist in specified directory
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 2. Memory Issues
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Error ** : `CUDA out of memory`
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Solution ** :
- Reduce `batch_size` in YAML config
- Enable `load_in_4bit: true`
- Use gradient accumulation
- Reduce `max_seq_length`
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 3. Data Format Errors
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Error ** : `KeyError: 'input_field'`
**Solution ** :
- Verify field names in JSONL/CSV files
- Check `input_field` and `output_field` in YAML
- Ensure data format matches expected structure
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
#### 4. Training Convergence Issues
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Symptoms ** : Loss not decreasing, poor model performance
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
**Solution ** :
- Adjust learning rate (try 1e-5 to 5e-4)
- Increase training epochs
- Check data quality and quantity
- Verify label distribution (for classification)
2025-08-13 21:30:45 +01:00
2025-08-13 23:59:28 +00:00
### Debug Mode
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
Enable detailed logging:
2025-08-06 22:45:37 +01:00
``` bash
2025-08-13 23:59:28 +00:00
export LOG_LEVEL = "DEBUG"
python scripts/[ task_type] /[ script] .py --log-level DEBUG
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
### Performance Optimization
#### Memory Optimization
2025-08-06 22:45:37 +01:00
``` yaml
2025-08-13 23:59:28 +00:00
model :
load_in_4bit : true # 4-bit quantization
dtype : "float16" # Use float16 if supported
2025-08-06 22:45:37 +01:00
training :
2025-08-13 23:59:28 +00:00
gradient_accumulation_steps : 4 # Effective batch size = batch_size * steps
max_grad_norm : 1.0 # Gradient clipping
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
#### Speed Optimization
``` yaml
training :
dataloader_num_workers : 4 # Parallel data loading
fp16 : true # Mixed precision training
bf16 : false # Disable bfloat16 if not supported
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
## Contributing
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
### Adding New Task Types
1. **Create task directory structure ** :
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
pipelines/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py
scripts/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py
configs/[new_task]/
└── example.yaml
```
2. **Implement base classes ** :
- Extend `BaseDataProcessor`
- Extend `BaseTrainer`
- Extend `BaseInference`
3. **Add configuration templates ** :
- Define task-specific parameters
- Document all configuration options
4. **Update documentation ** :
- Add task description to README
- Include usage examples
- Document configuration parameters
### Code Style
- Follow PEP 8 guidelines
- Use type hints for all functions
- Include comprehensive docstrings
- Add unit tests for new functionality
### Testing
2025-08-06 22:45:37 +01:00
``` bash
2025-08-13 23:59:28 +00:00
# Run all tests
python -m pytest tests/
# Run specific task tests
python -m pytest tests/[ task_type] /
# Run with coverage
python -m pytest --cov= pipelines tests/
2025-08-06 22:45:37 +01:00
```
2025-08-13 23:59:28 +00:00
## License
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
This project is licensed under the MIT License - see the [LICENSE ](LICENSE ) file for details.
2025-08-06 22:45:37 +01:00
2025-08-06 22:49:29 +01:00
## Support
2025-08-06 22:45:37 +01:00
2025-08-13 23:59:28 +00:00
- **Issues**: [GitHub Issues ](https://github.com/your-repo/issues )
- **Discussions**: [GitHub Discussions ](https://github.com/your-repo/discussions )
- **Documentation**: [Wiki ](https://github.com/your-repo/wiki )
2025-08-06 22:45:37 +01:00
---
2025-08-13 23:59:28 +00:00
**Happy fine-tuning! 🚀 **