DS-LLM-TEMPLATE-FINETUNING/README.md

# Fine-Tuning Task Framework

A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching.

## Table of Contents

- [Overview](#overview)
- [Architecture](#architecture)
- [Task Types](#task-types)
- [Quick Start](#quick-start)
- [Configuration Guide](#configuration-guide)
- [Scripts & Commands](#scripts--commands)
- [Complete Workflows](#complete-workflows)
- [API Reference](#api-reference)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)

## Overview

This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be:

- **Task-Agnostic**: Same pipeline structure for different task types
- **Configuration-Driven**: YAML-based configuration for all parameters
- **Developer-Friendly**: Clear scripts and comprehensive logging
- **Production-Ready**: Built-in validation, error handling, and optimization

## Architecture

The framework follows a **modular pipeline architecture**:

```
Raw Data → Data Processing → Model Training → Inference/Evaluation
    ↓              ↓              ↓              ↓
  JSONL/CSV    HuggingFace    Trained      Ready for
  Files        Datasets       Models       Production
```

### Core Components

1. **Data Processors**: Convert raw data to training-ready formats
2. **Training Pipelines**: Task-specific training with optimization
3. **Inference Engines**: Production-ready text generation/classification
4. **Configuration Management**: YAML-based parameter control
5. **Utility Scripts**: Command-line interfaces for all operations

## Task Types

### 1. Classification Task

**Purpose**: Text classification, sentiment analysis, topic categorization

**Data Format**:
```jsonl
{"text": "I love this product!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
```

**Output**: Classification probabilities and predicted labels

**Use Cases**: Sentiment analysis, spam detection, content moderation

### 2. Completion Task

**Purpose**: Text generation, story completion, code generation

**Data Format**:
```jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight..."}
{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"}
```

**Output**: Generated text continuations

**Use Cases**: Creative writing, code completion, content generation

### 3. Styling Task

**Purpose**: Style transfer, tone modification, writing style adaptation

**Data Format**:
```jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
```

**Output**: Text rewritten in target style

**Use Cases**: Formalization, casualization, domain adaptation

### 4. Matching Task

**Purpose**: Semantic similarity, question-answer matching, paraphrase detection

**Data Format**:
```jsonl
{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"}
{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"}
```

**Output**: Similarity scores or binary classifications

**Use Cases**: Search relevance, duplicate detection, semantic matching

## Quick Start

### Prerequisites

```bash
# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch, transformers, datasets; print('✅ All packages installed')"
```

### Basic Workflow

```bash
# 1. Process data
python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml

# 2. Train model
python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml

# 3. Run inference
python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml
```

## Configuration Guide

### YAML Structure

All configurations follow this hierarchical structure:

```yaml
# Task Configuration
task:
  name: "task_type"                    # classification, completion, styling, matching
  type: "specific_type"                # e.g., "sentiment_analysis", "style_transfer"

# Data Configuration
data:
  source: "custom"                     # "custom" or "huggingface"
  data_path: "./data/raw/..."          # Path to raw data
  input_field: "text"                  # Field name for input
  output_field: "label"                # Field name for output
  instruction: "Task instruction"      # For instruction-following tasks

# Model Configuration
model:
  name: "model_name"                   # HuggingFace model identifier
  max_seq_length: 2048                 # Maximum sequence length
  dtype: null                          # Data type (auto-detected)
  load_in_4bit: true                   # 4-bit quantization

# Training Configuration
training:
  num_epochs: 3                        # Training epochs
  batch_size: 4                        # Batch size
  learning_rate: 2e-4                  # Learning rate
  warmup_steps: 5                      # Warmup steps
  max_steps: 60                        # Maximum training steps

# Inference Configuration
inference:
  batch_size: 32                       # Inference batch size
  max_new_tokens: 128                  # Max tokens to generate
  temperature: 0.8                     # Sampling temperature
```

### Configuration Parameters

#### Data Processing Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `source` | string | "custom" | Data source type |
| `data_path` | string | required | Path to raw data file |
| `input_field` | string | "text" | Input field name |
| `output_field` | string | "label" | Output field name |
| `instruction` | string | task-specific | Task instruction |
| `data_format` | string | "jsonl" | Data file format |
| `max_length` | int | 256 | Maximum text length |
| `min_length` | int | 10 | Minimum text length |
| `clean_text` | boolean | true | Enable text cleaning |
| `lowercase` | boolean | false | Convert to lowercase |

#### Model Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `name` | string | required | HuggingFace model name |
| `max_seq_length` | int | 2048 | Maximum sequence length |
| `dtype` | string | null | Data type (auto-detected) |
| `load_in_4bit` | boolean | true | Enable 4-bit quantization |
| `token` | string | null | HuggingFace access token |

#### Training Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `num_epochs` | int | 1 | Number of training epochs |
| `batch_size` | int | 2 | Training batch size |
| `learning_rate` | float | 2e-4 | Learning rate |
| `weight_decay` | float | 0.01 | Weight decay |
| `warmup_steps` | int | 5 | Warmup steps |
| `max_steps` | int | 60 | Maximum training steps |
| `gradient_accumulation_steps` | int | 4 | Gradient accumulation |

#### LoRA Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `lora_r` | int | 16 | LoRA rank |
| `lora_alpha` | int | 16 | LoRA alpha |
| `lora_dropout` | float | 0 | LoRA dropout |
| `target_modules` | list | ["q_proj", "k_proj", "v_proj", "o_proj"] | Target modules for LoRA |

### Environment Variables

```bash
# HuggingFace token for gated models
export HF_TOKEN="hf_..."

# CUDA device selection
export CUDA_VISIBLE_DEVICES="0"

# Logging level
export LOG_LEVEL="INFO"
```

## Scripts & Commands

### Data Processing Scripts

#### Basic Usage

```bash
python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml
```

#### Advanced Options

```bash
python scripts/[task_type]/data_processor.py \
  --config configs/[task_type]/[config].yaml \
  --max-samples 1000 \
  --log-level DEBUG \
  --create-hf-dataset \
  --hf-dataset-path ./datasets/[task_name]
```

#### Command Line Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--config` | string | required | YAML configuration file |
| `--max-samples` | int | all | Maximum samples to process |
| `--log-level` | string | "INFO" | Logging level |
| `--create-hf-dataset` | flag | false | Create HuggingFace dataset |
| `--hf-dataset-path` | string | auto | HuggingFace dataset path |

### Training Scripts

#### Basic Usage

```bash
python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml
```

#### Advanced Options

```bash
python scripts/[task_type]/train.py train \
  --config configs/[task_type]/[config].yaml \
  --epochs 5 \
  --batch-size 8 \
  --learning-rate 1e-4 \
  --max-steps 100
```

#### Command Line Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--config` | string | required | YAML configuration file |
| `--epochs` | int | YAML value | Override training epochs |
| `--batch-size` | int | YAML value | Override batch size |
| `--learning-rate` | float | YAML value | Override learning rate |
| `--max-steps` | int | YAML value | Override max steps |
| `--output-dir` | string | YAML value | Override output directory |

### Inference Scripts

#### Basic Usage

```bash
python scripts/[task_type]/inference.py infer \
  --config configs/[task_type]/[config].yaml \
  --input-text "Your input text here"
```

#### Advanced Options

```bash
python scripts/[task_type]/inference.py infer \
  --config configs/[task_type]/[config].yaml \
  --input-text "Your input text here" \
  --max-tokens 256 \
  --temperature 0.7 \
  --stream
```

#### Command Line Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--config` | string | required | YAML configuration file |
| `--input-text` | string | required | Text to process |
| `--max-tokens` | int | 128 | Maximum tokens to generate |
| `--temperature` | float | 0.8 | Sampling temperature |
| `--stream` | flag | false | Enable streaming generation |

### Batch Processing

```bash
# Process multiple inputs from file
python scripts/[task_type]/inference.py batch \
  --config configs/[task_type]/[config].yaml \
  --input-file input.txt \
  --output-file output.txt
```

### Interactive Mode

```bash
# Enter interactive mode for testing
python scripts/[task_type]/inference.py interactive \
  --config configs/[task_type]/[config].yaml
```

## Complete Workflows

### Classification Task Workflow

#### 1. Data Preparation

```jsonl
# data/raw/classification/sentiment.jsonl
{"text": "I love this movie!", "label": "positive"}
{"text": "This is terrible", "label": "negative"}
{"text": "It's okay", "label": "neutral"}
```

#### 2. Configuration

```yaml
# configs/classification/sentiment.yaml
task:
  name: "classification"
  type: "sentiment_analysis"

data:
  source: "custom"
  data_path: "./data/raw/classification/sentiment.jsonl"
  input_field: "text"
  output_field: "label"
  instruction: "Classify the sentiment of the following text"

model:
  name: "microsoft/DialoGPT-medium"
  max_seq_length: 512

training:
  num_epochs: 3
  batch_size: 8
  learning_rate: 3e-5
```

#### 3. Execute Pipeline

```bash
# Process data
python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml

# Train model
python scripts/classification/train.py train --config configs/classification/sentiment.yaml

# Run inference
python scripts/classification/inference.py infer \
  --config configs/classification/sentiment.yaml \
  --input-text "This product exceeded my expectations!"
```

### Styling Task Workflow

#### 1. Data Preparation

```jsonl
# data/raw/styling/formal.jsonl
{"text": "Hey there!", "styled_text": "Hello, how are you?"}
{"text": "I'm gonna go", "styled_text": "I will be going"}
{"text": "This is cool", "styled_text": "This is quite impressive"}
```

#### 2. Configuration

```yaml
# configs/styling/formal.yaml
task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./data/raw/styling/formal.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite the following text in a formal style"

model:
  name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
  max_seq_length: 2048

training:
  num_epochs: 3
  batch_size: 4
  learning_rate: 2e-4
  model_output_dir: "./models/styling"
```

#### 3. Execute Pipeline

```bash
# Process data
python scripts/styling/data_processor.py --config configs/styling/formal.yaml

# Train model
python scripts/styling/train.py train --config configs/styling/formal.yaml

# Run inference
python scripts/styling/inference.py infer \
  --config configs/styling/formal.yaml \
  --instruction "Rewrite in formal style" \
  --input-text "Hey there! What's up?"
```

### Completion Task Workflow

#### 1. Data Preparation

```jsonl
# data/raw/completion/story.jsonl
{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."}
{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."}
```

#### 2. Configuration

```yaml
# configs/completion/story.yaml
task:
  name: "completion"
  type: "story_generation"

data:
  source: "custom"
  data_path: "./data/raw/completion/story.jsonl"
  input_field: "prompt"
  output_field: "completion"

model:
  name: "gpt2-medium"
  max_seq_length: 1024

training:
  num_epochs: 2
  batch_size: 16
  learning_rate: 5e-5
```

#### 3. Execute Pipeline

```bash
# Process data
python scripts/completion/data_processor.py --config configs/completion/story.yaml

# Train model
python scripts/completion/train.py train --config configs/completion/story.yaml

# Run inference
python scripts/completion/inference.py infer \
  --config configs/completion/story.yaml \
  --input-text "The wizard cast a spell"
```

## API Reference

### Data Processing Classes

#### BaseDataProcessor

```python
class BaseDataProcessor:
    def __init__(self, config: Dict[str, Any])
    def load_and_preprocess(self) -> Tuple[Dict, Dict]
    def validate_data(self, data: Dict) -> Tuple[bool, List[str]]
    def save_data(self, data: Dict, output_path: str)
```

#### ClassificationDataProcessor

```python
class ClassificationDataProcessor(BaseDataProcessor):
    def convert_to_classification_format(self, data: Dict) -> Dict
    def create_label_mapping(self, labels: List[str]) -> Dict[str, int]
```

#### StylingDataProcessor

```python
class StylingDataProcessor(BaseDataProcessor):
    def convert_to_alpaca_format(self, data: Dict) -> Dict
    def format_for_training(self, data: Dict) -> Dict
```

### Training Classes

#### BaseTrainer

```python
class BaseTrainer:
    def __init__(self, config: Dict[str, Any])
    def load_model_and_tokenizer(self)
    def setup_training(self, dataset: Dataset)
    def train(self, dataset_path: str) -> Dict
    def save_model(self)
```

#### ClassificationTrainer

```python
class ClassificationTrainer(BaseTrainer):
    def setup_classification_head(self)
    def compute_metrics(self, eval_pred) -> Dict
```

#### StylingTrainer

```python
class StylingTrainer(BaseTrainer):
    def setup_lora(self)
    def format_dataset(self, dataset: Dataset) -> Dataset
```

### Inference Classes

#### BaseInference

```python
class BaseInference:
    def __init__(self, config: Dict[str, Any])
    def load_model_and_tokenizer(self)
    def preprocess_input(self, input_text: str) -> torch.Tensor
    def postprocess_output(self, output: torch.Tensor) -> str
```

#### ClassificationInference

```python
class ClassificationInference(BaseInference):
    def classify(self, text: str) -> Dict[str, float]
    def batch_classify(self, texts: List[str]) -> List[Dict]
```

#### StylingInference

```python
class StylingInference(BaseInference):
    def style_transfer(self, text: str, instruction: str) -> str
    def generate_text(self, instruction: str, input_text: str) -> str
```

## Troubleshooting

### Common Issues

#### 1. Model Loading Errors

**Error**: `FileNotFoundError: ./models/[task_name]/*.json`

**Solution**:
- Verify model was trained successfully
- Check `model_output_dir` in YAML config
- Ensure model files exist in specified directory

#### 2. Memory Issues

**Error**: `CUDA out of memory`

**Solution**:
- Reduce `batch_size` in YAML config
- Enable `load_in_4bit: true`
- Use gradient accumulation
- Reduce `max_seq_length`

#### 3. Data Format Errors

**Error**: `KeyError: 'input_field'`

**Solution**:
- Verify field names in JSONL/CSV files
- Check `input_field` and `output_field` in YAML
- Ensure data format matches expected structure

#### 4. Training Convergence Issues

**Symptoms**: Loss not decreasing, poor model performance

**Solution**:
- Adjust learning rate (try 1e-5 to 5e-4)
- Increase training epochs
- Check data quality and quantity
- Verify label distribution (for classification)

### Debug Mode

Enable detailed logging:

```bash
export LOG_LEVEL="DEBUG"
python scripts/[task_type]/[script].py --log-level DEBUG
```

### Performance Optimization

#### Memory Optimization

```yaml
model:
  load_in_4bit: true          # 4-bit quantization
  dtype: "float16"            # Use float16 if supported

training:
  gradient_accumulation_steps: 4  # Effective batch size = batch_size * steps
  max_grad_norm: 1.0         # Gradient clipping
```

#### Speed Optimization

```yaml
training:
  dataloader_num_workers: 4   # Parallel data loading
  fp16: true                  # Mixed precision training
  bf16: false                 # Disable bfloat16 if not supported
```

## Contributing

### Adding New Task Types

1. **Create task directory structure**:
```
pipelines/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py

scripts/[new_task]/
├── __init__.py
├── data_processor.py
├── train.py
└── inference.py

configs/[new_task]/
└── example.yaml
```

2. **Implement base classes**:
- Extend `BaseDataProcessor`
- Extend `BaseTrainer`
- Extend `BaseInference`

3. **Add configuration templates**:
- Define task-specific parameters
- Document all configuration options

4. **Update documentation**:
- Add task description to README
- Include usage examples
- Document configuration parameters

### Code Style

- Follow PEP 8 guidelines
- Use type hints for all functions
- Include comprehensive docstrings
- Add unit tests for new functionality

### Testing

```bash
# Run all tests
python -m pytest tests/

# Run specific task tests
python -m pytest tests/[task_type]/

# Run with coverage
python -m pytest --cov=pipelines tests/
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/your-repo/issues)
- **Discussions**: [GitHub Discussions](https://github.com/your-repo/discussions)
- **Documentation**: [Wiki](https://github.com/your-repo/wiki)

---

**Happy fine-tuning! 🚀**