DS-LLM-TEMPLATE-FINETUNING/README.md

# Fine-Tune Task: NLP Pipeline Framework

A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).

## Supported Tasks

This framework supports multiple NLP tasks with organized configurations:

- **Classification**: Text classification, sentiment analysis, topic classification
- **Completion**: Text generation, code completion, story generation
- **Styling**: Style transfer, tone classification, writing style adaptation
- **Matching**: Semantic matching, entity matching, similarity scoring

### Current Implementation Status

- **Classification**: ✅ Fully implemented with emotion classification example
- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning
- **Completion**: Planned for future updates
- **Matching**: Planned for future updates

**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.

## Project Structure

```
fine-tune-task/
├── configs/                    # YAML configuration files
│   ├── classification/         # ✅ Implemented
│   │   ├── emotion.yaml       # Emotion classification
│   │   └── custom.yaml        # Custom dataset
│   ├── styling/               # ✅ Implemented
│   │   └── formal.yaml        # Formal style transfer
│   ├── completion/             # Planned for future updates
│   └── matching/              # Planned for future updates
├── data/                       # Data directories
│   ├── raw/                    # Raw input data
│   │   ├── classification/     # ✅ Implemented
│   │   ├── styling/           # ✅ Implemented
│   │   ├── completion/         # Planned for future updates
│   │   └── matching/          # Planned for future updates
│   └── processed/              # Processed data
│       ├── classification/     # ✅ Implemented
│       ├── styling/           # ✅ Implemented
│       ├── completion/         # Planned for future updates
│       └── matching/          # Planned for future updates
├── pipelines/                  # Core pipeline scripts
│   ├── classification/         # ✅ Implemented
│   │   ├── data_processor.py  # Data processing
│   │   ├── train.py          # Training
│   │   └── inference.py      # Inference
│   ├── styling/               # ✅ Implemented
│   │   ├── data_processor.py  # Style data processing
│   │   ├── train.py          # LoRA fine-tuning
│   │   └── inference.py      # Style transfer inference
│   ├── completion/            # Planned for future updates
│   └── matching/             # Planned for future updates
├── scripts/                    # User-friendly scripts
│   ├── classification/         # ✅ Implemented
│   │   ├── data_processor.py  # Data processing script
│   │   ├── trainer.py        # Training script
│   │   └── inference.py      # Inference script
│   ├── styling/               # ✅ Implemented
│   │   ├── data_processor.py  # Style data processing script
│   │   ├── train.py          # Training script
│   │   └── inference.py      # Inference script
│   ├── completion/            # Planned for future updates
│   └── matching/             # Planned for future updates
├── results/                    # Model outputs
│   ├── classification/         # ✅ Implemented
│   ├── styling/              # ✅ Implemented
│   ├── completion/            # Planned for future updates
│   └── matching/             # Planned for future updates
└── utils/                      # Shared utility modules
```

## Quick Start (Classification Task)

### 1. Setup Environment

```bash
# Install dependencies
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.
```

### 2. Data Processing

```bash
# Process emotion dataset
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Process with custom parameters
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000

# Check output location
ls -la ./data/processed/classification/emotion/classification/
```

**Expected Output:**
```
Data processing completed successfully!
  Data source: huggingface
  Dataset: dair-ai/emotion
  Total samples: 2999
  Unique labels: 6
  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
  Output directory: ./data/processed/classification/emotion
```

### 3. Model Training

```bash
# Train using processed data
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Train with custom parameters
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32

# Check model output
ls -la ./results/classification/emotion_model/
```

**Expected Output:**
```
Training completed successfully!
  Model: bert-base-uncased
  Data directory: ./data/processed/classification/emotion
  Training for 3 epochs with batch size 16
  Model saved to: ./results/classification/emotion_model
```

### 4. Model Inference

```bash
# Run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# File-based inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl
```

**Expected Output:**
```
Inference completed successfully!
  Loading model from: ./results/classification/emotion_model
  Predicted label: joy
  Confidence: 0.8542
  Top 3 predictions:
    - joy: 0.8542
    - love: 0.1234
    - surprise: 0.0224
```

## Quick Start (Styling Task)

### 1. Setup Environment

```bash
# Install dependencies (including unsloth for styling)
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.
```

### 2. Data Processing

```bash
# Process style transfer dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml

# Create HuggingFace dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset

# Check output location
ls -la ./data/processed/styling/formal/
```

**Expected Output:**
```
Styling data processing completed successfully!
  Data source: custom
  Data file: ./data/raw/styling/sample_formal.jsonl
  Total samples: 5
  Split sizes: {'train': 3, 'validation': 1, 'test': 1}
  Output directory: ./data/processed/styling/formal
  Style instruction: Rewrite the following text in a formal style
```

### 3. Model Training

```bash
# Train using processed data (automatically loads from YAML output_dir)
python scripts/styling/train.py example

# Custom training
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4

# Check model output
ls -la ./models/styling/
```

**Expected Output:**
```
Training completed successfully!
  Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
  Dataset: Loaded from ./data/processed/styling/formal
  Training for 3 epochs with batch size 4
  Model saved to: ./models/styling
```

### 4. Model Inference

```bash
# Single text style transfer
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"

# Batch processing
python scripts/styling/inference.py batch

# Interactive mode
python scripts/styling/inference.py infer --config configs/styling/formal.yaml
```

**Expected Output:**
```
Inference completed successfully!
  Input: Hey, what's up?
  Output: Hello, how are you doing?
  Style: Formal
```

## Adding New Tasks

To add a new task (e.g., completion, styling, matching), follow these steps:

### Example: Styling Task (Already Implemented)

The styling task demonstrates a complete implementation:

1. **Task Directory Structure** ✅
```bash
configs/styling/           # YAML configurations
data/raw/styling/         # Raw style transfer data
data/processed/styling/   # Processed data
pipelines/styling/        # Core pipeline scripts
scripts/styling/          # User-friendly scripts
models/styling/           # Trained models
```

2. **Pipeline Components** ✅
- **Data Processor**: Handles style transfer datasets with instruction/input/output format
- **Trainer**: LoRA fine-tuning using Unsloth for efficiency
- **Inference**: Style transfer with streaming and batch processing

3. **Key Features** ✅
- Automatic EOS token handling: `text + tokenizer.eos_token`
- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)`
- YAML integration: Uses `data.output_dir` for automatic dataset loading
- HuggingFace dataset export and loading

### For Other Tasks (completion, matching)

1. **Create Task Directory Structure**
```bash
# Create task directories
mkdir -p configs/completion
mkdir -p data/raw/completion data/processed/completion
mkdir -p pipelines/completion
mkdir -p scripts/completion
mkdir -p results/completion
mkdir -p tasks/completion
mkdir -p models/completion
```

2. **Create Task Configuration**

```bash
# Create YAML configuration for new task
cat > configs/completion/text_generation.yaml << 'EOF'
# Text Generation Task Configuration
task:
  name: "completion"
  type: "text_generation"

# Data Processing Configuration
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/completion/text_generation"
  max_samples: 1000
  # ... other data parameters

# Model Configuration
model:
  name: "gpt2"  # Different model for completion
  max_length: 1024
  # ... model parameters

# Training Configuration
training:
  num_epochs: 3
  batch_size: 8  # Smaller batch for generation
  learning_rate: 5e-5
  data_dir: "./data/processed/completion/text_generation"
  output_dir: "./results/completion/text_generation_model"

# Inference Configuration
inference:
  model_path: "./results/completion/text_generation_model"
  device: "auto"
  batch_size: 1  # Generation is typically one at a time
  max_length: 100
  temperature: 0.7
EOF
```

3. **Create Pipeline Scripts**

Copy and modify the classification pipeline scripts:

```bash
# Copy classification scripts as templates
cp pipelines/classification/data_processor.py pipelines/completion/
cp pipelines/classification/train.py pipelines/completion/
cp pipelines/classification/inference.py pipelines/completion/

# Copy task scripts
cp scripts/classification/data_processor.py scripts/completion/
cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/
```

4. **Modify Pipeline Code**

Update the pipeline scripts for your specific task:

1. **Data Processor** (`pipelines/completion/data_processor.py`):
   - Update data loading logic for completion datasets
   - Modify preprocessing for text generation
   - Adjust output format for completion tasks

2. **Trainer** (`pipelines/completion/train.py`):
   - Change model type to generation models (GPT, T5, etc.)
   - Update training loop for text generation
   - Modify evaluation metrics

3. **Inference** (`pipelines/completion/inference.py`):
   - Update inference for text generation
   - Add generation parameters (temperature, top-k, etc.)
   - Modify output format

5. **Update Task Scripts**

Modify the task scripts to use your new pipeline:

```python
# scripts/completion/data_processor.py
def run_with_yaml_config(config_path: str, **cli_overrides):
    cmd = [
        "python", "pipelines/completion/data_processor.py",  # Updated path
        "--config", config_path
    ]
    # ... rest of the function
```

6. **Create Task-Specific Models**

```bash
# Create model directory
mkdir -p models/completion

# Add task-specific model classes
cat > models/completion/text_generator.py << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer

class TextGenerator:
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    def generate(self, prompt, max_length=100, temperature=0.7):
        # Implementation for text generation
        pass
EOF
```

7. **Test Your New Task**

```bash
# Test data processing
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml

# Test training
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml

# Test inference
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"
```

## YAML Configuration Guide

### Configuration Structure

Each YAML file is organized into clear sections:

```yaml
# Task Configuration
task:
  name: "classification"  # or "completion", "styling", "matching"
  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"

# Data Processing Configuration
data:
  source: "huggingface"                    # "huggingface" or "custom"
  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
  output_dir: "./data/processed/classification/emotion"
  max_samples: 1000                        # Limit dataset size
  # ... other data parameters

# Model Configuration
model:
  name: "bert-base-uncased"                # Model from HuggingFace Hub
  max_length: 512                          # Sequence length
  num_labels: 6                            # Number of classes

# Training Configuration
training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate
  data_dir: "./data/processed/classification/emotion"
  output_dir: "./results/classification/emotion_model"

# Inference Configuration
inference:
  model_path: "./results/classification/emotion_model"
  device: "auto"                           # "auto", "cuda", "cpu"
  batch_size: 32                           # Inference batch size
  return_top_k: 3                          # Top K predictions
```

### Styling Configuration Example

```yaml
# Styling Task Configuration
task:
  name: "styling"
  type: "style_transfer"

# Data Processing Configuration
data:
  source: "custom"
  data_path: "./data/raw/styling/sample_formal.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite the following text in a formal style"
  output_dir: "./data/processed/styling/formal"
  output_format: "alpaca"

# Model Configuration
model:
  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
  training_max_seq_length: 2048
  training_load_in_4bit: true

# Training Configuration
training:
  num_epochs: 3
  batch_size: 2
  learning_rate: 2e-4
  weight_decay: 0.01

# Inference Configuration
inference:
  batch_size: 1
  max_new_tokens: 128
  temperature: 0.8
```

### Available Configuration Files

- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset
- `configs/classification/custom.yaml` - Custom dataset processing
- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning

## Usage Examples

### Data Processing Examples

```bash
# 1. Use YAML config only
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500

# 3. Use CLI only (backward compatibility)
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion

# 4. Run examples
python scripts/classification/data_processor.py examples
```

### Training Examples

```bash
# 1. Use YAML config only
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5

# 3. Use CLI only
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3

# 4. Run examples
python scripts/classification/trainer.py examples
```

### Inference Examples

```bash
# 1. Single text prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# 2. File-based prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

# 3. Interactive mode
python scripts/classification/inference.py --config configs/classification/emotion.yaml

# 4. Run examples
python scripts/classification/inference.py examples
```

### Styling Examples

```bash
# 1. Data Processing
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset

# 2. Training
python scripts/styling/train.py example
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2

# 3. Inference
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
python scripts/styling/inference.py batch
python scripts/styling/inference.py infer --config configs/styling/formal.yaml

# 4. Run examples
python scripts/styling/data_processor.py examples
python scripts/styling/train.py features
python scripts/styling/inference.py features
```

## Troubleshooting Common Errors

### 1. ModuleNotFoundError: No module named 'utils'

**Error:**
```
ModuleNotFoundError: No module named 'utils'
```

**Solution:**
```bash
# Set Python path before running scripts
export PYTHONPATH=.
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml
```

### 2. Model Path Not Found

**Error:**
```
Model path not found: ./results/classification/emotion_model
```

**Solution:**
```bash
# Train the model first
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Then run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml
```

### 3. Data Directory Not Found

**Error:**
```
Data directory not found: ./data/processed/classification/emotion
```

**Solution:**
```bash
# Process data first
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Then train
python scripts/classification/trainer.py --config configs/classification/emotion.yaml
```

### 4. YAML Configuration Errors

**Error:**
```
data_processor.py: error: --data-source is required (either in YAML config or CLI)
```

**Solution:**
Check your YAML file structure. It should have:
```yaml
data:
  source: "huggingface"  # Not data_source
  dataset_name: "dair-ai/emotion"
```

### 5. HuggingFace Download Issues

**Error:**
```
KeyboardInterrupt during model download
```

**Solution:**
```bash
# Use smaller dataset for testing
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100

# Or use cached models
export HF_HOME=./cache
```

### 6. CUDA/GPU Issues

**Error:**
```
RuntimeError: CUDA out of memory
```

**Solution:**
```bash
# Reduce batch size
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8

# Or use CPU
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu
```

## Monitoring and Logs

### Check Processing Status

```bash
# Check data processing output
ls -la ./data/processed/classification/emotion/classification/

# Check training output
ls -la ./results/classification/emotion_model/

# Check logs
tail -f logs/training.log
```

### Expected File Structure After Processing

```
./data/processed/classification/emotion/classification/
├── train.jsonl       # Training data
├── validation.jsonl   # Validation data
└── test.jsonl        # Test data

./results/classification/emotion_model/
├── config.json       # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json    # Tokenizer
└── label_info.json   # Label mappings
```

## Workflow Summary

### Classification Task
1. **Setup**: Install dependencies and set PYTHONPATH
2. **Data Processing**: Process raw data into organized splits
3. **Training**: Train model using processed data
4. **Inference**: Use trained model for predictions
5. **Monitoring**: Check logs and outputs for errors

### Styling Task
1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH
2. **Data Processing**: Process style transfer data with instruction/input/output format
3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer
4. **Inference**: Style transfer with streaming and batch processing
5. **Monitoring**: Check training logs and model outputs

## Creating Custom Configurations

### For New Datasets

1. Copy existing config:
```bash
cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml
```

2. Modify parameters:
```yaml
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/classification/my_dataset"
  # ... other parameters

training:
  data_dir: "./data/processed/classification/my_dataset"
  output_dir: "./results/classification/my_dataset_model"
```

3. Run pipeline:
```bash
python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml
```

### For Custom Data

1. Use custom config:
```yaml
data:
  source: "custom"
  data_path: "./data/raw/my_data.jsonl"
  output_dir: "./data/processed/classification/my_custom_dataset"
```

2. Run processing:
```bash
python scripts/classification/data_processor.py --config configs/classification/custom.yaml
```

## Best Practices

1. **Always check output directories** before running next step
2. **Use small datasets for testing** before full runs
3. **Monitor logs** for errors and warnings
4. **Backup configurations** before major changes
5. **Use version control** for YAML files
6. **Test with CLI overrides** for quick experiments

## Support

For issues and questions:
1. Check the troubleshooting section above
2. Review logs in the output directories
3. Verify YAML configuration structure
4. Test with smaller datasets first

---

**Happy fine-tuning!**