A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).

Supported Tasks

This framework supports multiple NLP tasks with organized configurations:

Classification: Text classification, sentiment analysis, topic classification
Completion: Text generation, code completion, story generation
Styling: Style transfer, tone classification, writing style adaptation
Matching: Semantic matching, entity matching, similarity scoring

Current Implementation Status

Classification: Fully implemented with emotion classification example
Completion: Planned for future updates
Styling: Planned for future updates
Matching: Planned for future updates

Note: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.

Project Structure

fine-tune-task/
├── configs/                    # YAML configuration files
│   ├── classification/         # Implemented
│   │   ├── emotion.yaml       # Emotion classification
│   │   └── custom.yaml        # Custom dataset
│   ├── completion/             # Planned for future updates
│   ├── styling/               # Planned for future updates
│   └── matching/              # Planned for future updates
├── data/                       # Data directories
│   ├── raw/                    # Raw input data
│   │   ├── classification/     # Implemented
│   │   ├── completion/         # Planned for future updates
│   │   ├── styling/           # Planned for future updates
│   │   └── matching/          # Planned for future updates
│   └── processed/              # Processed data
│       ├── classification/     # Implemented
│       ├── completion/         # Planned for future updates
│       ├── styling/           # Planned for future updates
│       └── matching/          # Planned for future updates
├── pipelines/                  # Core pipeline scripts
│   ├── classification/         # Implemented
│   │   ├── data_processor.py  # Data processing
│   │   ├── train.py          # Training
│   │   └── inference.py      # Inference
│   ├── completion/            # Planned for future updates
│   ├── styling/              # Planned for future updates
│   └── matching/             # Planned for future updates
├── scripts/                    # User-friendly scripts
│   ├── classification/         # Implemented
│   │   ├── data_processor.py  # Data processing script
│   │   ├── trainer.py        # Training script
│   │   └── inference.py      # Inference script
│   ├── completion/            # Planned for future updates
│   ├── styling/              # Planned for future updates
│   └── matching/             # Planned for future updates
├── results/                    # Model outputs
│   ├── classification/         # Implemented
│   ├── completion/            # Planned for future updates
│   ├── styling/              # Planned for future updates
│   └── matching/             # Planned for future updates
└── utils/                      # Shared utility modules

Quick Start (Classification Task)

1. Setup Environment

# Install dependencies
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.

2. Data Processing

# Process emotion dataset
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Process with custom parameters
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000

# Check output location
ls -la ./data/processed/classification/emotion/classification/

Expected Output:

Data processing completed successfully!
  Data source: huggingface
  Dataset: dair-ai/emotion
  Total samples: 2999
  Unique labels: 6
  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
  Output directory: ./data/processed/classification/emotion

3. Model Training

# Train using processed data
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Train with custom parameters
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32

# Check model output
ls -la ./results/classification/emotion_model/

Expected Output:

Training completed successfully!
  Model: bert-base-uncased
  Data directory: ./data/processed/classification/emotion
  Training for 3 epochs with batch size 16
  Model saved to: ./results/classification/emotion_model

4. Model Inference

# Run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# File-based inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

Expected Output:

Inference completed successfully!
  Loading model from: ./results/classification/emotion_model
  Predicted label: joy
  Confidence: 0.8542
  Top 3 predictions:
    - joy: 0.8542
    - love: 0.1234
    - surprise: 0.0224

Adding New Tasks

To add a new task (e.g., completion, styling, matching), follow these steps:

Step 1: Create Task Directory Structure

# Create task directories
mkdir -p configs/completion
mkdir -p data/raw/completion data/processed/completion
mkdir -p pipelines/completion
mkdir -p scripts/completion
mkdir -p results/completion
mkdir -p tasks/completion
mkdir -p models/completion

Step 2: Create Task Configuration

# Create YAML configuration for new task
cat > configs/completion/text_generation.yaml << 'EOF'
# Text Generation Task Configuration
task:
  name: "completion"
  type: "text_generation"

# Data Processing Configuration
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/completion/text_generation"
  max_samples: 1000
  # ... other data parameters

# Model Configuration
model:
  name: "gpt2"  # Different model for completion
  max_length: 1024
  # ... model parameters

# Training Configuration
training:
  num_epochs: 3
  batch_size: 8  # Smaller batch for generation
  learning_rate: 5e-5
  data_dir: "./data/processed/completion/text_generation"
  output_dir: "./results/completion/text_generation_model"

# Inference Configuration
inference:
  model_path: "./results/completion/text_generation_model"
  device: "auto"
  batch_size: 1  # Generation is typically one at a time
  max_length: 100
  temperature: 0.7
EOF

Step 3: Create Pipeline Scripts

Copy and modify the classification pipeline scripts:

# Copy classification scripts as templates
cp pipelines/classification/data_processor.py pipelines/completion/
cp pipelines/classification/train.py pipelines/completion/
cp pipelines/classification/inference.py pipelines/completion/

# Copy task scripts
cp scripts/classification/data_processor.py scripts/completion/
cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/

Step 4: Modify Pipeline Code

Update the pipeline scripts for your specific task:

Data Processor (pipelines/completion/data_processor.py):
- Update data loading logic for completion datasets
- Modify preprocessing for text generation
- Adjust output format for completion tasks
Trainer (pipelines/completion/train.py):
- Change model type to generation models (GPT, T5, etc.)
- Update training loop for text generation
- Modify evaluation metrics
Inference (pipelines/completion/inference.py):
- Update inference for text generation
- Add generation parameters (temperature, top-k, etc.)
- Modify output format

Step 5: Update Task Scripts

Modify the task scripts to use your new pipeline:

# scripts/completion/data_processor.py
def run_with_yaml_config(config_path: str, **cli_overrides):
    cmd = [
        "python", "pipelines/completion/data_processor.py",  # Updated path
        "--config", config_path
    ]
    # ... rest of the function

Step 6: Create Task-Specific Models

# Create model directory
mkdir -p models/completion

# Add task-specific model classes
cat > models/completion/text_generator.py << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer

class TextGenerator:
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def generate(self, prompt, max_length=100, temperature=0.7):
        # Implementation for text generation
        pass
EOF

Step 7: Test Your New Task

# Test data processing
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml

# Test training
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml

# Test inference
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"

YAML Configuration Guide

Configuration Structure

Each YAML file is organized into clear sections:

# Task Configuration
task:
  name: "classification"  # or "completion", "styling", "matching"
  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"

# Data Processing Configuration
data:
  source: "huggingface"                    # "huggingface" or "custom"
  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
  output_dir: "./data/processed/classification/emotion"
  max_samples: 1000                        # Limit dataset size
  # ... other data parameters

# Model Configuration
model:
  name: "bert-base-uncased"                # Model from HuggingFace Hub
  max_length: 512                          # Sequence length
  num_labels: 6                            # Number of classes

# Training Configuration
training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate
  data_dir: "./data/processed/classification/emotion"
  output_dir: "./results/classification/emotion_model"

# Inference Configuration
inference:
  model_path: "./results/classification/emotion_model"
  device: "auto"                           # "auto", "cuda", "cpu"
  batch_size: 32                           # Inference batch size
  return_top_k: 3                          # Top K predictions

Available Configuration Files

configs/classification/emotion.yaml - Emotion classification with HuggingFace dataset
configs/classification/custom.yaml - Custom dataset processing

Usage Examples

Data Processing Examples

# 1. Use YAML config only
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500

# 3. Use CLI only (backward compatibility)
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion

# 4. Run examples
python scripts/classification/data_processor.py examples

Training Examples

# 1. Use YAML config only
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5

# 3. Use CLI only
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3

# 4. Run examples
python scripts/classification/trainer.py examples

Inference Examples

# 1. Single text prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# 2. File-based prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

# 3. Interactive mode
python scripts/classification/inference.py --config configs/classification/emotion.yaml

# 4. Run examples
python scripts/classification/inference.py examples

Troubleshooting Common Errors

1. ModuleNotFoundError: No module named 'utils'

Error:

ModuleNotFoundError: No module named 'utils'

Solution:

# Set Python path before running scripts
export PYTHONPATH=.
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

2. Model Path Not Found

Error:

Model path not found: ./results/classification/emotion_model

Solution:

# Train the model first
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Then run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml

3. Data Directory Not Found

Error:

Data directory not found: ./data/processed/classification/emotion

Solution:

# Process data first
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Then train
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

4. YAML Configuration Errors

Error:

data_processor.py: error: --data-source is required (either in YAML config or CLI)

Solution: Check your YAML file structure. It should have:

data:
  source: "huggingface"  # Not data_source
  dataset_name: "dair-ai/emotion"

5. HuggingFace Download Issues

Error:

KeyboardInterrupt during model download

Solution:

# Use smaller dataset for testing
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100

# Or use cached models
export HF_HOME=./cache

6. CUDA/GPU Issues

Error:

RuntimeError: CUDA out of memory

Solution:

# Reduce batch size
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8

# Or use CPU
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu

Monitoring and Logs

Check Processing Status

# Check data processing output
ls -la ./data/processed/classification/emotion/classification/

# Check training output
ls -la ./results/classification/emotion_model/

# Check logs
tail -f logs/training.log

Expected File Structure After Processing

./data/processed/classification/emotion/classification/
├── train.jsonl       # Training data
├── validation.jsonl   # Validation data
└── test.jsonl        # Test data

./results/classification/emotion_model/
├── config.json       # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json    # Tokenizer
└── label_info.json   # Label mappings

Workflow Summary

Setup: Install dependencies and set PYTHONPATH
Data Processing: Process raw data into organized splits
Training: Train model using processed data
Inference: Use trained model for predictions
Monitoring: Check logs and outputs for errors

Creating Custom Configurations

For New Datasets

Copy existing config:

cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml

Modify parameters:

data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/classification/my_dataset"
  # ... other parameters

training:
  data_dir: "./data/processed/classification/my_dataset"
  output_dir: "./results/classification/my_dataset_model"

Run pipeline:

python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml

For Custom Data

Use custom config:

data:
  source: "custom"
  data_path: "./data/raw/my_data.jsonl"
  output_dir: "./data/processed/classification/my_custom_dataset"

Run processing:

python scripts/classification/data_processor.py --config configs/classification/custom.yaml

Best Practices

Always check output directories before running next step
Use small datasets for testing before full runs
Monitor logs for errors and warnings
Backup configurations before major changes
Use version control for YAML files
Test with CLI overrides for quick experiments

Support

For issues and questions:

Check the troubleshooting section above
Review logs in the output directories
Verify YAML configuration structure
Test with smaller datasets first

Happy fine-tuning!