OwusuBlessing fd54d4be39 updated readme
2025-08-06 22:49:29 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:49:29 +01:00
2025-08-06 22:45:37 +01:00

Fine-Tune Task: NLP Pipeline Framework

A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).

Supported Tasks

This framework supports multiple NLP tasks with organized configurations:

  • Classification: Text classification, sentiment analysis, topic classification
  • Completion: Text generation, code completion, story generation
  • Styling: Style transfer, tone classification, writing style adaptation
  • Matching: Semantic matching, entity matching, similarity scoring

Current Implementation Status

  • Classification: Fully implemented with emotion classification example
  • Completion: Planned for future updates
  • Styling: Planned for future updates
  • Matching: Planned for future updates

Note: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.

Project Structure

fine-tune-task/
├── configs/                    # YAML configuration files
│   ├── classification/         # Implemented
│   │   ├── emotion.yaml       # Emotion classification
│   │   └── custom.yaml        # Custom dataset
│   ├── completion/             # Planned for future updates
│   ├── styling/               # Planned for future updates
│   └── matching/              # Planned for future updates
├── data/                       # Data directories
│   ├── raw/                    # Raw input data
│   │   ├── classification/     # Implemented
│   │   ├── completion/         # Planned for future updates
│   │   ├── styling/           # Planned for future updates
│   │   └── matching/          # Planned for future updates
│   └── processed/              # Processed data
│       ├── classification/     # Implemented
│       ├── completion/         # Planned for future updates
│       ├── styling/           # Planned for future updates
│       └── matching/          # Planned for future updates
├── pipelines/                  # Core pipeline scripts
│   ├── classification/         # Implemented
│   │   ├── data_processor.py  # Data processing
│   │   ├── train.py          # Training
│   │   └── inference.py      # Inference
│   ├── completion/            # Planned for future updates
│   ├── styling/              # Planned for future updates
│   └── matching/             # Planned for future updates
├── scripts/                    # User-friendly scripts
│   ├── classification/         # Implemented
│   │   ├── data_processor.py  # Data processing script
│   │   ├── trainer.py        # Training script
│   │   └── inference.py      # Inference script
│   ├── completion/            # Planned for future updates
│   ├── styling/              # Planned for future updates
│   └── matching/             # Planned for future updates
├── results/                    # Model outputs
│   ├── classification/         # Implemented
│   ├── completion/            # Planned for future updates
│   ├── styling/              # Planned for future updates
│   └── matching/             # Planned for future updates
└── utils/                      # Shared utility modules

Quick Start (Classification Task)

1. Setup Environment

# Install dependencies
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.

2. Data Processing

# Process emotion dataset
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Process with custom parameters
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000

# Check output location
ls -la ./data/processed/classification/emotion/classification/

Expected Output:

Data processing completed successfully!
  Data source: huggingface
  Dataset: dair-ai/emotion
  Total samples: 2999
  Unique labels: 6
  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
  Output directory: ./data/processed/classification/emotion

3. Model Training

# Train using processed data
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Train with custom parameters
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32

# Check model output
ls -la ./results/classification/emotion_model/

Expected Output:

Training completed successfully!
  Model: bert-base-uncased
  Data directory: ./data/processed/classification/emotion
  Training for 3 epochs with batch size 16
  Model saved to: ./results/classification/emotion_model

4. Model Inference

# Run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# File-based inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

Expected Output:

Inference completed successfully!
  Loading model from: ./results/classification/emotion_model
  Predicted label: joy
  Confidence: 0.8542
  Top 3 predictions:
    - joy: 0.8542
    - love: 0.1234
    - surprise: 0.0224

Adding New Tasks

To add a new task (e.g., completion, styling, matching), follow these steps:

Step 1: Create Task Directory Structure

# Create task directories
mkdir -p configs/completion
mkdir -p data/raw/completion data/processed/completion
mkdir -p pipelines/completion
mkdir -p scripts/completion
mkdir -p results/completion
mkdir -p tasks/completion
mkdir -p models/completion

Step 2: Create Task Configuration

# Create YAML configuration for new task
cat > configs/completion/text_generation.yaml << 'EOF'
# Text Generation Task Configuration
task:
  name: "completion"
  type: "text_generation"

# Data Processing Configuration
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/completion/text_generation"
  max_samples: 1000
  # ... other data parameters

# Model Configuration
model:
  name: "gpt2"  # Different model for completion
  max_length: 1024
  # ... model parameters

# Training Configuration
training:
  num_epochs: 3
  batch_size: 8  # Smaller batch for generation
  learning_rate: 5e-5
  data_dir: "./data/processed/completion/text_generation"
  output_dir: "./results/completion/text_generation_model"

# Inference Configuration
inference:
  model_path: "./results/completion/text_generation_model"
  device: "auto"
  batch_size: 1  # Generation is typically one at a time
  max_length: 100
  temperature: 0.7
EOF

Step 3: Create Pipeline Scripts

Copy and modify the classification pipeline scripts:

# Copy classification scripts as templates
cp pipelines/classification/data_processor.py pipelines/completion/
cp pipelines/classification/train.py pipelines/completion/
cp pipelines/classification/inference.py pipelines/completion/

# Copy task scripts
cp scripts/classification/data_processor.py scripts/completion/
cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/

Step 4: Modify Pipeline Code

Update the pipeline scripts for your specific task:

  1. Data Processor (pipelines/completion/data_processor.py):

    • Update data loading logic for completion datasets
    • Modify preprocessing for text generation
    • Adjust output format for completion tasks
  2. Trainer (pipelines/completion/train.py):

    • Change model type to generation models (GPT, T5, etc.)
    • Update training loop for text generation
    • Modify evaluation metrics
  3. Inference (pipelines/completion/inference.py):

    • Update inference for text generation
    • Add generation parameters (temperature, top-k, etc.)
    • Modify output format

Step 5: Update Task Scripts

Modify the task scripts to use your new pipeline:

# scripts/completion/data_processor.py
def run_with_yaml_config(config_path: str, **cli_overrides):
    cmd = [
        "python", "pipelines/completion/data_processor.py",  # Updated path
        "--config", config_path
    ]
    # ... rest of the function

Step 6: Create Task-Specific Models

# Create model directory
mkdir -p models/completion

# Add task-specific model classes
cat > models/completion/text_generator.py << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer

class TextGenerator:
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def generate(self, prompt, max_length=100, temperature=0.7):
        # Implementation for text generation
        pass
EOF

Step 7: Test Your New Task

# Test data processing
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml

# Test training
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml

# Test inference
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"

YAML Configuration Guide

Configuration Structure

Each YAML file is organized into clear sections:

# Task Configuration
task:
  name: "classification"  # or "completion", "styling", "matching"
  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"

# Data Processing Configuration
data:
  source: "huggingface"                    # "huggingface" or "custom"
  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
  output_dir: "./data/processed/classification/emotion"
  max_samples: 1000                        # Limit dataset size
  # ... other data parameters

# Model Configuration
model:
  name: "bert-base-uncased"                # Model from HuggingFace Hub
  max_length: 512                          # Sequence length
  num_labels: 6                            # Number of classes

# Training Configuration
training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate
  data_dir: "./data/processed/classification/emotion"
  output_dir: "./results/classification/emotion_model"

# Inference Configuration
inference:
  model_path: "./results/classification/emotion_model"
  device: "auto"                           # "auto", "cuda", "cpu"
  batch_size: 32                           # Inference batch size
  return_top_k: 3                          # Top K predictions

Available Configuration Files

  • configs/classification/emotion.yaml - Emotion classification with HuggingFace dataset
  • configs/classification/custom.yaml - Custom dataset processing

Usage Examples

Data Processing Examples

# 1. Use YAML config only
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500

# 3. Use CLI only (backward compatibility)
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion

# 4. Run examples
python scripts/classification/data_processor.py examples

Training Examples

# 1. Use YAML config only
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5

# 3. Use CLI only
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3

# 4. Run examples
python scripts/classification/trainer.py examples

Inference Examples

# 1. Single text prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# 2. File-based prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

# 3. Interactive mode
python scripts/classification/inference.py --config configs/classification/emotion.yaml

# 4. Run examples
python scripts/classification/inference.py examples

Troubleshooting Common Errors

1. ModuleNotFoundError: No module named 'utils'

Error:

ModuleNotFoundError: No module named 'utils'

Solution:

# Set Python path before running scripts
export PYTHONPATH=.
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

2. Model Path Not Found

Error:

Model path not found: ./results/classification/emotion_model

Solution:

# Train the model first
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Then run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml

3. Data Directory Not Found

Error:

Data directory not found: ./data/processed/classification/emotion

Solution:

# Process data first
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Then train
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

4. YAML Configuration Errors

Error:

data_processor.py: error: --data-source is required (either in YAML config or CLI)

Solution: Check your YAML file structure. It should have:

data:
  source: "huggingface"  # Not data_source
  dataset_name: "dair-ai/emotion"

5. HuggingFace Download Issues

Error:

KeyboardInterrupt during model download

Solution:

# Use smaller dataset for testing
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100

# Or use cached models
export HF_HOME=./cache

6. CUDA/GPU Issues

Error:

RuntimeError: CUDA out of memory

Solution:

# Reduce batch size
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8

# Or use CPU
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu

Monitoring and Logs

Check Processing Status

# Check data processing output
ls -la ./data/processed/classification/emotion/classification/

# Check training output
ls -la ./results/classification/emotion_model/

# Check logs
tail -f logs/training.log

Expected File Structure After Processing

./data/processed/classification/emotion/classification/
├── train.jsonl       # Training data
├── validation.jsonl   # Validation data
└── test.jsonl        # Test data

./results/classification/emotion_model/
├── config.json       # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json    # Tokenizer
└── label_info.json   # Label mappings

Workflow Summary

  1. Setup: Install dependencies and set PYTHONPATH
  2. Data Processing: Process raw data into organized splits
  3. Training: Train model using processed data
  4. Inference: Use trained model for predictions
  5. Monitoring: Check logs and outputs for errors

Creating Custom Configurations

For New Datasets

  1. Copy existing config:
cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml
  1. Modify parameters:
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/classification/my_dataset"
  # ... other parameters

training:
  data_dir: "./data/processed/classification/my_dataset"
  output_dir: "./results/classification/my_dataset_model"
  1. Run pipeline:
python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml

For Custom Data

  1. Use custom config:
data:
  source: "custom"
  data_path: "./data/raw/my_data.jsonl"
  output_dir: "./data/processed/classification/my_custom_dataset"
  1. Run processing:
python scripts/classification/data_processor.py --config configs/classification/custom.yaml

Best Practices

  1. Always check output directories before running next step
  2. Use small datasets for testing before full runs
  3. Monitor logs for errors and warnings
  4. Backup configurations before major changes
  5. Use version control for YAML files
  6. Test with CLI overrides for quick experiments

Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review logs in the output directories
  3. Verify YAML configuration structure
  4. Test with smaller datasets first

Happy fine-tuning!

S
Description
No description provided
Readme 4.2 GiB
Languages
Python 72.5%
Jupyter Notebook 27.2%
Jinja 0.2%