Files

T

OwusuBlessing fef3f5ae35 initial setupt

2025-08-06 22:45:37 +01:00

17 KiB

Raw Blame History

Fine-Tune Task: NLP Pipeline Framework

A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).

🎯 Supported Tasks

This framework supports multiple NLP tasks with organized configurations:

Classification: Text classification, sentiment analysis, topic classification
Completion: Text generation, code completion, story generation
Styling: Style transfer, tone classification, writing style adaptation
Matching: Semantic matching, entity matching, similarity scoring

Current Implementation Status

✅ Classification: Fully implemented with emotion classification example
🔄 Completion: Planned for future updates
🔄 Styling: Planned for future updates
🔄 Matching: Planned for future updates

Note: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates.

🏗️ Project Structure

fine-tune-task/
├── configs/                    # YAML configuration files
│   ├── classification/         # ✅ Implemented
│   │   ├── emotion.yaml       # Emotion classification
│   │   └── custom.yaml        # Custom dataset
│   ├── completion/             # 🔄 Planned for future updates
│   ├── styling/               # 🔄 Planned for future updates
│   └── matching/              # 🔄 Planned for future updates
├── data/                       # Data directories
│   ├── raw/                    # Raw input data
│   │   ├── classification/     # ✅ Implemented
│   │   ├── completion/         # 🔄 Planned for future updates
│   │   ├── styling/           # 🔄 Planned for future updates
│   │   └── matching/          # 🔄 Planned for future updates
│   └── processed/              # Processed data
│       ├── classification/     # ✅ Implemented
│       ├── completion/         # 🔄 Planned for future updates
│       ├── styling/           # 🔄 Planned for future updates
│       └── matching/          # 🔄 Planned for future updates
├── pipelines/                  # Core pipeline scripts
│   ├── classification/         # ✅ Implemented
│   │   ├── data_processor.py  # Data processing
│   │   ├── train.py          # Training
│   │   └── inference.py      # Inference
│   ├── completion/            # 🔄 Framework ready
│   ├── styling/              # 🔄 Framework ready
│   └── matching/             # 🔄 Framework ready
├── scripts/                    # User-friendly scripts
│   ├── classification/         # ✅ Implemented
│   │   ├── data_processor.py  # Data processing script
│   │   ├── trainer.py        # Training script
│   │   └── inference.py      # Inference script
│   ├── completion/            # 🔄 Framework ready
│   ├── styling/              # 🔄 Framework ready
│   └── matching/             # 🔄 Framework ready
├── results/                    # Model outputs
│   ├── classification/         # ✅ Implemented
│   ├── completion/            # 🔄 Ready
│   ├── styling/              # 🔄 Ready
│   └── matching/             # 🔄 Ready
└── utils/                      # Shared utility modules

🚀 Quick Start (Classification Task)

1. Setup Environment

# Install dependencies
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.

2. Data Processing

# Process emotion dataset
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Process with custom parameters
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000

# Check output location
ls -la ./data/processed/classification/emotion/classification/

Expected Output:

✅ Data processing completed successfully!
  Data source: huggingface
  Dataset: dair-ai/emotion
  Total samples: 2999
  Unique labels: 6
  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
  Output directory: ./data/processed/classification/emotion

3. Model Training

# Train using processed data
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Train with custom parameters
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32

# Check model output
ls -la ./results/classification/emotion_model/

Expected Output:

✅ Training completed successfully!
  Model: bert-base-uncased
  Data directory: ./data/processed/classification/emotion
  Training for 3 epochs with batch size 16
  Model saved to: ./results/classification/emotion_model

4. Model Inference

# Run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# File-based inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

Expected Output:

✅ Inference completed successfully!
  Loading model from: ./results/classification/emotion_model
  Predicted label: joy
  Confidence: 0.8542
  Top 3 predictions:
    - joy: 0.8542
    - love: 0.1234
    - surprise: 0.0224

🔧 Adding New Tasks

To add a new task (e.g., completion, styling, matching), follow these steps:

Step 1: Create Task Directory Structure

# Create task directories
mkdir -p configs/completion
mkdir -p data/raw/completion data/processed/completion
mkdir -p pipelines/completion
mkdir -p scripts/completion
mkdir -p results/completion
mkdir -p tasks/completion
mkdir -p models/completion

Step 2: Create Task Configuration

# Create YAML configuration for new task
cat > configs/completion/text_generation.yaml << 'EOF'
# Text Generation Task Configuration
task:
  name: "completion"
  type: "text_generation"

# Data Processing Configuration
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/completion/text_generation"
  max_samples: 1000
  # ... other data parameters

# Model Configuration
model:
  name: "gpt2"  # Different model for completion
  max_length: 1024
  # ... model parameters

# Training Configuration
training:
  num_epochs: 3
  batch_size: 8  # Smaller batch for generation
  learning_rate: 5e-5
  data_dir: "./data/processed/completion/text_generation"
  output_dir: "./results/completion/text_generation_model"

# Inference Configuration
inference:
  model_path: "./results/completion/text_generation_model"
  device: "auto"
  batch_size: 1  # Generation is typically one at a time
  max_length: 100
  temperature: 0.7
EOF

Step 3: Create Pipeline Scripts

Copy and modify the classification pipeline scripts:

# Copy classification scripts as templates
cp pipelines/classification/data_processor.py pipelines/completion/
cp pipelines/classification/train.py pipelines/completion/
cp pipelines/classification/inference.py pipelines/completion/

# Copy task scripts
cp scripts/classification/data_processor.py scripts/completion/
cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/

Step 4: Modify Pipeline Code

Update the pipeline scripts for your specific task:

Data Processor (pipelines/completion/data_processor.py):
- Update data loading logic for completion datasets
- Modify preprocessing for text generation
- Adjust output format for completion tasks
Trainer (pipelines/completion/train.py):
- Change model type to generation models (GPT, T5, etc.)
- Update training loop for text generation
- Modify evaluation metrics
Inference (pipelines/completion/inference.py):
- Update inference for text generation
- Add generation parameters (temperature, top-k, etc.)
- Modify output format

Step 5: Update Task Scripts

Modify the task scripts to use your new pipeline:

# scripts/completion/data_processor.py
def run_with_yaml_config(config_path: str, **cli_overrides):
    cmd = [
        "python", "pipelines/completion/data_processor.py",  # Updated path
        "--config", config_path
    ]
    # ... rest of the function

Step 6: Create Task-Specific Models

# Create model directory
mkdir -p models/completion

# Add task-specific model classes
cat > models/completion/text_generator.py << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer

class TextGenerator:
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def generate(self, prompt, max_length=100, temperature=0.7):
        # Implementation for text generation
        pass
EOF

Step 7: Test Your New Task

# Test data processing
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml

# Test training
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml

# Test inference
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"

📋 YAML Configuration Guide

Configuration Structure

Each YAML file is organized into clear sections:

# Task Configuration
task:
  name: "classification"  # or "completion", "styling", "matching"
  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"

# Data Processing Configuration
data:
  source: "huggingface"                    # "huggingface" or "custom"
  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
  output_dir: "./data/processed/classification/emotion"
  max_samples: 1000                        # Limit dataset size
  # ... other data parameters

# Model Configuration
model:
  name: "bert-base-uncased"                # Model from HuggingFace Hub
  max_length: 512                          # Sequence length
  num_labels: 6                            # Number of classes

# Training Configuration
training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate
  data_dir: "./data/processed/classification/emotion"
  output_dir: "./results/classification/emotion_model"

# Inference Configuration
inference:
  model_path: "./results/classification/emotion_model"
  device: "auto"                           # "auto", "cuda", "cpu"
  batch_size: 32                           # Inference batch size
  return_top_k: 3                          # Top K predictions

Available Configuration Files

configs/classification/emotion.yaml - Emotion classification with HuggingFace dataset
configs/classification/custom.yaml - Custom dataset processing

🔧 Usage Examples

Data Processing Examples

# 1. Use YAML config only
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500

# 3. Use CLI only (backward compatibility)
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion

# 4. Run examples
python scripts/classification/data_processor.py examples

Training Examples

# 1. Use YAML config only
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5

# 3. Use CLI only
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3

# 4. Run examples
python scripts/classification/trainer.py examples

Inference Examples

# 1. Single text prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# 2. File-based prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

# 3. Interactive mode
python scripts/classification/inference.py --config configs/classification/emotion.yaml

# 4. Run examples
python scripts/classification/inference.py examples

🐛 Troubleshooting Common Errors

1. ModuleNotFoundError: No module named 'utils'

Error:

ModuleNotFoundError: No module named 'utils'

Solution:

# Set Python path before running scripts
export PYTHONPATH=.
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

2. Model Path Not Found

Error:

❌ Model path not found: ./results/classification/emotion_model

Solution:

# Train the model first
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Then run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml

3. Data Directory Not Found

Error:

❌ Data directory not found: ./data/processed/classification/emotion

Solution:

# Process data first
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Then train
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

4. YAML Configuration Errors

Error:

data_processor.py: error: --data-source is required (either in YAML config or CLI)

Solution: Check your YAML file structure. It should have:

data:
  source: "huggingface"  # Not data_source
  dataset_name: "dair-ai/emotion"

5. HuggingFace Download Issues

Error:

KeyboardInterrupt during model download

Solution:

# Use smaller dataset for testing
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100

# Or use cached models
export HF_HOME=./cache

6. CUDA/GPU Issues

Error:

RuntimeError: CUDA out of memory

Solution:

# Reduce batch size
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8

# Or use CPU
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu

📊 Monitoring and Logs

Check Processing Status

# Check data processing output
ls -la ./data/processed/classification/emotion/classification/

# Check training output
ls -la ./results/classification/emotion_model/

# Check logs
tail -f logs/training.log

Expected File Structure After Processing

./data/processed/classification/emotion/classification/
├── train.jsonl       # Training data
├── validation.jsonl   # Validation data
└── test.jsonl        # Test data

./results/classification/emotion_model/
├── config.json       # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json    # Tokenizer
└── label_info.json   # Label mappings

🔄 Workflow Summary

Setup: Install dependencies and set PYTHONPATH
Data Processing: Process raw data into organized splits
Training: Train model using processed data
Inference: Use trained model for predictions
Monitoring: Check logs and outputs for errors

📝 Creating Custom Configurations

For New Datasets

Copy existing config:

cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml

Modify parameters:

data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/classification/my_dataset"
  # ... other parameters

training:
  data_dir: "./data/processed/classification/my_dataset"
  output_dir: "./results/classification/my_dataset_model"

Run pipeline:

python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml

For Custom Data

Use custom config:

data:
  source: "custom"
  data_path: "./data/raw/my_data.jsonl"
  output_dir: "./data/processed/classification/my_custom_dataset"

Run processing:

python scripts/classification/data_processor.py --config configs/classification/custom.yaml

🎯 Best Practices

Always check output directories before running next step
Use small datasets for testing before full runs
Monitor logs for errors and warnings
Backup configurations before major changes
Use version control for YAML files
Test with CLI overrides for quick experiments

📞 Support

For issues and questions:

Check the troubleshooting section above
Review logs in the output directories
Verify YAML configuration structure
Test with smaller datasets first

Happy fine-tuning! 🚀

17 KiB Raw Blame History

Fine-Tune Task: NLP Pipeline Framework

🎯 Supported Tasks

Current Implementation Status

🏗️ Project Structure

🚀 Quick Start (Classification Task)

1. Setup Environment

2. Data Processing

3. Model Training

4. Model Inference

🔧 Adding New Tasks

Step 1: Create Task Directory Structure

Step 2: Create Task Configuration

Step 3: Create Pipeline Scripts

Step 4: Modify Pipeline Code

Step 5: Update Task Scripts

Step 6: Create Task-Specific Models

Step 7: Test Your New Task

📋 YAML Configuration Guide

Configuration Structure

Available Configuration Files

🔧 Usage Examples

Data Processing Examples

Training Examples

Inference Examples

🐛 Troubleshooting Common Errors

1. ModuleNotFoundError: No module named 'utils'

2. Model Path Not Found

3. Data Directory Not Found

4. YAML Configuration Errors

5. HuggingFace Download Issues

6. CUDA/GPU Issues

📊 Monitoring and Logs

Check Processing Status

Expected File Structure After Processing

🔄 Workflow Summary

📝 Creating Custom Configurations

For New Datasets

For Custom Data

🎯 Best Practices

📞 Support

17 KiB

Raw Blame History