Files

T

OwusuBlessing 8847035d12 updated styling pipeline

2025-08-13 21:30:45 +01:00

22 KiB

Raw Blame History

Fine-Tune Task: NLP Pipeline Framework

A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching).

Supported Tasks

This framework supports multiple NLP tasks with organized configurations:

Classification: Text classification, sentiment analysis, topic classification
Completion: Text generation, code completion, story generation
Styling: Style transfer, tone classification, writing style adaptation
Matching: Semantic matching, entity matching, similarity scoring

Current Implementation Status

Classification: ✅ Fully implemented with emotion classification example
Styling: ✅ Fully implemented with style transfer and LoRA fine-tuning
Completion: Planned for future updates
Matching: Planned for future updates

Note: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates.

Project Structure

fine-tune-task/
├── configs/                    # YAML configuration files
│   ├── classification/         # ✅ Implemented
│   │   ├── emotion.yaml       # Emotion classification
│   │   └── custom.yaml        # Custom dataset
│   ├── styling/               # ✅ Implemented
│   │   └── formal.yaml        # Formal style transfer
│   ├── completion/             # Planned for future updates
│   └── matching/              # Planned for future updates
├── data/                       # Data directories
│   ├── raw/                    # Raw input data
│   │   ├── classification/     # ✅ Implemented
│   │   ├── styling/           # ✅ Implemented
│   │   ├── completion/         # Planned for future updates
│   │   └── matching/          # Planned for future updates
│   └── processed/              # Processed data
│       ├── classification/     # ✅ Implemented
│       ├── styling/           # ✅ Implemented
│       ├── completion/         # Planned for future updates
│       └── matching/          # Planned for future updates
├── pipelines/                  # Core pipeline scripts
│   ├── classification/         # ✅ Implemented
│   │   ├── data_processor.py  # Data processing
│   │   ├── train.py          # Training
│   │   └── inference.py      # Inference
│   ├── styling/               # ✅ Implemented
│   │   ├── data_processor.py  # Style data processing
│   │   ├── train.py          # LoRA fine-tuning
│   │   └── inference.py      # Style transfer inference
│   ├── completion/            # Planned for future updates
│   └── matching/             # Planned for future updates
├── scripts/                    # User-friendly scripts
│   ├── classification/         # ✅ Implemented
│   │   ├── data_processor.py  # Data processing script
│   │   ├── trainer.py        # Training script
│   │   └── inference.py      # Inference script
│   ├── styling/               # ✅ Implemented
│   │   ├── data_processor.py  # Style data processing script
│   │   ├── train.py          # Training script
│   │   └── inference.py      # Inference script
│   ├── completion/            # Planned for future updates
│   └── matching/             # Planned for future updates
├── results/                    # Model outputs
│   ├── classification/         # ✅ Implemented
│   ├── styling/              # ✅ Implemented
│   ├── completion/            # Planned for future updates
│   └── matching/             # Planned for future updates
└── utils/                      # Shared utility modules

Quick Start (Classification Task)

1. Setup Environment

# Install dependencies
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.

2. Data Processing

# Process emotion dataset
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Process with custom parameters
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000

# Check output location
ls -la ./data/processed/classification/emotion/classification/

Expected Output:

Data processing completed successfully!
  Data source: huggingface
  Dataset: dair-ai/emotion
  Total samples: 2999
  Unique labels: 6
  Split sizes: {'train': 1000, 'validation': 999, 'test': 1000}
  Output directory: ./data/processed/classification/emotion

3. Model Training

# Train using processed data
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Train with custom parameters
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32

# Check model output
ls -la ./results/classification/emotion_model/

Expected Output:

Training completed successfully!
  Model: bert-base-uncased
  Data directory: ./data/processed/classification/emotion
  Training for 3 epochs with batch size 16
  Model saved to: ./results/classification/emotion_model

4. Model Inference

# Run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# File-based inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

Expected Output:

Inference completed successfully!
  Loading model from: ./results/classification/emotion_model
  Predicted label: joy
  Confidence: 0.8542
  Top 3 predictions:
    - joy: 0.8542
    - love: 0.1234
    - surprise: 0.0224

Quick Start (Styling Task)

1. Setup Environment

# Install dependencies (including unsloth for styling)
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.

2. Data Processing

# Process style transfer dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml

# Create HuggingFace dataset
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset

# Check output location
ls -la ./data/processed/styling/formal/

Expected Output:

Styling data processing completed successfully!
  Data source: custom
  Data file: ./data/raw/styling/sample_formal.jsonl
  Total samples: 5
  Split sizes: {'train': 3, 'validation': 1, 'test': 1}
  Output directory: ./data/processed/styling/formal
  Style instruction: Rewrite the following text in a formal style

3. Model Training

# Train using processed data (automatically loads from YAML output_dir)
python scripts/styling/train.py example

# Custom training
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4

# Check model output
ls -la ./models/styling/

Expected Output:

Training completed successfully!
  Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
  Dataset: Loaded from ./data/processed/styling/formal
  Training for 3 epochs with batch size 4
  Model saved to: ./models/styling

4. Model Inference

# Single text style transfer
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"

# Batch processing
python scripts/styling/inference.py batch

# Interactive mode
python scripts/styling/inference.py infer --config configs/styling/formal.yaml

Expected Output:

Inference completed successfully!
  Input: Hey, what's up?
  Output: Hello, how are you doing?
  Style: Formal

Adding New Tasks

To add a new task (e.g., completion, styling, matching), follow these steps:

Example: Styling Task (Already Implemented)

The styling task demonstrates a complete implementation:

Task Directory Structure ✅

configs/styling/           # YAML configurations
data/raw/styling/         # Raw style transfer data
data/processed/styling/   # Processed data
pipelines/styling/        # Core pipeline scripts
scripts/styling/          # User-friendly scripts
models/styling/           # Trained models

Pipeline Components ✅

Data Processor: Handles style transfer datasets with instruction/input/output format
Trainer: LoRA fine-tuning using Unsloth for efficiency
Inference: Style transfer with streaming and batch processing

Key Features ✅

Automatic EOS token handling: text + tokenizer.eos_token
Dataset mapping: dataset.map(formatting_prompts_func, batched=True)
YAML integration: Uses data.output_dir for automatic dataset loading
HuggingFace dataset export and loading

For Other Tasks (completion, matching)

Create Task Directory Structure

# Create task directories
mkdir -p configs/completion
mkdir -p data/raw/completion data/processed/completion
mkdir -p pipelines/completion
mkdir -p scripts/completion
mkdir -p results/completion
mkdir -p tasks/completion
mkdir -p models/completion

Create Task Configuration

# Create YAML configuration for new task
cat > configs/completion/text_generation.yaml << 'EOF'
# Text Generation Task Configuration
task:
  name: "completion"
  type: "text_generation"

# Data Processing Configuration
data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/completion/text_generation"
  max_samples: 1000
  # ... other data parameters

# Model Configuration
model:
  name: "gpt2"  # Different model for completion
  max_length: 1024
  # ... model parameters

# Training Configuration
training:
  num_epochs: 3
  batch_size: 8  # Smaller batch for generation
  learning_rate: 5e-5
  data_dir: "./data/processed/completion/text_generation"
  output_dir: "./results/completion/text_generation_model"

# Inference Configuration
inference:
  model_path: "./results/completion/text_generation_model"
  device: "auto"
  batch_size: 1  # Generation is typically one at a time
  max_length: 100
  temperature: 0.7
EOF

Create Pipeline Scripts

Copy and modify the classification pipeline scripts:

# Copy classification scripts as templates
cp pipelines/classification/data_processor.py pipelines/completion/
cp pipelines/classification/train.py pipelines/completion/
cp pipelines/classification/inference.py pipelines/completion/

# Copy task scripts
cp scripts/classification/data_processor.py scripts/completion/
cp scripts/classification/trainer.py scripts/completion/
cp scripts/classification/inference.py scripts/completion/

Modify Pipeline Code

Update the pipeline scripts for your specific task:

Data Processor (pipelines/completion/data_processor.py):
- Update data loading logic for completion datasets
- Modify preprocessing for text generation
- Adjust output format for completion tasks
Trainer (pipelines/completion/train.py):
- Change model type to generation models (GPT, T5, etc.)
- Update training loop for text generation
- Modify evaluation metrics
Inference (pipelines/completion/inference.py):
- Update inference for text generation
- Add generation parameters (temperature, top-k, etc.)
- Modify output format
Update Task Scripts

Modify the task scripts to use your new pipeline:

# scripts/completion/data_processor.py
def run_with_yaml_config(config_path: str, **cli_overrides):
    cmd = [
        "python", "pipelines/completion/data_processor.py",  # Updated path
        "--config", config_path
    ]
    # ... rest of the function

Create Task-Specific Models

# Create model directory
mkdir -p models/completion

# Add task-specific model classes
cat > models/completion/text_generator.py << 'EOF'
from transformers import AutoModelForCausalLM, AutoTokenizer

class TextGenerator:
    def __init__(self, model_name):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def generate(self, prompt, max_length=100, temperature=0.7):
        # Implementation for text generation
        pass
EOF

Test Your New Task

# Test data processing
python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml

# Test training
python scripts/completion/trainer.py --config configs/completion/text_generation.yaml

# Test inference
python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time"

YAML Configuration Guide

Configuration Structure

Each YAML file is organized into clear sections:

# Task Configuration
task:
  name: "classification"  # or "completion", "styling", "matching"
  type: "sequence_classification"  # or "text_generation", "style_transfer", "semantic_matching"

# Data Processing Configuration
data:
  source: "huggingface"                    # "huggingface" or "custom"
  dataset_name: "dair-ai/emotion"         # HuggingFace dataset name
  output_dir: "./data/processed/classification/emotion"
  max_samples: 1000                        # Limit dataset size
  # ... other data parameters

# Model Configuration
model:
  name: "bert-base-uncased"                # Model from HuggingFace Hub
  max_length: 512                          # Sequence length
  num_labels: 6                            # Number of classes

# Training Configuration
training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate
  data_dir: "./data/processed/classification/emotion"
  output_dir: "./results/classification/emotion_model"

# Inference Configuration
inference:
  model_path: "./results/classification/emotion_model"
  device: "auto"                           # "auto", "cuda", "cpu"
  batch_size: 32                           # Inference batch size
  return_top_k: 3                          # Top K predictions

Styling Configuration Example

# Styling Task Configuration
task:
  name: "styling"
  type: "style_transfer"

# Data Processing Configuration
data:
  source: "custom"
  data_path: "./data/raw/styling/sample_formal.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite the following text in a formal style"
  output_dir: "./data/processed/styling/formal"
  output_format: "alpaca"

# Model Configuration
model:
  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
  training_max_seq_length: 2048
  training_load_in_4bit: true

# Training Configuration
training:
  num_epochs: 3
  batch_size: 2
  learning_rate: 2e-4
  weight_decay: 0.01

# Inference Configuration
inference:
  batch_size: 1
  max_new_tokens: 128
  temperature: 0.8

Available Configuration Files

configs/classification/emotion.yaml - Emotion classification with HuggingFace dataset
configs/classification/custom.yaml - Custom dataset processing
configs/styling/formal.yaml - Formal style transfer with LoRA fine-tuning

Usage Examples

Data Processing Examples

# 1. Use YAML config only
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500

# 3. Use CLI only (backward compatibility)
python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion

# 4. Run examples
python scripts/classification/data_processor.py examples

Training Examples

# 1. Use YAML config only
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# 2. Override YAML values
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5

# 3. Use CLI only
python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3

# 4. Run examples
python scripts/classification/trainer.py examples

Inference Examples

# 1. Single text prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!"

# 2. File-based prediction
python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl

# 3. Interactive mode
python scripts/classification/inference.py --config configs/classification/emotion.yaml

# 4. Run examples
python scripts/classification/inference.py examples

Styling Examples

# 1. Data Processing
python scripts/styling/data_processor.py --config configs/styling/formal.yaml
python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset

# 2. Training
python scripts/styling/train.py example
python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2

# 3. Inference
python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?"
python scripts/styling/inference.py batch
python scripts/styling/inference.py infer --config configs/styling/formal.yaml

# 4. Run examples
python scripts/styling/data_processor.py examples
python scripts/styling/train.py features
python scripts/styling/inference.py features

Troubleshooting Common Errors

1. ModuleNotFoundError: No module named 'utils'

Error:

ModuleNotFoundError: No module named 'utils'

Solution:

# Set Python path before running scripts
export PYTHONPATH=.
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

2. Model Path Not Found

Error:

Model path not found: ./results/classification/emotion_model

Solution:

# Train the model first
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

# Then run inference
python scripts/classification/inference.py --config configs/classification/emotion.yaml

3. Data Directory Not Found

Error:

Data directory not found: ./data/processed/classification/emotion

Solution:

# Process data first
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml

# Then train
python scripts/classification/trainer.py --config configs/classification/emotion.yaml

4. YAML Configuration Errors

Error:

data_processor.py: error: --data-source is required (either in YAML config or CLI)

Solution: Check your YAML file structure. It should have:

data:
  source: "huggingface"  # Not data_source
  dataset_name: "dair-ai/emotion"

5. HuggingFace Download Issues

Error:

KeyboardInterrupt during model download

Solution:

# Use smaller dataset for testing
python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100

# Or use cached models
export HF_HOME=./cache

6. CUDA/GPU Issues

Error:

RuntimeError: CUDA out of memory

Solution:

# Reduce batch size
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8

# Or use CPU
python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu

Monitoring and Logs

Check Processing Status

# Check data processing output
ls -la ./data/processed/classification/emotion/classification/

# Check training output
ls -la ./results/classification/emotion_model/

# Check logs
tail -f logs/training.log

Expected File Structure After Processing

./data/processed/classification/emotion/classification/
├── train.jsonl       # Training data
├── validation.jsonl   # Validation data
└── test.jsonl        # Test data

./results/classification/emotion_model/
├── config.json       # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json    # Tokenizer
└── label_info.json   # Label mappings

Workflow Summary

Classification Task

Setup: Install dependencies and set PYTHONPATH
Data Processing: Process raw data into organized splits
Training: Train model using processed data
Inference: Use trained model for predictions
Monitoring: Check logs and outputs for errors

Styling Task

Setup: Install dependencies (including unsloth) and set PYTHONPATH
Data Processing: Process style transfer data with instruction/input/output format
Training: LoRA fine-tuning using Unsloth for efficient style transfer
Inference: Style transfer with streaming and batch processing
Monitoring: Check training logs and model outputs

Creating Custom Configurations

For New Datasets

Copy existing config:

cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml

Modify parameters:

data:
  source: "huggingface"
  dataset_name: "your-dataset-name"
  output_dir: "./data/processed/classification/my_dataset"
  # ... other parameters

training:
  data_dir: "./data/processed/classification/my_dataset"
  output_dir: "./results/classification/my_dataset_model"

Run pipeline:

python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml

For Custom Data

Use custom config:

data:
  source: "custom"
  data_path: "./data/raw/my_data.jsonl"
  output_dir: "./data/processed/classification/my_custom_dataset"

Run processing:

python scripts/classification/data_processor.py --config configs/classification/custom.yaml

Best Practices

Always check output directories before running next step
Use small datasets for testing before full runs
Monitor logs for errors and warnings
Backup configurations before major changes
Use version control for YAML files
Test with CLI overrides for quick experiments

Support

For issues and questions:

Check the troubleshooting section above
Review logs in the output directories
Verify YAML configuration structure
Test with smaller datasets first

Happy fine-tuning!

22 KiB Raw Blame History

Fine-Tune Task: NLP Pipeline Framework

Supported Tasks

Current Implementation Status

Project Structure

Quick Start (Classification Task)

1. Setup Environment

2. Data Processing

3. Model Training

4. Model Inference

Quick Start (Styling Task)

1. Setup Environment

2. Data Processing

3. Model Training

4. Model Inference

Adding New Tasks

Example: Styling Task (Already Implemented)

For Other Tasks (completion, matching)

YAML Configuration Guide

Configuration Structure

Styling Configuration Example

Available Configuration Files

Usage Examples

Data Processing Examples

Training Examples

Inference Examples

Styling Examples

Troubleshooting Common Errors

1. ModuleNotFoundError: No module named 'utils'

2. Model Path Not Found

3. Data Directory Not Found

4. YAML Configuration Errors

5. HuggingFace Download Issues

6. CUDA/GPU Issues

Monitoring and Logs

Check Processing Status

Expected File Structure After Processing

Workflow Summary

Classification Task

Styling Task

Creating Custom Configurations

For New Datasets

For Custom Data

Best Practices

Support

22 KiB

Raw Blame History