From d7441f40891e55a1f5a7cd7e1b7e550a399122b5 Mon Sep 17 00:00:00 2001 From: Your Name Date: Wed, 13 Aug 2025 23:59:28 +0000 Subject: [PATCH] upated readme --- .ipynb_checkpoints/README-checkpoint.md | 1305 ++++++++++---------- .ipynb_checkpoints/untitled-checkpoint.txt | 0 README.md | 1305 ++++++++++---------- untitled.txt | 0 4 files changed, 1268 insertions(+), 1342 deletions(-) create mode 100644 .ipynb_checkpoints/untitled-checkpoint.txt create mode 100644 untitled.txt diff --git a/.ipynb_checkpoints/README-checkpoint.md b/.ipynb_checkpoints/README-checkpoint.md index f1ba946..517945e 100644 --- a/.ipynb_checkpoints/README-checkpoint.md +++ b/.ipynb_checkpoints/README-checkpoint.md @@ -1,763 +1,726 @@ -# Fine-Tune Task: NLP Pipeline Framework +# Fine-Tuning Task Framework -A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching). +A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching. -## Supported Tasks +## Table of Contents -This framework supports multiple NLP tasks with organized configurations: +- [Overview](#overview) +- [Architecture](#architecture) +- [Task Types](#task-types) +- [Quick Start](#quick-start) +- [Configuration Guide](#configuration-guide) +- [Scripts & Commands](#scripts--commands) +- [Complete Workflows](#complete-workflows) +- [API Reference](#api-reference) +- [Troubleshooting](#troubleshooting) +- [Contributing](#contributing) -- **Classification**: Text classification, sentiment analysis, topic classification -- **Completion**: Text generation, code completion, story generation -- **Styling**: Style transfer, tone classification, writing style adaptation -- **Matching**: Semantic matching, entity matching, similarity scoring +## Overview -### Current Implementation Status +This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be: -- **Classification**: ✅ Fully implemented with emotion classification example -- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning -- **Completion**: Planned for future updates -- **Matching**: Planned for future updates +- **Task-Agnostic**: Same pipeline structure for different task types +- **Configuration-Driven**: YAML-based configuration for all parameters +- **Developer-Friendly**: Clear scripts and comprehensive logging +- **Production-Ready**: Built-in validation, error handling, and optimization -**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates. +## Architecture -## Project Structure +The framework follows a **modular pipeline architecture**: ``` -fine-tune-task/ -├── configs/ # YAML configuration files -│ ├── classification/ # ✅ Implemented -│ │ ├── emotion.yaml # Emotion classification -│ │ └── custom.yaml # Custom dataset -│ ├── styling/ # ✅ Implemented -│ │ └── formal.yaml # Formal style transfer -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── data/ # Data directories -│ ├── raw/ # Raw input data -│ │ ├── classification/ # ✅ Implemented -│ │ ├── styling/ # ✅ Implemented -│ │ ├── completion/ # Planned for future updates -│ │ └── matching/ # Planned for future updates -│ └── processed/ # Processed data -│ ├── classification/ # ✅ Implemented -│ ├── styling/ # ✅ Implemented -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── pipelines/ # Core pipeline scripts -│ ├── classification/ # ✅ Implemented -│ │ ├── data_processor.py # Data processing -│ │ ├── train.py # Training -│ │ └── inference.py # Inference -│ ├── styling/ # ✅ Implemented -│ │ ├── data_processor.py # Style data processing -│ │ ├── train.py # LoRA fine-tuning -│ │ └── inference.py # Style transfer inference -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── scripts/ # User-friendly scripts -│ ├── classification/ # ✅ Implemented -│ │ ├── data_processor.py # Data processing script -│ │ ├── trainer.py # Training script -│ │ └── inference.py # Inference script -│ ├── styling/ # ✅ Implemented -│ │ ├── data_processor.py # Style data processing script -│ │ ├── train.py # Training script -│ │ └── inference.py # Inference script -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── results/ # Model outputs -│ ├── classification/ # ✅ Implemented -│ ├── styling/ # ✅ Implemented -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -└── utils/ # Shared utility modules +Raw Data → Data Processing → Model Training → Inference/Evaluation + ↓ ↓ ↓ ↓ + JSONL/CSV HuggingFace Trained Ready for + Files Datasets Models Production ``` -## Quick Start (Classification Task) +### Core Components -### 1. Setup Environment +1. **Data Processors**: Convert raw data to training-ready formats +2. **Training Pipelines**: Task-specific training with optimization +3. **Inference Engines**: Production-ready text generation/classification +4. **Configuration Management**: YAML-based parameter control +5. **Utility Scripts**: Command-line interfaces for all operations + +## Task Types + +### 1. Classification Task + +**Purpose**: Text classification, sentiment analysis, topic categorization + +**Data Format**: +```jsonl +{"text": "I love this product!", "label": "positive"} +{"text": "This is terrible", "label": "negative"} +``` + +**Output**: Classification probabilities and predicted labels + +**Use Cases**: Sentiment analysis, spam detection, content moderation + +### 2. Completion Task + +**Purpose**: Text generation, story completion, code generation + +**Data Format**: +```jsonl +{"prompt": "Once upon a time", "completion": "there was a brave knight..."} +{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"} +``` + +**Output**: Generated text continuations + +**Use Cases**: Creative writing, code completion, content generation + +### 3. Styling Task + +**Purpose**: Style transfer, tone modification, writing style adaptation + +**Data Format**: +```jsonl +{"text": "Hey there!", "styled_text": "Hello, how are you?"} +{"text": "I'm gonna go", "styled_text": "I will be going"} +``` + +**Output**: Text rewritten in target style + +**Use Cases**: Formalization, casualization, domain adaptation + +### 4. Matching Task + +**Purpose**: Semantic similarity, question-answer matching, paraphrase detection + +**Data Format**: +```jsonl +{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"} +{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"} +``` + +**Output**: Similarity scores or binary classifications + +**Use Cases**: Search relevance, duplicate detection, semantic matching + +## Quick Start + +### Prerequisites ```bash # Install dependencies pip install -r requirements.txt -# Set Python path -export PYTHONPATH=. +# Verify installation +python -c "import torch, transformers, datasets; print('✅ All packages installed')" ``` -### 2. Data Processing +### Basic Workflow ```bash -# Process emotion dataset -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml +# 1. Process data +python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml -# Process with custom parameters -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000 +# 2. Train model +python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml -# Check output location -ls -la ./data/processed/classification/emotion/classification/ +# 3. Run inference +python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml ``` -**Expected Output:** -``` -Data processing completed successfully! - Data source: huggingface - Dataset: dair-ai/emotion - Total samples: 2999 - Unique labels: 6 - Split sizes: {'train': 1000, 'validation': 999, 'test': 1000} - Output directory: ./data/processed/classification/emotion -``` +## Configuration Guide -### 3. Model Training +### YAML Structure -```bash -# Train using processed data -python scripts/classification/trainer.py --config configs/classification/emotion.yaml - -# Train with custom parameters -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32 - -# Check model output -ls -la ./results/classification/emotion_model/ -``` - -**Expected Output:** -``` -Training completed successfully! - Model: bert-base-uncased - Data directory: ./data/processed/classification/emotion - Training for 3 epochs with batch size 16 - Model saved to: ./results/classification/emotion_model -``` - -### 4. Model Inference - -```bash -# Run inference -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" - -# File-based inference -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl -``` - -**Expected Output:** -``` -Inference completed successfully! - Loading model from: ./results/classification/emotion_model - Predicted label: joy - Confidence: 0.8542 - Top 3 predictions: - - joy: 0.8542 - - love: 0.1234 - - surprise: 0.0224 -``` - -## Quick Start (Styling Task) - -### 1. Setup Environment - -```bash -# Install dependencies (including unsloth for styling) -pip install -r requirements.txt - -# Set Python path -export PYTHONPATH=. -``` - -### 2. Data Processing - -```bash -# Process style transfer dataset -python scripts/styling/data_processor.py --config configs/styling/formal.yaml - -# Create HuggingFace dataset -python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset - -# Check output location -ls -la ./data/processed/styling/formal/ -``` - -**Expected Output:** -``` -Styling data processing completed successfully! - Data source: custom - Data file: ./data/raw/styling/sample_formal.jsonl - Total samples: 5 - Split sizes: {'train': 3, 'validation': 1, 'test': 1} - Output directory: ./data/processed/styling/formal - Style instruction: Rewrite the following text in a formal style -``` - -### 3. Model Training - -```bash -# Train using processed data (automatically loads from YAML output_dir) -python scripts/styling/train.py example - -# Custom training -python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4 - -# Check model output -ls -la ./models/styling/ -``` - -**Expected Output:** -``` -Training completed successfully! - Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit - Dataset: Loaded from ./data/processed/styling/formal - Training for 3 epochs with batch size 4 - Model saved to: ./models/styling -``` - -### 4. Model Inference - -```bash -# Single text style transfer -python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" - -# Batch processing -python scripts/styling/inference.py batch - -# Interactive mode -python scripts/styling/inference.py infer --config configs/styling/formal.yaml -``` - -**Expected Output:** -``` -Inference completed successfully! - Input: Hey, what's up? - Output: Hello, how are you doing? - Style: Formal -``` - -## Adding New Tasks - -To add a new task (e.g., completion, styling, matching), follow these steps: - -### Example: Styling Task (Already Implemented) - -The styling task demonstrates a complete implementation: - -1. **Task Directory Structure** ✅ -```bash -configs/styling/ # YAML configurations -data/raw/styling/ # Raw style transfer data -data/processed/styling/ # Processed data -pipelines/styling/ # Core pipeline scripts -scripts/styling/ # User-friendly scripts -models/styling/ # Trained models -``` - -2. **Pipeline Components** ✅ -- **Data Processor**: Handles style transfer datasets with instruction/input/output format -- **Trainer**: LoRA fine-tuning using Unsloth for efficiency -- **Inference**: Style transfer with streaming and batch processing - -3. **Key Features** ✅ -- Automatic EOS token handling: `text + tokenizer.eos_token` -- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)` -- YAML integration: Uses `data.output_dir` for automatic dataset loading -- HuggingFace dataset export and loading - -### For Other Tasks (completion, matching) - -1. **Create Task Directory Structure** -```bash -# Create task directories -mkdir -p configs/completion -mkdir -p data/raw/completion data/processed/completion -mkdir -p pipelines/completion -mkdir -p scripts/completion -mkdir -p results/completion -mkdir -p tasks/completion -mkdir -p models/completion -``` - -2. **Create Task Configuration** - -```bash -# Create YAML configuration for new task -cat > configs/completion/text_generation.yaml << 'EOF' -# Text Generation Task Configuration -task: - name: "completion" - type: "text_generation" - -# Data Processing Configuration -data: - source: "huggingface" - dataset_name: "your-dataset-name" - output_dir: "./data/processed/completion/text_generation" - max_samples: 1000 - # ... other data parameters - -# Model Configuration -model: - name: "gpt2" # Different model for completion - max_length: 1024 - # ... model parameters - -# Training Configuration -training: - num_epochs: 3 - batch_size: 8 # Smaller batch for generation - learning_rate: 5e-5 - data_dir: "./data/processed/completion/text_generation" - output_dir: "./results/completion/text_generation_model" - -# Inference Configuration -inference: - model_path: "./results/completion/text_generation_model" - device: "auto" - batch_size: 1 # Generation is typically one at a time - max_length: 100 - temperature: 0.7 -EOF -``` - -3. **Create Pipeline Scripts** - -Copy and modify the classification pipeline scripts: - -```bash -# Copy classification scripts as templates -cp pipelines/classification/data_processor.py pipelines/completion/ -cp pipelines/classification/train.py pipelines/completion/ -cp pipelines/classification/inference.py pipelines/completion/ - -# Copy task scripts -cp scripts/classification/data_processor.py scripts/completion/ -cp scripts/classification/trainer.py scripts/completion/ -cp scripts/classification/inference.py scripts/completion/ -``` - -4. **Modify Pipeline Code** - -Update the pipeline scripts for your specific task: - -1. **Data Processor** (`pipelines/completion/data_processor.py`): - - Update data loading logic for completion datasets - - Modify preprocessing for text generation - - Adjust output format for completion tasks - -2. **Trainer** (`pipelines/completion/train.py`): - - Change model type to generation models (GPT, T5, etc.) - - Update training loop for text generation - - Modify evaluation metrics - -3. **Inference** (`pipelines/completion/inference.py`): - - Update inference for text generation - - Add generation parameters (temperature, top-k, etc.) - - Modify output format - -5. **Update Task Scripts** - -Modify the task scripts to use your new pipeline: - -```python -# scripts/completion/data_processor.py -def run_with_yaml_config(config_path: str, **cli_overrides): - cmd = [ - "python", "pipelines/completion/data_processor.py", # Updated path - "--config", config_path - ] - # ... rest of the function -``` - -6. **Create Task-Specific Models** - -```bash -# Create model directory -mkdir -p models/completion - -# Add task-specific model classes -cat > models/completion/text_generator.py << 'EOF' -from transformers import AutoModelForCausalLM, AutoTokenizer - -class TextGenerator: - def __init__(self, model_name): - self.model = AutoModelForCausalLM.from_pretrained(model_name) - self.tokenizer = AutoTokenizer.from_pretrained(model_name) - - def generate(self, prompt, max_length=100, temperature=0.7): - # Implementation for text generation - pass -EOF -``` - -7. **Test Your New Task** - -```bash -# Test data processing -python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml - -# Test training -python scripts/completion/trainer.py --config configs/completion/text_generation.yaml - -# Test inference -python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time" -``` - -## YAML Configuration Guide - -### Configuration Structure - -Each YAML file is organized into clear sections: +All configurations follow this hierarchical structure: ```yaml # Task Configuration task: - name: "classification" # or "completion", "styling", "matching" - type: "sequence_classification" # or "text_generation", "style_transfer", "semantic_matching" + name: "task_type" # classification, completion, styling, matching + type: "specific_type" # e.g., "sentiment_analysis", "style_transfer" -# Data Processing Configuration +# Data Configuration data: - source: "huggingface" # "huggingface" or "custom" - dataset_name: "dair-ai/emotion" # HuggingFace dataset name - output_dir: "./data/processed/classification/emotion" - max_samples: 1000 # Limit dataset size - # ... other data parameters + source: "custom" # "custom" or "huggingface" + data_path: "./data/raw/..." # Path to raw data + input_field: "text" # Field name for input + output_field: "label" # Field name for output + instruction: "Task instruction" # For instruction-following tasks # Model Configuration model: - name: "bert-base-uncased" # Model from HuggingFace Hub - max_length: 512 # Sequence length - num_labels: 6 # Number of classes + name: "model_name" # HuggingFace model identifier + max_seq_length: 2048 # Maximum sequence length + dtype: null # Data type (auto-detected) + load_in_4bit: true # 4-bit quantization # Training Configuration training: - num_epochs: 3 # Training epochs - batch_size: 16 # Batch size - learning_rate: 2e-5 # Learning rate - data_dir: "./data/processed/classification/emotion" - output_dir: "./results/classification/emotion_model" + num_epochs: 3 # Training epochs + batch_size: 4 # Batch size + learning_rate: 2e-4 # Learning rate + warmup_steps: 5 # Warmup steps + max_steps: 60 # Maximum training steps # Inference Configuration inference: - model_path: "./results/classification/emotion_model" - device: "auto" # "auto", "cuda", "cpu" - batch_size: 32 # Inference batch size - return_top_k: 3 # Top K predictions + batch_size: 32 # Inference batch size + max_new_tokens: 128 # Max tokens to generate + temperature: 0.8 # Sampling temperature ``` -### Styling Configuration Example +### Configuration Parameters + +#### Data Processing Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `source` | string | "custom" | Data source type | +| `data_path` | string | required | Path to raw data file | +| `input_field` | string | "text" | Input field name | +| `output_field` | string | "label" | Output field name | +| `instruction` | string | task-specific | Task instruction | +| `data_format` | string | "jsonl" | Data file format | +| `max_length` | int | 256 | Maximum text length | +| `min_length` | int | 10 | Minimum text length | +| `clean_text` | boolean | true | Enable text cleaning | +| `lowercase` | boolean | false | Convert to lowercase | + +#### Model Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `name` | string | required | HuggingFace model name | +| `max_seq_length` | int | 2048 | Maximum sequence length | +| `dtype` | string | null | Data type (auto-detected) | +| `load_in_4bit` | boolean | true | Enable 4-bit quantization | +| `token` | string | null | HuggingFace access token | + +#### Training Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `num_epochs` | int | 1 | Number of training epochs | +| `batch_size` | int | 2 | Training batch size | +| `learning_rate` | float | 2e-4 | Learning rate | +| `weight_decay` | float | 0.01 | Weight decay | +| `warmup_steps` | int | 5 | Warmup steps | +| `max_steps` | int | 60 | Maximum training steps | +| `gradient_accumulation_steps` | int | 4 | Gradient accumulation | + +#### LoRA Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `lora_r` | int | 16 | LoRA rank | +| `lora_alpha` | int | 16 | LoRA alpha | +| `lora_dropout` | float | 0 | LoRA dropout | +| `target_modules` | list | ["q_proj", "k_proj", "v_proj", "o_proj"] | Target modules for LoRA | + +### Environment Variables + +```bash +# HuggingFace token for gated models +export HF_TOKEN="hf_..." + +# CUDA device selection +export CUDA_VISIBLE_DEVICES="0" + +# Logging level +export LOG_LEVEL="INFO" +``` + +## Scripts & Commands + +### Data Processing Scripts + +#### Basic Usage + +```bash +python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml +``` + +#### Advanced Options + +```bash +python scripts/[task_type]/data_processor.py \ + --config configs/[task_type]/[config].yaml \ + --max-samples 1000 \ + --log-level DEBUG \ + --create-hf-dataset \ + --hf-dataset-path ./datasets/[task_name] +``` + +#### Command Line Arguments + +| Argument | Type | Default | Description | +|----------|------|---------|-------------| +| `--config` | string | required | YAML configuration file | +| `--max-samples` | int | all | Maximum samples to process | +| `--log-level` | string | "INFO" | Logging level | +| `--create-hf-dataset` | flag | false | Create HuggingFace dataset | +| `--hf-dataset-path` | string | auto | HuggingFace dataset path | + +### Training Scripts + +#### Basic Usage + +```bash +python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml +``` + +#### Advanced Options + +```bash +python scripts/[task_type]/train.py train \ + --config configs/[task_type]/[config].yaml \ + --epochs 5 \ + --batch-size 8 \ + --learning-rate 1e-4 \ + --max-steps 100 +``` + +#### Command Line Arguments + +| Argument | Type | Default | Description | +|----------|------|---------|-------------| +| `--config` | string | required | YAML configuration file | +| `--epochs` | int | YAML value | Override training epochs | +| `--batch-size` | int | YAML value | Override batch size | +| `--learning-rate` | float | YAML value | Override learning rate | +| `--max-steps` | int | YAML value | Override max steps | +| `--output-dir` | string | YAML value | Override output directory | + +### Inference Scripts + +#### Basic Usage + +```bash +python scripts/[task_type]/inference.py infer \ + --config configs/[task_type]/[config].yaml \ + --input-text "Your input text here" +``` + +#### Advanced Options + +```bash +python scripts/[task_type]/inference.py infer \ + --config configs/[task_type]/[config].yaml \ + --input-text "Your input text here" \ + --max-tokens 256 \ + --temperature 0.7 \ + --stream +``` + +#### Command Line Arguments + +| Argument | Type | Default | Description | +|----------|------|---------|-------------| +| `--config` | string | required | YAML configuration file | +| `--input-text` | string | required | Text to process | +| `--max-tokens` | int | 128 | Maximum tokens to generate | +| `--temperature` | float | 0.8 | Sampling temperature | +| `--stream` | flag | false | Enable streaming generation | + +### Batch Processing + +```bash +# Process multiple inputs from file +python scripts/[task_type]/inference.py batch \ + --config configs/[task_type]/[config].yaml \ + --input-file input.txt \ + --output-file output.txt +``` + +### Interactive Mode + +```bash +# Enter interactive mode for testing +python scripts/[task_type]/inference.py interactive \ + --config configs/[task_type]/[config].yaml +``` + +## Complete Workflows + +### Classification Task Workflow + +#### 1. Data Preparation + +```jsonl +# data/raw/classification/sentiment.jsonl +{"text": "I love this movie!", "label": "positive"} +{"text": "This is terrible", "label": "negative"} +{"text": "It's okay", "label": "neutral"} +``` + +#### 2. Configuration ```yaml -# Styling Task Configuration +# configs/classification/sentiment.yaml +task: + name: "classification" + type: "sentiment_analysis" + +data: + source: "custom" + data_path: "./data/raw/classification/sentiment.jsonl" + input_field: "text" + output_field: "label" + instruction: "Classify the sentiment of the following text" + +model: + name: "microsoft/DialoGPT-medium" + max_seq_length: 512 + +training: + num_epochs: 3 + batch_size: 8 + learning_rate: 3e-5 +``` + +#### 3. Execute Pipeline + +```bash +# Process data +python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml + +# Train model +python scripts/classification/train.py train --config configs/classification/sentiment.yaml + +# Run inference +python scripts/classification/inference.py infer \ + --config configs/classification/sentiment.yaml \ + --input-text "This product exceeded my expectations!" +``` + +### Styling Task Workflow + +#### 1. Data Preparation + +```jsonl +# data/raw/styling/formal.jsonl +{"text": "Hey there!", "styled_text": "Hello, how are you?"} +{"text": "I'm gonna go", "styled_text": "I will be going"} +{"text": "This is cool", "styled_text": "This is quite impressive"} +``` + +#### 2. Configuration + +```yaml +# configs/styling/formal.yaml task: name: "styling" type: "style_transfer" -# Data Processing Configuration data: source: "custom" - data_path: "./data/raw/styling/sample_formal.jsonl" + data_path: "./data/raw/styling/formal.jsonl" input_field: "text" output_field: "styled_text" instruction: "Rewrite the following text in a formal style" - output_dir: "./data/processed/styling/formal" - output_format: "alpaca" -# Model Configuration model: - training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" - training_max_seq_length: 2048 - training_load_in_4bit: true + name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" + max_seq_length: 2048 -# Training Configuration training: num_epochs: 3 - batch_size: 2 + batch_size: 4 learning_rate: 2e-4 - weight_decay: 0.01 - -# Inference Configuration -inference: - batch_size: 1 - max_new_tokens: 128 - temperature: 0.8 + model_output_dir: "./models/styling" ``` -### Available Configuration Files - -- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset -- `configs/classification/custom.yaml` - Custom dataset processing -- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning - -## Usage Examples - -### Data Processing Examples +#### 3. Execute Pipeline ```bash -# 1. Use YAML config only -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml - -# 2. Override YAML values -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500 - -# 3. Use CLI only (backward compatibility) -python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion - -# 4. Run examples -python scripts/classification/data_processor.py examples -``` - -### Training Examples - -```bash -# 1. Use YAML config only -python scripts/classification/trainer.py --config configs/classification/emotion.yaml - -# 2. Override YAML values -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 - -# 3. Use CLI only -python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3 - -# 4. Run examples -python scripts/classification/trainer.py examples -``` - -### Inference Examples - -```bash -# 1. Single text prediction -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" - -# 2. File-based prediction -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl - -# 3. Interactive mode -python scripts/classification/inference.py --config configs/classification/emotion.yaml - -# 4. Run examples -python scripts/classification/inference.py examples -``` - -### Styling Examples - -```bash -# 1. Data Processing +# Process data python scripts/styling/data_processor.py --config configs/styling/formal.yaml -python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset -# 2. Training -python scripts/styling/train.py example -python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2 +# Train model +python scripts/styling/train.py train --config configs/styling/formal.yaml -# 3. Inference -python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" -python scripts/styling/inference.py batch -python scripts/styling/inference.py infer --config configs/styling/formal.yaml - -# 4. Run examples -python scripts/styling/data_processor.py examples -python scripts/styling/train.py features -python scripts/styling/inference.py features +# Run inference +python scripts/styling/inference.py infer \ + --config configs/styling/formal.yaml \ + --instruction "Rewrite in formal style" \ + --input-text "Hey there! What's up?" ``` -## Troubleshooting Common Errors +### Completion Task Workflow -### 1. ModuleNotFoundError: No module named 'utils' +#### 1. Data Preparation -**Error:** -``` -ModuleNotFoundError: No module named 'utils' +```jsonl +# data/raw/completion/story.jsonl +{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."} +{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."} ``` -**Solution:** -```bash -# Set Python path before running scripts -export PYTHONPATH=. -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml -``` +#### 2. Configuration -### 2. Model Path Not Found - -**Error:** -``` -Model path not found: ./results/classification/emotion_model -``` - -**Solution:** -```bash -# Train the model first -python scripts/classification/trainer.py --config configs/classification/emotion.yaml - -# Then run inference -python scripts/classification/inference.py --config configs/classification/emotion.yaml -``` - -### 3. Data Directory Not Found - -**Error:** -``` -Data directory not found: ./data/processed/classification/emotion -``` - -**Solution:** -```bash -# Process data first -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml - -# Then train -python scripts/classification/trainer.py --config configs/classification/emotion.yaml -``` - -### 4. YAML Configuration Errors - -**Error:** -``` -data_processor.py: error: --data-source is required (either in YAML config or CLI) -``` - -**Solution:** -Check your YAML file structure. It should have: ```yaml -data: - source: "huggingface" # Not data_source - dataset_name: "dair-ai/emotion" -``` +# configs/completion/story.yaml +task: + name: "completion" + type: "story_generation" -### 5. HuggingFace Download Issues - -**Error:** -``` -KeyboardInterrupt during model download -``` - -**Solution:** -```bash -# Use smaller dataset for testing -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100 - -# Or use cached models -export HF_HOME=./cache -``` - -### 6. CUDA/GPU Issues - -**Error:** -``` -RuntimeError: CUDA out of memory -``` - -**Solution:** -```bash -# Reduce batch size -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8 - -# Or use CPU -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu -``` - -## Monitoring and Logs - -### Check Processing Status - -```bash -# Check data processing output -ls -la ./data/processed/classification/emotion/classification/ - -# Check training output -ls -la ./results/classification/emotion_model/ - -# Check logs -tail -f logs/training.log -``` - -### Expected File Structure After Processing - -``` -./data/processed/classification/emotion/classification/ -├── train.jsonl # Training data -├── validation.jsonl # Validation data -└── test.jsonl # Test data - -./results/classification/emotion_model/ -├── config.json # Model configuration -├── pytorch_model.bin # Model weights -├── tokenizer.json # Tokenizer -└── label_info.json # Label mappings -``` - -## Workflow Summary - -### Classification Task -1. **Setup**: Install dependencies and set PYTHONPATH -2. **Data Processing**: Process raw data into organized splits -3. **Training**: Train model using processed data -4. **Inference**: Use trained model for predictions -5. **Monitoring**: Check logs and outputs for errors - -### Styling Task -1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH -2. **Data Processing**: Process style transfer data with instruction/input/output format -3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer -4. **Inference**: Style transfer with streaming and batch processing -5. **Monitoring**: Check training logs and model outputs - -## Creating Custom Configurations - -### For New Datasets - -1. Copy existing config: -```bash -cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml -``` - -2. Modify parameters: -```yaml -data: - source: "huggingface" - dataset_name: "your-dataset-name" - output_dir: "./data/processed/classification/my_dataset" - # ... other parameters - -training: - data_dir: "./data/processed/classification/my_dataset" - output_dir: "./results/classification/my_dataset_model" -``` - -3. Run pipeline: -```bash -python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml -``` - -### For Custom Data - -1. Use custom config: -```yaml data: source: "custom" - data_path: "./data/raw/my_data.jsonl" - output_dir: "./data/processed/classification/my_custom_dataset" + data_path: "./data/raw/completion/story.jsonl" + input_field: "prompt" + output_field: "completion" + +model: + name: "gpt2-medium" + max_seq_length: 1024 + +training: + num_epochs: 2 + batch_size: 16 + learning_rate: 5e-5 ``` -2. Run processing: +#### 3. Execute Pipeline + ```bash -python scripts/classification/data_processor.py --config configs/classification/custom.yaml +# Process data +python scripts/completion/data_processor.py --config configs/completion/story.yaml + +# Train model +python scripts/completion/train.py train --config configs/completion/story.yaml + +# Run inference +python scripts/completion/inference.py infer \ + --config configs/completion/story.yaml \ + --input-text "The wizard cast a spell" ``` -## Best Practices +## API Reference -1. **Always check output directories** before running next step -2. **Use small datasets for testing** before full runs -3. **Monitor logs** for errors and warnings -4. **Backup configurations** before major changes -5. **Use version control** for YAML files -6. **Test with CLI overrides** for quick experiments +### Data Processing Classes + +#### BaseDataProcessor + +```python +class BaseDataProcessor: + def __init__(self, config: Dict[str, Any]) + def load_and_preprocess(self) -> Tuple[Dict, Dict] + def validate_data(self, data: Dict) -> Tuple[bool, List[str]] + def save_data(self, data: Dict, output_path: str) +``` + +#### ClassificationDataProcessor + +```python +class ClassificationDataProcessor(BaseDataProcessor): + def convert_to_classification_format(self, data: Dict) -> Dict + def create_label_mapping(self, labels: List[str]) -> Dict[str, int] +``` + +#### StylingDataProcessor + +```python +class StylingDataProcessor(BaseDataProcessor): + def convert_to_alpaca_format(self, data: Dict) -> Dict + def format_for_training(self, data: Dict) -> Dict +``` + +### Training Classes + +#### BaseTrainer + +```python +class BaseTrainer: + def __init__(self, config: Dict[str, Any]) + def load_model_and_tokenizer(self) + def setup_training(self, dataset: Dataset) + def train(self, dataset_path: str) -> Dict + def save_model(self) +``` + +#### ClassificationTrainer + +```python +class ClassificationTrainer(BaseTrainer): + def setup_classification_head(self) + def compute_metrics(self, eval_pred) -> Dict +``` + +#### StylingTrainer + +```python +class StylingTrainer(BaseTrainer): + def setup_lora(self) + def format_dataset(self, dataset: Dataset) -> Dataset +``` + +### Inference Classes + +#### BaseInference + +```python +class BaseInference: + def __init__(self, config: Dict[str, Any]) + def load_model_and_tokenizer(self) + def preprocess_input(self, input_text: str) -> torch.Tensor + def postprocess_output(self, output: torch.Tensor) -> str +``` + +#### ClassificationInference + +```python +class ClassificationInference(BaseInference): + def classify(self, text: str) -> Dict[str, float] + def batch_classify(self, texts: List[str]) -> List[Dict] +``` + +#### StylingInference + +```python +class StylingInference(BaseInference): + def style_transfer(self, text: str, instruction: str) -> str + def generate_text(self, instruction: str, input_text: str) -> str +``` + +## Troubleshooting + +### Common Issues + +#### 1. Model Loading Errors + +**Error**: `FileNotFoundError: ./models/[task_name]/*.json` + +**Solution**: +- Verify model was trained successfully +- Check `model_output_dir` in YAML config +- Ensure model files exist in specified directory + +#### 2. Memory Issues + +**Error**: `CUDA out of memory` + +**Solution**: +- Reduce `batch_size` in YAML config +- Enable `load_in_4bit: true` +- Use gradient accumulation +- Reduce `max_seq_length` + +#### 3. Data Format Errors + +**Error**: `KeyError: 'input_field'` + +**Solution**: +- Verify field names in JSONL/CSV files +- Check `input_field` and `output_field` in YAML +- Ensure data format matches expected structure + +#### 4. Training Convergence Issues + +**Symptoms**: Loss not decreasing, poor model performance + +**Solution**: +- Adjust learning rate (try 1e-5 to 5e-4) +- Increase training epochs +- Check data quality and quantity +- Verify label distribution (for classification) + +### Debug Mode + +Enable detailed logging: + +```bash +export LOG_LEVEL="DEBUG" +python scripts/[task_type]/[script].py --log-level DEBUG +``` + +### Performance Optimization + +#### Memory Optimization + +```yaml +model: + load_in_4bit: true # 4-bit quantization + dtype: "float16" # Use float16 if supported + +training: + gradient_accumulation_steps: 4 # Effective batch size = batch_size * steps + max_grad_norm: 1.0 # Gradient clipping +``` + +#### Speed Optimization + +```yaml +training: + dataloader_num_workers: 4 # Parallel data loading + fp16: true # Mixed precision training + bf16: false # Disable bfloat16 if not supported +``` + +## Contributing + +### Adding New Task Types + +1. **Create task directory structure**: +``` +pipelines/[new_task]/ +├── __init__.py +├── data_processor.py +├── train.py +└── inference.py + +scripts/[new_task]/ +├── __init__.py +├── data_processor.py +├── train.py +└── inference.py + +configs/[new_task]/ +└── example.yaml +``` + +2. **Implement base classes**: +- Extend `BaseDataProcessor` +- Extend `BaseTrainer` +- Extend `BaseInference` + +3. **Add configuration templates**: +- Define task-specific parameters +- Document all configuration options + +4. **Update documentation**: +- Add task description to README +- Include usage examples +- Document configuration parameters + +### Code Style + +- Follow PEP 8 guidelines +- Use type hints for all functions +- Include comprehensive docstrings +- Add unit tests for new functionality + +### Testing + +```bash +# Run all tests +python -m pytest tests/ + +# Run specific task tests +python -m pytest tests/[task_type]/ + +# Run with coverage +python -m pytest --cov=pipelines tests/ +``` + +## License + +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Support -For issues and questions: -1. Check the troubleshooting section above -2. Review logs in the output directories -3. Verify YAML configuration structure -4. Test with smaller datasets first +- **Issues**: [GitHub Issues](https://github.com/your-repo/issues) +- **Discussions**: [GitHub Discussions](https://github.com/your-repo/discussions) +- **Documentation**: [Wiki](https://github.com/your-repo/wiki) --- -**Happy fine-tuning!** +**Happy fine-tuning! 🚀** diff --git a/.ipynb_checkpoints/untitled-checkpoint.txt b/.ipynb_checkpoints/untitled-checkpoint.txt new file mode 100644 index 0000000..e69de29 diff --git a/README.md b/README.md index f1ba946..517945e 100644 --- a/README.md +++ b/README.md @@ -1,763 +1,726 @@ -# Fine-Tune Task: NLP Pipeline Framework +# Fine-Tuning Task Framework -A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching). +A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching. -## Supported Tasks +## Table of Contents -This framework supports multiple NLP tasks with organized configurations: +- [Overview](#overview) +- [Architecture](#architecture) +- [Task Types](#task-types) +- [Quick Start](#quick-start) +- [Configuration Guide](#configuration-guide) +- [Scripts & Commands](#scripts--commands) +- [Complete Workflows](#complete-workflows) +- [API Reference](#api-reference) +- [Troubleshooting](#troubleshooting) +- [Contributing](#contributing) -- **Classification**: Text classification, sentiment analysis, topic classification -- **Completion**: Text generation, code completion, story generation -- **Styling**: Style transfer, tone classification, writing style adaptation -- **Matching**: Semantic matching, entity matching, similarity scoring +## Overview -### Current Implementation Status +This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be: -- **Classification**: ✅ Fully implemented with emotion classification example -- **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning -- **Completion**: Planned for future updates -- **Matching**: Planned for future updates +- **Task-Agnostic**: Same pipeline structure for different task types +- **Configuration-Driven**: YAML-based configuration for all parameters +- **Developer-Friendly**: Clear scripts and comprehensive logging +- **Production-Ready**: Built-in validation, error handling, and optimization -**Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates. +## Architecture -## Project Structure +The framework follows a **modular pipeline architecture**: ``` -fine-tune-task/ -├── configs/ # YAML configuration files -│ ├── classification/ # ✅ Implemented -│ │ ├── emotion.yaml # Emotion classification -│ │ └── custom.yaml # Custom dataset -│ ├── styling/ # ✅ Implemented -│ │ └── formal.yaml # Formal style transfer -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── data/ # Data directories -│ ├── raw/ # Raw input data -│ │ ├── classification/ # ✅ Implemented -│ │ ├── styling/ # ✅ Implemented -│ │ ├── completion/ # Planned for future updates -│ │ └── matching/ # Planned for future updates -│ └── processed/ # Processed data -│ ├── classification/ # ✅ Implemented -│ ├── styling/ # ✅ Implemented -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── pipelines/ # Core pipeline scripts -│ ├── classification/ # ✅ Implemented -│ │ ├── data_processor.py # Data processing -│ │ ├── train.py # Training -│ │ └── inference.py # Inference -│ ├── styling/ # ✅ Implemented -│ │ ├── data_processor.py # Style data processing -│ │ ├── train.py # LoRA fine-tuning -│ │ └── inference.py # Style transfer inference -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── scripts/ # User-friendly scripts -│ ├── classification/ # ✅ Implemented -│ │ ├── data_processor.py # Data processing script -│ │ ├── trainer.py # Training script -│ │ └── inference.py # Inference script -│ ├── styling/ # ✅ Implemented -│ │ ├── data_processor.py # Style data processing script -│ │ ├── train.py # Training script -│ │ └── inference.py # Inference script -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -├── results/ # Model outputs -│ ├── classification/ # ✅ Implemented -│ ├── styling/ # ✅ Implemented -│ ├── completion/ # Planned for future updates -│ └── matching/ # Planned for future updates -└── utils/ # Shared utility modules +Raw Data → Data Processing → Model Training → Inference/Evaluation + ↓ ↓ ↓ ↓ + JSONL/CSV HuggingFace Trained Ready for + Files Datasets Models Production ``` -## Quick Start (Classification Task) +### Core Components -### 1. Setup Environment +1. **Data Processors**: Convert raw data to training-ready formats +2. **Training Pipelines**: Task-specific training with optimization +3. **Inference Engines**: Production-ready text generation/classification +4. **Configuration Management**: YAML-based parameter control +5. **Utility Scripts**: Command-line interfaces for all operations + +## Task Types + +### 1. Classification Task + +**Purpose**: Text classification, sentiment analysis, topic categorization + +**Data Format**: +```jsonl +{"text": "I love this product!", "label": "positive"} +{"text": "This is terrible", "label": "negative"} +``` + +**Output**: Classification probabilities and predicted labels + +**Use Cases**: Sentiment analysis, spam detection, content moderation + +### 2. Completion Task + +**Purpose**: Text generation, story completion, code generation + +**Data Format**: +```jsonl +{"prompt": "Once upon a time", "completion": "there was a brave knight..."} +{"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"} +``` + +**Output**: Generated text continuations + +**Use Cases**: Creative writing, code completion, content generation + +### 3. Styling Task + +**Purpose**: Style transfer, tone modification, writing style adaptation + +**Data Format**: +```jsonl +{"text": "Hey there!", "styled_text": "Hello, how are you?"} +{"text": "I'm gonna go", "styled_text": "I will be going"} +``` + +**Output**: Text rewritten in target style + +**Use Cases**: Formalization, casualization, domain adaptation + +### 4. Matching Task + +**Purpose**: Semantic similarity, question-answer matching, paraphrase detection + +**Data Format**: +```jsonl +{"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"} +{"text1": "Weather today", "text2": "Cooking recipes", "label": "different"} +``` + +**Output**: Similarity scores or binary classifications + +**Use Cases**: Search relevance, duplicate detection, semantic matching + +## Quick Start + +### Prerequisites ```bash # Install dependencies pip install -r requirements.txt -# Set Python path -export PYTHONPATH=. +# Verify installation +python -c "import torch, transformers, datasets; print('✅ All packages installed')" ``` -### 2. Data Processing +### Basic Workflow ```bash -# Process emotion dataset -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml +# 1. Process data +python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml -# Process with custom parameters -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000 +# 2. Train model +python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml -# Check output location -ls -la ./data/processed/classification/emotion/classification/ +# 3. Run inference +python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml ``` -**Expected Output:** -``` -Data processing completed successfully! - Data source: huggingface - Dataset: dair-ai/emotion - Total samples: 2999 - Unique labels: 6 - Split sizes: {'train': 1000, 'validation': 999, 'test': 1000} - Output directory: ./data/processed/classification/emotion -``` +## Configuration Guide -### 3. Model Training +### YAML Structure -```bash -# Train using processed data -python scripts/classification/trainer.py --config configs/classification/emotion.yaml - -# Train with custom parameters -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32 - -# Check model output -ls -la ./results/classification/emotion_model/ -``` - -**Expected Output:** -``` -Training completed successfully! - Model: bert-base-uncased - Data directory: ./data/processed/classification/emotion - Training for 3 epochs with batch size 16 - Model saved to: ./results/classification/emotion_model -``` - -### 4. Model Inference - -```bash -# Run inference -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" - -# File-based inference -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl -``` - -**Expected Output:** -``` -Inference completed successfully! - Loading model from: ./results/classification/emotion_model - Predicted label: joy - Confidence: 0.8542 - Top 3 predictions: - - joy: 0.8542 - - love: 0.1234 - - surprise: 0.0224 -``` - -## Quick Start (Styling Task) - -### 1. Setup Environment - -```bash -# Install dependencies (including unsloth for styling) -pip install -r requirements.txt - -# Set Python path -export PYTHONPATH=. -``` - -### 2. Data Processing - -```bash -# Process style transfer dataset -python scripts/styling/data_processor.py --config configs/styling/formal.yaml - -# Create HuggingFace dataset -python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset - -# Check output location -ls -la ./data/processed/styling/formal/ -``` - -**Expected Output:** -``` -Styling data processing completed successfully! - Data source: custom - Data file: ./data/raw/styling/sample_formal.jsonl - Total samples: 5 - Split sizes: {'train': 3, 'validation': 1, 'test': 1} - Output directory: ./data/processed/styling/formal - Style instruction: Rewrite the following text in a formal style -``` - -### 3. Model Training - -```bash -# Train using processed data (automatically loads from YAML output_dir) -python scripts/styling/train.py example - -# Custom training -python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4 - -# Check model output -ls -la ./models/styling/ -``` - -**Expected Output:** -``` -Training completed successfully! - Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit - Dataset: Loaded from ./data/processed/styling/formal - Training for 3 epochs with batch size 4 - Model saved to: ./models/styling -``` - -### 4. Model Inference - -```bash -# Single text style transfer -python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" - -# Batch processing -python scripts/styling/inference.py batch - -# Interactive mode -python scripts/styling/inference.py infer --config configs/styling/formal.yaml -``` - -**Expected Output:** -``` -Inference completed successfully! - Input: Hey, what's up? - Output: Hello, how are you doing? - Style: Formal -``` - -## Adding New Tasks - -To add a new task (e.g., completion, styling, matching), follow these steps: - -### Example: Styling Task (Already Implemented) - -The styling task demonstrates a complete implementation: - -1. **Task Directory Structure** ✅ -```bash -configs/styling/ # YAML configurations -data/raw/styling/ # Raw style transfer data -data/processed/styling/ # Processed data -pipelines/styling/ # Core pipeline scripts -scripts/styling/ # User-friendly scripts -models/styling/ # Trained models -``` - -2. **Pipeline Components** ✅ -- **Data Processor**: Handles style transfer datasets with instruction/input/output format -- **Trainer**: LoRA fine-tuning using Unsloth for efficiency -- **Inference**: Style transfer with streaming and batch processing - -3. **Key Features** ✅ -- Automatic EOS token handling: `text + tokenizer.eos_token` -- Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)` -- YAML integration: Uses `data.output_dir` for automatic dataset loading -- HuggingFace dataset export and loading - -### For Other Tasks (completion, matching) - -1. **Create Task Directory Structure** -```bash -# Create task directories -mkdir -p configs/completion -mkdir -p data/raw/completion data/processed/completion -mkdir -p pipelines/completion -mkdir -p scripts/completion -mkdir -p results/completion -mkdir -p tasks/completion -mkdir -p models/completion -``` - -2. **Create Task Configuration** - -```bash -# Create YAML configuration for new task -cat > configs/completion/text_generation.yaml << 'EOF' -# Text Generation Task Configuration -task: - name: "completion" - type: "text_generation" - -# Data Processing Configuration -data: - source: "huggingface" - dataset_name: "your-dataset-name" - output_dir: "./data/processed/completion/text_generation" - max_samples: 1000 - # ... other data parameters - -# Model Configuration -model: - name: "gpt2" # Different model for completion - max_length: 1024 - # ... model parameters - -# Training Configuration -training: - num_epochs: 3 - batch_size: 8 # Smaller batch for generation - learning_rate: 5e-5 - data_dir: "./data/processed/completion/text_generation" - output_dir: "./results/completion/text_generation_model" - -# Inference Configuration -inference: - model_path: "./results/completion/text_generation_model" - device: "auto" - batch_size: 1 # Generation is typically one at a time - max_length: 100 - temperature: 0.7 -EOF -``` - -3. **Create Pipeline Scripts** - -Copy and modify the classification pipeline scripts: - -```bash -# Copy classification scripts as templates -cp pipelines/classification/data_processor.py pipelines/completion/ -cp pipelines/classification/train.py pipelines/completion/ -cp pipelines/classification/inference.py pipelines/completion/ - -# Copy task scripts -cp scripts/classification/data_processor.py scripts/completion/ -cp scripts/classification/trainer.py scripts/completion/ -cp scripts/classification/inference.py scripts/completion/ -``` - -4. **Modify Pipeline Code** - -Update the pipeline scripts for your specific task: - -1. **Data Processor** (`pipelines/completion/data_processor.py`): - - Update data loading logic for completion datasets - - Modify preprocessing for text generation - - Adjust output format for completion tasks - -2. **Trainer** (`pipelines/completion/train.py`): - - Change model type to generation models (GPT, T5, etc.) - - Update training loop for text generation - - Modify evaluation metrics - -3. **Inference** (`pipelines/completion/inference.py`): - - Update inference for text generation - - Add generation parameters (temperature, top-k, etc.) - - Modify output format - -5. **Update Task Scripts** - -Modify the task scripts to use your new pipeline: - -```python -# scripts/completion/data_processor.py -def run_with_yaml_config(config_path: str, **cli_overrides): - cmd = [ - "python", "pipelines/completion/data_processor.py", # Updated path - "--config", config_path - ] - # ... rest of the function -``` - -6. **Create Task-Specific Models** - -```bash -# Create model directory -mkdir -p models/completion - -# Add task-specific model classes -cat > models/completion/text_generator.py << 'EOF' -from transformers import AutoModelForCausalLM, AutoTokenizer - -class TextGenerator: - def __init__(self, model_name): - self.model = AutoModelForCausalLM.from_pretrained(model_name) - self.tokenizer = AutoTokenizer.from_pretrained(model_name) - - def generate(self, prompt, max_length=100, temperature=0.7): - # Implementation for text generation - pass -EOF -``` - -7. **Test Your New Task** - -```bash -# Test data processing -python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml - -# Test training -python scripts/completion/trainer.py --config configs/completion/text_generation.yaml - -# Test inference -python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time" -``` - -## YAML Configuration Guide - -### Configuration Structure - -Each YAML file is organized into clear sections: +All configurations follow this hierarchical structure: ```yaml # Task Configuration task: - name: "classification" # or "completion", "styling", "matching" - type: "sequence_classification" # or "text_generation", "style_transfer", "semantic_matching" + name: "task_type" # classification, completion, styling, matching + type: "specific_type" # e.g., "sentiment_analysis", "style_transfer" -# Data Processing Configuration +# Data Configuration data: - source: "huggingface" # "huggingface" or "custom" - dataset_name: "dair-ai/emotion" # HuggingFace dataset name - output_dir: "./data/processed/classification/emotion" - max_samples: 1000 # Limit dataset size - # ... other data parameters + source: "custom" # "custom" or "huggingface" + data_path: "./data/raw/..." # Path to raw data + input_field: "text" # Field name for input + output_field: "label" # Field name for output + instruction: "Task instruction" # For instruction-following tasks # Model Configuration model: - name: "bert-base-uncased" # Model from HuggingFace Hub - max_length: 512 # Sequence length - num_labels: 6 # Number of classes + name: "model_name" # HuggingFace model identifier + max_seq_length: 2048 # Maximum sequence length + dtype: null # Data type (auto-detected) + load_in_4bit: true # 4-bit quantization # Training Configuration training: - num_epochs: 3 # Training epochs - batch_size: 16 # Batch size - learning_rate: 2e-5 # Learning rate - data_dir: "./data/processed/classification/emotion" - output_dir: "./results/classification/emotion_model" + num_epochs: 3 # Training epochs + batch_size: 4 # Batch size + learning_rate: 2e-4 # Learning rate + warmup_steps: 5 # Warmup steps + max_steps: 60 # Maximum training steps # Inference Configuration inference: - model_path: "./results/classification/emotion_model" - device: "auto" # "auto", "cuda", "cpu" - batch_size: 32 # Inference batch size - return_top_k: 3 # Top K predictions + batch_size: 32 # Inference batch size + max_new_tokens: 128 # Max tokens to generate + temperature: 0.8 # Sampling temperature ``` -### Styling Configuration Example +### Configuration Parameters + +#### Data Processing Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `source` | string | "custom" | Data source type | +| `data_path` | string | required | Path to raw data file | +| `input_field` | string | "text" | Input field name | +| `output_field` | string | "label" | Output field name | +| `instruction` | string | task-specific | Task instruction | +| `data_format` | string | "jsonl" | Data file format | +| `max_length` | int | 256 | Maximum text length | +| `min_length` | int | 10 | Minimum text length | +| `clean_text` | boolean | true | Enable text cleaning | +| `lowercase` | boolean | false | Convert to lowercase | + +#### Model Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `name` | string | required | HuggingFace model name | +| `max_seq_length` | int | 2048 | Maximum sequence length | +| `dtype` | string | null | Data type (auto-detected) | +| `load_in_4bit` | boolean | true | Enable 4-bit quantization | +| `token` | string | null | HuggingFace access token | + +#### Training Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `num_epochs` | int | 1 | Number of training epochs | +| `batch_size` | int | 2 | Training batch size | +| `learning_rate` | float | 2e-4 | Learning rate | +| `weight_decay` | float | 0.01 | Weight decay | +| `warmup_steps` | int | 5 | Warmup steps | +| `max_steps` | int | 60 | Maximum training steps | +| `gradient_accumulation_steps` | int | 4 | Gradient accumulation | + +#### LoRA Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `lora_r` | int | 16 | LoRA rank | +| `lora_alpha` | int | 16 | LoRA alpha | +| `lora_dropout` | float | 0 | LoRA dropout | +| `target_modules` | list | ["q_proj", "k_proj", "v_proj", "o_proj"] | Target modules for LoRA | + +### Environment Variables + +```bash +# HuggingFace token for gated models +export HF_TOKEN="hf_..." + +# CUDA device selection +export CUDA_VISIBLE_DEVICES="0" + +# Logging level +export LOG_LEVEL="INFO" +``` + +## Scripts & Commands + +### Data Processing Scripts + +#### Basic Usage + +```bash +python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml +``` + +#### Advanced Options + +```bash +python scripts/[task_type]/data_processor.py \ + --config configs/[task_type]/[config].yaml \ + --max-samples 1000 \ + --log-level DEBUG \ + --create-hf-dataset \ + --hf-dataset-path ./datasets/[task_name] +``` + +#### Command Line Arguments + +| Argument | Type | Default | Description | +|----------|------|---------|-------------| +| `--config` | string | required | YAML configuration file | +| `--max-samples` | int | all | Maximum samples to process | +| `--log-level` | string | "INFO" | Logging level | +| `--create-hf-dataset` | flag | false | Create HuggingFace dataset | +| `--hf-dataset-path` | string | auto | HuggingFace dataset path | + +### Training Scripts + +#### Basic Usage + +```bash +python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml +``` + +#### Advanced Options + +```bash +python scripts/[task_type]/train.py train \ + --config configs/[task_type]/[config].yaml \ + --epochs 5 \ + --batch-size 8 \ + --learning-rate 1e-4 \ + --max-steps 100 +``` + +#### Command Line Arguments + +| Argument | Type | Default | Description | +|----------|------|---------|-------------| +| `--config` | string | required | YAML configuration file | +| `--epochs` | int | YAML value | Override training epochs | +| `--batch-size` | int | YAML value | Override batch size | +| `--learning-rate` | float | YAML value | Override learning rate | +| `--max-steps` | int | YAML value | Override max steps | +| `--output-dir` | string | YAML value | Override output directory | + +### Inference Scripts + +#### Basic Usage + +```bash +python scripts/[task_type]/inference.py infer \ + --config configs/[task_type]/[config].yaml \ + --input-text "Your input text here" +``` + +#### Advanced Options + +```bash +python scripts/[task_type]/inference.py infer \ + --config configs/[task_type]/[config].yaml \ + --input-text "Your input text here" \ + --max-tokens 256 \ + --temperature 0.7 \ + --stream +``` + +#### Command Line Arguments + +| Argument | Type | Default | Description | +|----------|------|---------|-------------| +| `--config` | string | required | YAML configuration file | +| `--input-text` | string | required | Text to process | +| `--max-tokens` | int | 128 | Maximum tokens to generate | +| `--temperature` | float | 0.8 | Sampling temperature | +| `--stream` | flag | false | Enable streaming generation | + +### Batch Processing + +```bash +# Process multiple inputs from file +python scripts/[task_type]/inference.py batch \ + --config configs/[task_type]/[config].yaml \ + --input-file input.txt \ + --output-file output.txt +``` + +### Interactive Mode + +```bash +# Enter interactive mode for testing +python scripts/[task_type]/inference.py interactive \ + --config configs/[task_type]/[config].yaml +``` + +## Complete Workflows + +### Classification Task Workflow + +#### 1. Data Preparation + +```jsonl +# data/raw/classification/sentiment.jsonl +{"text": "I love this movie!", "label": "positive"} +{"text": "This is terrible", "label": "negative"} +{"text": "It's okay", "label": "neutral"} +``` + +#### 2. Configuration ```yaml -# Styling Task Configuration +# configs/classification/sentiment.yaml +task: + name: "classification" + type: "sentiment_analysis" + +data: + source: "custom" + data_path: "./data/raw/classification/sentiment.jsonl" + input_field: "text" + output_field: "label" + instruction: "Classify the sentiment of the following text" + +model: + name: "microsoft/DialoGPT-medium" + max_seq_length: 512 + +training: + num_epochs: 3 + batch_size: 8 + learning_rate: 3e-5 +``` + +#### 3. Execute Pipeline + +```bash +# Process data +python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml + +# Train model +python scripts/classification/train.py train --config configs/classification/sentiment.yaml + +# Run inference +python scripts/classification/inference.py infer \ + --config configs/classification/sentiment.yaml \ + --input-text "This product exceeded my expectations!" +``` + +### Styling Task Workflow + +#### 1. Data Preparation + +```jsonl +# data/raw/styling/formal.jsonl +{"text": "Hey there!", "styled_text": "Hello, how are you?"} +{"text": "I'm gonna go", "styled_text": "I will be going"} +{"text": "This is cool", "styled_text": "This is quite impressive"} +``` + +#### 2. Configuration + +```yaml +# configs/styling/formal.yaml task: name: "styling" type: "style_transfer" -# Data Processing Configuration data: source: "custom" - data_path: "./data/raw/styling/sample_formal.jsonl" + data_path: "./data/raw/styling/formal.jsonl" input_field: "text" output_field: "styled_text" instruction: "Rewrite the following text in a formal style" - output_dir: "./data/processed/styling/formal" - output_format: "alpaca" -# Model Configuration model: - training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" - training_max_seq_length: 2048 - training_load_in_4bit: true + name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" + max_seq_length: 2048 -# Training Configuration training: num_epochs: 3 - batch_size: 2 + batch_size: 4 learning_rate: 2e-4 - weight_decay: 0.01 - -# Inference Configuration -inference: - batch_size: 1 - max_new_tokens: 128 - temperature: 0.8 + model_output_dir: "./models/styling" ``` -### Available Configuration Files - -- `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset -- `configs/classification/custom.yaml` - Custom dataset processing -- `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning - -## Usage Examples - -### Data Processing Examples +#### 3. Execute Pipeline ```bash -# 1. Use YAML config only -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml - -# 2. Override YAML values -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500 - -# 3. Use CLI only (backward compatibility) -python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion - -# 4. Run examples -python scripts/classification/data_processor.py examples -``` - -### Training Examples - -```bash -# 1. Use YAML config only -python scripts/classification/trainer.py --config configs/classification/emotion.yaml - -# 2. Override YAML values -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 - -# 3. Use CLI only -python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3 - -# 4. Run examples -python scripts/classification/trainer.py examples -``` - -### Inference Examples - -```bash -# 1. Single text prediction -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" - -# 2. File-based prediction -python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl - -# 3. Interactive mode -python scripts/classification/inference.py --config configs/classification/emotion.yaml - -# 4. Run examples -python scripts/classification/inference.py examples -``` - -### Styling Examples - -```bash -# 1. Data Processing +# Process data python scripts/styling/data_processor.py --config configs/styling/formal.yaml -python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset -# 2. Training -python scripts/styling/train.py example -python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2 +# Train model +python scripts/styling/train.py train --config configs/styling/formal.yaml -# 3. Inference -python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" -python scripts/styling/inference.py batch -python scripts/styling/inference.py infer --config configs/styling/formal.yaml - -# 4. Run examples -python scripts/styling/data_processor.py examples -python scripts/styling/train.py features -python scripts/styling/inference.py features +# Run inference +python scripts/styling/inference.py infer \ + --config configs/styling/formal.yaml \ + --instruction "Rewrite in formal style" \ + --input-text "Hey there! What's up?" ``` -## Troubleshooting Common Errors +### Completion Task Workflow -### 1. ModuleNotFoundError: No module named 'utils' +#### 1. Data Preparation -**Error:** -``` -ModuleNotFoundError: No module named 'utils' +```jsonl +# data/raw/completion/story.jsonl +{"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."} +{"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."} ``` -**Solution:** -```bash -# Set Python path before running scripts -export PYTHONPATH=. -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml -``` +#### 2. Configuration -### 2. Model Path Not Found - -**Error:** -``` -Model path not found: ./results/classification/emotion_model -``` - -**Solution:** -```bash -# Train the model first -python scripts/classification/trainer.py --config configs/classification/emotion.yaml - -# Then run inference -python scripts/classification/inference.py --config configs/classification/emotion.yaml -``` - -### 3. Data Directory Not Found - -**Error:** -``` -Data directory not found: ./data/processed/classification/emotion -``` - -**Solution:** -```bash -# Process data first -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml - -# Then train -python scripts/classification/trainer.py --config configs/classification/emotion.yaml -``` - -### 4. YAML Configuration Errors - -**Error:** -``` -data_processor.py: error: --data-source is required (either in YAML config or CLI) -``` - -**Solution:** -Check your YAML file structure. It should have: ```yaml -data: - source: "huggingface" # Not data_source - dataset_name: "dair-ai/emotion" -``` +# configs/completion/story.yaml +task: + name: "completion" + type: "story_generation" -### 5. HuggingFace Download Issues - -**Error:** -``` -KeyboardInterrupt during model download -``` - -**Solution:** -```bash -# Use smaller dataset for testing -python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100 - -# Or use cached models -export HF_HOME=./cache -``` - -### 6. CUDA/GPU Issues - -**Error:** -``` -RuntimeError: CUDA out of memory -``` - -**Solution:** -```bash -# Reduce batch size -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8 - -# Or use CPU -python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu -``` - -## Monitoring and Logs - -### Check Processing Status - -```bash -# Check data processing output -ls -la ./data/processed/classification/emotion/classification/ - -# Check training output -ls -la ./results/classification/emotion_model/ - -# Check logs -tail -f logs/training.log -``` - -### Expected File Structure After Processing - -``` -./data/processed/classification/emotion/classification/ -├── train.jsonl # Training data -├── validation.jsonl # Validation data -└── test.jsonl # Test data - -./results/classification/emotion_model/ -├── config.json # Model configuration -├── pytorch_model.bin # Model weights -├── tokenizer.json # Tokenizer -└── label_info.json # Label mappings -``` - -## Workflow Summary - -### Classification Task -1. **Setup**: Install dependencies and set PYTHONPATH -2. **Data Processing**: Process raw data into organized splits -3. **Training**: Train model using processed data -4. **Inference**: Use trained model for predictions -5. **Monitoring**: Check logs and outputs for errors - -### Styling Task -1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH -2. **Data Processing**: Process style transfer data with instruction/input/output format -3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer -4. **Inference**: Style transfer with streaming and batch processing -5. **Monitoring**: Check training logs and model outputs - -## Creating Custom Configurations - -### For New Datasets - -1. Copy existing config: -```bash -cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml -``` - -2. Modify parameters: -```yaml -data: - source: "huggingface" - dataset_name: "your-dataset-name" - output_dir: "./data/processed/classification/my_dataset" - # ... other parameters - -training: - data_dir: "./data/processed/classification/my_dataset" - output_dir: "./results/classification/my_dataset_model" -``` - -3. Run pipeline: -```bash -python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml -``` - -### For Custom Data - -1. Use custom config: -```yaml data: source: "custom" - data_path: "./data/raw/my_data.jsonl" - output_dir: "./data/processed/classification/my_custom_dataset" + data_path: "./data/raw/completion/story.jsonl" + input_field: "prompt" + output_field: "completion" + +model: + name: "gpt2-medium" + max_seq_length: 1024 + +training: + num_epochs: 2 + batch_size: 16 + learning_rate: 5e-5 ``` -2. Run processing: +#### 3. Execute Pipeline + ```bash -python scripts/classification/data_processor.py --config configs/classification/custom.yaml +# Process data +python scripts/completion/data_processor.py --config configs/completion/story.yaml + +# Train model +python scripts/completion/train.py train --config configs/completion/story.yaml + +# Run inference +python scripts/completion/inference.py infer \ + --config configs/completion/story.yaml \ + --input-text "The wizard cast a spell" ``` -## Best Practices +## API Reference -1. **Always check output directories** before running next step -2. **Use small datasets for testing** before full runs -3. **Monitor logs** for errors and warnings -4. **Backup configurations** before major changes -5. **Use version control** for YAML files -6. **Test with CLI overrides** for quick experiments +### Data Processing Classes + +#### BaseDataProcessor + +```python +class BaseDataProcessor: + def __init__(self, config: Dict[str, Any]) + def load_and_preprocess(self) -> Tuple[Dict, Dict] + def validate_data(self, data: Dict) -> Tuple[bool, List[str]] + def save_data(self, data: Dict, output_path: str) +``` + +#### ClassificationDataProcessor + +```python +class ClassificationDataProcessor(BaseDataProcessor): + def convert_to_classification_format(self, data: Dict) -> Dict + def create_label_mapping(self, labels: List[str]) -> Dict[str, int] +``` + +#### StylingDataProcessor + +```python +class StylingDataProcessor(BaseDataProcessor): + def convert_to_alpaca_format(self, data: Dict) -> Dict + def format_for_training(self, data: Dict) -> Dict +``` + +### Training Classes + +#### BaseTrainer + +```python +class BaseTrainer: + def __init__(self, config: Dict[str, Any]) + def load_model_and_tokenizer(self) + def setup_training(self, dataset: Dataset) + def train(self, dataset_path: str) -> Dict + def save_model(self) +``` + +#### ClassificationTrainer + +```python +class ClassificationTrainer(BaseTrainer): + def setup_classification_head(self) + def compute_metrics(self, eval_pred) -> Dict +``` + +#### StylingTrainer + +```python +class StylingTrainer(BaseTrainer): + def setup_lora(self) + def format_dataset(self, dataset: Dataset) -> Dataset +``` + +### Inference Classes + +#### BaseInference + +```python +class BaseInference: + def __init__(self, config: Dict[str, Any]) + def load_model_and_tokenizer(self) + def preprocess_input(self, input_text: str) -> torch.Tensor + def postprocess_output(self, output: torch.Tensor) -> str +``` + +#### ClassificationInference + +```python +class ClassificationInference(BaseInference): + def classify(self, text: str) -> Dict[str, float] + def batch_classify(self, texts: List[str]) -> List[Dict] +``` + +#### StylingInference + +```python +class StylingInference(BaseInference): + def style_transfer(self, text: str, instruction: str) -> str + def generate_text(self, instruction: str, input_text: str) -> str +``` + +## Troubleshooting + +### Common Issues + +#### 1. Model Loading Errors + +**Error**: `FileNotFoundError: ./models/[task_name]/*.json` + +**Solution**: +- Verify model was trained successfully +- Check `model_output_dir` in YAML config +- Ensure model files exist in specified directory + +#### 2. Memory Issues + +**Error**: `CUDA out of memory` + +**Solution**: +- Reduce `batch_size` in YAML config +- Enable `load_in_4bit: true` +- Use gradient accumulation +- Reduce `max_seq_length` + +#### 3. Data Format Errors + +**Error**: `KeyError: 'input_field'` + +**Solution**: +- Verify field names in JSONL/CSV files +- Check `input_field` and `output_field` in YAML +- Ensure data format matches expected structure + +#### 4. Training Convergence Issues + +**Symptoms**: Loss not decreasing, poor model performance + +**Solution**: +- Adjust learning rate (try 1e-5 to 5e-4) +- Increase training epochs +- Check data quality and quantity +- Verify label distribution (for classification) + +### Debug Mode + +Enable detailed logging: + +```bash +export LOG_LEVEL="DEBUG" +python scripts/[task_type]/[script].py --log-level DEBUG +``` + +### Performance Optimization + +#### Memory Optimization + +```yaml +model: + load_in_4bit: true # 4-bit quantization + dtype: "float16" # Use float16 if supported + +training: + gradient_accumulation_steps: 4 # Effective batch size = batch_size * steps + max_grad_norm: 1.0 # Gradient clipping +``` + +#### Speed Optimization + +```yaml +training: + dataloader_num_workers: 4 # Parallel data loading + fp16: true # Mixed precision training + bf16: false # Disable bfloat16 if not supported +``` + +## Contributing + +### Adding New Task Types + +1. **Create task directory structure**: +``` +pipelines/[new_task]/ +├── __init__.py +├── data_processor.py +├── train.py +└── inference.py + +scripts/[new_task]/ +├── __init__.py +├── data_processor.py +├── train.py +└── inference.py + +configs/[new_task]/ +└── example.yaml +``` + +2. **Implement base classes**: +- Extend `BaseDataProcessor` +- Extend `BaseTrainer` +- Extend `BaseInference` + +3. **Add configuration templates**: +- Define task-specific parameters +- Document all configuration options + +4. **Update documentation**: +- Add task description to README +- Include usage examples +- Document configuration parameters + +### Code Style + +- Follow PEP 8 guidelines +- Use type hints for all functions +- Include comprehensive docstrings +- Add unit tests for new functionality + +### Testing + +```bash +# Run all tests +python -m pytest tests/ + +# Run specific task tests +python -m pytest tests/[task_type]/ + +# Run with coverage +python -m pytest --cov=pipelines tests/ +``` + +## License + +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Support -For issues and questions: -1. Check the troubleshooting section above -2. Review logs in the output directories -3. Verify YAML configuration structure -4. Test with smaller datasets first +- **Issues**: [GitHub Issues](https://github.com/your-repo/issues) +- **Discussions**: [GitHub Discussions](https://github.com/your-repo/discussions) +- **Documentation**: [Wiki](https://github.com/your-repo/wiki) --- -**Happy fine-tuning!** +**Happy fine-tuning! 🚀** diff --git a/untitled.txt b/untitled.txt new file mode 100644 index 0000000..e69de29