# Fine-Tuning Task Framework A comprehensive framework for fine-tuning Large Language Models (LLMs) across multiple task types including classification, completion, styling, and matching. ## Table of Contents - [Overview](#overview) - [Architecture](#architecture) - [Task Types](#task-types) - [Quick Start](#quick-start) - [Configuration Guide](#configuration-guide) - [Scripts & Commands](#scripts--commands) - [Complete Workflows](#complete-workflows) - [API Reference](#api-reference) - [Troubleshooting](#troubleshooting) - [Contributing](#contributing) ## Overview This framework provides a unified approach to fine-tuning LLMs for various NLP tasks. It's designed to be: - **Task-Agnostic**: Same pipeline structure for different task types - **Configuration-Driven**: YAML-based configuration for all parameters - **Developer-Friendly**: Clear scripts and comprehensive logging - **Production-Ready**: Built-in validation, error handling, and optimization ## Architecture The framework follows a **modular pipeline architecture**: ``` Raw Data → Data Processing → Model Training → Inference/Evaluation ↓ ↓ ↓ ↓ JSONL/CSV HuggingFace Trained Ready for Files Datasets Models Production ``` ### Core Components 1. **Data Processors**: Convert raw data to training-ready formats 2. **Training Pipelines**: Task-specific training with optimization 3. **Inference Engines**: Production-ready text generation/classification 4. **Configuration Management**: YAML-based parameter control 5. **Utility Scripts**: Command-line interfaces for all operations ## Task Types ### 1. Classification Task **Purpose**: Text classification, sentiment analysis, topic categorization **Data Format**: ```jsonl {"text": "I love this product!", "label": "positive"} {"text": "This is terrible", "label": "negative"} ``` **Output**: Classification probabilities and predicted labels **Use Cases**: Sentiment analysis, spam detection, content moderation ### 2. Completion Task **Purpose**: Text generation, story completion, code generation **Data Format**: ```jsonl {"prompt": "Once upon a time", "completion": "there was a brave knight..."} {"prompt": "def calculate_sum", "completion": "(numbers): return sum(numbers)"} ``` **Output**: Generated text continuations **Use Cases**: Creative writing, code completion, content generation ### 3. Styling Task **Purpose**: Style transfer, tone modification, writing style adaptation **Data Format**: ```jsonl {"text": "Hey there!", "styled_text": "Hello, how are you?"} {"text": "I'm gonna go", "styled_text": "I will be going"} ``` **Output**: Text rewritten in target style **Use Cases**: Formalization, casualization, domain adaptation ### 4. Matching Task **Purpose**: Semantic similarity, question-answer matching, paraphrase detection **Data Format**: ```jsonl {"text1": "What is AI?", "text2": "Artificial Intelligence", "label": "similar"} {"text1": "Weather today", "text2": "Cooking recipes", "label": "different"} ``` **Output**: Similarity scores or binary classifications **Use Cases**: Search relevance, duplicate detection, semantic matching ## Quick Start ### Prerequisites ```bash # Install dependencies pip install -r requirements.txt # Verify installation python -c "import torch, transformers, datasets; print('✅ All packages installed')" ``` ### Basic Workflow ```bash # 1. Process data python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml # 2. Train model python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml # 3. Run inference python scripts/[task_type]/inference.py infer --config configs/[task_type]/[config].yaml ``` ## Configuration Guide ### YAML Structure All configurations follow this hierarchical structure: ```yaml # Task Configuration task: name: "task_type" # classification, completion, styling, matching type: "specific_type" # e.g., "sentiment_analysis", "style_transfer" # Data Configuration data: source: "custom" # "custom" or "huggingface" data_path: "./data/raw/..." # Path to raw data input_field: "text" # Field name for input output_field: "label" # Field name for output instruction: "Task instruction" # For instruction-following tasks # Model Configuration model: name: "model_name" # HuggingFace model identifier max_seq_length: 2048 # Maximum sequence length dtype: null # Data type (auto-detected) load_in_4bit: true # 4-bit quantization # Training Configuration training: num_epochs: 3 # Training epochs batch_size: 4 # Batch size learning_rate: 2e-4 # Learning rate warmup_steps: 5 # Warmup steps max_steps: 60 # Maximum training steps # Inference Configuration inference: batch_size: 32 # Inference batch size max_new_tokens: 128 # Max tokens to generate temperature: 0.8 # Sampling temperature ``` ### Configuration Parameters #### Data Processing Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `source` | string | "custom" | Data source type | | `data_path` | string | required | Path to raw data file | | `input_field` | string | "text" | Input field name | | `output_field` | string | "label" | Output field name | | `instruction` | string | task-specific | Task instruction | | `data_format` | string | "jsonl" | Data file format | | `max_length` | int | 256 | Maximum text length | | `min_length` | int | 10 | Minimum text length | | `clean_text` | boolean | true | Enable text cleaning | | `lowercase` | boolean | false | Convert to lowercase | #### Model Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `name` | string | required | HuggingFace model name | | `max_seq_length` | int | 2048 | Maximum sequence length | | `dtype` | string | null | Data type (auto-detected) | | `load_in_4bit` | boolean | true | Enable 4-bit quantization | | `token` | string | null | HuggingFace access token | #### Training Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `num_epochs` | int | 1 | Number of training epochs | | `batch_size` | int | 2 | Training batch size | | `learning_rate` | float | 2e-4 | Learning rate | | `weight_decay` | float | 0.01 | Weight decay | | `warmup_steps` | int | 5 | Warmup steps | | `max_steps` | int | 60 | Maximum training steps | | `gradient_accumulation_steps` | int | 4 | Gradient accumulation | #### LoRA Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `lora_r` | int | 16 | LoRA rank | | `lora_alpha` | int | 16 | LoRA alpha | | `lora_dropout` | float | 0 | LoRA dropout | | `target_modules` | list | ["q_proj", "k_proj", "v_proj", "o_proj"] | Target modules for LoRA | ### Environment Variables ```bash # HuggingFace token for gated models export HF_TOKEN="hf_..." # CUDA device selection export CUDA_VISIBLE_DEVICES="0" # Logging level export LOG_LEVEL="INFO" ``` ## Scripts & Commands ### Data Processing Scripts #### Basic Usage ```bash python scripts/[task_type]/data_processor.py --config configs/[task_type]/[config].yaml ``` #### Advanced Options ```bash python scripts/[task_type]/data_processor.py \ --config configs/[task_type]/[config].yaml \ --max-samples 1000 \ --log-level DEBUG \ --create-hf-dataset \ --hf-dataset-path ./datasets/[task_name] ``` #### Command Line Arguments | Argument | Type | Default | Description | |----------|------|---------|-------------| | `--config` | string | required | YAML configuration file | | `--max-samples` | int | all | Maximum samples to process | | `--log-level` | string | "INFO" | Logging level | | `--create-hf-dataset` | flag | false | Create HuggingFace dataset | | `--hf-dataset-path` | string | auto | HuggingFace dataset path | ### Training Scripts #### Basic Usage ```bash python scripts/[task_type]/train.py train --config configs/[task_type]/[config].yaml ``` #### Advanced Options ```bash python scripts/[task_type]/train.py train \ --config configs/[task_type]/[config].yaml \ --epochs 5 \ --batch-size 8 \ --learning-rate 1e-4 \ --max-steps 100 ``` #### Command Line Arguments | Argument | Type | Default | Description | |----------|------|---------|-------------| | `--config` | string | required | YAML configuration file | | `--epochs` | int | YAML value | Override training epochs | | `--batch-size` | int | YAML value | Override batch size | | `--learning-rate` | float | YAML value | Override learning rate | | `--max-steps` | int | YAML value | Override max steps | | `--output-dir` | string | YAML value | Override output directory | ### Inference Scripts #### Basic Usage ```bash python scripts/[task_type]/inference.py infer \ --config configs/[task_type]/[config].yaml \ --input-text "Your input text here" ``` #### Advanced Options ```bash python scripts/[task_type]/inference.py infer \ --config configs/[task_type]/[config].yaml \ --input-text "Your input text here" \ --max-tokens 256 \ --temperature 0.7 \ --stream ``` #### Command Line Arguments | Argument | Type | Default | Description | |----------|------|---------|-------------| | `--config` | string | required | YAML configuration file | | `--input-text` | string | required | Text to process | | `--max-tokens` | int | 128 | Maximum tokens to generate | | `--temperature` | float | 0.8 | Sampling temperature | | `--stream` | flag | false | Enable streaming generation | ### Batch Processing ```bash # Process multiple inputs from file python scripts/[task_type]/inference.py batch \ --config configs/[task_type]/[config].yaml \ --input-file input.txt \ --output-file output.txt ``` ### Interactive Mode ```bash # Enter interactive mode for testing python scripts/[task_type]/inference.py interactive \ --config configs/[task_type]/[config].yaml ``` ## Complete Workflows ### Classification Task Workflow #### 1. Data Preparation ```jsonl # data/raw/classification/sentiment.jsonl {"text": "I love this movie!", "label": "positive"} {"text": "This is terrible", "label": "negative"} {"text": "It's okay", "label": "neutral"} ``` #### 2. Configuration ```yaml # configs/classification/sentiment.yaml task: name: "classification" type: "sentiment_analysis" data: source: "custom" data_path: "./data/raw/classification/sentiment.jsonl" input_field: "text" output_field: "label" instruction: "Classify the sentiment of the following text" model: name: "microsoft/DialoGPT-medium" max_seq_length: 512 training: num_epochs: 3 batch_size: 8 learning_rate: 3e-5 ``` #### 3. Execute Pipeline ```bash # Process data python scripts/classification/data_processor.py --config configs/classification/sentiment.yaml # Train model python scripts/classification/train.py train --config configs/classification/sentiment.yaml # Run inference python scripts/classification/inference.py infer \ --config configs/classification/sentiment.yaml \ --input-text "This product exceeded my expectations!" ``` ### Styling Task Workflow #### 1. Data Preparation ```jsonl # data/raw/styling/formal.jsonl {"text": "Hey there!", "styled_text": "Hello, how are you?"} {"text": "I'm gonna go", "styled_text": "I will be going"} {"text": "This is cool", "styled_text": "This is quite impressive"} ``` #### 2. Configuration ```yaml # configs/styling/formal.yaml task: name: "styling" type: "style_transfer" data: source: "custom" data_path: "./data/raw/styling/formal.jsonl" input_field: "text" output_field: "styled_text" instruction: "Rewrite the following text in a formal style" model: name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" max_seq_length: 2048 training: num_epochs: 3 batch_size: 4 learning_rate: 2e-4 model_output_dir: "./models/styling" ``` #### 3. Execute Pipeline ```bash # Process data python scripts/styling/data_processor.py --config configs/styling/formal.yaml # Train model python scripts/styling/train.py train --config configs/styling/formal.yaml # Run inference python scripts/styling/inference.py infer \ --config configs/styling/formal.yaml \ --instruction "Rewrite in formal style" \ --input-text "Hey there! What's up?" ``` ### Completion Task Workflow #### 1. Data Preparation ```jsonl # data/raw/completion/story.jsonl {"prompt": "Once upon a time", "completion": "there was a brave knight who lived in a castle..."} {"prompt": "The dragon roared", "completion": "and the ground shook beneath its massive feet..."} ``` #### 2. Configuration ```yaml # configs/completion/story.yaml task: name: "completion" type: "story_generation" data: source: "custom" data_path: "./data/raw/completion/story.jsonl" input_field: "prompt" output_field: "completion" model: name: "gpt2-medium" max_seq_length: 1024 training: num_epochs: 2 batch_size: 16 learning_rate: 5e-5 ``` #### 3. Execute Pipeline ```bash # Process data python scripts/completion/data_processor.py --config configs/completion/story.yaml # Train model python scripts/completion/train.py train --config configs/completion/story.yaml # Run inference python scripts/completion/inference.py infer \ --config configs/completion/story.yaml \ --input-text "The wizard cast a spell" ``` ## API Reference ### Data Processing Classes #### BaseDataProcessor ```python class BaseDataProcessor: def __init__(self, config: Dict[str, Any]) def load_and_preprocess(self) -> Tuple[Dict, Dict] def validate_data(self, data: Dict) -> Tuple[bool, List[str]] def save_data(self, data: Dict, output_path: str) ``` #### ClassificationDataProcessor ```python class ClassificationDataProcessor(BaseDataProcessor): def convert_to_classification_format(self, data: Dict) -> Dict def create_label_mapping(self, labels: List[str]) -> Dict[str, int] ``` #### StylingDataProcessor ```python class StylingDataProcessor(BaseDataProcessor): def convert_to_alpaca_format(self, data: Dict) -> Dict def format_for_training(self, data: Dict) -> Dict ``` ### Training Classes #### BaseTrainer ```python class BaseTrainer: def __init__(self, config: Dict[str, Any]) def load_model_and_tokenizer(self) def setup_training(self, dataset: Dataset) def train(self, dataset_path: str) -> Dict def save_model(self) ``` #### ClassificationTrainer ```python class ClassificationTrainer(BaseTrainer): def setup_classification_head(self) def compute_metrics(self, eval_pred) -> Dict ``` #### StylingTrainer ```python class StylingTrainer(BaseTrainer): def setup_lora(self) def format_dataset(self, dataset: Dataset) -> Dataset ``` ### Inference Classes #### BaseInference ```python class BaseInference: def __init__(self, config: Dict[str, Any]) def load_model_and_tokenizer(self) def preprocess_input(self, input_text: str) -> torch.Tensor def postprocess_output(self, output: torch.Tensor) -> str ``` #### ClassificationInference ```python class ClassificationInference(BaseInference): def classify(self, text: str) -> Dict[str, float] def batch_classify(self, texts: List[str]) -> List[Dict] ``` #### StylingInference ```python class StylingInference(BaseInference): def style_transfer(self, text: str, instruction: str) -> str def generate_text(self, instruction: str, input_text: str) -> str ``` ## Troubleshooting ### Common Issues #### 1. Model Loading Errors **Error**: `FileNotFoundError: ./models/[task_name]/*.json` **Solution**: - Verify model was trained successfully - Check `model_output_dir` in YAML config - Ensure model files exist in specified directory #### 2. Memory Issues **Error**: `CUDA out of memory` **Solution**: - Reduce `batch_size` in YAML config - Enable `load_in_4bit: true` - Use gradient accumulation - Reduce `max_seq_length` #### 3. Data Format Errors **Error**: `KeyError: 'input_field'` **Solution**: - Verify field names in JSONL/CSV files - Check `input_field` and `output_field` in YAML - Ensure data format matches expected structure #### 4. Training Convergence Issues **Symptoms**: Loss not decreasing, poor model performance **Solution**: - Adjust learning rate (try 1e-5 to 5e-4) - Increase training epochs - Check data quality and quantity - Verify label distribution (for classification) ### Debug Mode Enable detailed logging: ```bash export LOG_LEVEL="DEBUG" python scripts/[task_type]/[script].py --log-level DEBUG ``` ### Performance Optimization #### Memory Optimization ```yaml model: load_in_4bit: true # 4-bit quantization dtype: "float16" # Use float16 if supported training: gradient_accumulation_steps: 4 # Effective batch size = batch_size * steps max_grad_norm: 1.0 # Gradient clipping ``` #### Speed Optimization ```yaml training: dataloader_num_workers: 4 # Parallel data loading fp16: true # Mixed precision training bf16: false # Disable bfloat16 if not supported ``` ## Contributing ### Adding New Task Types 1. **Create task directory structure**: ``` pipelines/[new_task]/ ├── __init__.py ├── data_processor.py ├── train.py └── inference.py scripts/[new_task]/ ├── __init__.py ├── data_processor.py ├── train.py └── inference.py configs/[new_task]/ └── example.yaml ``` 2. **Implement base classes**: - Extend `BaseDataProcessor` - Extend `BaseTrainer` - Extend `BaseInference` 3. **Add configuration templates**: - Define task-specific parameters - Document all configuration options 4. **Update documentation**: - Add task description to README - Include usage examples - Document configuration parameters ### Code Style - Follow PEP 8 guidelines - Use type hints for all functions - Include comprehensive docstrings - Add unit tests for new functionality ### Testing ```bash # Run all tests python -m pytest tests/ # Run specific task tests python -m pytest tests/[task_type]/ # Run with coverage python -m pytest --cov=pipelines tests/ ``` ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Support - **Issues**: [GitHub Issues](https://github.com/your-repo/issues) - **Discussions**: [GitHub Discussions](https://github.com/your-repo/discussions) - **Documentation**: [Wiki](https://github.com/your-repo/wiki) --- **Happy fine-tuning! 🚀**