# Fine-Tune Task: NLP Pipeline Framework A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching). ## Supported Tasks This framework supports multiple NLP tasks with organized configurations: - **Classification**: Text classification, sentiment analysis, topic classification - **Completion**: Text generation, code completion, story generation - **Styling**: Style transfer, tone classification, writing style adaptation - **Matching**: Semantic matching, entity matching, similarity scoring ### Current Implementation Status - **Classification**: ✅ Fully implemented with emotion classification example - **Styling**: ✅ Fully implemented with style transfer and LoRA fine-tuning - **Completion**: Planned for future updates - **Matching**: Planned for future updates **Note**: Classification and styling tasks are fully supported. Other tasks (completion, matching) are planned for future updates. ## Project Structure ``` fine-tune-task/ ├── configs/ # YAML configuration files │ ├── classification/ # ✅ Implemented │ │ ├── emotion.yaml # Emotion classification │ │ └── custom.yaml # Custom dataset │ ├── styling/ # ✅ Implemented │ │ └── formal.yaml # Formal style transfer │ ├── completion/ # Planned for future updates │ └── matching/ # Planned for future updates ├── data/ # Data directories │ ├── raw/ # Raw input data │ │ ├── classification/ # ✅ Implemented │ │ ├── styling/ # ✅ Implemented │ │ ├── completion/ # Planned for future updates │ │ └── matching/ # Planned for future updates │ └── processed/ # Processed data │ ├── classification/ # ✅ Implemented │ ├── styling/ # ✅ Implemented │ ├── completion/ # Planned for future updates │ └── matching/ # Planned for future updates ├── pipelines/ # Core pipeline scripts │ ├── classification/ # ✅ Implemented │ │ ├── data_processor.py # Data processing │ │ ├── train.py # Training │ │ └── inference.py # Inference │ ├── styling/ # ✅ Implemented │ │ ├── data_processor.py # Style data processing │ │ ├── train.py # LoRA fine-tuning │ │ └── inference.py # Style transfer inference │ ├── completion/ # Planned for future updates │ └── matching/ # Planned for future updates ├── scripts/ # User-friendly scripts │ ├── classification/ # ✅ Implemented │ │ ├── data_processor.py # Data processing script │ │ ├── trainer.py # Training script │ │ └── inference.py # Inference script │ ├── styling/ # ✅ Implemented │ │ ├── data_processor.py # Style data processing script │ │ ├── train.py # Training script │ │ └── inference.py # Inference script │ ├── completion/ # Planned for future updates │ └── matching/ # Planned for future updates ├── results/ # Model outputs │ ├── classification/ # ✅ Implemented │ ├── styling/ # ✅ Implemented │ ├── completion/ # Planned for future updates │ └── matching/ # Planned for future updates └── utils/ # Shared utility modules ``` ## Quick Start (Classification Task) ### 1. Setup Environment ```bash # Install dependencies pip install -r requirements.txt # Set Python path export PYTHONPATH=. ``` ### 2. Data Processing ```bash # Process emotion dataset python scripts/classification/data_processor.py --config configs/classification/emotion.yaml # Process with custom parameters python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000 # Check output location ls -la ./data/processed/classification/emotion/classification/ ``` **Expected Output:** ``` Data processing completed successfully! Data source: huggingface Dataset: dair-ai/emotion Total samples: 2999 Unique labels: 6 Split sizes: {'train': 1000, 'validation': 999, 'test': 1000} Output directory: ./data/processed/classification/emotion ``` ### 3. Model Training ```bash # Train using processed data python scripts/classification/trainer.py --config configs/classification/emotion.yaml # Train with custom parameters python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32 # Check model output ls -la ./results/classification/emotion_model/ ``` **Expected Output:** ``` Training completed successfully! Model: bert-base-uncased Data directory: ./data/processed/classification/emotion Training for 3 epochs with batch size 16 Model saved to: ./results/classification/emotion_model ``` ### 4. Model Inference ```bash # Run inference python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" # File-based inference python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl ``` **Expected Output:** ``` Inference completed successfully! Loading model from: ./results/classification/emotion_model Predicted label: joy Confidence: 0.8542 Top 3 predictions: - joy: 0.8542 - love: 0.1234 - surprise: 0.0224 ``` ## Quick Start (Styling Task) ### 1. Setup Environment ```bash # Install dependencies (including unsloth for styling) pip install -r requirements.txt # Set Python path export PYTHONPATH=. ``` ### 2. Data Processing ```bash # Process style transfer dataset python scripts/styling/data_processor.py --config configs/styling/formal.yaml # Create HuggingFace dataset python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset # Check output location ls -la ./data/processed/styling/formal/ ``` **Expected Output:** ``` Styling data processing completed successfully! Data source: custom Data file: ./data/raw/styling/sample_formal.jsonl Total samples: 5 Split sizes: {'train': 3, 'validation': 1, 'test': 1} Output directory: ./data/processed/styling/formal Style instruction: Rewrite the following text in a formal style ``` ### 3. Model Training ```bash # Train using processed data (automatically loads from YAML output_dir) python scripts/styling/train.py example # Custom training python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 3 --batch-size 4 # Check model output ls -la ./models/styling/ ``` **Expected Output:** ``` Training completed successfully! Model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit Dataset: Loaded from ./data/processed/styling/formal Training for 3 epochs with batch size 4 Model saved to: ./models/styling ``` ### 4. Model Inference ```bash # Single text style transfer python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" # Batch processing python scripts/styling/inference.py batch # Interactive mode python scripts/styling/inference.py infer --config configs/styling/formal.yaml ``` **Expected Output:** ``` Inference completed successfully! Input: Hey, what's up? Output: Hello, how are you doing? Style: Formal ``` ## Adding New Tasks To add a new task (e.g., completion, styling, matching), follow these steps: ### Example: Styling Task (Already Implemented) The styling task demonstrates a complete implementation: 1. **Task Directory Structure** ✅ ```bash configs/styling/ # YAML configurations data/raw/styling/ # Raw style transfer data data/processed/styling/ # Processed data pipelines/styling/ # Core pipeline scripts scripts/styling/ # User-friendly scripts models/styling/ # Trained models ``` 2. **Pipeline Components** ✅ - **Data Processor**: Handles style transfer datasets with instruction/input/output format - **Trainer**: LoRA fine-tuning using Unsloth for efficiency - **Inference**: Style transfer with streaming and batch processing 3. **Key Features** ✅ - Automatic EOS token handling: `text + tokenizer.eos_token` - Dataset mapping: `dataset.map(formatting_prompts_func, batched=True)` - YAML integration: Uses `data.output_dir` for automatic dataset loading - HuggingFace dataset export and loading ### For Other Tasks (completion, matching) 1. **Create Task Directory Structure** ```bash # Create task directories mkdir -p configs/completion mkdir -p data/raw/completion data/processed/completion mkdir -p pipelines/completion mkdir -p scripts/completion mkdir -p results/completion mkdir -p tasks/completion mkdir -p models/completion ``` 2. **Create Task Configuration** ```bash # Create YAML configuration for new task cat > configs/completion/text_generation.yaml << 'EOF' # Text Generation Task Configuration task: name: "completion" type: "text_generation" # Data Processing Configuration data: source: "huggingface" dataset_name: "your-dataset-name" output_dir: "./data/processed/completion/text_generation" max_samples: 1000 # ... other data parameters # Model Configuration model: name: "gpt2" # Different model for completion max_length: 1024 # ... model parameters # Training Configuration training: num_epochs: 3 batch_size: 8 # Smaller batch for generation learning_rate: 5e-5 data_dir: "./data/processed/completion/text_generation" output_dir: "./results/completion/text_generation_model" # Inference Configuration inference: model_path: "./results/completion/text_generation_model" device: "auto" batch_size: 1 # Generation is typically one at a time max_length: 100 temperature: 0.7 EOF ``` 3. **Create Pipeline Scripts** Copy and modify the classification pipeline scripts: ```bash # Copy classification scripts as templates cp pipelines/classification/data_processor.py pipelines/completion/ cp pipelines/classification/train.py pipelines/completion/ cp pipelines/classification/inference.py pipelines/completion/ # Copy task scripts cp scripts/classification/data_processor.py scripts/completion/ cp scripts/classification/trainer.py scripts/completion/ cp scripts/classification/inference.py scripts/completion/ ``` 4. **Modify Pipeline Code** Update the pipeline scripts for your specific task: 1. **Data Processor** (`pipelines/completion/data_processor.py`): - Update data loading logic for completion datasets - Modify preprocessing for text generation - Adjust output format for completion tasks 2. **Trainer** (`pipelines/completion/train.py`): - Change model type to generation models (GPT, T5, etc.) - Update training loop for text generation - Modify evaluation metrics 3. **Inference** (`pipelines/completion/inference.py`): - Update inference for text generation - Add generation parameters (temperature, top-k, etc.) - Modify output format 5. **Update Task Scripts** Modify the task scripts to use your new pipeline: ```python # scripts/completion/data_processor.py def run_with_yaml_config(config_path: str, **cli_overrides): cmd = [ "python", "pipelines/completion/data_processor.py", # Updated path "--config", config_path ] # ... rest of the function ``` 6. **Create Task-Specific Models** ```bash # Create model directory mkdir -p models/completion # Add task-specific model classes cat > models/completion/text_generator.py << 'EOF' from transformers import AutoModelForCausalLM, AutoTokenizer class TextGenerator: def __init__(self, model_name): self.model = AutoModelForCausalLM.from_pretrained(model_name) self.tokenizer = AutoTokenizer.from_pretrained(model_name) def generate(self, prompt, max_length=100, temperature=0.7): # Implementation for text generation pass EOF ``` 7. **Test Your New Task** ```bash # Test data processing python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml # Test training python scripts/completion/trainer.py --config configs/completion/text_generation.yaml # Test inference python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time" ``` ## YAML Configuration Guide ### Configuration Structure Each YAML file is organized into clear sections: ```yaml # Task Configuration task: name: "classification" # or "completion", "styling", "matching" type: "sequence_classification" # or "text_generation", "style_transfer", "semantic_matching" # Data Processing Configuration data: source: "huggingface" # "huggingface" or "custom" dataset_name: "dair-ai/emotion" # HuggingFace dataset name output_dir: "./data/processed/classification/emotion" max_samples: 1000 # Limit dataset size # ... other data parameters # Model Configuration model: name: "bert-base-uncased" # Model from HuggingFace Hub max_length: 512 # Sequence length num_labels: 6 # Number of classes # Training Configuration training: num_epochs: 3 # Training epochs batch_size: 16 # Batch size learning_rate: 2e-5 # Learning rate data_dir: "./data/processed/classification/emotion" output_dir: "./results/classification/emotion_model" # Inference Configuration inference: model_path: "./results/classification/emotion_model" device: "auto" # "auto", "cuda", "cpu" batch_size: 32 # Inference batch size return_top_k: 3 # Top K predictions ``` ### Styling Configuration Example ```yaml # Styling Task Configuration task: name: "styling" type: "style_transfer" # Data Processing Configuration data: source: "custom" data_path: "./data/raw/styling/sample_formal.jsonl" input_field: "text" output_field: "styled_text" instruction: "Rewrite the following text in a formal style" output_dir: "./data/processed/styling/formal" output_format: "alpaca" # Model Configuration model: training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" training_max_seq_length: 2048 training_load_in_4bit: true # Training Configuration training: num_epochs: 3 batch_size: 2 learning_rate: 2e-4 weight_decay: 0.01 # Inference Configuration inference: batch_size: 1 max_new_tokens: 128 temperature: 0.8 ``` ### Available Configuration Files - `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset - `configs/classification/custom.yaml` - Custom dataset processing - `configs/styling/formal.yaml` - Formal style transfer with LoRA fine-tuning ## Usage Examples ### Data Processing Examples ```bash # 1. Use YAML config only python scripts/classification/data_processor.py --config configs/classification/emotion.yaml # 2. Override YAML values python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500 # 3. Use CLI only (backward compatibility) python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion # 4. Run examples python scripts/classification/data_processor.py examples ``` ### Training Examples ```bash # 1. Use YAML config only python scripts/classification/trainer.py --config configs/classification/emotion.yaml # 2. Override YAML values python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 # 3. Use CLI only python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3 # 4. Run examples python scripts/classification/trainer.py examples ``` ### Inference Examples ```bash # 1. Single text prediction python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" # 2. File-based prediction python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl # 3. Interactive mode python scripts/classification/inference.py --config configs/classification/emotion.yaml # 4. Run examples python scripts/classification/inference.py examples ``` ### Styling Examples ```bash # 1. Data Processing python scripts/styling/data_processor.py --config configs/styling/formal.yaml python scripts/styling/data_processor.py --config configs/styling/formal.yaml --create-hf-dataset # 2. Training python scripts/styling/train.py example python scripts/styling/train.py train --config configs/styling/formal.yaml --epochs 2 # 3. Inference python scripts/styling/inference.py infer --config configs/styling/formal.yaml --text "Hey, what's up?" python scripts/styling/inference.py batch python scripts/styling/inference.py infer --config configs/styling/formal.yaml # 4. Run examples python scripts/styling/data_processor.py examples python scripts/styling/train.py features python scripts/styling/inference.py features ``` ## Troubleshooting Common Errors ### 1. ModuleNotFoundError: No module named 'utils' **Error:** ``` ModuleNotFoundError: No module named 'utils' ``` **Solution:** ```bash # Set Python path before running scripts export PYTHONPATH=. python scripts/classification/data_processor.py --config configs/classification/emotion.yaml ``` ### 2. Model Path Not Found **Error:** ``` Model path not found: ./results/classification/emotion_model ``` **Solution:** ```bash # Train the model first python scripts/classification/trainer.py --config configs/classification/emotion.yaml # Then run inference python scripts/classification/inference.py --config configs/classification/emotion.yaml ``` ### 3. Data Directory Not Found **Error:** ``` Data directory not found: ./data/processed/classification/emotion ``` **Solution:** ```bash # Process data first python scripts/classification/data_processor.py --config configs/classification/emotion.yaml # Then train python scripts/classification/trainer.py --config configs/classification/emotion.yaml ``` ### 4. YAML Configuration Errors **Error:** ``` data_processor.py: error: --data-source is required (either in YAML config or CLI) ``` **Solution:** Check your YAML file structure. It should have: ```yaml data: source: "huggingface" # Not data_source dataset_name: "dair-ai/emotion" ``` ### 5. HuggingFace Download Issues **Error:** ``` KeyboardInterrupt during model download ``` **Solution:** ```bash # Use smaller dataset for testing python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100 # Or use cached models export HF_HOME=./cache ``` ### 6. CUDA/GPU Issues **Error:** ``` RuntimeError: CUDA out of memory ``` **Solution:** ```bash # Reduce batch size python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8 # Or use CPU python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu ``` ## Monitoring and Logs ### Check Processing Status ```bash # Check data processing output ls -la ./data/processed/classification/emotion/classification/ # Check training output ls -la ./results/classification/emotion_model/ # Check logs tail -f logs/training.log ``` ### Expected File Structure After Processing ``` ./data/processed/classification/emotion/classification/ ├── train.jsonl # Training data ├── validation.jsonl # Validation data └── test.jsonl # Test data ./results/classification/emotion_model/ ├── config.json # Model configuration ├── pytorch_model.bin # Model weights ├── tokenizer.json # Tokenizer └── label_info.json # Label mappings ``` ## Workflow Summary ### Classification Task 1. **Setup**: Install dependencies and set PYTHONPATH 2. **Data Processing**: Process raw data into organized splits 3. **Training**: Train model using processed data 4. **Inference**: Use trained model for predictions 5. **Monitoring**: Check logs and outputs for errors ### Styling Task 1. **Setup**: Install dependencies (including unsloth) and set PYTHONPATH 2. **Data Processing**: Process style transfer data with instruction/input/output format 3. **Training**: LoRA fine-tuning using Unsloth for efficient style transfer 4. **Inference**: Style transfer with streaming and batch processing 5. **Monitoring**: Check training logs and model outputs ## Creating Custom Configurations ### For New Datasets 1. Copy existing config: ```bash cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml ``` 2. Modify parameters: ```yaml data: source: "huggingface" dataset_name: "your-dataset-name" output_dir: "./data/processed/classification/my_dataset" # ... other parameters training: data_dir: "./data/processed/classification/my_dataset" output_dir: "./results/classification/my_dataset_model" ``` 3. Run pipeline: ```bash python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml ``` ### For Custom Data 1. Use custom config: ```yaml data: source: "custom" data_path: "./data/raw/my_data.jsonl" output_dir: "./data/processed/classification/my_custom_dataset" ``` 2. Run processing: ```bash python scripts/classification/data_processor.py --config configs/classification/custom.yaml ``` ## Best Practices 1. **Always check output directories** before running next step 2. **Use small datasets for testing** before full runs 3. **Monitor logs** for errors and warnings 4. **Backup configurations** before major changes 5. **Use version control** for YAML files 6. **Test with CLI overrides** for quick experiments ## Support For issues and questions: 1. Check the troubleshooting section above 2. Review logs in the output directories 3. Verify YAML configuration structure 4. Test with smaller datasets first --- **Happy fine-tuning!**