# Fine-Tune Task: NLP Pipeline Framework A comprehensive framework for fine-tuning NLP models with organized YAML configurations, supporting multiple tasks (classification, completion, styling, matching). ## Supported Tasks This framework supports multiple NLP tasks with organized configurations: - **Classification**: Text classification, sentiment analysis, topic classification - **Completion**: Text generation, code completion, story generation - **Styling**: Style transfer, tone classification, writing style adaptation - **Matching**: Semantic matching, entity matching, similarity scoring ### Current Implementation Status - **Classification**: Fully implemented with emotion classification example - **Completion**: Planned for future updates - **Styling**: Planned for future updates - **Matching**: Planned for future updates **Note**: Currently only classification task is supported. Other tasks (completion, styling, matching) are planned for future updates. ## Project Structure ``` fine-tune-task/ ├── configs/ # YAML configuration files │ ├── classification/ # Implemented │ │ ├── emotion.yaml # Emotion classification │ │ └── custom.yaml # Custom dataset │ ├── completion/ # Planned for future updates │ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── data/ # Data directories │ ├── raw/ # Raw input data │ │ ├── classification/ # Implemented │ │ ├── completion/ # Planned for future updates │ │ ├── styling/ # Planned for future updates │ │ └── matching/ # Planned for future updates │ └── processed/ # Processed data │ ├── classification/ # Implemented │ ├── completion/ # Planned for future updates │ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── pipelines/ # Core pipeline scripts │ ├── classification/ # Implemented │ │ ├── data_processor.py # Data processing │ │ ├── train.py # Training │ │ └── inference.py # Inference │ ├── completion/ # Planned for future updates │ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── scripts/ # User-friendly scripts │ ├── classification/ # Implemented │ │ ├── data_processor.py # Data processing script │ │ ├── trainer.py # Training script │ │ └── inference.py # Inference script │ ├── completion/ # Planned for future updates │ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates ├── results/ # Model outputs │ ├── classification/ # Implemented │ ├── completion/ # Planned for future updates │ ├── styling/ # Planned for future updates │ └── matching/ # Planned for future updates └── utils/ # Shared utility modules ``` ## Quick Start (Classification Task) ### 1. Setup Environment ```bash # Install dependencies pip install -r requirements.txt # Set Python path export PYTHONPATH=. ``` ### 2. Data Processing ```bash # Process emotion dataset python scripts/classification/data_processor.py --config configs/classification/emotion.yaml # Process with custom parameters python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 1000 # Check output location ls -la ./data/processed/classification/emotion/classification/ ``` **Expected Output:** ``` Data processing completed successfully! Data source: huggingface Dataset: dair-ai/emotion Total samples: 2999 Unique labels: 6 Split sizes: {'train': 1000, 'validation': 999, 'test': 1000} Output directory: ./data/processed/classification/emotion ``` ### 3. Model Training ```bash # Train using processed data python scripts/classification/trainer.py --config configs/classification/emotion.yaml # Train with custom parameters python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 --batch-size 32 # Check model output ls -la ./results/classification/emotion_model/ ``` **Expected Output:** ``` Training completed successfully! Model: bert-base-uncased Data directory: ./data/processed/classification/emotion Training for 3 epochs with batch size 16 Model saved to: ./results/classification/emotion_model ``` ### 4. Model Inference ```bash # Run inference python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" # File-based inference python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl ``` **Expected Output:** ``` Inference completed successfully! Loading model from: ./results/classification/emotion_model Predicted label: joy Confidence: 0.8542 Top 3 predictions: - joy: 0.8542 - love: 0.1234 - surprise: 0.0224 ``` ## Adding New Tasks To add a new task (e.g., completion, styling, matching), follow these steps: ### Step 1: Create Task Directory Structure ```bash # Create task directories mkdir -p configs/completion mkdir -p data/raw/completion data/processed/completion mkdir -p pipelines/completion mkdir -p scripts/completion mkdir -p results/completion mkdir -p tasks/completion mkdir -p models/completion ``` ### Step 2: Create Task Configuration ```bash # Create YAML configuration for new task cat > configs/completion/text_generation.yaml << 'EOF' # Text Generation Task Configuration task: name: "completion" type: "text_generation" # Data Processing Configuration data: source: "huggingface" dataset_name: "your-dataset-name" output_dir: "./data/processed/completion/text_generation" max_samples: 1000 # ... other data parameters # Model Configuration model: name: "gpt2" # Different model for completion max_length: 1024 # ... model parameters # Training Configuration training: num_epochs: 3 batch_size: 8 # Smaller batch for generation learning_rate: 5e-5 data_dir: "./data/processed/completion/text_generation" output_dir: "./results/completion/text_generation_model" # Inference Configuration inference: model_path: "./results/completion/text_generation_model" device: "auto" batch_size: 1 # Generation is typically one at a time max_length: 100 temperature: 0.7 EOF ``` ### Step 3: Create Pipeline Scripts Copy and modify the classification pipeline scripts: ```bash # Copy classification scripts as templates cp pipelines/classification/data_processor.py pipelines/completion/ cp pipelines/classification/train.py pipelines/completion/ cp pipelines/classification/inference.py pipelines/completion/ # Copy task scripts cp scripts/classification/data_processor.py scripts/completion/ cp scripts/classification/trainer.py scripts/completion/ cp scripts/classification/inference.py scripts/completion/ ``` ### Step 4: Modify Pipeline Code Update the pipeline scripts for your specific task: 1. **Data Processor** (`pipelines/completion/data_processor.py`): - Update data loading logic for completion datasets - Modify preprocessing for text generation - Adjust output format for completion tasks 2. **Trainer** (`pipelines/completion/train.py`): - Change model type to generation models (GPT, T5, etc.) - Update training loop for text generation - Modify evaluation metrics 3. **Inference** (`pipelines/completion/inference.py`): - Update inference for text generation - Add generation parameters (temperature, top-k, etc.) - Modify output format ### Step 5: Update Task Scripts Modify the task scripts to use your new pipeline: ```python # scripts/completion/data_processor.py def run_with_yaml_config(config_path: str, **cli_overrides): cmd = [ "python", "pipelines/completion/data_processor.py", # Updated path "--config", config_path ] # ... rest of the function ``` ### Step 6: Create Task-Specific Models ```bash # Create model directory mkdir -p models/completion # Add task-specific model classes cat > models/completion/text_generator.py << 'EOF' from transformers import AutoModelForCausalLM, AutoTokenizer class TextGenerator: def __init__(self, model_name): self.model = AutoModelForCausalLM.from_pretrained(model_name) self.tokenizer = AutoTokenizer.from_pretrained(model_name) def generate(self, prompt, max_length=100, temperature=0.7): # Implementation for text generation pass EOF ``` ### Step 7: Test Your New Task ```bash # Test data processing python scripts/completion/data_processor.py --config configs/completion/text_generation.yaml # Test training python scripts/completion/trainer.py --config configs/completion/text_generation.yaml # Test inference python scripts/completion/inference.py --config configs/completion/text_generation.yaml --input-text "Once upon a time" ``` ## YAML Configuration Guide ### Configuration Structure Each YAML file is organized into clear sections: ```yaml # Task Configuration task: name: "classification" # or "completion", "styling", "matching" type: "sequence_classification" # or "text_generation", "style_transfer", "semantic_matching" # Data Processing Configuration data: source: "huggingface" # "huggingface" or "custom" dataset_name: "dair-ai/emotion" # HuggingFace dataset name output_dir: "./data/processed/classification/emotion" max_samples: 1000 # Limit dataset size # ... other data parameters # Model Configuration model: name: "bert-base-uncased" # Model from HuggingFace Hub max_length: 512 # Sequence length num_labels: 6 # Number of classes # Training Configuration training: num_epochs: 3 # Training epochs batch_size: 16 # Batch size learning_rate: 2e-5 # Learning rate data_dir: "./data/processed/classification/emotion" output_dir: "./results/classification/emotion_model" # Inference Configuration inference: model_path: "./results/classification/emotion_model" device: "auto" # "auto", "cuda", "cpu" batch_size: 32 # Inference batch size return_top_k: 3 # Top K predictions ``` ### Available Configuration Files - `configs/classification/emotion.yaml` - Emotion classification with HuggingFace dataset - `configs/classification/custom.yaml` - Custom dataset processing ## Usage Examples ### Data Processing Examples ```bash # 1. Use YAML config only python scripts/classification/data_processor.py --config configs/classification/emotion.yaml # 2. Override YAML values python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 500 # 3. Use CLI only (backward compatibility) python scripts/classification/data_processor.py --data-source huggingface --dataset-name dair-ai/emotion # 4. Run examples python scripts/classification/data_processor.py examples ``` ### Training Examples ```bash # 1. Use YAML config only python scripts/classification/trainer.py --config configs/classification/emotion.yaml # 2. Override YAML values python scripts/classification/trainer.py --config configs/classification/emotion.yaml --num-epochs 5 # 3. Use CLI only python scripts/classification/trainer.py --model-name bert-base-uncased --num-epochs 3 # 4. Run examples python scripts/classification/trainer.py examples ``` ### Inference Examples ```bash # 1. Single text prediction python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-text "I love this product!" # 2. File-based prediction python scripts/classification/inference.py --config configs/classification/emotion.yaml --input-file input.txt --output-file predictions.jsonl # 3. Interactive mode python scripts/classification/inference.py --config configs/classification/emotion.yaml # 4. Run examples python scripts/classification/inference.py examples ``` ## Troubleshooting Common Errors ### 1. ModuleNotFoundError: No module named 'utils' **Error:** ``` ModuleNotFoundError: No module named 'utils' ``` **Solution:** ```bash # Set Python path before running scripts export PYTHONPATH=. python scripts/classification/data_processor.py --config configs/classification/emotion.yaml ``` ### 2. Model Path Not Found **Error:** ``` Model path not found: ./results/classification/emotion_model ``` **Solution:** ```bash # Train the model first python scripts/classification/trainer.py --config configs/classification/emotion.yaml # Then run inference python scripts/classification/inference.py --config configs/classification/emotion.yaml ``` ### 3. Data Directory Not Found **Error:** ``` Data directory not found: ./data/processed/classification/emotion ``` **Solution:** ```bash # Process data first python scripts/classification/data_processor.py --config configs/classification/emotion.yaml # Then train python scripts/classification/trainer.py --config configs/classification/emotion.yaml ``` ### 4. YAML Configuration Errors **Error:** ``` data_processor.py: error: --data-source is required (either in YAML config or CLI) ``` **Solution:** Check your YAML file structure. It should have: ```yaml data: source: "huggingface" # Not data_source dataset_name: "dair-ai/emotion" ``` ### 5. HuggingFace Download Issues **Error:** ``` KeyboardInterrupt during model download ``` **Solution:** ```bash # Use smaller dataset for testing python scripts/classification/data_processor.py --config configs/classification/emotion.yaml --max-samples 100 # Or use cached models export HF_HOME=./cache ``` ### 6. CUDA/GPU Issues **Error:** ``` RuntimeError: CUDA out of memory ``` **Solution:** ```bash # Reduce batch size python scripts/classification/trainer.py --config configs/classification/emotion.yaml --batch-size 8 # Or use CPU python scripts/classification/trainer.py --config configs/classification/emotion.yaml --device cpu ``` ## Monitoring and Logs ### Check Processing Status ```bash # Check data processing output ls -la ./data/processed/classification/emotion/classification/ # Check training output ls -la ./results/classification/emotion_model/ # Check logs tail -f logs/training.log ``` ### Expected File Structure After Processing ``` ./data/processed/classification/emotion/classification/ ├── train.jsonl # Training data ├── validation.jsonl # Validation data └── test.jsonl # Test data ./results/classification/emotion_model/ ├── config.json # Model configuration ├── pytorch_model.bin # Model weights ├── tokenizer.json # Tokenizer └── label_info.json # Label mappings ``` ## Workflow Summary 1. **Setup**: Install dependencies and set PYTHONPATH 2. **Data Processing**: Process raw data into organized splits 3. **Training**: Train model using processed data 4. **Inference**: Use trained model for predictions 5. **Monitoring**: Check logs and outputs for errors ## Creating Custom Configurations ### For New Datasets 1. Copy existing config: ```bash cp configs/classification/emotion.yaml configs/classification/my_dataset.yaml ``` 2. Modify parameters: ```yaml data: source: "huggingface" dataset_name: "your-dataset-name" output_dir: "./data/processed/classification/my_dataset" # ... other parameters training: data_dir: "./data/processed/classification/my_dataset" output_dir: "./results/classification/my_dataset_model" ``` 3. Run pipeline: ```bash python scripts/classification/data_processor.py --config configs/classification/my_dataset.yaml ``` ### For Custom Data 1. Use custom config: ```yaml data: source: "custom" data_path: "./data/raw/my_data.jsonl" output_dir: "./data/processed/classification/my_custom_dataset" ``` 2. Run processing: ```bash python scripts/classification/data_processor.py --config configs/classification/custom.yaml ``` ## Best Practices 1. **Always check output directories** before running next step 2. **Use small datasets for testing** before full runs 3. **Monitor logs** for errors and warnings 4. **Backup configurations** before major changes 5. **Use version control** for YAML files 6. **Test with CLI overrides** for quick experiments ## Support For issues and questions: 1. Check the troubleshooting section above 2. Review logs in the output directories 3. Verify YAML configuration structure 4. Test with smaller datasets first --- **Happy fine-tuning!**