# Quick Reference Card ## Essential Parameters (Most Common) ### Data Source & Location ```yaml data: source: "huggingface|custom" # REQUIRED: Data source type dataset_name: "dataset/name" # REQUIRED for huggingface data_path: "./path/to/file" # REQUIRED for custom data_format: "jsonl|csv|json" # REQUIRED for custom ``` ### Field Mapping ```yaml data: input_field: "text" # REQUIRED: Input text field label_field: "label" # REQUIRED for classification output_field: "styled_text" # REQUIRED for styling instruction: "Style instruction" # REQUIRED for styling ``` ### Basic Processing ```yaml data: max_samples: 1000 # Limit total samples train_split: 0.8 # Training ratio (0.0-1.0) validation_split: 0.1 # Validation ratio (0.0-1.0) test_split: 0.1 # Test ratio (0.0-1.0) output_dir: "./output/path" # Output directory ``` ### Text Preprocessing ```yaml data: clean_text: true # Clean/normalize text lowercase: true # Convert to lowercase min_length: 10 # Minimum text length max_length: 512 # Maximum text length ``` ### Model & Training ```yaml model: name: "bert-base-uncased" # Model name max_length: 512 # Max sequence length training: num_epochs: 3 # Training epochs batch_size: 16 # Batch size learning_rate: 2e-5 # Learning rate ``` ## Common Configurations by Task ### Classification ```yaml task: name: "classification" type: "sequence_classification" data: source: "huggingface" dataset_name: "dair-ai/emotion" input_field: "text" label_field: "label" output_format: "classification" ``` ### Styling ```yaml task: name: "styling" type: "style_transfer" data: source: "custom" data_path: "./data.jsonl" input_field: "text" output_field: "styled_text" instruction: "Rewrite in formal style" output_format: "alpaca" ``` ### Text Generation ```yaml task: name: "completion" type: "text_generation" data: source: "custom" data_path: "./prompts.jsonl" input_field: "prompt" output_field: "completion" output_format: "instruction" ``` ## Quick Start Templates ### 1. HuggingFace Dataset ```yaml task: name: "classification" type: "sequence_classification" data: source: "huggingface" dataset_name: "your/dataset" input_field: "text" label_field: "label" max_samples: 1000 output_dir: "./output" ``` ### 2. Custom JSONL File ```yaml task: name: "styling" type: "style_transfer" data: source: "custom" data_path: "./your_data.jsonl" data_format: "jsonl" input_field: "source" output_field: "target" instruction: "Your style instruction" output_dir: "./output" ``` ### 3. CSV File ```yaml task: name: "classification" type: "sequence_classification" data: source: "custom" data_path: "./your_data.csv" data_format: "csv" input_field: "text" label_field: "label" delimiter: "," output_dir: "./output" ``` ## Parameter Ranges & Recommendations ### Split Ratios - **Total must be ≤ 1.0** - **Common**: train=0.8, val=0.1, test=0.1 - **Small datasets**: train=0.7, val=0.15, test=0.15 ### Learning Rates - **Fine-tuning**: 1e-5 to 5e-5 - **Training from scratch**: 1e-4 to 1e-3 - **Start with**: 2e-5 ### Batch Sizes - **GPU Memory**: 8, 16, 32, 64 - **CPU**: 4, 8, 16 - **Start with**: 16 ### Text Lengths - **BERT**: 512 (max) - **GPT-2**: 1024 (max) - **T5**: 512 (max) - **Start with**: 256 ## Common Issues & Fixes | Issue | Cause | Fix | |-------|-------|-----| | "File not found" | Wrong path | Check `data_path` and `output_dir` | | "Memory error" | Batch too large | Reduce `batch_size` | | "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 | | "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range | | "Slow processing" | Text too long | Reduce `max_length` | ## Environment Variables ```bash # Set cache directory export HF_HOME="./cache" # Set output directory export OUTPUT_DIR="./results" # Set log level export LOG_LEVEL="INFO" ```