# Configuration Files Documentation This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters. ## Configuration Structure All configuration files follow a consistent structure organized into these main sections: ### 1. Task Configuration ```yaml task: name: "task_type" # Task type: classification, completion, styling, matching type: "specific_type" # Specific model/task type ``` **Available Task Types:** - **classification**: Text classification tasks (emotion, sentiment, topic, etc.) - **completion**: Text generation and completion tasks - **styling**: Style transfer and text transformation tasks - **matching**: Semantic matching and similarity tasks ### 2. Data Processing Configuration ```yaml data: # Data Source source: "huggingface|custom" # Where to get data from # Data Location dataset_name: "dataset/name" # HuggingFace dataset name (for huggingface source) data_path: "./path/to/file" # Path to custom data file (for custom source) data_format: "jsonl|csv|json" # File format for custom data # Field Mapping input_field: "text" # Field containing input text output_field: "styled_text" # Field containing output (for styling) label_field: "label" # Field containing labels (for classification) id_field: "id" # Optional ID field for tracking # Processing Parameters max_samples: 1000 # Maximum samples to process train_split: 0.8 # Training split ratio validation_split: 0.1 # Validation split ratio test_split: 0.1 # Test split ratio # Text Preprocessing clean_text: true # Clean and normalize text remove_special_chars: false # Remove special characters lowercase: true # Convert to lowercase min_length: 10 # Minimum text length max_length: 1000 # Maximum text length # Output Configuration output_format: "format_type" # Output format output_dir: "./output/path" # Output directory ``` **Data Source Types:** - **huggingface**: Use datasets from HuggingFace Hub - **custom**: Use local files (JSONL, CSV, JSON) **Output Formats:** - **classification**: Raw classification format - **instruction**: Instruction-following format - **conversation**: Conversational format - **qa**: Question-answer format - **styling**: Raw styling format - **alpaca**: Alpaca instruction format ### 3. Model Configuration ```yaml model: name: "model_name" # Model from HuggingFace Hub max_length: 512 # Maximum sequence length num_labels: 6 # Number of labels (for classification) ``` **Recommended Models by Task:** - **Classification**: `bert-base-uncased`, `distilbert-base-uncased` - **Styling**: `t5-base`, `gpt2-medium` - **Completion**: `gpt2-medium`, `gpt2-large` - **Matching**: `sentence-transformers/all-MiniLM-L6-v2` ### 4. Training Configuration ```yaml training: num_epochs: 3 # Number of training epochs batch_size: 16 # Training batch size learning_rate: 2e-5 # Learning rate weight_decay: 0.01 # Weight decay lr_scheduler_type: "linear" # Learning rate scheduler warmup_ratio: 0.1 # Warmup ratio data_dir: "./data/path" # Training data directory output_dir: "./model/output" # Model output directory ``` **Learning Rate Guidelines:** - **Fine-tuning**: 1e-5 to 5e-5 - **Training from scratch**: 1e-4 to 1e-3 **Scheduler Types:** - **linear**: Linear decay - **cosine**: Cosine annealing - **polynomial**: Polynomial decay ### 5. Inference Configuration ```yaml inference: model_path: "./model/path" # Path to saved model device: "auto" # Device to use batch_size: 32 # Inference batch size return_probabilities: true # Return probabilities return_top_k: 3 # Return top K predictions max_new_tokens: 128 # Max tokens to generate temperature: 0.8 # Sampling temperature ``` **Device Options:** - **auto**: Automatically detect best device - **cuda**: Use GPU if available - **cpu**: Force CPU usage **Temperature Guidelines:** - **0.0**: Deterministic (always same output) - **0.7-0.9**: Balanced creativity - **1.0+**: More random/creative ## Task-Specific Parameters ### Classification Tasks ```yaml data: label_encoding: "auto|numeric|string" # How to encode labels multilabel: false # Multi-label vs single-label label_separator: "," # Separator for multi-label ``` ### Styling Tasks ```yaml data: instruction: "Style instruction text" # The style instruction ``` ### Completion Tasks ```yaml data: prompt_template: "template" # Prompt template completion_length: 100 # Target completion length ``` ## Advanced Configuration ### HuggingFace Specific ```yaml data: hf_split: "train" # Dataset split to use hf_cache_dir: "./cache" # Cache directory test_split_from: "train" # Source for test split val_split_from: "train" # Source for validation split ``` ### Custom Data Specific ```yaml data: encoding: "utf-8" # File encoding delimiter: "," # CSV delimiter ``` ## Usage Examples ### Basic Usage ```bash # Use YAML configuration python scripts/task_type/data_processor.py --config configs/task_type/config.yaml # Override specific parameters python scripts/task_type/data_processor.py \ --config configs/task_type/config.yaml \ --max-samples 1000 \ --learning-rate 3e-5 ``` ### Creating Custom Configurations 1. Copy an existing config file 2. Modify parameters for your specific use case 3. Update paths and model names 4. Test with a small dataset first ## Best Practices 1. **Start with Defaults**: Use default values and adjust based on results 2. **Validate Paths**: Ensure all file paths are correct and accessible 3. **Monitor Resources**: Adjust batch sizes based on available GPU memory 4. **Test Incrementally**: Test with small datasets before full processing 5. **Version Control**: Keep configurations in version control for reproducibility ## Troubleshooting ### Common Issues: - **File Not Found**: Check `data_path` and `output_dir` paths - **Memory Errors**: Reduce `batch_size` or `max_length` - **Poor Performance**: Adjust `learning_rate` or `num_epochs` - **Split Errors**: Ensure split ratios sum to ≤ 1.0 ### Getting Help: - Check the script help: `python script.py --help` - Review the pipeline logs for detailed error messages - Verify YAML syntax and parameter values