Configuration Files Documentation
This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
Configuration Structure
All configuration files follow a consistent structure organized into these main sections:
1. Task Configuration
task:
name: "task_type" # Task type: classification, completion, styling, matching
type: "specific_type" # Specific model/task type
Available Task Types:
- classification: Text classification tasks (emotion, sentiment, topic, etc.)
- completion: Text generation and completion tasks
- styling: Style transfer and text transformation tasks
- matching: Semantic matching and similarity tasks
2. Data Processing Configuration
data:
# Data Source
source: "huggingface|custom" # Where to get data from
# Data Location
dataset_name: "dataset/name" # HuggingFace dataset name (for huggingface source)
data_path: "./path/to/file" # Path to custom data file (for custom source)
data_format: "jsonl|csv|json" # File format for custom data
# Field Mapping
input_field: "text" # Field containing input text
output_field: "styled_text" # Field containing output (for styling)
label_field: "label" # Field containing labels (for classification)
id_field: "id" # Optional ID field for tracking
# Processing Parameters
max_samples: 1000 # Maximum samples to process
train_split: 0.8 # Training split ratio
validation_split: 0.1 # Validation split ratio
test_split: 0.1 # Test split ratio
# Text Preprocessing
clean_text: true # Clean and normalize text
remove_special_chars: false # Remove special characters
lowercase: true # Convert to lowercase
min_length: 10 # Minimum text length
max_length: 1000 # Maximum text length
# Output Configuration
output_format: "format_type" # Output format
output_dir: "./output/path" # Output directory
Data Source Types:
- huggingface: Use datasets from HuggingFace Hub
- custom: Use local files (JSONL, CSV, JSON)
Output Formats:
- classification: Raw classification format
- instruction: Instruction-following format
- conversation: Conversational format
- qa: Question-answer format
- styling: Raw styling format
- alpaca: Alpaca instruction format
3. Model Configuration
model:
name: "model_name" # Model from HuggingFace Hub
max_length: 512 # Maximum sequence length
num_labels: 6 # Number of labels (for classification)
Recommended Models by Task:
- Classification:
bert-base-uncased,distilbert-base-uncased - Styling:
t5-base,gpt2-medium - Completion:
gpt2-medium,gpt2-large - Matching:
sentence-transformers/all-MiniLM-L6-v2
4. Training Configuration
training:
num_epochs: 3 # Number of training epochs
batch_size: 16 # Training batch size
learning_rate: 2e-5 # Learning rate
weight_decay: 0.01 # Weight decay
lr_scheduler_type: "linear" # Learning rate scheduler
warmup_ratio: 0.1 # Warmup ratio
data_dir: "./data/path" # Training data directory
output_dir: "./model/output" # Model output directory
Learning Rate Guidelines:
- Fine-tuning: 1e-5 to 5e-5
- Training from scratch: 1e-4 to 1e-3
Scheduler Types:
- linear: Linear decay
- cosine: Cosine annealing
- polynomial: Polynomial decay
5. Inference Configuration
inference:
model_path: "./model/path" # Path to saved model
device: "auto" # Device to use
batch_size: 32 # Inference batch size
return_probabilities: true # Return probabilities
return_top_k: 3 # Return top K predictions
max_new_tokens: 128 # Max tokens to generate
temperature: 0.8 # Sampling temperature
Device Options:
- auto: Automatically detect best device
- cuda: Use GPU if available
- cpu: Force CPU usage
Temperature Guidelines:
- 0.0: Deterministic (always same output)
- 0.7-0.9: Balanced creativity
- 1.0+: More random/creative
Task-Specific Parameters
Classification Tasks
data:
label_encoding: "auto|numeric|string" # How to encode labels
multilabel: false # Multi-label vs single-label
label_separator: "," # Separator for multi-label
Styling Tasks
data:
instruction: "Style instruction text" # The style instruction
Completion Tasks
data:
prompt_template: "template" # Prompt template
completion_length: 100 # Target completion length
Advanced Configuration
HuggingFace Specific
data:
hf_split: "train" # Dataset split to use
hf_cache_dir: "./cache" # Cache directory
test_split_from: "train" # Source for test split
val_split_from: "train" # Source for validation split
Custom Data Specific
data:
encoding: "utf-8" # File encoding
delimiter: "," # CSV delimiter
Usage Examples
Basic Usage
# Use YAML configuration
python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
# Override specific parameters
python scripts/task_type/data_processor.py \
--config configs/task_type/config.yaml \
--max-samples 1000 \
--learning-rate 3e-5
Creating Custom Configurations
- Copy an existing config file
- Modify parameters for your specific use case
- Update paths and model names
- Test with a small dataset first
Best Practices
- Start with Defaults: Use default values and adjust based on results
- Validate Paths: Ensure all file paths are correct and accessible
- Monitor Resources: Adjust batch sizes based on available GPU memory
- Test Incrementally: Test with small datasets before full processing
- Version Control: Keep configurations in version control for reproducibility
Troubleshooting
Common Issues:
- File Not Found: Check
data_pathandoutput_dirpaths - Memory Errors: Reduce
batch_sizeormax_length - Poor Performance: Adjust
learning_rateornum_epochs - Split Errors: Ensure split ratios sum to ≤ 1.0
Getting Help:
- Check the script help:
python script.py --help - Review the pipeline logs for detailed error messages
- Verify YAML syntax and parameter values