208 lines
7.1 KiB
Markdown
208 lines
7.1 KiB
Markdown
# Configuration Files Documentation
|
|
|
|
This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
|
|
|
|
## Configuration Structure
|
|
|
|
All configuration files follow a consistent structure organized into these main sections:
|
|
|
|
### 1. Task Configuration
|
|
```yaml
|
|
task:
|
|
name: "task_type" # Task type: classification, completion, styling, matching
|
|
type: "specific_type" # Specific model/task type
|
|
```
|
|
|
|
**Available Task Types:**
|
|
- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
|
|
- **completion**: Text generation and completion tasks
|
|
- **styling**: Style transfer and text transformation tasks
|
|
- **matching**: Semantic matching and similarity tasks
|
|
|
|
### 2. Data Processing Configuration
|
|
```yaml
|
|
data:
|
|
# Data Source
|
|
source: "huggingface|custom" # Where to get data from
|
|
|
|
# Data Location
|
|
dataset_name: "dataset/name" # HuggingFace dataset name (for huggingface source)
|
|
data_path: "./path/to/file" # Path to custom data file (for custom source)
|
|
data_format: "jsonl|csv|json" # File format for custom data
|
|
|
|
# Field Mapping
|
|
input_field: "text" # Field containing input text
|
|
output_field: "styled_text" # Field containing output (for styling)
|
|
label_field: "label" # Field containing labels (for classification)
|
|
id_field: "id" # Optional ID field for tracking
|
|
|
|
# Processing Parameters
|
|
max_samples: 1000 # Maximum samples to process
|
|
train_split: 0.8 # Training split ratio
|
|
validation_split: 0.1 # Validation split ratio
|
|
test_split: 0.1 # Test split ratio
|
|
|
|
# Text Preprocessing
|
|
clean_text: true # Clean and normalize text
|
|
remove_special_chars: false # Remove special characters
|
|
lowercase: true # Convert to lowercase
|
|
min_length: 10 # Minimum text length
|
|
max_length: 1000 # Maximum text length
|
|
|
|
# Output Configuration
|
|
output_format: "format_type" # Output format
|
|
output_dir: "./output/path" # Output directory
|
|
```
|
|
|
|
**Data Source Types:**
|
|
- **huggingface**: Use datasets from HuggingFace Hub
|
|
- **custom**: Use local files (JSONL, CSV, JSON)
|
|
|
|
**Output Formats:**
|
|
- **classification**: Raw classification format
|
|
- **instruction**: Instruction-following format
|
|
- **conversation**: Conversational format
|
|
- **qa**: Question-answer format
|
|
- **styling**: Raw styling format
|
|
- **alpaca**: Alpaca instruction format
|
|
|
|
### 3. Model Configuration
|
|
```yaml
|
|
model:
|
|
name: "model_name" # Model from HuggingFace Hub
|
|
max_length: 512 # Maximum sequence length
|
|
num_labels: 6 # Number of labels (for classification)
|
|
```
|
|
|
|
**Recommended Models by Task:**
|
|
- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
|
|
- **Styling**: `t5-base`, `gpt2-medium`
|
|
- **Completion**: `gpt2-medium`, `gpt2-large`
|
|
- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`
|
|
|
|
### 4. Training Configuration
|
|
```yaml
|
|
training:
|
|
num_epochs: 3 # Number of training epochs
|
|
batch_size: 16 # Training batch size
|
|
learning_rate: 2e-5 # Learning rate
|
|
weight_decay: 0.01 # Weight decay
|
|
lr_scheduler_type: "linear" # Learning rate scheduler
|
|
warmup_ratio: 0.1 # Warmup ratio
|
|
data_dir: "./data/path" # Training data directory
|
|
output_dir: "./model/output" # Model output directory
|
|
```
|
|
|
|
**Learning Rate Guidelines:**
|
|
- **Fine-tuning**: 1e-5 to 5e-5
|
|
- **Training from scratch**: 1e-4 to 1e-3
|
|
|
|
**Scheduler Types:**
|
|
- **linear**: Linear decay
|
|
- **cosine**: Cosine annealing
|
|
- **polynomial**: Polynomial decay
|
|
|
|
### 5. Inference Configuration
|
|
```yaml
|
|
inference:
|
|
model_path: "./model/path" # Path to saved model
|
|
device: "auto" # Device to use
|
|
batch_size: 32 # Inference batch size
|
|
return_probabilities: true # Return probabilities
|
|
return_top_k: 3 # Return top K predictions
|
|
max_new_tokens: 128 # Max tokens to generate
|
|
temperature: 0.8 # Sampling temperature
|
|
```
|
|
|
|
**Device Options:**
|
|
- **auto**: Automatically detect best device
|
|
- **cuda**: Use GPU if available
|
|
- **cpu**: Force CPU usage
|
|
|
|
**Temperature Guidelines:**
|
|
- **0.0**: Deterministic (always same output)
|
|
- **0.7-0.9**: Balanced creativity
|
|
- **1.0+**: More random/creative
|
|
|
|
## Task-Specific Parameters
|
|
|
|
### Classification Tasks
|
|
```yaml
|
|
data:
|
|
label_encoding: "auto|numeric|string" # How to encode labels
|
|
multilabel: false # Multi-label vs single-label
|
|
label_separator: "," # Separator for multi-label
|
|
```
|
|
|
|
### Styling Tasks
|
|
```yaml
|
|
data:
|
|
instruction: "Style instruction text" # The style instruction
|
|
```
|
|
|
|
### Completion Tasks
|
|
```yaml
|
|
data:
|
|
prompt_template: "template" # Prompt template
|
|
completion_length: 100 # Target completion length
|
|
```
|
|
|
|
## Advanced Configuration
|
|
|
|
### HuggingFace Specific
|
|
```yaml
|
|
data:
|
|
hf_split: "train" # Dataset split to use
|
|
hf_cache_dir: "./cache" # Cache directory
|
|
test_split_from: "train" # Source for test split
|
|
val_split_from: "train" # Source for validation split
|
|
```
|
|
|
|
### Custom Data Specific
|
|
```yaml
|
|
data:
|
|
encoding: "utf-8" # File encoding
|
|
delimiter: "," # CSV delimiter
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Usage
|
|
```bash
|
|
# Use YAML configuration
|
|
python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
|
|
|
|
# Override specific parameters
|
|
python scripts/task_type/data_processor.py \
|
|
--config configs/task_type/config.yaml \
|
|
--max-samples 1000 \
|
|
--learning-rate 3e-5
|
|
```
|
|
|
|
### Creating Custom Configurations
|
|
1. Copy an existing config file
|
|
2. Modify parameters for your specific use case
|
|
3. Update paths and model names
|
|
4. Test with a small dataset first
|
|
|
|
## Best Practices
|
|
|
|
1. **Start with Defaults**: Use default values and adjust based on results
|
|
2. **Validate Paths**: Ensure all file paths are correct and accessible
|
|
3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
|
|
4. **Test Incrementally**: Test with small datasets before full processing
|
|
5. **Version Control**: Keep configurations in version control for reproducibility
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues:
|
|
- **File Not Found**: Check `data_path` and `output_dir` paths
|
|
- **Memory Errors**: Reduce `batch_size` or `max_length`
|
|
- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
|
|
- **Split Errors**: Ensure split ratios sum to ≤ 1.0
|
|
|
|
### Getting Help:
|
|
- Check the script help: `python script.py --help`
|
|
- Review the pipeline logs for detailed error messages
|
|
- Verify YAML syntax and parameter values
|