Files
2025-08-13 21:17:01 +01:00

192 lines
4.3 KiB
Markdown

# Quick Reference Card
## Essential Parameters (Most Common)
### Data Source & Location
```yaml
data:
source: "huggingface|custom" # REQUIRED: Data source type
dataset_name: "dataset/name" # REQUIRED for huggingface
data_path: "./path/to/file" # REQUIRED for custom
data_format: "jsonl|csv|json" # REQUIRED for custom
```
### Field Mapping
```yaml
data:
input_field: "text" # REQUIRED: Input text field
label_field: "label" # REQUIRED for classification
output_field: "styled_text" # REQUIRED for styling
instruction: "Style instruction" # REQUIRED for styling
```
### Basic Processing
```yaml
data:
max_samples: 1000 # Limit total samples
train_split: 0.8 # Training ratio (0.0-1.0)
validation_split: 0.1 # Validation ratio (0.0-1.0)
test_split: 0.1 # Test ratio (0.0-1.0)
output_dir: "./output/path" # Output directory
```
### Text Preprocessing
```yaml
data:
clean_text: true # Clean/normalize text
lowercase: true # Convert to lowercase
min_length: 10 # Minimum text length
max_length: 512 # Maximum text length
```
### Model & Training
```yaml
model:
name: "bert-base-uncased" # Model name
max_length: 512 # Max sequence length
training:
num_epochs: 3 # Training epochs
batch_size: 16 # Batch size
learning_rate: 2e-5 # Learning rate
```
## Common Configurations by Task
### Classification
```yaml
task:
name: "classification"
type: "sequence_classification"
data:
source: "huggingface"
dataset_name: "dair-ai/emotion"
input_field: "text"
label_field: "label"
output_format: "classification"
```
### Styling
```yaml
task:
name: "styling"
type: "style_transfer"
data:
source: "custom"
data_path: "./data.jsonl"
input_field: "text"
output_field: "styled_text"
instruction: "Rewrite in formal style"
output_format: "alpaca"
```
### Text Generation
```yaml
task:
name: "completion"
type: "text_generation"
data:
source: "custom"
data_path: "./prompts.jsonl"
input_field: "prompt"
output_field: "completion"
output_format: "instruction"
```
## Quick Start Templates
### 1. HuggingFace Dataset
```yaml
task:
name: "classification"
type: "sequence_classification"
data:
source: "huggingface"
dataset_name: "your/dataset"
input_field: "text"
label_field: "label"
max_samples: 1000
output_dir: "./output"
```
### 2. Custom JSONL File
```yaml
task:
name: "styling"
type: "style_transfer"
data:
source: "custom"
data_path: "./your_data.jsonl"
data_format: "jsonl"
input_field: "source"
output_field: "target"
instruction: "Your style instruction"
output_dir: "./output"
```
### 3. CSV File
```yaml
task:
name: "classification"
type: "sequence_classification"
data:
source: "custom"
data_path: "./your_data.csv"
data_format: "csv"
input_field: "text"
label_field: "label"
delimiter: ","
output_dir: "./output"
```
## Parameter Ranges & Recommendations
### Split Ratios
- **Total must be ≤ 1.0**
- **Common**: train=0.8, val=0.1, test=0.1
- **Small datasets**: train=0.7, val=0.15, test=0.15
### Learning Rates
- **Fine-tuning**: 1e-5 to 5e-5
- **Training from scratch**: 1e-4 to 1e-3
- **Start with**: 2e-5
### Batch Sizes
- **GPU Memory**: 8, 16, 32, 64
- **CPU**: 4, 8, 16
- **Start with**: 16
### Text Lengths
- **BERT**: 512 (max)
- **GPT-2**: 1024 (max)
- **T5**: 512 (max)
- **Start with**: 256
## Common Issues & Fixes
| Issue | Cause | Fix |
|-------|-------|-----|
| "File not found" | Wrong path | Check `data_path` and `output_dir` |
| "Memory error" | Batch too large | Reduce `batch_size` |
| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
| "Slow processing" | Text too long | Reduce `max_length` |
## Environment Variables
```bash
# Set cache directory
export HF_HOME="./cache"
# Set output directory
export OUTPUT_DIR="./results"
# Set log level
export LOG_LEVEL="INFO"
```