192 lines
4.3 KiB
Markdown
192 lines
4.3 KiB
Markdown
|
|
# Quick Reference Card
|
||
|
|
|
||
|
|
## Essential Parameters (Most Common)
|
||
|
|
|
||
|
|
### Data Source & Location
|
||
|
|
```yaml
|
||
|
|
data:
|
||
|
|
source: "huggingface|custom" # REQUIRED: Data source type
|
||
|
|
dataset_name: "dataset/name" # REQUIRED for huggingface
|
||
|
|
data_path: "./path/to/file" # REQUIRED for custom
|
||
|
|
data_format: "jsonl|csv|json" # REQUIRED for custom
|
||
|
|
```
|
||
|
|
|
||
|
|
### Field Mapping
|
||
|
|
```yaml
|
||
|
|
data:
|
||
|
|
input_field: "text" # REQUIRED: Input text field
|
||
|
|
label_field: "label" # REQUIRED for classification
|
||
|
|
output_field: "styled_text" # REQUIRED for styling
|
||
|
|
instruction: "Style instruction" # REQUIRED for styling
|
||
|
|
```
|
||
|
|
|
||
|
|
### Basic Processing
|
||
|
|
```yaml
|
||
|
|
data:
|
||
|
|
max_samples: 1000 # Limit total samples
|
||
|
|
train_split: 0.8 # Training ratio (0.0-1.0)
|
||
|
|
validation_split: 0.1 # Validation ratio (0.0-1.0)
|
||
|
|
test_split: 0.1 # Test ratio (0.0-1.0)
|
||
|
|
output_dir: "./output/path" # Output directory
|
||
|
|
```
|
||
|
|
|
||
|
|
### Text Preprocessing
|
||
|
|
```yaml
|
||
|
|
data:
|
||
|
|
clean_text: true # Clean/normalize text
|
||
|
|
lowercase: true # Convert to lowercase
|
||
|
|
min_length: 10 # Minimum text length
|
||
|
|
max_length: 512 # Maximum text length
|
||
|
|
```
|
||
|
|
|
||
|
|
### Model & Training
|
||
|
|
```yaml
|
||
|
|
model:
|
||
|
|
name: "bert-base-uncased" # Model name
|
||
|
|
max_length: 512 # Max sequence length
|
||
|
|
|
||
|
|
training:
|
||
|
|
num_epochs: 3 # Training epochs
|
||
|
|
batch_size: 16 # Batch size
|
||
|
|
learning_rate: 2e-5 # Learning rate
|
||
|
|
```
|
||
|
|
|
||
|
|
## Common Configurations by Task
|
||
|
|
|
||
|
|
### Classification
|
||
|
|
```yaml
|
||
|
|
task:
|
||
|
|
name: "classification"
|
||
|
|
type: "sequence_classification"
|
||
|
|
|
||
|
|
data:
|
||
|
|
source: "huggingface"
|
||
|
|
dataset_name: "dair-ai/emotion"
|
||
|
|
input_field: "text"
|
||
|
|
label_field: "label"
|
||
|
|
output_format: "classification"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Styling
|
||
|
|
```yaml
|
||
|
|
task:
|
||
|
|
name: "styling"
|
||
|
|
type: "style_transfer"
|
||
|
|
|
||
|
|
data:
|
||
|
|
source: "custom"
|
||
|
|
data_path: "./data.jsonl"
|
||
|
|
input_field: "text"
|
||
|
|
output_field: "styled_text"
|
||
|
|
instruction: "Rewrite in formal style"
|
||
|
|
output_format: "alpaca"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Text Generation
|
||
|
|
```yaml
|
||
|
|
task:
|
||
|
|
name: "completion"
|
||
|
|
type: "text_generation"
|
||
|
|
|
||
|
|
data:
|
||
|
|
source: "custom"
|
||
|
|
data_path: "./prompts.jsonl"
|
||
|
|
input_field: "prompt"
|
||
|
|
output_field: "completion"
|
||
|
|
output_format: "instruction"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Quick Start Templates
|
||
|
|
|
||
|
|
### 1. HuggingFace Dataset
|
||
|
|
```yaml
|
||
|
|
task:
|
||
|
|
name: "classification"
|
||
|
|
type: "sequence_classification"
|
||
|
|
|
||
|
|
data:
|
||
|
|
source: "huggingface"
|
||
|
|
dataset_name: "your/dataset"
|
||
|
|
input_field: "text"
|
||
|
|
label_field: "label"
|
||
|
|
max_samples: 1000
|
||
|
|
output_dir: "./output"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Custom JSONL File
|
||
|
|
```yaml
|
||
|
|
task:
|
||
|
|
name: "styling"
|
||
|
|
type: "style_transfer"
|
||
|
|
|
||
|
|
data:
|
||
|
|
source: "custom"
|
||
|
|
data_path: "./your_data.jsonl"
|
||
|
|
data_format: "jsonl"
|
||
|
|
input_field: "source"
|
||
|
|
output_field: "target"
|
||
|
|
instruction: "Your style instruction"
|
||
|
|
output_dir: "./output"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. CSV File
|
||
|
|
```yaml
|
||
|
|
task:
|
||
|
|
name: "classification"
|
||
|
|
type: "sequence_classification"
|
||
|
|
|
||
|
|
data:
|
||
|
|
source: "custom"
|
||
|
|
data_path: "./your_data.csv"
|
||
|
|
data_format: "csv"
|
||
|
|
input_field: "text"
|
||
|
|
label_field: "label"
|
||
|
|
delimiter: ","
|
||
|
|
output_dir: "./output"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Parameter Ranges & Recommendations
|
||
|
|
|
||
|
|
### Split Ratios
|
||
|
|
- **Total must be ≤ 1.0**
|
||
|
|
- **Common**: train=0.8, val=0.1, test=0.1
|
||
|
|
- **Small datasets**: train=0.7, val=0.15, test=0.15
|
||
|
|
|
||
|
|
### Learning Rates
|
||
|
|
- **Fine-tuning**: 1e-5 to 5e-5
|
||
|
|
- **Training from scratch**: 1e-4 to 1e-3
|
||
|
|
- **Start with**: 2e-5
|
||
|
|
|
||
|
|
### Batch Sizes
|
||
|
|
- **GPU Memory**: 8, 16, 32, 64
|
||
|
|
- **CPU**: 4, 8, 16
|
||
|
|
- **Start with**: 16
|
||
|
|
|
||
|
|
### Text Lengths
|
||
|
|
- **BERT**: 512 (max)
|
||
|
|
- **GPT-2**: 1024 (max)
|
||
|
|
- **T5**: 512 (max)
|
||
|
|
- **Start with**: 256
|
||
|
|
|
||
|
|
## Common Issues & Fixes
|
||
|
|
|
||
|
|
| Issue | Cause | Fix |
|
||
|
|
|-------|-------|-----|
|
||
|
|
| "File not found" | Wrong path | Check `data_path` and `output_dir` |
|
||
|
|
| "Memory error" | Batch too large | Reduce `batch_size` |
|
||
|
|
| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
|
||
|
|
| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
|
||
|
|
| "Slow processing" | Text too long | Reduce `max_length` |
|
||
|
|
|
||
|
|
## Environment Variables
|
||
|
|
```bash
|
||
|
|
# Set cache directory
|
||
|
|
export HF_HOME="./cache"
|
||
|
|
|
||
|
|
# Set output directory
|
||
|
|
export OUTPUT_DIR="./results"
|
||
|
|
|
||
|
|
# Set log level
|
||
|
|
export LOG_LEVEL="INFO"
|
||
|
|
```
|