added style mimicking piepelines
This commit is contained in:
@@ -0,0 +1,191 @@
|
||||
# Quick Reference Card
|
||||
|
||||
## Essential Parameters (Most Common)
|
||||
|
||||
### Data Source & Location
|
||||
```yaml
|
||||
data:
|
||||
source: "huggingface|custom" # REQUIRED: Data source type
|
||||
dataset_name: "dataset/name" # REQUIRED for huggingface
|
||||
data_path: "./path/to/file" # REQUIRED for custom
|
||||
data_format: "jsonl|csv|json" # REQUIRED for custom
|
||||
```
|
||||
|
||||
### Field Mapping
|
||||
```yaml
|
||||
data:
|
||||
input_field: "text" # REQUIRED: Input text field
|
||||
label_field: "label" # REQUIRED for classification
|
||||
output_field: "styled_text" # REQUIRED for styling
|
||||
instruction: "Style instruction" # REQUIRED for styling
|
||||
```
|
||||
|
||||
### Basic Processing
|
||||
```yaml
|
||||
data:
|
||||
max_samples: 1000 # Limit total samples
|
||||
train_split: 0.8 # Training ratio (0.0-1.0)
|
||||
validation_split: 0.1 # Validation ratio (0.0-1.0)
|
||||
test_split: 0.1 # Test ratio (0.0-1.0)
|
||||
output_dir: "./output/path" # Output directory
|
||||
```
|
||||
|
||||
### Text Preprocessing
|
||||
```yaml
|
||||
data:
|
||||
clean_text: true # Clean/normalize text
|
||||
lowercase: true # Convert to lowercase
|
||||
min_length: 10 # Minimum text length
|
||||
max_length: 512 # Maximum text length
|
||||
```
|
||||
|
||||
### Model & Training
|
||||
```yaml
|
||||
model:
|
||||
name: "bert-base-uncased" # Model name
|
||||
max_length: 512 # Max sequence length
|
||||
|
||||
training:
|
||||
num_epochs: 3 # Training epochs
|
||||
batch_size: 16 # Batch size
|
||||
learning_rate: 2e-5 # Learning rate
|
||||
```
|
||||
|
||||
## Common Configurations by Task
|
||||
|
||||
### Classification
|
||||
```yaml
|
||||
task:
|
||||
name: "classification"
|
||||
type: "sequence_classification"
|
||||
|
||||
data:
|
||||
source: "huggingface"
|
||||
dataset_name: "dair-ai/emotion"
|
||||
input_field: "text"
|
||||
label_field: "label"
|
||||
output_format: "classification"
|
||||
```
|
||||
|
||||
### Styling
|
||||
```yaml
|
||||
task:
|
||||
name: "styling"
|
||||
type: "style_transfer"
|
||||
|
||||
data:
|
||||
source: "custom"
|
||||
data_path: "./data.jsonl"
|
||||
input_field: "text"
|
||||
output_field: "styled_text"
|
||||
instruction: "Rewrite in formal style"
|
||||
output_format: "alpaca"
|
||||
```
|
||||
|
||||
### Text Generation
|
||||
```yaml
|
||||
task:
|
||||
name: "completion"
|
||||
type: "text_generation"
|
||||
|
||||
data:
|
||||
source: "custom"
|
||||
data_path: "./prompts.jsonl"
|
||||
input_field: "prompt"
|
||||
output_field: "completion"
|
||||
output_format: "instruction"
|
||||
```
|
||||
|
||||
## Quick Start Templates
|
||||
|
||||
### 1. HuggingFace Dataset
|
||||
```yaml
|
||||
task:
|
||||
name: "classification"
|
||||
type: "sequence_classification"
|
||||
|
||||
data:
|
||||
source: "huggingface"
|
||||
dataset_name: "your/dataset"
|
||||
input_field: "text"
|
||||
label_field: "label"
|
||||
max_samples: 1000
|
||||
output_dir: "./output"
|
||||
```
|
||||
|
||||
### 2. Custom JSONL File
|
||||
```yaml
|
||||
task:
|
||||
name: "styling"
|
||||
type: "style_transfer"
|
||||
|
||||
data:
|
||||
source: "custom"
|
||||
data_path: "./your_data.jsonl"
|
||||
data_format: "jsonl"
|
||||
input_field: "source"
|
||||
output_field: "target"
|
||||
instruction: "Your style instruction"
|
||||
output_dir: "./output"
|
||||
```
|
||||
|
||||
### 3. CSV File
|
||||
```yaml
|
||||
task:
|
||||
name: "classification"
|
||||
type: "sequence_classification"
|
||||
|
||||
data:
|
||||
source: "custom"
|
||||
data_path: "./your_data.csv"
|
||||
data_format: "csv"
|
||||
input_field: "text"
|
||||
label_field: "label"
|
||||
delimiter: ","
|
||||
output_dir: "./output"
|
||||
```
|
||||
|
||||
## Parameter Ranges & Recommendations
|
||||
|
||||
### Split Ratios
|
||||
- **Total must be ≤ 1.0**
|
||||
- **Common**: train=0.8, val=0.1, test=0.1
|
||||
- **Small datasets**: train=0.7, val=0.15, test=0.15
|
||||
|
||||
### Learning Rates
|
||||
- **Fine-tuning**: 1e-5 to 5e-5
|
||||
- **Training from scratch**: 1e-4 to 1e-3
|
||||
- **Start with**: 2e-5
|
||||
|
||||
### Batch Sizes
|
||||
- **GPU Memory**: 8, 16, 32, 64
|
||||
- **CPU**: 4, 8, 16
|
||||
- **Start with**: 16
|
||||
|
||||
### Text Lengths
|
||||
- **BERT**: 512 (max)
|
||||
- **GPT-2**: 1024 (max)
|
||||
- **T5**: 512 (max)
|
||||
- **Start with**: 256
|
||||
|
||||
## Common Issues & Fixes
|
||||
|
||||
| Issue | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| "File not found" | Wrong path | Check `data_path` and `output_dir` |
|
||||
| "Memory error" | Batch too large | Reduce `batch_size` |
|
||||
| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
|
||||
| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
|
||||
| "Slow processing" | Text too long | Reduce `max_length` |
|
||||
|
||||
## Environment Variables
|
||||
```bash
|
||||
# Set cache directory
|
||||
export HF_HOME="./cache"
|
||||
|
||||
# Set output directory
|
||||
export OUTPUT_DIR="./results"
|
||||
|
||||
# Set log level
|
||||
export LOG_LEVEL="INFO"
|
||||
```
|
||||
Reference in New Issue
Block a user