configs/QUICK_REFERENCE.md

# Quick Reference Card

## Essential Parameters (Most Common)

### Data Source & Location
```yaml
data:
  source: "huggingface|custom"             # REQUIRED: Data source type
  dataset_name: "dataset/name"             # REQUIRED for huggingface
  data_path: "./path/to/file"              # REQUIRED for custom
  data_format: "jsonl|csv|json"            # REQUIRED for custom
```

### Field Mapping
```yaml
data:
  input_field: "text"                      # REQUIRED: Input text field
  label_field: "label"                     # REQUIRED for classification
  output_field: "styled_text"              # REQUIRED for styling
  instruction: "Style instruction"          # REQUIRED for styling
```

### Basic Processing
```yaml
data:
  max_samples: 1000                        # Limit total samples
  train_split: 0.8                         # Training ratio (0.0-1.0)
  validation_split: 0.1                    # Validation ratio (0.0-1.0)
  test_split: 0.1                          # Test ratio (0.0-1.0)
  output_dir: "./output/path"              # Output directory
```

### Text Preprocessing
```yaml
data:
  clean_text: true                         # Clean/normalize text
  lowercase: true                          # Convert to lowercase
  min_length: 10                           # Minimum text length
  max_length: 512                          # Maximum text length
```

### Model & Training
```yaml
model:
  name: "bert-base-uncased"                # Model name
  max_length: 512                          # Max sequence length

training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate
```

## Common Configurations by Task

### Classification
```yaml
task:
  name: "classification"
  type: "sequence_classification"

data:
  source: "huggingface"
  dataset_name: "dair-ai/emotion"
  input_field: "text"
  label_field: "label"
  output_format: "classification"
```

### Styling
```yaml
task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./data.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite in formal style"
  output_format: "alpaca"
```

### Text Generation
```yaml
task:
  name: "completion"
  type: "text_generation"

data:
  source: "custom"
  data_path: "./prompts.jsonl"
  input_field: "prompt"
  output_field: "completion"
  output_format: "instruction"
```

## Quick Start Templates

### 1. HuggingFace Dataset
```yaml
task:
  name: "classification"
  type: "sequence_classification"

data:
  source: "huggingface"
  dataset_name: "your/dataset"
  input_field: "text"
  label_field: "label"
  max_samples: 1000
  output_dir: "./output"
```

### 2. Custom JSONL File
```yaml
task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./your_data.jsonl"
  data_format: "jsonl"
  input_field: "source"
  output_field: "target"
  instruction: "Your style instruction"
  output_dir: "./output"
```

### 3. CSV File
```yaml
task:
  name: "classification"
  type: "sequence_classification"

data:
  source: "custom"
  data_path: "./your_data.csv"
  data_format: "csv"
  input_field: "text"
  label_field: "label"
  delimiter: ","
  output_dir: "./output"
```

## Parameter Ranges & Recommendations

### Split Ratios
- **Total must be ≤ 1.0**
- **Common**: train=0.8, val=0.1, test=0.1
- **Small datasets**: train=0.7, val=0.15, test=0.15

### Learning Rates
- **Fine-tuning**: 1e-5 to 5e-5
- **Training from scratch**: 1e-4 to 1e-3
- **Start with**: 2e-5

### Batch Sizes
- **GPU Memory**: 8, 16, 32, 64
- **CPU**: 4, 8, 16
- **Start with**: 16

### Text Lengths
- **BERT**: 512 (max)
- **GPT-2**: 1024 (max)
- **T5**: 512 (max)
- **Start with**: 256

## Common Issues & Fixes

| Issue | Cause | Fix |
|-------|-------|-----|
| "File not found" | Wrong path | Check `data_path` and `output_dir` |
| "Memory error" | Batch too large | Reduce `batch_size` |
| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
| "Slow processing" | Text too long | Reduce `max_length` |

## Environment Variables
```bash
# Set cache directory
export HF_HOME="./cache"

# Set output directory
export OUTPUT_DIR="./results"

# Set log level
export LOG_LEVEL="INFO"
```
added style mimicking piepelines 2025-08-13 21:17:01 +01:00			`# Quick Reference Card`

			`## Essential Parameters (Most Common)`

			`### Data Source & Location`
			```yaml
			`data:`
			`source: "huggingface\|custom" # REQUIRED: Data source type`
			`dataset_name: "dataset/name" # REQUIRED for huggingface`
			`data_path: "./path/to/file" # REQUIRED for custom`
			`data_format: "jsonl\|csv\|json" # REQUIRED for custom`
			```

			`### Field Mapping`
			```yaml
			`data:`
			`input_field: "text" # REQUIRED: Input text field`
			`label_field: "label" # REQUIRED for classification`
			`output_field: "styled_text" # REQUIRED for styling`
			`instruction: "Style instruction" # REQUIRED for styling`
			```

			`### Basic Processing`
			```yaml
			`data:`
			`max_samples: 1000 # Limit total samples`
			`train_split: 0.8 # Training ratio (0.0-1.0)`
			`validation_split: 0.1 # Validation ratio (0.0-1.0)`
			`test_split: 0.1 # Test ratio (0.0-1.0)`
			`output_dir: "./output/path" # Output directory`
			```

			`### Text Preprocessing`
			```yaml
			`data:`
			`clean_text: true # Clean/normalize text`
			`lowercase: true # Convert to lowercase`
			`min_length: 10 # Minimum text length`
			`max_length: 512 # Maximum text length`
			```

			`### Model & Training`
			```yaml
			`model:`
			`name: "bert-base-uncased" # Model name`
			`max_length: 512 # Max sequence length`

			`training:`
			`num_epochs: 3 # Training epochs`
			`batch_size: 16 # Batch size`
			`learning_rate: 2e-5 # Learning rate`
			```

			`## Common Configurations by Task`

			`### Classification`
			```yaml
			`task:`
			`name: "classification"`
			`type: "sequence_classification"`

			`data:`
			`source: "huggingface"`
			`dataset_name: "dair-ai/emotion"`
			`input_field: "text"`
			`label_field: "label"`
			`output_format: "classification"`
			```

			`### Styling`
			```yaml
			`task:`
			`name: "styling"`
			`type: "style_transfer"`

			`data:`
			`source: "custom"`
			`data_path: "./data.jsonl"`
			`input_field: "text"`
			`output_field: "styled_text"`
			`instruction: "Rewrite in formal style"`
			`output_format: "alpaca"`
			```

			`### Text Generation`
			```yaml
			`task:`
			`name: "completion"`
			`type: "text_generation"`

			`data:`
			`source: "custom"`
			`data_path: "./prompts.jsonl"`
			`input_field: "prompt"`
			`output_field: "completion"`
			`output_format: "instruction"`
			```

			`## Quick Start Templates`

			`### 1. HuggingFace Dataset`
			```yaml
			`task:`
			`name: "classification"`
			`type: "sequence_classification"`

			`data:`
			`source: "huggingface"`
			`dataset_name: "your/dataset"`
			`input_field: "text"`
			`label_field: "label"`
			`max_samples: 1000`
			`output_dir: "./output"`
			```

			`### 2. Custom JSONL File`
			```yaml
			`task:`
			`name: "styling"`
			`type: "style_transfer"`

			`data:`
			`source: "custom"`
			`data_path: "./your_data.jsonl"`
			`data_format: "jsonl"`
			`input_field: "source"`
			`output_field: "target"`
			`instruction: "Your style instruction"`
			`output_dir: "./output"`
			```

			`### 3. CSV File`
			```yaml
			`task:`
			`name: "classification"`
			`type: "sequence_classification"`

			`data:`
			`source: "custom"`
			`data_path: "./your_data.csv"`
			`data_format: "csv"`
			`input_field: "text"`
			`label_field: "label"`
			`delimiter: ","`
			`output_dir: "./output"`
			```

			`## Parameter Ranges & Recommendations`

			`### Split Ratios`
			`- Total must be ≤ 1.0`
			`- Common: train=0.8, val=0.1, test=0.1`
			`- Small datasets: train=0.7, val=0.15, test=0.15`

			`### Learning Rates`
			`- Fine-tuning: 1e-5 to 5e-5`
			`- Training from scratch: 1e-4 to 1e-3`
			`- Start with: 2e-5`

			`### Batch Sizes`
			`- GPU Memory: 8, 16, 32, 64`
			`- CPU: 4, 8, 16`
			`- Start with: 16`

			`### Text Lengths`
			`- BERT: 512 (max)`
			`- GPT-2: 1024 (max)`
			`- T5: 512 (max)`
			`- Start with: 256`

			`## Common Issues & Fixes`

			`\| Issue \| Cause \| Fix \|`
			`\|-------\|-------\|-----\|`
			\| "File not found" \| Wrong path \| Check `data_path` and `output_dir` \|
			\| "Memory error" \| Batch too large \| Reduce `batch_size` \|
			`\| "Split error" \| Ratios > 1.0 \| Ensure splits sum to ≤ 1.0 \|`
			`\| "Poor performance" \| Wrong learning rate \| Try 1e-5 to 5e-5 range \|`
			\| "Slow processing" \| Text too long \| Reduce `max_length` \|

			`## Environment Variables`
			```bash
			`# Set cache directory`
			`export HF_HOME="./cache"`

			`# Set output directory`
			`export OUTPUT_DIR="./results"`

			`# Set log level`
			`export LOG_LEVEL="INFO"`
			```