added style mimicking piepelines

2025-08-13 21:17:01 +01:00
parent fd54d4be39
commit 710d074b47
31 changed files with 3816 additions and 46 deletions
@@ -0,0 +1,191 @@
+# Quick Reference Card
+
+## Essential Parameters (Most Common)
+
+### Data Source & Location
+```yaml
+data:
+  source: "huggingface|custom"             # REQUIRED: Data source type
+  dataset_name: "dataset/name"             # REQUIRED for huggingface
+  data_path: "./path/to/file"              # REQUIRED for custom
+  data_format: "jsonl|csv|json"            # REQUIRED for custom
+```
+
+### Field Mapping
+```yaml
+data:
+  input_field: "text"                      # REQUIRED: Input text field
+  label_field: "label"                     # REQUIRED for classification
+  output_field: "styled_text"              # REQUIRED for styling
+  instruction: "Style instruction"          # REQUIRED for styling
+```
+
+### Basic Processing
+```yaml
+data:
+  max_samples: 1000                        # Limit total samples
+  train_split: 0.8                         # Training ratio (0.0-1.0)
+  validation_split: 0.1                    # Validation ratio (0.0-1.0)
+  test_split: 0.1                          # Test ratio (0.0-1.0)
+  output_dir: "./output/path"              # Output directory
+```
+
+### Text Preprocessing
+```yaml
+data:
+  clean_text: true                         # Clean/normalize text
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 512                          # Maximum text length
+```
+
+### Model & Training
+```yaml
+model:
+  name: "bert-base-uncased"                # Model name
+  max_length: 512                          # Max sequence length
+
+training:
+  num_epochs: 3                            # Training epochs
+  batch_size: 16                           # Batch size
+  learning_rate: 2e-5                      # Learning rate
+```
+
+## Common Configurations by Task
+
+### Classification
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "dair-ai/emotion"
+  input_field: "text"
+  label_field: "label"
+  output_format: "classification"
+```
+
+### Styling
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./data.jsonl"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite in formal style"
+  output_format: "alpaca"
+```
+
+### Text Generation
+```yaml
+task:
+  name: "completion"
+  type: "text_generation"
+
+data:
+  source: "custom"
+  data_path: "./prompts.jsonl"
+  input_field: "prompt"
+  output_field: "completion"
+  output_format: "instruction"
+```
+
+## Quick Start Templates
+
+### 1. HuggingFace Dataset
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "your/dataset"
+  input_field: "text"
+  label_field: "label"
+  max_samples: 1000
+  output_dir: "./output"
+```
+
+### 2. Custom JSONL File
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./your_data.jsonl"
+  data_format: "jsonl"
+  input_field: "source"
+  output_field: "target"
+  instruction: "Your style instruction"
+  output_dir: "./output"
+```
+
+### 3. CSV File
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "custom"
+  data_path: "./your_data.csv"
+  data_format: "csv"
+  input_field: "text"
+  label_field: "label"
+  delimiter: ","
+  output_dir: "./output"
+```
+
+## Parameter Ranges & Recommendations
+
+### Split Ratios
+- **Total must be ≤ 1.0**
+- **Common**: train=0.8, val=0.1, test=0.1
+- **Small datasets**: train=0.7, val=0.15, test=0.15
+
+### Learning Rates
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+- **Start with**: 2e-5
+
+### Batch Sizes
+- **GPU Memory**: 8, 16, 32, 64
+- **CPU**: 4, 8, 16
+- **Start with**: 16
+
+### Text Lengths
+- **BERT**: 512 (max)
+- **GPT-2**: 1024 (max)
+- **T5**: 512 (max)
+- **Start with**: 256
+
+## Common Issues & Fixes
+
+| Issue | Cause | Fix |
+|-------|-------|-----|
+| "File not found" | Wrong path | Check `data_path` and `output_dir` |
+| "Memory error" | Batch too large | Reduce `batch_size` |
+| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
+| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
+| "Slow processing" | Text too long | Reduce `max_length` |
+
+## Environment Variables
+```bash
+# Set cache directory
+export HF_HOME="./cache"
+
+# Set output directory
+export OUTPUT_DIR="./results"
+
+# Set log level
+export LOG_LEVEL="INFO"
+```