owusu/DS-LLM-TEMPLATE-FINETUNING

Fork 0

Files

T

OwusuBlessing 710d074b47 added style mimicking piepelines

2025-08-13 21:17:01 +01:00

4.3 KiB

Raw Blame History

Quick Reference Card

Essential Parameters (Most Common)

Data Source & Location

data:
  source: "huggingface|custom"             # REQUIRED: Data source type
  dataset_name: "dataset/name"             # REQUIRED for huggingface
  data_path: "./path/to/file"              # REQUIRED for custom
  data_format: "jsonl|csv|json"            # REQUIRED for custom

Field Mapping

data:
  input_field: "text"                      # REQUIRED: Input text field
  label_field: "label"                     # REQUIRED for classification
  output_field: "styled_text"              # REQUIRED for styling
  instruction: "Style instruction"          # REQUIRED for styling

Basic Processing

data:
  max_samples: 1000                        # Limit total samples
  train_split: 0.8                         # Training ratio (0.0-1.0)
  validation_split: 0.1                    # Validation ratio (0.0-1.0)
  test_split: 0.1                          # Test ratio (0.0-1.0)
  output_dir: "./output/path"              # Output directory

Text Preprocessing

data:
  clean_text: true                         # Clean/normalize text
  lowercase: true                          # Convert to lowercase
  min_length: 10                           # Minimum text length
  max_length: 512                          # Maximum text length

Model & Training

model:
  name: "bert-base-uncased"                # Model name
  max_length: 512                          # Max sequence length

training:
  num_epochs: 3                            # Training epochs
  batch_size: 16                           # Batch size
  learning_rate: 2e-5                      # Learning rate

Common Configurations by Task

Classification

task:
  name: "classification"
  type: "sequence_classification"

data:
  source: "huggingface"
  dataset_name: "dair-ai/emotion"
  input_field: "text"
  label_field: "label"
  output_format: "classification"

Styling

task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./data.jsonl"
  input_field: "text"
  output_field: "styled_text"
  instruction: "Rewrite in formal style"
  output_format: "alpaca"

Text Generation

task:
  name: "completion"
  type: "text_generation"

data:
  source: "custom"
  data_path: "./prompts.jsonl"
  input_field: "prompt"
  output_field: "completion"
  output_format: "instruction"

Quick Start Templates

1. HuggingFace Dataset

task:
  name: "classification"
  type: "sequence_classification"

data:
  source: "huggingface"
  dataset_name: "your/dataset"
  input_field: "text"
  label_field: "label"
  max_samples: 1000
  output_dir: "./output"

2. Custom JSONL File

task:
  name: "styling"
  type: "style_transfer"

data:
  source: "custom"
  data_path: "./your_data.jsonl"
  data_format: "jsonl"
  input_field: "source"
  output_field: "target"
  instruction: "Your style instruction"
  output_dir: "./output"

3. CSV File

task:
  name: "classification"
  type: "sequence_classification"

data:
  source: "custom"
  data_path: "./your_data.csv"
  data_format: "csv"
  input_field: "text"
  label_field: "label"
  delimiter: ","
  output_dir: "./output"

Parameter Ranges & Recommendations

Split Ratios

Total must be ≤ 1.0
Common: train=0.8, val=0.1, test=0.1
Small datasets: train=0.7, val=0.15, test=0.15

Learning Rates

Fine-tuning: 1e-5 to 5e-5
Training from scratch: 1e-4 to 1e-3
Start with: 2e-5

Batch Sizes

GPU Memory: 8, 16, 32, 64
CPU: 4, 8, 16
Start with: 16

Text Lengths

BERT: 512 (max)
GPT-2: 1024 (max)
T5: 512 (max)
Start with: 256

Common Issues & Fixes

Issue	Cause	Fix
"File not found"	Wrong path	Check `data_path` and `output_dir`
"Memory error"	Batch too large	Reduce `batch_size`
"Split error"	Ratios > 1.0	Ensure splits sum to ≤ 1.0
"Poor performance"	Wrong learning rate	Try 1e-5 to 5e-5 range
"Slow processing"	Text too long	Reduce `max_length`

Environment Variables

# Set cache directory
export HF_HOME="./cache"

# Set output directory
export OUTPUT_DIR="./results"

# Set log level
export LOG_LEVEL="INFO"

4.3 KiB Raw Blame History

Quick Reference Card

Essential Parameters (Most Common)

Data Source & Location

Field Mapping

Basic Processing

Text Preprocessing

Model & Training

Common Configurations by Task

Classification

Styling

Text Generation

Quick Start Templates

1. HuggingFace Dataset

2. Custom JSONL File

3. CSV File

Parameter Ranges & Recommendations

Split Ratios

Learning Rates

Batch Sizes

Text Lengths

Common Issues & Fixes

Environment Variables

4.3 KiB

Raw Blame History