Files
DS-LLM-TEMPLATE-FINETUNING/configs
2025-08-28 14:12:30 +00:00
..
2025-08-06 22:45:37 +01:00
2025-08-28 14:12:30 +00:00
2025-08-06 22:45:37 +01:00
2025-08-06 22:45:37 +01:00
2025-08-13 21:17:01 +01:00
2025-08-28 14:12:30 +00:00

Configuration Files Documentation

This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.

Configuration Structure

All configuration files follow a consistent structure organized into these main sections:

1. Task Configuration

task:
  name: "task_type"                        # Task type: classification, completion, styling, matching
  type: "specific_type"                    # Specific model/task type

Available Task Types:

  • classification: Text classification tasks (emotion, sentiment, topic, etc.)
  • completion: Text generation and completion tasks
  • styling: Style transfer and text transformation tasks
  • matching: Semantic matching and similarity tasks

2. Data Processing Configuration

data:
  # Data Source
  source: "huggingface|custom"             # Where to get data from
  
  # Data Location
  dataset_name: "dataset/name"             # HuggingFace dataset name (for huggingface source)
  data_path: "./path/to/file"              # Path to custom data file (for custom source)
  data_format: "jsonl|csv|json"            # File format for custom data
  
  # Field Mapping
  input_field: "text"                      # Field containing input text
  output_field: "styled_text"              # Field containing output (for styling)
  label_field: "label"                     # Field containing labels (for classification)
  id_field: "id"                           # Optional ID field for tracking
  
  # Processing Parameters
  max_samples: 1000                        # Maximum samples to process
  train_split: 0.8                         # Training split ratio
  validation_split: 0.1                    # Validation split ratio
  test_split: 0.1                          # Test split ratio
  
  # Text Preprocessing
  clean_text: true                         # Clean and normalize text
  remove_special_chars: false              # Remove special characters
  lowercase: true                          # Convert to lowercase
  min_length: 10                           # Minimum text length
  max_length: 1000                         # Maximum text length
  
  # Output Configuration
  output_format: "format_type"             # Output format
  output_dir: "./output/path"              # Output directory

Data Source Types:

  • huggingface: Use datasets from HuggingFace Hub
  • custom: Use local files (JSONL, CSV, JSON)

Output Formats:

  • classification: Raw classification format
  • instruction: Instruction-following format
  • conversation: Conversational format
  • qa: Question-answer format
  • styling: Raw styling format
  • alpaca: Alpaca instruction format

3. Model Configuration

model:
  name: "model_name"                       # Model from HuggingFace Hub
  max_length: 512                          # Maximum sequence length
  num_labels: 6                            # Number of labels (for classification)

Recommended Models by Task:

  • Classification: bert-base-uncased, distilbert-base-uncased
  • Styling: t5-base, gpt2-medium
  • Completion: gpt2-medium, gpt2-large
  • Matching: sentence-transformers/all-MiniLM-L6-v2

4. Training Configuration

training:
  num_epochs: 3                            # Number of training epochs
  batch_size: 16                           # Training batch size
  learning_rate: 2e-5                      # Learning rate
  weight_decay: 0.01                       # Weight decay
  lr_scheduler_type: "linear"              # Learning rate scheduler
  warmup_ratio: 0.1                        # Warmup ratio
  data_dir: "./data/path"                  # Training data directory
  output_dir: "./model/output"             # Model output directory

Learning Rate Guidelines:

  • Fine-tuning: 1e-5 to 5e-5
  • Training from scratch: 1e-4 to 1e-3

Scheduler Types:

  • linear: Linear decay
  • cosine: Cosine annealing
  • polynomial: Polynomial decay

5. Inference Configuration

inference:
  model_path: "./model/path"               # Path to saved model
  device: "auto"                           # Device to use
  batch_size: 32                           # Inference batch size
  return_probabilities: true                # Return probabilities
  return_top_k: 3                          # Return top K predictions
  max_new_tokens: 128                      # Max tokens to generate
  temperature: 0.8                         # Sampling temperature

Device Options:

  • auto: Automatically detect best device
  • cuda: Use GPU if available
  • cpu: Force CPU usage

Temperature Guidelines:

  • 0.0: Deterministic (always same output)
  • 0.7-0.9: Balanced creativity
  • 1.0+: More random/creative

Task-Specific Parameters

Classification Tasks

data:
  label_encoding: "auto|numeric|string"    # How to encode labels
  multilabel: false                        # Multi-label vs single-label
  label_separator: ","                     # Separator for multi-label

Styling Tasks

data:
  instruction: "Style instruction text"    # The style instruction

Completion Tasks

data:
  prompt_template: "template"               # Prompt template
  completion_length: 100                   # Target completion length

Advanced Configuration

HuggingFace Specific

data:
  hf_split: "train"                        # Dataset split to use
  hf_cache_dir: "./cache"                  # Cache directory
  test_split_from: "train"                 # Source for test split
  val_split_from: "train"                  # Source for validation split

Custom Data Specific

data:
  encoding: "utf-8"                        # File encoding
  delimiter: ","                           # CSV delimiter

Usage Examples

Basic Usage

# Use YAML configuration
python scripts/task_type/data_processor.py --config configs/task_type/config.yaml

# Override specific parameters
python scripts/task_type/data_processor.py \
  --config configs/task_type/config.yaml \
  --max-samples 1000 \
  --learning-rate 3e-5

Creating Custom Configurations

  1. Copy an existing config file
  2. Modify parameters for your specific use case
  3. Update paths and model names
  4. Test with a small dataset first

Best Practices

  1. Start with Defaults: Use default values and adjust based on results
  2. Validate Paths: Ensure all file paths are correct and accessible
  3. Monitor Resources: Adjust batch sizes based on available GPU memory
  4. Test Incrementally: Test with small datasets before full processing
  5. Version Control: Keep configurations in version control for reproducibility

Troubleshooting

Common Issues:

  • File Not Found: Check data_path and output_dir paths
  • Memory Errors: Reduce batch_size or max_length
  • Poor Performance: Adjust learning_rate or num_epochs
  • Split Errors: Ensure split ratios sum to ≤ 1.0

Getting Help:

  • Check the script help: python script.py --help
  • Review the pipeline logs for detailed error messages
  • Verify YAML syntax and parameter values