configs/README.md

# Configuration Files Documentation

This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.

## Configuration Structure

All configuration files follow a consistent structure organized into these main sections:

### 1. Task Configuration
```yaml
task:
  name: "task_type"                        # Task type: classification, completion, styling, matching
  type: "specific_type"                    # Specific model/task type
```

**Available Task Types:**
- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
- **completion**: Text generation and completion tasks
- **styling**: Style transfer and text transformation tasks
- **matching**: Semantic matching and similarity tasks

### 2. Data Processing Configuration
```yaml
data:
  # Data Source
  source: "huggingface|custom"             # Where to get data from
  
  # Data Location
  dataset_name: "dataset/name"             # HuggingFace dataset name (for huggingface source)
  data_path: "./path/to/file"              # Path to custom data file (for custom source)
  data_format: "jsonl|csv|json"            # File format for custom data
  
  # Field Mapping
  input_field: "text"                      # Field containing input text
  output_field: "styled_text"              # Field containing output (for styling)
  label_field: "label"                     # Field containing labels (for classification)
  id_field: "id"                           # Optional ID field for tracking
  
  # Processing Parameters
  max_samples: 1000                        # Maximum samples to process
  train_split: 0.8                         # Training split ratio
  validation_split: 0.1                    # Validation split ratio
  test_split: 0.1                          # Test split ratio
  
  # Text Preprocessing
  clean_text: true                         # Clean and normalize text
  remove_special_chars: false              # Remove special characters
  lowercase: true                          # Convert to lowercase
  min_length: 10                           # Minimum text length
  max_length: 1000                         # Maximum text length
  
  # Output Configuration
  output_format: "format_type"             # Output format
  output_dir: "./output/path"              # Output directory
```

**Data Source Types:**
- **huggingface**: Use datasets from HuggingFace Hub
- **custom**: Use local files (JSONL, CSV, JSON)

**Output Formats:**
- **classification**: Raw classification format
- **instruction**: Instruction-following format
- **conversation**: Conversational format
- **qa**: Question-answer format
- **styling**: Raw styling format
- **alpaca**: Alpaca instruction format

### 3. Model Configuration
```yaml
model:
  name: "model_name"                       # Model from HuggingFace Hub
  max_length: 512                          # Maximum sequence length
  num_labels: 6                            # Number of labels (for classification)
```

**Recommended Models by Task:**
- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
- **Styling**: `t5-base`, `gpt2-medium`
- **Completion**: `gpt2-medium`, `gpt2-large`
- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`

### 4. Training Configuration
```yaml
training:
  num_epochs: 3                            # Number of training epochs
  batch_size: 16                           # Training batch size
  learning_rate: 2e-5                      # Learning rate
  weight_decay: 0.01                       # Weight decay
  lr_scheduler_type: "linear"              # Learning rate scheduler
  warmup_ratio: 0.1                        # Warmup ratio
  data_dir: "./data/path"                  # Training data directory
  output_dir: "./model/output"             # Model output directory
```

**Learning Rate Guidelines:**
- **Fine-tuning**: 1e-5 to 5e-5
- **Training from scratch**: 1e-4 to 1e-3

**Scheduler Types:**
- **linear**: Linear decay
- **cosine**: Cosine annealing
- **polynomial**: Polynomial decay

### 5. Inference Configuration
```yaml
inference:
  model_path: "./model/path"               # Path to saved model
  device: "auto"                           # Device to use
  batch_size: 32                           # Inference batch size
  return_probabilities: true                # Return probabilities
  return_top_k: 3                          # Return top K predictions
  max_new_tokens: 128                      # Max tokens to generate
  temperature: 0.8                         # Sampling temperature
```

**Device Options:**
- **auto**: Automatically detect best device
- **cuda**: Use GPU if available
- **cpu**: Force CPU usage

**Temperature Guidelines:**
- **0.0**: Deterministic (always same output)
- **0.7-0.9**: Balanced creativity
- **1.0+**: More random/creative

## Task-Specific Parameters

### Classification Tasks
```yaml
data:
  label_encoding: "auto|numeric|string"    # How to encode labels
  multilabel: false                        # Multi-label vs single-label
  label_separator: ","                     # Separator for multi-label
```

### Styling Tasks
```yaml
data:
  instruction: "Style instruction text"    # The style instruction
```

### Completion Tasks
```yaml
data:
  prompt_template: "template"               # Prompt template
  completion_length: 100                   # Target completion length
```

## Advanced Configuration

### HuggingFace Specific
```yaml
data:
  hf_split: "train"                        # Dataset split to use
  hf_cache_dir: "./cache"                  # Cache directory
  test_split_from: "train"                 # Source for test split
  val_split_from: "train"                  # Source for validation split
```

### Custom Data Specific
```yaml
data:
  encoding: "utf-8"                        # File encoding
  delimiter: ","                           # CSV delimiter
```

## Usage Examples

### Basic Usage
```bash
# Use YAML configuration
python scripts/task_type/data_processor.py --config configs/task_type/config.yaml

# Override specific parameters
python scripts/task_type/data_processor.py \
  --config configs/task_type/config.yaml \
  --max-samples 1000 \
  --learning-rate 3e-5
```

### Creating Custom Configurations
1. Copy an existing config file
2. Modify parameters for your specific use case
3. Update paths and model names
4. Test with a small dataset first

## Best Practices

1. **Start with Defaults**: Use default values and adjust based on results
2. **Validate Paths**: Ensure all file paths are correct and accessible
3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
4. **Test Incrementally**: Test with small datasets before full processing
5. **Version Control**: Keep configurations in version control for reproducibility

## Troubleshooting

### Common Issues:
- **File Not Found**: Check `data_path` and `output_dir` paths
- **Memory Errors**: Reduce `batch_size` or `max_length`
- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
- **Split Errors**: Ensure split ratios sum to ≤ 1.0

### Getting Help:
- Check the script help: `python script.py --help`
- Review the pipeline logs for detailed error messages
- Verify YAML syntax and parameter values
added style mimicking piepelines 2025-08-13 21:17:01 +01:00			`# Configuration Files Documentation`

			`This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.`

			`## Configuration Structure`

			`All configuration files follow a consistent structure organized into these main sections:`

			`### 1. Task Configuration`
			```yaml
			`task:`
			`name: "task_type" # Task type: classification, completion, styling, matching`
			`type: "specific_type" # Specific model/task type`
			```

			`Available Task Types:`
			`- classification: Text classification tasks (emotion, sentiment, topic, etc.)`
			`- completion: Text generation and completion tasks`
			`- styling: Style transfer and text transformation tasks`
			`- matching: Semantic matching and similarity tasks`

			`### 2. Data Processing Configuration`
			```yaml
			`data:`
			`# Data Source`
			`source: "huggingface\|custom" # Where to get data from`

			`# Data Location`
			`dataset_name: "dataset/name" # HuggingFace dataset name (for huggingface source)`
			`data_path: "./path/to/file" # Path to custom data file (for custom source)`
			`data_format: "jsonl\|csv\|json" # File format for custom data`

			`# Field Mapping`
			`input_field: "text" # Field containing input text`
			`output_field: "styled_text" # Field containing output (for styling)`
			`label_field: "label" # Field containing labels (for classification)`
			`id_field: "id" # Optional ID field for tracking`

			`# Processing Parameters`
			`max_samples: 1000 # Maximum samples to process`
			`train_split: 0.8 # Training split ratio`
			`validation_split: 0.1 # Validation split ratio`
			`test_split: 0.1 # Test split ratio`

			`# Text Preprocessing`
			`clean_text: true # Clean and normalize text`
			`remove_special_chars: false # Remove special characters`
			`lowercase: true # Convert to lowercase`
			`min_length: 10 # Minimum text length`
			`max_length: 1000 # Maximum text length`

			`# Output Configuration`
			`output_format: "format_type" # Output format`
			`output_dir: "./output/path" # Output directory`
			```

			`Data Source Types:`
			`- huggingface: Use datasets from HuggingFace Hub`
			`- custom: Use local files (JSONL, CSV, JSON)`

			`Output Formats:`
			`- classification: Raw classification format`
			`- instruction: Instruction-following format`
			`- conversation: Conversational format`
			`- qa: Question-answer format`
			`- styling: Raw styling format`
			`- alpaca: Alpaca instruction format`

			`### 3. Model Configuration`
			```yaml
			`model:`
			`name: "model_name" # Model from HuggingFace Hub`
			`max_length: 512 # Maximum sequence length`
			`num_labels: 6 # Number of labels (for classification)`
			```

			`Recommended Models by Task:`
			- Classification: `bert-base-uncased`, `distilbert-base-uncased`
			- Styling: `t5-base`, `gpt2-medium`
			- Completion: `gpt2-medium`, `gpt2-large`
			- Matching: `sentence-transformers/all-MiniLM-L6-v2`

			`### 4. Training Configuration`
			```yaml
			`training:`
			`num_epochs: 3 # Number of training epochs`
			`batch_size: 16 # Training batch size`
			`learning_rate: 2e-5 # Learning rate`
			`weight_decay: 0.01 # Weight decay`
			`lr_scheduler_type: "linear" # Learning rate scheduler`
			`warmup_ratio: 0.1 # Warmup ratio`
			`data_dir: "./data/path" # Training data directory`
			`output_dir: "./model/output" # Model output directory`
			```

			`Learning Rate Guidelines:`
			`- Fine-tuning: 1e-5 to 5e-5`
			`- Training from scratch: 1e-4 to 1e-3`

			`Scheduler Types:`
			`- linear: Linear decay`
			`- cosine: Cosine annealing`
			`- polynomial: Polynomial decay`

			`### 5. Inference Configuration`
			```yaml
			`inference:`
			`model_path: "./model/path" # Path to saved model`
			`device: "auto" # Device to use`
			`batch_size: 32 # Inference batch size`
			`return_probabilities: true # Return probabilities`
			`return_top_k: 3 # Return top K predictions`
			`max_new_tokens: 128 # Max tokens to generate`
			`temperature: 0.8 # Sampling temperature`
			```

			`Device Options:`
			`- auto: Automatically detect best device`
			`- cuda: Use GPU if available`
			`- cpu: Force CPU usage`

			`Temperature Guidelines:`
			`- 0.0: Deterministic (always same output)`
			`- 0.7-0.9: Balanced creativity`
			`- 1.0+: More random/creative`

			`## Task-Specific Parameters`

			`### Classification Tasks`
			```yaml
			`data:`
			`label_encoding: "auto\|numeric\|string" # How to encode labels`
			`multilabel: false # Multi-label vs single-label`
			`label_separator: "," # Separator for multi-label`
			```

			`### Styling Tasks`
			```yaml
			`data:`
			`instruction: "Style instruction text" # The style instruction`
			```

			`### Completion Tasks`
			```yaml
			`data:`
			`prompt_template: "template" # Prompt template`
			`completion_length: 100 # Target completion length`
			```

			`## Advanced Configuration`

			`### HuggingFace Specific`
			```yaml
			`data:`
			`hf_split: "train" # Dataset split to use`
			`hf_cache_dir: "./cache" # Cache directory`
			`test_split_from: "train" # Source for test split`
			`val_split_from: "train" # Source for validation split`
			```

			`### Custom Data Specific`
			```yaml
			`data:`
			`encoding: "utf-8" # File encoding`
			`delimiter: "," # CSV delimiter`
			```

			`## Usage Examples`

			`### Basic Usage`
			```bash
			`# Use YAML configuration`
			`python scripts/task_type/data_processor.py --config configs/task_type/config.yaml`

			`# Override specific parameters`
			`python scripts/task_type/data_processor.py \`
			`--config configs/task_type/config.yaml \`
			`--max-samples 1000 \`
			`--learning-rate 3e-5`
			```

			`### Creating Custom Configurations`
			`1. Copy an existing config file`
			`2. Modify parameters for your specific use case`
			`3. Update paths and model names`
			`4. Test with a small dataset first`

			`## Best Practices`

			`1. Start with Defaults: Use default values and adjust based on results`
			`2. Validate Paths: Ensure all file paths are correct and accessible`
			`3. Monitor Resources: Adjust batch sizes based on available GPU memory`
			`4. Test Incrementally: Test with small datasets before full processing`
			`5. Version Control: Keep configurations in version control for reproducibility`

			`## Troubleshooting`

			`### Common Issues:`
			- File Not Found: Check `data_path` and `output_dir` paths
			- Memory Errors: Reduce `batch_size` or `max_length`
			- Poor Performance: Adjust `learning_rate` or `num_epochs`
			`- Split Errors: Ensure split ratios sum to ≤ 1.0`

			`### Getting Help:`
			- Check the script help: `python script.py --help`
			`- Review the pipeline logs for detailed error messages`
			`- Verify YAML syntax and parameter values`