added style mimicking piepelines

2025-08-13 21:17:01 +01:00
parent fd54d4be39
commit 710d074b47
31 changed files with 3816 additions and 46 deletions
@@ -0,0 +1,191 @@
+# Quick Reference Card
+
+## Essential Parameters (Most Common)
+
+### Data Source & Location
+```yaml
+data:
+  source: "huggingface|custom"             # REQUIRED: Data source type
+  dataset_name: "dataset/name"             # REQUIRED for huggingface
+  data_path: "./path/to/file"              # REQUIRED for custom
+  data_format: "jsonl|csv|json"            # REQUIRED for custom
+```
+
+### Field Mapping
+```yaml
+data:
+  input_field: "text"                      # REQUIRED: Input text field
+  label_field: "label"                     # REQUIRED for classification
+  output_field: "styled_text"              # REQUIRED for styling
+  instruction: "Style instruction"          # REQUIRED for styling
+```
+
+### Basic Processing
+```yaml
+data:
+  max_samples: 1000                        # Limit total samples
+  train_split: 0.8                         # Training ratio (0.0-1.0)
+  validation_split: 0.1                    # Validation ratio (0.0-1.0)
+  test_split: 0.1                          # Test ratio (0.0-1.0)
+  output_dir: "./output/path"              # Output directory
+```
+
+### Text Preprocessing
+```yaml
+data:
+  clean_text: true                         # Clean/normalize text
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 512                          # Maximum text length
+```
+
+### Model & Training
+```yaml
+model:
+  name: "bert-base-uncased"                # Model name
+  max_length: 512                          # Max sequence length
+
+training:
+  num_epochs: 3                            # Training epochs
+  batch_size: 16                           # Batch size
+  learning_rate: 2e-5                      # Learning rate
+```
+
+## Common Configurations by Task
+
+### Classification
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "dair-ai/emotion"
+  input_field: "text"
+  label_field: "label"
+  output_format: "classification"
+```
+
+### Styling
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./data.jsonl"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite in formal style"
+  output_format: "alpaca"
+```
+
+### Text Generation
+```yaml
+task:
+  name: "completion"
+  type: "text_generation"
+
+data:
+  source: "custom"
+  data_path: "./prompts.jsonl"
+  input_field: "prompt"
+  output_field: "completion"
+  output_format: "instruction"
+```
+
+## Quick Start Templates
+
+### 1. HuggingFace Dataset
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "your/dataset"
+  input_field: "text"
+  label_field: "label"
+  max_samples: 1000
+  output_dir: "./output"
+```
+
+### 2. Custom JSONL File
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./your_data.jsonl"
+  data_format: "jsonl"
+  input_field: "source"
+  output_field: "target"
+  instruction: "Your style instruction"
+  output_dir: "./output"
+```
+
+### 3. CSV File
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "custom"
+  data_path: "./your_data.csv"
+  data_format: "csv"
+  input_field: "text"
+  label_field: "label"
+  delimiter: ","
+  output_dir: "./output"
+```
+
+## Parameter Ranges & Recommendations
+
+### Split Ratios
+- **Total must be ≤ 1.0**
+- **Common**: train=0.8, val=0.1, test=0.1
+- **Small datasets**: train=0.7, val=0.15, test=0.15
+
+### Learning Rates
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+- **Start with**: 2e-5
+
+### Batch Sizes
+- **GPU Memory**: 8, 16, 32, 64
+- **CPU**: 4, 8, 16
+- **Start with**: 16
+
+### Text Lengths
+- **BERT**: 512 (max)
+- **GPT-2**: 1024 (max)
+- **T5**: 512 (max)
+- **Start with**: 256
+
+## Common Issues & Fixes
+
+| Issue | Cause | Fix |
+|-------|-------|-----|
+| "File not found" | Wrong path | Check `data_path` and `output_dir` |
+| "Memory error" | Batch too large | Reduce `batch_size` |
+| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
+| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
+| "Slow processing" | Text too long | Reduce `max_length` |
+
+## Environment Variables
+```bash
+# Set cache directory
+export HF_HOME="./cache"
+
+# Set output directory
+export OUTPUT_DIR="./results"
+
+# Set log level
+export LOG_LEVEL="INFO"
+```
@@ -0,0 +1,207 @@
+# Configuration Files Documentation
+
+This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
+
+## Configuration Structure
+
+All configuration files follow a consistent structure organized into these main sections:
+
+### 1. Task Configuration
+```yaml
+task:
+  name: "task_type"                        # Task type: classification, completion, styling, matching
+  type: "specific_type"                    # Specific model/task type
+```
+
+**Available Task Types:**
+- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
+- **completion**: Text generation and completion tasks
+- **styling**: Style transfer and text transformation tasks
+- **matching**: Semantic matching and similarity tasks
+
+### 2. Data Processing Configuration
+```yaml
+data:
+  # Data Source
+  source: "huggingface|custom"             # Where to get data from
+  
+  # Data Location
+  dataset_name: "dataset/name"             # HuggingFace dataset name (for huggingface source)
+  data_path: "./path/to/file"              # Path to custom data file (for custom source)
+  data_format: "jsonl|csv|json"            # File format for custom data
+  
+  # Field Mapping
+  input_field: "text"                      # Field containing input text
+  output_field: "styled_text"              # Field containing output (for styling)
+  label_field: "label"                     # Field containing labels (for classification)
+  id_field: "id"                           # Optional ID field for tracking
+  
+  # Processing Parameters
+  max_samples: 1000                        # Maximum samples to process
+  train_split: 0.8                         # Training split ratio
+  validation_split: 0.1                    # Validation split ratio
+  test_split: 0.1                          # Test split ratio
+  
+  # Text Preprocessing
+  clean_text: true                         # Clean and normalize text
+  remove_special_chars: false              # Remove special characters
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 1000                         # Maximum text length
+  
+  # Output Configuration
+  output_format: "format_type"             # Output format
+  output_dir: "./output/path"              # Output directory
+```
+
+**Data Source Types:**
+- **huggingface**: Use datasets from HuggingFace Hub
+- **custom**: Use local files (JSONL, CSV, JSON)
+
+**Output Formats:**
+- **classification**: Raw classification format
+- **instruction**: Instruction-following format
+- **conversation**: Conversational format
+- **qa**: Question-answer format
+- **styling**: Raw styling format
+- **alpaca**: Alpaca instruction format
+
+### 3. Model Configuration
+```yaml
+model:
+  name: "model_name"                       # Model from HuggingFace Hub
+  max_length: 512                          # Maximum sequence length
+  num_labels: 6                            # Number of labels (for classification)
+```
+
+**Recommended Models by Task:**
+- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
+- **Styling**: `t5-base`, `gpt2-medium`
+- **Completion**: `gpt2-medium`, `gpt2-large`
+- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`
+
+### 4. Training Configuration
+```yaml
+training:
+  num_epochs: 3                            # Number of training epochs
+  batch_size: 16                           # Training batch size
+  learning_rate: 2e-5                      # Learning rate
+  weight_decay: 0.01                       # Weight decay
+  lr_scheduler_type: "linear"              # Learning rate scheduler
+  warmup_ratio: 0.1                        # Warmup ratio
+  data_dir: "./data/path"                  # Training data directory
+  output_dir: "./model/output"             # Model output directory
+```
+
+**Learning Rate Guidelines:**
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+
+**Scheduler Types:**
+- **linear**: Linear decay
+- **cosine**: Cosine annealing
+- **polynomial**: Polynomial decay
+
+### 5. Inference Configuration
+```yaml
+inference:
+  model_path: "./model/path"               # Path to saved model
+  device: "auto"                           # Device to use
+  batch_size: 32                           # Inference batch size
+  return_probabilities: true                # Return probabilities
+  return_top_k: 3                          # Return top K predictions
+  max_new_tokens: 128                      # Max tokens to generate
+  temperature: 0.8                         # Sampling temperature
+```
+
+**Device Options:**
+- **auto**: Automatically detect best device
+- **cuda**: Use GPU if available
+- **cpu**: Force CPU usage
+
+**Temperature Guidelines:**
+- **0.0**: Deterministic (always same output)
+- **0.7-0.9**: Balanced creativity
+- **1.0+**: More random/creative
+
+## Task-Specific Parameters
+
+### Classification Tasks
+```yaml
+data:
+  label_encoding: "auto|numeric|string"    # How to encode labels
+  multilabel: false                        # Multi-label vs single-label
+  label_separator: ","                     # Separator for multi-label
+```
+
+### Styling Tasks
+```yaml
+data:
+  instruction: "Style instruction text"    # The style instruction
+```
+
+### Completion Tasks
+```yaml
+data:
+  prompt_template: "template"               # Prompt template
+  completion_length: 100                   # Target completion length
+```
+
+## Advanced Configuration
+
+### HuggingFace Specific
+```yaml
+data:
+  hf_split: "train"                        # Dataset split to use
+  hf_cache_dir: "./cache"                  # Cache directory
+  test_split_from: "train"                 # Source for test split
+  val_split_from: "train"                  # Source for validation split
+```
+
+### Custom Data Specific
+```yaml
+data:
+  encoding: "utf-8"                        # File encoding
+  delimiter: ","                           # CSV delimiter
+```
+
+## Usage Examples
+
+### Basic Usage
+```bash
+# Use YAML configuration
+python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
+
+# Override specific parameters
+python scripts/task_type/data_processor.py \
+  --config configs/task_type/config.yaml \
+  --max-samples 1000 \
+  --learning-rate 3e-5
+```
+
+### Creating Custom Configurations
+1. Copy an existing config file
+2. Modify parameters for your specific use case
+3. Update paths and model names
+4. Test with a small dataset first
+
+## Best Practices
+
+1. **Start with Defaults**: Use default values and adjust based on results
+2. **Validate Paths**: Ensure all file paths are correct and accessible
+3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
+4. **Test Incrementally**: Test with small datasets before full processing
+5. **Version Control**: Keep configurations in version control for reproducibility
+
+## Troubleshooting
+
+### Common Issues:
+- **File Not Found**: Check `data_path` and `output_dir` paths
+- **Memory Errors**: Reduce `batch_size` or `max_length`
+- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
+- **Split Errors**: Ensure split ratios sum to ≤ 1.0
+
+### Getting Help:
+- Check the script help: `python script.py --help`
+- Review the pipeline logs for detailed error messages
+- Verify YAML syntax and parameter values
@@ -1,6 +1,6 @@
 # Comprehensive Classification Configuration
 # This file defines all parameters for emotion classification using the dair-ai/emotion dataset
-# Organized by level: data processing, model, training, and inference
+# Organized by level: task, data processing, model, training, and inference

 # Task Configuration
 task:
@@ -15,9 +15,9 @@ data:
  data_format: "jsonl"                     # Data format: "jsonl", "csv", "json" (for custom data)
  
  # Field Mapping
-  input_field: "text"                      # Field name containing input text
-  label_field: "label"                     # Field name containing labels
-  id_field: null                           # Optional ID field name
+  input_field: "text"                      # Field name containing input text to be classified
+  label_field: "label"                     # Field name containing classification labels
+  id_field: null                           # Optional ID field name for tracking individual samples
  
  # Processing Parameters
  max_samples: 1000                        # Maximum samples to process (null for all samples)
@@ -26,54 +26,54 @@ data:
  test_split: 0.1                          # Test split ratio (0.0 to 1.0)
  
  # Text Preprocessing
-  clean_text: true                         # Clean and normalize text
-  remove_special_chars: false              # Remove special characters from text
-  lowercase: true                          # Convert text to lowercase
+  clean_text: true                         # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
+  remove_special_chars: false              # Remove special characters from text (keep for emotion analysis)
+  lowercase: true                          # Convert text to lowercase (standard for BERT models)
  min_length: 10                           # Minimum text length (filter out shorter texts)
  max_length: 1000                         # Maximum text length (truncate longer texts)
  
  # Label Processing
  label_encoding: "auto"                   # Label encoding: "auto", "numeric", "string"
-  multilabel: false                        # Enable multilabel classification
-  label_separator: ","                     # Separator for multilabel datasets
+  multilabel: false                        # Enable multilabel classification (false for single emotion per text)
+  label_separator: ","                     # Separator for multilabel datasets (comma-separated labels)
  
  # Output Configuration
  output_format: "classification"          # Output format: "classification", "instruction", "conversation", "qa"
-  output_dir: "./data/processed/classification/emotion"  # Specific output directory for this dataset
+  output_dir: "./data/processed/classification/emotion"  # Output directory for processed data and splits
  
  # HuggingFace Specific
-  hf_split: "train"                        # HuggingFace dataset split to use
-  hf_cache_dir: null                       # HuggingFace cache directory (null for default)
+  hf_split: "train"                        # HuggingFace dataset split to use as base
+  hf_cache_dir: null                       # HuggingFace cache directory (null for default ~/.cache/huggingface)
  
  # Split Configuration (Advanced)
  test_split_from: "train"                 # Source for test split: "train", "use_test_if_available", "use_val_if_available"
  val_split_from: "train"                  # Source for validation split: "train", "use_val_if_available"
  
  # Custom Data Specific
-  encoding: "utf-8"                        # File encoding for custom data
-  delimiter: ","                           # Delimiter for CSV files
+  encoding: "utf-8"                        # File encoding for custom data files
+  delimiter: ","                           # Delimiter for CSV files (comma for standard CSV)

 # Model Configuration
 model:
-  name: "bert-base-uncased"                # Model name from HuggingFace Hub
-  max_length: 512                          # Maximum sequence length for tokenization
-  num_labels: 6                            # Number of classification labels
+  name: "bert-base-uncased"                # Model name from HuggingFace Hub (good for text classification)
+  max_length: 512                          # Maximum sequence length for tokenization (BERT limit)
+  num_labels: 6                            # Number of classification labels (emotion categories)

 # Training Configuration
 training:
-  num_epochs: 3                            # Number of training epochs
-  batch_size: 16                           # Training batch size
-  learning_rate: 2e-5                      # Learning rate (typical range: 1e-5 to 5e-5)
-  weight_decay: 0.01                       # Weight decay for optimizer (typical range: 0.01 to 0.1)
+  num_epochs: 3                            # Number of training epochs (adjust based on dataset size)
+  batch_size: 16                           # Training batch size (adjust based on GPU memory)
+  learning_rate: 2e-5                      # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
+  weight_decay: 0.01                       # Weight decay for optimizer (prevents overfitting)
  lr_scheduler_type: "linear"              # Scheduler type: "linear", "cosine", "polynomial"
  warmup_ratio: 0.1                        # Warmup ratio for scheduler (0.0 to 1.0)
  data_dir: "./data/processed/classification/emotion"  # Directory containing train/validation/test JSONL files
-  output_dir: "./results/classification/emotion_model"  # Output directory for saved model
+  output_dir: "./results/classification/emotion_model"  # Output directory for saved model and checkpoints

 # Inference Configuration
 inference:
  model_path: "./results/classification/emotion_model"  # Path to saved model directory
-  device: "auto"                           # Device: "auto", "cuda", "cpu"
-  batch_size: 32                           # Batch size for inference
-  return_probabilities: true                # Return all class probabilities
-  return_top_k: 3                          # Return top K predictions
+  device: "auto"                           # Device: "auto", "cuda", "cpu" (auto detects best available)
+  batch_size: 32                           # Batch size for inference (can be larger than training)
+  return_probabilities: true                # Return all class probabilities (not just top prediction)
+  return_top_k: 3                          # Return top K predictions (useful for confidence analysis)
@@ -1,29 +1,69 @@
+# Comprehensive Styling Configuration
+# This file defines all parameters for formal style transfer tasks
+# Organized by level: task, data processing, model, training, and inference
+
+# Task Configuration
 task:
-  name: "styling"
-  type: "style_transfer"
+  name: "styling"                          # Task type: classification, completion, styling, matching
+  type: "style_transfer"                   # Model type: style_transfer, text_generation, etc.

+# Data Processing Configuration
 data:
-  source: "custom"
-  input_field: "text"
-  style_field: "style"
-  max_length: 256
-  train_split: 0.8
-  validation_split: 0.1
-  test_split: 0.1
+  source: "custom"                          # Data source: "huggingface" or "custom"
+  data_path: "./data/raw/styling/sample_formal.jsonl"  # Path to custom data file (required for custom source)
+  dataset_name: null                        # HuggingFace dataset name (required for huggingface source)
+  
+  # Field Mapping
+  input_field: "text"                       # Field name containing source text to be styled
+  output_field: "styled_text"               # Field name containing the styled/transformed text
+  
+  # Style Instruction
+  instruction: "Rewrite the following text in a formal style"  # The style instruction that guides the transformation
+  
+  # Data Format & Processing
+  data_format: "jsonl"                      # Data format: "jsonl", "csv", "json" (for custom data)
+  max_length: 256                           # Maximum text length (truncate longer texts)
+  min_length: 10                            # Minimum text length (filter out shorter texts)
+  
+  # Text Preprocessing
+  clean_text: true                          # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
+  lowercase: false                          # Convert text to lowercase (false for formal style to preserve case)
+  
+  # Data Splitting
+  train_split: 0.8                          # Training split ratio (0.0 to 1.0)
+  validation_split: 0.1                     # Validation split ratio (0.0 to 1.0)
+  test_split: 0.1                           # Test split ratio (0.0 to 1.0)
+  
+  # Output Configuration
+  output_format: "alpaca"                   # Output format: "styling" (raw), "alpaca" (instruction format)
+  output_dir: "./data/processed/styling/formal"  # Output directory for processed data and HuggingFace datasets

+# Model Configuration
 model:
-  name: "t5-base"
-  max_length: 256
+  name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Model name from HuggingFace Hub
+  max_length: 2048                          # Maximum sequence length for tokenization
+  max_seq_length: 2048                      # Maximum sequence length for training (RoPE scaling supported)
+  dtype: null                               # Data type: null for auto detection, float16 for Tesla T4/V100, bfloat16 for Ampere+
+  load_in_4bit: true                        # Use 4bit quantization to reduce memory usage
+  token: null                               # HuggingFace token for gated models (e.g., "hf_...")
+  
+  # Training Model Parameters
+  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Model to use for training
+  training_max_seq_length: 2048             # Max sequence length for training
+  training_dtype: null                      # Data type for training
+  training_load_in_4bit: true               # 4bit quantization for training

+# Training Configuration
 training:
-  num_epochs: 3
-  batch_size: 16
-  learning_rate: 3e-5
-  weight_decay: 0.01
-  warmup_ratio: 0.1
-  lr_scheduler_type: "linear"
+  num_epochs: 3                             # Number of training epochs
+  batch_size: 16                            # Training batch size (adjust based on GPU memory)
+  learning_rate: 3e-5                       # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
+  weight_decay: 0.01                        # Weight decay for optimizer (prevents overfitting)
+  warmup_ratio: 0.1                         # Warmup ratio for scheduler (0.0 to 1.0)
+  lr_scheduler_type: "linear"               # Scheduler type: "linear", "cosine", "polynomial"

+# Inference Configuration
 inference:
-  batch_size: 32
-  max_new_tokens: 128
-  temperature: 0.8
+  batch_size: 32                            # Batch size for inference (can be larger than training)
+  max_new_tokens: 128                       # Maximum new tokens to generate during inference
+  temperature: 0.8                          # Sampling temperature (0.0 = deterministic, 1.0 = random)