added style mimicking piepelines

This commit is contained in:
OwusuBlessing
2025-08-13 21:17:01 +01:00
parent fd54d4be39
commit 710d074b47
31 changed files with 3816 additions and 46 deletions
+191
View File
@@ -0,0 +1,191 @@
# Quick Reference Card
## Essential Parameters (Most Common)
### Data Source & Location
```yaml
data:
source: "huggingface|custom" # REQUIRED: Data source type
dataset_name: "dataset/name" # REQUIRED for huggingface
data_path: "./path/to/file" # REQUIRED for custom
data_format: "jsonl|csv|json" # REQUIRED for custom
```
### Field Mapping
```yaml
data:
input_field: "text" # REQUIRED: Input text field
label_field: "label" # REQUIRED for classification
output_field: "styled_text" # REQUIRED for styling
instruction: "Style instruction" # REQUIRED for styling
```
### Basic Processing
```yaml
data:
max_samples: 1000 # Limit total samples
train_split: 0.8 # Training ratio (0.0-1.0)
validation_split: 0.1 # Validation ratio (0.0-1.0)
test_split: 0.1 # Test ratio (0.0-1.0)
output_dir: "./output/path" # Output directory
```
### Text Preprocessing
```yaml
data:
clean_text: true # Clean/normalize text
lowercase: true # Convert to lowercase
min_length: 10 # Minimum text length
max_length: 512 # Maximum text length
```
### Model & Training
```yaml
model:
name: "bert-base-uncased" # Model name
max_length: 512 # Max sequence length
training:
num_epochs: 3 # Training epochs
batch_size: 16 # Batch size
learning_rate: 2e-5 # Learning rate
```
## Common Configurations by Task
### Classification
```yaml
task:
name: "classification"
type: "sequence_classification"
data:
source: "huggingface"
dataset_name: "dair-ai/emotion"
input_field: "text"
label_field: "label"
output_format: "classification"
```
### Styling
```yaml
task:
name: "styling"
type: "style_transfer"
data:
source: "custom"
data_path: "./data.jsonl"
input_field: "text"
output_field: "styled_text"
instruction: "Rewrite in formal style"
output_format: "alpaca"
```
### Text Generation
```yaml
task:
name: "completion"
type: "text_generation"
data:
source: "custom"
data_path: "./prompts.jsonl"
input_field: "prompt"
output_field: "completion"
output_format: "instruction"
```
## Quick Start Templates
### 1. HuggingFace Dataset
```yaml
task:
name: "classification"
type: "sequence_classification"
data:
source: "huggingface"
dataset_name: "your/dataset"
input_field: "text"
label_field: "label"
max_samples: 1000
output_dir: "./output"
```
### 2. Custom JSONL File
```yaml
task:
name: "styling"
type: "style_transfer"
data:
source: "custom"
data_path: "./your_data.jsonl"
data_format: "jsonl"
input_field: "source"
output_field: "target"
instruction: "Your style instruction"
output_dir: "./output"
```
### 3. CSV File
```yaml
task:
name: "classification"
type: "sequence_classification"
data:
source: "custom"
data_path: "./your_data.csv"
data_format: "csv"
input_field: "text"
label_field: "label"
delimiter: ","
output_dir: "./output"
```
## Parameter Ranges & Recommendations
### Split Ratios
- **Total must be ≤ 1.0**
- **Common**: train=0.8, val=0.1, test=0.1
- **Small datasets**: train=0.7, val=0.15, test=0.15
### Learning Rates
- **Fine-tuning**: 1e-5 to 5e-5
- **Training from scratch**: 1e-4 to 1e-3
- **Start with**: 2e-5
### Batch Sizes
- **GPU Memory**: 8, 16, 32, 64
- **CPU**: 4, 8, 16
- **Start with**: 16
### Text Lengths
- **BERT**: 512 (max)
- **GPT-2**: 1024 (max)
- **T5**: 512 (max)
- **Start with**: 256
## Common Issues & Fixes
| Issue | Cause | Fix |
|-------|-------|-----|
| "File not found" | Wrong path | Check `data_path` and `output_dir` |
| "Memory error" | Batch too large | Reduce `batch_size` |
| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
| "Slow processing" | Text too long | Reduce `max_length` |
## Environment Variables
```bash
# Set cache directory
export HF_HOME="./cache"
# Set output directory
export OUTPUT_DIR="./results"
# Set log level
export LOG_LEVEL="INFO"
```
+207
View File
@@ -0,0 +1,207 @@
# Configuration Files Documentation
This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
## Configuration Structure
All configuration files follow a consistent structure organized into these main sections:
### 1. Task Configuration
```yaml
task:
name: "task_type" # Task type: classification, completion, styling, matching
type: "specific_type" # Specific model/task type
```
**Available Task Types:**
- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
- **completion**: Text generation and completion tasks
- **styling**: Style transfer and text transformation tasks
- **matching**: Semantic matching and similarity tasks
### 2. Data Processing Configuration
```yaml
data:
# Data Source
source: "huggingface|custom" # Where to get data from
# Data Location
dataset_name: "dataset/name" # HuggingFace dataset name (for huggingface source)
data_path: "./path/to/file" # Path to custom data file (for custom source)
data_format: "jsonl|csv|json" # File format for custom data
# Field Mapping
input_field: "text" # Field containing input text
output_field: "styled_text" # Field containing output (for styling)
label_field: "label" # Field containing labels (for classification)
id_field: "id" # Optional ID field for tracking
# Processing Parameters
max_samples: 1000 # Maximum samples to process
train_split: 0.8 # Training split ratio
validation_split: 0.1 # Validation split ratio
test_split: 0.1 # Test split ratio
# Text Preprocessing
clean_text: true # Clean and normalize text
remove_special_chars: false # Remove special characters
lowercase: true # Convert to lowercase
min_length: 10 # Minimum text length
max_length: 1000 # Maximum text length
# Output Configuration
output_format: "format_type" # Output format
output_dir: "./output/path" # Output directory
```
**Data Source Types:**
- **huggingface**: Use datasets from HuggingFace Hub
- **custom**: Use local files (JSONL, CSV, JSON)
**Output Formats:**
- **classification**: Raw classification format
- **instruction**: Instruction-following format
- **conversation**: Conversational format
- **qa**: Question-answer format
- **styling**: Raw styling format
- **alpaca**: Alpaca instruction format
### 3. Model Configuration
```yaml
model:
name: "model_name" # Model from HuggingFace Hub
max_length: 512 # Maximum sequence length
num_labels: 6 # Number of labels (for classification)
```
**Recommended Models by Task:**
- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
- **Styling**: `t5-base`, `gpt2-medium`
- **Completion**: `gpt2-medium`, `gpt2-large`
- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`
### 4. Training Configuration
```yaml
training:
num_epochs: 3 # Number of training epochs
batch_size: 16 # Training batch size
learning_rate: 2e-5 # Learning rate
weight_decay: 0.01 # Weight decay
lr_scheduler_type: "linear" # Learning rate scheduler
warmup_ratio: 0.1 # Warmup ratio
data_dir: "./data/path" # Training data directory
output_dir: "./model/output" # Model output directory
```
**Learning Rate Guidelines:**
- **Fine-tuning**: 1e-5 to 5e-5
- **Training from scratch**: 1e-4 to 1e-3
**Scheduler Types:**
- **linear**: Linear decay
- **cosine**: Cosine annealing
- **polynomial**: Polynomial decay
### 5. Inference Configuration
```yaml
inference:
model_path: "./model/path" # Path to saved model
device: "auto" # Device to use
batch_size: 32 # Inference batch size
return_probabilities: true # Return probabilities
return_top_k: 3 # Return top K predictions
max_new_tokens: 128 # Max tokens to generate
temperature: 0.8 # Sampling temperature
```
**Device Options:**
- **auto**: Automatically detect best device
- **cuda**: Use GPU if available
- **cpu**: Force CPU usage
**Temperature Guidelines:**
- **0.0**: Deterministic (always same output)
- **0.7-0.9**: Balanced creativity
- **1.0+**: More random/creative
## Task-Specific Parameters
### Classification Tasks
```yaml
data:
label_encoding: "auto|numeric|string" # How to encode labels
multilabel: false # Multi-label vs single-label
label_separator: "," # Separator for multi-label
```
### Styling Tasks
```yaml
data:
instruction: "Style instruction text" # The style instruction
```
### Completion Tasks
```yaml
data:
prompt_template: "template" # Prompt template
completion_length: 100 # Target completion length
```
## Advanced Configuration
### HuggingFace Specific
```yaml
data:
hf_split: "train" # Dataset split to use
hf_cache_dir: "./cache" # Cache directory
test_split_from: "train" # Source for test split
val_split_from: "train" # Source for validation split
```
### Custom Data Specific
```yaml
data:
encoding: "utf-8" # File encoding
delimiter: "," # CSV delimiter
```
## Usage Examples
### Basic Usage
```bash
# Use YAML configuration
python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
# Override specific parameters
python scripts/task_type/data_processor.py \
--config configs/task_type/config.yaml \
--max-samples 1000 \
--learning-rate 3e-5
```
### Creating Custom Configurations
1. Copy an existing config file
2. Modify parameters for your specific use case
3. Update paths and model names
4. Test with a small dataset first
## Best Practices
1. **Start with Defaults**: Use default values and adjust based on results
2. **Validate Paths**: Ensure all file paths are correct and accessible
3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
4. **Test Incrementally**: Test with small datasets before full processing
5. **Version Control**: Keep configurations in version control for reproducibility
## Troubleshooting
### Common Issues:
- **File Not Found**: Check `data_path` and `output_dir` paths
- **Memory Errors**: Reduce `batch_size` or `max_length`
- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
- **Split Errors**: Ensure split ratios sum to ≤ 1.0
### Getting Help:
- Check the script help: `python script.py --help`
- Review the pipeline logs for detailed error messages
- Verify YAML syntax and parameter values
+26 -26
View File
@@ -1,6 +1,6 @@
# Comprehensive Classification Configuration
# This file defines all parameters for emotion classification using the dair-ai/emotion dataset
# Organized by level: data processing, model, training, and inference
# Organized by level: task, data processing, model, training, and inference
# Task Configuration
task:
@@ -15,9 +15,9 @@ data:
data_format: "jsonl" # Data format: "jsonl", "csv", "json" (for custom data)
# Field Mapping
input_field: "text" # Field name containing input text
label_field: "label" # Field name containing labels
id_field: null # Optional ID field name
input_field: "text" # Field name containing input text to be classified
label_field: "label" # Field name containing classification labels
id_field: null # Optional ID field name for tracking individual samples
# Processing Parameters
max_samples: 1000 # Maximum samples to process (null for all samples)
@@ -26,54 +26,54 @@ data:
test_split: 0.1 # Test split ratio (0.0 to 1.0)
# Text Preprocessing
clean_text: true # Clean and normalize text
remove_special_chars: false # Remove special characters from text
lowercase: true # Convert text to lowercase
clean_text: true # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
remove_special_chars: false # Remove special characters from text (keep for emotion analysis)
lowercase: true # Convert text to lowercase (standard for BERT models)
min_length: 10 # Minimum text length (filter out shorter texts)
max_length: 1000 # Maximum text length (truncate longer texts)
# Label Processing
label_encoding: "auto" # Label encoding: "auto", "numeric", "string"
multilabel: false # Enable multilabel classification
label_separator: "," # Separator for multilabel datasets
multilabel: false # Enable multilabel classification (false for single emotion per text)
label_separator: "," # Separator for multilabel datasets (comma-separated labels)
# Output Configuration
output_format: "classification" # Output format: "classification", "instruction", "conversation", "qa"
output_dir: "./data/processed/classification/emotion" # Specific output directory for this dataset
output_dir: "./data/processed/classification/emotion" # Output directory for processed data and splits
# HuggingFace Specific
hf_split: "train" # HuggingFace dataset split to use
hf_cache_dir: null # HuggingFace cache directory (null for default)
hf_split: "train" # HuggingFace dataset split to use as base
hf_cache_dir: null # HuggingFace cache directory (null for default ~/.cache/huggingface)
# Split Configuration (Advanced)
test_split_from: "train" # Source for test split: "train", "use_test_if_available", "use_val_if_available"
val_split_from: "train" # Source for validation split: "train", "use_val_if_available"
# Custom Data Specific
encoding: "utf-8" # File encoding for custom data
delimiter: "," # Delimiter for CSV files
encoding: "utf-8" # File encoding for custom data files
delimiter: "," # Delimiter for CSV files (comma for standard CSV)
# Model Configuration
model:
name: "bert-base-uncased" # Model name from HuggingFace Hub
max_length: 512 # Maximum sequence length for tokenization
num_labels: 6 # Number of classification labels
name: "bert-base-uncased" # Model name from HuggingFace Hub (good for text classification)
max_length: 512 # Maximum sequence length for tokenization (BERT limit)
num_labels: 6 # Number of classification labels (emotion categories)
# Training Configuration
training:
num_epochs: 3 # Number of training epochs
batch_size: 16 # Training batch size
learning_rate: 2e-5 # Learning rate (typical range: 1e-5 to 5e-5)
weight_decay: 0.01 # Weight decay for optimizer (typical range: 0.01 to 0.1)
num_epochs: 3 # Number of training epochs (adjust based on dataset size)
batch_size: 16 # Training batch size (adjust based on GPU memory)
learning_rate: 2e-5 # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
weight_decay: 0.01 # Weight decay for optimizer (prevents overfitting)
lr_scheduler_type: "linear" # Scheduler type: "linear", "cosine", "polynomial"
warmup_ratio: 0.1 # Warmup ratio for scheduler (0.0 to 1.0)
data_dir: "./data/processed/classification/emotion" # Directory containing train/validation/test JSONL files
output_dir: "./results/classification/emotion_model" # Output directory for saved model
output_dir: "./results/classification/emotion_model" # Output directory for saved model and checkpoints
# Inference Configuration
inference:
model_path: "./results/classification/emotion_model" # Path to saved model directory
device: "auto" # Device: "auto", "cuda", "cpu"
batch_size: 32 # Batch size for inference
return_probabilities: true # Return all class probabilities
return_top_k: 3 # Return top K predictions
device: "auto" # Device: "auto", "cuda", "cpu" (auto detects best available)
batch_size: 32 # Batch size for inference (can be larger than training)
return_probabilities: true # Return all class probabilities (not just top prediction)
return_top_k: 3 # Return top K predictions (useful for confidence analysis)
+60 -20
View File
@@ -1,29 +1,69 @@
# Comprehensive Styling Configuration
# This file defines all parameters for formal style transfer tasks
# Organized by level: task, data processing, model, training, and inference
# Task Configuration
task:
name: "styling"
type: "style_transfer"
name: "styling" # Task type: classification, completion, styling, matching
type: "style_transfer" # Model type: style_transfer, text_generation, etc.
# Data Processing Configuration
data:
source: "custom"
input_field: "text"
style_field: "style"
max_length: 256
train_split: 0.8
validation_split: 0.1
test_split: 0.1
source: "custom" # Data source: "huggingface" or "custom"
data_path: "./data/raw/styling/sample_formal.jsonl" # Path to custom data file (required for custom source)
dataset_name: null # HuggingFace dataset name (required for huggingface source)
# Field Mapping
input_field: "text" # Field name containing source text to be styled
output_field: "styled_text" # Field name containing the styled/transformed text
# Style Instruction
instruction: "Rewrite the following text in a formal style" # The style instruction that guides the transformation
# Data Format & Processing
data_format: "jsonl" # Data format: "jsonl", "csv", "json" (for custom data)
max_length: 256 # Maximum text length (truncate longer texts)
min_length: 10 # Minimum text length (filter out shorter texts)
# Text Preprocessing
clean_text: true # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
lowercase: false # Convert text to lowercase (false for formal style to preserve case)
# Data Splitting
train_split: 0.8 # Training split ratio (0.0 to 1.0)
validation_split: 0.1 # Validation split ratio (0.0 to 1.0)
test_split: 0.1 # Test split ratio (0.0 to 1.0)
# Output Configuration
output_format: "alpaca" # Output format: "styling" (raw), "alpaca" (instruction format)
output_dir: "./data/processed/styling/formal" # Output directory for processed data and HuggingFace datasets
# Model Configuration
model:
name: "t5-base"
max_length: 256
name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" # Model name from HuggingFace Hub
max_length: 2048 # Maximum sequence length for tokenization
max_seq_length: 2048 # Maximum sequence length for training (RoPE scaling supported)
dtype: null # Data type: null for auto detection, float16 for Tesla T4/V100, bfloat16 for Ampere+
load_in_4bit: true # Use 4bit quantization to reduce memory usage
token: null # HuggingFace token for gated models (e.g., "hf_...")
# Training Model Parameters
training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit" # Model to use for training
training_max_seq_length: 2048 # Max sequence length for training
training_dtype: null # Data type for training
training_load_in_4bit: true # 4bit quantization for training
# Training Configuration
training:
num_epochs: 3
batch_size: 16
learning_rate: 3e-5
weight_decay: 0.01
warmup_ratio: 0.1
lr_scheduler_type: "linear"
num_epochs: 3 # Number of training epochs
batch_size: 16 # Training batch size (adjust based on GPU memory)
learning_rate: 3e-5 # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
weight_decay: 0.01 # Weight decay for optimizer (prevents overfitting)
warmup_ratio: 0.1 # Warmup ratio for scheduler (0.0 to 1.0)
lr_scheduler_type: "linear" # Scheduler type: "linear", "cosine", "polynomial"
# Inference Configuration
inference:
batch_size: 32
max_new_tokens: 128
temperature: 0.8
batch_size: 32 # Batch size for inference (can be larger than training)
max_new_tokens: 128 # Maximum new tokens to generate during inference
temperature: 0.8 # Sampling temperature (0.0 = deterministic, 1.0 = random)