added style mimicking piepelines

2025-08-13 21:17:01 +01:00
parent fd54d4be39
commit 710d074b47
31 changed files with 3816 additions and 46 deletions
@@ -0,0 +1,191 @@
+# Quick Reference Card
+
+## Essential Parameters (Most Common)
+
+### Data Source & Location
+```yaml
+data:
+  source: "huggingface|custom"             # REQUIRED: Data source type
+  dataset_name: "dataset/name"             # REQUIRED for huggingface
+  data_path: "./path/to/file"              # REQUIRED for custom
+  data_format: "jsonl|csv|json"            # REQUIRED for custom
+```
+
+### Field Mapping
+```yaml
+data:
+  input_field: "text"                      # REQUIRED: Input text field
+  label_field: "label"                     # REQUIRED for classification
+  output_field: "styled_text"              # REQUIRED for styling
+  instruction: "Style instruction"          # REQUIRED for styling
+```
+
+### Basic Processing
+```yaml
+data:
+  max_samples: 1000                        # Limit total samples
+  train_split: 0.8                         # Training ratio (0.0-1.0)
+  validation_split: 0.1                    # Validation ratio (0.0-1.0)
+  test_split: 0.1                          # Test ratio (0.0-1.0)
+  output_dir: "./output/path"              # Output directory
+```
+
+### Text Preprocessing
+```yaml
+data:
+  clean_text: true                         # Clean/normalize text
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 512                          # Maximum text length
+```
+
+### Model & Training
+```yaml
+model:
+  name: "bert-base-uncased"                # Model name
+  max_length: 512                          # Max sequence length
+
+training:
+  num_epochs: 3                            # Training epochs
+  batch_size: 16                           # Batch size
+  learning_rate: 2e-5                      # Learning rate
+```
+
+## Common Configurations by Task
+
+### Classification
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "dair-ai/emotion"
+  input_field: "text"
+  label_field: "label"
+  output_format: "classification"
+```
+
+### Styling
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./data.jsonl"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite in formal style"
+  output_format: "alpaca"
+```
+
+### Text Generation
+```yaml
+task:
+  name: "completion"
+  type: "text_generation"
+
+data:
+  source: "custom"
+  data_path: "./prompts.jsonl"
+  input_field: "prompt"
+  output_field: "completion"
+  output_format: "instruction"
+```
+
+## Quick Start Templates
+
+### 1. HuggingFace Dataset
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "your/dataset"
+  input_field: "text"
+  label_field: "label"
+  max_samples: 1000
+  output_dir: "./output"
+```
+
+### 2. Custom JSONL File
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./your_data.jsonl"
+  data_format: "jsonl"
+  input_field: "source"
+  output_field: "target"
+  instruction: "Your style instruction"
+  output_dir: "./output"
+```
+
+### 3. CSV File
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "custom"
+  data_path: "./your_data.csv"
+  data_format: "csv"
+  input_field: "text"
+  label_field: "label"
+  delimiter: ","
+  output_dir: "./output"
+```
+
+## Parameter Ranges & Recommendations
+
+### Split Ratios
+- **Total must be ≤ 1.0**
+- **Common**: train=0.8, val=0.1, test=0.1
+- **Small datasets**: train=0.7, val=0.15, test=0.15
+
+### Learning Rates
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+- **Start with**: 2e-5
+
+### Batch Sizes
+- **GPU Memory**: 8, 16, 32, 64
+- **CPU**: 4, 8, 16
+- **Start with**: 16
+
+### Text Lengths
+- **BERT**: 512 (max)
+- **GPT-2**: 1024 (max)
+- **T5**: 512 (max)
+- **Start with**: 256
+
+## Common Issues & Fixes
+
+| Issue | Cause | Fix |
+|-------|-------|-----|
+| "File not found" | Wrong path | Check `data_path` and `output_dir` |
+| "Memory error" | Batch too large | Reduce `batch_size` |
+| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
+| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
+| "Slow processing" | Text too long | Reduce `max_length` |
+
+## Environment Variables
+```bash
+# Set cache directory
+export HF_HOME="./cache"
+
+# Set output directory
+export OUTPUT_DIR="./results"
+
+# Set log level
+export LOG_LEVEL="INFO"
+```
@@ -0,0 +1,207 @@
+# Configuration Files Documentation
+
+This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
+
+## Configuration Structure
+
+All configuration files follow a consistent structure organized into these main sections:
+
+### 1. Task Configuration
+```yaml
+task:
+  name: "task_type"                        # Task type: classification, completion, styling, matching
+  type: "specific_type"                    # Specific model/task type
+```
+
+**Available Task Types:**
+- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
+- **completion**: Text generation and completion tasks
+- **styling**: Style transfer and text transformation tasks
+- **matching**: Semantic matching and similarity tasks
+
+### 2. Data Processing Configuration
+```yaml
+data:
+  # Data Source
+  source: "huggingface|custom"             # Where to get data from
+  
+  # Data Location
+  dataset_name: "dataset/name"             # HuggingFace dataset name (for huggingface source)
+  data_path: "./path/to/file"              # Path to custom data file (for custom source)
+  data_format: "jsonl|csv|json"            # File format for custom data
+  
+  # Field Mapping
+  input_field: "text"                      # Field containing input text
+  output_field: "styled_text"              # Field containing output (for styling)
+  label_field: "label"                     # Field containing labels (for classification)
+  id_field: "id"                           # Optional ID field for tracking
+  
+  # Processing Parameters
+  max_samples: 1000                        # Maximum samples to process
+  train_split: 0.8                         # Training split ratio
+  validation_split: 0.1                    # Validation split ratio
+  test_split: 0.1                          # Test split ratio
+  
+  # Text Preprocessing
+  clean_text: true                         # Clean and normalize text
+  remove_special_chars: false              # Remove special characters
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 1000                         # Maximum text length
+  
+  # Output Configuration
+  output_format: "format_type"             # Output format
+  output_dir: "./output/path"              # Output directory
+```
+
+**Data Source Types:**
+- **huggingface**: Use datasets from HuggingFace Hub
+- **custom**: Use local files (JSONL, CSV, JSON)
+
+**Output Formats:**
+- **classification**: Raw classification format
+- **instruction**: Instruction-following format
+- **conversation**: Conversational format
+- **qa**: Question-answer format
+- **styling**: Raw styling format
+- **alpaca**: Alpaca instruction format
+
+### 3. Model Configuration
+```yaml
+model:
+  name: "model_name"                       # Model from HuggingFace Hub
+  max_length: 512                          # Maximum sequence length
+  num_labels: 6                            # Number of labels (for classification)
+```
+
+**Recommended Models by Task:**
+- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
+- **Styling**: `t5-base`, `gpt2-medium`
+- **Completion**: `gpt2-medium`, `gpt2-large`
+- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`
+
+### 4. Training Configuration
+```yaml
+training:
+  num_epochs: 3                            # Number of training epochs
+  batch_size: 16                           # Training batch size
+  learning_rate: 2e-5                      # Learning rate
+  weight_decay: 0.01                       # Weight decay
+  lr_scheduler_type: "linear"              # Learning rate scheduler
+  warmup_ratio: 0.1                        # Warmup ratio
+  data_dir: "./data/path"                  # Training data directory
+  output_dir: "./model/output"             # Model output directory
+```
+
+**Learning Rate Guidelines:**
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+
+**Scheduler Types:**
+- **linear**: Linear decay
+- **cosine**: Cosine annealing
+- **polynomial**: Polynomial decay
+
+### 5. Inference Configuration
+```yaml
+inference:
+  model_path: "./model/path"               # Path to saved model
+  device: "auto"                           # Device to use
+  batch_size: 32                           # Inference batch size
+  return_probabilities: true                # Return probabilities
+  return_top_k: 3                          # Return top K predictions
+  max_new_tokens: 128                      # Max tokens to generate
+  temperature: 0.8                         # Sampling temperature
+```
+
+**Device Options:**
+- **auto**: Automatically detect best device
+- **cuda**: Use GPU if available
+- **cpu**: Force CPU usage
+
+**Temperature Guidelines:**
+- **0.0**: Deterministic (always same output)
+- **0.7-0.9**: Balanced creativity
+- **1.0+**: More random/creative
+
+## Task-Specific Parameters
+
+### Classification Tasks
+```yaml
+data:
+  label_encoding: "auto|numeric|string"    # How to encode labels
+  multilabel: false                        # Multi-label vs single-label
+  label_separator: ","                     # Separator for multi-label
+```
+
+### Styling Tasks
+```yaml
+data:
+  instruction: "Style instruction text"    # The style instruction
+```
+
+### Completion Tasks
+```yaml
+data:
+  prompt_template: "template"               # Prompt template
+  completion_length: 100                   # Target completion length
+```
+
+## Advanced Configuration
+
+### HuggingFace Specific
+```yaml
+data:
+  hf_split: "train"                        # Dataset split to use
+  hf_cache_dir: "./cache"                  # Cache directory
+  test_split_from: "train"                 # Source for test split
+  val_split_from: "train"                  # Source for validation split
+```
+
+### Custom Data Specific
+```yaml
+data:
+  encoding: "utf-8"                        # File encoding
+  delimiter: ","                           # CSV delimiter
+```
+
+## Usage Examples
+
+### Basic Usage
+```bash
+# Use YAML configuration
+python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
+
+# Override specific parameters
+python scripts/task_type/data_processor.py \
+  --config configs/task_type/config.yaml \
+  --max-samples 1000 \
+  --learning-rate 3e-5
+```
+
+### Creating Custom Configurations
+1. Copy an existing config file
+2. Modify parameters for your specific use case
+3. Update paths and model names
+4. Test with a small dataset first
+
+## Best Practices
+
+1. **Start with Defaults**: Use default values and adjust based on results
+2. **Validate Paths**: Ensure all file paths are correct and accessible
+3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
+4. **Test Incrementally**: Test with small datasets before full processing
+5. **Version Control**: Keep configurations in version control for reproducibility
+
+## Troubleshooting
+
+### Common Issues:
+- **File Not Found**: Check `data_path` and `output_dir` paths
+- **Memory Errors**: Reduce `batch_size` or `max_length`
+- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
+- **Split Errors**: Ensure split ratios sum to ≤ 1.0
+
+### Getting Help:
+- Check the script help: `python script.py --help`
+- Review the pipeline logs for detailed error messages
+- Verify YAML syntax and parameter values
@@ -1,6 +1,6 @@
 # Comprehensive Classification Configuration
 # This file defines all parameters for emotion classification using the dair-ai/emotion dataset
-# Organized by level: data processing, model, training, and inference
+# Organized by level: task, data processing, model, training, and inference

 # Task Configuration
 task:
@@ -15,9 +15,9 @@ data:
  data_format: "jsonl"                     # Data format: "jsonl", "csv", "json" (for custom data)
  
  # Field Mapping
-  input_field: "text"                      # Field name containing input text
-  label_field: "label"                     # Field name containing labels
-  id_field: null                           # Optional ID field name
+  input_field: "text"                      # Field name containing input text to be classified
+  label_field: "label"                     # Field name containing classification labels
+  id_field: null                           # Optional ID field name for tracking individual samples
  
  # Processing Parameters
  max_samples: 1000                        # Maximum samples to process (null for all samples)
@@ -26,54 +26,54 @@ data:
  test_split: 0.1                          # Test split ratio (0.0 to 1.0)
  
  # Text Preprocessing
-  clean_text: true                         # Clean and normalize text
-  remove_special_chars: false              # Remove special characters from text
-  lowercase: true                          # Convert text to lowercase
+  clean_text: true                         # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
+  remove_special_chars: false              # Remove special characters from text (keep for emotion analysis)
+  lowercase: true                          # Convert text to lowercase (standard for BERT models)
  min_length: 10                           # Minimum text length (filter out shorter texts)
  max_length: 1000                         # Maximum text length (truncate longer texts)
  
  # Label Processing
  label_encoding: "auto"                   # Label encoding: "auto", "numeric", "string"
-  multilabel: false                        # Enable multilabel classification
-  label_separator: ","                     # Separator for multilabel datasets
+  multilabel: false                        # Enable multilabel classification (false for single emotion per text)
+  label_separator: ","                     # Separator for multilabel datasets (comma-separated labels)
  
  # Output Configuration
  output_format: "classification"          # Output format: "classification", "instruction", "conversation", "qa"
-  output_dir: "./data/processed/classification/emotion"  # Specific output directory for this dataset
+  output_dir: "./data/processed/classification/emotion"  # Output directory for processed data and splits
  
  # HuggingFace Specific
-  hf_split: "train"                        # HuggingFace dataset split to use
-  hf_cache_dir: null                       # HuggingFace cache directory (null for default)
+  hf_split: "train"                        # HuggingFace dataset split to use as base
+  hf_cache_dir: null                       # HuggingFace cache directory (null for default ~/.cache/huggingface)
  
  # Split Configuration (Advanced)
  test_split_from: "train"                 # Source for test split: "train", "use_test_if_available", "use_val_if_available"
  val_split_from: "train"                  # Source for validation split: "train", "use_val_if_available"
  
  # Custom Data Specific
-  encoding: "utf-8"                        # File encoding for custom data
-  delimiter: ","                           # Delimiter for CSV files
+  encoding: "utf-8"                        # File encoding for custom data files
+  delimiter: ","                           # Delimiter for CSV files (comma for standard CSV)

 # Model Configuration
 model:
-  name: "bert-base-uncased"                # Model name from HuggingFace Hub
-  max_length: 512                          # Maximum sequence length for tokenization
-  num_labels: 6                            # Number of classification labels
+  name: "bert-base-uncased"                # Model name from HuggingFace Hub (good for text classification)
+  max_length: 512                          # Maximum sequence length for tokenization (BERT limit)
+  num_labels: 6                            # Number of classification labels (emotion categories)

 # Training Configuration
 training:
-  num_epochs: 3                            # Number of training epochs
-  batch_size: 16                           # Training batch size
-  learning_rate: 2e-5                      # Learning rate (typical range: 1e-5 to 5e-5)
-  weight_decay: 0.01                       # Weight decay for optimizer (typical range: 0.01 to 0.1)
+  num_epochs: 3                            # Number of training epochs (adjust based on dataset size)
+  batch_size: 16                           # Training batch size (adjust based on GPU memory)
+  learning_rate: 2e-5                      # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
+  weight_decay: 0.01                       # Weight decay for optimizer (prevents overfitting)
  lr_scheduler_type: "linear"              # Scheduler type: "linear", "cosine", "polynomial"
  warmup_ratio: 0.1                        # Warmup ratio for scheduler (0.0 to 1.0)
  data_dir: "./data/processed/classification/emotion"  # Directory containing train/validation/test JSONL files
-  output_dir: "./results/classification/emotion_model"  # Output directory for saved model
+  output_dir: "./results/classification/emotion_model"  # Output directory for saved model and checkpoints

 # Inference Configuration
 inference:
  model_path: "./results/classification/emotion_model"  # Path to saved model directory
-  device: "auto"                           # Device: "auto", "cuda", "cpu"
-  batch_size: 32                           # Batch size for inference
-  return_probabilities: true                # Return all class probabilities
-  return_top_k: 3                          # Return top K predictions
+  device: "auto"                           # Device: "auto", "cuda", "cpu" (auto detects best available)
+  batch_size: 32                           # Batch size for inference (can be larger than training)
+  return_probabilities: true                # Return all class probabilities (not just top prediction)
+  return_top_k: 3                          # Return top K predictions (useful for confidence analysis)
@@ -1,29 +1,69 @@
+# Comprehensive Styling Configuration
+# This file defines all parameters for formal style transfer tasks
+# Organized by level: task, data processing, model, training, and inference
+
+# Task Configuration
 task:
-  name: "styling"
-  type: "style_transfer"
+  name: "styling"                          # Task type: classification, completion, styling, matching
+  type: "style_transfer"                   # Model type: style_transfer, text_generation, etc.

+# Data Processing Configuration
 data:
-  source: "custom"
-  input_field: "text"
-  style_field: "style"
-  max_length: 256
-  train_split: 0.8
-  validation_split: 0.1
-  test_split: 0.1
+  source: "custom"                          # Data source: "huggingface" or "custom"
+  data_path: "./data/raw/styling/sample_formal.jsonl"  # Path to custom data file (required for custom source)
+  dataset_name: null                        # HuggingFace dataset name (required for huggingface source)
  
+  # Field Mapping
+  input_field: "text"                       # Field name containing source text to be styled
+  output_field: "styled_text"               # Field name containing the styled/transformed text
+  
+  # Style Instruction
+  instruction: "Rewrite the following text in a formal style"  # The style instruction that guides the transformation
+  
+  # Data Format & Processing
+  data_format: "jsonl"                      # Data format: "jsonl", "csv", "json" (for custom data)
+  max_length: 256                           # Maximum text length (truncate longer texts)
+  min_length: 10                            # Minimum text length (filter out shorter texts)
+  
+  # Text Preprocessing
+  clean_text: true                          # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
+  lowercase: false                          # Convert text to lowercase (false for formal style to preserve case)
+  
+  # Data Splitting
+  train_split: 0.8                          # Training split ratio (0.0 to 1.0)
+  validation_split: 0.1                     # Validation split ratio (0.0 to 1.0)
+  test_split: 0.1                           # Test split ratio (0.0 to 1.0)
+  
+  # Output Configuration
+  output_format: "alpaca"                   # Output format: "styling" (raw), "alpaca" (instruction format)
+  output_dir: "./data/processed/styling/formal"  # Output directory for processed data and HuggingFace datasets
+
+# Model Configuration
 model:
-  name: "t5-base"
-  max_length: 256
+  name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Model name from HuggingFace Hub
+  max_length: 2048                          # Maximum sequence length for tokenization
+  max_seq_length: 2048                      # Maximum sequence length for training (RoPE scaling supported)
+  dtype: null                               # Data type: null for auto detection, float16 for Tesla T4/V100, bfloat16 for Ampere+
+  load_in_4bit: true                        # Use 4bit quantization to reduce memory usage
+  token: null                               # HuggingFace token for gated models (e.g., "hf_...")
  
+  # Training Model Parameters
+  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Model to use for training
+  training_max_seq_length: 2048             # Max sequence length for training
+  training_dtype: null                      # Data type for training
+  training_load_in_4bit: true               # 4bit quantization for training
+
+# Training Configuration
 training:
-  num_epochs: 3
-  batch_size: 16
-  learning_rate: 3e-5
-  weight_decay: 0.01
-  warmup_ratio: 0.1
-  lr_scheduler_type: "linear"
+  num_epochs: 3                             # Number of training epochs
+  batch_size: 16                            # Training batch size (adjust based on GPU memory)
+  learning_rate: 3e-5                       # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
+  weight_decay: 0.01                        # Weight decay for optimizer (prevents overfitting)
+  warmup_ratio: 0.1                         # Warmup ratio for scheduler (0.0 to 1.0)
+  lr_scheduler_type: "linear"               # Scheduler type: "linear", "cosine", "polynomial"

+# Inference Configuration
 inference:
-  batch_size: 32
-  max_new_tokens: 128
-  temperature: 0.8
+  batch_size: 32                            # Batch size for inference (can be larger than training)
+  max_new_tokens: 128                       # Maximum new tokens to generate during inference
+  temperature: 0.8                          # Sampling temperature (0.0 = deterministic, 1.0 = random)
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
@@ -0,0 +1,24 @@
+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "instruction": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "input": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "output": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "text": {
+      "dtype": "string",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}
@@ -0,0 +1,13 @@
+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "4e028847697e7b16",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "That's totally awesome!", "output": "That is quite remarkable!"}
@@ -0,0 +1,3 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
+{"instruction": "Rewrite the following text in a formal style", "input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
+{"instruction": "Rewrite the following text in a formal style", "input": "What's the deal with this?", "output": "What is the situation regarding this matter?"}
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "That's totally awesome!", "output": "That is quite remarkable!"}
@@ -0,0 +1,3 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
+{"instruction": "Rewrite the following text in a formal style", "input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
+{"instruction": "Rewrite the following text in a formal style", "input": "What's the deal with this?", "output": "What is the situation regarding this matter?"}
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}
@@ -0,0 +1,5 @@
+{"text": "Hey, what's up? How are you doing today?", "styled_text": "Hello, how are you doing today?"}
+{"text": "This is really cool stuff!", "styled_text": "This is quite impressive material."}
+{"text": "I'm gonna go to the store later.", "styled_text": "I will go to the store later."}
+{"text": "What's the deal with this?", "styled_text": "What is the situation regarding this matter?"}
+{"text": "That's totally awesome!", "styled_text": "That is quite remarkable!"}
@@ -0,0 +1,3 @@
+{"input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
+{"input": "This is really cool stuff!", "output": "This is quite impressive material."}
+{"input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
@@ -0,0 +1,5 @@
+{"text": "Hello world", "styled_text": "Greetings, world."}
+{"styled_text": "This is a formal greeting."}
+{"text": "How are you?", "styled_text": "How are you doing?"}
+{"text": null, "styled_text": "Empty input example."}
+{"styled_text": "Another example with no input."}
@@ -0,0 +1,346 @@
+#!/usr/bin/env python3
+"""
+Styling Inference Pipeline using Trained Models
+Supports style transfer inference with streaming and batch processing
+"""
+
+import os
+import sys
+import json
+import logging
+import argparse
+from pathlib import Path
+from typing import Dict, Any, Optional, List, Union
+import yaml
+
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent.parent.parent))
+
+from utils.config.config_manager import ConfigManager
+from utils.logging.logging import setup_logging
+
+# Inference imports
+import torch
+from datasets import load_from_disk, Dataset
+from unsloth import FastLanguageModel
+from transformers import TextStreamer
+
+logger = logging.getLogger(__name__)
+
+class StylingInference:
+    """Styling task inference using trained models"""
+    
+    def __init__(self, config: Dict[str, Any]):
+        self.config = config
+        self.model = None
+        self.tokenizer = None
+        
+        # Set device
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info(f"Using device: {self.device}")
+        
+        # Model parameters
+        self.model_path = config.get('model_path')
+        self.max_seq_length = config.get('max_seq_length', 2048)
+        self.dtype = config.get('dtype', None)
+        self.load_in_4bit = config.get('load_in_4bit', True)
+        self.hf_token = config.get('hf_token', None)
+        
+        # Inference parameters
+        self.batch_size = config.get('batch_size', 1)
+        self.max_new_tokens = config.get('max_new_tokens', 128)
+        self.temperature = config.get('temperature', 0.8)
+        self.top_p = config.get('top_p', 0.9)
+        self.do_sample = config.get('do_sample', True)
+        
+        # Alpaca prompt template
+        self.alpaca_prompt = config.get('alpaca_prompt', """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that follows the instruction
+
+### Instruction:
+{}
+
+### Input:
+{}
+
+### Response:
+{}""")
+        
+        # Style instruction
+        self.style_instruction = config.get('style_instruction', 'Rewrite the following text in a formal style')
+    
+    def load_model_and_tokenizer(self):
+        """Load the trained model and tokenizer"""
+        logger.info("Loading model and tokenizer...")
+        
+        try:
+            if self.model_path and Path(self.model_path).exists():
+                # Load local trained model
+                logger.info(f"Loading local model from: {self.model_path}")
+                self.model, self.tokenizer = FastLanguageModel.from_pretrained(
+                    model_name=self.model_path,
+                    max_seq_length=self.max_seq_length,
+                    dtype=self.dtype,
+                    load_in_4bit=self.load_in_4bit,
+                    token=self.hf_token
+                )
+            else:
+                # Load base model from HuggingFace Hub
+                logger.info(f"Loading base model: {self.config.get('base_model_name', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit')}")
+                self.model, self.tokenizer = FastLanguageModel.from_pretrained(
+                    model_name=self.config.get('base_model_name', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit'),
+                    max_seq_length=self.max_seq_length,
+                    dtype=self.dtype,
+                    load_in_4bit=self.load_in_4bit,
+                    token=self.hf_token
+                )
+            
+            # Prepare for inference
+            FastLanguageModel.for_inference(self.model)
+            
+            logger.info(f"✅ Model loaded successfully")
+            logger.info(f"✅ Tokenizer loaded with vocab size: {self.tokenizer.vocab_size}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error loading model: {e}")
+            raise
+    
+    def format_prompt(self, instruction: str, input_text: str, output: str = "") -> str:
+        """Format the prompt using Alpaca template"""
+        return self.alpaca_prompt.format(instruction, input_text, output)
+    
+    def generate_text(self, prompt: str, max_new_tokens: Optional[int] = None) -> str:
+        """Generate text from a single prompt"""
+        try:
+            # Tokenize input
+            inputs = self.tokenizer([prompt], return_tensors="pt").to(self.device)
+            
+            # Set generation parameters
+            gen_kwargs = {
+                "max_new_tokens": max_new_tokens or self.max_new_tokens,
+                "temperature": self.temperature,
+                "top_p": self.top_p,
+                "do_sample": self.do_sample,
+                "use_cache": True,
+                "pad_token_id": self.tokenizer.eos_token_id
+            }
+            
+            # Generate
+            with torch.no_grad():
+                outputs = self.model.generate(**inputs, **gen_kwargs)
+            
+            # Decode
+            generated_text = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+            
+            # Extract only the generated part (remove input prompt)
+            if prompt in generated_text:
+                generated_text = generated_text[len(prompt):].strip()
+            
+            return generated_text
+            
+        except Exception as e:
+            logger.error(f"❌ Error generating text: {e}")
+            return ""
+    
+    def style_transfer(self, input_text: str, instruction: Optional[str] = None, streaming: bool = False) -> str:
+        """Perform style transfer on input text"""
+        if instruction is None:
+            instruction = self.style_instruction
+        
+        # Format prompt
+        prompt = self.format_prompt(instruction, input_text, "")
+        
+        logger.info(f"Style transfer prompt: {prompt}")
+        
+        if streaming:
+            logger.info("Generating with streaming...")
+            self.generate_text_streaming(prompt)
+            return ""
+        else:
+            logger.info("Generating text...")
+            result = self.generate_text(prompt)
+            logger.info(f"Generated result: {result}")
+            return result
+    
+    def generate_text_streaming(self, prompt: str, max_new_tokens: Optional[int] = None):
+        """Generate text with streaming output"""
+        try:
+            # Tokenize input
+            inputs = self.tokenizer([prompt], return_tensors="pt").to(self.device)
+            
+            # Setup text streamer
+            text_streamer = TextStreamer(self.tokenizer)
+            
+            # Set generation parameters
+            gen_kwargs = {
+                "max_new_tokens": max_new_tokens or self.max_new_tokens,
+                "temperature": self.temperature,
+                "top_p": self.top_p,
+                "do_sample": self.do_sample,
+                "use_cache": True,
+                "pad_token_id": self.tokenizer.eos_token_id
+            }
+            
+            # Generate with streaming
+            with torch.no_grad():
+                _ = self.model.generate(**inputs, streamer=text_streamer, **gen_kwargs)
+                
+        except Exception as e:
+            logger.error(f"❌ Error in streaming generation: {e}")
+    
+    def batch_style_transfer(self, input_texts: List[str], instruction: Optional[str] = None) -> List[str]:
+        """Perform style transfer on multiple input texts"""
+        results = []
+        
+        for i, input_text in enumerate(input_texts):
+            logger.info(f"Processing text {i+1}/{len(input_texts)}")
+            result = self.style_transfer(input_text, instruction)
+            results.append(result)
+        
+        return results
+
+def load_inference_config(config_path: str) -> Dict[str, Any]:
+    """Load inference configuration from YAML file"""
+    try:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+        
+        # Extract inference configuration
+        inference_config = {}
+        
+        # Model configuration
+        if 'model' in config:
+            model_data = config['model']
+            inference_config.update({
+                'base_model_name': model_data.get('training_model', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit'),
+                'max_seq_length': model_data.get('training_max_seq_length', 2048),
+                'dtype': model_data.get('training_dtype'),
+                'load_in_4bit': model_data.get('training_load_in_4bit', True),
+                'hf_token': model_data.get('training_token')
+            })
+        
+        # Inference configuration
+        if 'inference' in config:
+            inference_data = config['inference']
+            inference_config.update({
+                'batch_size': inference_data.get('batch_size', 1),
+                'max_new_tokens': inference_data.get('max_new_tokens', 128),
+                'temperature': inference_data.get('temperature', 0.8)
+            })
+        
+        # Style configuration
+        if 'data' in config:
+            data_config = config['data']
+            inference_config.update({
+                'style_instruction': data_config.get('instruction', 'Rewrite the following text in a formal style')
+            })
+        
+        return inference_config
+        
+    except Exception as e:
+        logger.error(f"Error loading inference config: {e}")
+        raise
+
+def main():
+    """Main inference function"""
+    parser = argparse.ArgumentParser(description="Styling Inference Pipeline")
+    
+    # Configuration
+    parser.add_argument("--config", type=str, required=True, help="Path to YAML configuration file")
+    parser.add_argument("--model-path", type=str, help="Path to trained model (optional, uses base model if not provided)")
+    
+    # Inference modes
+    parser.add_argument("--text", type=str, help="Single text to style transfer")
+    parser.add_argument("--input-file", type=str, help="File containing texts to process (one per line)")
+    
+    # Generation parameters
+    parser.add_argument("--max-tokens", type=int, help="Maximum new tokens to generate")
+    parser.add_argument("--temperature", type=float, help="Sampling temperature")
+    parser.add_argument("--streaming", action="store_true", help="Enable streaming generation")
+    parser.add_argument("--instruction", type=str, help="Custom style instruction")
+    
+    # Output
+    parser.add_argument("--output-file", type=str, help="Output file for results")
+    
+    args = parser.parse_args()
+    
+    # Setup logging
+    setup_logging()
+    
+    try:
+        # Load configuration
+        logger.info(f"Loading configuration from: {args.config}")
+        inference_config = load_inference_config(args.config)
+        
+        # Override with CLI arguments
+        if args.model_path:
+            inference_config['model_path'] = args.model_path
+        if args.max_tokens:
+            inference_config['max_new_tokens'] = args.max_tokens
+        if args.temperature:
+            inference_config['temperature'] = args.temperature
+        if args.instruction:
+            inference_config['style_instruction'] = args.instruction
+        
+        logger.info("Inference configuration:")
+        for key, value in inference_config.items():
+            logger.info(f"  {key}: {value}")
+        
+        # Initialize inference
+        inferencer = StylingInference(inference_config)
+        
+        # Load model
+        inferencer.load_model_and_tokenizer()
+        
+        # Run inference based on mode
+        if args.text:
+            # Single text inference
+            logger.info("Running single text inference...")
+            result = inferencer.style_transfer(args.text, args.instruction, args.streaming)
+            if not args.streaming:
+                print(f"\nGenerated text: {result}")
+        
+        elif args.input_file:
+            # Batch file inference
+            logger.info("Running batch file inference...")
+            with open(args.input_file, 'r', encoding='utf-8') as f:
+                input_texts = [line.strip() for line in f if line.strip()]
+            
+            results = inferencer.batch_style_transfer(input_texts, args.instruction)
+            
+            # Save results
+            output_file = args.output_file or f"{Path(args.input_file).stem}_styled.txt"
+            with open(output_file, 'w', encoding='utf-8') as f:
+                for input_text, result in zip(input_texts, results):
+                    f.write(f"Input: {input_text}\n")
+                    f.write(f"Output: {result}\n")
+                    f.write("-" * 50 + "\n")
+            
+            logger.info(f"✅ Results saved to: {output_file}")
+        
+        else:
+            # Interactive mode
+            logger.info("Entering interactive mode. Type 'quit' to exit.")
+            while True:
+                try:
+                    user_input = input("\nEnter text to style (or 'quit'): ").strip()
+                    if user_input.lower() == 'quit':
+                        break
+                    
+                    if user_input:
+                        result = inferencer.style_transfer(user_input, args.instruction, args.streaming)
+                        if not args.streaming:
+                            print(f"\nStyled text: {result}")
+                
+                except KeyboardInterrupt:
+                    break
+                except Exception as e:
+                    logger.error(f"Error processing input: {e}")
+        
+        logger.info("🎉 Inference completed successfully!")
+        
+    except Exception as e:
+        logger.error(f"❌ Inference failed: {e}")
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,446 @@
+#!/usr/bin/env python3
+"""
+Styling Training Pipeline using Unsloth and SFTTrainer
+Supports style transfer tasks with LoRA fine-tuning
+"""
+
+import os
+import sys
+import json
+import logging
+import argparse
+from pathlib import Path
+from typing import Dict, Any, Optional
+import yaml
+
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent.parent.parent))
+
+from utils.config.config_manager import ConfigManager
+#from utils.logging.logging import setup_logging
+
+# Training imports
+import torch
+from datasets import load_from_disk, Dataset
+from unsloth import FastLanguageModel, is_bfloat16_supported
+from trl import SFTTrainer
+from transformers import TrainingArguments
+
+logger = logging.getLogger(__name__)
+
+class StylingTrainer:
+    """Styling task trainer using Unsloth and SFTTrainer"""
+    
+    def __init__(self, config: Dict[str, Any]):
+        self.config = config
+        self.model = None
+        self.tokenizer = None
+        self.trainer = None
+        
+        # Set device
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info(f"Using device: {self.device}")
+        
+        # Training parameters
+        self.max_seq_length = config.get('max_seq_length', 2048)
+        self.dtype = config.get('dtype', None)
+        self.load_in_4bit = config.get('load_in_4bit', True)
+        self.hf_token = config.get('hf_token', None)
+        
+        # LoRA parameters
+        self.lora_r = config.get('lora_r', 16)
+        self.lora_alpha = config.get('lora_alpha', 16)
+        self.lora_dropout = config.get('lora_dropout', 0)
+        self.target_modules = config.get('target_modules', [
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj"
+        ])
+        
+        # Training arguments
+        self.batch_size = config.get('batch_size', 2)
+        self.gradient_accumulation_steps = config.get('gradient_accumulation_steps', 4)
+        self.learning_rate = config.get('learning_rate', 2e-4)
+        self.num_epochs = config.get('num_epochs', 1)
+        self.max_steps = config.get('max_steps', None)
+        self.warmup_steps = config.get('warmup_steps', 5)
+        self.weight_decay = config.get('weight_decay', 0.01)
+        self.seed = config.get('seed', 3407)
+        
+        # Output paths
+        self.output_dir = config.get('output_dir', './outputs')
+        self.model_output_dir = config.get('model_output_dir', './models/styling')
+        
+    def load_model_and_tokenizer(self):
+        """Load the pre-trained model and tokenizer"""
+        logger.info("Loading model and tokenizer...")
+        
+        try:
+            self.model, self.tokenizer = FastLanguageModel.from_pretrained(
+                model_name=self.config['model_name'],
+                max_seq_length=self.max_seq_length,
+                dtype=self.dtype,
+                load_in_4bit=self.load_in_4bit,
+                token=self.hf_token
+            )
+            
+            logger.info(f"✅ Model loaded: {self.config['model_name']}")
+            logger.info(f"✅ Tokenizer loaded with vocab size: {self.tokenizer.vocab_size}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error loading model: {e}")
+            raise
+    
+    def setup_lora(self):
+        """Setup LoRA for efficient fine-tuning"""
+        logger.info("Setting up LoRA configuration...")
+        
+        try:
+            self.model = FastLanguageModel.get_peft_model(
+                self.model,
+                r=self.lora_r,
+                target_modules=self.target_modules,
+                lora_alpha=self.lora_alpha,
+                lora_dropout=self.lora_dropout,
+                bias="none",
+                use_gradient_checkpointing="unsloth",
+                random_state=self.seed,
+                use_rslora=False,
+                loftq_config=None
+            )
+            
+            logger.info(f"✅ LoRA configured with r={self.lora_r}, alpha={self.lora_alpha}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error setting up LoRA: {e}")
+            raise
+    
+    def load_dataset(self, dataset_path: str) -> Dataset:
+        """Load the training dataset"""
+        logger.info(f"Loading dataset from: {dataset_path}")
+        
+        try:
+            if Path(dataset_path).exists():
+                # Check if it's a HuggingFace dataset directory
+                if (Path(dataset_path) / "dataset_info.json").exists():
+                    # Load from HuggingFace dataset directory
+                    dataset = load_from_disk(dataset_path)
+                    logger.info(f"Loaded HuggingFace dataset from disk: {len(dataset)} samples")
+                else:
+                    # Load from processed data files (JSONL format)
+                    logger.info("Loading from processed data files...")
+                    from datasets import Dataset
+                    import json
+                    
+                    all_data = []
+                    data_dir = Path(dataset_path)
+                    
+                    # Look for train.jsonl, validation.jsonl, test.jsonl
+                    for split_file in ["train.jsonl", "validation.jsonl", "test.jsonl"]:
+                        file_path = data_dir / split_file
+                        if file_path.exists():
+                            logger.info(f"Loading {split_file}...")
+                            with open(file_path, 'r', encoding='utf-8') as f:
+                                for line in f:
+                                    if line.strip():
+                                        data = json.loads(line)
+                                        all_data.append(data)
+                    
+                    if not all_data:
+                        raise ValueError(f"No data found in {dataset_path}")
+                    
+                    # Create HuggingFace dataset
+                    dataset = Dataset.from_list(all_data)
+                    logger.info(f"Created HuggingFace dataset from {len(all_data)} samples")
+            else:
+                # Try loading from HuggingFace Hub
+                logger.info(f"Attempting to load from HuggingFace Hub: {dataset_path}")
+                dataset = Dataset.load_dataset(dataset_path, split="train")
+                logger.info(f"Loaded from HuggingFace Hub: {len(dataset)} samples")
+            
+            logger.info(f"Dataset loaded: {len(dataset)} samples")
+            logger.info(f"Dataset features: {dataset.features}")
+            
+            # Verify required fields exist
+            required_fields = ["instruction", "input", "output"]
+            missing_fields = [field for field in required_fields if field not in dataset.features]
+            if missing_fields:
+                raise ValueError(f"Missing required fields in dataset: {missing_fields}")
+            
+            return dataset
+            
+        except Exception as e:
+            logger.error(f"Error loading dataset: {e}")
+            raise
+    
+    def setup_trainer(self, train_dataset: Dataset):
+        """Setup the SFTTrainer"""
+        logger.info("Setting up SFTTrainer...")
+        
+        try:
+            # First, map the dataset to create the text field with EOS token
+            def formatting_prompts_func(examples):
+                instructions = examples["instruction"]
+                inputs = examples["input"]
+                outputs = examples["output"]
+                texts = []
+                
+                for instruction, input_text, output in zip(instructions, inputs, outputs):
+                    # Must add EOS_TOKEN, otherwise your generation will go on forever!
+                    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that follows the instruction
+
+### Instruction:
+{}
+
+### Input:
+{}
+
+### Response:
+{}"""
+                    text = alpaca_prompt.format(instruction, input_text, output) + self.tokenizer.eos_token
+                    texts.append(text)
+                
+                return {"text": texts}
+            
+            # Apply the formatting function to create the text field
+            logger.info("Mapping dataset to create text field with EOS token...")
+            formatted_dataset = train_dataset.map(formatting_prompts_func, batched=True, remove_columns=train_dataset.column_names)
+            
+            logger.info(f"Dataset mapped successfully. New features: {formatted_dataset.features}")
+            logger.info(f"Sample text field: {formatted_dataset[0]['text'][:100]}...")
+            
+            # Training arguments
+            training_args = TrainingArguments(
+                per_device_train_batch_size=self.batch_size,
+                gradient_accumulation_steps=self.gradient_accumulation_steps,
+                warmup_steps=self.warmup_steps,
+                num_train_epochs=self.num_epochs,
+                max_steps=self.max_steps,
+                learning_rate=self.learning_rate,
+                fp16=not is_bfloat16_supported(),
+                bf16=is_bfloat16_supported(),
+                logging_steps=1,
+                optim="adamw_8bit",
+                weight_decay=self.weight_decay,
+                lr_scheduler_type="linear",
+                seed=self.seed,
+                output_dir=self.output_dir,
+                report_to="none",  # Disable wandb for now
+                save_strategy="epoch",
+                save_total_limit=2,
+                evaluation_strategy="no",  # No validation for now
+                load_best_model_at_end=False,
+                remove_unused_columns=False,
+                dataloader_pin_memory=False,
+            )
+            
+            # Create trainer with the formatted dataset
+            self.trainer = SFTTrainer(
+                model=self.model,
+                tokenizer=self.tokenizer,
+                train_dataset=formatted_dataset,  # Use the formatted dataset
+                dataset_text_field="text",  # The field we just created
+                max_seq_length=self.max_seq_length,
+                dataset_num_proc=2,
+                packing=False,  # Can make training 5x faster for short sequences
+                args=training_args
+            )
+            
+            logger.info("SFTTrainer configured successfully")
+            
+        except Exception as e:
+            logger.error(f"Error setting up trainer: {e}")
+            raise
+    
+    def train(self, dataset_path: str):
+        """Run the training process"""
+        logger.info("🚀 Starting training process...")
+        
+        try:
+            # Load model and tokenizer
+            self.load_model_and_tokenizer()
+            
+            # Setup LoRA
+            self.setup_lora()
+            
+            # Load dataset
+            train_dataset = self.load_dataset(dataset_path)
+            
+            # Setup trainer
+            self.setup_trainer(train_dataset)
+            
+            # Start training
+            logger.info("Starting training...")
+            trainer_stats = self.trainer.train()
+            
+            logger.info("✅ Training completed successfully!")
+            logger.info(f"Training stats: {trainer_stats}")
+            
+            # Save the model
+            self.save_model()
+            
+            return trainer_stats
+            
+        except Exception as e:
+            logger.error(f"❌ Training failed: {e}")
+            raise
+    
+    def save_model(self):
+        """Save the trained model"""
+        logger.info("Saving trained model...")
+        
+        try:
+            # Create output directory
+            Path(self.model_output_dir).mkdir(parents=True, exist_ok=True)
+            
+            # Save model and tokenizer
+            self.model.save_pretrained(self.model_output_dir)
+            self.tokenizer.save_pretrained(self.model_output_dir)
+            
+            # Save training config
+            config_path = Path(self.model_output_dir) / "training_config.json"
+            with open(config_path, 'w') as f:
+                json.dump(self.config, f, indent=2)
+            
+            logger.info(f"✅ Model saved to: {self.model_output_dir}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error saving model: {e}")
+            raise
+    
+    def prepare_for_inference(self):
+        """Prepare model for inference"""
+        logger.info("Preparing model for inference...")
+        
+        try:
+            FastLanguageModel.for_inference(self.model)
+            logger.info("✅ Model prepared for inference")
+            
+        except Exception as e:
+            logger.error(f"❌ Error preparing for inference: {e}")
+            raise
+
+def load_training_config(config_path: str) -> Dict[str, Any]:
+    """Load training configuration from YAML file"""
+    try:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+        
+        # Extract training configuration
+        training_config = {}
+        
+        # Model configuration
+        if 'model' in config:
+            model_data = config['model']
+            training_config.update({
+                'model_name': model_data.get('training_model', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit'),
+                'max_seq_length': model_data.get('training_max_seq_length', 2048),
+                'dtype': model_data.get('training_dtype'),
+                'load_in_4bit': model_data.get('training_load_in_4bit', True),
+                'hf_token': model_data.get('training_token')
+            })
+        
+        # Training configuration
+        if 'training' in config:
+            training_data = config['training']
+            training_config.update({
+                'num_epochs': training_data.get('num_epochs', 3),
+                'batch_size': training_data.get('batch_size', 2),
+                'learning_rate': training_data.get('learning_rate', 2e-4),
+                'weight_decay': training_data.get('weight_decay', 0.01),
+                'warmup_ratio': training_data.get('warmup_ratio', 0.1),
+                'lr_scheduler_type': training_data.get('lr_scheduler_type', 'linear')
+            })
+        
+        # Data configuration - use output_dir from data section
+        if 'data' in config:
+            data_config = config['data']
+            output_dir = data_config.get('output_dir', './data/processed/styling')
+            training_config.update({
+                'data_output_dir': output_dir,
+                'dataset_path': output_dir,  # Default dataset path is the output_dir
+                'style_instruction': data_config.get('instruction', 'Rewrite the following text in a formal style')
+            })
+        
+        # LoRA configuration
+        training_config.update({
+            'lora_r': 16,
+            'lora_alpha': 16,
+            'lora_dropout': 0,
+            'target_modules': [
+                "q_proj", "k_proj", "v_proj", "o_proj",
+                "gate_proj", "up_proj", "down_proj"
+            ],
+            'gradient_accumulation_steps': 4,
+            'max_steps': None,
+            'warmup_steps': 5,
+            'seed': 3407,
+            'output_dir': './outputs',
+            'model_output_dir': './models/styling'
+        })
+        
+        return training_config
+        
+    except Exception as e:
+        logger.error(f"Error loading training config: {e}")
+        raise
+
+def main():
+    """Main training function"""
+    parser = argparse.ArgumentParser(description="Styling Training Pipeline")
+    
+    # Configuration
+    parser.add_argument("--config", type=str, required=True, help="Path to YAML configuration file")
+    parser.add_argument("--dataset", type=str, help="Path to training dataset (HF dataset path or local path)")
+    parser.add_argument("--output-dir", type=str, help="Output directory for model")
+    parser.add_argument("--epochs", type=int, help="Number of training epochs")
+    parser.add_argument("--batch-size", type=int, help="Training batch size")
+    parser.add_argument("--learning-rate", type=float, help="Learning rate")
+    parser.add_argument("--max-steps", type=int, help="Maximum training steps")
+    
+    args = parser.parse_args()
+    
+    # Setup logging
+    # setup_logging()  # Commented out as per user's change
+    
+    try:
+        # Load configuration
+        logger.info(f"Loading configuration from: {args.config}")
+        training_config = load_training_config(args.config)
+        
+        # Override with CLI arguments
+        if args.output_dir:
+            training_config['model_output_dir'] = args.output_dir
+        if args.epochs:
+            training_config['num_epochs'] = args.epochs
+        if args.batch_size:
+            training_config['batch_size'] = args.batch_size
+        if args.learning_rate:
+            training_config['learning_rate'] = args.learning_rate
+        if args.max_steps:
+            training_config['max_steps'] = args.max_steps
+        
+        # Determine dataset path: CLI argument takes precedence, then YAML config
+        dataset_path = args.dataset or training_config.get('dataset_path')
+        if not dataset_path:
+            logger.error("No dataset path provided. Use --dataset or ensure output_dir is set in YAML config.")
+            sys.exit(1)
+        
+        logger.info("Training configuration:")
+        for key, value in training_config.items():
+            logger.info(f"  {key}: {value}")
+        logger.info(f"  Dataset path: {dataset_path}")
+        
+        # Initialize trainer
+        trainer = StylingTrainer(training_config)
+        
+        # Start training
+        trainer.train(dataset_path)
+        
+        logger.info("Training completed successfully!")
+        
+    except Exception as e:
+        logger.error(f"Training failed: {e}")
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,45 @@
+"""
+Styling Scripts Package
+Provides command-line interfaces for styling data processing, training, and inference
+"""
+
+from .data_processor import (
+    run_with_yaml_config,
+    run_styling_examples,
+    create_sample_styling_data,
+    create_custom_styling_config,
+    show_styling_features
+)
+
+from .train import (
+    run_training_with_config,
+    create_training_example,
+    show_training_features
+)
+
+from .inference import (
+    run_inference_with_config,
+    create_inference_example,
+    run_batch_inference_example,
+    show_inference_features
+)
+
+__all__ = [
+    # Data processing
+    'run_with_yaml_config',
+    'run_styling_examples',
+    'create_sample_styling_data',
+    'create_custom_styling_config',
+    'show_styling_features',
+    
+    # Training
+    'run_training_with_config',
+    'create_training_example',
+    'show_training_features',
+    
+    # Inference
+    'run_inference_with_config',
+    'create_inference_example',
+    'run_batch_inference_example',
+    'show_inference_features'
+]
@@ -0,0 +1,302 @@
+#!/usr/bin/env python3
+"""
+Styling data processor script that uses YAML configurations.
+This provides a flexible and maintainable approach for style transfer tasks.
+"""
+
+import sys
+import os
+import subprocess
+import argparse
+from pathlib import Path
+
+def run_with_yaml_config(config_path: str, **cli_overrides):
+    """Run styling data processor with YAML configuration"""
+    print(f"=== Running Styling Data Processor with YAML config: {config_path} ===")
+    
+    cmd = [
+        "python", "pipelines/styling/data_processor.py",
+        "--config", config_path
+    ]
+    
+    # Add CLI overrides
+    for key, value in cli_overrides.items():
+        if value is not None:
+            cmd.extend([f"--{key.replace('_', '-')}", str(value)])
+    
+    print(f"Running command: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("✅ Styling data processing completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Error running styling data processor: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def run_styling_examples():
+    """Run styling examples with YAML configs"""
+    
+    # Example 1: Formal style transfer
+    print("=== Example 1: Formal Style Transfer ===")
+    success = run_with_yaml_config(
+        "configs/styling/formal.yaml",
+        max_samples=1000,  # Override YAML value
+        output_format="alpaca"
+    )
+    
+    if success:
+        print("✅ Formal style transfer completed!")
+    
+    # Example 2: Custom styling dataset (if available)
+    print("\n=== Example 2: Custom Styling Dataset ===")
+    if os.path.exists("data/raw/styling/custom_dataset.jsonl"):
+        success = run_with_yaml_config(
+            "configs/styling/formal.yaml",  # Use formal config as base
+            data_source="custom",
+            data_path="data/raw/styling/custom_dataset.jsonl",
+            instruction="Rewrite the following text in a casual, friendly style",
+            output_dir="./data/processed/styling/casual"
+        )
+        if success:
+            print("✅ Custom styling dataset processing completed!")
+    else:
+        print("⚠️  Custom styling dataset not found, skipping...")
+        print("   You can create one with the 'create-sample-data' option")
+
+def create_sample_styling_data():
+    """Create sample styling dataset for testing"""
+    sample_data = [
+        {
+            "text": "Hey, what's up? How are you doing today?",
+            "styled_text": "Hello, how are you doing today?"
+        },
+        {
+            "text": "This is really cool stuff!",
+            "styled_text": "This is quite impressive material."
+        },
+        {
+            "text": "I'm gonna go to the store later.",
+            "styled_text": "I will go to the store later."
+        },
+        {
+            "text": "What's the deal with this?",
+            "styled_text": "What is the situation regarding this matter?"
+        },
+        {
+            "text": "That's totally awesome!",
+            "styled_text": "That is quite remarkable!"
+        }
+    ]
+    
+    # Create directory structure
+    data_dir = Path("data/raw/styling")
+    data_dir.mkdir(parents=True, exist_ok=True)
+    
+    # Save sample data
+    import json
+    sample_file = data_dir / "sample_formal.jsonl"
+    with open(sample_file, 'w', encoding='utf-8') as f:
+        for item in sample_data:
+            f.write(json.dumps(item, ensure_ascii=False) + '\n')
+    
+    print(f"✅ Created sample styling dataset: {sample_file}")
+    print(f"   Contains {len(sample_data)} examples")
+    print(f"   Format: text → styled_text")
+    print(f"   Ready to use with configs/styling/formal.yaml")
+
+def create_custom_styling_config():
+    """Create a custom styling configuration file"""
+    custom_config = """task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite the following text in a professional business style"
+  data_format: "jsonl"
+  max_length: 512
+  min_length: 10
+  clean_text: true
+  lowercase: false
+  train_split: 0.8
+  validation_split: 0.1
+  test_split: 0.1
+  output_format: "alpaca"
+  output_dir: "./data/processed/styling/professional"
+
+model:
+  name: "t5-base"
+  max_length: 512
+
+training:
+  num_epochs: 3
+  batch_size: 16
+  learning_rate: 3e-5
+  weight_decay: 0.01
+  warmup_ratio: 0.1
+  lr_scheduler_type: "linear"
+
+inference:
+  batch_size: 32
+  max_new_tokens: 128
+  temperature: 0.8
+"""
+    
+    config_path = "configs/styling/professional.yaml"
+    os.makedirs(os.path.dirname(config_path), exist_ok=True)
+    
+    with open(config_path, 'w') as f:
+        f.write(custom_config)
+    
+    print(f"✅ Created custom styling config: {config_path}")
+    print("   This config is set up for professional business style transfer")
+
+def handle_direct_args():
+    """Handle direct command-line arguments by passing them to the styling pipeline"""
+    parser = argparse.ArgumentParser(description="Styling Data Processor")
+    
+    # Add all the same arguments as the styling pipeline
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    parser.add_argument("--data-source", choices=["huggingface", "custom"], help="Data source")
+    parser.add_argument("--dataset-name", type=str, help="HuggingFace dataset name")
+    parser.add_argument("--data-path", type=str, help="Path to custom data file")
+    parser.add_argument("--data-format", choices=["jsonl", "csv", "json"], help="Data format")
+    parser.add_argument("--input-field", type=str, help="Input field name")
+    parser.add_argument("--output-field", type=str, help="Output field name")
+    parser.add_argument("--instruction", type=str, help="Style instruction")
+    parser.add_argument("--max-samples", type=int, help="Maximum samples to process")
+    parser.add_argument("--train-split", type=float, help="Training split ratio")
+    parser.add_argument("--validation-split", type=float, help="Validation split ratio")
+    parser.add_argument("--test-split", type=float, help="Test split ratio")
+    parser.add_argument("--clean-text", action="store_true", help="Clean and normalize text")
+    parser.add_argument("--remove-special-chars", action="store_true", help="Remove special characters")
+    parser.add_argument("--lowercase", action="store_true", help="Convert text to lowercase")
+    parser.add_argument("--min-length", type=int, help="Minimum text length")
+    parser.add_argument("--max-length", type=int, help="Maximum text length")
+    parser.add_argument("--output-format", choices=["styling", "alpaca"], help="Output format")
+    parser.add_argument("--output-dir", type=str, help="Output directory")
+    
+    # HuggingFace dataset options
+    parser.add_argument("--create-hf-dataset", action="store_true", help="Create HuggingFace dataset")
+    parser.add_argument("--hf-dataset-path", type=str, help="Path to save HuggingFace dataset")
+    
+    # Logging
+    parser.add_argument("--log-level", choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO", help="Logging level")
+    
+    args = parser.parse_args()
+    
+    # Build command to call the styling pipeline
+    cmd = ["python", "pipelines/styling/data_processor.py"]
+    
+    # Add all arguments that were provided
+    for arg_name, arg_value in vars(args).items():
+        if arg_value is not None:
+            if isinstance(arg_value, bool):
+                if arg_value:  # Only add flag if True
+                    cmd.append(f"--{arg_name.replace('_', '-')}")
+            else:
+                cmd.extend([f"--{arg_name.replace('_', '-')}", str(arg_value)])
+    
+    print(f"Running: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("✅ Styling data processing completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Error running styling data processor: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def show_styling_features():
+    """Show the features of the styling data processor"""
+    print("=== Styling Data Processor Features ===")
+    print()
+    print("1. **Style Transfer Tasks**:")
+    print("   - Formal vs. Informal style")
+    print("   - Professional vs. Casual tone")
+    print("   - Academic vs. Conversational")
+    print("   - Any custom style instruction")
+    print()
+    print("2. **Data Formats Supported**:")
+    print("   - HuggingFace datasets")
+    print("   - Custom JSONL/CSV/JSON files")
+    print("   - Automatic train/validation/test splits")
+    print()
+    print("3. **Output Formats**:")
+    print("   - Raw styling format (input/output)")
+    print("   - Alpaca format (instruction/input/output)")
+    print("   - HuggingFace dataset format")
+    print()
+    print("4. **Advanced Features**:")
+    print("   - Configurable field mapping")
+    print("   - Text preprocessing options")
+    print("   - Automatic dataset saving/loading")
+    print("   - YAML configuration support")
+    print()
+    print("=== Usage Examples ===")
+    print()
+    print("1. Use YAML config only:")
+    print("   python scripts/styling/data_processor.py --config configs/styling/formal.yaml")
+    print()
+    print("2. Override YAML values:")
+    print("   python scripts/styling/data_processor.py --config configs/styling/formal.yaml --max-samples 500")
+    print()
+    print("3. Create sample data:")
+    print("   python scripts/styling/data_processor.py create-sample-data")
+    print()
+    print("4. Create custom config:")
+    print("   python scripts/styling/data_processor.py create-config")
+
+def main():
+    """Main function"""
+    if len(sys.argv) > 1:
+        # Check if it's a subcommand
+        if sys.argv[1] in ["examples", "create-sample-data", "create-config", "features"]:
+            # Handle subcommands
+            if sys.argv[1] == "examples":
+                run_styling_examples()
+            elif sys.argv[1] == "create-sample-data":
+                create_sample_styling_data()
+            elif sys.argv[1] == "create-config":
+                create_custom_styling_config()
+            elif sys.argv[1] == "features":
+                show_styling_features()
+        else:
+            # Handle direct arguments (pass through to pipeline)
+            handle_direct_args()
+    else:
+        print("Styling Data Processor")
+        print("=====================")
+        print()
+        print("This script runs the styling data processor for style transfer tasks.")
+        print("It supports both YAML configurations and command-line overrides.")
+        print()
+        print("Usage:")
+        print("  python scripts/styling/data_processor.py examples           # Run examples")
+        print("  python scripts/styling/data_processor.py create-sample-data # Create sample dataset")
+        print("  python scripts/styling/data_processor.py create-config      # Create custom config")
+        print("  python scripts/styling/data_processor.py features           # Show features")
+        print()
+        print("Direct pipeline usage:")
+        print("  python scripts/styling/data_processor.py --config configs/styling/formal.yaml")
+        print("  python scripts/styling/data_processor.py --data-source custom --data-path ./data.jsonl")
+        print()
+        print("Key Features:")
+        print("  ✅ Style transfer with custom instructions")
+        print("  ✅ Multiple data source support")
+        print("  ✅ YAML configuration files")
+        print("  ✅ CLI argument overrides")
+        print("  ✅ Automatic data splitting")
+        print("  ✅ HuggingFace dataset export")
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,223 @@
+#!/usr/bin/env python3
+"""
+Styling Inference Script
+Provides a command-line interface to run the styling inference pipeline
+"""
+
+import sys
+import os
+import subprocess
+import argparse
+from pathlib import Path
+
+def run_inference_with_config(config_path: str, **cli_overrides):
+    """Run the styling inference pipeline with YAML configuration"""
+    print(f"🚀 Starting styling inference with config: {config_path}")
+    print()
+    
+    # Build command
+    cmd = ["python", "pipelines/styling/inference.py", "--config", config_path]
+    
+    # Add CLI overrides
+    for key, value in cli_overrides.items():
+        if value is not None:
+            if key == "model_path":
+                cmd.extend(["--model-path", str(value)])
+            elif key == "text":
+                cmd.extend(["--text", str(value)])
+            elif key == "input_file":
+                cmd.extend(["--input-file", str(value)])
+            elif key == "max_tokens":
+                cmd.extend(["--max-tokens", str(value)])
+            elif key == "temperature":
+                cmd.extend(["--temperature", str(value)])
+            elif key == "instruction":
+                cmd.extend(["--instruction", str(value)])
+            elif key == "output_file":
+                cmd.extend(["--output-file", str(value)])
+            elif key == "streaming":
+                cmd.append("--streaming")
+    
+    print(f"Running: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("✅ Inference completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Inference failed: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def show_inference_features():
+    """Show the features of the styling inference pipeline"""
+    print("=== Styling Inference Pipeline Features ===")
+    print()
+    print("1. **Model Support**:")
+    print("   - Trained LoRA models")
+    print("   - Base models from HuggingFace Hub")
+    print("   - Automatic model loading and preparation")
+    print()
+    print("2. **Inference Modes**:")
+    print("   - Single text inference")
+    print("   - Batch file processing")
+    print("   - Interactive mode")
+    print("   - Streaming generation")
+    print()
+    print("3. **Generation Control**:")
+    print("   - Configurable temperature and top-p")
+    print("   - Adjustable max tokens")
+    print("   - Custom style instructions")
+    print()
+    print("4. **Output Options**:")
+    print("   - Console output")
+    print("   - File output")
+    print("   - Streaming real-time generation")
+
+def create_inference_example():
+    """Create an inference example using the formal style configuration"""
+    print("=== Inference Example: Formal Style Transfer ===")
+    print()
+    
+    # Check if we have the required files
+    config_path = "configs/styling/formal.yaml"
+    
+    if not Path(config_path).exists():
+        print(f"❌ Configuration file not found: {config_path}")
+        print("   Please run the data processor first to create the configuration")
+        return False
+    
+    print("✅ Found configuration file!")
+    print(f"   Config: {config_path}")
+    print()
+    
+    # Example text
+    example_text = "Hey, what's up? I'm gonna go grab some food later."
+    
+    print(f"📝 Example text: {example_text}")
+    print()
+    
+    # Run inference
+    success = run_inference_with_config(
+        config_path=config_path,
+        text=example_text,
+        instruction="Rewrite the following text in a formal style"
+    )
+    
+    if success:
+        print("🎉 Inference example completed!")
+    
+    return success
+
+def create_test_file():
+    """Create a test file with sample texts for batch inference"""
+    test_file = "test_texts.txt"
+    
+    test_texts = [
+        "Hey, what's up? How are you doing today?",
+        "I'm gonna go to the store later to get some stuff.",
+        "This is pretty cool, right?",
+        "Can you help me out with this?",
+        "Thanks a lot for your help!"
+    ]
+    
+    with open(test_file, 'w', encoding='utf-8') as f:
+        for text in test_texts:
+            f.write(text + '\n')
+    
+    print(f"✅ Created test file: {test_file}")
+    print(f"   Contains {len(test_texts)} sample texts")
+    return test_file
+
+def run_batch_inference_example():
+    """Run a batch inference example"""
+    print("=== Batch Inference Example ===")
+    print()
+    
+    # Create test file
+    test_file = create_test_file()
+    
+    # Check configuration
+    config_path = "configs/styling/formal.yaml"
+    if not Path(config_path).exists():
+        print(f"❌ Configuration file not found: {config_path}")
+        return False
+    
+    print("✅ Running batch inference...")
+    print()
+    
+    # Run batch inference
+    success = run_inference_with_config(
+        config_path=config_path,
+        input_file=test_file,
+        output_file="styled_results.txt",
+        instruction="Rewrite the following text in a formal style"
+    )
+    
+    if success:
+        print("🎉 Batch inference completed!")
+        print("   Results saved to: styled_results.txt")
+    
+    return success
+
+def main():
+    """Main function"""
+    parser = argparse.ArgumentParser(description="Styling Inference Script")
+    
+    # Subcommands
+    parser.add_argument("command", choices=["infer", "example", "batch", "features"], 
+                       help="Command to run")
+    
+    # Inference arguments
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    parser.add_argument("--model-path", type=str, help="Path to trained model")
+    parser.add_argument("--text", type=str, help="Single text to style transfer")
+    parser.add_argument("--input-file", type=str, help="File containing texts to process")
+    parser.add_argument("--max-tokens", type=int, help="Maximum new tokens to generate")
+    parser.add_argument("--temperature", type=float, help="Sampling temperature")
+    parser.add_argument("--instruction", type=str, help="Custom style instruction")
+    parser.add_argument("--output-file", type=str, help="Output file for results")
+    parser.add_argument("--streaming", action="store_true", help="Enable streaming generation")
+    
+    args = parser.parse_args()
+    
+    if args.command == "features":
+        show_inference_features()
+    
+    elif args.command == "example":
+        create_inference_example()
+    
+    elif args.command == "batch":
+        run_batch_inference_example()
+    
+    elif args.command == "infer":
+        if not args.config:
+            print("❌ --config is required for inference")
+            print("Usage: python scripts/styling/inference.py infer --config config.yaml [options]")
+            sys.exit(1)
+        
+        # Check if we have input
+        if not args.text and not args.input_file:
+            print("❌ Either --text or --input-file is required")
+            print("Usage: python scripts/styling/inference.py infer --config config.yaml --text 'your text'")
+            sys.exit(1)
+        
+        success = run_inference_with_config(
+            config_path=args.config,
+            model_path=args.model_path,
+            text=args.text,
+            input_file=args.input_file,
+            max_tokens=args.max_tokens,
+            temperature=args.temperature,
+            instruction=args.instruction,
+            output_file=args.output_file,
+            streaming=args.streaming
+        )
+        
+        if not success:
+            sys.exit(1)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,168 @@
+#!/usr/bin/env python3
+"""
+Styling Training Script
+Provides a command-line interface to run the styling training pipeline
+"""
+
+import sys
+import os
+import subprocess
+import argparse
+from pathlib import Path
+
+def run_training_with_config(config_path: str, dataset_path: str = None, **cli_overrides):
+    """Run the styling training pipeline with YAML configuration"""
+    print(f"Starting styling training with config: {config_path}")
+    if dataset_path:
+        print(f"Training dataset: {dataset_path}")
+    else:
+        print("Training dataset: Will use output_dir from YAML config")
+    print()
+    
+    # Build command
+    cmd = ["python", "pipelines/styling/train.py", "--config", config_path]
+    
+    # Add dataset path if provided
+    if dataset_path:
+        cmd.extend(["--dataset", dataset_path])
+    
+    # Add CLI overrides
+    for key, value in cli_overrides.items():
+        if value is not None:
+            if key == "output_dir":
+                cmd.extend(["--output-dir", str(value)])
+            elif key == "epochs":
+                cmd.extend(["--epochs", str(value)])
+            elif key == "batch_size":
+                cmd.extend(["--batch-size", str(value)])
+            elif key == "learning_rate":
+                cmd.extend(["--learning-rate", str(value)])
+            elif key == "max_steps":
+                cmd.extend(["--max-steps", str(value)])
+    
+    print(f"Running: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("Training completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"Training failed: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def show_training_features():
+    """Show the features of the styling training pipeline"""
+    print("=== Styling Training Pipeline Features ===")
+    print()
+    print("1. **Model Support**:")
+    print("   - Unsloth optimized models (4x faster)")
+    print("   - LoRA fine-tuning for efficiency")
+    print("   - Support for Llama-3.1, Mistral, Phi-3, Gemma")
+    print()
+    print("2. **Training Features**:")
+    print("   - SFTTrainer with instruction tuning")
+    print("   - Automatic mixed precision (FP16/BF16)")
+    print("   - Gradient checkpointing for memory efficiency")
+    print("   - Configurable LoRA parameters")
+    print()
+    print("3. **Configuration**:")
+    print("   - YAML configuration files")
+    print("   - CLI argument overrides")
+    print("   - Automatic device detection")
+    print()
+    print("4. **Output**:")
+    print("   - Saved LoRA models")
+    print("   - Training logs and checkpoints")
+    print("   - Ready for inference")
+
+def create_training_example():
+    """Create a training example using the formal style configuration"""
+    print("=== Training Example: Formal Style Transfer ===")
+    print()
+    
+    # Check if we have the required files
+    config_path = "configs/styling/formal.yaml"
+    
+    if not Path(config_path).exists():
+        print(f"Configuration file not found: {config_path}")
+        print("   Please run the data processor first to create the configuration")
+        return False
+    
+    print("Found required files!")
+    print(f"   Config: {config_path}")
+    print("   Dataset: Will use output_dir from YAML config")
+    print("   The training pipeline will automatically:")
+    print("   - Load data from the output_dir specified in YAML")
+    print("   - Convert JSONL files to HuggingFace dataset format")
+    print("   - Apply formatting with EOS tokens")
+    print("   - Train the model using SFTTrainer")
+    print()
+    
+    # Run training without explicit dataset path - will use YAML config
+    success = run_training_with_config(
+        config_path=config_path,
+        dataset_path=None,  # Use output_dir from YAML config
+        epochs=1,
+        batch_size=2,
+        learning_rate=2e-4
+    )
+    
+    if success:
+        print("Training example completed!")
+        print("   Model saved to: ./models/styling")
+        print("   Ready for inference!")
+    
+    return success
+
+def main():
+    """Main function"""
+    parser = argparse.ArgumentParser(description="Styling Training Script")
+    
+    # Subcommands
+    parser.add_argument("command", choices=["train", "example", "features"], 
+                       help="Command to run")
+    
+    # Training arguments
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    parser.add_argument("--dataset", type=str, help="Path to training dataset")
+    parser.add_argument("--output-dir", type=str, help="Output directory for model")
+    parser.add_argument("--epochs", type=int, help="Number of training epochs")
+    parser.add_argument("--batch-size", type=int, help="Training batch size")
+    parser.add_argument("--learning-rate", type=float, help="Learning rate")
+    parser.add_argument("--max-steps", type=int, help="Maximum training steps")
+    
+    args = parser.parse_args()
+    
+    if args.command == "features":
+        show_training_features()
+    
+    elif args.command == "example":
+        create_training_example()
+    
+    elif args.command == "train":
+        if not args.config:
+            print("❌ --config is required for training")
+            print("Usage: python scripts/styling/train.py train --config config.yaml")
+            sys.exit(1)
+        
+        # If dataset is not provided, try to use output_dir from config
+        dataset_path = args.dataset if args.dataset else None
+        
+        success = run_training_with_config(
+            config_path=args.config,
+            dataset_path=dataset_path,
+            output_dir=args.output_dir,
+            epochs=args.epochs,
+            batch_size=args.batch_size,
+            learning_rate=args.learning_rate,
+            max_steps=args.max_steps
+        )
+        
+        if not success:
+            sys.exit(1)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,251 @@
+#!/usr/bin/env python3
+"""
+Test script for the styling data processor
+"""
+
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+
+from pipelines.styling.data_processor import StylingDataPipeline, create_custom_config, create_huggingface_config
+
+def test_styling_pipeline():
+    """Test the styling data processor with custom data"""
+    
+    print("Testing Styling Data Processor")
+    print("=" * 50)
+    
+    # Initialize the pipeline
+    pipeline = StylingDataPipeline()
+    
+    # Example 1: Load configuration from YAML
+    print("\n1. Loading configuration from YAML...")
+    try:
+        yaml_config = pipeline.load_config_from_yaml("./configs/styling/formal.yaml")
+        print(f"   ✅ YAML config loaded successfully!")
+        print(f"   Output directory: {yaml_config.output_dir}")
+        print(f"   Instruction: {yaml_config.instruction}")
+        print(f"   Input field: {yaml_config.input_field}")
+        print(f"   Output field: {yaml_config.output_field}")
+    except Exception as e:
+        print(f"   ❌ Error loading YAML config: {e}")
+        yaml_config = None
+    
+    # Example 2: Create custom dataset configuration
+    print("\n2. Creating custom dataset configuration...")
+    custom_config = create_custom_config(
+        data_path="./data/raw/styling/formal_dataset.jsonl",
+        data_format="jsonl",
+        input_field="text",
+        output_field="styled_text",
+        instruction="Rewrite the following text in a formal style",
+        max_samples=1000,
+        min_length=10,
+        max_length=256,
+        clean_text=True,
+        lowercase=False,
+        output_format="alpaca"
+    )
+    
+    print(f"   Input field: {custom_config.input_field} (maps to 'input')")
+    print(f"   Output field: {custom_config.output_field} (maps to 'output')")
+    print(f"   Instruction: {custom_config.instruction}")
+    print(f"   Max samples: {custom_config.max_samples}")
+    
+    # Example 3: Test with sample data (if available)
+    print("\n3. Testing pipeline with sample data...")
+    
+    # Create a sample dataset for testing
+    sample_data = [
+        {
+            "input": "Hey, what's up? How are you doing today?",
+            "output": "Hello, how are you doing today?"
+        },
+        {
+            "input": "This is really cool stuff!",
+            "output": "This is quite impressive material."
+        },
+        {
+            "input": "I'm gonna go to the store later.",
+            "output": "I will go to the store later."
+        }
+    ]
+    
+    # Save sample data to test file
+    import json
+    test_file = "./data/raw/styling/test_formal.jsonl"
+    os.makedirs(os.path.dirname(test_file), exist_ok=True)
+    
+    with open(test_file, 'w', encoding='utf-8') as f:
+        for item in sample_data:
+            f.write(json.dumps(item, ensure_ascii=False) + '\n')
+    
+    print(f"   Created test file: {test_file}")
+    
+    # Test the pipeline with the sample data
+    try:
+        test_config = create_custom_config(
+            data_path=test_file,
+            data_format="jsonl",
+            input_field="input",
+            output_field="output",
+            instruction="Rewrite the following text in a formal style",
+            max_samples=10,
+            output_format="alpaca"
+        )
+        
+        print("   Running pipeline...")
+        result = pipeline.run_pipeline(test_config, output_format="alpaca", save_splits=True, create_hf_dataset=True, save_hf_dataset=True)
+        
+        print("   ✅ Pipeline completed successfully!")
+        print(f"   Total samples: {result['analysis']['overall']['total_samples']}")
+        print(f"   Split sizes: {result['analysis']['overall']['split_sizes']}")
+        print(f"   Output directory: {result['output_dir']}")
+        
+        # Show HuggingFace dataset info if created
+        if 'hf_dataset' in result:
+            hf_dataset = result['hf_dataset']
+            print(f"   HuggingFace dataset created with {len(hf_dataset)} entries")
+            print(f"   Dataset features: {hf_dataset.features}")
+            
+            # Show save path if saved to disk
+            if 'hf_dataset_path' in result:
+                print(f"   Dataset saved to: {result['hf_dataset_path']}")
+            
+            # Show formatted example
+            if len(hf_dataset) > 0:
+                print(f"   Example formatted text:")
+                print(f"   {hf_dataset[0]['text'][:200]}...")
+        
+        # Show sample processed data
+        print("\n   Sample processed data:")
+        for split_name, split_data in result['data'].items():
+            if split_data:
+                print(f"   {split_name} split:")
+                for i, item in enumerate(split_data[:2]):  # Show first 2 items
+                    print(f"     Item {i+1}:")
+                    print(f"       Instruction: {item['instruction']}")
+                    print(f"       Input: {item['input'][:50]}...")
+                    print(f"       Output: {item['output'][:50]}...")
+                break
+        
+    except Exception as e:
+        print(f"   ❌ Error running pipeline: {e}")
+    
+    print("\n" + "=" * 50)
+    print("Test completed!")
+
+def test_hf_dataset_save_load():
+    """Test HuggingFace dataset save and load functionality"""
+    
+    print("\nTesting HuggingFace Dataset Save/Load")
+    print("=" * 50)
+    
+    from pipelines.styling.data_processor import save_hf_dataset_to_disk, load_hf_dataset_from_disk
+    
+    # Create a sample dataset for testing
+    sample_data = [
+        {
+            "instruction": "Rewrite in formal style",
+            "input": "Hey, what's up?",
+            "output": "Hello, how are you?"
+        },
+        {
+            "instruction": "Rewrite in formal style", 
+            "input": "This is really cool!",
+            "output": "This is quite impressive."
+        }
+    ]
+    
+    # Test configuration
+    config = create_custom_config(
+        data_path="dummy",
+        instruction="Rewrite in formal style"
+    )
+    
+    # Convert to HuggingFace dataset
+    pipeline = StylingDataPipeline()
+    hf_dataset = pipeline.convert_to_hf_dataset(sample_data, config)
+    
+    print(f"Created HuggingFace dataset with {len(hf_dataset)} entries")
+    
+    # Test saving to disk
+    save_path = "./data/processed/styling/test_hf_dataset"
+    print(f"\nSaving dataset to: {save_path}")
+    
+    success = save_hf_dataset_to_disk(hf_dataset, save_path)
+    if success:
+        print("✅ Dataset saved successfully!")
+        
+        # Test loading from disk
+        print(f"\nLoading dataset from: {save_path}")
+        loaded_dataset = load_hf_dataset_from_disk(save_path)
+        
+        if loaded_dataset is not None:
+            print("✅ Dataset loaded successfully!")
+            print(f"Loaded dataset has {len(loaded_dataset)} entries")
+            print(f"Features: {loaded_dataset.features}")
+            
+            # Show sample data
+            print("\nSample loaded data:")
+            for i in range(len(loaded_dataset)):
+                print(f"  Entry {i+1}: {loaded_dataset[i]['text'][:100]}...")
+        else:
+            print("❌ Failed to load dataset")
+    else:
+        print("❌ Failed to save dataset")
+    
+    return hf_dataset
+
+def test_hf_dataset_conversion():
+    """Test the HuggingFace dataset conversion"""
+    
+    print("\nTesting HuggingFace Dataset Conversion")
+    print("=" * 50)
+    
+    pipeline = StylingDataPipeline()
+    
+    # Sample data with instruction field
+    sample_data = [
+        {
+            "instruction": "Rewrite in formal style",
+            "input": "Hey, what's up?",
+            "output": "Hello, how are you?"
+        },
+        {
+            "instruction": "Rewrite in formal style", 
+            "input": "This is really cool!",
+            "output": "This is quite impressive."
+        }
+    ]
+    
+    # Test configuration
+    config = create_custom_config(
+        data_path="dummy",
+        instruction="Rewrite in formal style"
+    )
+    
+    # Convert to HuggingFace dataset
+    hf_dataset = pipeline.convert_to_hf_dataset(sample_data, config)
+    
+    print(f"HuggingFace dataset created with {len(hf_dataset)} entries")
+    print(f"Dataset features: {hf_dataset.features}")
+    
+    # Show formatted examples
+    print("\nFormatted examples:")
+    for i in range(len(hf_dataset)):
+        print(f"  Example {i+1}:")
+        print(f"    {hf_dataset[i]['text'][:150]}...")
+        print()
+    
+    # Test the dataset can be used for training
+    print("Dataset ready for training!")
+    print(f"Number of training examples: {len(hf_dataset)}")
+    
+    return hf_dataset
+
+
+if __name__ == "__main__":
+    test_styling_pipeline()
+    # test_hf_dataset_save_load()
+    # test_hf_dataset_conversion()
				`@@ -0,0 +1 @@`
				`{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}`