diff --git a/configs/QUICK_REFERENCE.md b/configs/QUICK_REFERENCE.md
new file mode 100644
index 0000000..f198c42
--- /dev/null
+++ b/configs/QUICK_REFERENCE.md
@@ -0,0 +1,191 @@
+# Quick Reference Card
+
+## Essential Parameters (Most Common)
+
+### Data Source & Location
+```yaml
+data:
+  source: "huggingface|custom"             # REQUIRED: Data source type
+  dataset_name: "dataset/name"             # REQUIRED for huggingface
+  data_path: "./path/to/file"              # REQUIRED for custom
+  data_format: "jsonl|csv|json"            # REQUIRED for custom
+```
+
+### Field Mapping
+```yaml
+data:
+  input_field: "text"                      # REQUIRED: Input text field
+  label_field: "label"                     # REQUIRED for classification
+  output_field: "styled_text"              # REQUIRED for styling
+  instruction: "Style instruction"          # REQUIRED for styling
+```
+
+### Basic Processing
+```yaml
+data:
+  max_samples: 1000                        # Limit total samples
+  train_split: 0.8                         # Training ratio (0.0-1.0)
+  validation_split: 0.1                    # Validation ratio (0.0-1.0)
+  test_split: 0.1                          # Test ratio (0.0-1.0)
+  output_dir: "./output/path"              # Output directory
+```
+
+### Text Preprocessing
+```yaml
+data:
+  clean_text: true                         # Clean/normalize text
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 512                          # Maximum text length
+```
+
+### Model & Training
+```yaml
+model:
+  name: "bert-base-uncased"                # Model name
+  max_length: 512                          # Max sequence length
+
+training:
+  num_epochs: 3                            # Training epochs
+  batch_size: 16                           # Batch size
+  learning_rate: 2e-5                      # Learning rate
+```
+
+## Common Configurations by Task
+
+### Classification
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "dair-ai/emotion"
+  input_field: "text"
+  label_field: "label"
+  output_format: "classification"
+```
+
+### Styling
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./data.jsonl"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite in formal style"
+  output_format: "alpaca"
+```
+
+### Text Generation
+```yaml
+task:
+  name: "completion"
+  type: "text_generation"
+
+data:
+  source: "custom"
+  data_path: "./prompts.jsonl"
+  input_field: "prompt"
+  output_field: "completion"
+  output_format: "instruction"
+```
+
+## Quick Start Templates
+
+### 1. HuggingFace Dataset
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "huggingface"
+  dataset_name: "your/dataset"
+  input_field: "text"
+  label_field: "label"
+  max_samples: 1000
+  output_dir: "./output"
+```
+
+### 2. Custom JSONL File
+```yaml
+task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  data_path: "./your_data.jsonl"
+  data_format: "jsonl"
+  input_field: "source"
+  output_field: "target"
+  instruction: "Your style instruction"
+  output_dir: "./output"
+```
+
+### 3. CSV File
+```yaml
+task:
+  name: "classification"
+  type: "sequence_classification"
+
+data:
+  source: "custom"
+  data_path: "./your_data.csv"
+  data_format: "csv"
+  input_field: "text"
+  label_field: "label"
+  delimiter: ","
+  output_dir: "./output"
+```
+
+## Parameter Ranges & Recommendations
+
+### Split Ratios
+- **Total must be ≤ 1.0**
+- **Common**: train=0.8, val=0.1, test=0.1
+- **Small datasets**: train=0.7, val=0.15, test=0.15
+
+### Learning Rates
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+- **Start with**: 2e-5
+
+### Batch Sizes
+- **GPU Memory**: 8, 16, 32, 64
+- **CPU**: 4, 8, 16
+- **Start with**: 16
+
+### Text Lengths
+- **BERT**: 512 (max)
+- **GPT-2**: 1024 (max)
+- **T5**: 512 (max)
+- **Start with**: 256
+
+## Common Issues & Fixes
+
+| Issue | Cause | Fix |
+|-------|-------|-----|
+| "File not found" | Wrong path | Check `data_path` and `output_dir` |
+| "Memory error" | Batch too large | Reduce `batch_size` |
+| "Split error" | Ratios > 1.0 | Ensure splits sum to ≤ 1.0 |
+| "Poor performance" | Wrong learning rate | Try 1e-5 to 5e-5 range |
+| "Slow processing" | Text too long | Reduce `max_length` |
+
+## Environment Variables
+```bash
+# Set cache directory
+export HF_HOME="./cache"
+
+# Set output directory
+export OUTPUT_DIR="./results"
+
+# Set log level
+export LOG_LEVEL="INFO"
+```
diff --git a/configs/README.md b/configs/README.md
new file mode 100644
index 0000000..6bf0d43
--- /dev/null
+++ b/configs/README.md
@@ -0,0 +1,207 @@
+# Configuration Files Documentation
+
+This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
+
+## Configuration Structure
+
+All configuration files follow a consistent structure organized into these main sections:
+
+### 1. Task Configuration
+```yaml
+task:
+  name: "task_type"                        # Task type: classification, completion, styling, matching
+  type: "specific_type"                    # Specific model/task type
+```
+
+**Available Task Types:**
+- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
+- **completion**: Text generation and completion tasks
+- **styling**: Style transfer and text transformation tasks
+- **matching**: Semantic matching and similarity tasks
+
+### 2. Data Processing Configuration
+```yaml
+data:
+  # Data Source
+  source: "huggingface|custom"             # Where to get data from
+  
+  # Data Location
+  dataset_name: "dataset/name"             # HuggingFace dataset name (for huggingface source)
+  data_path: "./path/to/file"              # Path to custom data file (for custom source)
+  data_format: "jsonl|csv|json"            # File format for custom data
+  
+  # Field Mapping
+  input_field: "text"                      # Field containing input text
+  output_field: "styled_text"              # Field containing output (for styling)
+  label_field: "label"                     # Field containing labels (for classification)
+  id_field: "id"                           # Optional ID field for tracking
+  
+  # Processing Parameters
+  max_samples: 1000                        # Maximum samples to process
+  train_split: 0.8                         # Training split ratio
+  validation_split: 0.1                    # Validation split ratio
+  test_split: 0.1                          # Test split ratio
+  
+  # Text Preprocessing
+  clean_text: true                         # Clean and normalize text
+  remove_special_chars: false              # Remove special characters
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 1000                         # Maximum text length
+  
+  # Output Configuration
+  output_format: "format_type"             # Output format
+  output_dir: "./output/path"              # Output directory
+```
+
+**Data Source Types:**
+- **huggingface**: Use datasets from HuggingFace Hub
+- **custom**: Use local files (JSONL, CSV, JSON)
+
+**Output Formats:**
+- **classification**: Raw classification format
+- **instruction**: Instruction-following format
+- **conversation**: Conversational format
+- **qa**: Question-answer format
+- **styling**: Raw styling format
+- **alpaca**: Alpaca instruction format
+
+### 3. Model Configuration
+```yaml
+model:
+  name: "model_name"                       # Model from HuggingFace Hub
+  max_length: 512                          # Maximum sequence length
+  num_labels: 6                            # Number of labels (for classification)
+```
+
+**Recommended Models by Task:**
+- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
+- **Styling**: `t5-base`, `gpt2-medium`
+- **Completion**: `gpt2-medium`, `gpt2-large`
+- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`
+
+### 4. Training Configuration
+```yaml
+training:
+  num_epochs: 3                            # Number of training epochs
+  batch_size: 16                           # Training batch size
+  learning_rate: 2e-5                      # Learning rate
+  weight_decay: 0.01                       # Weight decay
+  lr_scheduler_type: "linear"              # Learning rate scheduler
+  warmup_ratio: 0.1                        # Warmup ratio
+  data_dir: "./data/path"                  # Training data directory
+  output_dir: "./model/output"             # Model output directory
+```
+
+**Learning Rate Guidelines:**
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+
+**Scheduler Types:**
+- **linear**: Linear decay
+- **cosine**: Cosine annealing
+- **polynomial**: Polynomial decay
+
+### 5. Inference Configuration
+```yaml
+inference:
+  model_path: "./model/path"               # Path to saved model
+  device: "auto"                           # Device to use
+  batch_size: 32                           # Inference batch size
+  return_probabilities: true                # Return probabilities
+  return_top_k: 3                          # Return top K predictions
+  max_new_tokens: 128                      # Max tokens to generate
+  temperature: 0.8                         # Sampling temperature
+```
+
+**Device Options:**
+- **auto**: Automatically detect best device
+- **cuda**: Use GPU if available
+- **cpu**: Force CPU usage
+
+**Temperature Guidelines:**
+- **0.0**: Deterministic (always same output)
+- **0.7-0.9**: Balanced creativity
+- **1.0+**: More random/creative
+
+## Task-Specific Parameters
+
+### Classification Tasks
+```yaml
+data:
+  label_encoding: "auto|numeric|string"    # How to encode labels
+  multilabel: false                        # Multi-label vs single-label
+  label_separator: ","                     # Separator for multi-label
+```
+
+### Styling Tasks
+```yaml
+data:
+  instruction: "Style instruction text"    # The style instruction
+```
+
+### Completion Tasks
+```yaml
+data:
+  prompt_template: "template"               # Prompt template
+  completion_length: 100                   # Target completion length
+```
+
+## Advanced Configuration
+
+### HuggingFace Specific
+```yaml
+data:
+  hf_split: "train"                        # Dataset split to use
+  hf_cache_dir: "./cache"                  # Cache directory
+  test_split_from: "train"                 # Source for test split
+  val_split_from: "train"                  # Source for validation split
+```
+
+### Custom Data Specific
+```yaml
+data:
+  encoding: "utf-8"                        # File encoding
+  delimiter: ","                           # CSV delimiter
+```
+
+## Usage Examples
+
+### Basic Usage
+```bash
+# Use YAML configuration
+python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
+
+# Override specific parameters
+python scripts/task_type/data_processor.py \
+  --config configs/task_type/config.yaml \
+  --max-samples 1000 \
+  --learning-rate 3e-5
+```
+
+### Creating Custom Configurations
+1. Copy an existing config file
+2. Modify parameters for your specific use case
+3. Update paths and model names
+4. Test with a small dataset first
+
+## Best Practices
+
+1. **Start with Defaults**: Use default values and adjust based on results
+2. **Validate Paths**: Ensure all file paths are correct and accessible
+3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
+4. **Test Incrementally**: Test with small datasets before full processing
+5. **Version Control**: Keep configurations in version control for reproducibility
+
+## Troubleshooting
+
+### Common Issues:
+- **File Not Found**: Check `data_path` and `output_dir` paths
+- **Memory Errors**: Reduce `batch_size` or `max_length`
+- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
+- **Split Errors**: Ensure split ratios sum to ≤ 1.0
+
+### Getting Help:
+- Check the script help: `python script.py --help`
+- Review the pipeline logs for detailed error messages
+- Verify YAML syntax and parameter values
diff --git a/configs/classification/emotion.yaml b/configs/classification/emotion.yaml
index dd6958e..2827292 100644
--- a/configs/classification/emotion.yaml
+++ b/configs/classification/emotion.yaml
@@ -1,6 +1,6 @@
 # Comprehensive Classification Configuration
 # This file defines all parameters for emotion classification using the dair-ai/emotion dataset
-# Organized by level: data processing, model, training, and inference
+# Organized by level: task, data processing, model, training, and inference
 
 # Task Configuration
 task:
@@ -15,9 +15,9 @@ data:
   data_format: "jsonl"                     # Data format: "jsonl", "csv", "json" (for custom data)
   
   # Field Mapping
-  input_field: "text"                      # Field name containing input text
-  label_field: "label"                     # Field name containing labels
-  id_field: null                           # Optional ID field name
+  input_field: "text"                      # Field name containing input text to be classified
+  label_field: "label"                     # Field name containing classification labels
+  id_field: null                           # Optional ID field name for tracking individual samples
   
   # Processing Parameters
   max_samples: 1000                        # Maximum samples to process (null for all samples)
@@ -26,54 +26,54 @@ data:
   test_split: 0.1                          # Test split ratio (0.0 to 1.0)
   
   # Text Preprocessing
-  clean_text: true                         # Clean and normalize text
-  remove_special_chars: false              # Remove special characters from text
-  lowercase: true                          # Convert text to lowercase
+  clean_text: true                         # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
+  remove_special_chars: false              # Remove special characters from text (keep for emotion analysis)
+  lowercase: true                          # Convert text to lowercase (standard for BERT models)
   min_length: 10                           # Minimum text length (filter out shorter texts)
   max_length: 1000                         # Maximum text length (truncate longer texts)
   
   # Label Processing
   label_encoding: "auto"                   # Label encoding: "auto", "numeric", "string"
-  multilabel: false                        # Enable multilabel classification
-  label_separator: ","                     # Separator for multilabel datasets
+  multilabel: false                        # Enable multilabel classification (false for single emotion per text)
+  label_separator: ","                     # Separator for multilabel datasets (comma-separated labels)
   
   # Output Configuration
   output_format: "classification"          # Output format: "classification", "instruction", "conversation", "qa"
-  output_dir: "./data/processed/classification/emotion"  # Specific output directory for this dataset
+  output_dir: "./data/processed/classification/emotion"  # Output directory for processed data and splits
   
   # HuggingFace Specific
-  hf_split: "train"                        # HuggingFace dataset split to use
-  hf_cache_dir: null                       # HuggingFace cache directory (null for default)
+  hf_split: "train"                        # HuggingFace dataset split to use as base
+  hf_cache_dir: null                       # HuggingFace cache directory (null for default ~/.cache/huggingface)
   
   # Split Configuration (Advanced)
   test_split_from: "train"                 # Source for test split: "train", "use_test_if_available", "use_val_if_available"
   val_split_from: "train"                  # Source for validation split: "train", "use_val_if_available"
   
   # Custom Data Specific
-  encoding: "utf-8"                        # File encoding for custom data
-  delimiter: ","                           # Delimiter for CSV files
+  encoding: "utf-8"                        # File encoding for custom data files
+  delimiter: ","                           # Delimiter for CSV files (comma for standard CSV)
 
 # Model Configuration
 model:
-  name: "bert-base-uncased"                # Model name from HuggingFace Hub
-  max_length: 512                          # Maximum sequence length for tokenization
-  num_labels: 6                            # Number of classification labels
+  name: "bert-base-uncased"                # Model name from HuggingFace Hub (good for text classification)
+  max_length: 512                          # Maximum sequence length for tokenization (BERT limit)
+  num_labels: 6                            # Number of classification labels (emotion categories)
 
 # Training Configuration
 training:
-  num_epochs: 3                            # Number of training epochs
-  batch_size: 16                           # Training batch size
-  learning_rate: 2e-5                      # Learning rate (typical range: 1e-5 to 5e-5)
-  weight_decay: 0.01                       # Weight decay for optimizer (typical range: 0.01 to 0.1)
+  num_epochs: 3                            # Number of training epochs (adjust based on dataset size)
+  batch_size: 16                           # Training batch size (adjust based on GPU memory)
+  learning_rate: 2e-5                      # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
+  weight_decay: 0.01                       # Weight decay for optimizer (prevents overfitting)
   lr_scheduler_type: "linear"              # Scheduler type: "linear", "cosine", "polynomial"
   warmup_ratio: 0.1                        # Warmup ratio for scheduler (0.0 to 1.0)
   data_dir: "./data/processed/classification/emotion"  # Directory containing train/validation/test JSONL files
-  output_dir: "./results/classification/emotion_model"  # Output directory for saved model
+  output_dir: "./results/classification/emotion_model"  # Output directory for saved model and checkpoints
 
 # Inference Configuration
 inference:
   model_path: "./results/classification/emotion_model"  # Path to saved model directory
-  device: "auto"                           # Device: "auto", "cuda", "cpu"
-  batch_size: 32                           # Batch size for inference
-  return_probabilities: true                # Return all class probabilities
-  return_top_k: 3                          # Return top K predictions
+  device: "auto"                           # Device: "auto", "cuda", "cpu" (auto detects best available)
+  batch_size: 32                           # Batch size for inference (can be larger than training)
+  return_probabilities: true                # Return all class probabilities (not just top prediction)
+  return_top_k: 3                          # Return top K predictions (useful for confidence analysis)
diff --git a/configs/styling/formal.yaml b/configs/styling/formal.yaml
index fb79712..d13d2be 100644
--- a/configs/styling/formal.yaml
+++ b/configs/styling/formal.yaml
@@ -1,29 +1,69 @@
+# Comprehensive Styling Configuration
+# This file defines all parameters for formal style transfer tasks
+# Organized by level: task, data processing, model, training, and inference
+
+# Task Configuration
 task:
-  name: "styling"
-  type: "style_transfer"
+  name: "styling"                          # Task type: classification, completion, styling, matching
+  type: "style_transfer"                   # Model type: style_transfer, text_generation, etc.
 
+# Data Processing Configuration
 data:
-  source: "custom"
-  input_field: "text"
-  style_field: "style"
-  max_length: 256
-  train_split: 0.8
-  validation_split: 0.1
-  test_split: 0.1
+  source: "custom"                          # Data source: "huggingface" or "custom"
+  data_path: "./data/raw/styling/sample_formal.jsonl"  # Path to custom data file (required for custom source)
+  dataset_name: null                        # HuggingFace dataset name (required for huggingface source)
+  
+  # Field Mapping
+  input_field: "text"                       # Field name containing source text to be styled
+  output_field: "styled_text"               # Field name containing the styled/transformed text
+  
+  # Style Instruction
+  instruction: "Rewrite the following text in a formal style"  # The style instruction that guides the transformation
+  
+  # Data Format & Processing
+  data_format: "jsonl"                      # Data format: "jsonl", "csv", "json" (for custom data)
+  max_length: 256                           # Maximum text length (truncate longer texts)
+  min_length: 10                            # Minimum text length (filter out shorter texts)
+  
+  # Text Preprocessing
+  clean_text: true                          # Clean and normalize text (remove extra spaces, normalize quotes, etc.)
+  lowercase: false                          # Convert text to lowercase (false for formal style to preserve case)
+  
+  # Data Splitting
+  train_split: 0.8                          # Training split ratio (0.0 to 1.0)
+  validation_split: 0.1                     # Validation split ratio (0.0 to 1.0)
+  test_split: 0.1                           # Test split ratio (0.0 to 1.0)
+  
+  # Output Configuration
+  output_format: "alpaca"                   # Output format: "styling" (raw), "alpaca" (instruction format)
+  output_dir: "./data/processed/styling/formal"  # Output directory for processed data and HuggingFace datasets
 
+# Model Configuration
 model:
-  name: "t5-base"
-  max_length: 256
+  name: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Model name from HuggingFace Hub
+  max_length: 2048                          # Maximum sequence length for tokenization
+  max_seq_length: 2048                      # Maximum sequence length for training (RoPE scaling supported)
+  dtype: null                               # Data type: null for auto detection, float16 for Tesla T4/V100, bfloat16 for Ampere+
+  load_in_4bit: true                        # Use 4bit quantization to reduce memory usage
+  token: null                               # HuggingFace token for gated models (e.g., "hf_...")
+  
+  # Training Model Parameters
+  training_model: "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"  # Model to use for training
+  training_max_seq_length: 2048             # Max sequence length for training
+  training_dtype: null                      # Data type for training
+  training_load_in_4bit: true               # 4bit quantization for training
 
+# Training Configuration
 training:
-  num_epochs: 3
-  batch_size: 16
-  learning_rate: 3e-5
-  weight_decay: 0.01
-  warmup_ratio: 0.1
-  lr_scheduler_type: "linear"
+  num_epochs: 3                             # Number of training epochs
+  batch_size: 16                            # Training batch size (adjust based on GPU memory)
+  learning_rate: 3e-5                       # Learning rate (typical range: 1e-5 to 5e-5 for fine-tuning)
+  weight_decay: 0.01                        # Weight decay for optimizer (prevents overfitting)
+  warmup_ratio: 0.1                         # Warmup ratio for scheduler (0.0 to 1.0)
+  lr_scheduler_type: "linear"               # Scheduler type: "linear", "cosine", "polynomial"
 
+# Inference Configuration
 inference:
-  batch_size: 32
-  max_new_tokens: 128
-  temperature: 0.8
+  batch_size: 32                            # Batch size for inference (can be larger than training)
+  max_new_tokens: 128                       # Maximum new tokens to generate during inference
+  temperature: 0.8                          # Sampling temperature (0.0 = deterministic, 1.0 = random)
diff --git a/data/alpaca/test.jsonl b/data/alpaca/test.jsonl
new file mode 100644
index 0000000..659cab5
--- /dev/null
+++ b/data/alpaca/test.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}
diff --git a/data/alpaca/train.jsonl b/data/alpaca/train.jsonl
new file mode 100644
index 0000000..2af6ff3
--- /dev/null
+++ b/data/alpaca/train.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
diff --git a/data/alpaca/validation.jsonl b/data/alpaca/validation.jsonl
new file mode 100644
index 0000000..4be4e50
--- /dev/null
+++ b/data/alpaca/validation.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
diff --git a/data/hf_dataset/data-00000-of-00001.arrow b/data/hf_dataset/data-00000-of-00001.arrow
new file mode 100644
index 0000000..52b04b7
Binary files /dev/null and b/data/hf_dataset/data-00000-of-00001.arrow differ
diff --git a/data/hf_dataset/dataset_info.json b/data/hf_dataset/dataset_info.json
new file mode 100644
index 0000000..6bff0b3
--- /dev/null
+++ b/data/hf_dataset/dataset_info.json
@@ -0,0 +1,24 @@
+{
+  "citation": "",
+  "description": "",
+  "features": {
+    "instruction": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "input": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "output": {
+      "dtype": "string",
+      "_type": "Value"
+    },
+    "text": {
+      "dtype": "string",
+      "_type": "Value"
+    }
+  },
+  "homepage": "",
+  "license": ""
+}
\ No newline at end of file
diff --git a/data/hf_dataset/state.json b/data/hf_dataset/state.json
new file mode 100644
index 0000000..711aac0
--- /dev/null
+++ b/data/hf_dataset/state.json
@@ -0,0 +1,13 @@
+{
+  "_data_files": [
+    {
+      "filename": "data-00000-of-00001.arrow"
+    }
+  ],
+  "_fingerprint": "4e028847697e7b16",
+  "_format_columns": null,
+  "_format_kwargs": {},
+  "_format_type": null,
+  "_output_all_columns": false,
+  "_split": null
+}
\ No newline at end of file
diff --git a/data/processed/styling/formal/alpaca/test.jsonl b/data/processed/styling/formal/alpaca/test.jsonl
new file mode 100644
index 0000000..ee6ece1
--- /dev/null
+++ b/data/processed/styling/formal/alpaca/test.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "That's totally awesome!", "output": "That is quite remarkable!"}
diff --git a/data/processed/styling/formal/alpaca/train.jsonl b/data/processed/styling/formal/alpaca/train.jsonl
new file mode 100644
index 0000000..93f0ddf
--- /dev/null
+++ b/data/processed/styling/formal/alpaca/train.jsonl
@@ -0,0 +1,3 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
+{"instruction": "Rewrite the following text in a formal style", "input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
+{"instruction": "Rewrite the following text in a formal style", "input": "What's the deal with this?", "output": "What is the situation regarding this matter?"}
diff --git a/data/processed/styling/formal/alpaca/validation.jsonl b/data/processed/styling/formal/alpaca/validation.jsonl
new file mode 100644
index 0000000..659cab5
--- /dev/null
+++ b/data/processed/styling/formal/alpaca/validation.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}
diff --git a/data/processed/styling/formal/test.jsonl b/data/processed/styling/formal/test.jsonl
new file mode 100644
index 0000000..ee6ece1
--- /dev/null
+++ b/data/processed/styling/formal/test.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "That's totally awesome!", "output": "That is quite remarkable!"}
diff --git a/data/processed/styling/formal/train.jsonl b/data/processed/styling/formal/train.jsonl
new file mode 100644
index 0000000..93f0ddf
--- /dev/null
+++ b/data/processed/styling/formal/train.jsonl
@@ -0,0 +1,3 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
+{"instruction": "Rewrite the following text in a formal style", "input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
+{"instruction": "Rewrite the following text in a formal style", "input": "What's the deal with this?", "output": "What is the situation regarding this matter?"}
diff --git a/data/processed/styling/formal/validation.jsonl b/data/processed/styling/formal/validation.jsonl
new file mode 100644
index 0000000..659cab5
--- /dev/null
+++ b/data/processed/styling/formal/validation.jsonl
@@ -0,0 +1 @@
+{"instruction": "Rewrite the following text in a formal style", "input": "This is really cool stuff!", "output": "This is quite impressive material."}
diff --git a/data/raw/styling/sample_formal.jsonl b/data/raw/styling/sample_formal.jsonl
new file mode 100644
index 0000000..0a2d5a2
--- /dev/null
+++ b/data/raw/styling/sample_formal.jsonl
@@ -0,0 +1,5 @@
+{"text": "Hey, what's up? How are you doing today?", "styled_text": "Hello, how are you doing today?"}
+{"text": "This is really cool stuff!", "styled_text": "This is quite impressive material."}
+{"text": "I'm gonna go to the store later.", "styled_text": "I will go to the store later."}
+{"text": "What's the deal with this?", "styled_text": "What is the situation regarding this matter?"}
+{"text": "That's totally awesome!", "styled_text": "That is quite remarkable!"}
diff --git a/data/raw/styling/test_formal.jsonl b/data/raw/styling/test_formal.jsonl
new file mode 100644
index 0000000..7d6d9fb
--- /dev/null
+++ b/data/raw/styling/test_formal.jsonl
@@ -0,0 +1,3 @@
+{"input": "Hey, what's up? How are you doing today?", "output": "Hello, how are you doing today?"}
+{"input": "This is really cool stuff!", "output": "This is quite impressive material."}
+{"input": "I'm gonna go to the store later.", "output": "I will go to the store later."}
diff --git a/data/raw/styling/test_missing_fields.jsonl b/data/raw/styling/test_missing_fields.jsonl
new file mode 100644
index 0000000..2005649
--- /dev/null
+++ b/data/raw/styling/test_missing_fields.jsonl
@@ -0,0 +1,5 @@
+{"text": "Hello world", "styled_text": "Greetings, world."}
+{"styled_text": "This is a formal greeting."}
+{"text": "How are you?", "styled_text": "How are you doing?"}
+{"text": null, "styled_text": "Empty input example."}
+{"styled_text": "Another example with no input."}
diff --git a/pipelines/__pycache__/__init__.cpython-311.pyc b/pipelines/__pycache__/__init__.cpython-311.pyc
new file mode 100644
index 0000000..11c8488
Binary files /dev/null and b/pipelines/__pycache__/__init__.cpython-311.pyc differ
diff --git a/pipelines/styling/__pycache__/__init__.cpython-311.pyc b/pipelines/styling/__pycache__/__init__.cpython-311.pyc
new file mode 100644
index 0000000..cd21bb2
Binary files /dev/null and b/pipelines/styling/__pycache__/__init__.cpython-311.pyc differ
diff --git a/pipelines/styling/__pycache__/data_processor.cpython-311.pyc b/pipelines/styling/__pycache__/data_processor.cpython-311.pyc
new file mode 100644
index 0000000..5e118b1
Binary files /dev/null and b/pipelines/styling/__pycache__/data_processor.cpython-311.pyc differ
diff --git a/pipelines/styling/data_processor.py b/pipelines/styling/data_processor.py
new file mode 100644
index 0000000..cacb5b6
--- /dev/null
+++ b/pipelines/styling/data_processor.py
@@ -0,0 +1,1488 @@
+import json
+import pandas as pd
+import numpy as np
+from pathlib import Path
+from typing import Dict, List, Optional, Union, Any, Tuple
+from datasets import Dataset, load_dataset
+import os
+from dataclasses import dataclass
+from abc import ABC, abstractmethod
+import logging
+from sklearn.model_selection import train_test_split
+import re
+import argparse
+import sys
+import yaml
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+
+@dataclass
+class StylingConfig:
+    """Configuration for styling tasks"""
+    # Data source configuration
+    data_source: str = "huggingface"  # "huggingface" or "custom"
+    dataset_name: Optional[str] = None  # For Hugging Face datasets
+    data_path: Optional[str] = None  # For custom datasets
+    data_format: str = "jsonl"  # jsonl, csv, json
+
+    # Field mapping - User configures which fields map to input/output
+    input_field: str = "text"  # Field in dataset containing source text (e.g., "text", "source", etc.)
+    output_field: str = "styled_text"  # Field in dataset containing styled text (e.g., "styled_text", "target", etc.)
+    instruction: str = "Rewrite the following text in a formal style"  # Style instruction from YAML
+
+    # Data processing
+    max_samples: Optional[int] = None
+    train_split: float = 0.8
+    validation_split: float = 0.1
+    test_split: float = 0.1
+
+    # Text preprocessing
+    clean_text: bool = True
+    remove_special_chars: bool = False
+    lowercase: bool = False  # Keep original case for styling
+    min_length: int = 10
+    max_length: int = 1000
+
+    # Output configuration
+    output_format: str = "styling"  # instruction, conversation, qa
+    output_dir: str = "./data"
+
+    # Hugging Face specific
+    hf_split: str = "train"
+    hf_cache_dir: Optional[str] = None
+
+    # Split configuration
+    test_split_from: str = "train"
+    val_split_from: str = "train"
+
+    # Custom data specific
+    encoding: str = "utf-8"
+    delimiter: str = ","  # For CSV files
+
+    # Alpaca prompt configuration
+    alpaca_prompt: str = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that follows the instruction
+
+### Instruction:
+{}
+
+### Input:
+{}
+
+### Response:
+{}"""
+
+    eos_token: str = "<|eot_id|>"  # Use <|eot_id|> as EOS token
+
+class DataValidator:
+    """Validates styling data quality and format"""
+
+    @staticmethod
+    def validate_styling_data(data: Dict[str, List[Dict]], config: StylingConfig, is_processed: bool = False) -> Tuple[bool, List[str]]:
+        """Validate styling dataset splits"""
+        errors = []
+        
+        # Check if we have the expected splits
+        expected_splits = ["train", "validation", "test"]
+        for split in expected_splits:
+            if split not in data:
+                errors.append(f"Missing '{split}' split")
+            elif split == "train" and not data[split]:
+                errors.append(f"Train split cannot be empty")
+            # Allow validation and test splits to be empty for small datasets
+        
+        if errors:
+            return False, errors
+
+        total_samples = sum(len(split_data) for split_data in data.values())
+        logger.info(f"Validating {total_samples} total samples across all splits...")
+
+        # Determine field names based on whether data is processed or not
+        input_field = "input" if is_processed else config.input_field
+        output_field = "output" if is_processed else config.output_field
+
+        # Validate each split
+        for split_name, split_data in data.items():
+            if not split_data:
+                logger.info(f"Skipping validation for empty {split_name} split")
+                continue
+                
+            logger.info(f"Validating {split_name} split with {len(split_data)} samples...")
+            
+            # Check required fields
+            missing_input_count = 0
+            missing_output_count = 0
+
+            for i, item in enumerate(split_data):
+                if input_field not in item:
+                    errors.append(f"Missing input field '{input_field}' in {split_name} split, item {i}")
+                    missing_input_count += 1
+                if output_field not in item:
+                    errors.append(f"Missing output field '{output_field}' in {split_name} split, item {i}")
+                    missing_output_count += 1
+
+            logger.info(f"{split_name} - Items missing input field: {missing_input_count}")
+            logger.info(f"{split_name} - Items missing output field: {missing_output_count}")
+
+            # Check data types
+            type_errors = 0
+            for i, item in enumerate(split_data):
+                if not isinstance(item.get(input_field, ""), str):
+                    errors.append(f"Input field '{input_field}' must be string in {split_name} split, item {i}")
+                    type_errors += 1
+                if not isinstance(item.get(output_field, ""), str):
+                    errors.append(f"Output field '{output_field}' must be string in {split_name} split, item {i}")
+                    type_errors += 1
+
+            logger.info(f"{split_name} - Type errors: {type_errors}")
+
+            # Check for empty inputs/outputs
+            empty_inputs = sum(1 for item in split_data if not item.get(input_field, "").strip())
+            empty_outputs = sum(1 for item in split_data if not item.get(output_field, "").strip())
+            
+            if empty_inputs > 0:
+                errors.append(f"Found {empty_inputs} items with empty input text in {split_name} split")
+            if empty_outputs > 0:
+                errors.append(f"Found {empty_outputs} items with empty output text in {split_name} split")
+
+            logger.info(f"{split_name} - Empty inputs: {empty_inputs}")
+            logger.info(f"{split_name} - Empty outputs: {empty_outputs}")
+
+            # Show sample of processed data for debugging
+            if split_data:
+                logger.info(f"Sample processed items from {split_name}:")
+                for i in range(min(3, len(split_data))):
+                    item = split_data[i]
+                    logger.info(f"  Item {i}: input='{item.get(input_field, '')[:50]}...', output='{item.get(output_field, '')[:50]}...'")
+
+        return len(errors) == 0, errors
+
+    @staticmethod
+    def analyze_dataset(data: Dict[str, List[Dict]], config: StylingConfig, is_processed: bool = False) -> Dict[str, Any]:
+        """Analyze dataset characteristics across all splits"""
+        analysis = {
+            "splits": {},
+            "overall": {
+                "total_samples": 0,
+                "split_sizes": {}
+            }
+        }
+
+        # Determine field names based on whether data is processed or not
+        input_field = "input" if is_processed else config.input_field
+        output_field = "output" if is_processed else config.output_field
+
+        # Analyze each split
+        for split_name, split_data in data.items():
+            if not split_data:
+                # Handle empty splits
+                split_analysis = {
+                    "total_samples": 0,
+                    "text_length_stats": {},
+                    "missing_values": {}
+                }
+                analysis["splits"][split_name] = split_analysis
+                analysis["overall"]["split_sizes"][split_name] = 0
+                continue
+                
+            split_analysis = {
+                "total_samples": len(split_data),
+                "text_length_stats": {},
+                "missing_values": {}
+            }
+
+            # Text length statistics for both input and output
+            for field_name, field in [("input", input_field), ("output", output_field)]:
+                text_lengths = [len(item.get(field, "")) for item in split_data]
+                if text_lengths:
+                    split_analysis["text_length_stats"][field_name] = {
+                        "min": min(text_lengths),
+                        "max": max(text_lengths),
+                        "mean": np.mean(text_lengths),
+                        "median": np.median(text_lengths)
+                    }
+
+            # Missing values
+            for field in [input_field, output_field]:
+                missing_count = sum(1 for item in split_data if not item.get(field))
+                split_analysis["missing_values"][field] = missing_count
+
+            analysis["splits"][split_name] = split_analysis
+            analysis["overall"]["total_samples"] += len(split_data)
+            analysis["overall"]["split_sizes"][split_name] = len(split_data)
+
+        return analysis
+
+class BaseDataLoader(ABC):
+    """Abstract base class for data loaders"""
+
+    @abstractmethod
+    def load(self, config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Load data and return dictionary with train/val/test splits"""
+        pass
+
+    @abstractmethod
+    def preprocess(self, data: Dict[str, List[Dict]], config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Apply preprocessing steps to all splits"""
+        pass
+
+
+class HuggingFaceDataLoader(BaseDataLoader):
+    """Load datasets from Hugging Face Hub"""
+
+    def load(self, config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Load dataset from Hugging Face Hub with flexible split handling"""
+        if not config.dataset_name:
+            raise ValueError("Dataset name is required for Hugging Face datasets")
+
+        logger.info(f"Loading Hugging Face dataset: {config.dataset_name}")
+
+        try:
+            # First, let's check what splits are available in the dataset
+            dataset = load_dataset(
+                config.dataset_name,
+                cache_dir=config.hf_cache_dir
+            )
+
+            # Log available splits
+            available_splits = list(dataset.keys())
+            logger.info(f"Available splits in dataset: {available_splits}")
+
+            # Initialize split data
+            splits_data = {
+                "train": [],
+                "validation": [],
+                "test": []
+            }
+
+            # Handle train split
+            if "train" in available_splits:
+                train_dataset = dataset["train"]
+                logger.info(f"Using 'train' split with {len(train_dataset)} samples")
+                splits_data["train"] = list(train_dataset)
+            else:
+                logger.error("No 'train' split found in dataset!")
+                logger.error(f"Available splits: {available_splits}")
+                raise ValueError(f"Dataset {config.dataset_name} does not have a 'train' split")
+
+            # Handle validation split
+            if config.val_split_from == "use_val_if_available" and "validation" in available_splits:
+                val_dataset = dataset["validation"]
+                logger.info(f"Using 'validation' split with {len(val_dataset)} samples")
+                splits_data["validation"] = list(val_dataset)
+            elif config.val_split_from == "use_val_if_available" and "val" in available_splits:
+                val_dataset = dataset["val"]
+                logger.info(f"Using 'val' split with {len(val_dataset)} samples")
+                splits_data["validation"] = list(val_dataset)
+            elif config.val_split_from == "use_val_if_available":
+                logger.warning("No validation split found in dataset. Will create from train split.")
+                logger.info(f"Available splits: {available_splits}")
+                logger.info(f"Will use {config.validation_split * 100}% of train data for validation")
+            else:
+                logger.info(f"Will create validation split from train data ({config.validation_split * 100}%)")
+
+            # Handle test split
+            if config.test_split_from == "use_test_if_available" and "test" in available_splits:
+                test_dataset = dataset["test"]
+                logger.info(f"Using 'test' split with {len(test_dataset)} samples")
+                splits_data["test"] = list(test_dataset)
+            elif config.test_split_from == "use_val_if_available" and "validation" in available_splits:
+                test_dataset = dataset["validation"]
+                logger.info(f"Using 'validation' split as test with {len(test_dataset)} samples")
+                splits_data["test"] = list(test_dataset)
+            elif config.test_split_from == "use_val_if_available" and "val" in available_splits:
+                test_dataset = dataset["val"]
+                logger.info(f"Using 'val' split as test with {len(test_dataset)} samples")
+                splits_data["test"] = list(test_dataset)
+            elif config.test_split_from == "use_test_if_available":
+                logger.warning("No test split found in dataset. Will create from train split.")
+                logger.info(f"Available splits: {available_splits}")
+                logger.info(f"Will use {config.test_split * 100}% of train data for test")
+            else:
+                logger.info(f"Will create test split from train data ({config.test_split * 100}%)")
+
+            # If we need to create splits from train data
+            if not splits_data["validation"] or not splits_data["test"]:
+                train_data = splits_data["train"]
+                
+                # Handle very small datasets
+                if len(train_data) < 3:
+                    logger.warning(f"Dataset has only {len(train_data)} samples. Using all data for training.")
+                    splits_data["train"] = train_data
+                    splits_data["validation"] = []
+                    splits_data["test"] = []
+                else:
+                    # Calculate remaining percentages for train
+                    total_train_percentage = config.train_split + config.validation_split + config.test_split
+                    if total_train_percentage != 1.0:
+                        logger.warning(f"Split percentages don't sum to 1.0 (got {total_train_percentage}). Normalizing...")
+                        # Normalize percentages
+                        config.train_split = config.train_split / total_train_percentage
+                        config.validation_split = config.validation_split / total_train_percentage
+                        config.test_split = config.test_split / total_train_percentage
+
+                    # Create splits from train data
+                    if not splits_data["validation"] and not splits_data["test"]:
+                        # Split train into train, val, test
+                        train_size = int(len(train_data) * config.train_split)
+                        val_size = int(len(train_data) * config.validation_split)
+
+                        # Handle small datasets
+                        if len(train_data) < 10:
+                            # For small datasets, use more conservative splits
+                            config.train_split = 0.6
+                            config.validation_split = 0.2
+                            config.test_split = 0.2
+                            logger.info(f"Small dataset detected. Adjusted split ratios to: train={config.train_split}, val={config.validation_split}, test={config.test_split}")
+                        
+                        # Ensure minimum sizes
+                        min_val_size = max(1, int(len(train_data) * 0.1))
+                        min_test_size = max(1, int(len(train_data) * 0.1))
+                        
+                        val_size = max(min_val_size, int(len(train_data) * config.validation_split))
+                        test_size = max(min_test_size, int(len(train_data) * config.test_split))
+                        train_size = len(train_data) - val_size - test_size
+                        
+                        # Ensure train has at least 1 sample
+                        if train_size < 1:
+                            if val_size > 1:
+                                val_size -= 1
+                                train_size += 1
+                            elif test_size > 1:
+                                test_size -= 1
+                                train_size += 1
+                            logger.info(f"Adjusted split sizes: train={train_size}, val={val_size}, test={test_size}")
+
+                        # First split: train + (val+test)
+                        new_train, temp_data = train_test_split(
+                            train_data,
+                            test_size=val_size + test_size,
+                            random_state=42
+                        )
+
+                        # Second split: val + test
+                        new_val, new_test = train_test_split(
+                            temp_data,
+                            test_size=test_size / (val_size + test_size) if (val_size + test_size) > 0 else 0,
+                            random_state=42
+                        )
+                        
+                        splits_data["train"] = new_train
+                        splits_data["validation"] = new_val
+                        splits_data["test"] = new_test
+
+                    elif not splits_data["validation"]:
+                        # Only need to create val from train
+                        val_size = max(1, int(len(train_data) * config.validation_split))
+                        new_train, new_val = train_test_split(
+                            train_data,
+                            test_size=val_size,
+                            random_state=42
+                        )
+                        splits_data["train"] = new_train
+                        splits_data["validation"] = new_val
+
+                    elif not splits_data["test"]:
+                        # Only need to create test from train
+                        test_size = max(1, int(len(train_data) * config.test_split))
+                        new_train, new_test = train_test_split(
+                            train_data,
+                            test_size=test_size,
+                            random_state=42
+                        )
+                        splits_data["train"] = new_train
+                        splits_data["test"] = new_test
+
+            logger.info(f"Final split sizes:")
+            logger.info(f"  Train: {len(splits_data['train'])} samples")
+            logger.info(f"  Validation: {len(splits_data['validation'])} samples")
+            logger.info(f"  Test: {len(splits_data['test'])} samples")
+
+            # Ensure all splits exist (even if empty) for the pipeline
+            if "validation" not in splits_data:
+                splits_data["validation"] = []
+            if "test" not in splits_data:
+                splits_data["test"] = []
+
+            # Apply max_samples limit to each split if specified
+            if config.max_samples:
+                for split_name in splits_data:
+                    if splits_data[split_name]:
+                        original_size = len(splits_data[split_name])
+                        splits_data[split_name] = splits_data[split_name][:config.max_samples]
+                        logger.info(f"Limited {split_name} split from {original_size} to {len(splits_data[split_name])} samples")
+
+            # Log dataset info for debugging
+            for split_name, split_data in splits_data.items():
+                if split_data:
+                    logger.info(f"Sample data item from {split_name}: {split_data[0]}")
+                    logger.info(f"Available fields in {split_name} split: {list(split_data[0].keys())}")
+
+                    # Check if the required fields exist
+                    if config.input_field not in split_data[0]:
+                        logger.warning(f"Input field '{config.input_field}' not found in {split_name}. Available fields: {list(split_data[0].keys())}")
+                        # Suggest alternative fields
+                        text_fields = [f for f in split_data[0].keys() if any(keyword in f.lower() for keyword in ['text', 'sentence', 'content', 'input', 'comment', 'message'])]
+                        if text_fields:
+                            logger.info(f"Suggested text fields for {split_name}: {text_fields}")
+                    if config.output_field not in split_data[0]:
+                        logger.warning(f"Output field '{config.output_field}' not found in {split_name}. Available fields: {list(split_data[0].keys())}")
+                        # Suggest alternative fields
+                        output_fields = [f for f in split_data[0].keys() if any(keyword in f.lower() for keyword in ['output', 'response', 'result', 'target', 'styled'])]
+                        if output_fields:
+                            logger.info(f"Suggested output fields for {split_name}: {output_fields}")
+
+            logger.info(f"Successfully loaded dataset {config.dataset_name}")
+            return splits_data
+
+        except Exception as e:
+            logger.error(f"Error loading dataset {config.dataset_name}: {e}")
+            raise
+
+    def preprocess(self, data: Dict[str, List[Dict]], config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Apply preprocessing steps to all splits separately"""
+        processed_splits = {}
+
+        logger.info(f"=== PREPROCESSING DATA ===")
+        
+        for split_name, split_data in data.items():
+            logger.info(f"Processing {split_name} split with {len(split_data)} items...")
+            
+            # Log field availability for debugging
+            if split_data:
+                available_fields = set(split_data[0].keys())
+                logger.info(f"Available fields in {split_name}: {available_fields}")
+                logger.info(f"Looking for input field: '{config.input_field}', output field: '{config.output_field}'")
+
+                if config.input_field not in available_fields:
+                    logger.error(f"Input field '{config.input_field}' not found in {split_name}. Available fields: {available_fields}")
+                if config.output_field not in available_fields:
+                    logger.error(f"Output field '{config.output_field}' not found in {split_name}. Available fields: {available_fields}")
+
+            # Count items with missing fields
+            missing_input = sum(1 for item in split_data if config.input_field not in item or not item.get(config.input_field))
+            missing_output = sum(1 for item in split_data if config.output_field not in item or not item.get(config.output_field))
+
+            logger.info(f"{split_name} - Items missing input field: {missing_input}")
+            logger.info(f"{split_name} - Items missing output field: {missing_output}")
+
+            # Show sample of raw data before preprocessing
+            logger.info(f"=== SAMPLE RAW DATA FROM {split_name.upper()} BEFORE PREPROCESSING ===")
+            for i in range(min(3, len(split_data))):
+                item = split_data[i]
+                logger.info(f"Raw item {i} from {split_name}:")
+                for key, value in item.items():
+                    if isinstance(value, str) and len(value) > 100:
+                        logger.info(f"  {key}: '{value[:100]}...'")
+                    else:
+                        logger.info(f"  {key}: {value}")
+
+            # Process each item in the split
+            processed_data = []
+            processed_count = 0
+            skipped_count = 0
+
+            # Reset debug counter for each split
+            self._debug_count = 0
+
+            for i, item in enumerate(split_data):
+                processed_item = self._preprocess_item(item, config)
+                if processed_item is not None:
+                    processed_data.append(processed_item)
+                    processed_count += 1
+                else:
+                    skipped_count += 1
+                    if skipped_count <= 3:  # Log first few skipped items
+                        logger.info(f"Skipped item {i} from {split_name}: {item}")
+
+            processed_splits[split_name] = processed_data
+            logger.info(f"{split_name} - Preprocessed {processed_count} samples, skipped {skipped_count} samples")
+
+            # Show sample of processed data
+            if processed_data:
+                logger.info(f"=== SAMPLE PROCESSED DATA FROM {split_name.upper()} ===")
+                for i in range(min(3, len(processed_data))):
+                    logger.info(f"Processed item {i} from {split_name}: {processed_data[i]}")
+
+        return processed_splits
+
+    def _preprocess_item(self, item: Dict, config: StylingConfig) -> Optional[Dict]:
+        """Preprocess a single item"""
+        # Extract input and output using configurable field names
+        input_text = item.get(config.input_field, "")
+        output_text = item.get(config.output_field, "")
+
+        # Log what we're extracting (for first few items)
+        if hasattr(self, '_debug_count'):
+            self._debug_count += 1
+        else:
+            self._debug_count = 1
+
+        if self._debug_count <= 3:
+            logger.debug(f"Processing item {self._debug_count}:")
+            logger.debug(f"  Looking for input field '{config.input_field}': {input_text}")
+            logger.debug(f"  Looking for output field '{config.output_field}': {output_text}")
+
+        # Handle None values
+        if input_text is None:
+            input_text = ""
+        if output_text is None:
+            output_text = ""
+
+        # Convert to string if needed
+        input_text = str(input_text)
+        output_text = str(output_text)
+
+        if self._debug_count <= 3:
+            logger.debug(f"  After conversion - input: '{input_text[:50]}...', output: '{output_text[:50]}...'")
+
+        # Clean text if requested
+        if config.clean_text:
+            original_input = input_text
+            original_output = output_text
+            input_text = self._clean_text(input_text, config)
+            output_text = self._clean_text(output_text, config)
+            if self._debug_count <= 3:
+                logger.debug(f"  After cleaning - input: '{original_input[:50]}...' -> '{input_text[:50]}...'")
+                logger.debug(f"  After cleaning - output: '{original_output[:50]}...' -> '{output_text[:50]}...'")
+
+        # Check length constraints
+        if len(input_text) < config.min_length or len(input_text) > config.max_length:
+            if self._debug_count <= 3:
+                logger.debug(f"  Skipping - input length {len(input_text)} not in range [{config.min_length}, {config.max_length}]")
+            return None
+
+        if len(output_text) < config.min_length or len(output_text) > config.max_length:
+            if self._debug_count <= 3:
+                logger.debug(f"  Skipping - output length {len(output_text)} not in range [{config.min_length}, {config.max_length}]")
+            return None
+
+        # Create processed item - Always use "input" and "output" for internal processing
+        processed_item = {
+            "input": input_text,
+            "output": output_text
+        }
+
+        if self._debug_count <= 3:
+            logger.debug(f"  Final processed item: {processed_item}")
+
+        return processed_item
+
+    def _clean_text(self, text: str, config: StylingConfig) -> str:
+        """Clean and normalize text"""
+        if not isinstance(text, str):
+            return ""
+
+        # Remove extra whitespace
+        text = re.sub(r'\s+', ' ', text).strip()
+
+        # Convert to lowercase if requested
+        if config.lowercase:
+            text = text.lower()
+
+        # Remove special characters if requested
+        if config.remove_special_chars:
+            text = re.sub(r'[^\w\s]', '', text)
+
+        return text
+
+
+class CustomDataLoader(BaseDataLoader):
+    """Load custom datasets from local files"""
+    
+    def load(self, config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Load custom dataset from local file and create splits"""
+        if not config.data_path:
+            raise ValueError("Data path is required for custom datasets")
+        
+        file_path = Path(config.data_path)
+        
+        if not file_path.exists():
+            raise FileNotFoundError(f"Data file not found: {file_path}")
+        
+        logger.info(f"Loading custom dataset: {file_path}")
+        
+        if config.data_format == "jsonl":
+            raw_data = self._load_jsonl(file_path, config)
+        elif config.data_format == "csv":
+            raw_data = self._load_csv(file_path, config)
+        elif config.data_format == "json":
+            raw_data = self._load_json(file_path, config)
+        else:
+            raise ValueError(f"Unsupported format: {config.data_format}")
+        
+        if config.max_samples:
+            raw_data = raw_data[:config.max_samples]
+        
+        logger.info(f"Loaded {len(raw_data)} samples from {file_path}")
+        
+        # Create splits from the raw data
+        splits_data = self._create_splits(raw_data, config)
+        
+        return splits_data
+    
+    def _create_splits(self, data: List[Dict], config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Create train/validation/test splits from raw data"""
+        logger.info(f"Creating splits from {len(data)} samples...")
+        
+        # Handle very small datasets
+        if len(data) < 3:
+            logger.warning(f"Dataset has only {len(data)} samples. Using all data for training.")
+            return {
+                "train": data,
+                "validation": [],
+                "test": []
+            }
+        
+        # Calculate split sizes with minimum guarantees
+        total_samples = len(data)
+        
+        # Ensure minimum sizes for each split
+        min_val_size = max(1, int(total_samples * 0.1))  # At least 1 sample for validation
+        min_test_size = max(1, int(total_samples * 0.1))  # At least 1 sample for test
+        
+        # Adjust split ratios if dataset is too small
+        if total_samples < 10:
+            # For small datasets, use more conservative splits
+            config.train_split = 0.6
+            config.validation_split = 0.2
+            config.test_split = 0.2
+            logger.info(f"Small dataset detected. Adjusted split ratios to: train={config.train_split}, val={config.validation_split}, test={config.test_split}")
+        
+        # Calculate actual split sizes
+        val_size = max(min_val_size, int(total_samples * config.validation_split))
+        test_size = max(min_test_size, int(total_samples * config.test_split))
+        train_size = total_samples - val_size - test_size
+        
+        # Ensure train split has at least 1 sample
+        if train_size < 1:
+            # Adjust validation and test to ensure train has at least 1 sample
+            if val_size > 1:
+                val_size -= 1
+                train_size += 1
+            elif test_size > 1:
+                test_size -= 1
+                train_size += 1
+            logger.info(f"Adjusted split sizes to ensure train has at least 1 sample: train={train_size}, val={val_size}, test={test_size}")
+        
+        logger.info(f"Split sizes: train={train_size}, validation={val_size}, test={test_size}")
+        
+        # Create splits
+        if val_size == 0 and test_size == 0:
+            # All data goes to train
+            splits_data = {
+                "train": data,
+                "validation": [],
+                "test": []
+            }
+        elif val_size == 0:
+            # Split between train and test
+            train_data, test_data = train_test_split(data, test_size=test_size, random_state=42)
+            splits_data = {
+                "train": train_data,
+                "validation": [],
+                "test": test_data
+            }
+        elif test_size == 0:
+            # Split between train and validation
+            train_data, val_data = train_test_split(data, test_size=val_size, random_state=42)
+            splits_data = {
+                "train": train_data,
+                "validation": val_data,
+                "test": []
+            }
+        else:
+            # Full three-way split
+            # First split: train + (val+test)
+            train_data, temp_data = train_test_split(
+                data,
+                test_size=val_size + test_size,
+                random_state=42
+            )
+            
+            # Second split: val + test
+            val_data, test_data = train_test_split(
+                temp_data,
+                test_size=test_size,
+                random_state=42
+            )
+            
+            splits_data = {
+                "train": train_data,
+                "validation": val_data,
+                "test": test_data
+            }
+        
+        logger.info(f"Created splits:")
+        logger.info(f"  Train: {len(splits_data['train'])} samples")
+        logger.info(f"  Validation: {len(splits_data['validation'])} samples")
+        logger.info(f"  Test: {len(splits_data['test'])} samples")
+        
+        return splits_data
+    
+    def _load_jsonl(self, file_path: Path, config: StylingConfig) -> List[Dict]:
+        """Load JSONL file"""
+        data = []
+        with open(file_path, 'r', encoding=config.encoding) as f:
+            for line_num, line in enumerate(f, 1):
+                if line.strip():
+                    try:
+                        data.append(json.loads(line))
+                    except json.JSONDecodeError as e:
+                        logger.warning(f"Invalid JSON at line {line_num}: {e}")
+        return data
+    
+    def _load_csv(self, file_path: Path, config: StylingConfig) -> List[Dict]:
+        """Load CSV file"""
+        df = pd.read_csv(file_path, encoding=config.encoding, delimiter=config.delimiter)
+        return df.to_dict('records')
+    
+    def _load_json(self, file_path: Path, config: StylingConfig) -> List[Dict]:
+        """Load JSON file"""
+        with open(file_path, 'r', encoding=config.encoding) as f:
+            data = json.load(f)
+        
+        if isinstance(data, list):
+            return data
+        elif isinstance(data, dict) and "data" in data:
+            return data["data"]
+        else:
+            return [data]
+    
+    def preprocess(self, data: Dict[str, List[Dict]], config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Apply preprocessing steps to all splits separately"""
+        processed_splits = {}
+        
+        logger.info(f"=== PREPROCESSING CUSTOM DATA ===")
+        
+        for split_name, split_data in data.items():
+            logger.info(f"Processing {split_name} split with {len(split_data)} items...")
+            
+            processed_data = []
+            processed_count = 0
+            skipped_count = 0
+            
+            # Reset debug counter for each split
+            self._debug_count = 0
+            
+            for i, item in enumerate(split_data):
+                processed_item = self._preprocess_item(item, config)
+                if processed_item is not None:
+                    processed_data.append(processed_item)
+                    processed_count += 1
+                else:
+                    skipped_count += 1
+                    if skipped_count <= 3:  # Log first few skipped items
+                        logger.info(f"Skipped item {i} from {split_name}: {item}")
+            
+            processed_splits[split_name] = processed_data
+            logger.info(f"{split_name} - Preprocessed {processed_count} samples, skipped {skipped_count} samples")
+        
+        return processed_splits
+    
+    def _preprocess_item(self, item: Dict, config: StylingConfig) -> Optional[Dict]:
+        """Preprocess a single item"""
+        # Extract input and output using configurable field names
+        input_text = item.get(config.input_field, "")
+        output_text = item.get(config.output_field, "")
+        
+        # Handle None values
+        if input_text is None:
+            input_text = ""
+        if output_text is None:
+            output_text = ""
+        
+        # Convert to string if needed
+        input_text = str(input_text)
+        output_text = str(output_text)
+        
+        # Clean text if requested
+        if config.clean_text:
+            input_text = self._clean_text(input_text, config)
+            output_text = self._clean_text(output_text, config)
+        
+        # Check length constraints
+        if len(input_text) < config.min_length or len(input_text) > config.max_length:
+            return None
+        
+        if len(output_text) < config.min_length or len(output_text) > config.max_length:
+            return None
+        
+        # Create processed item - Always use "input" and "output" for internal processing
+        processed_item = {
+            "input": input_text,
+            "output": output_text
+        }
+        
+        return processed_item
+    
+    def _clean_text(self, text: str, config: StylingConfig) -> str:
+        """Clean and normalize text"""
+        if not isinstance(text, str):
+            return ""
+        
+        # Remove extra whitespace
+        text = re.sub(r'\s+', ' ', text).strip()
+        
+        # Convert to lowercase if requested
+        if config.lowercase:
+            text = text.lower()
+        
+        # Remove special characters if requested
+        if config.remove_special_chars:
+            text = re.sub(r'[^\w\s]', '', text)
+        
+        return text
+
+
+class StylingDataPipeline:
+    """Main styling pipeline"""
+    
+    def __init__(self):
+        self.validator = DataValidator()
+        self.hf_loader = HuggingFaceDataLoader()
+        self.custom_loader = CustomDataLoader()
+    
+    def create_config(
+        self,
+        data_source: str,
+        dataset_name: Optional[str] = None,
+        data_path: Optional[str] = None,
+        input_field: str = "input",
+        output_field: str = "output",
+        instruction: str = "Rewrite the following text in a formal style",
+        **kwargs
+    ) -> StylingConfig:
+        """Create styling configuration"""
+        return StylingConfig(
+            data_source=data_source,
+            dataset_name=dataset_name,
+            data_path=data_path,
+            input_field=input_field,
+            output_field=output_field,
+            instruction=instruction,
+            **kwargs
+        )
+
+    def load_config_from_yaml(self, yaml_path: str) -> StylingConfig:
+        """Load configuration from YAML file"""
+        try:
+            config_dict = load_yaml_config(yaml_path)
+            
+            # Create configuration object from YAML data
+            config = StylingConfig(
+                data_source=config_dict.get('data_source', 'custom'),
+                dataset_name=config_dict.get('dataset_name'),
+                data_path=config_dict.get('data_path'),
+                data_format=config_dict.get('data_format', 'jsonl'),
+                input_field=config_dict.get('input_field', 'text'),
+                output_field=config_dict.get('output_field', 'styled_text'),
+                instruction=config_dict.get('instruction', 'Rewrite the following text in a formal style'),
+                max_samples=config_dict.get('max_samples'),
+                train_split=config_dict.get('train_split', 0.8),
+                validation_split=config_dict.get('validation_split', 0.1),
+                test_split=config_dict.get('test_split', 0.1),
+                clean_text=config_dict.get('clean_text', True),
+                remove_special_chars=config_dict.get('remove_special_chars', False),
+                lowercase=config_dict.get('lowercase', False),
+                min_length=config_dict.get('min_length', 10),
+                max_length=config_dict.get('max_length', 1000),
+                output_format=config_dict.get('output_format', 'styling'),
+                output_dir=config_dict.get('output_dir', './data'),
+                hf_split=config_dict.get('hf_split', 'train'),
+                hf_cache_dir=config_dict.get('hf_cache_dir'),
+                test_split_from=config_dict.get('test_split_from', 'train'),
+                val_split_from=config_dict.get('val_split_from', 'train'),
+                encoding=config_dict.get('encoding', 'utf-8'),
+                delimiter=config_dict.get('delimiter', ',')
+            )
+            
+            logger.info(f"Configuration loaded from YAML: {yaml_path}")
+            logger.info(f"Output directory: {config.output_dir}")
+            logger.info(f"Instruction: {config.instruction}")
+            
+            return config
+            
+        except Exception as e:
+            logger.error(f"Error loading configuration from YAML {yaml_path}: {e}")
+            raise
+    
+    def load_and_preprocess(self, config: StylingConfig) -> Tuple[Dict[str, List[Dict]], Dict[str, Any]]:
+        """Load and preprocess data"""
+        
+        # Load data
+        if config.data_source == "huggingface":
+            raw_splits = self.hf_loader.load(config)
+            processed_splits = self.hf_loader.preprocess(raw_splits, config)
+        elif config.data_source == "custom":
+            raw_splits = self.custom_loader.load(config)
+            processed_splits = self.custom_loader.preprocess(raw_splits, config)
+        else:
+            raise ValueError(f"Unsupported data source: {config.data_source}")
+        
+        # Validate processed data
+        is_valid, errors = self.validator.validate_styling_data(processed_splits, config, is_processed=True)
+        if not is_valid:
+            logger.error("Data validation failed:")
+            for error in errors:
+                logger.error(f"  - {error}")
+            raise ValueError("Data validation failed")
+        
+        # Analyze dataset
+        analysis = self.validator.analyze_dataset(processed_splits, config, is_processed=True)
+        
+        return processed_splits, analysis
+    
+    def convert_to_alpaca_format(self, data: Dict[str, List[Dict]], config: StylingConfig) -> Dict[str, List[Dict]]:
+        """Convert styling data to Alpaca format with instruction"""
+        alpaca_splits = {}
+        
+        for split_name, split_data in data.items():
+            alpaca_data = []
+            for item in split_data:
+                # Ensure input and output fields exist, default to empty string if missing
+                input_text = item.get("input", "")
+                output_text = item.get("output", "")
+                
+                # Handle None values
+                if input_text is None:
+                    input_text = ""
+                if output_text is None:
+                    output_text = ""
+                
+                # Convert to string if needed
+                input_text = str(input_text)
+                output_text = str(output_text)
+                
+                alpaca_data.append({
+                    "instruction": config.instruction,
+                    "input": input_text,
+                    "output": output_text
+                })
+            alpaca_splits[split_name] = alpaca_data
+        
+        return alpaca_splits
+
+    def format_for_training(self, data: Dict[str, List[Dict]], config: StylingConfig) -> Dict[str, List[str]]:
+        """Format entries for training using Alpaca prompt format"""
+        formatted_splits = {}
+        
+        for split_name, split_data in data.items():
+            formatted_texts = []
+            for item in split_data:
+                # Ensure input and output fields exist, default to empty string if missing
+                input_text = item.get("input", "")
+                output_text = item.get("output", "")
+                
+                # Handle None values
+                if input_text is None:
+                    input_text = ""
+                if output_text is None:
+                    output_text = ""
+                
+                # Convert to string if needed
+                input_text = str(input_text)
+                output_text = str(output_text)
+                
+                text = config.alpaca_prompt.format(
+                    config.instruction,
+                    input_text,
+                    output_text
+                ) + config.eos_token
+                formatted_texts.append(text)
+            formatted_splits[split_name] = formatted_texts
+        
+        return formatted_splits
+
+    def convert_to_hf_dataset(self, dataset_entries: List[Dict], config: StylingConfig):
+        """Convert dataset entries to HuggingFace dataset format with text formatting"""
+        from datasets import Dataset
+
+        # Create HuggingFace dataset from list of dictionaries
+        hf_dataset = Dataset.from_list(dataset_entries)
+
+        # Apply formatting function to generate the text field
+        def formatting_prompts_func(examples):
+            instructions = examples["instruction"]
+            inputs = examples["input"]
+            outputs = examples["output"]
+            texts = []
+
+            for instruction, input_text, output in zip(instructions, inputs, outputs):
+                # Handle None values and ensure strings
+                if input_text is None:
+                    input_text = ""
+                if output is None:
+                    output = ""
+                
+                # Convert to string if needed
+                input_text = str(input_text)
+                output = str(output)
+                
+                # Use the config's EOS token and alpaca prompt
+                text = config.alpaca_prompt.format(instruction, input_text, output) + config.eos_token
+                texts.append(text)
+
+            return {"text": texts}
+
+        # Apply the formatting function
+        formatted_dataset = hf_dataset.map(formatting_prompts_func, batched=True)
+
+        return formatted_dataset
+
+    def save_hf_dataset_to_disk(self, hf_dataset, save_path: str):
+        """Save HuggingFace dataset to disk"""
+        try:
+            hf_dataset.save_to_disk(save_path)
+            logger.info(f"HuggingFace dataset saved to disk at: {save_path}")
+            return True
+        except Exception as e:
+            logger.error(f"Error saving HuggingFace dataset to disk: {e}")
+            return False
+
+    def load_hf_dataset_from_disk(self, load_path: str):
+        """Load HuggingFace dataset from disk"""
+        try:
+            from datasets import load_from_disk
+            hf_dataset = load_from_disk(load_path)
+            logger.info(f"HuggingFace dataset loaded from disk: {load_path}")
+            logger.info(f"Dataset has {len(hf_dataset)} entries")
+            logger.info(f"Dataset features: {hf_dataset.features}")
+            return hf_dataset
+        except Exception as e:
+            logger.error(f"Error loading HuggingFace dataset from disk: {e}")
+            return None
+
+    def save_data(self, data: Dict[str, List[Dict]], output_dir: str, format: str = "jsonl"):
+        """Save processed data splits to files"""
+        output_path = Path(output_dir)
+        output_path.mkdir(parents=True, exist_ok=True)
+        
+        for split_name, split_data in data.items():
+            if format == "jsonl":
+                output_file = output_path / f"{split_name}.jsonl"
+                with open(output_file, 'w', encoding='utf-8') as f:
+                    for item in split_data:
+                        f.write(json.dumps(item, ensure_ascii=False) + '\n')
+            elif format == "json":
+                output_file = output_path / f"{split_name}.json"
+                with open(output_file, 'w', encoding='utf-8') as f:
+                    json.dump(split_data, f, ensure_ascii=False, indent=2)
+            elif format == "csv":
+                output_file = output_path / f"{split_name}.csv"
+                df = pd.DataFrame(split_data)
+                df.to_csv(output_file, index=False)
+            
+            logger.info(f"Saved {len(split_data)} samples to {output_file}")
+    
+    def run_pipeline(
+        self,
+        config: StylingConfig,
+        output_format: str = "styling",
+        save_splits: bool = True,
+        create_hf_dataset: bool = False,
+        save_hf_dataset: bool = False,
+        hf_dataset_path: str = None
+    ) -> Dict[str, Any]:
+        """Run complete styling pipeline"""
+        
+        logger.info("Starting styling pipeline...")
+        
+        # Load and preprocess data
+        processed_splits, analysis = self.load_and_preprocess(config)
+        
+        # Convert to desired output format
+        if output_format == "alpaca":
+            formatted_splits = self.convert_to_alpaca_format(processed_splits, config)
+        else:
+            formatted_splits = processed_splits
+        
+        # Save data if requested
+        if save_splits:
+            # Save directly in the output directory, not in a subdirectory
+            output_dir = Path(config.output_dir)
+            self.save_data(formatted_splits, str(output_dir))
+        
+        # Convert to HuggingFace dataset if requested
+        hf_dataset = None
+        hf_dataset_save_path = None
+        if create_hf_dataset:
+            # Flatten all splits into one list for HF dataset
+            all_entries = []
+            for split_name, split_data in formatted_splits.items():
+                for item in split_data:
+                    # Ensure we have the instruction field
+                    if "instruction" not in item:
+                        item["instruction"] = config.instruction
+                    all_entries.append(item)
+            
+            hf_dataset = self.convert_to_hf_dataset(all_entries, config)
+            logger.info(f"HuggingFace dataset created with {len(hf_dataset)} entries")
+            logger.info(f"Dataset features: {hf_dataset.features}")
+            
+            # Save HuggingFace dataset to disk if requested
+            if save_hf_dataset:
+                if hf_dataset_path is None:
+                    # Generate default path using the YAML output_dir
+                    hf_dataset_path = str(Path(config.output_dir) / "hf_dataset")
+                
+                success = self.save_hf_dataset_to_disk(hf_dataset, hf_dataset_path)
+                if success:
+                    hf_dataset_save_path = hf_dataset_path
+                    logger.info(f"HuggingFace dataset saved to: {hf_dataset_save_path}")
+                else:
+                    logger.warning("Failed to save HuggingFace dataset to disk")
+        
+        # Create result summary
+        result = {
+            "config": config,
+            "analysis": analysis,
+            "splits": {
+                split_name: len(split_data) for split_name, split_data in formatted_splits.items()
+            },
+            "output_format": output_format,
+            "output_dir": config.output_dir,
+            "data": formatted_splits,  # Include the actual processed data
+            "instruction": config.instruction
+        }
+        
+        # Add HuggingFace dataset info to result if created
+        if hf_dataset is not None:
+            result["hf_dataset"] = hf_dataset
+            if hf_dataset_save_path:
+                result["hf_dataset_path"] = hf_dataset_save_path
+        
+        logger.info("Styling pipeline completed successfully!")
+        return result
+
+# Helper functions
+def create_huggingface_config(dataset_name: str, input_field: str = "text", output_field: str = "output", instruction: str = "Rewrite the following text in a formal style", **kwargs) -> StylingConfig:
+    """Helper function to create a HuggingFace configuration"""
+    return StylingConfig(
+        data_source="huggingface",
+        dataset_name=dataset_name,
+        input_field=input_field,
+        output_field=output_field,
+        instruction=instruction,
+        **kwargs
+    )
+
+
+def create_custom_config(data_path: str, data_format: str = "jsonl", input_field: str = "text", output_field: str = "styled_text", instruction: str = "Rewrite the following text in a formal style", **kwargs) -> StylingConfig:
+    """Helper function to create a custom data configuration"""
+    return StylingConfig(
+        data_source="custom",
+        data_path=data_path,
+        data_format=data_format,
+        input_field=input_field,
+        output_field=output_field,
+        instruction=instruction,
+        **kwargs
+    )
+
+
+def save_hf_dataset_to_disk(hf_dataset, save_path: str) -> bool:
+    """Utility function to save HuggingFace dataset to disk"""
+    try:
+        hf_dataset.save_to_disk(save_path)
+        print(f"HuggingFace dataset saved to disk at: {save_path}")
+        return True
+    except Exception as e:
+        print(f"Error saving HuggingFace dataset to disk: {e}")
+        return False
+
+
+def load_hf_dataset_from_disk(load_path: str):
+    """Utility function to load HuggingFace dataset from disk"""
+    try:
+        from datasets import load_from_disk
+        hf_dataset = load_from_disk(load_path)
+        print(f"HuggingFace dataset loaded from disk: {load_path}")
+        print(f"Dataset has {len(hf_dataset)} entries")
+        print(f"Dataset features: {hf_dataset.features}")
+        return hf_dataset
+    except Exception as e:
+        print(f"Error loading HuggingFace dataset from disk: {e}")
+        return None
+
+
+def load_yaml_config(config_path: str) -> Dict[str, Any]:
+    """Load and parse YAML configuration file with proper structure handling"""
+    try:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            yaml_data = yaml.safe_load(f)
+        
+        # Extract configuration from YAML structure
+        config_dict = {}
+        
+        # Handle task section
+        if 'task' in yaml_data:
+            task_data = yaml_data['task']
+            config_dict.update({
+                'task_name': task_data.get('name'),
+                'task_type': task_data.get('type')
+            })
+        
+        # Handle data section
+        if 'data' in yaml_data:
+            data_config = yaml_data['data']
+            config_dict.update({
+                'data_source': data_config.get('source'),
+                'dataset_name': data_config.get('dataset_name'),
+                'data_path': data_config.get('data_path'),
+                'data_format': data_config.get('data_format'),
+                'input_field': data_config.get('input_field'),
+                'output_field': data_config.get('output_field'),
+                'instruction': data_config.get('instruction'),
+                'max_samples': data_config.get('max_samples'),
+                'train_split': data_config.get('train_split'),
+                'validation_split': data_config.get('validation_split'),
+                'test_split': data_config.get('test_split'),
+                'clean_text': data_config.get('clean_text'),
+                'lowercase': data_config.get('lowercase'),
+                'min_length': data_config.get('min_length'),
+                'max_length': data_config.get('max_length'),
+                'output_format': data_config.get('output_format'),
+                'output_dir': data_config.get('output_dir'),
+                'encoding': data_config.get('encoding'),
+                'delimiter': data_config.get('delimiter')
+            })
+        
+        # Handle model section
+        if 'model' in yaml_data:
+            model_data = yaml_data['model']
+            config_dict.update({
+                'model_name': model_data.get('name'),
+                'model_max_length': model_data.get('max_length')
+            })
+        
+        # Handle training section
+        if 'training' in yaml_data:
+            training_data = yaml_data['training']
+            config_dict.update({
+                'num_epochs': training_data.get('num_epochs'),
+                'batch_size': training_data.get('batch_size'),
+                'learning_rate': training_data.get('learning_rate'),
+                'weight_decay': training_data.get('weight_decay'),
+                'warmup_ratio': training_data.get('warmup_ratio'),
+                'lr_scheduler_type': training_data.get('lr_scheduler_type')
+            })
+        
+        # Handle inference section
+        if 'inference' in yaml_data:
+            inference_data = yaml_data['inference']
+            config_dict.update({
+                'inference_batch_size': inference_data.get('batch_size'),
+                'max_new_tokens': inference_data.get('max_new_tokens'),
+                'temperature': inference_data.get('temperature')
+            })
+        
+        logger.info(f"Successfully parsed YAML configuration from: {config_path}")
+        logger.info(f"Extracted {len(config_dict)} configuration parameters")
+        
+        return config_dict
+        
+    except Exception as e:
+        logger.error(f"Error loading YAML config from {config_path}: {e}")
+        raise
+
+
+def main():
+    """Main function with YAML configuration support"""
+    
+    parser = argparse.ArgumentParser(description="Styling Data Processing Pipeline")
+    
+    # YAML configuration
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    
+    # Data source arguments
+    parser.add_argument("--data-source", choices=["huggingface", "custom"], help="Data source")
+    parser.add_argument("--dataset-name", type=str, help="HuggingFace dataset name")
+    parser.add_argument("--data-path", type=str, help="Path to custom data file")
+    parser.add_argument("--data-format", choices=["jsonl", "csv", "json"], help="Data format")
+    
+    # Field mapping
+    parser.add_argument("--input-field", type=str, help="Input field name")
+    parser.add_argument("--output-field", type=str, help="Output field name")
+    parser.add_argument("--instruction", type=str, help="Style instruction")
+    
+    # Data processing
+    parser.add_argument("--max-samples", type=int, help="Maximum samples to process")
+    parser.add_argument("--train-split", type=float, help="Training split ratio")
+    parser.add_argument("--validation-split", type=float, help="Validation split ratio")
+    parser.add_argument("--test-split", type=float, help="Test split ratio")
+    
+    # Text preprocessing
+    parser.add_argument("--clean-text", action="store_true", help="Clean and normalize text")
+    parser.add_argument("--remove-special-chars", action="store_true", help="Remove special characters")
+    parser.add_argument("--lowercase", action="store_true", help="Convert text to lowercase")
+    parser.add_argument("--min-length", type=int, help="Minimum text length")
+    parser.add_argument("--max-length", type=int, help="Maximum text length")
+    
+    # Output configuration
+    parser.add_argument("--output-format", choices=["styling", "alpaca"], help="Output format")
+    parser.add_argument("--output-dir", type=str, help="Output directory")
+    
+    # HuggingFace dataset options
+    parser.add_argument("--create-hf-dataset", action="store_true", help="Create HuggingFace dataset")
+    parser.add_argument("--hf-dataset-path", type=str, help="Path to save HuggingFace dataset")
+    
+    # Logging
+    parser.add_argument("--log-level", choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO", help="Logging level")
+    
+    args = parser.parse_args()
+    
+    # Set up logging
+    logging.basicConfig(
+        level=getattr(logging, args.log_level),
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    
+    # Load configuration
+    config_dict = {}
+    
+    # Load YAML config if provided
+    if args.config:
+        try:
+            config_dict = load_yaml_config(args.config)
+        except Exception as e:
+            logger.error(f"Error loading YAML config: {e}")
+            sys.exit(1)
+    
+    # Override YAML config with CLI arguments
+    cli_overrides = {}
+    if args.data_source:
+        cli_overrides['data_source'] = args.data_source
+    if args.dataset_name:
+        cli_overrides['dataset_name'] = args.dataset_name
+    if args.data_path:
+        cli_overrides['data_path'] = args.data_path
+    if args.data_format:
+        cli_overrides['data_format'] = args.data_format
+    if args.input_field:
+        cli_overrides['input_field'] = args.input_field
+    if args.output_field:
+        cli_overrides['output_field'] = args.output_field
+    if args.instruction:
+        cli_overrides['instruction'] = args.instruction
+    if args.max_samples:
+        cli_overrides['max_samples'] = args.max_samples
+    if args.train_split:
+        cli_overrides['train_split'] = args.train_split
+    if args.validation_split:
+        cli_overrides['validation_split'] = args.validation_split
+    if args.test_split:
+        cli_overrides['test_split'] = args.test_split
+    if args.clean_text:
+        cli_overrides['clean_text'] = True
+    if args.remove_special_chars:
+        cli_overrides['remove_special_chars'] = True
+    if args.lowercase:
+        cli_overrides['lowercase'] = True
+    if args.min_length:
+        cli_overrides['min_length'] = args.min_length
+    if args.max_length:
+        cli_overrides['max_length'] = args.max_length
+    if args.output_format:
+        cli_overrides['output_format'] = args.output_format
+    if args.output_dir:
+        cli_overrides['output_dir'] = args.output_dir
+    
+    # HuggingFace dataset options
+    if args.create_hf_dataset:
+        cli_overrides['create_hf_dataset'] = True
+    if args.hf_dataset_path:
+        cli_overrides['hf_dataset_path'] = args.hf_dataset_path
+    
+    # Logging
+    if args.log_level:
+        cli_overrides['log_level'] = args.log_level
+    
+    # Merge configurations
+    for key, value in cli_overrides.items():
+        if key in config_dict:
+            logger.info(f"Overriding YAML config '{key}' with CLI value: {value}")
+        config_dict[key] = value
+    
+    # Validate required arguments
+    if not config_dict.get('data_source'):
+        parser.error("--data-source is required (either in YAML config or CLI)")
+    
+    if config_dict.get('data_source') == "huggingface" and not config_dict.get('dataset_name'):
+        parser.error("--dataset-name is required for HuggingFace datasets")
+    
+    if config_dict.get('data_source') == "custom" and not config_dict.get('data_path'):
+        parser.error("--data-path is required for custom datasets")
+    
+    # Create configuration object - properly handle YAML structure
+    config = StylingConfig(
+        data_source=config_dict.get('data_source', 'huggingface'),
+        dataset_name=config_dict.get('dataset_name'),
+        data_path=config_dict.get('data_path'),
+        data_format=config_dict.get('data_format', 'jsonl'),
+        input_field=config_dict.get('input_field', 'text'),
+        output_field=config_dict.get('output_field', 'styled_text'),
+        instruction=config_dict.get('instruction', 'Rewrite the following text in a formal style'),
+        max_samples=config_dict.get('max_samples'),
+        train_split=config_dict.get('train_split', 0.8),
+        validation_split=config_dict.get('validation_split', 0.1),
+        test_split=config_dict.get('test_split', 0.1),
+        clean_text=config_dict.get('clean_text', True),
+        remove_special_chars=config_dict.get('remove_special_chars', False),
+        lowercase=config_dict.get('lowercase', False),
+        min_length=config_dict.get('min_length', 10),
+        max_length=config_dict.get('max_length', 1000),
+        output_format=config_dict.get('output_format', 'styling'),
+        output_dir=config_dict.get('output_dir', './data'),
+        hf_split=config_dict.get('hf_split', 'train'),
+        hf_cache_dir=config_dict.get('hf_cache_dir'),
+        test_split_from=config_dict.get('test_split_from', 'train'),
+        val_split_from=config_dict.get('val_split_from', 'train'),
+        encoding=config_dict.get('encoding', 'utf-8'),
+        delimiter=config_dict.get('delimiter', ',')
+    )
+    
+    # Initialize pipeline
+    pipeline = StylingDataPipeline()
+    
+    try:
+        print(f"Starting styling pipeline with {config.data_source} data source...")
+        if args.config:
+            print(f"Using YAML configuration: {args.config}")
+        print(f"Style instruction: {config.instruction}")
+        print()
+        
+        # Check if we should create HuggingFace dataset
+        create_hf_dataset = cli_overrides.get('create_hf_dataset', False)
+        hf_dataset_path = cli_overrides.get('hf_dataset_path')
+        
+        # If creating HF dataset, also save it by default
+        save_hf_dataset = create_hf_dataset
+        
+        result = pipeline.run_pipeline(
+            config, 
+            config.output_format, 
+            save_splits=True, 
+            create_hf_dataset=create_hf_dataset,
+            save_hf_dataset=save_hf_dataset,
+            hf_dataset_path=hf_dataset_path
+        )
+        
+        print(f"✅ Pipeline completed successfully!")
+        print(f"  Data source: {config.data_source}")
+        if config.data_source == "huggingface":
+            print(f"  Dataset: {config.dataset_name}")
+        else:
+            print(f"  Data file: {config.data_path}")
+        print(f"  Total samples: {result['analysis']['overall']['total_samples']}")
+        print(f"  Split sizes: {result['analysis']['overall']['split_sizes']}")
+        print(f"  Output directory: {config.output_dir}")
+        print(f"  Style instruction: {config.instruction}")
+        
+    except Exception as e:
+        print(f"❌ Error running pipeline: {e}")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pipelines/styling/inference.py b/pipelines/styling/inference.py
new file mode 100644
index 0000000..9dccafb
--- /dev/null
+++ b/pipelines/styling/inference.py
@@ -0,0 +1,346 @@
+#!/usr/bin/env python3
+"""
+Styling Inference Pipeline using Trained Models
+Supports style transfer inference with streaming and batch processing
+"""
+
+import os
+import sys
+import json
+import logging
+import argparse
+from pathlib import Path
+from typing import Dict, Any, Optional, List, Union
+import yaml
+
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent.parent.parent))
+
+from utils.config.config_manager import ConfigManager
+from utils.logging.logging import setup_logging
+
+# Inference imports
+import torch
+from datasets import load_from_disk, Dataset
+from unsloth import FastLanguageModel
+from transformers import TextStreamer
+
+logger = logging.getLogger(__name__)
+
+class StylingInference:
+    """Styling task inference using trained models"""
+    
+    def __init__(self, config: Dict[str, Any]):
+        self.config = config
+        self.model = None
+        self.tokenizer = None
+        
+        # Set device
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info(f"Using device: {self.device}")
+        
+        # Model parameters
+        self.model_path = config.get('model_path')
+        self.max_seq_length = config.get('max_seq_length', 2048)
+        self.dtype = config.get('dtype', None)
+        self.load_in_4bit = config.get('load_in_4bit', True)
+        self.hf_token = config.get('hf_token', None)
+        
+        # Inference parameters
+        self.batch_size = config.get('batch_size', 1)
+        self.max_new_tokens = config.get('max_new_tokens', 128)
+        self.temperature = config.get('temperature', 0.8)
+        self.top_p = config.get('top_p', 0.9)
+        self.do_sample = config.get('do_sample', True)
+        
+        # Alpaca prompt template
+        self.alpaca_prompt = config.get('alpaca_prompt', """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that follows the instruction
+
+### Instruction:
+{}
+
+### Input:
+{}
+
+### Response:
+{}""")
+        
+        # Style instruction
+        self.style_instruction = config.get('style_instruction', 'Rewrite the following text in a formal style')
+    
+    def load_model_and_tokenizer(self):
+        """Load the trained model and tokenizer"""
+        logger.info("Loading model and tokenizer...")
+        
+        try:
+            if self.model_path and Path(self.model_path).exists():
+                # Load local trained model
+                logger.info(f"Loading local model from: {self.model_path}")
+                self.model, self.tokenizer = FastLanguageModel.from_pretrained(
+                    model_name=self.model_path,
+                    max_seq_length=self.max_seq_length,
+                    dtype=self.dtype,
+                    load_in_4bit=self.load_in_4bit,
+                    token=self.hf_token
+                )
+            else:
+                # Load base model from HuggingFace Hub
+                logger.info(f"Loading base model: {self.config.get('base_model_name', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit')}")
+                self.model, self.tokenizer = FastLanguageModel.from_pretrained(
+                    model_name=self.config.get('base_model_name', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit'),
+                    max_seq_length=self.max_seq_length,
+                    dtype=self.dtype,
+                    load_in_4bit=self.load_in_4bit,
+                    token=self.hf_token
+                )
+            
+            # Prepare for inference
+            FastLanguageModel.for_inference(self.model)
+            
+            logger.info(f"✅ Model loaded successfully")
+            logger.info(f"✅ Tokenizer loaded with vocab size: {self.tokenizer.vocab_size}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error loading model: {e}")
+            raise
+    
+    def format_prompt(self, instruction: str, input_text: str, output: str = "") -> str:
+        """Format the prompt using Alpaca template"""
+        return self.alpaca_prompt.format(instruction, input_text, output)
+    
+    def generate_text(self, prompt: str, max_new_tokens: Optional[int] = None) -> str:
+        """Generate text from a single prompt"""
+        try:
+            # Tokenize input
+            inputs = self.tokenizer([prompt], return_tensors="pt").to(self.device)
+            
+            # Set generation parameters
+            gen_kwargs = {
+                "max_new_tokens": max_new_tokens or self.max_new_tokens,
+                "temperature": self.temperature,
+                "top_p": self.top_p,
+                "do_sample": self.do_sample,
+                "use_cache": True,
+                "pad_token_id": self.tokenizer.eos_token_id
+            }
+            
+            # Generate
+            with torch.no_grad():
+                outputs = self.model.generate(**inputs, **gen_kwargs)
+            
+            # Decode
+            generated_text = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+            
+            # Extract only the generated part (remove input prompt)
+            if prompt in generated_text:
+                generated_text = generated_text[len(prompt):].strip()
+            
+            return generated_text
+            
+        except Exception as e:
+            logger.error(f"❌ Error generating text: {e}")
+            return ""
+    
+    def style_transfer(self, input_text: str, instruction: Optional[str] = None, streaming: bool = False) -> str:
+        """Perform style transfer on input text"""
+        if instruction is None:
+            instruction = self.style_instruction
+        
+        # Format prompt
+        prompt = self.format_prompt(instruction, input_text, "")
+        
+        logger.info(f"Style transfer prompt: {prompt}")
+        
+        if streaming:
+            logger.info("Generating with streaming...")
+            self.generate_text_streaming(prompt)
+            return ""
+        else:
+            logger.info("Generating text...")
+            result = self.generate_text(prompt)
+            logger.info(f"Generated result: {result}")
+            return result
+    
+    def generate_text_streaming(self, prompt: str, max_new_tokens: Optional[int] = None):
+        """Generate text with streaming output"""
+        try:
+            # Tokenize input
+            inputs = self.tokenizer([prompt], return_tensors="pt").to(self.device)
+            
+            # Setup text streamer
+            text_streamer = TextStreamer(self.tokenizer)
+            
+            # Set generation parameters
+            gen_kwargs = {
+                "max_new_tokens": max_new_tokens or self.max_new_tokens,
+                "temperature": self.temperature,
+                "top_p": self.top_p,
+                "do_sample": self.do_sample,
+                "use_cache": True,
+                "pad_token_id": self.tokenizer.eos_token_id
+            }
+            
+            # Generate with streaming
+            with torch.no_grad():
+                _ = self.model.generate(**inputs, streamer=text_streamer, **gen_kwargs)
+                
+        except Exception as e:
+            logger.error(f"❌ Error in streaming generation: {e}")
+    
+    def batch_style_transfer(self, input_texts: List[str], instruction: Optional[str] = None) -> List[str]:
+        """Perform style transfer on multiple input texts"""
+        results = []
+        
+        for i, input_text in enumerate(input_texts):
+            logger.info(f"Processing text {i+1}/{len(input_texts)}")
+            result = self.style_transfer(input_text, instruction)
+            results.append(result)
+        
+        return results
+
+def load_inference_config(config_path: str) -> Dict[str, Any]:
+    """Load inference configuration from YAML file"""
+    try:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+        
+        # Extract inference configuration
+        inference_config = {}
+        
+        # Model configuration
+        if 'model' in config:
+            model_data = config['model']
+            inference_config.update({
+                'base_model_name': model_data.get('training_model', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit'),
+                'max_seq_length': model_data.get('training_max_seq_length', 2048),
+                'dtype': model_data.get('training_dtype'),
+                'load_in_4bit': model_data.get('training_load_in_4bit', True),
+                'hf_token': model_data.get('training_token')
+            })
+        
+        # Inference configuration
+        if 'inference' in config:
+            inference_data = config['inference']
+            inference_config.update({
+                'batch_size': inference_data.get('batch_size', 1),
+                'max_new_tokens': inference_data.get('max_new_tokens', 128),
+                'temperature': inference_data.get('temperature', 0.8)
+            })
+        
+        # Style configuration
+        if 'data' in config:
+            data_config = config['data']
+            inference_config.update({
+                'style_instruction': data_config.get('instruction', 'Rewrite the following text in a formal style')
+            })
+        
+        return inference_config
+        
+    except Exception as e:
+        logger.error(f"Error loading inference config: {e}")
+        raise
+
+def main():
+    """Main inference function"""
+    parser = argparse.ArgumentParser(description="Styling Inference Pipeline")
+    
+    # Configuration
+    parser.add_argument("--config", type=str, required=True, help="Path to YAML configuration file")
+    parser.add_argument("--model-path", type=str, help="Path to trained model (optional, uses base model if not provided)")
+    
+    # Inference modes
+    parser.add_argument("--text", type=str, help="Single text to style transfer")
+    parser.add_argument("--input-file", type=str, help="File containing texts to process (one per line)")
+    
+    # Generation parameters
+    parser.add_argument("--max-tokens", type=int, help="Maximum new tokens to generate")
+    parser.add_argument("--temperature", type=float, help="Sampling temperature")
+    parser.add_argument("--streaming", action="store_true", help="Enable streaming generation")
+    parser.add_argument("--instruction", type=str, help="Custom style instruction")
+    
+    # Output
+    parser.add_argument("--output-file", type=str, help="Output file for results")
+    
+    args = parser.parse_args()
+    
+    # Setup logging
+    setup_logging()
+    
+    try:
+        # Load configuration
+        logger.info(f"Loading configuration from: {args.config}")
+        inference_config = load_inference_config(args.config)
+        
+        # Override with CLI arguments
+        if args.model_path:
+            inference_config['model_path'] = args.model_path
+        if args.max_tokens:
+            inference_config['max_new_tokens'] = args.max_tokens
+        if args.temperature:
+            inference_config['temperature'] = args.temperature
+        if args.instruction:
+            inference_config['style_instruction'] = args.instruction
+        
+        logger.info("Inference configuration:")
+        for key, value in inference_config.items():
+            logger.info(f"  {key}: {value}")
+        
+        # Initialize inference
+        inferencer = StylingInference(inference_config)
+        
+        # Load model
+        inferencer.load_model_and_tokenizer()
+        
+        # Run inference based on mode
+        if args.text:
+            # Single text inference
+            logger.info("Running single text inference...")
+            result = inferencer.style_transfer(args.text, args.instruction, args.streaming)
+            if not args.streaming:
+                print(f"\nGenerated text: {result}")
+        
+        elif args.input_file:
+            # Batch file inference
+            logger.info("Running batch file inference...")
+            with open(args.input_file, 'r', encoding='utf-8') as f:
+                input_texts = [line.strip() for line in f if line.strip()]
+            
+            results = inferencer.batch_style_transfer(input_texts, args.instruction)
+            
+            # Save results
+            output_file = args.output_file or f"{Path(args.input_file).stem}_styled.txt"
+            with open(output_file, 'w', encoding='utf-8') as f:
+                for input_text, result in zip(input_texts, results):
+                    f.write(f"Input: {input_text}\n")
+                    f.write(f"Output: {result}\n")
+                    f.write("-" * 50 + "\n")
+            
+            logger.info(f"✅ Results saved to: {output_file}")
+        
+        else:
+            # Interactive mode
+            logger.info("Entering interactive mode. Type 'quit' to exit.")
+            while True:
+                try:
+                    user_input = input("\nEnter text to style (or 'quit'): ").strip()
+                    if user_input.lower() == 'quit':
+                        break
+                    
+                    if user_input:
+                        result = inferencer.style_transfer(user_input, args.instruction, args.streaming)
+                        if not args.streaming:
+                            print(f"\nStyled text: {result}")
+                
+                except KeyboardInterrupt:
+                    break
+                except Exception as e:
+                    logger.error(f"Error processing input: {e}")
+        
+        logger.info("🎉 Inference completed successfully!")
+        
+    except Exception as e:
+        logger.error(f"❌ Inference failed: {e}")
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
diff --git a/pipelines/styling/train.py b/pipelines/styling/train.py
new file mode 100644
index 0000000..2afaaf8
--- /dev/null
+++ b/pipelines/styling/train.py
@@ -0,0 +1,446 @@
+#!/usr/bin/env python3
+"""
+Styling Training Pipeline using Unsloth and SFTTrainer
+Supports style transfer tasks with LoRA fine-tuning
+"""
+
+import os
+import sys
+import json
+import logging
+import argparse
+from pathlib import Path
+from typing import Dict, Any, Optional
+import yaml
+
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent.parent.parent))
+
+from utils.config.config_manager import ConfigManager
+#from utils.logging.logging import setup_logging
+
+# Training imports
+import torch
+from datasets import load_from_disk, Dataset
+from unsloth import FastLanguageModel, is_bfloat16_supported
+from trl import SFTTrainer
+from transformers import TrainingArguments
+
+logger = logging.getLogger(__name__)
+
+class StylingTrainer:
+    """Styling task trainer using Unsloth and SFTTrainer"""
+    
+    def __init__(self, config: Dict[str, Any]):
+        self.config = config
+        self.model = None
+        self.tokenizer = None
+        self.trainer = None
+        
+        # Set device
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info(f"Using device: {self.device}")
+        
+        # Training parameters
+        self.max_seq_length = config.get('max_seq_length', 2048)
+        self.dtype = config.get('dtype', None)
+        self.load_in_4bit = config.get('load_in_4bit', True)
+        self.hf_token = config.get('hf_token', None)
+        
+        # LoRA parameters
+        self.lora_r = config.get('lora_r', 16)
+        self.lora_alpha = config.get('lora_alpha', 16)
+        self.lora_dropout = config.get('lora_dropout', 0)
+        self.target_modules = config.get('target_modules', [
+            "q_proj", "k_proj", "v_proj", "o_proj",
+            "gate_proj", "up_proj", "down_proj"
+        ])
+        
+        # Training arguments
+        self.batch_size = config.get('batch_size', 2)
+        self.gradient_accumulation_steps = config.get('gradient_accumulation_steps', 4)
+        self.learning_rate = config.get('learning_rate', 2e-4)
+        self.num_epochs = config.get('num_epochs', 1)
+        self.max_steps = config.get('max_steps', None)
+        self.warmup_steps = config.get('warmup_steps', 5)
+        self.weight_decay = config.get('weight_decay', 0.01)
+        self.seed = config.get('seed', 3407)
+        
+        # Output paths
+        self.output_dir = config.get('output_dir', './outputs')
+        self.model_output_dir = config.get('model_output_dir', './models/styling')
+        
+    def load_model_and_tokenizer(self):
+        """Load the pre-trained model and tokenizer"""
+        logger.info("Loading model and tokenizer...")
+        
+        try:
+            self.model, self.tokenizer = FastLanguageModel.from_pretrained(
+                model_name=self.config['model_name'],
+                max_seq_length=self.max_seq_length,
+                dtype=self.dtype,
+                load_in_4bit=self.load_in_4bit,
+                token=self.hf_token
+            )
+            
+            logger.info(f"✅ Model loaded: {self.config['model_name']}")
+            logger.info(f"✅ Tokenizer loaded with vocab size: {self.tokenizer.vocab_size}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error loading model: {e}")
+            raise
+    
+    def setup_lora(self):
+        """Setup LoRA for efficient fine-tuning"""
+        logger.info("Setting up LoRA configuration...")
+        
+        try:
+            self.model = FastLanguageModel.get_peft_model(
+                self.model,
+                r=self.lora_r,
+                target_modules=self.target_modules,
+                lora_alpha=self.lora_alpha,
+                lora_dropout=self.lora_dropout,
+                bias="none",
+                use_gradient_checkpointing="unsloth",
+                random_state=self.seed,
+                use_rslora=False,
+                loftq_config=None
+            )
+            
+            logger.info(f"✅ LoRA configured with r={self.lora_r}, alpha={self.lora_alpha}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error setting up LoRA: {e}")
+            raise
+    
+    def load_dataset(self, dataset_path: str) -> Dataset:
+        """Load the training dataset"""
+        logger.info(f"Loading dataset from: {dataset_path}")
+        
+        try:
+            if Path(dataset_path).exists():
+                # Check if it's a HuggingFace dataset directory
+                if (Path(dataset_path) / "dataset_info.json").exists():
+                    # Load from HuggingFace dataset directory
+                    dataset = load_from_disk(dataset_path)
+                    logger.info(f"Loaded HuggingFace dataset from disk: {len(dataset)} samples")
+                else:
+                    # Load from processed data files (JSONL format)
+                    logger.info("Loading from processed data files...")
+                    from datasets import Dataset
+                    import json
+                    
+                    all_data = []
+                    data_dir = Path(dataset_path)
+                    
+                    # Look for train.jsonl, validation.jsonl, test.jsonl
+                    for split_file in ["train.jsonl", "validation.jsonl", "test.jsonl"]:
+                        file_path = data_dir / split_file
+                        if file_path.exists():
+                            logger.info(f"Loading {split_file}...")
+                            with open(file_path, 'r', encoding='utf-8') as f:
+                                for line in f:
+                                    if line.strip():
+                                        data = json.loads(line)
+                                        all_data.append(data)
+                    
+                    if not all_data:
+                        raise ValueError(f"No data found in {dataset_path}")
+                    
+                    # Create HuggingFace dataset
+                    dataset = Dataset.from_list(all_data)
+                    logger.info(f"Created HuggingFace dataset from {len(all_data)} samples")
+            else:
+                # Try loading from HuggingFace Hub
+                logger.info(f"Attempting to load from HuggingFace Hub: {dataset_path}")
+                dataset = Dataset.load_dataset(dataset_path, split="train")
+                logger.info(f"Loaded from HuggingFace Hub: {len(dataset)} samples")
+            
+            logger.info(f"Dataset loaded: {len(dataset)} samples")
+            logger.info(f"Dataset features: {dataset.features}")
+            
+            # Verify required fields exist
+            required_fields = ["instruction", "input", "output"]
+            missing_fields = [field for field in required_fields if field not in dataset.features]
+            if missing_fields:
+                raise ValueError(f"Missing required fields in dataset: {missing_fields}")
+            
+            return dataset
+            
+        except Exception as e:
+            logger.error(f"Error loading dataset: {e}")
+            raise
+    
+    def setup_trainer(self, train_dataset: Dataset):
+        """Setup the SFTTrainer"""
+        logger.info("Setting up SFTTrainer...")
+        
+        try:
+            # First, map the dataset to create the text field with EOS token
+            def formatting_prompts_func(examples):
+                instructions = examples["instruction"]
+                inputs = examples["input"]
+                outputs = examples["output"]
+                texts = []
+                
+                for instruction, input_text, output in zip(instructions, inputs, outputs):
+                    # Must add EOS_TOKEN, otherwise your generation will go on forever!
+                    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that follows the instruction
+
+### Instruction:
+{}
+
+### Input:
+{}
+
+### Response:
+{}"""
+                    text = alpaca_prompt.format(instruction, input_text, output) + self.tokenizer.eos_token
+                    texts.append(text)
+                
+                return {"text": texts}
+            
+            # Apply the formatting function to create the text field
+            logger.info("Mapping dataset to create text field with EOS token...")
+            formatted_dataset = train_dataset.map(formatting_prompts_func, batched=True, remove_columns=train_dataset.column_names)
+            
+            logger.info(f"Dataset mapped successfully. New features: {formatted_dataset.features}")
+            logger.info(f"Sample text field: {formatted_dataset[0]['text'][:100]}...")
+            
+            # Training arguments
+            training_args = TrainingArguments(
+                per_device_train_batch_size=self.batch_size,
+                gradient_accumulation_steps=self.gradient_accumulation_steps,
+                warmup_steps=self.warmup_steps,
+                num_train_epochs=self.num_epochs,
+                max_steps=self.max_steps,
+                learning_rate=self.learning_rate,
+                fp16=not is_bfloat16_supported(),
+                bf16=is_bfloat16_supported(),
+                logging_steps=1,
+                optim="adamw_8bit",
+                weight_decay=self.weight_decay,
+                lr_scheduler_type="linear",
+                seed=self.seed,
+                output_dir=self.output_dir,
+                report_to="none",  # Disable wandb for now
+                save_strategy="epoch",
+                save_total_limit=2,
+                evaluation_strategy="no",  # No validation for now
+                load_best_model_at_end=False,
+                remove_unused_columns=False,
+                dataloader_pin_memory=False,
+            )
+            
+            # Create trainer with the formatted dataset
+            self.trainer = SFTTrainer(
+                model=self.model,
+                tokenizer=self.tokenizer,
+                train_dataset=formatted_dataset,  # Use the formatted dataset
+                dataset_text_field="text",  # The field we just created
+                max_seq_length=self.max_seq_length,
+                dataset_num_proc=2,
+                packing=False,  # Can make training 5x faster for short sequences
+                args=training_args
+            )
+            
+            logger.info("SFTTrainer configured successfully")
+            
+        except Exception as e:
+            logger.error(f"Error setting up trainer: {e}")
+            raise
+    
+    def train(self, dataset_path: str):
+        """Run the training process"""
+        logger.info("🚀 Starting training process...")
+        
+        try:
+            # Load model and tokenizer
+            self.load_model_and_tokenizer()
+            
+            # Setup LoRA
+            self.setup_lora()
+            
+            # Load dataset
+            train_dataset = self.load_dataset(dataset_path)
+            
+            # Setup trainer
+            self.setup_trainer(train_dataset)
+            
+            # Start training
+            logger.info("Starting training...")
+            trainer_stats = self.trainer.train()
+            
+            logger.info("✅ Training completed successfully!")
+            logger.info(f"Training stats: {trainer_stats}")
+            
+            # Save the model
+            self.save_model()
+            
+            return trainer_stats
+            
+        except Exception as e:
+            logger.error(f"❌ Training failed: {e}")
+            raise
+    
+    def save_model(self):
+        """Save the trained model"""
+        logger.info("Saving trained model...")
+        
+        try:
+            # Create output directory
+            Path(self.model_output_dir).mkdir(parents=True, exist_ok=True)
+            
+            # Save model and tokenizer
+            self.model.save_pretrained(self.model_output_dir)
+            self.tokenizer.save_pretrained(self.model_output_dir)
+            
+            # Save training config
+            config_path = Path(self.model_output_dir) / "training_config.json"
+            with open(config_path, 'w') as f:
+                json.dump(self.config, f, indent=2)
+            
+            logger.info(f"✅ Model saved to: {self.model_output_dir}")
+            
+        except Exception as e:
+            logger.error(f"❌ Error saving model: {e}")
+            raise
+    
+    def prepare_for_inference(self):
+        """Prepare model for inference"""
+        logger.info("Preparing model for inference...")
+        
+        try:
+            FastLanguageModel.for_inference(self.model)
+            logger.info("✅ Model prepared for inference")
+            
+        except Exception as e:
+            logger.error(f"❌ Error preparing for inference: {e}")
+            raise
+
+def load_training_config(config_path: str) -> Dict[str, Any]:
+    """Load training configuration from YAML file"""
+    try:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+        
+        # Extract training configuration
+        training_config = {}
+        
+        # Model configuration
+        if 'model' in config:
+            model_data = config['model']
+            training_config.update({
+                'model_name': model_data.get('training_model', 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit'),
+                'max_seq_length': model_data.get('training_max_seq_length', 2048),
+                'dtype': model_data.get('training_dtype'),
+                'load_in_4bit': model_data.get('training_load_in_4bit', True),
+                'hf_token': model_data.get('training_token')
+            })
+        
+        # Training configuration
+        if 'training' in config:
+            training_data = config['training']
+            training_config.update({
+                'num_epochs': training_data.get('num_epochs', 3),
+                'batch_size': training_data.get('batch_size', 2),
+                'learning_rate': training_data.get('learning_rate', 2e-4),
+                'weight_decay': training_data.get('weight_decay', 0.01),
+                'warmup_ratio': training_data.get('warmup_ratio', 0.1),
+                'lr_scheduler_type': training_data.get('lr_scheduler_type', 'linear')
+            })
+        
+        # Data configuration - use output_dir from data section
+        if 'data' in config:
+            data_config = config['data']
+            output_dir = data_config.get('output_dir', './data/processed/styling')
+            training_config.update({
+                'data_output_dir': output_dir,
+                'dataset_path': output_dir,  # Default dataset path is the output_dir
+                'style_instruction': data_config.get('instruction', 'Rewrite the following text in a formal style')
+            })
+        
+        # LoRA configuration
+        training_config.update({
+            'lora_r': 16,
+            'lora_alpha': 16,
+            'lora_dropout': 0,
+            'target_modules': [
+                "q_proj", "k_proj", "v_proj", "o_proj",
+                "gate_proj", "up_proj", "down_proj"
+            ],
+            'gradient_accumulation_steps': 4,
+            'max_steps': None,
+            'warmup_steps': 5,
+            'seed': 3407,
+            'output_dir': './outputs',
+            'model_output_dir': './models/styling'
+        })
+        
+        return training_config
+        
+    except Exception as e:
+        logger.error(f"Error loading training config: {e}")
+        raise
+
+def main():
+    """Main training function"""
+    parser = argparse.ArgumentParser(description="Styling Training Pipeline")
+    
+    # Configuration
+    parser.add_argument("--config", type=str, required=True, help="Path to YAML configuration file")
+    parser.add_argument("--dataset", type=str, help="Path to training dataset (HF dataset path or local path)")
+    parser.add_argument("--output-dir", type=str, help="Output directory for model")
+    parser.add_argument("--epochs", type=int, help="Number of training epochs")
+    parser.add_argument("--batch-size", type=int, help="Training batch size")
+    parser.add_argument("--learning-rate", type=float, help="Learning rate")
+    parser.add_argument("--max-steps", type=int, help="Maximum training steps")
+    
+    args = parser.parse_args()
+    
+    # Setup logging
+    # setup_logging()  # Commented out as per user's change
+    
+    try:
+        # Load configuration
+        logger.info(f"Loading configuration from: {args.config}")
+        training_config = load_training_config(args.config)
+        
+        # Override with CLI arguments
+        if args.output_dir:
+            training_config['model_output_dir'] = args.output_dir
+        if args.epochs:
+            training_config['num_epochs'] = args.epochs
+        if args.batch_size:
+            training_config['batch_size'] = args.batch_size
+        if args.learning_rate:
+            training_config['learning_rate'] = args.learning_rate
+        if args.max_steps:
+            training_config['max_steps'] = args.max_steps
+        
+        # Determine dataset path: CLI argument takes precedence, then YAML config
+        dataset_path = args.dataset or training_config.get('dataset_path')
+        if not dataset_path:
+            logger.error("No dataset path provided. Use --dataset or ensure output_dir is set in YAML config.")
+            sys.exit(1)
+        
+        logger.info("Training configuration:")
+        for key, value in training_config.items():
+            logger.info(f"  {key}: {value}")
+        logger.info(f"  Dataset path: {dataset_path}")
+        
+        # Initialize trainer
+        trainer = StylingTrainer(training_config)
+        
+        # Start training
+        trainer.train(dataset_path)
+        
+        logger.info("Training completed successfully!")
+        
+    except Exception as e:
+        logger.error(f"Training failed: {e}")
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/styling/__init__.py b/scripts/styling/__init__.py
new file mode 100644
index 0000000..5e53d8b
--- /dev/null
+++ b/scripts/styling/__init__.py
@@ -0,0 +1,45 @@
+"""
+Styling Scripts Package
+Provides command-line interfaces for styling data processing, training, and inference
+"""
+
+from .data_processor import (
+    run_with_yaml_config,
+    run_styling_examples,
+    create_sample_styling_data,
+    create_custom_styling_config,
+    show_styling_features
+)
+
+from .train import (
+    run_training_with_config,
+    create_training_example,
+    show_training_features
+)
+
+from .inference import (
+    run_inference_with_config,
+    create_inference_example,
+    run_batch_inference_example,
+    show_inference_features
+)
+
+__all__ = [
+    # Data processing
+    'run_with_yaml_config',
+    'run_styling_examples',
+    'create_sample_styling_data',
+    'create_custom_styling_config',
+    'show_styling_features',
+    
+    # Training
+    'run_training_with_config',
+    'create_training_example',
+    'show_training_features',
+    
+    # Inference
+    'run_inference_with_config',
+    'create_inference_example',
+    'run_batch_inference_example',
+    'show_inference_features'
+]
diff --git a/scripts/styling/data_processor.py b/scripts/styling/data_processor.py
new file mode 100644
index 0000000..fb18c63
--- /dev/null
+++ b/scripts/styling/data_processor.py
@@ -0,0 +1,302 @@
+#!/usr/bin/env python3
+"""
+Styling data processor script that uses YAML configurations.
+This provides a flexible and maintainable approach for style transfer tasks.
+"""
+
+import sys
+import os
+import subprocess
+import argparse
+from pathlib import Path
+
+def run_with_yaml_config(config_path: str, **cli_overrides):
+    """Run styling data processor with YAML configuration"""
+    print(f"=== Running Styling Data Processor with YAML config: {config_path} ===")
+    
+    cmd = [
+        "python", "pipelines/styling/data_processor.py",
+        "--config", config_path
+    ]
+    
+    # Add CLI overrides
+    for key, value in cli_overrides.items():
+        if value is not None:
+            cmd.extend([f"--{key.replace('_', '-')}", str(value)])
+    
+    print(f"Running command: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("✅ Styling data processing completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Error running styling data processor: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def run_styling_examples():
+    """Run styling examples with YAML configs"""
+    
+    # Example 1: Formal style transfer
+    print("=== Example 1: Formal Style Transfer ===")
+    success = run_with_yaml_config(
+        "configs/styling/formal.yaml",
+        max_samples=1000,  # Override YAML value
+        output_format="alpaca"
+    )
+    
+    if success:
+        print("✅ Formal style transfer completed!")
+    
+    # Example 2: Custom styling dataset (if available)
+    print("\n=== Example 2: Custom Styling Dataset ===")
+    if os.path.exists("data/raw/styling/custom_dataset.jsonl"):
+        success = run_with_yaml_config(
+            "configs/styling/formal.yaml",  # Use formal config as base
+            data_source="custom",
+            data_path="data/raw/styling/custom_dataset.jsonl",
+            instruction="Rewrite the following text in a casual, friendly style",
+            output_dir="./data/processed/styling/casual"
+        )
+        if success:
+            print("✅ Custom styling dataset processing completed!")
+    else:
+        print("⚠️  Custom styling dataset not found, skipping...")
+        print("   You can create one with the 'create-sample-data' option")
+
+def create_sample_styling_data():
+    """Create sample styling dataset for testing"""
+    sample_data = [
+        {
+            "text": "Hey, what's up? How are you doing today?",
+            "styled_text": "Hello, how are you doing today?"
+        },
+        {
+            "text": "This is really cool stuff!",
+            "styled_text": "This is quite impressive material."
+        },
+        {
+            "text": "I'm gonna go to the store later.",
+            "styled_text": "I will go to the store later."
+        },
+        {
+            "text": "What's the deal with this?",
+            "styled_text": "What is the situation regarding this matter?"
+        },
+        {
+            "text": "That's totally awesome!",
+            "styled_text": "That is quite remarkable!"
+        }
+    ]
+    
+    # Create directory structure
+    data_dir = Path("data/raw/styling")
+    data_dir.mkdir(parents=True, exist_ok=True)
+    
+    # Save sample data
+    import json
+    sample_file = data_dir / "sample_formal.jsonl"
+    with open(sample_file, 'w', encoding='utf-8') as f:
+        for item in sample_data:
+            f.write(json.dumps(item, ensure_ascii=False) + '\n')
+    
+    print(f"✅ Created sample styling dataset: {sample_file}")
+    print(f"   Contains {len(sample_data)} examples")
+    print(f"   Format: text → styled_text")
+    print(f"   Ready to use with configs/styling/formal.yaml")
+
+def create_custom_styling_config():
+    """Create a custom styling configuration file"""
+    custom_config = """task:
+  name: "styling"
+  type: "style_transfer"
+
+data:
+  source: "custom"
+  input_field: "text"
+  output_field: "styled_text"
+  instruction: "Rewrite the following text in a professional business style"
+  data_format: "jsonl"
+  max_length: 512
+  min_length: 10
+  clean_text: true
+  lowercase: false
+  train_split: 0.8
+  validation_split: 0.1
+  test_split: 0.1
+  output_format: "alpaca"
+  output_dir: "./data/processed/styling/professional"
+
+model:
+  name: "t5-base"
+  max_length: 512
+
+training:
+  num_epochs: 3
+  batch_size: 16
+  learning_rate: 3e-5
+  weight_decay: 0.01
+  warmup_ratio: 0.1
+  lr_scheduler_type: "linear"
+
+inference:
+  batch_size: 32
+  max_new_tokens: 128
+  temperature: 0.8
+"""
+    
+    config_path = "configs/styling/professional.yaml"
+    os.makedirs(os.path.dirname(config_path), exist_ok=True)
+    
+    with open(config_path, 'w') as f:
+        f.write(custom_config)
+    
+    print(f"✅ Created custom styling config: {config_path}")
+    print("   This config is set up for professional business style transfer")
+
+def handle_direct_args():
+    """Handle direct command-line arguments by passing them to the styling pipeline"""
+    parser = argparse.ArgumentParser(description="Styling Data Processor")
+    
+    # Add all the same arguments as the styling pipeline
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    parser.add_argument("--data-source", choices=["huggingface", "custom"], help="Data source")
+    parser.add_argument("--dataset-name", type=str, help="HuggingFace dataset name")
+    parser.add_argument("--data-path", type=str, help="Path to custom data file")
+    parser.add_argument("--data-format", choices=["jsonl", "csv", "json"], help="Data format")
+    parser.add_argument("--input-field", type=str, help="Input field name")
+    parser.add_argument("--output-field", type=str, help="Output field name")
+    parser.add_argument("--instruction", type=str, help="Style instruction")
+    parser.add_argument("--max-samples", type=int, help="Maximum samples to process")
+    parser.add_argument("--train-split", type=float, help="Training split ratio")
+    parser.add_argument("--validation-split", type=float, help="Validation split ratio")
+    parser.add_argument("--test-split", type=float, help="Test split ratio")
+    parser.add_argument("--clean-text", action="store_true", help="Clean and normalize text")
+    parser.add_argument("--remove-special-chars", action="store_true", help="Remove special characters")
+    parser.add_argument("--lowercase", action="store_true", help="Convert text to lowercase")
+    parser.add_argument("--min-length", type=int, help="Minimum text length")
+    parser.add_argument("--max-length", type=int, help="Maximum text length")
+    parser.add_argument("--output-format", choices=["styling", "alpaca"], help="Output format")
+    parser.add_argument("--output-dir", type=str, help="Output directory")
+    
+    # HuggingFace dataset options
+    parser.add_argument("--create-hf-dataset", action="store_true", help="Create HuggingFace dataset")
+    parser.add_argument("--hf-dataset-path", type=str, help="Path to save HuggingFace dataset")
+    
+    # Logging
+    parser.add_argument("--log-level", choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="INFO", help="Logging level")
+    
+    args = parser.parse_args()
+    
+    # Build command to call the styling pipeline
+    cmd = ["python", "pipelines/styling/data_processor.py"]
+    
+    # Add all arguments that were provided
+    for arg_name, arg_value in vars(args).items():
+        if arg_value is not None:
+            if isinstance(arg_value, bool):
+                if arg_value:  # Only add flag if True
+                    cmd.append(f"--{arg_name.replace('_', '-')}")
+            else:
+                cmd.extend([f"--{arg_name.replace('_', '-')}", str(arg_value)])
+    
+    print(f"Running: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("✅ Styling data processing completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Error running styling data processor: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def show_styling_features():
+    """Show the features of the styling data processor"""
+    print("=== Styling Data Processor Features ===")
+    print()
+    print("1. **Style Transfer Tasks**:")
+    print("   - Formal vs. Informal style")
+    print("   - Professional vs. Casual tone")
+    print("   - Academic vs. Conversational")
+    print("   - Any custom style instruction")
+    print()
+    print("2. **Data Formats Supported**:")
+    print("   - HuggingFace datasets")
+    print("   - Custom JSONL/CSV/JSON files")
+    print("   - Automatic train/validation/test splits")
+    print()
+    print("3. **Output Formats**:")
+    print("   - Raw styling format (input/output)")
+    print("   - Alpaca format (instruction/input/output)")
+    print("   - HuggingFace dataset format")
+    print()
+    print("4. **Advanced Features**:")
+    print("   - Configurable field mapping")
+    print("   - Text preprocessing options")
+    print("   - Automatic dataset saving/loading")
+    print("   - YAML configuration support")
+    print()
+    print("=== Usage Examples ===")
+    print()
+    print("1. Use YAML config only:")
+    print("   python scripts/styling/data_processor.py --config configs/styling/formal.yaml")
+    print()
+    print("2. Override YAML values:")
+    print("   python scripts/styling/data_processor.py --config configs/styling/formal.yaml --max-samples 500")
+    print()
+    print("3. Create sample data:")
+    print("   python scripts/styling/data_processor.py create-sample-data")
+    print()
+    print("4. Create custom config:")
+    print("   python scripts/styling/data_processor.py create-config")
+
+def main():
+    """Main function"""
+    if len(sys.argv) > 1:
+        # Check if it's a subcommand
+        if sys.argv[1] in ["examples", "create-sample-data", "create-config", "features"]:
+            # Handle subcommands
+            if sys.argv[1] == "examples":
+                run_styling_examples()
+            elif sys.argv[1] == "create-sample-data":
+                create_sample_styling_data()
+            elif sys.argv[1] == "create-config":
+                create_custom_styling_config()
+            elif sys.argv[1] == "features":
+                show_styling_features()
+        else:
+            # Handle direct arguments (pass through to pipeline)
+            handle_direct_args()
+    else:
+        print("Styling Data Processor")
+        print("=====================")
+        print()
+        print("This script runs the styling data processor for style transfer tasks.")
+        print("It supports both YAML configurations and command-line overrides.")
+        print()
+        print("Usage:")
+        print("  python scripts/styling/data_processor.py examples           # Run examples")
+        print("  python scripts/styling/data_processor.py create-sample-data # Create sample dataset")
+        print("  python scripts/styling/data_processor.py create-config      # Create custom config")
+        print("  python scripts/styling/data_processor.py features           # Show features")
+        print()
+        print("Direct pipeline usage:")
+        print("  python scripts/styling/data_processor.py --config configs/styling/formal.yaml")
+        print("  python scripts/styling/data_processor.py --data-source custom --data-path ./data.jsonl")
+        print()
+        print("Key Features:")
+        print("  ✅ Style transfer with custom instructions")
+        print("  ✅ Multiple data source support")
+        print("  ✅ YAML configuration files")
+        print("  ✅ CLI argument overrides")
+        print("  ✅ Automatic data splitting")
+        print("  ✅ HuggingFace dataset export")
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/styling/inference.py b/scripts/styling/inference.py
new file mode 100644
index 0000000..08beb8f
--- /dev/null
+++ b/scripts/styling/inference.py
@@ -0,0 +1,223 @@
+#!/usr/bin/env python3
+"""
+Styling Inference Script
+Provides a command-line interface to run the styling inference pipeline
+"""
+
+import sys
+import os
+import subprocess
+import argparse
+from pathlib import Path
+
+def run_inference_with_config(config_path: str, **cli_overrides):
+    """Run the styling inference pipeline with YAML configuration"""
+    print(f"🚀 Starting styling inference with config: {config_path}")
+    print()
+    
+    # Build command
+    cmd = ["python", "pipelines/styling/inference.py", "--config", config_path]
+    
+    # Add CLI overrides
+    for key, value in cli_overrides.items():
+        if value is not None:
+            if key == "model_path":
+                cmd.extend(["--model-path", str(value)])
+            elif key == "text":
+                cmd.extend(["--text", str(value)])
+            elif key == "input_file":
+                cmd.extend(["--input-file", str(value)])
+            elif key == "max_tokens":
+                cmd.extend(["--max-tokens", str(value)])
+            elif key == "temperature":
+                cmd.extend(["--temperature", str(value)])
+            elif key == "instruction":
+                cmd.extend(["--instruction", str(value)])
+            elif key == "output_file":
+                cmd.extend(["--output-file", str(value)])
+            elif key == "streaming":
+                cmd.append("--streaming")
+    
+    print(f"Running: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("✅ Inference completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Inference failed: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def show_inference_features():
+    """Show the features of the styling inference pipeline"""
+    print("=== Styling Inference Pipeline Features ===")
+    print()
+    print("1. **Model Support**:")
+    print("   - Trained LoRA models")
+    print("   - Base models from HuggingFace Hub")
+    print("   - Automatic model loading and preparation")
+    print()
+    print("2. **Inference Modes**:")
+    print("   - Single text inference")
+    print("   - Batch file processing")
+    print("   - Interactive mode")
+    print("   - Streaming generation")
+    print()
+    print("3. **Generation Control**:")
+    print("   - Configurable temperature and top-p")
+    print("   - Adjustable max tokens")
+    print("   - Custom style instructions")
+    print()
+    print("4. **Output Options**:")
+    print("   - Console output")
+    print("   - File output")
+    print("   - Streaming real-time generation")
+
+def create_inference_example():
+    """Create an inference example using the formal style configuration"""
+    print("=== Inference Example: Formal Style Transfer ===")
+    print()
+    
+    # Check if we have the required files
+    config_path = "configs/styling/formal.yaml"
+    
+    if not Path(config_path).exists():
+        print(f"❌ Configuration file not found: {config_path}")
+        print("   Please run the data processor first to create the configuration")
+        return False
+    
+    print("✅ Found configuration file!")
+    print(f"   Config: {config_path}")
+    print()
+    
+    # Example text
+    example_text = "Hey, what's up? I'm gonna go grab some food later."
+    
+    print(f"📝 Example text: {example_text}")
+    print()
+    
+    # Run inference
+    success = run_inference_with_config(
+        config_path=config_path,
+        text=example_text,
+        instruction="Rewrite the following text in a formal style"
+    )
+    
+    if success:
+        print("🎉 Inference example completed!")
+    
+    return success
+
+def create_test_file():
+    """Create a test file with sample texts for batch inference"""
+    test_file = "test_texts.txt"
+    
+    test_texts = [
+        "Hey, what's up? How are you doing today?",
+        "I'm gonna go to the store later to get some stuff.",
+        "This is pretty cool, right?",
+        "Can you help me out with this?",
+        "Thanks a lot for your help!"
+    ]
+    
+    with open(test_file, 'w', encoding='utf-8') as f:
+        for text in test_texts:
+            f.write(text + '\n')
+    
+    print(f"✅ Created test file: {test_file}")
+    print(f"   Contains {len(test_texts)} sample texts")
+    return test_file
+
+def run_batch_inference_example():
+    """Run a batch inference example"""
+    print("=== Batch Inference Example ===")
+    print()
+    
+    # Create test file
+    test_file = create_test_file()
+    
+    # Check configuration
+    config_path = "configs/styling/formal.yaml"
+    if not Path(config_path).exists():
+        print(f"❌ Configuration file not found: {config_path}")
+        return False
+    
+    print("✅ Running batch inference...")
+    print()
+    
+    # Run batch inference
+    success = run_inference_with_config(
+        config_path=config_path,
+        input_file=test_file,
+        output_file="styled_results.txt",
+        instruction="Rewrite the following text in a formal style"
+    )
+    
+    if success:
+        print("🎉 Batch inference completed!")
+        print("   Results saved to: styled_results.txt")
+    
+    return success
+
+def main():
+    """Main function"""
+    parser = argparse.ArgumentParser(description="Styling Inference Script")
+    
+    # Subcommands
+    parser.add_argument("command", choices=["infer", "example", "batch", "features"], 
+                       help="Command to run")
+    
+    # Inference arguments
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    parser.add_argument("--model-path", type=str, help="Path to trained model")
+    parser.add_argument("--text", type=str, help="Single text to style transfer")
+    parser.add_argument("--input-file", type=str, help="File containing texts to process")
+    parser.add_argument("--max-tokens", type=int, help="Maximum new tokens to generate")
+    parser.add_argument("--temperature", type=float, help="Sampling temperature")
+    parser.add_argument("--instruction", type=str, help="Custom style instruction")
+    parser.add_argument("--output-file", type=str, help="Output file for results")
+    parser.add_argument("--streaming", action="store_true", help="Enable streaming generation")
+    
+    args = parser.parse_args()
+    
+    if args.command == "features":
+        show_inference_features()
+    
+    elif args.command == "example":
+        create_inference_example()
+    
+    elif args.command == "batch":
+        run_batch_inference_example()
+    
+    elif args.command == "infer":
+        if not args.config:
+            print("❌ --config is required for inference")
+            print("Usage: python scripts/styling/inference.py infer --config config.yaml [options]")
+            sys.exit(1)
+        
+        # Check if we have input
+        if not args.text and not args.input_file:
+            print("❌ Either --text or --input-file is required")
+            print("Usage: python scripts/styling/inference.py infer --config config.yaml --text 'your text'")
+            sys.exit(1)
+        
+        success = run_inference_with_config(
+            config_path=args.config,
+            model_path=args.model_path,
+            text=args.text,
+            input_file=args.input_file,
+            max_tokens=args.max_tokens,
+            temperature=args.temperature,
+            instruction=args.instruction,
+            output_file=args.output_file,
+            streaming=args.streaming
+        )
+        
+        if not success:
+            sys.exit(1)
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/styling/train.py b/scripts/styling/train.py
new file mode 100644
index 0000000..7742320
--- /dev/null
+++ b/scripts/styling/train.py
@@ -0,0 +1,168 @@
+#!/usr/bin/env python3
+"""
+Styling Training Script
+Provides a command-line interface to run the styling training pipeline
+"""
+
+import sys
+import os
+import subprocess
+import argparse
+from pathlib import Path
+
+def run_training_with_config(config_path: str, dataset_path: str = None, **cli_overrides):
+    """Run the styling training pipeline with YAML configuration"""
+    print(f"Starting styling training with config: {config_path}")
+    if dataset_path:
+        print(f"Training dataset: {dataset_path}")
+    else:
+        print("Training dataset: Will use output_dir from YAML config")
+    print()
+    
+    # Build command
+    cmd = ["python", "pipelines/styling/train.py", "--config", config_path]
+    
+    # Add dataset path if provided
+    if dataset_path:
+        cmd.extend(["--dataset", dataset_path])
+    
+    # Add CLI overrides
+    for key, value in cli_overrides.items():
+        if value is not None:
+            if key == "output_dir":
+                cmd.extend(["--output-dir", str(value)])
+            elif key == "epochs":
+                cmd.extend(["--epochs", str(value)])
+            elif key == "batch_size":
+                cmd.extend(["--batch-size", str(value)])
+            elif key == "learning_rate":
+                cmd.extend(["--learning-rate", str(value)])
+            elif key == "max_steps":
+                cmd.extend(["--max-steps", str(value)])
+    
+    print(f"Running: {' '.join(cmd)}")
+    print()
+    
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print("Training completed successfully!")
+        print(result.stdout)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"Training failed: {e}")
+        print(f"Error output: {e.stderr}")
+        return False
+
+def show_training_features():
+    """Show the features of the styling training pipeline"""
+    print("=== Styling Training Pipeline Features ===")
+    print()
+    print("1. **Model Support**:")
+    print("   - Unsloth optimized models (4x faster)")
+    print("   - LoRA fine-tuning for efficiency")
+    print("   - Support for Llama-3.1, Mistral, Phi-3, Gemma")
+    print()
+    print("2. **Training Features**:")
+    print("   - SFTTrainer with instruction tuning")
+    print("   - Automatic mixed precision (FP16/BF16)")
+    print("   - Gradient checkpointing for memory efficiency")
+    print("   - Configurable LoRA parameters")
+    print()
+    print("3. **Configuration**:")
+    print("   - YAML configuration files")
+    print("   - CLI argument overrides")
+    print("   - Automatic device detection")
+    print()
+    print("4. **Output**:")
+    print("   - Saved LoRA models")
+    print("   - Training logs and checkpoints")
+    print("   - Ready for inference")
+
+def create_training_example():
+    """Create a training example using the formal style configuration"""
+    print("=== Training Example: Formal Style Transfer ===")
+    print()
+    
+    # Check if we have the required files
+    config_path = "configs/styling/formal.yaml"
+    
+    if not Path(config_path).exists():
+        print(f"Configuration file not found: {config_path}")
+        print("   Please run the data processor first to create the configuration")
+        return False
+    
+    print("Found required files!")
+    print(f"   Config: {config_path}")
+    print("   Dataset: Will use output_dir from YAML config")
+    print("   The training pipeline will automatically:")
+    print("   - Load data from the output_dir specified in YAML")
+    print("   - Convert JSONL files to HuggingFace dataset format")
+    print("   - Apply formatting with EOS tokens")
+    print("   - Train the model using SFTTrainer")
+    print()
+    
+    # Run training without explicit dataset path - will use YAML config
+    success = run_training_with_config(
+        config_path=config_path,
+        dataset_path=None,  # Use output_dir from YAML config
+        epochs=1,
+        batch_size=2,
+        learning_rate=2e-4
+    )
+    
+    if success:
+        print("Training example completed!")
+        print("   Model saved to: ./models/styling")
+        print("   Ready for inference!")
+    
+    return success
+
+def main():
+    """Main function"""
+    parser = argparse.ArgumentParser(description="Styling Training Script")
+    
+    # Subcommands
+    parser.add_argument("command", choices=["train", "example", "features"], 
+                       help="Command to run")
+    
+    # Training arguments
+    parser.add_argument("--config", type=str, help="Path to YAML configuration file")
+    parser.add_argument("--dataset", type=str, help="Path to training dataset")
+    parser.add_argument("--output-dir", type=str, help="Output directory for model")
+    parser.add_argument("--epochs", type=int, help="Number of training epochs")
+    parser.add_argument("--batch-size", type=int, help="Training batch size")
+    parser.add_argument("--learning-rate", type=float, help="Learning rate")
+    parser.add_argument("--max-steps", type=int, help="Maximum training steps")
+    
+    args = parser.parse_args()
+    
+    if args.command == "features":
+        show_training_features()
+    
+    elif args.command == "example":
+        create_training_example()
+    
+    elif args.command == "train":
+        if not args.config:
+            print("❌ --config is required for training")
+            print("Usage: python scripts/styling/train.py train --config config.yaml")
+            sys.exit(1)
+        
+        # If dataset is not provided, try to use output_dir from config
+        dataset_path = args.dataset if args.dataset else None
+        
+        success = run_training_with_config(
+            config_path=args.config,
+            dataset_path=dataset_path,
+            output_dir=args.output_dir,
+            epochs=args.epochs,
+            batch_size=args.batch_size,
+            learning_rate=args.learning_rate,
+            max_steps=args.max_steps
+        )
+        
+        if not success:
+            sys.exit(1)
+
+if __name__ == "__main__":
+    main()
diff --git a/test.py b/test.py
new file mode 100644
index 0000000..4743b82
--- /dev/null
+++ b/test.py
@@ -0,0 +1,251 @@
+#!/usr/bin/env python3
+"""
+Test script for the styling data processor
+"""
+
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+
+from pipelines.styling.data_processor import StylingDataPipeline, create_custom_config, create_huggingface_config
+
+def test_styling_pipeline():
+    """Test the styling data processor with custom data"""
+    
+    print("Testing Styling Data Processor")
+    print("=" * 50)
+    
+    # Initialize the pipeline
+    pipeline = StylingDataPipeline()
+    
+    # Example 1: Load configuration from YAML
+    print("\n1. Loading configuration from YAML...")
+    try:
+        yaml_config = pipeline.load_config_from_yaml("./configs/styling/formal.yaml")
+        print(f"   ✅ YAML config loaded successfully!")
+        print(f"   Output directory: {yaml_config.output_dir}")
+        print(f"   Instruction: {yaml_config.instruction}")
+        print(f"   Input field: {yaml_config.input_field}")
+        print(f"   Output field: {yaml_config.output_field}")
+    except Exception as e:
+        print(f"   ❌ Error loading YAML config: {e}")
+        yaml_config = None
+    
+    # Example 2: Create custom dataset configuration
+    print("\n2. Creating custom dataset configuration...")
+    custom_config = create_custom_config(
+        data_path="./data/raw/styling/formal_dataset.jsonl",
+        data_format="jsonl",
+        input_field="text",
+        output_field="styled_text",
+        instruction="Rewrite the following text in a formal style",
+        max_samples=1000,
+        min_length=10,
+        max_length=256,
+        clean_text=True,
+        lowercase=False,
+        output_format="alpaca"
+    )
+    
+    print(f"   Input field: {custom_config.input_field} (maps to 'input')")
+    print(f"   Output field: {custom_config.output_field} (maps to 'output')")
+    print(f"   Instruction: {custom_config.instruction}")
+    print(f"   Max samples: {custom_config.max_samples}")
+    
+    # Example 3: Test with sample data (if available)
+    print("\n3. Testing pipeline with sample data...")
+    
+    # Create a sample dataset for testing
+    sample_data = [
+        {
+            "input": "Hey, what's up? How are you doing today?",
+            "output": "Hello, how are you doing today?"
+        },
+        {
+            "input": "This is really cool stuff!",
+            "output": "This is quite impressive material."
+        },
+        {
+            "input": "I'm gonna go to the store later.",
+            "output": "I will go to the store later."
+        }
+    ]
+    
+    # Save sample data to test file
+    import json
+    test_file = "./data/raw/styling/test_formal.jsonl"
+    os.makedirs(os.path.dirname(test_file), exist_ok=True)
+    
+    with open(test_file, 'w', encoding='utf-8') as f:
+        for item in sample_data:
+            f.write(json.dumps(item, ensure_ascii=False) + '\n')
+    
+    print(f"   Created test file: {test_file}")
+    
+    # Test the pipeline with the sample data
+    try:
+        test_config = create_custom_config(
+            data_path=test_file,
+            data_format="jsonl",
+            input_field="input",
+            output_field="output",
+            instruction="Rewrite the following text in a formal style",
+            max_samples=10,
+            output_format="alpaca"
+        )
+        
+        print("   Running pipeline...")
+        result = pipeline.run_pipeline(test_config, output_format="alpaca", save_splits=True, create_hf_dataset=True, save_hf_dataset=True)
+        
+        print("   ✅ Pipeline completed successfully!")
+        print(f"   Total samples: {result['analysis']['overall']['total_samples']}")
+        print(f"   Split sizes: {result['analysis']['overall']['split_sizes']}")
+        print(f"   Output directory: {result['output_dir']}")
+        
+        # Show HuggingFace dataset info if created
+        if 'hf_dataset' in result:
+            hf_dataset = result['hf_dataset']
+            print(f"   HuggingFace dataset created with {len(hf_dataset)} entries")
+            print(f"   Dataset features: {hf_dataset.features}")
+            
+            # Show save path if saved to disk
+            if 'hf_dataset_path' in result:
+                print(f"   Dataset saved to: {result['hf_dataset_path']}")
+            
+            # Show formatted example
+            if len(hf_dataset) > 0:
+                print(f"   Example formatted text:")
+                print(f"   {hf_dataset[0]['text'][:200]}...")
+        
+        # Show sample processed data
+        print("\n   Sample processed data:")
+        for split_name, split_data in result['data'].items():
+            if split_data:
+                print(f"   {split_name} split:")
+                for i, item in enumerate(split_data[:2]):  # Show first 2 items
+                    print(f"     Item {i+1}:")
+                    print(f"       Instruction: {item['instruction']}")
+                    print(f"       Input: {item['input'][:50]}...")
+                    print(f"       Output: {item['output'][:50]}...")
+                break
+        
+    except Exception as e:
+        print(f"   ❌ Error running pipeline: {e}")
+    
+    print("\n" + "=" * 50)
+    print("Test completed!")
+
+def test_hf_dataset_save_load():
+    """Test HuggingFace dataset save and load functionality"""
+    
+    print("\nTesting HuggingFace Dataset Save/Load")
+    print("=" * 50)
+    
+    from pipelines.styling.data_processor import save_hf_dataset_to_disk, load_hf_dataset_from_disk
+    
+    # Create a sample dataset for testing
+    sample_data = [
+        {
+            "instruction": "Rewrite in formal style",
+            "input": "Hey, what's up?",
+            "output": "Hello, how are you?"
+        },
+        {
+            "instruction": "Rewrite in formal style", 
+            "input": "This is really cool!",
+            "output": "This is quite impressive."
+        }
+    ]
+    
+    # Test configuration
+    config = create_custom_config(
+        data_path="dummy",
+        instruction="Rewrite in formal style"
+    )
+    
+    # Convert to HuggingFace dataset
+    pipeline = StylingDataPipeline()
+    hf_dataset = pipeline.convert_to_hf_dataset(sample_data, config)
+    
+    print(f"Created HuggingFace dataset with {len(hf_dataset)} entries")
+    
+    # Test saving to disk
+    save_path = "./data/processed/styling/test_hf_dataset"
+    print(f"\nSaving dataset to: {save_path}")
+    
+    success = save_hf_dataset_to_disk(hf_dataset, save_path)
+    if success:
+        print("✅ Dataset saved successfully!")
+        
+        # Test loading from disk
+        print(f"\nLoading dataset from: {save_path}")
+        loaded_dataset = load_hf_dataset_from_disk(save_path)
+        
+        if loaded_dataset is not None:
+            print("✅ Dataset loaded successfully!")
+            print(f"Loaded dataset has {len(loaded_dataset)} entries")
+            print(f"Features: {loaded_dataset.features}")
+            
+            # Show sample data
+            print("\nSample loaded data:")
+            for i in range(len(loaded_dataset)):
+                print(f"  Entry {i+1}: {loaded_dataset[i]['text'][:100]}...")
+        else:
+            print("❌ Failed to load dataset")
+    else:
+        print("❌ Failed to save dataset")
+    
+    return hf_dataset
+
+def test_hf_dataset_conversion():
+    """Test the HuggingFace dataset conversion"""
+    
+    print("\nTesting HuggingFace Dataset Conversion")
+    print("=" * 50)
+    
+    pipeline = StylingDataPipeline()
+    
+    # Sample data with instruction field
+    sample_data = [
+        {
+            "instruction": "Rewrite in formal style",
+            "input": "Hey, what's up?",
+            "output": "Hello, how are you?"
+        },
+        {
+            "instruction": "Rewrite in formal style", 
+            "input": "This is really cool!",
+            "output": "This is quite impressive."
+        }
+    ]
+    
+    # Test configuration
+    config = create_custom_config(
+        data_path="dummy",
+        instruction="Rewrite in formal style"
+    )
+    
+    # Convert to HuggingFace dataset
+    hf_dataset = pipeline.convert_to_hf_dataset(sample_data, config)
+    
+    print(f"HuggingFace dataset created with {len(hf_dataset)} entries")
+    print(f"Dataset features: {hf_dataset.features}")
+    
+    # Show formatted examples
+    print("\nFormatted examples:")
+    for i in range(len(hf_dataset)):
+        print(f"  Example {i+1}:")
+        print(f"    {hf_dataset[i]['text'][:150]}...")
+        print()
+    
+    # Test the dataset can be used for training
+    print("Dataset ready for training!")
+    print(f"Number of training examples: {len(hf_dataset)}")
+    
+    return hf_dataset
+
+
+if __name__ == "__main__":
+    test_styling_pipeline()
+    # test_hf_dataset_save_load()
+    # test_hf_dataset_conversion()
diff --git a/test.readme b/test.readme
new file mode 100644
index 0000000..e69de29