added style mimicking piepelines

2025-08-13 21:17:01 +01:00
parent fd54d4be39
commit 710d074b47
31 changed files with 3816 additions and 46 deletions
@@ -0,0 +1,207 @@
+# Configuration Files Documentation
+
+This directory contains YAML configuration files for different machine learning tasks. Each configuration file is organized into logical sections and includes comprehensive documentation for all parameters.
+
+## Configuration Structure
+
+All configuration files follow a consistent structure organized into these main sections:
+
+### 1. Task Configuration
+```yaml
+task:
+  name: "task_type"                        # Task type: classification, completion, styling, matching
+  type: "specific_type"                    # Specific model/task type
+```
+
+**Available Task Types:**
+- **classification**: Text classification tasks (emotion, sentiment, topic, etc.)
+- **completion**: Text generation and completion tasks
+- **styling**: Style transfer and text transformation tasks
+- **matching**: Semantic matching and similarity tasks
+
+### 2. Data Processing Configuration
+```yaml
+data:
+  # Data Source
+  source: "huggingface|custom"             # Where to get data from
+  
+  # Data Location
+  dataset_name: "dataset/name"             # HuggingFace dataset name (for huggingface source)
+  data_path: "./path/to/file"              # Path to custom data file (for custom source)
+  data_format: "jsonl|csv|json"            # File format for custom data
+  
+  # Field Mapping
+  input_field: "text"                      # Field containing input text
+  output_field: "styled_text"              # Field containing output (for styling)
+  label_field: "label"                     # Field containing labels (for classification)
+  id_field: "id"                           # Optional ID field for tracking
+  
+  # Processing Parameters
+  max_samples: 1000                        # Maximum samples to process
+  train_split: 0.8                         # Training split ratio
+  validation_split: 0.1                    # Validation split ratio
+  test_split: 0.1                          # Test split ratio
+  
+  # Text Preprocessing
+  clean_text: true                         # Clean and normalize text
+  remove_special_chars: false              # Remove special characters
+  lowercase: true                          # Convert to lowercase
+  min_length: 10                           # Minimum text length
+  max_length: 1000                         # Maximum text length
+  
+  # Output Configuration
+  output_format: "format_type"             # Output format
+  output_dir: "./output/path"              # Output directory
+```
+
+**Data Source Types:**
+- **huggingface**: Use datasets from HuggingFace Hub
+- **custom**: Use local files (JSONL, CSV, JSON)
+
+**Output Formats:**
+- **classification**: Raw classification format
+- **instruction**: Instruction-following format
+- **conversation**: Conversational format
+- **qa**: Question-answer format
+- **styling**: Raw styling format
+- **alpaca**: Alpaca instruction format
+
+### 3. Model Configuration
+```yaml
+model:
+  name: "model_name"                       # Model from HuggingFace Hub
+  max_length: 512                          # Maximum sequence length
+  num_labels: 6                            # Number of labels (for classification)
+```
+
+**Recommended Models by Task:**
+- **Classification**: `bert-base-uncased`, `distilbert-base-uncased`
+- **Styling**: `t5-base`, `gpt2-medium`
+- **Completion**: `gpt2-medium`, `gpt2-large`
+- **Matching**: `sentence-transformers/all-MiniLM-L6-v2`
+
+### 4. Training Configuration
+```yaml
+training:
+  num_epochs: 3                            # Number of training epochs
+  batch_size: 16                           # Training batch size
+  learning_rate: 2e-5                      # Learning rate
+  weight_decay: 0.01                       # Weight decay
+  lr_scheduler_type: "linear"              # Learning rate scheduler
+  warmup_ratio: 0.1                        # Warmup ratio
+  data_dir: "./data/path"                  # Training data directory
+  output_dir: "./model/output"             # Model output directory
+```
+
+**Learning Rate Guidelines:**
+- **Fine-tuning**: 1e-5 to 5e-5
+- **Training from scratch**: 1e-4 to 1e-3
+
+**Scheduler Types:**
+- **linear**: Linear decay
+- **cosine**: Cosine annealing
+- **polynomial**: Polynomial decay
+
+### 5. Inference Configuration
+```yaml
+inference:
+  model_path: "./model/path"               # Path to saved model
+  device: "auto"                           # Device to use
+  batch_size: 32                           # Inference batch size
+  return_probabilities: true                # Return probabilities
+  return_top_k: 3                          # Return top K predictions
+  max_new_tokens: 128                      # Max tokens to generate
+  temperature: 0.8                         # Sampling temperature
+```
+
+**Device Options:**
+- **auto**: Automatically detect best device
+- **cuda**: Use GPU if available
+- **cpu**: Force CPU usage
+
+**Temperature Guidelines:**
+- **0.0**: Deterministic (always same output)
+- **0.7-0.9**: Balanced creativity
+- **1.0+**: More random/creative
+
+## Task-Specific Parameters
+
+### Classification Tasks
+```yaml
+data:
+  label_encoding: "auto|numeric|string"    # How to encode labels
+  multilabel: false                        # Multi-label vs single-label
+  label_separator: ","                     # Separator for multi-label
+```
+
+### Styling Tasks
+```yaml
+data:
+  instruction: "Style instruction text"    # The style instruction
+```
+
+### Completion Tasks
+```yaml
+data:
+  prompt_template: "template"               # Prompt template
+  completion_length: 100                   # Target completion length
+```
+
+## Advanced Configuration
+
+### HuggingFace Specific
+```yaml
+data:
+  hf_split: "train"                        # Dataset split to use
+  hf_cache_dir: "./cache"                  # Cache directory
+  test_split_from: "train"                 # Source for test split
+  val_split_from: "train"                  # Source for validation split
+```
+
+### Custom Data Specific
+```yaml
+data:
+  encoding: "utf-8"                        # File encoding
+  delimiter: ","                           # CSV delimiter
+```
+
+## Usage Examples
+
+### Basic Usage
+```bash
+# Use YAML configuration
+python scripts/task_type/data_processor.py --config configs/task_type/config.yaml
+
+# Override specific parameters
+python scripts/task_type/data_processor.py \
+  --config configs/task_type/config.yaml \
+  --max-samples 1000 \
+  --learning-rate 3e-5
+```
+
+### Creating Custom Configurations
+1. Copy an existing config file
+2. Modify parameters for your specific use case
+3. Update paths and model names
+4. Test with a small dataset first
+
+## Best Practices
+
+1. **Start with Defaults**: Use default values and adjust based on results
+2. **Validate Paths**: Ensure all file paths are correct and accessible
+3. **Monitor Resources**: Adjust batch sizes based on available GPU memory
+4. **Test Incrementally**: Test with small datasets before full processing
+5. **Version Control**: Keep configurations in version control for reproducibility
+
+## Troubleshooting
+
+### Common Issues:
+- **File Not Found**: Check `data_path` and `output_dir` paths
+- **Memory Errors**: Reduce `batch_size` or `max_length`
+- **Poor Performance**: Adjust `learning_rate` or `num_epochs`
+- **Split Errors**: Ensure split ratios sum to ≤ 1.0
+
+### Getting Help:
+- Check the script help: `python script.py --help`
+- Review the pipeline logs for detailed error messages
+- Verify YAML syntax and parameter values