Files
boladeE 859c17aad8 feat: Implement Pinecone vector store integration
- Update config.py with Pinecone settings and model configurations
- Implement VectorStore class with Pinecone backend
- Add comprehensive vector operations (add, search, delete)
- Set up proper error handling and metadata management
- Add .gitignore for Python project
2025-04-16 23:09:52 +01:00

4.1 KiB

Adriana James Marketing Assistant AI

This project fine-tunes a language model to generate marketing content in the voice and style of Adriana James, based on her book content, past campaigns, and style guidelines.

Project Structure

  • generate_dataset.py: Script to generate fine-tuning datasets from book content, past campaigns, and style guidelines
  • finetune_model.py: Script to fine-tune the model using the generated datasets
  • data/: Directory containing source data
    • book.pdf: Adriana James' book content
    • past_campaigns/: Directory containing past marketing campaigns
    • style_guidelines/: Directory containing brand style guidelines
  • datasets/: Directory containing generated fine-tuning datasets
  • adriana_model/: Directory containing the fine-tuned model

Setup

  1. Install the required dependencies:

    pip install -r requirements.txt
    
  2. Generate the fine-tuning datasets:

    python generate_dataset.py
    

    This will create the following datasets in the datasets/ directory:

    • stage1_book_content.json: Dataset for fine-tuning on book content
    • stage2_marketing_content.json: Dataset for fine-tuning on marketing content
    • stage3_style_alignment.json: Dataset for fine-tuning on style alignment
    • combined_dataset.json: Combined dataset for all stages

Fine-tuning the Model

The fine-tuning process follows a progressive approach with three stages:

  1. Stage 1: Fine-tune on book content to establish Adriana James' core voice
  2. Stage 2: Fine-tune on marketing content to adapt to marketing formats
  3. Stage 3: Fine-tune on style alignment to ensure style consistency

Running the Fine-tuning Script

To run the complete progressive fine-tuning process:

python finetune_model.py --stage all

To run a specific stage:

python finetune_model.py --stage 1  # Fine-tune on book content only
python finetune_model.py --stage 2  # Fine-tune on marketing content only
python finetune_model.py --stage 3  # Fine-tune on style alignment only

Command-line Arguments

  • --model_name: Base model to fine-tune (default: "mistralai/Mistral-7B-v0.1")
  • --output_dir: Directory to save the fine-tuned model (default: "adriana_model")
  • --stage: Fine-tuning stage (choices: "1", "2", "3", "all", default: "all")
  • --num_epochs: Number of epochs for each stage (default: 3)
  • --seed: Random seed for reproducibility (default: 42)

Model Selection

The default base model is Mistral-7B-v0.1, which is a good balance between performance and resource requirements. For better results, you can use larger models like:

  • meta-llama/Llama-2-13b-hf (requires access)
  • tiiuae/falcon-40b (larger model with good performance)
  • google/flan-t5-xxl (good for instruction following)

To use a different model, specify it with the --model_name argument:

python finetune_model.py --model_name tiiuae/falcon-40b

Hardware Requirements

Fine-tuning large language models requires significant computational resources:

  • Minimum: 16GB GPU RAM (for 7B parameter models)
  • Recommended: 24GB+ GPU RAM (for 13B+ parameter models)
  • Optimal: Multiple GPUs or a high-end GPU with 40GB+ RAM

For models larger than 7B parameters, you may need to use techniques like:

  • 8-bit quantization (already enabled in the script)
  • Gradient checkpointing
  • LoRA or QLoRA fine-tuning

Using the Fine-tuned Model

After fine-tuning, the model will be saved in the adriana_model/final directory. You can load and use it with the Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model
model_path = "adriana_model/final"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Generate content
prompt = "Write a marketing email for a professional development workshop."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

License

This project is licensed under the MIT License - see the LICENSE file for details.