README.md

# Adriana James Marketing Assistant AI

This project fine-tunes a language model to generate marketing content in the voice and style of Adriana James, based on her book content, past campaigns, and style guidelines.

## Project Structure

- `generate_dataset.py`: Script to generate fine-tuning datasets from book content, past campaigns, and style guidelines
- `finetune_model.py`: Script to fine-tune the model using the generated datasets
- `data/`: Directory containing source data
  - `book.pdf`: Adriana James' book content
  - `past_campaigns/`: Directory containing past marketing campaigns
  - `style_guidelines/`: Directory containing brand style guidelines
- `datasets/`: Directory containing generated fine-tuning datasets
- `adriana_model/`: Directory containing the fine-tuned model

## Setup

1. Install the required dependencies:
   ```
   pip install -r requirements.txt
   ```

2. Generate the fine-tuning datasets:
   ```
   python generate_dataset.py
   ```
   This will create the following datasets in the `datasets/` directory:
   - `stage1_book_content.json`: Dataset for fine-tuning on book content
   - `stage2_marketing_content.json`: Dataset for fine-tuning on marketing content
   - `stage3_style_alignment.json`: Dataset for fine-tuning on style alignment
   - `combined_dataset.json`: Combined dataset for all stages

## Fine-tuning the Model

The fine-tuning process follows a progressive approach with three stages:

1. **Stage 1**: Fine-tune on book content to establish Adriana James' core voice
2. **Stage 2**: Fine-tune on marketing content to adapt to marketing formats
3. **Stage 3**: Fine-tune on style alignment to ensure style consistency

### Running the Fine-tuning Script

To run the complete progressive fine-tuning process:

```
python finetune_model.py --stage all
```

To run a specific stage:

```
python finetune_model.py --stage 1  # Fine-tune on book content only
python finetune_model.py --stage 2  # Fine-tune on marketing content only
python finetune_model.py --stage 3  # Fine-tune on style alignment only
```

### Command-line Arguments

- `--model_name`: Base model to fine-tune (default: "mistralai/Mistral-7B-v0.1")
- `--output_dir`: Directory to save the fine-tuned model (default: "adriana_model")
- `--stage`: Fine-tuning stage (choices: "1", "2", "3", "all", default: "all")
- `--num_epochs`: Number of epochs for each stage (default: 3)
- `--seed`: Random seed for reproducibility (default: 42)

## Model Selection

The default base model is Mistral-7B-v0.1, which is a good balance between performance and resource requirements. For better results, you can use larger models like:

- `meta-llama/Llama-2-13b-hf` (requires access)
- `tiiuae/falcon-40b` (larger model with good performance)
- `google/flan-t5-xxl` (good for instruction following)

To use a different model, specify it with the `--model_name` argument:

```
python finetune_model.py --model_name tiiuae/falcon-40b
```

## Hardware Requirements

Fine-tuning large language models requires significant computational resources:

- **Minimum**: 16GB GPU RAM (for 7B parameter models)
- **Recommended**: 24GB+ GPU RAM (for 13B+ parameter models)
- **Optimal**: Multiple GPUs or a high-end GPU with 40GB+ RAM

For models larger than 7B parameters, you may need to use techniques like:
- 8-bit quantization (already enabled in the script)
- Gradient checkpointing
- LoRA or QLoRA fine-tuning

## Using the Fine-tuned Model

After fine-tuning, the model will be saved in the `adriana_model/final` directory. You can load and use it with the Transformers library:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model
model_path = "adriana_model/final"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Generate content
prompt = "Write a marketing email for a professional development workshop."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.
feat: Implement Pinecone vector store integration 2025-04-16 23:09:52 +01:00			`# Adriana James Marketing Assistant AI`

			`This project fine-tunes a language model to generate marketing content in the voice and style of Adriana James, based on her book content, past campaigns, and style guidelines.`

			`## Project Structure`

			- `generate_dataset.py`: Script to generate fine-tuning datasets from book content, past campaigns, and style guidelines
			- `finetune_model.py`: Script to fine-tune the model using the generated datasets
			- `data/`: Directory containing source data
			- `book.pdf`: Adriana James' book content
			- `past_campaigns/`: Directory containing past marketing campaigns
			- `style_guidelines/`: Directory containing brand style guidelines
			- `datasets/`: Directory containing generated fine-tuning datasets
			- `adriana_model/`: Directory containing the fine-tuned model

			`## Setup`

			`1. Install the required dependencies:`
			```
			`pip install -r requirements.txt`
			```

			`2. Generate the fine-tuning datasets:`
			```
			`python generate_dataset.py`
			```
			This will create the following datasets in the `datasets/` directory:
			- `stage1_book_content.json`: Dataset for fine-tuning on book content
			- `stage2_marketing_content.json`: Dataset for fine-tuning on marketing content
			- `stage3_style_alignment.json`: Dataset for fine-tuning on style alignment
			- `combined_dataset.json`: Combined dataset for all stages

			`## Fine-tuning the Model`

			`The fine-tuning process follows a progressive approach with three stages:`

			`1. Stage 1: Fine-tune on book content to establish Adriana James' core voice`
			`2. Stage 2: Fine-tune on marketing content to adapt to marketing formats`
			`3. Stage 3: Fine-tune on style alignment to ensure style consistency`

			`### Running the Fine-tuning Script`

			`To run the complete progressive fine-tuning process:`

			```
			`python finetune_model.py --stage all`
			```

			`To run a specific stage:`

			```
			`python finetune_model.py --stage 1 # Fine-tune on book content only`
			`python finetune_model.py --stage 2 # Fine-tune on marketing content only`
			`python finetune_model.py --stage 3 # Fine-tune on style alignment only`
			```

			`### Command-line Arguments`

			- `--model_name`: Base model to fine-tune (default: "mistralai/Mistral-7B-v0.1")
			- `--output_dir`: Directory to save the fine-tuned model (default: "adriana_model")
			- `--stage`: Fine-tuning stage (choices: "1", "2", "3", "all", default: "all")
			- `--num_epochs`: Number of epochs for each stage (default: 3)
			- `--seed`: Random seed for reproducibility (default: 42)

			`## Model Selection`

			`The default base model is Mistral-7B-v0.1, which is a good balance between performance and resource requirements. For better results, you can use larger models like:`

			- `meta-llama/Llama-2-13b-hf` (requires access)
			- `tiiuae/falcon-40b` (larger model with good performance)
			- `google/flan-t5-xxl` (good for instruction following)

			To use a different model, specify it with the `--model_name` argument:

			```
			`python finetune_model.py --model_name tiiuae/falcon-40b`
			```

			`## Hardware Requirements`

			`Fine-tuning large language models requires significant computational resources:`

			`- Minimum: 16GB GPU RAM (for 7B parameter models)`
			`- Recommended: 24GB+ GPU RAM (for 13B+ parameter models)`
			`- Optimal: Multiple GPUs or a high-end GPU with 40GB+ RAM`

			`For models larger than 7B parameters, you may need to use techniques like:`
			`- 8-bit quantization (already enabled in the script)`
			`- Gradient checkpointing`
			`- LoRA or QLoRA fine-tuning`

			`## Using the Fine-tuned Model`

			After fine-tuning, the model will be saved in the `adriana_model/final` directory. You can load and use it with the Transformers library:

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`# Load the fine-tuned model`
			`model_path = "adriana_model/final"`
			`tokenizer = AutoTokenizer.from_pretrained(model_path)`
			`model = AutoModelForCausalLM.from_pretrained(model_path)`

			`# Generate content`
			`prompt = "Write a marketing email for a professional development workshop."`
			`inputs = tokenizer(prompt, return_tensors="pt")`
			`outputs = model.generate(**inputs, max_length=200, num_return_sequences=1)`
			`generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)`
			`print(generated_text)`
			```

			`## License`

			`This project is licensed under the MIT License - see the LICENSE file for details.`