unsloth_compiled_cache/__pycache__/UnslothDDPOTrainer.cpython-310.pyc

o
õ×°h”›ã@s¼dZddlmZddlZddlmZddlmZddlmZm	Z	m
Z
mZmZm
Z
mZmZddlmZmZmZmZmZmZm
Z
mZmZmZmZmZmZmZmZmZmZmZmZm Z m!Z!mZm"Z"ddlZddlTddl#m$Z$m%Z%dd	l&m'Z'ddlZddl(Z)dd
l*m+Z+ddlmZddl,m-Z-m.Z/dd
dd
d
dœZ0ej1dde0d�dd„ƒZ2e$Gdd„deƒƒZ3	Gdd„deƒZ4Gdd„de4ƒZ5	e6edƒrÜddl7Z7Gdd„de7j8ƒZ9	e :e9dƒ¡dSdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)ÚAcceleratorrrÚ
DDPOConfigÚDDPOStableDiffusionPipelineÚDDPOTrainerrÚPathÚPerPromptStatTrackerÚProjectConfigurationÚPyTorchModelHubMixinrÚdefaultdictÚfuturesÚgenerate_model_cardÚget_comet_experiment_urlÚis_wandb_availableÚloggerÚosÚset_seedÚtextwrapÚtorchÚwarnings)Ú*)Ú	dataclassÚfield)ÚVersion)Únullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚmax_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ	fullgraphÚoptionsc
Cs¾tj| d|jd¡ddd�}tj| d¡ddd�}g}t||ƒD](\}}| tj¡}tj|d| d¡d� 	d¡}tj
|dd�}||}	| |	¡q!	t |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)r/Úindex©r/é)
rÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ	unsqueezeÚsqueezeÚ	logsumexpÚappendÚconcat)
Úlogitsr0Úchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚchunk_logitsÚchunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©rHúR/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothDDPOTrainer.pyÚchunked_selective_log_softmax"s
rJcs¬eZdZUdZedddid�Zeeed<edddid�Z	ee
ed	<	
				
																																			d"‡fd d!„	Z‡ZS)#ÚUnslothDDPOConfigaÎ
    
    Configuration class for the [`DDPOTrainer`].

    Using [`~transformers.HfArgumentParser`] we can turn this class into
    [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
    command line.

    Parameters:
        exp_name (`str`, *optional*, defaults to `os.path.basename(sys.argv[0])[: -len(".py")]`):
            Name of this experiment (by default is the file name without the extension name).
        run_name (`str`, *optional*, defaults to `""`):
            Name of this run.
        seed (`int`, *optional*, defaults to `0`):
            Random seed.
        log_with (`Literal["wandb", "tensorboard"]]` or `None`, *optional*, defaults to `None`):
            Log with either 'wandb' or 'tensorboard', check
            https://huggingface.co/docs/accelerate/usage_guides/tracking for more details.
        tracker_kwargs (`Dict`, *optional*, defaults to `{}`):
            Keyword arguments for the tracker (e.g. wandb_project).
        accelerator_kwargs (`Dict`, *optional*, defaults to `{}`):
            Keyword arguments for the accelerator.
        project_kwargs (`Dict`, *optional*, defaults to `{}`):
            Keyword arguments for the accelerator project config (e.g. `logging_dir`).
        tracker_project_name (`str`, *optional*, defaults to `"trl"`):
            Name of project to use for tracking.
        logdir (`str`, *optional*, defaults to `"logs"`):
            Top-level logging directory for checkpoint saving.
        num_epochs (`int`, *optional*, defaults to `100`):
            Number of epochs to train.
        save_freq (`int`, *optional*, defaults to `1`):
            Number of epochs between saving model checkpoints.
        num_checkpoint_limit (`int`, *optional*, defaults to `5`):
            Number of checkpoints to keep before overwriting old ones.
        mixed_precision (`str`, *optional*, defaults to `"fp16"`):
            Mixed precision training.
        allow_tf32 (`bool`, *optional*, defaults to `True`):
            Allow `tf32` on Ampere GPUs.
        resume_from (`str`, *optional*, defaults to `""`):
            Resume training from a checkpoint.
        sample_num_steps (`int`, *optional*, defaults to `50`):
            Number of sampler inference steps.
        sample_eta (`float`, *optional*, defaults to `1.0`):
            Eta parameter for the DDIM sampler.
        sample_guidance_scale (`float`, *optional*, defaults to `5.0`):
            Classifier-free guidance weight.
        sample_batch_size (`int`, *optional*, defaults to `1`):
            Batch size (per GPU) to use for sampling.
        sample_num_batches_per_epoch (`int`, *optional*, defaults to `2`):
            Number of batches to sample per epoch.
        train_batch_size (`int`, *optional*, defaults to `1`):
            Batch size (per GPU) to use for training.
        train_use_8bit_adam (`bool`, *optional*, defaults to `False`):
            Use 8bit Adam optimizer from bitsandbytes.
        train_learning_rate (`float`, *optional*, defaults to `3e-4`):
            Learning rate.
        train_adam_beta1 (`float`, *optional*, defaults to `0.9`):
            Adam beta1.
        train_adam_beta2 (`float`, *optional*, defaults to `0.999`):
            Adam beta2.
        train_adam_weight_decay (`float`, *optional*, defaults to `1e-4`):
            Adam weight decay.
        train_adam_epsilon (`float`, *optional*, defaults to `1e-8`):
            Adam epsilon.
        train_gradient_accumulation_steps (`int`, *optional*, defaults to `1`):
            Number of gradient accumulation steps.
        train_max_grad_norm (`float`, *optional*, defaults to `1.0`):
            Maximum gradient norm for gradient clipping.
        train_num_inner_epochs (`int`, *optional*, defaults to `1`):
            Number of inner epochs per outer epoch.
        train_cfg (`bool`, *optional*, defaults to `True`):
            Whether to use classifier-free guidance during training.
        train_adv_clip_max (`float`, *optional*, defaults to `5.0`):
            Clip advantages to the range.
        train_clip_range (`float`, *optional*, defaults to `1e-4`):
            PPO clip range.
        train_timestep_fraction (`float`, *optional*, defaults to `1.0`):
            Fraction of timesteps to train on.
        per_prompt_stat_tracking (`bool`, *optional*, defaults to `False`):
            Whether to track statistics for each prompt separately.
        per_prompt_stat_tracking_buffer_size (`int`, *optional*, defaults to `16`):
            Number of reward values to store in the buffer for each prompt.
        per_prompt_stat_tracking_min_count (`int`, *optional*, defaults to `16`):
            Minimum number of reward values to store in the buffer.
        async_reward_computation (`bool`, *optional*, defaults to `False`):
            Whether to compute rewards asynchronously.
        max_workers (`int`, *optional*, defaults to `2`):
            Maximum number of workers to use for async reward computation.
        negative_prompts (`str`, *optional*, defaults to `""`):
            Comma-separated list of prompts to use as negative examples.
        push_to_hub (`bool`, *optional*, defaults to `False`):
            Whether to push the final model checkpoint to the Hub.
    
    NÚhelpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsr,z8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksÚ	inferenceÚéO
ÚtrlÚlogsédr2éÚfp16Té2çð?ç@éFç-Cëâ6
?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç{®Gáz„?ç:Œ0âŽyE>ç-Cëâ6?éc)*stƒjd'id|“d|“d|“d|“d|“d|“d|“d|“d	|	“d
|
“d|“d|“d
|
“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d|“d | “d!|!“d"|"“d#|#“d$|$“d%|%“d&|&“|)¤Ž|'|_|(|_dS)(NÚexp_nameÚrun_nameÚseedÚlog_withÚtracker_project_nameÚlogdirÚ
num_epochsÚ	save_freqÚnum_checkpoint_limitÚmixed_precisionÚ
allow_tf32Úresume_fromÚsample_num_stepsÚ
sample_etaÚsample_guidance_scaleÚsample_batch_sizeÚsample_num_batches_per_epochÚtrain_batch_sizeÚtrain_use_8bit_adamÚtrain_learning_rateÚtrain_adam_beta1Útrain_adam_beta2Útrain_adam_weight_decayÚtrain_adam_epsilonÚ!train_gradient_accumulation_stepsÚtrain_max_grad_normÚtrain_num_inner_epochsÚ	train_cfgÚtrain_adv_clip_maxÚtrain_clip_rangeÚtrain_timestep_fractionÚper_prompt_stat_trackingÚ$per_prompt_stat_tracking_buffer_sizeÚ"per_prompt_stat_tracking_min_countÚasync_reward_computationÚmax_workersÚnegative_promptsÚpush_to_hubrH)ÚsuperÚ__init__rOrP)*Úselfrdrerfrgrhrirjrkrlrmrnrorprqrrrsrtrurvrwrxryrzr{r|r}r~rr€r�r‚rƒr„r…r†r‡rˆr‰rOrPÚkwargs©Ú	__class__rHrIr‹œsž.ÿþýüûúùø	÷
öõô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'
zUnslothDDPOConfig.__init__)(rQrRrSNrTrUrVr2rWrXTrRrYrZr[r2r\r2Fr]r^r_r`rar\rZr2Tr[rbrZFrcrcFr\rRFNr,)
Ú__name__Ú
__module__Ú__qualname__Ú__doc__r!rOrrÚ__annotations__rPÚintr‹Ú
__classcell__rHrHrŽrIrK3sf
^þþ×rKcsReZdZdZddgZ	d3dedeeje	e
e	egejfdege	e
effded	e
eeeegeff
d
d„Zd4d
d„Zdedefdd„Zdd„Zdejdedejfdd„Zdd„Zdd„Zdd„Zd d!„Zd"d#„Zd$e	ee
ffd%d&„Zd3d'e
efd(d)„Zd*d+„Z‡fd,d-„Z			d5d.e
e
d/e
e
d0ee
e e
dffd1d2„Z!‡Z"S)6Ú_UnslothDDPOTrainerrRrTÚddpoNÚconfigÚreward_functionÚprompt_functionÚsd_pipelineÚimage_samples_hookc	Csøt dt¡|durt d¡||_||_||_||_td"i|jj¤Ž}|jj	r}t
j t
j 
|jj	¡¡|j_	dt
j |jj	¡vr}ttdd„t
 |jj	¡ƒƒ}t|ƒdkr]td|jj	›�ƒ‚tdd	„|Dƒƒ}t
j |jj	d|d
›�¡|j_	|d
d|_t|jj|jjƒ|_td"|jj|jj||jj|jdœ|jj¤Ž|_ | !¡\}	}
|	s¬t|
ƒ‚|jduoµ|jd
k}|j j"rÒ|j j#|jj$|sÉt%| &¡d�n| &¡|jj'd�t( )d|›�¡t*|jj+dd�||_,|j,j-d|j j.dddd�|j jdkrýt/j0}n|j jdk�rt/j1}nt/j2}|j,j3j4|j j5|d�|j,j6j4|j j5|d�|j,j7j4|j j5|d�|j, 8¡}
|j  9|j:¡|j  ;|j<¡|jj=�rJdt/j>j?j@_=| AtB|
tƒ�sV|
 C¡n|
¡|_D|j, 6|j,jE|jjFdu�rjdgn|jjFddd|j,jEjGd�jH 4|j j5¡¡d|_I|jJ�r�tK|jL|jMƒ|_N|j,jO�p•|j jO|_OtP|j,dƒ�r»|j,jQ�r»|j  R|
|jD¡\}|_Dttdd„| C¡ƒƒ|_Sn|j  R|
|jD¡\|_S|_D|jjT�rÔtUjV|jWd�|_X|j	�r÷t( )d |j	›�¡|j  Y|j	¡t|j	 Zd!¡d
ƒd|_[dSd|_[dS)#Nz@DDPOTrainer is deprecated and will be removed in version 0.23.0.z8No image_samples_hook provided; no images will be loggedÚcheckpoint_cSsd|vS)NržrH©ÚxrHrHrIÚ<lambda>sz._UnslothDDPOTrainer.__init__.<locals>.<lambda>rzNo checkpoints found in cSsg|]}t| d¡dƒ‘qS)Ú_r,)r•Úsplit)Ú.0r rHrHrIÚ
<listcomp>óz0_UnslothDDPOTrainer.__init__.<locals>.<listcomp>r,r2)rgrmÚproject_configÚgradient_accumulation_stepsÚtensorboard)Úddpo_trainer_config)r™Úinit_kwargsÚ
T)Údevice_specificFÚTimestep)ÚpositionÚdisableÚleaveÚdescÚ
dynamic_ncolsrXÚbf16)ÚdtyperRÚptÚ
max_length©Úreturn_tensorsÚpaddingÚ
truncationr·Úuse_loracSs|jS©N)Ú
requires_grad)ÚprHrHrIr¡s)r‡zResuming from r¢rH)\rÚwarnÚDeprecationWarningÚ	prompt_fnÚ	reward_fnr™Úimage_samples_callbackrÚproject_kwargsrorÚpathÚnormpathÚ
expanduserÚbasenameÚlistÚfilterÚlistdirÚlenÚ
ValueErrorÚsortedÚjoinÚ	iterationr•rpr‚Únum_train_timestepsrrgrmr|Úaccelerator_kwargsÚacceleratorÚ
_config_checkÚis_main_processÚ
init_trackersrhÚdictÚto_dictÚtracker_kwargsrÚinforrfrœÚset_progress_bar_configÚis_local_main_processrÚfloat16Úbfloat16r8Úvaer7ÚdeviceÚtext_encoderÚunetÚget_trainable_layersÚregister_save_state_pre_hookÚ_save_model_hookÚregister_load_state_pre_hookÚ_load_model_hookrnÚbackendsÚcudaÚmatmulÚ_setup_optimizerÚ
isinstanceÚ
parametersÚ	optimizerÚ	tokenizerrˆÚmodel_max_lengthÚ	input_idsÚneg_prompt_embedrƒrr„r…Ústat_trackerÚautocastÚhasattrr¼ÚprepareÚtrainable_layersr†rÚThreadPoolExecutorr‡ÚexecutorÚ
load_stater£Úfirst_epoch)rŒr™ršr›rœr�Úaccelerator_project_configÚcheckpointsÚcheckpoint_numbersÚis_okayÚmessageÚis_using_tensorboardÚinference_dtyperørãrHrHrIr‹ûsÌþ
þÿþùøýû


ÿûùø
þ

z_UnslothDDPOTrainer.__init__Fc	s~|s'g}|D]\}}}ˆ |||¡\}}| tj|ˆjjd�|f¡qt|ŽSˆj ‡fdd„|¡}‡fdd„|Dƒ}t|ŽS)N©rács
ˆj|ŽSr½)rÃrŸ©rŒrHrIr¡™ó
z5_UnslothDDPOTrainer.compute_rewards.<locals>.<lambda>cs.g|]\}}tj| ¡ˆjjd�| ¡f‘qS©r)rÚ	as_tensorÚresultrÔrá)r¤ÚrewardÚreward_metadatarrHrIr¥šsÿÿz7_UnslothDDPOTrainer.compute_rewards.<locals>.<listcomp>)	rÃr=rrrÔrárúÚmapr6)	rŒÚprompt_image_pairsÚis_asyncÚrewardsÚimagesÚpromptsÚprompt_metadatar
rrHrrIÚcompute_rewards�sþÿ
ú
þz#_UnslothDDPOTrainer.compute_rewardsÚepochÚglobal_stepcsšˆjˆjjˆjjd�\‰}‡fdd„ˆd ¡Dƒ‰ˆj|ˆjjd�\}}t|ƒD]\}}| ||||g¡q)ˆj	durIˆ 	||ˆj
jd¡t 
|¡}ˆj
 |¡ ¡ ¡}ˆj
j||| ¡| ¡dœ|d�ˆjjrŠˆj
 ˆd	¡ ¡ ¡}ˆjjj|d
d�}	ˆj |	|¡}
n|| ¡| ¡d}
t |
¡ ˆj
jd
¡ˆj
j ˆj
j¡ˆd<ˆd	=ˆdj \}‰t!ˆjj"ƒD]v}tj#|ˆj
jd�‰‡fdd„ˆ $¡Dƒ‰t %‡‡fdd„t!|ƒDƒ¡}
dD]}ˆ|tj&|ˆj
jd�dd…df|
fˆ|<qãˆ ¡‰ˆ '¡}‡fdd„|Dƒ}t(|Ž}‡fdd„|Dƒ}ˆjj) *¡ˆ +||||¡}ˆj
j,�s2t-dƒ‚q¼|dk�rK|ˆjj.dk�rKˆj
j/�rKˆj
 0¡|S)a
        Perform a single step of training.

        Args:
            epoch (int): The current epoch.
            global_step (int): The current global step.

        Side Effects:
            - Model weights are updated
            - Logs the statistics to the accelerator trackers.
            - If `self.image_samples_callback` is not None, it will be called with the prompt_image_pairs, global_step,
              and the accelerator tracker.

        Returns:
            global_step (int): The updated global step.

        )Ú
iterationsÚ
batch_sizecs&i|]‰ˆt ‡fdd„ˆDƒ¡“qS)csg|]}|ˆ‘qSrHrH)r¤Ús©ÚkrHrIr¥¹óz7_UnslothDDPOTrainer.step.<locals>.<dictcomp>.<listcomp>)rÚcat)r¤)ÚsamplesrrIÚ
<dictcomp>¹s&z,_UnslothDDPOTrainer.step.<locals>.<dictcomp>r)rN)r
rÚreward_meanÚ
reward_std©ÚstepÚ
prompt_idsT)Úskip_special_tokensrar,Ú
advantagesÚ	timestepsrcsi|]	\}}||ˆ“qSrHrH©r¤rÚv)ÚpermrHrIrçócsg|]}tjˆˆjjd�‘qSr)rÚrandpermrÔrá©r¤r¢)Ú
num_timestepsrŒrHrIr¥ìr¦z,_UnslothDDPOTrainer.step.<locals>.<listcomp>)r&ÚlatentsÚnext_latentsÚ	log_probscs.g|]}|jdˆjjg|jdd…¢RŽ‘qS)r,r2N)r4r™rur5)r¤r(rrHrIr¥øs.csg|]	}ttˆ|ƒƒ‘qSrH)rØr6)r¤Ú
row_values)Ú
original_keysrHrIr¥ýr*zsOptimization step should have been performed by this point. Please check calculated gradient accumulation settings.)1Ú_generate_samplesr™rtrsÚkeysrr†Ú	enumerateÚextendrÄrÔÚtrackersrrr9ÚcpuÚnumpyÚlogÚmeanÚstdrƒrœrðÚbatch_decoderôÚupdaterr4Ú
num_processesÚ
process_indexr7rár5Úranger~r+ÚitemsÚstackÚarangeÚvaluesr6rãÚtrainÚ_train_batched_samplesÚsync_gradientsrÎrkrÖÚ
save_state)rŒrrÚprompt_image_datarÚrewards_metadataÚiÚ
image_datar#rr%Útotal_batch_sizeÚinner_epochÚpermsÚkeyÚoriginal_valuesÚreshaped_valuesÚtransposed_valuesÚsamples_batchedrH)r-r2r)rrŒrIr"¡sz
þ
ÿ

üù
ÿ
ýÿÿ
ÿ
ÿÿ&
z_UnslothDDPOTrainer.stepcCs(| ¡�L|jjr0|j t |gd¡t |gd¡|¡j}| d¡\}}	||jj	|	|}n	|j |||¡j}|jj
||||jj|d�}
|
j}Wdƒn1sSwYt 
||jj|jj¡}t ||¡}| ||jj|¡}
dt ||d¡}t t |d¡|jjk ¡¡}|
||fS)a‚
        Calculate the loss for a batch of an unpacked sample

        Args:
            latents (torch.Tensor):
                The latents sampled from the diffusion model, shape: [batch_size, num_channels_latents, height, width]
            timesteps (torch.Tensor):
                The timesteps sampled from the diffusion model, shape: [batch_size]
            next_latents (torch.Tensor):
                The next latents sampled from the diffusion model, shape: [batch_size, num_channels_latents, height,
                width]
            log_probs (torch.Tensor):
                The log probabilities of the latents, shape: [batch_size]
            advantages (torch.Tensor):
                The advantages of the latents, shape: [batch_size]
            embeds (torch.Tensor):
                The embeddings of the prompts, shape: [2*batch_size or batch_size, ...] Note: the "or" is because if
                train_cfg is True, the expectation is that negative prompts are concatenated to the embeds

        Returns:
            loss (torch.Tensor), approx_kl (torch.Tensor), clipfrac (torch.Tensor) (all of these are of shape (1,))
        r\)ÚetaÚprev_sampleNgà?rZ)rõr™rrœrãrrÚsampler3rrÚscheduler_steprqr0Úclampr€ÚexpÚlossr�r;ÚabsÚfloat)rŒr.r&r/r0r%ÚembedsÚ
noise_predÚnoise_pred_uncondÚnoise_pred_textÚscheduler_step_outputÚlog_probÚratior\Ú	approx_klÚclipfracrHrHrIÚcalculate_losssN
ýüÿýüûåý 
z"_UnslothDDPOTrainer.calculate_lossr%Ú
clip_rangerecCs8||}|t |d|d|¡}t t ||¡¡S)NrZ)rrZr;Úmaximum)rŒr%rireÚunclipped_lossÚclipped_lossrHrHrIr\Ps
ýz_UnslothDDPOTrainer.losscCsL|jjr
ddl}|jj}ntjj}|||jj|jj|jj	f|jj
|jjd�S)Nr)ÚlrÚbetasÚweight_decayÚeps)r™rvÚbitsandbytesÚoptimÚ	AdamW8bitrÚAdamWrwrxryrzr{)rŒÚtrainable_layers_parametersrqÚ
optimizer_clsrHrHrIrì^s
ûz$_UnslothDDPOTrainer._setup_optimizercCs|j |||¡| ¡dSr½)rœÚsave_checkpointÚpop)rŒÚmodelsÚweightsÚ
output_dirrHrHrIrænsz$_UnslothDDPOTrainer._save_model_hookcCs|j ||¡| ¡dSr½)rœÚload_checkpointrx)rŒryÚ	input_dirrHrHrIrèrsz$_UnslothDDPOTrainer._load_model_hookcsdg}g}ˆjj ¡ˆj |dd¡}t|ƒD]—}t‡fdd„t|ƒDƒŽ\}}ˆjj|dddˆjjjd�j	 
ˆjj¡}	ˆj 
|	¡d}
ˆ ¡�"ˆj|
|ˆjjˆjjˆjjdd	�}|j}|j}
|j}Wd
ƒn1slwYtj|
dd�}
tj|dd�}ˆjjj |d¡}| |	|
||
d
d
…d
d…f|
d
d
…dd
…f||d
œ¡| |||g¡q||fS)a4
        Generate samples from the model

        Args:
            iterations (int): Number of iterations to generate samples for
            batch_size (int): Batch size to use for sampling

        Returns:
            samples (list[dict[str, torch.Tensor]]), prompt_image_pairs (list[list[Any]])
        r2csg|]}ˆ ¡‘qSrH)rÂr,rrHrIr¥ˆrz9_UnslothDDPOTrainer._generate_samples.<locals>.<listcomp>r¶r·Tr¸r)Ú
prompt_embedsÚnegative_prompt_embedsÚnum_inference_stepsÚguidance_scalerVÚoutput_typeNr1r,)r#r~r&r.r/r0r)rœrãÚevalróÚrepeatrAr6rðrñròr7rÔrárârõr™rprrrqrr.r0rrCÚ	schedulerr&r=)rŒrrrr
Úsample_neg_prompt_embedsr¢rrr#r~Ú	sd_outputrr.r0r&rHrrIr3vsXûú
ú	ôùÿz%_UnslothDDPOTrainer._generate_samplesc
Csºttƒ}t|ƒD]Ò\}}|jjrt |d|dg¡}n|d}t|jƒD]´}	|j	 
|jj¡�u| 
|ddd…|	f|ddd…|	f|ddd…|	f|ddd…|	f|d|¡\}
}}|d	 |¡|d
 |¡|d |
¡|j	 |
¡|j	jr“|j	 t|jtƒsŒ|j ¡n|j|jj¡|j ¡|j ¡Wdƒn1s§wY|j	jrÙdd
„| ¡Dƒ}|j	j|dd�}| ||dœ¡|j	j||d�|d7}ttƒ}q%q|S)a
        Train on a batch of samples. Main training segment

        Args:
            inner_epoch (int): The current inner epoch
            epoch (int): The current epoch
            global_step (int): The current global step
            batched_samples (list[dict[str, torch.Tensor]]): The batched samples to train on

        Side Effects:
            - Model weights are updated
            - Logs the statistics to the accelerator trackers.

        Returns:
            global_step (int): The updated global step
        rr~r.Nr&r/r0r%rfrgr\cSs"i|]
\}}|t t |¡¡“qSrH)rr;rCr'rHrHrIrés"z>_UnslothDDPOTrainer._train_batched_samples.<locals>.<dictcomp>r;)Ú	reduction)rrOr!r2)rrÊr5r™rrrrArÒrÔÚ
accumulaterœrãrhr=ÚbackwardrHÚclip_grad_norm_rírørîr}rïr"Ú	zero_gradrBÚreducer>r:)
rŒrOrrÚbatched_samplesrÛÚ_irXr_Újr\rfrgrHrHrIrG´sN
ú
ÿü
ê€ß"z*_UnslothDDPOTrainer._train_batched_samplesÚreturncCs¶|jj|jj|jj}|jj|jj|jj}|jj|jjks/dd|jj›d|jj›d�fS|jj|jjdksHdd|jj›d|jj›d�fS||dksYdd|›d|›d�fSd	S)
NFzSample batch size (z9) must be greater than or equal to the train batch size (ú)rz-) must be divisible by the train batch size (zNumber of samples per epoch (z3) must be divisible by the total train batch size ()TrR)r™rsrÔr?rtrur|)rŒÚsamples_per_epochÚtotal_train_batch_sizerHrHrIrÕñs*ÿÿþÿþþþz!_UnslothDDPOTrainer._config_checkÚepochscCs6d}|dur