Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothDDPOTrainer.cpython-311.pyc
T

361 lines
46 KiB
Plaintext
Raw Normal View History

2025-08-13 23:50:20 +00:00
§
3$hXšãóÔdZddlmZddlZddlmZddlmZddlmZm Z m
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZm
Z
mZmZmZmZm Z mZmZmZmZmZmZmZm Z m!Z!mZm"Z"ddlZddlTddl#m$Z$m%Z%dd l&m'Z'ddlZddl(Z)dd
l*m+Z+ddlmZdd l,m-Z-m.Z/d d
d d
d
dœZ0ej1d d e0¬¦«d¦«Z2e$Gdde¦«¦«Z3 Gdde¦«Z4Gdde4¦«Z5dS)z8
2025.8.4
2025.8.5
4.55.1
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)Ú Acceleratorrr Ú
DDPOConfigÚDDPOStableDiffusionPipelineÚ DDPOTrainerrÚPathÚPerPromptStatTrackerÚProjectConfigurationÚPyTorchModelHubMixinr Ú defaultdictÚfuturesÚgenerate_model_cardÚget_comet_experiment_urlÚis_wandb_availableÚloggerÚosÚset_seedÚtextwrapÚtorchÚwarnings)Ú*)Ú dataclassÚfield)ÚVersion)Ú nullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionscó’tj| d|jd¦«dd¬¦«}tj| d¦«dd¬¦«}g}t ||¦«D]\}}| tj¦«}tj|d| d¦«¬¦«  d¦«}tj
|d¬¦«}||z
} |  | ¦«Œ’ tj |¦«}| |jd|jdf¦«}|S)Néÿÿÿÿér)ÚchunksÚdim)r1Úindex©r1é)
rÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
Úlogitsr2Úchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logpss
ú^/workspace/Fine-tuning/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothDDPOTrainer.pyÚchunked_selective_log_softmaxrK"s5õ”[ §¢°°F´LÀÔ4DÑ!EÔ!EÐPQÐYZÐ[€NÝ”[ §¢¨rÑ!2Ô!2¸QÀaÐH€MØÐå%(¨¸Ñ%GÔ%Gð #—¥u¤}Ñ Ýœ, |¸2À{×G\ÒG\Ð]_ÑG`ÔG`Ða×iÐjlÑmˆÝ œ?¨<¸Ø)Ð,<Ñ<ˆØ×" Ýœ,Ð':ÑØ-×5°v´|ÀA´ÈÌ ÐUVÌÐ6XÑØ ÐócóÞeZdZUdZedddi¬¦«Zeeed<edddi¬¦«Z ee
ed < d!ˆfd „ Z ˆxZ S)"ÚUnslothDDPOConfigaÎ
Configuration class for the [`DDPOTrainer`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
exp_name (`str`, *optional*, defaults to `os.path.basename(sys.argv[0])[: -len(".py")]`):
Name of this experiment (by default is the file name without the extension name).
run_name (`str`, *optional*, defaults to `""`):
Name of this run.
seed (`int`, *optional*, defaults to `0`):
Random seed.
log_with (`Literal["wandb", "tensorboard"]]` or `None`, *optional*, defaults to `None`):
Log with either 'wandb' or 'tensorboard', check
https://huggingface.co/docs/accelerate/usage_guides/tracking for more details.
tracker_kwargs (`Dict`, *optional*, defaults to `{}`):
Keyword arguments for the tracker (e.g. wandb_project).
accelerator_kwargs (`Dict`, *optional*, defaults to `{}`):
Keyword arguments for the accelerator.
project_kwargs (`Dict`, *optional*, defaults to `{}`):
Keyword arguments for the accelerator project config (e.g. `logging_dir`).
tracker_project_name (`str`, *optional*, defaults to `"trl"`):
Name of project to use for tracking.
logdir (`str`, *optional*, defaults to `"logs"`):
Top-level logging directory for checkpoint saving.
num_epochs (`int`, *optional*, defaults to `100`):
Number of epochs to train.
save_freq (`int`, *optional*, defaults to `1`):
Number of epochs between saving model checkpoints.
num_checkpoint_limit (`int`, *optional*, defaults to `5`):
Number of checkpoints to keep before overwriting old ones.
mixed_precision (`str`, *optional*, defaults to `"fp16"`):
Mixed precision training.
allow_tf32 (`bool`, *optional*, defaults to `True`):
Allow `tf32` on Ampere GPUs.
resume_from (`str`, *optional*, defaults to `""`):
Resume training from a checkpoint.
sample_num_steps (`int`, *optional*, defaults to `50`):
Number of sampler inference steps.
sample_eta (`float`, *optional*, defaults to `1.0`):
Eta parameter for the DDIM sampler.
sample_guidance_scale (`float`, *optional*, defaults to `5.0`):
Classifier-free guidance weight.
sample_batch_size (`int`, *optional*, defaults to `1`):
Batch size (per GPU) to use for sampling.
sample_num_batches_per_epoch (`int`, *optional*, defaults to `2`):
Number of batches to sample per epoch.
train_batch_size (`int`, *optional*, defaults to `1`):
Batch size (per GPU) to use for training.
train_use_8bit_adam (`bool`, *optional*, defaults to `False`):
Use 8bit Adam optimizer from bitsandbytes.
train_learning_rate (`float`, *optional*, defaults to `3e-4`):
Learning rate.
train_adam_beta1 (`float`, *optional*, defaults to `0.9`):
Adam beta1.
train_adam_beta2 (`float`, *optional*, defaults to `0.999`):
Adam beta2.
train_adam_weight_decay (`float`, *optional*, defaults to `1e-4`):
Adam weight decay.
train_adam_epsilon (`float`, *optional*, defaults to `1e-8`):
Adam epsilon.
train_gradient_accumulation_steps (`int`, *optional*, defaults to `1`):
Number of gradient accumulation steps.
train_max_grad_norm (`float`, *optional*, defaults to `1.0`):
Maximum gradient norm for gradient clipping.
train_num_inner_epochs (`int`, *optional*, defaults to `1`):
Number of inner epochs per outer epoch.
train_cfg (`bool`, *optional*, defaults to `True`):
Whether to use classifier-free guidance during training.
train_adv_clip_max (`float`, *optional*, defaults to `5.0`):
Clip advantages to the range.
train_clip_range (`float`, *optional*, defaults to `1e-4`):
PPO clip range.
train_timestep_fraction (`float`, *optional*, defaults to `1.0`):
Fraction of timesteps to train on.
per_prompt_stat_tracking (`bool`, *optional*, defaults to `False`):
Whether to track statistics for each prompt separately.
per_prompt_stat_tracking_buffer_size (`int`, *optional*, defaults to `16`):
Number of reward values to store in the buffer for each prompt.
per_prompt_stat_tracking_min_count (`int`, *optional*, defaults to `16`):
Minimum number of reward values to store in the buffer.
async_reward_computation (`bool`, *optional*, defaults to `False`):
Whether to compute rewards asynchronously.
max_workers (`int`, *optional*, defaults to `2`):
Maximum number of workers to use for async reward computation.
negative_prompts (`str`, *optional*, defaults to `""`):
Comma-separated list of prompts to use as negative examples.
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether to push the final model checkpoint to the Hub.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsr.z8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksÚ inferenceÚéO
ÚtrlÚlogsédr4éÚfp16Té2çð?çFç-Cëâ6
?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç{®Gáz„?ç:Œ0âŽyE>ç-Cëâ6c) ó:t¦«jd'id|d|d|d|d|d|d|d|d | “d
|
d | d | d
|
d|d|d|d|d|d|d|d|d|d|d|d|d|d|d|d|d|d|d | “d!|!“d"|"“d#|#“d$|$“d%|%“d&|&“|)¤Ž|'|_|(|_dS)(NÚexp_nameÚrun_nameÚseedÚlog_withÚtracker_project_nameÚlogdirÚ
num_epochsÚ save_freqÚnum_checkpoint_limitÚmixed_precisionÚ
allow_tf32Ú resume_fromÚsample_num_stepsÚ
sample_etaÚsample_guidance_scaleÚsample_batch_sizeÚsample_num_batches_per_epochÚtrain_batch_sizeÚtrain_use_8bit_adamÚtrain_learning_rateÚtrain_adam_beta1Útrain_adam_beta2Útrain_adam_weight_decayÚtrain_adam_epsilonÚ!train_gradient_accumulation_stepsÚtrain_max_grad_normÚtrain_num_inner_epochsÚ train_cfgÚtrain_adv_clip_maxÚtrain_clip_rangeÚtrain_timestep_fractionÚper_prompt_stat_trackingÚ$per_prompt_stat_tracking_buffer_sizeÚ"per_prompt_stat_tracking_min_countÚasync_reward_computationÚ max_workersÚnegative_promptsÚ push_to_hub©)ÚsuperÚ__init__rRrS)+Úselfrhrirjrkrlrmrnrorprqrrrsrtrurvrwrxryrzr{r|r}r~rr€rrr„r…r†r‡r‰rrrRrSÚkwargsÚ __class__s+ €rJrzUnslothDDPOConfig.__init__s6ø€ðZ ŒÔð& 0ð& 0ð& 0Ø& 0à& 0ð& 0ð xð & 0ð
$8Ð#7ð & 0ð 
& 0ð$˜ð& 0ð"˜ ð& 0ð$8Ð#7ð& 0ð.˜& 0ð$˜ð& 0ð&˜& 0ð& 0ð$˜ð& 0ð%:Ð$9ð& 0ð !2Ð 1ð!& 0ð",HÐ+Gð#& 0ð$/ð%& 0ð&#6Ð"5ð'& 0ð(#6Ð"5ð)& 0ð*/ð+& 0ð,/ð-& 0ð.'>Ð&=ð/& 0ð0"4Ð!3ð1& 0ð21RÐ0Qð3& 0ð4#6Ð"5ð5& 0ð6&<Ð%;ð7& 0ð8"˜ ð9& 0ð:"4Ð!3ð;& 0ð</ð=& 0ð>'>Ð&=ð?& 0ð@(@Ð'?ðA& 0ðB4XÐ3WðC& 0ðD2TÐ1SðE& 0ðF(@Ð'?ðG& 0ðH&˜+ðI& 0ðJ/ðK& 0ðL&˜ðM& 0ð& 0ð& 0ðN%9ˆÔ!Ø"4ˆÔÐÐrL)(rTrUrVNrWrXrYr4rZr[TrUr\r]r^r4r_r4Fr`rarbrcrdr_r]r4Tr^rer]FrfrfFr_rUFNr.)
Ú__name__Ú
__module__Ú __qualname__Ú__doc__r"rRrrÚ__annotations__rSÚintrÚ
__classcell__©r“s@rJrNrN3sEø€ð]ð]ð|+0¨%ØØÐ+ñ+ô+И( 3œ-ððñð*/¨ØØÐ*ñ*ô*И #œððñð ØØØØØØØ Ø ØØØØØ #ØØ'(ØØØ Ø"&Ø"Ø,-Ø!Ø!"ØØ Ø!Ø"%Ø#(Ø/1Ø-/Ø#(ØØØØðSUUUUUUUUUU5rLrNcóÀeZdZdZddgZ d$dedeeje e
e e gejfdege e
e ffde d e
ee e e ge ff
d
Zd%d Zd
edefdZdZdejdedejfdZdZdZdZdZdZde ee
ffdZd$de
efdZdZˆfdZ d&d e
e
d!e
e
d"ee
e e
dffd#„Z!ˆxZ"S)'Ú_UnslothDDPOTrainerrUrWÚddpoNÚconfigÚreward_functionÚprompt_functionÚ sd_pipelineÚimage_samples_hookc óÚ
tjdt¦«|tjd¦«||_||_||_||_td i|jj¤Ž}|jj rJtj   tj  
|jj ¦«¦«|j_ dtj  |jj ¦«vrÏtt!dtj|jj ¦«¦«¦«}t%|¦«dkrt'd|jj ¦«t)d|D¦«¦«}tj  |jj d|d¦«|j_ |dd z|_t/|jj|jjz¦«|_t7d |jj|jj||jj|jzd
œ|jj¤Ž|_ | !¦«\} }
| st'|
¦«|jduo
|jd k} |j j"rg|j  #|jj$| s"tK| &¦«¬ ¦«n| &¦«|jj'¬
¦«tQj)d|¦«tU|jj+d¬¦«||_,|j, -d |j j. ddd¬¦«|j jdkr
t^j0} n)|j jdkr
t^j1} n t^j2} |j,j3 4|j j5| ¬¦«|j,j6 4|j j5| ¬¦«|j,j7 4|j j5| ¬¦«|j, 8¦«}
|j  9|j:¦«|j  ;|j<¦«|jj=rdt^j>j?j@_=| At…|
t¦«s|
 C¦«n|
¦«|_D|j, 6|j, E|jjFdgn |jjFddd|j,jEjG¬¦«jH 4|j j5¦«¦«d|_I|jJrt—|jL|jM¦«|_N|j,jOp |j jO|_O|j,d¦«rj|j,jQr^|j  R|
|jD¦«\}|_Dtt!d| C¦«¦«¦«|_Sn-|j  R|
|jD¦«\|_S|_D|jjTrjV|jW¬¦«|_X|j rrtQj)d|j ¦«|j  Y|j ¦«t/|j  Zd¦«d¦«d z|_[dSd|_[dS)!Nz@DDPOTrainer is deprecated and will be removed in version 0.23.0.z8No image_samples_hook provided; no images will be loggedÚ checkpoint_có
d|vS)Nr¥)Úxs rJú<lambda>z._UnslothDDPOTrainer.__init__.<locals>.<lambda>s  -°1Ð"4€rLrzNo checkpoints found in có^g|]*}t| d¦«d¦«Œ+S)Ú_r.)r™Úsplit)Ú.0r§s rJú
<listcomp>z0_UnslothDDPOTrainer.__init__.<locals>.<listcomp>s/Ð,XÐ,XÐ,XÀq­S°·²¸±´¸bÔ1AÑ-BÔ-BÐ,XÐ,XÐ,XrLr.r4)rkrqÚproject_configÚgradient_accumulation_stepsÚ tensorboard)Úddpo_trainer_config)Ú init_kwargsÚ
T)Údevice_specificFÚTimestep)ÚpositionÚdisableÚleaveÚdescÚ
dynamic_ncolsr[Úbf16)ÚdtyperUÚptÚ
max_length©Úreturn_tensorsÚpaddingÚ
truncationr¾Úuse_loracó|jS©N)Ú
requires_grad)Úps rJz._UnslothDDPOTrainer.__init__.<locals>.<lambda>|s¸!¼/€rL)rzResuming from rª)\rÚwarnÚDeprecationWarningÚ prompt_fnÚ reward_fnrŸÚimage_samples_callbackrÚproject_kwargsrsrÚpathÚnormpathÚ
expanduserÚbasenameÚlistÚfilterÚlistdirÚlenÚ
ValueErrorÚsortedÚjoinÚ iterationr™rtr†Únum_train_timestepsr
rkrqr€Úaccelerator_kwargsÚ acceleratorÚ
_config_checkÚis_main_processÚ
init_trackersrlÚdictÚto_dictÚtracker_kwargsrÚinforrjÚset_progress_bar_configÚis_local_main_processrÚfloat16Úbfloat16r:Úvaer9ÚdeviceÚ text_encoderÚunetÚget_trainable_layersÚregister_save_state_pre_hookÚ_save_model_hookÚregister_load_state_pre_hookÚ_load_model_hookrrÚbackendsÚcudaÚmatmulÚ_setup_optimizerÚ
isinstanceÚ
parametersÚ optimizerÚ tokenizerrŒÚmodel_max_lengthÚ input_idsÚneg_prompt_embedr‡rrˆr‰Ú stat_trackerÚautocastÚhasattrrÃÚprepareÚtrainable_layersrŠrÚThreadPoolExecutorrÚexecutorÚ
load_stater«Ú first_epoch)rr Úaccelerator_project_configÚ checkpointsÚcheckpoint_numbersÚis_okayÚmessageÚis_using_tensorboardÚinference_dtypers rJrz_UnslothDDPOTrainer.__init__øs‡õ Œ
Ø ñ
ô
ð
ð Ð ŒMÐ ŒØŒØˆŒ Ø&8ˆÔ#å%9Ð%WÐ%W¸D¼KÔ<VÐ%WÐ%WÐ Œ;Ô  RÝ&(¤g×&6Ò&6µr´w×7IÒ7IÈ$Ì+ÔJaÑ7bÔ7bÑ&cÔ&cˆDŒKÔ ¥B¤G×$4Ò$4°T´[Ô5LÑ$MÔ$MÐØœ
 4¤;Ô#:Ñôñô õ # qÒ$Ð%YÀÄ Ô@WÐ%YÐ%YÑZÝ%+Ð,XÐ,XÈKÐ,XÑ,XÔ,XÑ%YÔ%YÐ"Ý*,¬'¯,ª,Ø”KÔ:Ð"4°RÔ"8Ð+ô+ Ô
8JÈ"Ô7MÐPQÑ7QÐ$' t¤{Ô'CÀdÄkÔFiÑ'iÑ#jÔ#jˆÔ å 
Ø”[Ô œKÔ)-¬ Ô(UÐX\ÔXpÑ(pð 
ð 
ðŒkÔ 
ð 
ˆÔð ׈Øð˜WÑ °dÐ_¸v¼ÐR_Ò?_Ðà Ô Ô  Ø Ô ×  Ô0ØI]Ðs•t°·²Ñ0@Ô0@ÐAÐci×cqÒcqÑcsÔcsØ œKÔ

ô
ð
õ Œ M˜MÔ!°4ÐÔà Ô×ØÔØØð 
ô
ð
ð Ô Ô +¨vÒ #œmˆOˆOØ
Ô
Ô
Ò
#œnˆOˆOå#œmˆOà ÔÔ×Ò Ô 0Ô 7¸ÐÑ ÔÔÔ)9Ô)@ÈÐ ÔÔ× Ò  Ô!1Ô!8ÀÐ ÑÔà Ô×5°dÔ6KÑ Ô×5°dÔ6KÑ Œ;Ô  9Ø48EŒNÔ Ô ×.Ý1;Ð<LÍdÑ1SÔ1SÐ × )ÐYiñ
ô
ˆŒð!%Ô 0× =Ò =Ø Ô × œ Ô<À$Ä+ÔB^ØØÔ

ô
ô Ÿš˜4Ô!
ô!
ð ô!
ˆÔð Ô  Ý 4ØÔÔ!ô!ˆDÔ ðÔN°TÔ5EÔ5NˆŒ
å #   o°TÔ5EÔ5Nð oØ#'Ô#3×#;Ò#;Ð<LÈdÌnÑ#]Ô#]Ñ ˆD$”.Ý$(­Ð0IÐ0IÈ4Ï?Ê?ÑK\ÔK\Ñ)]Ô)]Ñ$^Ô$^ˆ !à48Ô4D×4LÒ4LÐM]Ð_cÔ_mÑ4nÔ4nÑ 1ˆ ! 4¤>à Œ;Ô  WÝ6À6ÔCUÐVˆDŒMà Ô ð ŒKÐÔ);Ð Ô × Ô(:Ñ " 6Ô#5×#;Ò#;¸CÑ#@Ô#@ÀÔ#DÑÑIˆDÔ Ð Ð à ˆDÔ Ð Ð rLFcó(|s[g}|D]U\}}} |||¦«\}}| tj|jj¬¦«|f¦«ŒVn,‰j ˆfd|¦«}ˆfd|D¦«}t|ŽS)cój|ŽS))rs €rJz5_UnslothDDPOTrainer.compute_rewards.<locals>.<lambda>sø€°.°$´.À!Ð2D€rLcó¢g|]K\}}tj| ¦«jj¬¦«| ¦«fŒLS©r
)rÚ as_tensorÚresultrÜ)ÚrewardÚreward_metadatars €rJr­z7_UnslothDDPOTrainer.compute_rewards.<locals>.<listcomp>—s\ø€ðððá+F˜Oõ §¢¡¤¸Ô9IÔ9PÐQÐSb×SiÒSiÑSkÔSkÐððrL) rËr?rrrÚmapr8) rÚprompt_image_pairsÚis_asyncÚrewardsÚimagesÚpromptsÚprompt_metadatarrs ` rJÚcompute_rewardsz#_UnslothDDPOTrainer.compute_rewardsŠø€Øð ؈GØ4Fð
ð
Ñ0˜ Ø*.¯.ª.¸ÀÈ/Ñ*ZÔ*ZÑ'˜Øåœ¨°tÔ7GÔ7NÐñôððð
ð”m×'Ð(DÐ(DÐ(DÐ(DÐFXÑYˆðððà/6ðñôˆGõ
GˆrLÚepochÚ global_stepcó¬ jjjj¬¦«\Š}ˆfdd ¦«D¦«Š |jj¬¦«\}}t|¦«D](\}}| ||||g¦«Œ)‰j '‰  ||j
j d¦«tj
|¦«}j
 |¦« ¦« ¦«}j
 ||| ¦«| ¦«dœ|¬¦«jjrj
 d¦« ¦« ¦«}jj |d ¬
¦«} ‰j | |¦«}
n/|| ¦«z
| ¦«d zz }
tj|
¦« j
jd ¦«j
j j
j¦«d
<d=dj \} ŠtCjj"¦«D]O} tj#| j
j¬¦«Šˆfd $¦«D¦«Štj%ˆˆfdtC| ¦«D¦«¦«}
dD]=}|tj&| j
j¬¦«dddf|
f|<Œ>‰ ¦«Š '¦«}ˆfd|D¦«}tQ|Ž}ˆfd|D¦«}jj) *¦« +| |||¦«}j
j,st[d¦«ŒQ|dkr8|jj.zdkr%‰j
j/rj
 0¦«|S)a
Perform a single step of training.
Args:
epoch (int): The current epoch.
global_step (int): The current global step.
Side Effects:
- Model weights are updated
- Logs the statistics to the accelerator trackers.
- If `self.image_samples_callback` is not None, it will be called with the prompt_image_pairs, global_step,
and the accelerator tracker.
Returns:
global_step (int): The updated global step.
)Ú
iterationsÚ
batch_sizecóTi|]#ŠtjˆfdD¦«¦«Œ$S)có g|]
}|Œ S)Úks €rJr­z7_UnslothDDPOTrainer.step.<locals>.<dictcomp>.<listcomp>¶sø€Ð 7Ð 7Ð 7¨!  Ð 7Ð 7Ð 7rL)rÚcat)r%Úsampless @€rJú
<dictcomp>z,_UnslothDDPOTrainer.step.<locals>.<dictcomp>¶s;øø€ÐT¸Q1•e”iÐ 7Ð 7Ð 7Ð 7¨wÐ 7Ñ 7Ô 7ÑTrLr)rN)rrÚ reward_meanÚ
reward_std©ÚstepÚ
prompt_idsT)Úskip_special_tokensrdr.Ú
advantagesÚ timestepsr
có(i|]\}}||ŒS)r%Úperms €rJr(z,_UnslothDDPOTrainer.step.<locals>.<dictcomp>äs#ø€Ð>¡d q˜!˜Dœ'Ð>rLcóPg|]"}tjjj¬¦«Œ#Sr)rÚrandpermrÜ)Ú
num_timestepsrs €€rJr­z,_UnslothDDPOTrainer.step.<locals>.<listcomp>és/ø€ÐpÐST• 
°dÔ6FÔ6MÐprL)r0ÚlatentsÚ next_latentsÚ log_probscó\g|](}|jdjjg|jdd¢RŽŒ)S)r.r4N)r6ryr7)r2rs €rJr­z,_UnslothDDPOTrainer.step.<locals>.<listcomp>õsBø€ÐrÐ]^˜y˜qœy¨¨T¬[Ô-IÐXÈAÌGÐTUÐTVÐTVÌKÐrrLcóJg|]}tt|¦«¦«Œ S)r8)Ú
row_valuesÚ
original_keyss €rJr­z,_UnslothDDPOTrainer.step.<locals>.<listcomp>ús+ø€Ð
t¥C¨
°zÑ$BÔ$BÑhrLzsOptimization step should have been performed by this point. Please check calculated gradient accumulation settings.)1Ú_generate_samplesrŸrxrwÚkeysrÚ enumerateÚextendrÌÚtrackersrr&r;ÚcpuÚnumpyÚlogÚmeanÚstdr‡Ú batch_decoderüÚupdaterr6Ú
num_processesÚ
process_indexr9r7Úrangerr5ÚitemsÚstackÚarangeÚvaluesr8ÚtrainÚ_train_batched_samplesÚsync_gradientsrÖroÚ
save_state)rrrÚprompt_image_datarÚrewards_metadataÚ
image_datar-rr/Útotal_batch_sizeÚ inner_epochÚpermsÚkeyÚoriginal_valuesÚreshaped_valuesÚtransposed_valuesÚsamples_batchedr6r=r3r's` @@@@rJr,z_UnslothDDPOTrainer.stepžs~øøøøø€ð$&*×%;Ò%;Ø”{Ô”{Ô&<ñ&
ô&
Ñ"ˆÐ UÐTÀ'È!Ä*Ç/Â/ÑBSÔBSÐTˆØ$(×$8Ò$8Ø ¨¬ Ô(Lð%9ñ%
ô%
Ñ!ˆÐ'Ð'8Ñ Að A‰MˆAˆ × Ò ˜w qœzÐ+;¸AÔ+>Ð Ô × 'Ð(9¸ÔHXÔHaÐbcÔHdÑ ”)˜$ˆØÔ"×)¨'Ñ2×8×@ˆà Ô×ÒàØ&Ÿ|š|™~œ~Ø%Ÿkšk™mœmð 
ð
ð ð ñ
ô
ð
ð Œ;Ô  MàÔ¸Ô1FÑM×UˆÔ0×=¸jÐ^bÐcˆÔ*×1°'¸CˆJˆ! G§L¢L¡N¤NÑ2°w·{²{±}´}ÀtÑ7KÑLˆ
ŒO˜
ŠW3°RÑ
8¸Ô9IÔ9Wô
Yç
ŠRÔ Ô
  Ñð
 !à*1°+Ô*>Ô*Dј  ¤Ô!CÑ! ñ! ˆ”>Ð"2¸4Ô;KÔ;RÐSˆ>¨g¯mªm©o¬oÐ>ˆ”KØpÕX]Ð^nÑXoÔXoÐôˆEðMð
ð
Ø& sœ|Ý”LÐ!1¸$Ô:JÔ:QÐRÐSTÐSTÐSTÐVZÐSZÔðô ˜ ð
$ŸLšL™NœNˆ%ŸnšnÑ.ˆrÐbqÐrˆ!$ _Ð 5Ð àhÐVgÐhˆ Ô Ô !× ×5°kÀ5È+ÐWfÑgˆÔ
Ý ðJñôðñ
ð
AŠ:ˆ:˜% $¤+Ô"7Ñ7¸ÔAQÔAaÐ Ô × ÐrLcó¦| ¦«5|jjr{|j t j|gdz¦«t j|gdz¦«|¦«j}| d¦«\}} ||jj | |z
zz}n!|j |||¦«j}|j 
||||jj |¬¦«}
|
j } ddd¦«n #1swxYwYt j
||jj |jj¦«}t j| |z
¦«} | ||jj| ¦«}
dt j| |z
dz¦«z}t jt j| dz
¦«|jjk ¦«¦«}|
||fS)a
Calculate the loss for a batch of an unpacked sample
Args:
latents (torch.Tensor):
The latents sampled from the diffusion model, shape: [batch_size, num_channels_latents, height, width]
timesteps (torch.Tensor):
The timesteps sampled from the diffusion model, shape: [batch_size]
next_latents (torch.Tensor):
The next latents sampled from the diffusion model, shape: [batch_size, num_channels_latents, height,
width]
log_probs (torch.Tensor):
The log probabilities of the latents, shape: [batch_size]
advantages (torch.Tensor):
The advantages of the latents, shape: [batch_size]
embeds (torch.Tensor):
The embeddings of the prompts, shape: [2*batch_size or batch_size, ...] Note: the "or" is because if
train_cfg is True, the expectation is that negative prompts are concatenated to the embeds
Returns:
loss (torch.Tensor), approx_kl (torch.Tensor), clipfrac (torch.Tensor) (all of these are of shape (1,))
r_)ÚetaÚ prev_sampleNgà?r])rr&Úsampler5rvÚscheduler_steprur9Úclampr„ÚexpÚlossr…rFÚabsÚfloat)rr7r0r8r9r/ÚembedsÚ
noise_predÚnoise_pred_uncondÚnoise_pred_textÚscheduler_step_outputÚlog_probÚratiorhÚ approx_klÚclipfracs rJÚcalculate_lossz"_UnslothDDPOTrainer.calculate_loss sð.]Š]‰_Œ_ðŒ{Ô
Ø”I˜w˜i¨!™mÑ”I˜y˜k¨A™oÑñôôð ð
6@×5EÒ5EÀaÑ5HÔ5HÑ! ´Ô1RØ#Ð&7Ñ2ñ
ðØØñôôð ð%)Ô$4×$CÒ$CØØØØ”KÔ %Dñ%ô%Ð 6ˆHð7 7øøøð 7õ:”[Ø Ø
Œ[Ô
ŒKÔ 
ô
ˆ
õ ” ˜( YÑàyŠy˜ T¤[Ô%AÀ5ÑIˆà%œ* Ñ&:¸qÑ%@ÑAˆ å”:uœy¨°©Ñ¼ Ô8TÒ^ˆàY Ð(s•CC:Ã:C>ÄC>r/Ú
clip_rangerqcóœ| |z}| tj|d|z
d|z¦«z}tjtj||¦«¦«S)Nr])rrfrFÚmaximum)rr/rurqÚunclipped_lossÚ clipped_losss rJrhz_UnslothDDPOTrainer.lossMs[ð %˜ ,ˆØ"{¥U¤[Ø Ø  Ø  ñ&
ô&
ñ
ˆ õ
Œz%œ-¨¸ ÑFrLcóæ|jjrddl}|jj}nt
jj}|||jj|jj|jj f|jj
|jj ¬¦«S)Nr)ÚlrÚbetasÚ weight_decayÚeps) rzÚ bitsandbytesÚoptimÚ AdamW8bitrÚAdamWr{r|r}r~r)rÚtrainable_layers_parametersrÚ
optimizer_clss rJz$_UnslothDDPOTrainer._setup_optimizer[sxØ Œ;Ô  Ð Ð Ð à8ˆMˆMå!œKÔ-ˆˆ Œ{Ô”;Ô´Ô1MÐœÔ Ô 
ñ
ô
ð
rLcóf|j |||¦«| ¦«dS)Úsave_checkpointÚpop)rÚmodelsÚweightsÚ
output_dirs rJz$_UnslothDDPOTrainer._save_model_hookks.Ø Ô×°¸*Ñ Š
Œ
ˆ
ˆ
ˆ
rLcód|j ||¦«| ¦«dS)Úload_checkpointr‡)rrˆÚ input_dirs rJz$_UnslothDDPOTrainer._load_model_hookos,Ø Ô×°Ñ
Š
Œ ˆ ˆ ˆ rLc ó
g}g}jj ¦«j |dd¦«}t |¦«D]°}t
ˆfdt |¦«D¦«Ž\}}j |dddjjj¬¦«j  
j j ¦«} ‰j