Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothPPOTrainer.cpython-310.pyc
T

329 lines
34 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
<—°hÝã@sBdZddlmZddlZddlmZddlmZddlmZm Z m
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZm Z m!Z!m"Z"m#Z#m$Z$m%Z%m&Z&m'Z'm(Z(m)Z)m Z m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2m3Z3m4Z4m5Z5m6Z6m7Z7m8Z8m9Z9m:Z:m;Z;m<Z<m=Z=m>Z>m?Z?m@Z@mAZAmZmBZBmCZCmDZDmEZEmFZFmGZGmHZHmIZImJZJmKZKmZmLZLmMZMm
Z
m"Z"m'Z'm;Z;mDZDmZddlDZDddlTddlNmOZOmPZPdd lQmRZRddlZddlSZBdd
lTmCZCddlmZdd lUmVZVmWZXd d
d d
d
dœZYejZd d eYdddƒZ[eOGdddeƒƒZ\ Gddde'ƒZ]Gddde]ƒZ^dS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)GÚ AcceleratorÚBaseImageProcessorÚCallbackHandlerÚDEFAULT_CALLBACKSÚDEFAULT_PROGRESS_CALLBACKÚDataCollatorWithPaddingÚ
DataLoaderÚDatasetÚExportableStateÚFeatureExtractionMixinÚGenerationConfigÚINVALID_LOGPROBÚOnlineTrainerStaterÚ PPOConfigÚ
PPOTrainerÚPathÚ
PeftConfigÚ PeftModelÚPolicyAndValueWrapperÚPreTrainedTokenizerBaseÚPrinterCallbackÚProcessorMixinÚTrainerÚTrainerCallbackÚTrainerControlrÚbatch_generationÚ broadcastÚcontextmanagerÚcreate_reference_modelÚ defaultdictÚdisable_dropout_in_modelÚ empty_cacheÚ exact_divÚfirst_true_indicesÚforwardÚ
gather_objectÚgcÚgenerate_model_cardÚget_comet_experiment_urlÚget_peft_modelÚ#get_reporting_integration_callbacksÚ
get_rewardÚis_peft_availableÚis_rich_availableÚis_wandb_availableÚlog_table_to_comet_experimentÚ masked_meanÚ
masked_whitenÚmathÚnnÚnpÚ nullcontextÚosÚpdÚpeft_module_casting_to_bf16Úprepare_deepspeedÚprint_rich_tableÚselective_log_softmaxÚtextwrapÚtimeÚtorchÚtruncate_responseÚunwrap_model_for_generationrrr"r6r@rH)Ú*)Ú dataclassÚfield)ÚVersion)r?)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rZÚindex©rZé)
rHÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
Úlogitsr[Úchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©rsúQ/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothPPOTrainer.pyÚchunked_selective_log_softmax"s  
rucs eZdZUdZedddidZeeed<edddidZ ee
ed <  
 
                 

   
 
!
   
"
       
 
"      # $ 
%     

  &  


 !    " 
 ' (
 
 

    ) * +   ,  -        . -  
/ 0 1  1  2   d5‡fd3d4„ Z Z S)6ÚUnslothPPOConfigaþ
Configuration class for the [`PPOTrainer`].
This class includes only the parameters that are specific to PPO training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] and [`OnPolicyConfig`] documentation. Note that default
values in this class may differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
exp_name (`str`, *optional*, defaults to `os.path.basename(__file__)[:-3]`):
Name of this experiment.
reward_model_path (`str`, *optional*, defaults to `"EleutherAI/pythia-160m"`):
Path to the reward model.
model_adapter_name (`str` or `None`, *optional*, defaults to `None`):
Name of the train target PEFT adapter, when using LoRA with multiple adapters.
ref_adapter_name (`str` or `None`, *optional*, defaults to `None`):
Name of the reference PEFT adapter, when using LoRA with multiple adapters.
num_ppo_epochs (`int`, *optional*, defaults to `4`):
Number of epochs to train.
whiten_rewards (`bool`, *optional*, defaults to `False`):
Whether to whiten the rewards.
kl_coef (`float`, *optional*, defaults to `0.05`):
KL coefficient.
kl_estimator (`Literal["k1", "k3"]`, *optional*, defaults to `"k1"`):
Which estimator for KL-Divergence to use from [Approximating KL
Divergence](http://joschu.net/blog/kl-approx.html). Defaults to "k1", a straightforward, unbiased
estimator. Can be set to "k3", an unbiased estimator with lower variance which "appears to be a strictly
better estimator". Cannot be set to "k2", as it is used for logging purposes.
cliprange (`float`, *optional*, defaults to `0.2`):
Clip range.
vf_coef (`float`, *optional*, defaults to `0.1`):
Value function coefficient.
cliprange_value (`float`, *optional*, defaults to `0.2`):
Clip range for the value function.
gamma (`float`, *optional*, defaults to `1.0`):
Discount factor.
lam (`float`, *optional*, defaults to `0.95`):
Lambda value for GAE.
ds3_gather_for_generation (`bool`, *optional*, defaults to `True`):
This setting applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation,
improving generation speed. However, disabling this option allows training models that exceed the VRAM
capacity of a single GPU, albeit at the cost of slower generation.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrWz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksFÚnorXéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsr]éôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéé@é
é5çffffffæ?úEleutherAI/pythia-160mÚ
ppo_configçš™™™™™©?Úk1çš™™™™™É?çffffffî?c£¥ sv|dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|dur:d
d lmt|¤ƒd d
ƒ}|‰d
krBtdƒ|‰dkrJtdƒtƒjd±id|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/|d0| “d1|!“d2|"“d3|#“d4|$“d5|%“d6|&“d7|'“d8|(“d9|)“d:|*“d;|+“d<|,“d=|-“d>|.“d?|/“d@|0“dA|1“dB|2“dC|3“dD|4“dE|5“dF|6“dG|7“dH|8“dI|9“dJ|:“dK|;“dL|<“dM|=“dN|>“dO|?“dP|@“dQ|A“dR|B“dS|C“dT|D“dU|E“dV|F“dW|G“dX|H“dY|I“dZ|J“d[|K“d\|L“d]|M“d^|N“d_|O“d`|P“da|Q“db|R“dc|S“dd|T“de|U“df|V“dg|W“dh|X“di|Y“dj|Z“dk|[“dl|\“dm|]“dn|^“do|_“dp|`“dq|a“dr|b“ds|c“dt|d“du|e“dv|f“dw|g“dx|h“dy|i“dz|j“d{|k“d||l“d}|m“d~|n“d|o“d€|p“d|q“d|r“dƒ|s“d„|t“d…|u“d†|v“d‡|w“dˆ|x“d‰|y“dŠ|z“d|{“dŒ||“d|}“dŽ|~“d|d|€“d|d’|‚“d“|ƒ“d”|„“d•|…“d–|†“d—|‡“d˜|ˆ“d™|‰“dš|Š“d›|‹“dœ|Œ“d|dž|Ž“dŸ|d |d¡|‘“d¢|’“d£|““d¤|”“d¥|•“d¦|–“d§|—“d¨|˜“d©|™“dª|š“d«|›“d¬|œ“d­|d®|ž“d¯|Ÿ“d°| “|£¤Ž|¡|_|¢|_ dS)²NçH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!r]za` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rŠrÚunsloth_training_checkpointsr|r)Ú cpu_countrXr}zUUnsloth: Please set a positive non-zero temperature since your results will be wrong.r—zgUnsloth: Please set a positive non-zero temperature less than 10, since sampling will be quite erratic.Ú
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚdataset_num_procÚnum_mini_batchesÚtotal_episodesÚ local_rollout_forward_batch_sizeÚnum_sample_generationsÚresponse_lengthÚ
stop_tokenÚ
stop_token_idÚ temperatureÚmissing_eos_penaltyÚsft_model_pathÚ
world_sizeÚnum_total_batchesÚmicro_batch_sizeÚlocal_batch_sizeÚ
batch_sizeÚlocal_mini_batch_sizeÚmini_batch_sizeÚexp_nameÚreward_model_pathÚmodel_adapter_nameÚref_adapter_nameÚnum_ppo_epochsÚwhiten_rewardsÚkl_coefÚ kl_estimatorÚ cliprangeÚvf_coefÚcliprange_valueÚgammaÚlamÚds3_gather_for_generationrs)
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr¢ÚmaxÚ MathErrorÚsuperÚ__init__rzr{)¥Úselfr£r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrrrrr r!r"r#r$r%r&r'r(r)r*r+r,r-r.r/r0r1r2r3r4r5r6r7r8r9r:r;r<r=r>r?r@rArBrzr{Úkwargsr¢©Ú __class__rsrtrIns&(  ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"
zUnslothPPOConfig.__init__)¢NNFFFr|FrXrXNNr}r}rr~rr€rrr„r…rWr†r‡rrˆr‰TNrŠFr]FrŠrNTFFFFFFrŒFFFFrFFNrWNNFrFNrNrWNNTNFNNFrrNNNNrrNFFrNNNNTFTFFNNr“NNFNFNFTrŽNNNrTFNr”r•FNNFFNNFFFNFTNr]Nrr—r˜NNr™NršNNNNNNNrNNrXFrœrr‡r„TNrW)
Ú__name__Ú
__module__Ú __qualname__Ú__doc__rMrzrrÚ__annotations__r{ÚintrIÚ
__classcell__rsrsrLrtrv3s\
0þþÜrvcsPeZdZddgZ     d,dedeeeee e
fde j dee j d e j d
e
d e j d eed
eee
eee
ffdeejjejjjfdeeededddfddZdefddZdefddZeddƒZd-deedeffdd
Z d d!„Z!d.d"efd#d$„Z"‡fd%d&„Z#   d/d'eed(eed)eeeedffd*d+„Z$‡Z%S)0Ú_UnslothPPOTrainerÚtrlÚppoN©NNÚargsÚprocessing_classÚmodelÚ ref_modelÚ reward_modelÚ
train_datasetÚ value_modelÚ
data_collatorÚ eval_datasetÚ
optimizersÚ callbacksÚ peft_configrÚreturnc
Cs\||urtdƒ||_||_||_|durt|jƒ}|jr$|jr$tdƒ|jr?|jdkr6|j|jj_|_ntd|jdƒ|j|jj_|_|jj dvrRtdƒt
ƒs]| dur]t dƒt
ƒr†| dur†t |jt
ƒrp|j ¡|_t|j| ƒ|_|jr†t|jd d
ƒr†t|jƒt
ƒoŽt |jt
ƒ|_|j|_|j|_|rž||_n
|jr¥d|_nt|jƒ|_||_||_t|ƒ|_||_||_| |_|
\|_|_ d|_!|j"durÖt#|j$|jƒ|_"t%|j&d }
|
|_'|
j(|_)|j*|j&|_+t#|j*|j)ƒ|_,t#|j+|j)ƒ|_-t.|j-|j/d ƒ|_0t.|j+|j/d
ƒ|_1|j2r!|j1dks!Jd|j1dƒt3 4|j"|j-¡|_5t6j7t#t8 ƒ|
j9d}t:|dƒ }|j<d|j=d||_>|j=|
j?d|_@|jAdkrdtBd|j5|jAƒ|_C|j+|_D|j|j|j|jfD] }|dur}tE|ƒqrtF|j|jƒ|_G|jjH|jG_H|jI|j5dtJtK|jjLƒ}| dur£|n|| |_MtN|jM|jG|j|j|j ƒ|_O| P|jjQr¿tRntS¡tTƒ|_UtV| | dd|jOjM|jUgDƒd|_Yd|_Zd|_[t|j'jYddƒdu|_\t|j'jYddƒdu|_]d|_^|jj_r| |jjartbjc|jjdddte|jGdƒr!|jG f|jg¡th|j|jDd|jdd|_it6 j|j=¡|
 k|jG|j|ji¡\|_G|_|_it6 j|j@¡th|j|jl|jdd |_m|
 k|jm¡|_m|j\rtn|j|j*|jo|jƒ|_|jdur}|js{td!ƒdStn|j|j*|jo|jƒ|_dS|jdurš|js™td!ƒn |j p|j'j9¡|_|j p|j'j9¡|_dS)"Nzœ`model` and `ref_model` cannot be the same object. If you want `ref_model` to be the same as `model`, you must make a copy of it, or `None` if you use peft.z5You cannot set both `stop_token` and `stop_token_id`.ÚeoszUnknown `stop_token` z9. Allowed values are: `'eos'` and `None` (no stop token).>rÚk3zákl_estimator must be either 'k1' (straightforward, unbiased) or 'k3' (lower variance, unbiased, appears to be a strictly better estimator). See [Approximating KL Divergence](http://joschu.net/blog/kl-approx.html) for details.zvPEFT is not installed and you passed a `peft_config` in the trainer's kwargs, please install it to use the PEFT modelsÚis_loaded_in_4bitF)z5`batch_size` must be a multiple of `num_mini_batches`z;`local_batch_size` must be a multiple of `num_mini_batches`ézPer-rank minibatch size z is insufficient for whitening©ÚdevicerÚ__i£†r])Únum_training_stepscSsg|] }t|tƒr|qSrs)Ú
isinstancer)Ú.0ÚcbrsrsrtÚ
<listcomp>_s

ÿÿz/_UnslothPPOTrainer.__init__.<locals>.<listcomp>)Úis_local_process_zeroÚis_world_process_zeroÚstateful_callbacksÚdeepspeed_pluginÚ fsdp_pluginT)Úexist_okÚadd_model_tags)r2ÚshuffleÚ
collate_fnÚ drop_last)r2rzr{z1No reference model and model is not a Peft model.)qÚ
ValueErrorrYrZÚ policy_modelrr)r*Ú eos_token_idÚgeneration_configr<r6Ú ImportErrorrnrÚmerge_and_unloadr3ÚgetattrrBÚ
is_peft_modelr7r8r\r(r]r^ÚlenÚtrain_dataset_lenr_r`raÚ optimizerÚ lr_schedulerÚoptimizer_cls_and_kwargsr%rSr¸r Ú acceleratorÚ
num_processesr.r1r0r2r,r$r4r3r:r<Úceilr/rHÚtensorrGrkr&Úitemr5Ú
process_indexÚ
local_seedr'rFÚsample_generations_freqÚlocal_dataloader_batch_sizer*rr[ÚconfigÚcreate_optimizer_and_schedulerrr4rcrÚcallback_handlerÚ add_callbackrår rr$ÚcontrolrrrrsÚstateÚ current_flosÚhp_search_backendÚis_deepspeed_enabledÚis_fsdp_enabledrrÚ init_hf_repoÚ should_saver@Úmakedirsr£ÚhasattrrxÚ
_tag_namesrÚ
dataloaderÚ manual_seedÚpreparer«Úeval_dataloaderrCrb)rJrYrZr[r\r]r^r_r`rarbrcrdr‰Ú time_tensorÚtime_intÚmoduleÚdefault_callbacksrsrsrtrIÌs ÿ
 
 ÿ ÿÿ  
 
 
 
ÿ
ÿ ÿ
ÿ 
 ÿÿÿý

û  üÿ ÿ
ÿ ÿz_UnslothPPOTrainer.__init__cCó|jS©©rJrsrsrtÚget_train_dataloaderžóz'_UnslothPPOTrainer.get_train_dataloadercC)rsrsrtÚget_eval_dataloader¡z&_UnslothPPOTrainer.get_eval_dataloaderccs”|jr|js|j |jj¡ ¡ntƒ,|jr |jj |j¡dV|jr8|jj |j p.d¡WdƒdSWdƒdS1sCwYdS)zWContext manager for handling null reference model (that is, peft adapter manipulation).Nrx)
r8r‰Ú unwrap_modelr[ÚpolicyÚdisable_adapterr?Ú set_adapterr7rsrsrtÚnull_ref_context¤sÿÿý÷"øz#_UnslothPPOTrainer.null_ref_contextFr£Ú_internal_callcsL|j}|jj|_|jr|j}|j|_tƒ ||¡||_|jr$||_dSdS)r[rHÚ
save_model)rJÚ backup_modelÚbackup_deepspeedrLrsrt²s

ÿz_UnslothPPOTrainer.save_modelcr
s0|j}|j}|j}|j}|j}|j}|j}|j|j}fdd} t | ƒƒ}
t
|j |j ddddd} | 
d¡t ¡} |j|j|jf}
tj|
|d }tj|
|d }tj|
|d }tj|
|d }tj|
|d }tj|
|d }tj|
|d }| ¡d
|j_d
|j_|j|j_|j|j|j_|jdurª|jd kr¥t |jj|j¡|j_n|j|j_|j durÆ|j d krÁt |jj|j ¡|j_ n|j |j_ |j!durâ|j!d krÝt |jj|j!¡|j_!n|j!|j_!|j" #||j|j$¡|_$|j%rø|j|_&|j|_'t(d |jd ƒD]î}|jjd |j)7_t*|
ƒ}t ˆ|d  ,|¡}|j-d }g}g}g}g}g}g}g}t.|j|j|jj/d
} t0| j1||j2|j3| ƒ\}!}"Wdƒn 1sVwYt(d
|j-d
|j2ƒD]è}#||#|#|j2}$|!|#|#|j2}%|%dd|df}&|"|#|#|j2}'t4|'|&ƒ}(~'t5ƒ|dur¸| t7|j1|%|j3ƒ})Wdƒn 1s²wYnt7||%|j3ƒ})|)j8dd|d df}*|*|j d}*t4|*|&ƒ}+~)~*t5ƒ|&},|j9durít:|j9|j3|&ƒ},t ;|$|,fd ¡}-t<|,|j3kƒd }.| =|¡j>}/t?|/|%|j3|ƒ\}0}1}1|0dd|d df @d¡}2t?||-|j3|ƒ\}1}3}1| A|&¡| A|,¡| A|(¡| A|+¡| A|.¡| A|3¡| A|2¡qet ;|d
¡}t ;|d
¡}t ;|d
¡}t ;|d
¡}t ;|d
¡}t ;|d
¡}t ;|d
¡}~(~+~0~2~3~ t5ƒtB tjD||jjEkdd}4|jjFdur¢||4|jjF8<tjG|j-d |jd  H|j-d
d ¡}5|5| Id ¡k}6t J||6tK¡}t J||6tK¡}|d }7|5|7 Id ¡k}8t J||8d
¡}||}9|jLdkrè|9 n|9 d |9}:|jN |:};|; }<tjG|< Pd
¡|<jd }=t Q|7|< Pd ¡k|7|¡}>|<|=|>g|7<|jRr.tS|<|8dd}<t J|<|8d
¡}<d
}?g}@|j-d }AtTt(|AƒƒD]:}B|B|Ad krP|dd|Bd fnd}C|<dd|Bf|jU|C|dd|Bf}D|D|jU|jV|?}?|@ A|?¡q=tjW|@dddd d}E|E|}FtS|E|6ƒ}Et J|E|6d
¡}Et5ƒWdƒn 1s£wYt(|jƒD]Ê}GtXjY Z|j[¡}Hd
}It(d
|j[|j\ƒD]´}J|J|j\}K|H|J|K…}Ld
}Mt(d
|j\|j]ƒD]x}N| ^|¡b|N|j]}O|L|N|O…}P|E|P}Q||P}R|!|P}S||P}T|F|P}U||P}Vt7||S|j3ƒ\}W}X|Wj8dd|d df}'|'|j d}'t4|'|Rƒ}Yt J|Y|6|PtK¡}Y|Xdd|d df @d¡}Zt J|Z|8|Pd
¡}Zt _|Z|V|j`|V|j`¡}[t a|Z|U¡}\t a|[|U¡}]t b|\|]¡}^dtc|^|8|Pƒ}_tc|]|\k |8|Pƒ}`|Y|T}at M|a¡}b|Q |b}c|Q t _|bd|jed|je¡}dt b|c|d¡}etc|e|6|Pƒ}f|f|jf|_}g| g|g¡| | t ptc|d|ck |6|Pƒ}htjjjkjl|'dtjmd ,|'jn¡}itjo|'ddtjp|i|'dd}jd|ad }k|k||G|I|Mf<|h||G|I|Mf<|f||G|I|Mf<|_||G|I|Mf<|`||G|I|Mf<|j ||G|I|Mf<|b ||G|I|Mf<Wdƒn 1s8wYWdƒn 1sHwY|Md 7}MqÙ|Id 7}I~W~X~'~Y~Z~[~\~]~_~`~a~b~c~d~e~f~g~h~i~j~k~U~Q~V~R~S~Tt5ƒq­t |: pd ¡ }l|  pd ¡ }m|; pd ¡ }n|n| }otr|jjt ¡| ƒ}pi}q|p|qd<|j s|l¡  |qd<|j s|m¡  |qd<|j s|n¡  |qd<|j s|o¡  |qd<|j s| ¡  |qd<|j s|¡  |qd<|j s|¡  |qd<|j s|¡  |qd<|j s|¡  |qd <|j s|¡  |qd!<|j s|¡  |qd"<|j s|¡  |qd#<|j s|¡  |qd$<||jEk  |qd%<|jv d
|qd&<|jj|qd'<|jj|j|j_x|jjd 7_| y|q¡Wdƒn 1s†wY|jv |j" z||j|j$¡|_$|j$j{r³|j||dd(|j" }|j|j|j$¡|_$~:~l~m~n~~q~;t5ƒtB |j~d
krÚ|d |jd
krÚ|j€dd)t5ƒ~!~~~~~~~4~7~5~6~8~<~=~>~E~Ft5ƒq|j" ||j|j$¡|_$|j$j{r|j||ddd*|j" }|j|j|j$¡|_$dSdS)+Nc3s ˆEdHqrsrsrsrtÚrepeat_generatorÌs
ÿz2_UnslothPPOTrainer.train.<locals>.repeat_generatorr rr„Úmax_new_tokensr+Útop_kÚtop_pÚ do_samplez===training policy===rjrr]Ú input_ids©Úgather_deepspeed3_paramsrWr\rF)ÚmaskÚ
shift_mean)Úaxisgà?)rZÚdtyper}Úepsz objective/klzobjective/entropyzobjective/non_score_rewardzobjective/rlhf_rewardzobjective/scoreszpolicy/approxkl_avgzpolicy/clipfrac_avgzloss/policy_avgzloss/value_avgzval/clipfrac_avgzpolicy/entropy_avgz val/ratioz
val/ratio_varzval/num_eos_tokensÚlrÚepisode)Útrial)Úsampling)Úmetrics)rYr‰r†r[r\r]rZrkÚiterrr(r+ÚprintrGr9r$rHÚzerosÚtrainr—Ú global_steprÈr/r%r…r¸r<rr”Úon_train_beginrÚ
model_wrappedÚranger2ÚnextÚno_gradrbr`rJrBr%r&Ú pad_token_idrEr+r´r.rjr*rIÚcatr-r_r5rfrhr0ÚcollectÚanyr~r,ÚarangeÚrepeatreÚ masked_fillrr<Úexpr;ÚcloneÚsizeÚwherer:r;Úreversedr@rAÚstackr>ÚrandomÚ permutationr1r3Ú
accumulateÚclampr?ÚsquarerFr:Úfloatr=r>ÚbackwardÚstepÚ zero_gradr=rÚsoftmaxrcrgÚsumÚmeanrSÚgather_for_metricsrÚvarr‡Ú get_last_lrÚepochÚlogÚ on_step_endrÚ_save_checkpointÚon_saver'rÚgenerate_completionsÚ on_train_end)rrJrYr‰r†r[Ú
ref_policyr]rZrkÚiter_dataloaderrÚ
start_timeÚ stats_shapeÚapproxkl_statsÚpg_clipfrac_statsÚ
pg_loss_statsÚ
vf_loss_statsÚvf_clipfrac_statsÚ
entropy_statsÚ ratio_statsÚupdateÚdataÚqueriesÚcontext_lengthÚ responsesÚpostprocessed_responsesÚlogprobsÚ ref_logprobsÚscoresÚsequence_lengthsÚvaluesÚunwrapped_modelÚquery_responsesÚlogitssÚqueryÚquery_responseÚresponserjÚlogprobÚ
ref_outputÚ
ref_logitsÚ ref_logprobÚpostprocessed_responseÚpostprocessed_query_responseÚsequence_lengthÚunwrapped_value_modelÚ
full_valueÚvalueÚscoreÚcontain_eos_tokenÚ
response_idxsÚ padding_maskÚsequence_lengths_p1Úpadding_mask_p1ÚlogrÚklÚnon_score_rewardÚrewardsÚ actual_startÚ
actual_endÚ
lastgaelamÚadvantages_reversedÚ
gen_lengthÚ
nextvaluesÚdeltaÚ
advantagesÚreturnsÚ
ppo_epoch_idxÚb_indsÚ
minibatch_idxÚmini_batch_startÚmini_batch_endÚmini_batch_indsÚgradient_accumulation_idxÚmicro_batch_startÚmicro_batch_endÚmicro_batch_indsÚ mb_advantageÚ mb_responsesÚmb_query_responsesÚ mb_logprobsÚ mb_returnÚ mb_valuesÚoutputÚ
vpred_tempÚ new_logprobsÚvpredÚ vpredclippedÚ
vf_losses1Ú
vf_losses2Ú vf_loss_maxÚvf_lossÚ vf_clipfracÚ
logprobs_diffÚratioÚ pg_lossesÚ
pg_losses2Ú pg_loss_maxÚpg_lossÚlossÚ pg_clipfracÚ prob_distÚentropyÚapproxklÚmean_klÚ mean_entropyÚmean_non_score_rewardÚ rlhf_rewardrÆrsrtÁsR 
û










 
ÿ
ûý 


ÿ
 
ÿ 

ÿ

ÿ





        $" 
&* }
 
 

ÿý ÿ