Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothPPOTrainer.cpython-311.pyc
T

307 lines
65 KiB
Plaintext
Raw Normal View History

2025-08-13 23:50:20 +00:00
§
5$hãó”dZddlmZddlZddlmZddlmZddlmZm Z m
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZm Z m!Z!m"Z"m#Z#m$Z$m%Z%m&Z&m'Z'm(Z(m)Z)m Z m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2m3Z3m4Z4m5Z5m6Z6m7Z7m8Z8m9Z9m:Z:m;Z;m<Z<m=Z=m>Z>m?Z?m@Z@mAZAmZmBZBmCZCmDZDmEZEmFZFmGZGmHZHmIZImJZJmKZKmZmLZLmMZMm
Z
m"Z"m'Z'm;Z;mDZDmZddlDZDddlTddlNmOZOmPZPdd lQmRZRddlZddlSZBdd
lTmCZCddlmZdd lUmVZVmWZXd d
d d
d
dœZYejZd d eY¬¦«d¦«Z[eOGdde¦«¦«Z\ Gdde'¦«Z]Gdde]¦«Z^dS)z8
2025.8.4
2025.8.5
4.55.1
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)GÚ AcceleratorÚBaseImageProcessorÚCallbackHandlerÚDEFAULT_CALLBACKSÚDEFAULT_PROGRESS_CALLBACKÚDataCollatorWithPaddingÚ
DataLoaderÚDatasetÚExportableStateÚFeatureExtractionMixinÚGenerationConfigÚINVALID_LOGPROBÚOnlineTrainerStaterÚ PPOConfigÚ
PPOTrainerÚPathÚ
PeftConfigÚ PeftModelÚPolicyAndValueWrapperÚPreTrainedTokenizerBaseÚPrinterCallbackÚProcessorMixinÚTrainerÚTrainerCallbackÚTrainerControlr Úbatch_generationÚ broadcastÚcontextmanagerÚcreate_reference_modelÚ defaultdictÚdisable_dropout_in_modelÚ empty_cacheÚ exact_divÚfirst_true_indicesÚforwardÚ
gather_objectÚgcÚgenerate_model_cardÚget_comet_experiment_urlÚget_peft_modelÚ#get_reporting_integration_callbacksÚ
get_rewardÚis_peft_availableÚis_rich_availableÚis_wandb_availableÚlog_table_to_comet_experimentÚ masked_meanÚ
masked_whitenÚmathÚnnÚnpÚ nullcontextÚosÚpdÚpeft_module_casting_to_bf16Úprepare_deepspeedÚprint_rich_tableÚselective_log_softmaxÚtextwrapÚtimeÚtorchÚtruncate_responseÚunwrap_model_for_generationrrr#r7rArI)Ú*)Ú dataclassÚfield)ÚVersion)r@)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionscó’tj| d|jd¦«dd¬¦«}tj| d¦«dd¬¦«}g}t ||¦«D]\}}| tj¦«}tj|d| d¦«¬¦«  d¦«}tj
|d¬¦«}||z
} |  | ¦«Œ’ tj |¦«}| |jd|jdf¦«}|S)Néÿÿÿÿér)ÚchunksÚdim)r\Úindex©r\é)
rIÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
Úlogitsr]Úchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logpss
ú]/workspace/Fine-tuning/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothPPOTrainer.pyÚchunked_selective_log_softmaxrv"s5õ”[ §¢°°F´LÀÔ4DÑ!EÔ!EÐPQÐYZÐ[€NÝ”[ §¢¨rÑ!2Ô!2¸QÀaÐH€MØÐå%(¨¸Ñ%GÔ%Gð #—¥u¤}Ñ Ýœ, |¸2À{×G\ÒG\Ð]_ÑG`ÔG`Ða×iÐjlÑmˆÝ œ?¨<¸Ø)Ð,<Ñ<ˆØ×" Ýœ,Ð':ÑØ-×5°v´|ÀA´ÈÌ ÐUVÌÐ6XÑØ ÐócóÒeZdZUdZedddi¬¦«Zeeed<edddi¬¦«Z ee
ed < d4ˆfd3„ Z ˆxZ S)5ÚUnslothPPOConfigaþ
Configuration class for the [`PPOTrainer`].
This class includes only the parameters that are specific to PPO training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] and [`OnPolicyConfig`] documentation. Note that default
values in this class may differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
exp_name (`str`, *optional*, defaults to `os.path.basename(__file__)[:-3]`):
Name of this experiment.
reward_model_path (`str`, *optional*, defaults to `"EleutherAI/pythia-160m"`):
Path to the reward model.
model_adapter_name (`str` or `None`, *optional*, defaults to `None`):
Name of the train target PEFT adapter, when using LoRA with multiple adapters.
ref_adapter_name (`str` or `None`, *optional*, defaults to `None`):
Name of the reference PEFT adapter, when using LoRA with multiple adapters.
num_ppo_epochs (`int`, *optional*, defaults to `4`):
Number of epochs to train.
whiten_rewards (`bool`, *optional*, defaults to `False`):
Whether to whiten the rewards.
kl_coef (`float`, *optional*, defaults to `0.05`):
KL coefficient.
kl_estimator (`Literal["k1", "k3"]`, *optional*, defaults to `"k1"`):
Which estimator for KL-Divergence to use from [Approximating KL
Divergence](http://joschu.net/blog/kl-approx.html). Defaults to "k1", a straightforward, unbiased
estimator. Can be set to "k3", an unbiased estimator with lower variance which "appears to be a strictly
better estimator". Cannot be set to "k2", as it is used for logging purposes.
cliprange (`float`, *optional*, defaults to `0.2`):
Clip range.
vf_coef (`float`, *optional*, defaults to `0.1`):
Value function coefficient.
cliprange_value (`float`, *optional*, defaults to `0.2`):
Clip range for the value function.
gamma (`float`, *optional*, defaults to `1.0`):
Discount factor.
lam (`float`, *optional*, defaults to `0.95`):
Lambda value for GAE.
ds3_gather_for_generation (`bool`, *optional*, defaults to `True`):
This setting applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation,
improving generation speed. However, disabling this option allows training models that exceed the VRAM
capacity of a single GPU, albeit at the cost of slower generation.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrYz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksFÚnorZéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsr_éôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéé@é
é5çffffffæ?úEleutherAI/pythia-160mÚ
ppo_configçš™™™™™©?Úk1çš™™™™™É?çffffffî?c£ ó8|dkrtd|d¦«|dkrtd|d¦«||#dkr
|$dkrd}d }#|€!d
d lmt |¤¦«d zd ¦«}|‰d
krt d
¦«|‰dkrt d¦«t
¦«jd°id|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/| “d0|!“d1|"“d2|#“d3|$“d4|%“d5|&“d6|'“d7|(“d8|)“d9|*“d:|+“d;|,“d<|-“d=|.“d>|/“d?|0“d@|1“dA|2“dB|3“dC|4“dD|5“dE|6“dF|7“dG|8“dH|9“dI|:“dJ|;“dK|<“dL|=“dM|>“dN|?“dO|@“dP|A“dQ|B“dR|C“dS|D“dT|E“dU|F“dV|G“dW|H“dX|I“dY|J“dZ|K“d[|L“d\|M“d]|N“d^|O“d_|P“d`|Q“da|R“db|S“dc|T“dd|U“de|V“df|W“dg|X“dh|Y“di|Z“dj|[“dk|\“dl|]“dm|^“dn|_“do|`“dp|a“dq|b“dr|c“ds|d“dt|e“du|f“dv|g“dw|h“dx|i“dy|j“dz|k“d{|l“d||m“d}|n“d~|o“d|p“d€|q“d|r“d|s“dƒ|t“d„|u“d…|v“d†|w“d‡|x“dˆ|y“d‰|z“dŠ|{“d||“dŒ|}“d|~“dŽ|d|€“d|d‘|‚“d’|ƒ“d“|„“d”|…“d•|†“d–|‡“d—|ˆ“d˜|‰“d™|Š“dš|‹“d›|Œ“dœ|d|Ž“dž|dŸ|d |‘“d¡|’“d¢|““d£|”“d¤|•“d¥|–“d¦|—“d§|˜“d¨|™“d©|š“dª|›“d«|œ“d¬|d­|ž“d®|Ÿ“d¯| “|£¤Ž|¡|_|¢|_ dS)±NçH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!r_za` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rÚunsloth_training_checkpointsrr)Ú cpu_countr€zUUnsloth: Please set a positive non-zero temperature since your results will be wrong.ršzgUnsloth: Please set a positive non-zero temperature less than 10, since sampling will be quite erratic.Ú
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚdataset_num_procÚnum_mini_batchesÚtotal_episodesÚ local_rollout_forward_batch_sizeÚnum_sample_generationsÚresponse_lengthÚ
stop_tokenÚ
stop_token_idÚ temperatureÚmissing_eos_penaltyÚsft_model_pathÚ
world_sizeÚnum_total_batchesÚmicro_batch_sizeÚlocal_batch_sizeÚ
batch_sizeÚlocal_mini_batch_sizeÚmini_batch_sizeÚexp_nameÚreward_model_pathÚmodel_adapter_nameÚref_adapter_nameÚnum_ppo_epochsÚwhiten_rewardsÚkl_coefÚ kl_estimatorÚ cliprangeÚvf_coefÚcliprange_valueÚgammaÚlamÚds3_gather_for_generation©)
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr¦ÚminÚ MathErrorÚsuperÚ__init__r}r~)¦Úselfr§r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrrrrr r!r"r#r$r%r&r'r(r)r*r+r,r-r.r/r0r1r2r3r4r5r6r7r8r9r:r;r<r=r>r?r@rArBrCrDrErFr}r~Úkwargsr¦Ú __class__s¦ €rurNzUnslothPPOConfig.__init__msz ø€ðL ˜ Ð Õ'9ð;VÐ]jð;Vð;Vð;Vñ(Wô(Wð"WØ ˜ Ð ¥Mð3FÐUbð3Fð3Fð3Fñ%Gô%GðGØ Ð  -°7Ò":Ð":¸zÈSÒ?PÐ?PØ7ˆ ˆ Ð " 9 9¡;¤;¨q¡=°!Ñ Ø ˜!Ò Ð ÝÐ
˜
Ð
ÝðFñGôGð
Gð ŒÔð` Lð` Lð` LØ#˜ð` Là#7Ð#7ð` Lð ` Lðgð ` Lð
$˜ð ` Lð *˜
` Lð$8Ð#7ð` Lð+FÐ*Eð` Lð*DÐ)Cð` Lð(@Ð'?ð` Lð'>Ð&=ð` Lð+FÐ*Eð` Lð'>Ð&=ð` Lð$˜ð` Lð'>Ð&=ð` Lð *˜Mð!` Lð"(˜<ð#` Lð$$˜ð%` Lð&$˜ð'` Lð((˜<ð)` Lð**˜Mð+` Lð,/ð-` Lð."˜ ð/` Lð0!2Ð 1ð1` Lð2(˜<ð3` Lð4(˜<ð5` Lð6"˜ ð7` Lð8!2Ð 1ð9` Lð:/ð;` Lð<&˜+ð=` Lð>/ð?` Lð@"4Ð!3ðA` LðB*˜MðC` LðD&<Ð%;ðE` LðF*˜MðG` LðH$˜ðI` LðJ/ðK` LðL/ðM` LðN!2Ð 1ðO` LðP.˜oðQ` LðR7^Ð6]ðS` LðTgðU` LðVgðW` LðX,˜^ðY` LðZ4ð[` Lð\"˜ ð]` Lð^*˜Mð_` Lð` xða` Lðb4ðc` Lðd4ðe` Lðf,˜^ðg` Lðh&<Ð%;ði` Lðj,˜^ðk` Lðl,˜^ðm` Lðn4ðo` Lðp$˜ðq` Lðr&˜+ðs` Lðt*˜Mðu` Lðv!2Ð 1ðw` LðxEðy` Lðz$8Ð#7ð{` Lð|$˜ð}` Lð~&<Ð%;ð` Lð@*DÐ)CðA` LðB$˜ðC` LðD xðE` LðF(˜<ðG` LðH%:Ð$9ðI` LðJ&˜+ðK` LðL&<Ð%;ðM` LðN%:Ð$9ðO` LðP!2Ð 1ðQ` LðR/ðS` LðT4ðU` LðV#6Ð"5ðW` LðX&˜+ðY` LðZ2TÐ1Sð[` Lð\"4Ð!3ð]` Lð^"˜ ð_` Lð`&<Ð%;ða` LðbEðc` Lðd$˜ðe` Lðf"˜ ðg` Lðh.˜oði` Lðj"4Ð!3ðk` Lðl"˜ ðm` Lðn*DÐ)Cðo` Lðp!2Ð 1ðq` Lðr%:Ð$9ðs` Lðt%:Ð$9ðu` Lðv-JÐ,Iðw` Lðx#6Ð"5ðy` Lðz*DÐ)Cð{` Lð|&˜+ð}` Lð~&<Ð%;ð` Lð@(˜<ðA` LðB(˜<ðC` LðD"˜ ðE` LðF/ðG` LðH.˜oðI` LðJ(˜<ðK` LðL&<Ð%;ðM` LðN-JÐ,IðO` LðP*DÐ)CðQ` LðR&<Ð%;ðS` LðT(˜<ðU` LðV$8Ð#7ðW` LðX(@Ð'?ðY` LðZ!2Ð 1ð[` Lð\*˜Mð]` Lð^$8Ð#7ð_` Lð`/ða` Lðb&˜+ðc` Lðd"˜ ðe` Lðf&˜+ðg` Lðh*˜Mði` Lðj%:Ð$9ðk` Lðl"4Ð!3ðm` Lðn)BÐ(Aðo` Lðp-JÐ,Iðq` Lðr#6Ð"5ðs` Lðt$8Ð#7ðu` Lðv"4Ð!3ðw` Lðx*˜Mðy` Lðz/ð{` Lð|#6Ð"5ð}` Lð~&<Ð%;ð` Lð@-JÐ,IðA` LðB/ðC` LðD/ðE` LðF,˜^ðG` LðH0PÐ/OðI` LðJ&<Ð%;ðK` LðL.˜oðM` LðN$˜ðO` LðP*˜MðQ` LðR&˜+ðS` LðT#6Ð"5ðU` LðV,˜^ðW` LðX$˜ðY` LðZ!2Ð 1ð[` Lð\/ð]` Lð^/ð_` Lð`$˜ða` Lðb%:Ð$9ðc` Lðd.˜oðe` Lðf xðg` Lðh!2Ð 1ði` Lðj"4Ð!3ðk` Lðl/ðm` Lðn,˜^ðo` Lðp,˜^ðq` Lðrgðs` Lðt(˜<ðu` Lðv"˜ ðw` Lðxgðy` Lðz.˜oð{` Lð|Eð}` Lð~` Lð@)BÐ(AÀFðA` Lð` Lð` LðB%9ˆÔ!Ø"4ˆÔÐÐrw)¢NNFFFrFrZrZNNr€r€rrrr„r…r†r‡rˆrYr‰rrTNrFr_FrNTFFFFFFrrFFFFrrFFNrYNNFrFNrNrYNNTNFNNFrrNNNNr“r”NFFr•NNNNTFTFFNNrNNFNFNFTrNNNrTFNr—r˜FNNFFNNFFFNFTNr_Nr™rNNrœNrNNNNNNNržrNNrZFrŸr r‡TNrY)
Ú__name__Ú
__module__Ú __qualname__Ú__doc__rNr}rrÚ__annotations__r~ÚintrNÚ
__classcell__©rQs@ruryry3ø€ð/ð/ð`+0¨%ØØÐ+ñ+ô+И( 3œ-ððñð*/¨ØØÐ*ñ*ô*И #œððñð ØØØØØ$Ø&'Ø%&Ø#'Ø"&Ø&'Ø"#ØØ"%ØØØØØØØØØØØØØØØ!&ØØØØØØ27ØØØØØØØØØØØ!'ØØØØØØØØØ!"Ø%)ØØØØ $ØØ!&Ø $Ø Ø ØØØØ-1ØØ!$ØØØØØØ%)Ø Ø $Ø $Ø(-Ø"Ø%*ØØ!%ØØØØØØ!&Ø(,Ø%*Ø!%ØØ#Ø#'Ø ØØ ØØØØØ $Ø!Ø$)Ø(-ØØ Ø"Ø!&Ø(,ØØØØ+-Ø!#ØØØØØØ ØØØØ $ØØØØØØØØØØØØØ$(ØðGVVVVVVVVVV5rwrycóeZdZddgZ d#dedeeeee e
fde j dee j d e j d
e
d e j d eed
eee
eee
ffdeejjejjjfdeeededddfdZdefdZdefdZed¦«Zd$deedefˆfd
Z dZ!d%defdZ"ˆfdZ# d&deed eed!eeeedffd"„Z$ˆxZ%S)'Ú_UnslothPPOTrainerÚtrlÚppoN©NNÚargsÚprocessing_classÚmodelÚ ref_modelÚ reward_modelÚ
train_datasetÚ value_modelÚ
data_collatorÚ eval_datasetÚ
optimizersÚ callbacksÚ peft_configrÚreturnc
óÂ||urtd¦«||_||_||_|t |j¦«}|jr|jrtd¦«|jrA|jdkr|jx|jj_|_n5td|jd¦«|jx|jj_|_|jj dvrtd¦«t¦«s| td¦«t¦«r…| ƒt|jt¦«r|j ¦«|_t|j| ¦«|_|jr*t#|jd d
¦«rt%|j¦«t¦«ot|jt¦«|_|j|_|j|_|r||_n(|jrd|_nt/|j¦«|_||_||_t5|¦«|_||_||_| |_|
\|_|_ d|_!|j"€!tG|j$|jz¦«|_"tK|j&¬ ¦«}
|
|_'|
j(|_)|j*|j&z|_+tG|j*|j)z¦«|_,tG|j+|j)z¦«|_-t]|j-|j/d ¦«|_0t]|j+|j/d
¦«|_1|j2r|j1dksJd|j1d¦«tgj4|j"|j-z ¦«|_5tmj7tGtqj8¦«¦«|
j9¬¦«}tu|d¦« ;¦«}|j<d|j=d||_>|j=|
j?dzz|_@|jAdkr"t…d|j5|jAz¦«|_C|j+|_D|j|j|j|jfD]}|t|¦«Œt|j|j¦«|_G|jjH|jG_H| I|j5¬¦«t”t—|jjL¦«z}| |n|| z|_Mt|jM|jG|j|j|j ¦«|_O| P|jjQrn¦«¦«|_Ut­| W¦«| X¦«d|jOjM|jUgzD¦«¬¦«|_Yd|_Zd|_[t#|j'jYdd¦«du|_\t#|j'jYdd¦«du|_]d|_^|jj_r| `¦«|jjar tÅjc|jjdd¬¦«|jGd¦«r|jG f|jg¦«|j|jDd|jd¬¦«|_itmjj|j=¦«|
 k|jG|j|ji¦«\|_G|_|_itmjj|j@¦«|j|jl|jd¬¦«|_m|
 k|jm¦«|_m|j\rwtÝ|j|j*|jo|j¦«|_|j|jstd ¦«dS|j|j*|jo|j¦«|_dS|j|jstd ¦«n)|j p|j'j9¦«|_|j p|j'j9¦«|_dS)!Nzœ`model` and `ref_model` cannot be the same object. If you want `ref_model` to be the same as `model`, you must make a copy of it, or `None` if you use peft.z5You cannot set both `stop_token` and `stop_token_id`.ÚeoszUnknown `stop_token` z9. Allowed values are: `'eos'` and `None` (no stop token).>r Úk3zákl_estimator must be either 'k1' (straightforward, unbiased) or 'k3' (lower variance, unbiased, appears to be a strictly better estimator). See [Approximating KL Divergence](http://joschu.net/blog/kl-approx.html) for details.zvPEFT is not installed and you passed a `peft_config` in the trainer's kwargs, please install it to use the PEFT modelsÚis_loaded_in_4bitF)z5`batch_size` must be a multiple of `num_mini_batches`z;`local_batch_size` must be a multiple of `num_mini_batches`ézPer-rank minibatch size z is insufficient for whitening©ÚdevicerÚ__i£†r_)Únum_training_stepscó<g|]
isinstancer)Ú.0Úcbs ruú
<listcomp>z/_UnslothPPOTrainer.__init__.<locals>.<listcomp>\s:ð ð ð ØÕQ[Ð\^Õ`oÑQpÔQpð Øð ð ð rw)Úis_local_process_zeroÚis_world_process_zeroÚstateful_callbacksÚdeepspeed_pluginÚ fsdp_pluginT)Úexist_okÚadd_model_tags)r6ÚshuffleÚ
collate_fnÚ drop_last)r6rz1No reference model and model is not a Peft model.)qÚ
ValueErrorr_r`Ú policy_modelrr-r.Ú eos_token_idÚgeneration_configr@r7Ú ImportErrorrvrÚmerge_and_unloadr4r×ÚgetattrrCÚ
is_peft_modelr;r<rbr)rcrdÚlenÚtrain_dataset_lenrerfrgÚ optimizerÚ lr_schedulerÚoptimizer_cls_and_kwargsr)rWr
Ú acceleratorÚ
num_processesr2r5r4r6r-r(r8r7r>r=Úceilr3rIÚtensorrHrrr'Úitemr9Ú
process_indexÚ
local_seedr+ÚmaxÚsample_generations_freqÚlocal_dataloader_batch_sizer+rraÚconfigÚcreate_optimizer_and_schedulerrr5rirÚcallback_handlerÚ add_callbackrér!rr%Úcontrolrrzr{ÚstateÚ current_flosÚhp_search_backendÚis_deepspeed_enabledÚis_fsdp_enabledrrÚ init_hf_repoÚ should_saverAÚmakedirsr§Úhasattrr€Ú
_tag_namesrÚ
dataloaderÚ manual_seedÚpreparer¯Úeval_dataloaderrDrd)rOr_r`rarbrcrdrerfrgrhrirjrÚ time_tensorÚtime_intÚmoduleÚdefault_callbackss rurNz_UnslothPPOTrainer.__init__Ésð$ ˜Ð Ð ÝðZñôð
ð
ˆŒ Ø 0ˆÔØÔð Ð Ý3°DÔ4IÑJˆMð Œ?ð
g˜
gÝÐ
Œ_ð gØŒ 'ØXhÔXuÐuÔ@À4ÔCUÐCUå Øv¨D¬OÐôððUYÔTfÐ fˆ Ô <¸tÔ?Qð Œ9Ô Ð ðdñôð
õ ? {Ð'>ÝðIñôð
õÑ
Ô
ð ? [Ð%<å˜+­
IØ$(Ô$5×$FÒ$FÑ$HÔ$HÔ!/¨tÔ/@À+Ñ NÔ Nˆ ØŒyð
?W TÔ%6Ð8KÈUÑ
+¨DÔ,=Ñ]µZÀÔ@QÕS\Ñ5]Ô5]ˆÔØ"&Ô"9ˆÔØ $Ô 5ˆÔà ð GØ&ˆDŒNˆ
Ô
ð GØ!ˆDŒNˆ3°DÔ4EÑFˆDŒNàÔØÔÝ!$ ]Ñ!3Ô!3ˆÔØÔØÔØÔØ,6ÑŒ˜Ô)Ø(,ˆÔ
Ô Ð &Ý"% dÔ&;¸dÔ>TÑ&TÑ"UÔ"Uˆ Ý!¸dÔ>^Ð_ˆ ØÔØŒØ $Ô @À4ÔCcÑ cˆÔÝ # DÔ$DÀtÄÑ$VÑ WÔ WˆÔݘdÔ3°d´oÑŒÝ ŒO˜TÔ2Ð4kñ
ô
ˆÔõ&/Ø Ô ! 4Ô#8Ð:wñ&
ô&
ˆÔ Ô ð ØÔÒe¨4Ô+EÐ
"&¤Ø Ô  $¤/Ñ "
ô"
ˆÔõ”l¥3¥t¤y¡{¤{Ñ#3Ô#3¸KÔ<NÐOˆ ݘ[¨!ÑØœ=ÐC¨D¬IÐÐŒ
Øœ) kÔ&?À&Ñ&HÑHˆŒØ Ô Ò *Ý+.¨q°$Ô2HÈDÔLgÑ2gÑ+hÔ+hˆ (Ø+/Ô+@ˆÔ
Ô(¨$¬.¸$Ô:JÈDÔL]Ð 1ˆÐÑ0øÝ*¨4Ô+<¸dÔ>NÑOˆŒ
Ø Ô4ˆŒ
ÔØ ×
ô
ð
õ.Õ0SÐTXÔT]ÔTgÑ0hÔ0hÑØ.7Ð.?Ð*ÐEVÐYbÑEbˆŒÝ /Ø ŒN˜DœJ¨Ô(=¸t¼~ÈtÔO`ñ!
ô!
ˆÔð
×Ò¨T¬YÔ-CÐb/˜/ÕIbÑ'ˆŒ Ý'Ø"&×"<Ò"<Ñ">Ô">Ø"&×"<Ò"<Ñ">Ô">ð ð ØÄ ¸~Ñ ñ ô ð
ñ
ô
ˆŒ
ðˆÔØ!%ˆÔÝ$+¨DÔ,<Ô,BÐDVÐX\Ñ$]Ô$]ÐeiÐ$iˆÔ& tÔ'7Ô'=¸}ÈdÑSÐ[_ÐÔà ˆÔØ Œ9Ô ð Ø × Ò Ñ Ô Ð Ø Œ9Ô ð ŒK˜œ Ô,°tÐ  4”:Ð  ŒJ× % d¤oÑ
 Ô ØÔØÔð 
ñ
ô
ˆŒõ Ô˜$œ)Ñ$Ø6A×6IÒ6IÈ$Ì*ÐVZÔVdÐfjÔfuÑ6vÔ6vÑ3ˆŒ
D”N D¤OÝ
Ô˜$œ/Ñ Ô ØÔÔð 
ñ
ô
ˆÔð 2°4Ô3GÑÔà Ô  NÝ 1ØÔ! 4Ô#CÀTÄYÐPTÔPYñ!ô!ˆDÔ ðŒ~ÐÔZÝ$Ð%XÑZðZõ"3Ø”N DÔ$DÀdÄiÐQUÔQZñ"ô"ðŒ~ÐÔZÝ$Ð%XÑZð"&¤×!2Ò!2°4Ô3CÔ3JÑ!KÔ!KØ $Ô 1× 4Ò 4°TÔ5EÔ5LÑ MÔ Mˆ Ð Ð rwcó|jS©©rOs ruÚget_train_dataloaderz'_UnslothPPOTrainer.get_train_dataloaders
ØŒÐrwcó|jS)r­s ruÚget_eval_dataloaderz&_UnslothPPOTrainer.get_eval_dataloaderžs ØÔ#rwc#ó˜K|jr=|js6|j |jj¦« ¦«n
t¦«5|jr$|jj |j¦«dV|jr&|jj |j pd¦«ddd¦«dS#1swxYwYdS)zWContext manager for handling null reference model (that is, peft adapter manipulation).Nr{)
rr<rÚ unwrap_modelraÚpolicyÚdisable_adapterr@Ú set_adapterr;s ruÚnull_ref_contextz#_UnslothPPOTrainer.null_ref_context¡s'èèð
Ô
Ø*.Ô*?ð
ˆDÔ × )¨$¬*Ô*;Ñ <× ð Tð Tð
Ô
EØ
Ô!×-¨dÔ.CÑ ˆEˆEˆEØÔ
TØ
Ô-¨dÔ.EÐ.RÈÑ Tð Tð Tñ Tô Tð Tð Tð Tð Tð Tð Tð Tøøøð Tð Tð Tð Tð Tð TsÁAB?Â?CÃCFr§Ú_internal_callcóÞ|j}|jj|_|jr|j}|j|_t ¦« ||¦«||_|jr ||_dSdS)rarMÚ
save_model)rOr¿Ú backup_modelÚbackup_deepspeedrQs €ruz_UnslothPPOTrainer.save_model¯ssø€Ø”zˆ Ø”ZÔŒ
à Ô #œ~Ð Ø!œZˆDŒNå
Œ×Ò˜: ~ÑŒ
à Ô -ˆDŒNˆNˆ .rwc ó€*‡r—|j}|j}|j}|j}|j}|j}|j}|jŠr|j}ˆrfd} t| ¦«¦«}
t|j |j dzddd¬¦«} | 
d¦«tj¦«} |j|j|jf}
t%j|
|¬¦«}t%j|
|¬¦«}t%j|
|¬¦«}t%j|
|¬¦«}t%j|
|¬¦«}t%j|
|¬¦«}t%j|
|¬¦«}| ¦«d |j_d |j_|j|j_|j|jz |j_|jM|jd
kr1t=j|jj|jz¦«|j_n|j|j_|j M|j d
kr1t=j|jj|j z¦«|j_ n|j |j_ |j!M|j!d
kr1t=j|jj|j!z¦«|j_!n|j!|j_!|j" #||j|j$¦«|_$|j%r|j|_&|j|_'tQd
|jd
z¦«D]³}|jxjd
|j)zz
c_tU|
¦«}t%j+¦«5|d  ,|¦«}|j-d
}g}g}g}g}g}g}g}t]|j|j|jj/¬ ¦«5} ta| j1||j2|j3| ¦«\}!}"ddd¦«n #1swxYwYtQd |j-d |j2¦«D]f}#||#|#|j2z}$|!|#|#|j2z}%|%dd|df}&|"|#|#|j2z}'ti|'|&¦«}(~'tk¦«|€H| 6¦«5to|j1|%|j3¦«})ddd¦«n #1swxYwYnto||%|j3¦«})|)j8dd|d
z
d
f}*|*|j dzz}*ti|*|&¦«}+~)~*tk¦«|&},|j9tu|j9|j3|&¦«},t%j;|$|,fd
¦«}-ty|,|j3k¦«d
z
}.| =|¦«j>}/t|/|%|j3|¦«\}0}1}1|0dd|d
z
d
f @d
¦«}2t||-|j3|¦«\}1}3}1| A|&¦«| A|,¦«| A|(¦«| A|+¦«| A|.¦«| A|3¦«| A|2¦«Œht%j;|d ¦«}t%j;|d ¦«}t%j;|d ¦«}t%j;|d ¦«}t%j;|d ¦«}t%j;|d ¦«}t%j;|d ¦«}~(~+~0~2~3~ tk¦«t…jC¦«t%jD||jjEkd
¬¦«}4|jjF||4xx|jjFzcc<t%jG|j-d
|j¬¦« H|j-d d
¦«}5|5| Id
¦«k}6t%jJ||6t¦«}t%jJ||6t¦«}|d
z}7|5|7 Id
¦«k}8t%jJ||8d ¦«}||z
}9|jLdkr|9 n|9 M¦«d
z
|9z
}:|jN |:z};|; O¦«}<t%jG|< Pd ¦«|<j¬¦«}=t%jQ|7|< Pd
¦«k|7|¦«}>|<|=|>gxx|z
cc<|jRr)t§|<|8d¬¦«}<t%jJ|<|8d ¦«}<d }?g}@|j-d
}At©tQ|A¦«¦«D]j}B|B|Ad
z
kr|dd|Bd
zfnd}C|<dd|Bf|jU|Czz|dd|Bfz
}D|D|jU|jVz|?zz}?|@ A|?¦«Œkt%jW|@ddd
d
¬¦«}E|E|z}Ft§|E|6¦«}Et%jJ|E|6d ¦«}Etk¦«ddd¦«n #1swxYwYtQ|j¦«D]c}Gt°jY Z|j[¦«}Hd }ItQd |j[|j\¦«D]}J|J|j\z}K|H|J|K…}Ld }MtQd |j\|j]¦«D]¹}N| ^|¦«5|N|j]z}O|L|N|O…}P|E|P}Q||P}R|!|P}S||P}T|F|P}U||P}Vto||S|j3¦«\}W}X|Wj8dd|d
z
d
f}'|'|j dzz}'ti|'|R¦«}Yt%jJ|Y|6|Pt¦«}Y|Xdd|d
z
d
f @d
¦«}Zt%jJ|Z|8|Pd ¦«}Zt%j_|Z|V|j`z
|V|j`z¦«}[t%ja|Z|Uz
¦«}\t%ja|[|Uz
¦«}]t%jb|\|]¦«}^d|^|8|P¦«z}_tÇ|]|\k d¦«|8|P¦«}`|Y|Tz
}at%jM|a¦«}b|Q |bz}c|Q t%j_|bd|jez
d|jez¦«z}dt%jb|c|d¦«}etÇ|e|6|P¦«}f|f|jf|_zz}g| g|g¦«| h¦«| i¦«t%j+¦«5|d|ck d¦«|6|P¦«}ht$jjjk l|'d
¬¦«}it%jm|'d
¬¦«t%jn|i|'zd
¬¦«z
}jd|adz o¦«z}k|k||G|I|Mf<|h||G|I|Mf<|f||G|I|Mf<|_||G|I|Mf<|`||G|I|Mf<|j o¦«||G|I|Mf<|b o¦«||G|I|Mf<ddd¦«n #1swxYwYddd¦«n #1swxYwY|Md
z
}MŒ»|Id
z
}I~W~X~'~Y~Z~[~\~]~_~`~a~b~c~d~e~f~g~h~i~j~k~U~Q~V~R~S~Ttk¦«ŒŒet%j+¦«5|: nd
¦« o¦«}l|  nd
¦« o¦«}m|; nd
¦« o¦«}n|n| o¦«z}otá|jjtj¦«| z
z ¦«}pi}q|p|qd<|j q|l¦« o¦« r¦«|qd<|j q|m¦« o¦« r¦«|qd<|j q|n¦« o¦« r¦«|qd<|j q|o¦« o¦« r¦«|qd<|j q| o¦«¦« o¦« r¦«|qd<|j q|¦« o¦« r¦«|qd<|j q|¦« o¦« r¦«|qd<|j q|¦« o¦« r¦«|qd<|j q|¦« o¦« r¦«|qd<|j q|¦« o¦« r¦«|qd<|j q|¦« o¦« r¦«|qd <|j q|¦« o¦« r¦«|qd!<|j q|¦« s¦« r¦«|qd"<||jEk n¦« r¦«|qd#<|jt u¦«d |qd$<|jj|qd%<|jj|jz |j_v|jxjd
z
c_| w|q¦«ddd¦«n #1swxYwY|jt h¦«|j" x||j|j$¦«|_$|j$jyrG| z|d¬&¦«|j" {|j|j|j$¦«|_$~:~l~m~n~~q~;tk¦«t…jC¦«|j|d kr5|d
z
|j}zd kr$| ~d¬'¦«tk¦«~!~~~~~~~4~7~5~6~8~<~=~>~E~Ftk¦«Œµ|j" ||j|j$¦«|_$|j$jyrJ| z|dd¬(¦«|j" {|j|j|j$¦«|_$dSdS))Nc3óK Ed{VŒ r³rGr´s€ruÚrepeat_generatorz2_UnslothPPOTrainer.train.<locals>.repeat_generatorÉs)øèèð

&rwr“r‡Úmax_new_tokensr/Útop_kÚtop_pÚ do_samplez===training policy===rqrr_Ú input_ids©Úgather_deepspeed3_paramsrYr^r F)ÚmaskÚ
shift_mean)Úaxisgà?r€Úepsz objective/klzobjective/entropyzobjective/non_score_rewardzobjective/rlhf_rewardzobjective/scoreszpolicy/approxkl_avgzpolicy/clipfrac_avgzloss/policy_avgzloss/value_avgzval/clipfrac_avgzpolicy/entropy_avgz val/ratioz
val/ratio_varzval/num_eos_tokensÚlrÚepisode)Útrial)Úsampling)Úmetrics)€r_rrarbrcr`rrÚiterrr,r/ÚprintrHr=r(rIÚzerosÚtrainr Ú global_steprÔr3r)rr=r“rÚon_train_beginrŸÚ
model_wrappedÚranger6ÚnextÚno_gradrdrbrKrFr&r*Ú pad_token_idrFr,r/rlr.rJÚcatr.rer6rhrjr1ÚcollectÚanyr†r0ÚarangeÚrepeatrgÚ masked_fillrr@Úexpr?ÚcloneÚsizeÚwherer>r<ÚreversedrDrEÚstackr?ÚrandomÚ permutationr5r7Ú