Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothRLOOTrainer.cpython-311.pyc
T

259 lines
56 KiB
Plaintext
Raw Normal View History

2025-08-13 23:50:20 +00:00
§
5$hiÜãóddZddlmZddlZddlmZddlmZddlmZm Z m
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZm Z m!Z!m"Z"m#Z#m$Z$m%Z%m&Z&m Z m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2m3Z3m4Z4m5Z5m6Z6m7Z7m8Z8mZm9Z9m:Z:m;Z;m<Z<m=Z=m>Z>m?Z?m@Z@mZmAZAmBZBm
Z
m$Z$m:Z:mZddl:Z:ddlTddlCmDZDmEZEdd lFmGZGddlZddlHZ9dd
lImJZJddlmZdd lKmLZLmMZNd d
d d
d
dœZOejPd d eO¬¦«d¦«ZQeDGdde"¦«¦«ZR Gdde$¦«ZSGddeS¦«ZTdS)z8
2025.8.4
2025.8.5
4.55.1
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable);Ú AcceleratorÚBaseImageProcessorr ÚCallbackHandlerÚDEFAULT_CALLBACKSÚDEFAULT_PROGRESS_CALLBACKÚDataCollatorWithPaddingÚ
DataLoaderÚDatasetÚExportableStateÚFeatureExtractionMixinÚGenerationConfigÚINVALID_LOGPROBÚOnlineTrainerStaterÚPathÚPreTrainedTokenizerBaseÚPrinterCallbackÚProcessorMixinÚ
RLOOConfigÚ RLOOTrainerÚTrainerÚTrainerCallbackÚTrainerControlr Úbatch_generationÚ broadcastÚ defaultdictÚdisable_dropout_in_modelÚ empty_cacheÚ exact_divÚfirst_true_indicesÚforwardÚ
gather_objectÚgcÚgenerate_model_cardÚget_comet_experiment_urlÚ#get_reporting_integration_callbacksÚ
get_rewardÚis_rich_availableÚis_wandb_availableÚlog_table_to_comet_experimentÚmathÚnnÚnpÚosÚpdÚprepare_deepspeedÚprint_rich_tableÚselective_log_softmaxÚtextwrapÚtimeÚtorchÚtruncate_responseÚunwrap_model_for_generationrr r7r>)Ú*)Ú dataclassÚfield)ÚVersion)Ú nullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionscó’tj| d|jd¦«dd¬¦«}tj| d¦«dd¬¦«}g}t ||¦«D]\}}| tj¦«}tj|d| d¦«¬¦«  d¦«}tj
|d¬¦«}||z
} |  | ¦«Œ’ tj |¦«}| |jd|jdf¦«}|S)Néÿÿÿÿér)ÚchunksÚdim)rRÚindex©rRé)
r>ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
ÚlogitsrSÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logpss
ú^/workspace/Fine-tuning/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothRLOOTrainer.pyÚchunked_selective_log_softmaxrl"s5õ”[ §¢°°F´LÀÔ4DÑ!EÔ!EÐPQÐYZÐ[€NÝ”[ §¢¨rÑ!2Ô!2¸QÀaÐH€MØÐå%(¨¸Ñ%GÔ%Gð #—¥u¤}Ñ Ýœ, |¸2À{×G\ÒG\Ð]_ÑG`ÔG`Ða×iÐjlÑmˆÝ œ?¨<¸Ø)Ð,<Ñ<ˆØ×" Ýœ,Ð':ÑØ-×5°v´|ÀA´ÈÌ ÐUVÌÐ6XÑØ ÐócóÎeZdZUdZedddi¬¦«Zeeed<edddi¬¦«Z ee
ed < d3ˆfd2„ Z ˆxZ S)4ÚUnslothRLOOConfigaÆ
Configuration class for the [`RLOOTrainer`].
This class includes only the parameters that are specific to RLOO training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] and [`OnPolicyConfig`] documentation. Note that default
values in this class may differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
exp_name (`str`, *optional*, defaults to `os.path.basename(__file__)[: -len(".py")]`):
Name of this experiment.
reward_model_path (`str`, *optional*, defaults to `"EleutherAI/pythia-160m"`):
Path to the reward model.
num_ppo_epochs (`int`, *optional*, defaults to `4`):
Number of epochs to train.
whiten_rewards (`bool`, *optional*, defaults to `False`):
Whether to whiten the rewards.
kl_coef (`float`, *optional*, defaults to `0.05`):
KL coefficient.
cliprange (`float`, *optional*, defaults to `0.2`):
Clip range.
rloo_k (`int`, *optional*, defaults to `2`):
REINFORCE Leave-One-Out (RLOO) number of online samples per prompt.
normalize_reward (`bool`, *optional*, defaults to `False`):
Whether to normalize rewards.
reward_clip_range (`float`, *optional*, defaults to `10.0`):
Clip range for rewards.
normalize_advantage (`bool`, *optional*, defaults to `False`):
Whether to normalize advantages.
token_level_kl (`bool`, *optional*, defaults to `True`):
Whether to use token-level KL penalty or sequence-level KL penalty.
ds3_gather_for_generation (`bool`, *optional*, defaults to `True`):
This setting applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation,
improving generation speed. However, disabling this option allows training models that exceed the VRAM
capacity of a single GPU, albeit at the cost of slower generation.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrOz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksFÚnorPéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrUéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéé@é
é5çffffffæ?úEleutherAI/pythia-160mÚ rloo_configçš™™™™™©?çš™™™™™É?ç$@c¡ ó,|dkrtd|d¦«|dkrtd|d¦«||#dkr
|$dkrd}d }#|€!d
d lmt |¢¦«d zd ¦«}|‰d
krt d
¦«|‰dkrt d¦«t
¦«jd®id|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/| “d0|!“d1|"“d2|#“d3|$“d4|%“d5|&“d6|'“d7|(“d8|)“d9|*“d:|+“d;|,“d<|-“d=|.“d>|/“d?|0“d@|1“dA|2“dB|3“dC|4“dD|5“dE|6“dF|7“dG|8“dH|9“dI|:“dJ|;“dK|<“dL|=“dM|>“dN|?“dO|@“dP|A“dQ|B“dR|C“dS|D“dT|E“dU|F“dV|G“dW|H“dX|I“dY|J“dZ|K“d[|L“d\|M“d]|N“d^|O“d_|P“d`|Q“da|R“db|S“dc|T“dd|U“de|V“df|W“dg|X“dh|Y“di|Z“dj|[“dk|\“dl|]“dm|^“dn|_“do|`“dp|a“dq|b“dr|c“ds|d“dt|e“du|f“dv|g“dw|h“dx|i“dy|j“dz|k“d{|l“d||m“d}|n“d~|o“d|p“d€|q“d|r“d|s“dƒ|t“d„|u“d…|v“d†|w“d‡|x“dˆ|y“d‰|z“dŠ|{“d||“dŒ|}“d|~“dŽ|d|€“d|d‘|‚“d’|ƒ“d“|„“d”|…“d•|†“d–|‡“d—|ˆ“d˜|‰“d™|Š“dš|‹“d›|Œ“dœ|d|Ž“dž|dŸ|d |‘“d¡|’“d¢|““d£|”“d¤|•“d¥|–“d¦|—“d§|˜“d¨|™“d©|š“dª|›“d«|œ“d¬|d­|ž“|¡¤Ž|Ÿ|_| |_ dS)¯NçH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rUza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rƒr„Úunsloth_training_checkpointsrur)Ú cpu_countrvzUUnsloth: Please set a positive non-zero temperature since your results will be wrong.rzgUnsloth: Please set a positive non-zero temperature less than 10, since sampling will be quite erratic.Ú
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚdataset_num_procÚnum_mini_batchesÚtotal_episodesÚ local_rollout_forward_batch_sizeÚnum_sample_generationsÚresponse_lengthÚ
stop_tokenÚ
stop_token_idÚ temperatureÚmissing_eos_penaltyÚsft_model_pathÚ
world_sizeÚnum_total_batchesÚmicro_batch_sizeÚlocal_batch_sizeÚ
batch_sizeÚlocal_mini_batch_sizeÚmini_batch_sizeÚexp_nameÚreward_model_pathÚnum_ppo_epochsÚwhiten_rewardsÚkl_coefÚ cliprangeÚrloo_kÚnormalize_rewardÚreward_clip_rangeÚnormalize_advantageÚtoken_level_klÚds3_gather_for_generation©)
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingrÚminÚ MathErrorÚsuperÚ__init__rsrt)¤Úselfrœrr r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrrrrr r!r"r#r$r%r&r'r(r)r*r+r,r-r.r/r0r1r2r3r4r5r6r7r8r9rsrtÚkwargsrÚ __class__s¤ €rkrAzUnslothRLOOConfig.__init__fs[ ø€ðH ˜ Ð Õ'9ð;VÐ]jð;Vð;Vð;Vñ(Wô(Wð"WØ ˜ Ð ¥Mð3FÐUbð3Fð3Fð3Fñ%Gô%GðGØ Ð  -°7Ò":Ð":¸zÈSÒ?PÐ?PØ7ˆ ˆ Ð " 9 9¡;¤;¨q¡=°!Ñ Ø ˜!Ò Ð ÝÐ
˜
Ð
ÝðFñGôGð
Gð ŒÔð^ Lð^ Lð^ LØ#˜ð^ Là#7Ð#7ð^ Lð ^ Lðgð ^ Lð
$˜ð ^ Lð *˜
^ Lð$8Ð#7ð^ Lð+FÐ*Eð^ Lð*DÐ)Cð^ Lð(@Ð'?ð^ Lð'>Ð&=ð^ Lð+FÐ*Eð^ Lð'>Ð&=ð^ Lð$˜ð^ Lð'>Ð&=ð^ Lð *˜Mð!^ Lð"(˜<ð#^ Lð$$˜ð%^ Lð&$˜ð'^ Lð((˜<ð)^ Lð**˜Mð+^ Lð,/ð-^ Lð."˜ ð/^ Lð0!2Ð 1ð1^ Lð2(˜<ð3^ Lð4(˜<ð5^ Lð6"˜ ð7^ Lð8!2Ð 1ð9^ Lð:/ð;^ Lð<&˜+ð=^ Lð>/ð?^ Lð@"4Ð!3ðA^ LðB*˜MðC^ LðD&<Ð%;ðE^ LðF*˜MðG^ LðH$˜ðI^ LðJ/ðK^ LðL/ðM^ LðN!2Ð 1ðO^ LðP.˜oðQ^ LðR7^Ð6]ðS^ LðTgðU^ LðVgðW^ LðX,˜^ðY^ LðZ4ð[^ Lð\"˜ ð]^ Lð^*˜Mð_^ Lð` xða^ Lðb4ðc^ Lðd4ðe^ Lðf,˜^ðg^ Lðh&<Ð%;ði^ Lðj,˜^ðk^ Lðl,˜^ðm^ Lðn4ðo^ Lðp$˜ðq^ Lðr&˜+ðs^ Lðt*˜Mðu^ Lðv!2Ð 1ðw^ LðxEðy^ Lðz$8Ð#7ð{^ Lð|$˜ð}^ Lð~&<Ð%;ð^ Lð@*DÐ)CðA^ LðB$˜ðC^ LðD xðE^ LðF(˜<ðG^ LðH%:Ð$9ðI^ LðJ&˜+ðK^ LðL&<Ð%;ðM^ LðN%:Ð$9ðO^ LðP!2Ð 1ðQ^ LðR/ðS^ LðT4ðU^ LðV#6Ð"5ðW^ LðX&˜+ðY^ LðZ2TÐ1Sð[^ Lð\"4Ð!3ð]^ Lð^"˜ ð_^ Lð`&<Ð%;ða^ LðbEðc^ Lðd$˜ðe^ Lðf"˜ ðg^ Lðh.˜oði^ Lðj"4Ð!3ðk^ Lðl"˜ ðm^ Lðn*DÐ)Cðo^ Lðp!2Ð 1ðq^ Lðr%:Ð$9ðs^ Lðt%:Ð$9ðu^ Lðv-JÐ,Iðw^ Lðx#6Ð"5ðy^ Lðz*DÐ)Cð{^ Lð|&˜+ð}^ Lð~&<Ð%;ð^ Lð@(˜<ðA^ LðB(˜<ðC^ LðD"˜ ðE^ LðF/ðG^ LðH.˜oðI^ LðJ(˜<ðK^ LðL&<Ð%;ðM^ LðN-JÐ,IðO^ LðP*DÐ)CðQ^ LðR&<Ð%;ðS^ LðT(˜<ðU^ LðV$8Ð#7ðW^ LðX(@Ð'?ðY^ LðZ!2Ð 1ð[^ Lð\*˜Mð]^ Lð^$8Ð#7ð_^ Lð`/ða^ Lðb&˜+ðc^ Lðd"˜ ðe^ Lðf&˜+ðg^ Lðh*˜Mði^ Lðj%:Ð$9ðk^ Lðl"4Ð!3ðm^ Lðn)BÐ(Aðo^ Lðp-JÐ,Iðq^ Lðr#6Ð"5ðs^ Lðt$8Ð#7ðu^ Lðv"4Ð!3ðw^ Lðx*˜Mðy^ Lðz/ð{^ Lð|#6Ð"5ð}^ Lð~&<Ð%;ð^ Lð@-JÐ,IðA^ LðB/ðC^ LðD/ðE^ LðF,˜^ðG^ LðH0PÐ/OðI^ LðJ&<Ð%;ðK^ LðL.˜oðM^ LðN$˜ðO^ LðP*˜MðQ^ LðR&˜+ðS^ LðT#6Ð"5ðU^ LðV,˜^ðW^ LðX$˜ðY^ LðZ!2Ð 1ð[^ Lð\/ð]^ Lð^/ð_^ Lð`$˜ða^ Lðb%:Ð$9ðc^ Lðd.˜oðe^ Lðf xðg^ Lðh!2Ð 1ði^ Lðj,˜^ðk^ Lðl,˜^ðm^ Lðngðo^ Lðp"˜ ðq^ LðrVðs^ Lðt/ðu^ Lðv!2Ð 1ðw^ Lðx#6Ð"5ðy^ Lðz,˜^ð{^ Lð|)BÐ(AÀFð}^ Lð^ Lð^ Lð~%9ˆÔ!Ø"4ˆÔÐÐrm) NNFFFruFrPrPNNrvrvrrwrxryrzr{r|r}r~rOrr€rrrTNrƒFrUFrƒr„NTFFFFFFr…r…FFFFr†r‡FFNrONNFrˆFNrNrONNTNFNNFrˆrNNNNr‰NFFrNNNNTFTFFNNrŒNNFNFNFTr‡NNNrˆTFNrFNNFFNNFFFNFTNrUNrrrNNrNr“NNNNNNNr”r“rPFr•rrvFr—FFTNrO)
Ú__name__Ú
__module__Ú __qualname__Ú__doc__rCrsrrÚ__annotations__rtÚintrAÚ
__classcell__©rDs@rkroro3ø€ð(ð(ðR+0¨%ØØÐ+ñ+ô+И( 3œ-ððñð*/¨ØØÐ*ñ*ô*И #œððñð ØØØØØ$Ø&'Ø%&Ø#'Ø"&Ø&'Ø"#ØØ"%ØØØØØØØØØØØØØØØ!&ØØØØØØ27ØØØØØØØØØØØ!'ØØØØØØØØØ!"Ø%)ØØØØ $ØØ!&Ø $Ø Ø ØØØØ-1ØØ!$ØØØØØØ%)Ø Ø $Ø $Ø(-Ø"Ø%*ØØ!%ØØØØØØ!&Ø(,Ø%*Ø!%ØØ#Ø#'Ø ØØ ØØØØØ $Ø!Ø$)Ø(-ØØ Ø"Ø!&Ø(,ØØØØ+-Ø!#ØØØØØØ ØØØØ $ØØ ØØØØØØ Ø ØØ$(ØðCRRRRRRRRRR5rmrocóìeZdZddgZ ddedeeeee e
fde j de j d ee j e
eegeeffd
ed eed eeeeeeffd
eejjejjjfdeeeddfdZdefdZdefdZdZdde fdZ!ˆfdZ" ddeedeedeeeedffdZ#ˆxZ$S)Ú_UnslothRLOOTrainerÚtrlÚrlooN©NNÚconfigÚprocessing_classÚpolicyÚ
ref_policyÚ reward_modelÚ
train_datasetÚ
data_collatorÚ eval_datasetÚ
optimizersÚ callbacksÚreturnc ó„
||urtd¦«||_|} ||_||_|t |j¦«}d|jj_d|jj_||_||_ ||_
t|¦«|_ ||_
||_| \|_|_d|_| j€!t'| j|j z¦«| _t+| j¬¦«} | |_| j| _| j| jz| jz| _t'| j| jz¦«| _t'| j| jz¦«| _t?| j| jd¦«| _ t?| j| jd¦«| _!tEj#| j| jz ¦«| _$tKj&t'tOj'¦«¦«| j(¬¦«}
tS|
d¦« *¦«}| j+d| j,d|| _-| j,| j.dzz|_/| j0dkr"tcd | j$| j0z¦«|_2t?| j| j3d
¦«|_4|||fD]+}tk|tlj7¦«rtq|¦«Œ,| j9r| j9d kr|jj| _:||_;| <| j$¬ ¦«tzt}|jj?¦«z}|
|n||
z|_@|j@|j;|j|j|j¦«|_B| C|jjDrn¦«t¦«|_Ht“| J¦«| K¦«d
|jBj@|jHgzD¦«¬¦«|_Ld|_Md|_N|jjLdd¦«du|_P|jjLdd¦«du|_Qd|_R|jjSr| T¦«|jjUr t­jW|jjXd¬¦«d|_Y|j;d¦«r|j; [|j\¦«|j
|j4d|j
d¬¦«|_^tKj_| j,¦«|  `|j;|j|j^¦«\|_;|_|_^tKj_|j/¦«|j| ja|j
d¬¦«|_b|  `|jb¦«|_b|jPrƒtk|j tlj7¦«r+tÇ|j | j| jd| je¦«|_ |j| j| jd| je¦«|_|j;|_fdS|j g|jj(¦«|_tk|j tlj7¦«r+|j  g|jj(¦«|_ dSdS)Nz `policy` and `ref_policy` cannot be the same object. If you want `ref_policy` to be the same as `policy`, you must mass a copy of it, or `None` if you use peft.)z5`batch_size` must be a multiple of `num_mini_batches`z;`local_batch_size` must be a multiple of `num_mini_batches`©ÚdevicerÚ__i£†rUz/`local_batch_size` must be a multiple of rloo_kÚeos)Únum_training_stepscó<g|]}t|t¦«¯|ŒSr:)Ú
isinstancer)Ú.0Úcbs rkú
<listcomp>z0_UnslothRLOOTrainer.__init__.<locals>.<listcomp>%s:ð ð ð ØÕQ[Ð\^Õ`oÑQpÔQpð Øð ð ð rm)Úis_local_process_zeroÚis_world_process_zeroÚstateful_callbacksÚdeepspeed_pluginÚ fsdp_pluginT)Úexist_okÚadd_model_tags)r+ÚshuffleÚ
collate_fnÚ drop_last)r+rprq)hÚ
ValueErrorÚargsrSrTrÚgeneration_configÚ eos_token_idÚ pad_token_idrUrVrWÚlenÚtrain_dataset_lenrXrYÚ optimizerÚ lr_schedulerÚoptimizer_cls_and_kwargsrrJr
Ú acceleratorÚ
num_processesr'rr*r)r+r(r-r,r4Úceilr(r>Útensorr=r_r$Úitemr.Ú
process_indexÚ
local_seedr ÚmaxÚsample_generations_freqr4Úlocal_dataloader_batch_sizerdr5ÚModuler&r"r#ÚmodelÚcreate_optimizer_and_schedulerrr/r[rÚcallback_handlerÚ add_callbackrÞrrr"ÚcontrolrrhriÚstateÚ current_flosÚhp_search_backendÚgetattrÚis_deepspeed_enabledÚis_fsdp_enabledrûÚ init_hf_repoÚ should_saver7ÚmakedirsrœÚ backup_modelÚhasattrrnÚ
_tag_namesrÚ
dataloaderÚ manual_seedÚpreparer¤Úeval_dataloaderr9rZ)rBrRrSrTrUrVrWrXrYrZr[rsr|Ú time_tensorÚtime_intÚmoduleÚdefault_callbackss rkrAz_UnslothRLOOTrainer.__init__¾s•ð ˜Ð Ð Ýð[ñôð
ð
ˆŒ ØˆØ 0ˆÔ؈Œ ð Ð Ý3°DÔ4IÑJˆMð
ð
Œ Ô6:ˆŒ ÔŒØÔØÔÝ!$ ]Ñ!3Ô!3ˆÔØÔØÔØ,6ÑŒ˜Ô)Ø(,ˆÔ
Ô Ð &Ý"% dÔ&;¸dÔ>TÑ&TÑ"UÔ"Uˆ Ý!¸dÔ>^Ð_ˆ ØÔØŒà Ô ,¨tÔ/OÑ OÐRVÔRgÑ 
Ôõ!$ DÔ$DÀtÄÑ$VÑ WÔ WˆÔݘdÔ3°d´oÑŒÝ ŒO˜TÔ2Ð4kñ
ô
ˆÔõ&/Ø Ô ! 4Ô#8Ð:wñ&
ô&
ˆÔ"&¤Ø Ô  $¤/Ñ "
ô"
ˆÔõ”l¥3¥t¤y¡{¤{Ñ#3Ô#3¸KÔ<NÐOˆ ݘ[¨!ÑØœ=ÐC¨D¬IÐÐŒ
Øœ) kÔ&?À&Ñ&HÑHˆŒØ Ô Ò *Ý+.¨q°$Ô2HÈDÔLgÑ2gÑ+hÔ+hˆ (Ý+4Ø Ô ! 4¤;Ð0añ,
ô,
ˆÔ˜z¨<Ð 1ˆ˜&¥"¤)Ñ
Ñ0øØ Œ?ð D˜°%Ò7Ø!%Ô!6Ô!CˆDÔ ØˆŒ
Ø ×
ô
ð
õ.Õ0SÐTXÔT]ÔTgÑ0hÔ0hÑØ.7Ð.?Ð*ÐEVÐYbÑEbˆŒÝ /Ø ŒN˜DœJ¨Ô(=¸t¼~ÈtÔO`ñ!
ô!
ˆÔð
×Ò¨T¬YÔ-CÐb/˜/ÕIbÑ'ˆŒ Ý'Ø"&×"<Ò"<Ñ">Ô">Ø"&×"<Ò"<Ñ">Ô">ð ð ØÄ ¸~Ñ ñ ô ð
ñ
ô
ˆŒ
ðˆÔØ!%ˆÔÝ$+¨DÔ,<Ô,BÐDVÐX\Ñ$]Ô$]ÐeiÐ$iˆÔ& tÔ'7Ô'=¸}ÈdÑSÐ[_ÐÔà ˆÔØ Œ9Ô ð Ø × Ò Ñ Ô Ð Ø Œ9Ô ð ŒK˜œ Ô,°tÐ  ˆÔõ 4”:Ð  ŒJ× % d¤oÑ
 Ô ØÔØÔð 
ñ
ô
ˆŒõ Ô˜$œ)Ñ$Ø6A×6IÒ6IÈ$Ì*ÐVZÔVdÐfjÔfuÑ6vÔ6vÑ3ˆŒ
D”N D¤OÝ
Ô˜$œ/Ñ Ô ØÔÔð 
ñ
ô
ˆÔð 2°4Ô3GÑÔà Ô Rݘ$Ô+­R¬YÑ
Ý$5ØÔ% tÔ'GÈÌÐTXÔT]ñ%ô%Ô Ô!AÀ4Ä9ÈdÌiñôˆDŒOð"œZˆDŒNˆNˆ"œo×Ô1AÔ1HÑIˆDŒOݘ+­R¬YÑ
RØ$(Ô$5×$8Ò$8¸Ô9IÔ9PÑ$QÔ$QÔ
Rð
Rrmcó|jS©r˜©rBs rkÚget_train_dataloaderz(_UnslothRLOOTrainer.get_train_dataloader`s
ØŒÐrmcó|jS)rs rkÚget_eval_dataloaderz'_UnslothRLOOTrainer.get_eval_dataloadercs ØÔ#rmc óŒ&‡^—|j}|j}|j}|j}|j|_|j}|j}|j}|jŠ^|j }ˆ^fd} t| ¦«¦«}
t|j |j
dzddd¬¦«} | d¦«tj¦«} |j|j|jf}
t'j|
|¬¦«}t'j|
|¬¦«}t'j|
|¬¦«}t'j|
|¬¦«}t'j|
|¬¦«}t'j|
|¬¦«}| ¦«d |j_d |j_|j|jzd
z|j_|j|jz |j_|jM|jd kr1t?j |jj|jz¦«|j_n|j|j_|j!M|j!d kr1t?j |jj|j!z¦«|j_!n|j!|j_!|j"M|j"d kr1t?j |jj|j"z¦«|j_"n|j"|j_"|j# $||j|j%¦«|_%tMd |jd z¦«D]×}|jxjd |j'zz
c_tQ|
¦«}t'j)¦«5|d  *|¦«}| +|j,d ¦«}|j-d }g}g}g}g}g}g}t]|j|j|jj/¬
¦«5}ta|||j1|j2| ¦«\}} ddd¦«n #1swxYwYtMd |j-d |j1¦«D]}!||!|!|j1z}"||!|!|j1z}#|#dd|df}$| |!|!|j1z}%tg|%|$¦«}&~%ti¦«tk||#|j2¦«}'|'j6dd|d z
df}(|(|j
dzz}(tg|(|$¦«})~'~(ti¦«|$}*|j7tq|j7|j2|$¦«}*t'j9|"|*fd ¦«}+tu|*|j2k¦«d z
},tw|txj=¦«rt}||+|j2|¦«\}-}.}-nQt'j?|| @|+d¬¦«¦«t&jA¬¦« *|¦«}.| B|$¦«| B|*¦«| B|&¦«| B|)¦«| B|,¦«| B|.¦«Œt'j9|d ¦«}t'j9|d ¦«}t'j9|d ¦«}t'j9|d ¦«}t'j9|d ¦«}t'j9|d ¦«}~&~)~.ti¦«t‡jD¦«t'jE||jFkd¬¦«}/|jG||/xx|jjGzcc<t'jH|j-d |j ¬¦« +|j-d d ¦«}0|0| Id ¦«k}1t'jJ||1t¦«}t'jJ||1t¦«}||z
}2|jLrP|| M¦«z
| N¦«dzz }t'jO||jP |jP¦«}|jQré|jR |2z}3|1 Sd ¦«d z
|1 T¦« U¦« Vd d¬¦«z
}4t'jW|2¦«}5| Xdd ¦« *|2jY¦«}6|5 Zd |4|6¬¦«|3 [d ¦«}7|5|3z}8|8 [d ¦«}9n%|2 [d ¦«}:|jR |:z}7|7|z}9|9 X|j,d¦«}9|9 [d ¦«|9z
|j,d z
z };|9|;z
}<|< \¦«}<|j]r/|<|< M¦«z
|< N¦«dzz }<ti¦«ddd¦«n #1swxYwYtM|j¦«D]v}=t¼j_ `|ja¦«}>d }?tMd |ja|jb¦«D]0}@|@|jbz}A|>|@|A…}Bd }CtMd |jb|jc¦«D]Ö}D| d|¦«5|D|jcz}E|B|D|E…}F|<|F}G||F}H||F}I||F}Jtk||I|j2¦«}K|Kj6dd|d z
df}%|%|j
dzz}%tg|%|H¦«}Lt'jJ|L|1|Ft¦«}L|L|Jz
 e¦«}M|L [d ¦«}L|J [d ¦«}J|L|Jz
}Nt'je|N¦«}O|G |Oz}P|G t'jO|Od|jfz
d|jfz¦«z}Qt'jg|P|Q¦«}R|R M¦«}S|S}T| h|T¦«| i¦«| j¦«t'j)¦«5|Q|Pk A¦« M¦«}Ut&j<jk l|%d¬¦«}Vt'jm|%d¬¦«t'j[|V|%zd¬¦«z
}Wd|Nd
z M¦«z}X|X||=|?|Cf<|U||=|?|Cf<|S||=|?|Cf<|W M¦«||=|?|Cf<|M M¦«||=|?|Cf<ddd¦«n #1swxYwYddd¦«n #1swxYwY|Cd z
}Cί|?d z
}?~K~%~L~N~O~P~Q~S~T~U~V~W~X~G~H~I~Jti¦«Œ2Œxt'j)¦«5|2 [d ¦« M¦«}Y|  [d ¦« M¦«}Z|7 M¦«}[tÝ|jjtj¦«| z
z ¦«}\i}]|\|]d<|j o|Y¦« M¦« p¦«|]d<|j o|Z¦« M¦« p¦«|]d<|j o|[¦« M¦« p¦«|]d<|j o|9¦« M¦« p¦«|]d<|j o| M¦«¦« M¦« p¦«|]d<|j o|¦« M¦« p¦«|]d<|j o|¦« M¦« p¦«|]d<|j o|¦« M¦« p¦«|]d<|j o|¦« M¦« p¦«|]d<|j o|¦« M¦« p¦«|]d <|j o|¦« M¦« p¦«|]d!<|j o|¦« q¦« p¦«|]d"<||jFk [¦« p¦«|]d#<|jr s¦«d |]d$<|jj|]d%<|jj|j,|jzz |j_t| u|]¦«ddd¦«n #1swxYwY~2~Y~Z~|jr i¦«|jxjd z
c_|j# v||j|j%¦«|_%|j%jwrG| x|d¬&¦«|j# y|j|j|j%¦«|_%ti¦«t‡jD¦«|jzd kr'|d z
|j{zd kr| |d¬'¦«ŒÙ|j# }||j|j%¦«|_%|j%jwrJ| x|dd¬(¦«|j# y|j|j|j%¦«|_%dSdS))Nc3óK Ed{VŒ r¡r:s€rkÚrepeat_generatorz3_UnslothRLOOTrainer.train.<locals>.repeat_generatorrs)øèèð

&rmr™r‰r}Úmax_new_tokensr$Útop_kÚtop_pÚ do_samplez===training policy===r^rrvrUÚ input_ids©Úgather_deepspeed3_paramsrO©Úskip_special_tokens©ÚdtyperTr|)rRÚkeepdim)rRrSÚsrcgà?Úepsz objective/klzobjective/entropyzobjective/non_score_rewardzobjective/rlhf_rewardzobjective/scoreszpolicy/approxkl_avgzpolicy/clipfrac_avgzloss/policy_avgzval/clipfrac_avgzpolicy/entropy_avgz val/ratioz
val/ratio_varzval/num_eos_tokensÚlrÚepisode)Útrial)Úsampling)Úmetrics)~rsr|ryr‡Ú
model_wrappedrUrVrSr˜r_Úiterrr!r$Úprintr=r0rr>ÚzerosÚtrainrŒÚ global_steprºr(rrxr4r~r¿r‰Úon_train_beginrÚranger+ÚnextÚno_gradrZÚrepeatr4rXr@r9r#rrvr;r'r*rbr#r?Úcatr)rdr5r†r0rÚ batch_decodeÚfloatr`r,ÚcollectÚanyrur%Úaranger]Ú masked_fillrr5ÚmeanÚstdÚclampr6r8r2ÚsizeÚlongÚfliplrÚargmaxÚ
zeros_likerWÚscatter_ÚsumÚflattenr7r6ÚrandomÚ permutationr*r,Ú
accumulateÚexpr3ÚbackwardÚstepÚ zero_gradrÚsoftmaxr_rJÚgather_for_metricsr€ÚvarrzÚ get_last_lrÚepochÚlogÚ on_step_endr“Ú_save_checkpointÚon_saver r„Úgenerate_completionsÚ on_train_end)_rBrsr|ryr‡rUrVrSr_Úiter_dataloaderrtÚ
start_timeÚ stats_shapeÚapproxkl_statsÚpg_clipfrac_statsÚ
pg_loss_statsÚvf_clipfrac_statsÚ
entropy_statsÚ ratio_statsÚupdateÚdataÚqueriesÚcontext_lengthÚ responsesÚpostprocessed_responsesÚlogprobsÚ ref_logprobsÚscoresÚsequence_lengthsÚunwrapped_modelÚquery_responsesÚlogitssÚqueryÚquery_responseÚresponserbÚlogprobÚ
ref_outputÚ
ref_logitsÚ ref_logprobÚpostprocessed_responseÚpostprocessed_query_responseÚsequence_lengthÚscoreÚcontain_eos_tokenÚ
response_idxsÚ padding_maskÚklÚ kl_rewardÚ eos_indicesÚ last_rewardÚ
scores_shapedÚnon_score_rewardÚrewardÚ rlhf_rewardÚ sequence_klÚbaselineÚ
advantagesÚ
ppo_epoch_idxÚb_indsÚ
minibatch_idxÚmini_batch_startÚmini_batch_endÚmini_batch_indsÚgradient_accumulation_idxÚmicro_batch_startÚmicro_batch_endÚmicro_batch_indsÚ mb_advantageÚ mb_responsesÚmb_query_responsesÚ mb_logprobsÚoutputÚ new_logprobsÚ new_ratioÚ
logprobs_diffÚratioÚ pg_lossesÚ
pg_losses2Ú pg_loss_maxÚpg_lossÚlossÚ pg_clipfracÚ prob_distÚentropyÚapproxklÚmean_klÚ mean_entropyÚmean_non_score_rewardr¸r˜s_ @rkz_UnslothRLOOTrainer.trainfs,ø€ØŒyˆØÔ Ø”Nˆ Ø
ˆØ!œZˆÔØ”_ˆ
ØÔ ØÔØ”_ˆ
ØÔ#ˆðÐÝÔÔ)¨DÑØØð 
ñ
ô
Ðð ×ÒДY‘[”[ˆ
ØÔ*¨DÔ,AÀ4ÔCcÐ Ýœ [¸ÐÝ!œK¨ ¸FÐÝœ  K¸Ð
Ý!œK¨ ¸FÐÝœ  K¸Ð
Ý”k +°fÐ Ø
Š
Œ