Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothRLOOTrainer.cpython-310.pyc
T

308 lines
29 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
2025-08-28 22:41:56 +00:00
ö×°hßã@sdZddlmZddlZddlmZddlmZddlmZm Z m
2025-08-28 17:57:59 +00:00
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZm Z m!Z!m"Z"m#Z#m$Z$m%Z%m&Z&m Z m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2m3Z3m4Z4m5Z5m6Z6m7Z7m8Z8mZm9Z9m:Z:m;Z;m<Z<m=Z=m>Z>m?Z?m@Z@mZmAZAmBZBm
Z
m$Z$m:Z:mZddl:Z:ddlTddlCmDZDmEZEdd lFmGZGddlZddlHZ9dd
lImJZJddlmZdd lKmLZLmMZNd d
d d
d
dœZOejPd d eOdddƒZQeDGddde"ƒƒZR Gddde$ƒZSGdddeSƒZTdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable);Ú AcceleratorÚBaseImageProcessorr ÚCallbackHandlerÚDEFAULT_CALLBACKSÚDEFAULT_PROGRESS_CALLBACKÚDataCollatorWithPaddingÚ
DataLoaderÚDatasetÚExportableStateÚFeatureExtractionMixinÚGenerationConfigÚINVALID_LOGPROBÚOnlineTrainerStaterÚPathÚPreTrainedTokenizerBaseÚPrinterCallbackÚProcessorMixinÚ
RLOOConfigÚ RLOOTrainerÚTrainerÚTrainerCallbackÚTrainerControlrÚbatch_generationÚ broadcastÚ defaultdictÚdisable_dropout_in_modelÚ empty_cacheÚ exact_divÚfirst_true_indicesÚforwardÚ
gather_objectÚgcÚgenerate_model_cardÚget_comet_experiment_urlÚ#get_reporting_integration_callbacksÚ
get_rewardÚis_rich_availableÚis_wandb_availableÚlog_table_to_comet_experimentÚmathÚnnÚnpÚosÚpdÚprepare_deepspeedÚprint_rich_tableÚselective_log_softmaxÚtextwrapÚtimeÚtorchÚtruncate_responseÚunwrap_model_for_generationrrr6r=)Ú*)Ú dataclassÚfield)ÚVersion)Ú nullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rPÚindex©rPé)
r=ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
ÚlogitsrQÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©riúR/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothRLOOTrainer.pyÚchunked_selective_log_softmax"s  
rkceZdZUdZedddidZeeed<edddidZ ee
ed <  
 
                 

   
 
!
   
"
       
 
"      # $ 
%     

  &  


 !    " 
 ' (
 
 

    ) * +   ,  -        . -
/ 0
1
  d4‡fd2d3„ Z Z S)5ÚUnslothRLOOConfigaÆ
Configuration class for the [`RLOOTrainer`].
This class includes only the parameters that are specific to RLOO training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] and [`OnPolicyConfig`] documentation. Note that default
values in this class may differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
exp_name (`str`, *optional*, defaults to `os.path.basename(__file__)[: -len(".py")]`):
Name of this experiment.
reward_model_path (`str`, *optional*, defaults to `"EleutherAI/pythia-160m"`):
Path to the reward model.
num_ppo_epochs (`int`, *optional*, defaults to `4`):
Number of epochs to train.
whiten_rewards (`bool`, *optional*, defaults to `False`):
Whether to whiten the rewards.
kl_coef (`float`, *optional*, defaults to `0.05`):
KL coefficient.
cliprange (`float`, *optional*, defaults to `0.2`):
Clip range.
rloo_k (`int`, *optional*, defaults to `2`):
REINFORCE Leave-One-Out (RLOO) number of online samples per prompt.
normalize_reward (`bool`, *optional*, defaults to `False`):
Whether to normalize rewards.
reward_clip_range (`float`, *optional*, defaults to `10.0`):
Clip range for rewards.
normalize_advantage (`bool`, *optional*, defaults to `False`):
Whether to normalize advantages.
token_level_kl (`bool`, *optional*, defaults to `True`):
Whether to use token-level KL penalty or sequence-level KL penalty.
ds3_gather_for_generation (`bool`, *optional*, defaults to `True`):
This setting applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation,
improving generation speed. However, disabling this option allows training models that exceed the VRAM
capacity of a single GPU, albeit at the cost of slower generation.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrMz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksFÚnorNéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrSéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéé@é
é5çffffffæ?úEleutherAI/pythia-160mÚ rloo_configçš™™™™™©?çš™™™™™É?ç$@c¡£ sj|dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|dur:d
d lmt|¢ƒd d
ƒ}|‰d
krBtdƒ|‰dkrJtdƒtƒjd¯id|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/|d0| “d1|!“d2|"“d3|#“d4|$“d5|%“d6|&“d7|'“d8|(“d9|)“d:|*“d;|+“d<|,“d=|-“d>|.“d?|/“d@|0“dA|1“dB|2“dC|3“dD|4“dE|5“dF|6“dG|7“dH|8“dI|9“dJ|:“dK|;“dL|<“dM|=“dN|>“dO|?“dP|@“dQ|A“dR|B“dS|C“dT|D“dU|E“dV|F“dW|G“dX|H“dY|I“dZ|J“d[|K“d\|L“d]|M“d^|N“d_|O“d`|P“da|Q“db|R“dc|S“dd|T“de|U“df|V“dg|W“dh|X“di|Y“dj|Z“dk|[“dl|\“dm|]“dn|^“do|_“dp|`“dq|a“dr|b“ds|c“dt|d“du|e“dv|f“dw|g“dx|h“dy|i“dz|j“d{|k“d||l“d}|m“d~|n“d|o“d€|p“d|q“d|r“dƒ|s“d„|t“d…|u“d†|v“d‡|w“dˆ|x“d‰|y“dŠ|z“d|{“dŒ||“d|}“dŽ|~“d|d|€“d|d’|‚“d“|ƒ“d”|„“d•|…“d–|†“d—|‡“d˜|ˆ“d™|‰“dš|Š“d›|‹“dœ|Œ“d|dž|Ž“dŸ|d |d¡|‘“d¢|’“d£|““d¤|”“d¥|•“d¦|–“d§|—“d¨|˜“d©|™“dª|š“d«|›“d¬|œ“d­|d®|ž“|¡¤Ž|Ÿ|_| |_ dS)°NçH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rSza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!r€rÚunsloth_training_checkpointsrrr)Ú cpu_countrNrszUUnsloth: Please set a positive non-zero temperature since your results will be wrong.rzgUnsloth: Please set a positive non-zero temperature less than 10, since sampling will be quite erratic.Ú
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚdataset_num_procÚnum_mini_batchesÚtotal_episodesÚ local_rollout_forward_batch_sizeÚnum_sample_generationsÚresponse_lengthÚ
stop_tokenÚ
stop_token_idÚ temperatureÚmissing_eos_penaltyÚsft_model_pathÚ
world_sizeÚnum_total_batchesÚmicro_batch_sizeÚlocal_batch_sizeÚ
batch_sizeÚlocal_mini_batch_sizeÚmini_batch_sizeÚexp_nameÚreward_model_pathÚnum_ppo_epochsÚwhiten_rewardsÚkl_coefÚ cliprangeÚrloo_kÚnormalize_rewardÚreward_clip_rangeÚnormalize_advantageÚtoken_level_klÚds3_gather_for_generationri)
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr—ÚmaxÚ MathErrorÚsuperÚ__init__rprq)£Úselfr˜r™rrr r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrrrrr r!r"r#r$r%r&r'r(r)r*r+r,r-r.r/r0r1r2r3r4r5rprqÚkwargsr—©Ú __class__rirjr<gs&  ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá
zUnslothRLOOConfig.__init__) NNFFFrrFrNrNNNrsrsrrtrurvrwrxryrzr{rMr|r}rr~rTNr€FrSFr€rNTFFFFFFrrFFFFrƒr„FFNrMNNFr…FNrNrMNNTNFNNFr…rNNNNr†r‡NFFrˆNNNNTFTFFNNr‰NNFNFNFTr„NNNr…TFNrŠrFNNFFNNFFFNFTNrSNrŒrNNrNrNNNNNNNrrrNFrr“rsFr”FFTNrM)
Ú__name__Ú
__module__Ú __qualname__Ú__doc__rBrprrÚ__annotations__rqÚintr<Ú
__classcell__ririr?rjrl3sX
)þþÞrlcs.eZdZddgZ    d#dedeeeee e
fde j de j d ee j e
eegeeffd
ed eed eeeeeeffd
eejjejjjfdeeeddfddZdefddZdefddZddZd$de fddZ!‡fddZ"   d%deedeed eeeedffd!d"„Z#‡Z$S)&Ú_UnslothRLOOTrainerÚtrlÚrlooN©NNÚconfigÚprocessing_classÚpolicyÚ
ref_policyÚ reward_modelÚ
train_datasetÚ
data_collatorÚ eval_datasetÚ
optimizersÚ callbacksÚreturnc Cs(||urtdƒ||_|} ||_||_|durt|jƒ}d|jj_d|jj_||_||_ ||_
t |ƒ|_ ||_
||_| \|_|_d|_| jdurQt| j|j ƒ| _t| jd} | |_| j| _| j| j| j| _t| j| jƒ| _t| j| jƒ| _t| j| jdƒ| _ t| j| jdƒ| _!t" #| j| j¡| _$t%j&tt' ƒ| j(d}
t)|
dƒ }| j+d| j,d|| _-| j,| j.d|_/| j0dkrÎt1d | j$| j0ƒ|_2t| j| j3d
ƒ|_4|||fD] }t5|t6j7ƒrèt8|ƒqÜ| j9rö| j9d krö|jj| _:||_;|j<| j$d t=t>|jj?ƒ}|
dur|n||
|_@tA|j@|j;|j|j|jƒ|_B| C|jjDr+tEntF¡tGƒ|_HtI| | d
d|jBj@|jHgDƒd|_Ld|_Md|_NtO|jjLddƒdu|_PtO|jjLddƒdu|_Qd|_R|jjSrq| |jjUrtVjW|jjXddd|_YtZ|j;dƒr|j; [|j\¡t]|j
|j4d|j
dd|_^t% _| j,¡|  `|j;|j|j^¡\|_;|_|_^t% _|j/¡t]|j| ja|j
dd|_b|  `|jb¡|_b|jPröt5|j t6j7ƒrätc|j | j| jd| jeƒ|_ tc|j| j| jd| jeƒ|_|j;|_fdS|j g|jj(¡|_t5|j t6j7ƒr|j  g|jj(¡|_ dSdS)Nz `policy` and `ref_policy` cannot be the same object. If you want `ref_policy` to be the same as `policy`, you must mass a copy of it, or `None` if you use peft.)z5`batch_size` must be a multiple of `num_mini_batches`z;`local_batch_size` must be a multiple of `num_mini_batches`©ÚdevicerÚ__i£†rSz/`local_batch_size` must be a multiple of rloo_kÚeos)Únum_training_stepscSsg|] }t|tƒr|qSri)Ú
isinstancer)Ú.0ÚcbririrjÚ
<listcomp>(s

ÿÿz0_UnslothRLOOTrainer.__init__.<locals>.<listcomp>)Úis_local_process_zeroÚis_world_process_zeroÚstateful_callbacksÚdeepspeed_pluginÚ fsdp_pluginT)Úexist_okÚadd_model_tags)r'ÚshuffleÚ
collate_fnÚ drop_last)r'rhri)hÚ
ValueErrorÚargsrMrNrÚgeneration_configÚ eos_token_idÚ pad_token_idrOrPrQÚlenÚtrain_dataset_lenrRrSÚ optimizerÚ lr_schedulerÚoptimizer_cls_and_kwargsrrFr­r Ú acceleratorÚ
num_processesr#rr&r%r'r'r)r(r3Úceilr$r=Útensorr<rXr#Úitemr*Ú
process_indexÚ
local_seedrr9Úsample_generations_freqr0Úlocal_dataloader_batch_sizer\r4ÚModuler%rrÚmodelÚcreate_optimizer_and_schedulerrr.rUrÚcallback_handlerÚ add_callbackrÚrrr!Úcontrolrr`raÚstateÚ current_flosÚhp_search_backendÚgetattrÚis_deepspeed_enabledÚis_fsdp_enabledr÷Ú init_hf_repoÚ should_saver6Úmakedirsr˜Ú backup_modelÚhasattrrfÚ
_tag_namesrÚ
dataloaderÚ manual_seedÚpreparer Úeval_dataloaderr8rX)r=rLrMrNrOrPrQrRrSrTrUrkrtÚ time_tensorÚtime_intÚmoduleÚdefault_callbacksririrjr<Áÿ
ÿ

 
 ÿ
ÿ
ÿ
ÿ

ÿ 
ÿÿÿý

û  üÿÿ ÿz_UnslothRLOOTrainer.__init__cCó|jS©r©r=ririrjÚget_train_dataloadercóz(_UnslothRLOOTrainer.get_train_dataloadercCr—r˜)rririrjÚget_eval_dataloaderfz'_UnslothRLOOTrainer.get_eval_dataloaderc^
s |j}|j}|j}|j}|j|_|j}|j}|j}|j|j }fdd} t
| ƒƒ}
t |j |j
ddddd} | d¡t ¡} |j|j|jf}
tj|
|d }tj|
|d }tj|
|d }tj|
|d }tj|
|d }tj|
|d }| ¡d
|j_d
|j_|j|jd |j_|j|j|j_|jdur¬|jd kr§t  |jj|j¡|j_n|j|j_|j!durÈ|j!d krÃt  |jj|j!¡|j_!n|j!|j_!|j"durä|j"d krßt  |jj|j"¡|j_"n|j"|j_"|j# $||j|j%¡|_%t&d |jd ƒD]ê}|jjd |j'7_t(|
ƒ}t !|d
 *|¡}| +|j,d ¡}|j-d }g}g}g}g}g}g}t.|j|j|jj/d}t0|||j1|j2| ƒ\}} Wdƒn 1sQwYt&d
|j-d
|j1ƒD]º}!||!|!|j1}"||!|!|j1}#|#dd|df}$| |!|!|j1}%t3|%|$ƒ}&~%t4ƒt5||#|j2ƒ}'|'j6dd|d df}(|(|j
d}(t3|(|$ƒ})~'~(t4ƒ|$}*|j7durÅt8|j7|j2|$ƒ}*t 9|"|*fd ¡}+t:|*|j2kƒd },t;|t<j=ƒrét>||+|j2|ƒ\}-}.}-ntj?||j@|+ddƒtjAd *|¡}.| B|$¡| B|*¡| B|&¡| B|)¡| B|,¡| B|.¡q`t 9|d
¡}t 9|d
¡}t 9|d
¡}t 9|d
¡}t 9|d
¡}t 9|d
¡}~&~)~.t4ƒtC tjE||jFkdd}/|jGdurd||/|jjG8<tjH|j-d |j d  +|j-d
d ¡}0|0| Id ¡k}1t J||1tK¡}t J||1tK¡}||}2|jLr©|| | d}t O||jP |jP¡}|jQrë|jR |2}3|1 Sd ¡d |1  jVd dd}4t W|2¡}5| Xdd ¡ *|2jY¡}6|5jZd |4|6d|3 [d ¡}7|5|3}8|8 [d ¡}9n|2 [d ¡}:|jR |:}7|7|}9|9 X|j,d¡}9|9 [d
¡|9|j,d };|9|;}<|< }<|j]r%|<|< |< d}<t4ƒWdƒn 1s3wYt&|jƒD][}=t^j_ `|ja¡}>d
}?t&d
|ja|jbƒD]E}@|@|jb}A|>|@|A…}Bd
}Ct&d
|jb|jcƒD]}D| d|¡þ|D|jc}E|B|D|E…}F|<|F}G||F}H||F}I||F}Jt5||I|j2ƒ}K|Kj6dd|d df}%|%|j
d}%t3|%|Hƒ}Lt J|L|1|FtK¡}L|L|J }M|L [d ¡}L|J [d ¡}J|L|J}Nt e|N¡}O|G |O}P|G t O|Od|jfd|jf¡}Qt g|P|Q¡}R|R }S|S}T| h|T¡| | t ^|Q|Pk  }Utj<jkjl|%dtjmd *|%jY¡}Vtjn|%ddtj[|V|%dd}Wd|Nd  }X|X||=|?|Cf<|U||=|?|Cf<|S||=|?|Cf<|W ||=|?|Cf<|M ||=|?|Cf<Wdƒn 1scwYWdƒn 1sswY|Cd 7}Cqi|?d 7}?~K~%~L~N~O~P~Q~S~T~U~V~W~X~G~H~I~Jt4ƒqQq=t ê|2 [d ¡ }Y|  [d ¡ }Z|7 }[to|jjt ¡| ƒ}\i}]|\|]d<|j p|Y¡  |]d<|j p|Z¡  |]d<|j p|[¡  |]d<|j p|9¡  |]d<|j p| ¡  |]d<|j p|¡  |]d<|j p|¡  |]d<|j p|¡  |]d <|j p|¡  |]d!<|j p|¡  |]d"<|j p|¡  |]d#<|j p|¡  |]d$<||jFk  |]d%<|js d
|]d&<|jj|]d'<|jj|j,|j|j_u| v|]¡Wdƒn 1sŒwY~2~Y~Z~|js |jjd 7_|j# w||j|j%¡|_%|j%jxrÅ|jy|dd(|j# z|j|j|j%¡|_%t4ƒtC |j{d
krâ|d |j|d
krâ|j}dd)q÷|j# ~||j|j%¡|_%|j%jxr |jy|ddd*|j# z|j|j|j%¡|_%dSdS)+Nc3s ˆEdHqr˜ririr™rirjÚrepeat_generatorus
ÿz3_UnslothRLOOTrainer.train.<locals>.repeat_generatorr•r†rzÚmax_new_tokensr Útop_kÚtop_pÚ do_samplez===training policy===rWrrsrSÚ input_ids©Úgather_deepspeed3_paramsrM©Úskip_special_tokens©ÚdtyperRry)rPÚkeepdim)rPrQÚsrc)rPgà?Úepsz objective/klzobjective/entropyzobjective/non_score_rewardzobjective/rlhf_rewardzobjective/scoreszpolicy/approxkl_avgzpolicy/clipfrac_avgzloss/policy_avgzval/clipfrac_avgzpolicy/entropy_avgz val/ratioz
val/ratio_varzval/num_eos_tokensÚlrÚepisode)Útrial)Úsampling)Úmetrics)rkrtrqr~Ú
model_wrappedrOrPrMrrXÚiterrrr Úprintr<r,rr=ÚzerosÚtrainrƒÚ global_stepr¯r$rrpr­r¸r3rvr€Úon_train_beginrÚranger'ÚnextÚno_gradrXÚrepeatr0rVr?r5r"rrnr:r&r)r`rr>Úcatr(r\r4r}r/rwÚ batch_decodeÚfloatr^r+ÚcollectÚanyrmr!Úaranger[Ú masked_fillrr1ÚmeanÚstdÚclampr2r4r.ÚsizeÚlongÚfliplrÚargmaxÚ
zeros_likerUÚscatter_ÚsumÚflattenr3r5ÚrandomÚ permutationr&r(Ú
accumulateÚexpr/r9ÚbackwardÚstepÚ zero_gradrÚsoftmaxrYr]rFÚgather_for_metricsrxÚvarrrÚ get_last_lrÚepochÚlogÚ on_step_endrŠÚ_save_checkpointÚon_saverr{Úgenerate_completionsÚ on_train_end)^r=rkrtrqr~rOrPrMrXÚiter_dataloaderrlÚ
start_timeÚ stats_shapeÚapproxkl_statsÚpg_clipfrac_statsÚ
pg_loss_statsÚvf_clipfrac_statsÚ
entropy_statsÚ ratio_statsÚupdateÚdataÚqueriesÚcontext_lengthÚ responsesÚpostprocessed_responsesÚlogprobsÚ ref_logprobsÚscoresÚsequence_lengthsÚunwrapped_modelÚquery_responsesÚlogitssÚqueryÚquery_responseÚresponser`ÚlogprobÚ
ref_outputÚ
ref_logitsÚ ref_logprobÚpostprocessed_responseÚpostprocessed_query_responseÚsequence_lengthÚscoreÚcontain_eos_tokenÚ
response_idxsÚ padding_maskÚklÚ kl_rewardÚ eos_indicesÚ last_rewardÚ
scores_shapedÚnon_score_rewardÚrewardÚ rlhf_rewardÚ sequence_klÚbaselineÚ
advantagesÚ
ppo_epoch_idxÚb_indsÚ
minibatch_idxÚmini_batch_startÚmini_batch_endÚmini_batch_indsÚgradient_accumulation_idxÚmicro_batch_startÚmicro_batch_endÚmicro_batch_indsÚ mb_advantageÚ mb_responsesÚmb_query_responsesÚ mb_logprobsÚoutputÚ new_logprobsÚ new_ratioÚ
logprobs_diffÚratioÚ pg_lossesÚ
pg_losses2Ú pg_loss_maxÚpg_lossÚlossÚ pg_clipfracÚ prob_distÚentropyÚapproxklÚmean_klÚ mean_entropyÚmean_non_score_rewardr­rir™rji 
û









 
ÿ
ûý 

 
ÿ
 ÿ ÿüû




       $ $

 
 ö

  
 

ÿ 



  

  ÿõÖ 6 
º
Iÿ  ç

  
þz_UnslothRLOOTrainer.trainFr±c
Cs8|j}|j}t|jjddddd}ttƒ}t|j|j|jj d±}|j
D]¥}|d}t   ¡|j
d} t|||j
d |j|ƒ\}
} |
dd| df} | }
|jdur[t|j|j| ƒ}
|d