Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothRewardTrainer.cpython-310.pyc
T

200 lines
25 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
2025-08-28 22:41:56 +00:00
ö×°h¯¤ã@dZddlmZddlZddlmZddlmZddlmZm Z m
2025-08-28 17:57:59 +00:00
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZm
Z
mZmZmZmZmZmZmZmZmZm Z m!Z!m Z m"Z"m#Z#m$Z$m%Z%m&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0mZm1Z1m2Z2m3Z3m4Z4m5Z5mZm6Z6m
Z
mZmZm Z m+Z+m1Z1mZddl1Z1ddlTddl7m8Z8m9Z9dd l:m;Z;ddlZddl<Z=dd
l>m?Z?ddlmZdd l@mAZAmBZCd d
d d
d
dœZDejEd d eDdddƒZFe8GdddeƒƒZG Gddde ƒZHGdddeHƒZIdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)3rÚBaseImageProcessorr Ú DataCollatorÚDatasetÚEvalPredictionÚFeatureExtractionMixinÚFrozenInstanceErrorrÚ PartialStateÚPathÚ PeftModelÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚ RewardConfigÚRewardDataCollatorWithPaddingÚ
RewardTrainerÚTrainerÚTrainerCallbackrÚ _tokenizeÚcompute_accuracyÚdecode_and_strip_paddingÚ defaultdictÚdisable_dropout_in_modelÚ
gather_objectÚgenerate_model_cardÚget_comet_experiment_urlÚinspectÚis_peft_availableÚis_rich_availableÚis_wandb_availableÚlog_table_to_comet_experimentÚmaybe_apply_chat_templateÚ
nested_detachÚnnÚosÚpdÚprepare_model_for_kbit_trainingÚprint_rich_tableÚreplaceÚtorchÚwarningsrrrrr&r-r2)Ú*)Ú dataclassÚfield)ÚVersion)Ú nullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rDÚindex©rDé)
r2ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
ÚlogitsrEÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©r]úT/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothRewardTrainer.pyÚchunked_selective_log_softmax"s  
r_cs†eZdZUdZedddidZeeed<edddidZ ee
ed <eddd
idZ ee
ed <  
                            ! ! " #     $          $      % &  '         (      #    $   ) *       +      d.‡fd,d-„ Z Z
S)/ÚUnslothRewardConfigaI
Configuration class for the [`RewardTrainer`].
This class includes only the parameters that are specific to Reward training. For a full list of training
arguments, please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this
class may differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
max_length (`int` or `None`, *optional*, defaults to `1024`):
Maximum length of the sequences (prompt + completion) in the batch, filters out entries that exceed the
limit. This argument is required if you want to use the default data collator.
disable_dropout (`bool`, *optional*, defaults to `True`):
Whether to disable dropout in the model.
dataset_num_proc (`int`, *optional*, defaults to `None`):
Number of processes to use for processing the dataset.
center_rewards_coefficient (`float`, *optional*, defaults to `None`):
Coefficient to incentivize the reward model to output mean-zero rewards (proposed by
https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: `0.01`.
remove_unused_columns (`bool`, *optional*, defaults to `False`):
Whether to remove the columns that are not used by the model's forward pass. Can be `True` only if the
dataset is pretokenized.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrAz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnorBéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrGéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéécˆŠ s´|dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|ƒdur:d
d lm}‰t|‰ƒd d
ƒ}ƒtƒjdid|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d|d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-| “d.|!“d/|"“d0|#“d1|$“d2|%“d3|&“d4|'“d5|(“d6|)“d7|*“d8|+“d9|,“d:|-“d;|.“d<|/“d=|0“d>|1“d?|2“d@|3“dA|4“dB|5“dC|6“dD|7“dE|8“dF|9“dG|:“dH|;“dI|<“dJ|=“dK|>“dL|?“dM|@“dN|A“dO|B“dP|C“dQ|D“dR|E“dS|F“dT|G“dU|H“dV|I“dW|J“dX|K“dY|L“dZ|M“d[|N“d\|O“d]|P“d^|Q“d_|R“d`|S“da|T“db|U“dc|V“dd|W“de|X“df|Y“dg|Z“dh|[“di|\“dj|]“dk|^“dl|_“dm|`“dn|a“do|b“dp|c“dq|d“dr|e“ds|f“dt|g“du|h“dv|i“dw|j“dx|k“dy|l“dz|m“d{|n“d||o“d}|p“d~|q“d|r“d€|s“d|t“d|u“dƒ|v“d„|w“d…|x“d†|y“d‡|z“dˆ|{“d‰||“dŠ|}“d|~“dŒ|d|€“dŽ|d|‚“d|ƒ“d‘|„“|ˆ¤Ž|…|_|†|_|‡|_ dS)“NgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rGza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rurvÚunsloth_training_checkpointsrgr)Ú cpu_countrBrhÚ
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚ
max_lengthÚdisable_dropoutÚdataset_num_procÚcenter_rewards_coefficientr])
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingrƒÚmaxÚsuperÚ__init__rdrerf)ŠÚselfr„r…r†r‡r‰rrrrrrr“r”r•r–r—r™rrr r­r¿rÿrrrrrrrrrdrerfÚkwargsrƒ©Ú __class__r]r^r
]s@   ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüû
zUnslothRewardConfig.__init__)‡NNFFFrgFrBrBNNrhrhrrirjrkrlrmrnrorprArqrrrrsrtTNruFrGFrurvNTFFFFFFrwrwFFFFrxryFFNrANNFrzFNrNrANNFNFNNFrzrNNNNr{r|NFFr}NNNNTFTFFNNr~NNFNFNFTryNNNrzTFNrr€FNNFFNNFFFNFTrTNNNrAN)Ú__name__Ú
__module__Ú __qualname__Ú__doc__r6rdrrÚ__annotations__reÚintrfr
Ú
__classcell__r]r]rr^r`3s.
þþþ÷r`cseZdZddgZ            d(deeeejfdee dee
dee d eee e e
e ffd
eeeeeefd eegefd eeege fd
eeedeejjejjjfdeeejejgejfdee ffdd
Z  d)deeejfde e
eejeffdeejeeje e
ejffffddZ d*deeejfde e
eejeffde deee
deeejeejeejff
ddZ!‡fddZ"de#fdd „Z$‡fd!d"„Z%   d+d#ee
d$ee
d%ee
ee
dffd&d'„Z&‡Z'S),Ú_UnslothRewardTrainerÚtrlzreward-trainerN©NNÚmodelÚargsÚ
data_collatorÚ
train_datasetÚ eval_datasetÚprocessing_classÚ
model_initÚcompute_metricsÚ callbacksÚ
optimizersÚpreprocess_logits_for_metricsÚ peft_configc

sHtƒs | dur tdƒtƒrV| durVt|tƒsVt|ddƒs#t|ddƒrTdtt t¡j ƒv}
d|j
i}|
s@|j dur@t  
dt¡n |
rL|j durL|j |d<t|fi|¤Ž}|}|jr]t|ƒ|durct}|dur˜|durotd ƒ|jt|ƒ}|jr”zd|_Wntyt|dd
}Ynwt  
d t¡d |_nd|_d |jd
<d|jvrtƒ ¡Nd|i}|jtd|id}|jtd ||jd}|j ‡fdd|jd}|durò|jtd|id}|jt|d |jd}|j ‡fdd|jd}Wdƒn1süwYt!ƒj"||||||||| |
| d t#|j$dƒr"|j$ %|j&¡dSdS)
Initialize RewardTrainer.
Args:
model (`transformers.PreTrainedModel`):
The model to train, preferably an `AutoModelForSequenceClassification`.
args (`RewardConfig`):
The arguments to use for training.
data_collator (`transformers.DataCollator`):
The data collator to use for training. If None is specified, the default data collator
(`RewardDataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of
the sequences in the batch, given a dataset of paired sequences.
train_dataset (`datasets.Dataset`):
The dataset to use for training.
eval_dataset (`datasets.Dataset`):
The dataset to use for evaluation.
processing_class ([`~transformers.PreTrainedTokenizerBase`], [`~transformers.BaseImageProcessor`], [`~transformers.FeatureExtractionMixin`] or [`~transformers.ProcessorMixin`], *optional*, defaults to `None`):
Processing class used to process the data. If provided, will be used to automatically process the
inputs for the model, and it will be saved along the model to make it easier to rerun an interrupted
training or reuse the fine-tuned model.
model_init (`Callable[[], transformers.PreTrainedModel]`):
The model initializer to use for training. If None is specified, the default model initializer will be
used.
compute_metrics (`Callable[[transformers.EvalPrediction], dict]`, *optional* defaults to `compute_accuracy`):
The metrics to use for evaluation. If no metrics are specified, the default metric (`compute_accuracy`)
will be used.
callbacks (`list[transformers.TrainerCallback]`):
The callbacks to use for training.
optimizers (`tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]`):
The optimizer and scheduler to use for training.
preprocess_logits_for_metrics (`Callable[[torch.Tensor, torch.Tensor], torch.Tensor]`):
The function to use to preprocess the logits before computing the metrics.
peft_config (`dict`, defaults to `None`):
The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped
in a PEFT model.
NzvPEFT is not installed and you passed a `peft_config` in the trainer's kwargs, please install it to use the PEFT modelsÚis_loaded_in_8bitFÚ is_quantizedrêÚuse_gradient_checkpointingzÂYou passed `gradient_checkpointing_kwargs` in the trainer's kwargs, but your peft version does not support it. please update to the latest version of peft to use `gradient_checkpointing_kwargs`.zYA processing_class must be specified when using the default RewardDataCollatorWithPadding)z°When using RewardDataCollatorWithPadding, you should set `remove_unused_columns=False` in your RewardConfig we have set it for you, but you should do it yourself in the future.TÚestimate_tokensÚinput_ids_chosenÚ tokenizer)Ú fn_kwargs)Úbatchedr.Únum_proccó t|dƒˆkot|dƒˆkS©Nr,Úinput_ids_rejected©Úlen©Úrr]r^Ú<lambda>s z0_UnslothRewardTrainer.__init__.<locals>.<lambda>)r0)r.r/r0cr1r2r4r6r8r]r^r9s) rrrrr r!r"r#r$r%r&Úadd_model_tags)'r&Ú
ValueErrorÚ
isinstancerÚgetattrÚlistr%Ú signaturer/Ú
parametersrér3ÚwarnÚ UserWarningrr!rrrrr1Úuse_reward_data_collatorÚwarnings_issuedÚ column_namesrÚmain_process_firstÚmapr*rrÚfilterr r
Úhasattrrr:Ú
_tag_names)rrrrrr r!r"r#r$r%r&r'Ú_supports_gc_kwargsÚprepare_model_kwargsr.rr8r^r
~s´8ÿ

ÿ
ý
ÿ
 ÿý
 ü
þÿü
ýã#õÿz_UnslothRewardTrainer.__init__FÚinputsÚreturncC||d|dddd}||d|dddd}d|vr.tj |||d¡ ¡ }n tj ||¡ ¡ }|jjdurN||jjt ||d ¡7}|rW|||d
œfS|S) Nr,Úattention_mask_chosenT)Ú input_idsÚattention_maskÚ return_dictrTr3Úattention_mask_rejectedÚmarginrh)Úrewards_chosenÚrewards_rejected)r,rÚ
logsigmoidÚmeanrrr2)rrrMÚreturn_outputsÚnum_items_in_batchrUrVÚlossr]r]r^Ú compute_loss0s2ýüýü  þz"_UnslothRewardTrainer.compute_lossrŠÚ ignore_keysc | |¡}ˆdurt|jdƒrt|jjdgƒngt ¡|j||dd\}}Wdƒn1s3wY|r?|ddfS| ¡}t ‡fdd| 
¡Dƒƒ}t |ƒ}t  |¡j
ddjd dj}t |jd ¡}| |¡}|||fS)
configÚkeys_to_ignore_at_inferenceT)rYc3s |] \}}|ˆvr|VqdS©Nr])Ú.0Úr]r]r^Ú <genexpr>fsz8_UnslothRewardTrainer.prediction_step.<locals>.<genexpr>rhrFr)Ú_prepare_inputsrIrr=r^r2Úno_gradr\ÚdetachÚtupleÚitemsr+ÚstackrXÚsoftmaxÚzerosrJ) rrrMr]r[Ú logits_dictrTÚlabelsr]rdr^Úprediction_stepQs"
 
ÿ


z%_UnslothRewardTrainer.prediction_stepcs(| dd¡}| |¡tƒj|i|¤ŽS)num_print_samplesrB)ÚpopÚvisualize_samplesr Úevaluate)rrrrrrr]r^ruqs 
z_UnslothRewardTrainer.evaluaterrc Cs>| ¡}ttƒ}t|ƒD]P\}}|j|j|dd\}}}t|d|jƒ}t|d|jƒ}|d t |ƒ¡|d t |ƒ¡|d t dd „| 
¡Dƒƒ¡|d
kr\t |dƒ|kr\nq t  
|¡} |jjd
krtƒrst| d |ƒd |jjvrd
d l}
|
jd ur|
 d