unsloth_compiled_cache/__pycache__/UnslothKTOTrainer.cpython-310.pyc

o
ö×°hB‡ã@sBdZddlmZddlZddlmZddlmZddlmZm	Z	m
Z
mZmZm
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZmZmZm Z m!Z!m"Z"m#Z#m$Z$m%Z%mZm&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2m3Z3m4Z4m5Z5m6Z6m7Z7m8Z8m9Z9m:Z:m;Z;mZm<Z<m=Z=m>Z>m?Z?m@Z@mAZAmBZBmCZCmDZDmEZEmFZFmZmGZGmHZHmZm
Z
mZmZm#Z#m5Z5m>Z>mZddl>Z>ddlTddlImJZJmKZKdd	lLmMZMddlZddlNZ<dd
lOm=Z=ddlmZddlPmQZQmRZSdd
dd
d
dœZTejUddeTd�dd„ƒZVeJGdd„deƒƒZW	Gdd„de#ƒZXGdd„deXƒZYdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)GrÚAutoModelForCausalLMÚBaseImageProcessorrÚDPODataCollatorWithPaddingÚDataCollatorÚ
DataLoaderÚDatasetÚEvalLoopOutputÚFÚFeatureExtractionMixinÚ	KTOConfigÚ
KTOTrainerÚLiteralrÚPartialStateÚPathÚ	PeftModelÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚSequentialSamplerÚTrainerÚTrainerCallbackÚTrainingArgumentsrÚ_get_kl_datasetÚ_process_tokensÚ	_tokenizeÚautocastÚconcatenate_datasetsÚcontextmanagerÚcreate_reference_modelÚdefaultdictÚdisable_dropout_in_modelÚgenerate_model_cardÚget_comet_experiment_urlÚ
has_lengthÚinspectÚis_comet_availableÚis_liger_kernel_availableÚis_peft_availableÚis_wandb_availableÚ
itemgetterÚlog_table_to_comet_experimentÚmaybe_apply_chat_templateÚmaybe_extract_promptÚmaybe_unpair_preference_datasetÚnnÚnpÚnullcontextÚosÚ
pad_to_lengthÚpdÚpeft_module_casting_to_bf16Úprepare_deepspeedÚprepare_model_for_kbit_trainingÚrandomÚselective_log_softmaxÚtextwrapÚtorchÚtqdmÚwarningsrrrrrr1r;rD)Ú*)Ú	dataclassÚfield)ÚVersion)r:)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚmax_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ	fullgraphÚoptionsc
Cs¾tj| d|jd¡ddd�}tj| d¡ddd�}g}t||ƒD](\}}| tj¡}tj|d| d¡d� 	d¡}tj
|dd�}||}	| |	¡q!	t |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rVÚindex)rVé)
rDÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ	unsqueezeÚsqueezeÚ	logsumexpÚappendÚconcat)
ÚlogitsrWÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚchunk_logitsÚchunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©rnúQ/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothKTOTrainer.pyÚchunked_selective_log_softmax"s
rpcs¤eZdZUdZedddid�Zeeed<edddid�Z	ee
ed	<eddd
id�Zee
ed<						
																														 									!	!					"	#								$														$						%	&				'												(									#				$				)	*														+	,			-			.		/									0			d3‡fd1d2„	Z‡Z
S)4ÚUnslothKTOConfiguÐ
    
    Configuration class for the [`KTOTrainer`].

    This class includes only the parameters that are specific to KTO training. For a full list of training arguments,
    please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this class may
    differ from those in [`~transformers.TrainingArguments`].

    Using [`~transformers.HfArgumentParser`] we can turn this class into
    [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
    command line.

    Parameters:
        max_length (`int` or `None`, *optional*, defaults to `1024`):
            Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want
            to use the default data collator.
        max_prompt_length (`int` or `None`, *optional*, defaults to `512`):
            Maximum length of the prompt. This argument is required if you want to use the default data collator.
        max_completion_length (`int` or `None`, *optional*, defaults to `None`):
            Maximum length of the completion. This argument is required if you want to use the default data collator
            and your model is an encoder-decoder.
        beta (`float`, *optional*, defaults to `0.1`):
            Parameter controlling the deviation from the reference model. Higher Î² means less deviation from the
            reference model.
        loss_type (`str`, *optional*, defaults to `"kto"`):
            Type of loss to use. Possible values are:

                - `"kto"`: KTO loss from the [KTO](https://huggingface.co/papers/2402.01306) paper.
                - `"apo_zero_unpaired"`: Unpaired variant of APO-zero loss from the
                  [APO](https://huggingface.co/papers/2408.06266) paper.

        desirable_weight (`float`, *optional*, defaults to `1.0`):
            Desirable losses are weighed by this factor to counter unequal number of desirable and undesirable paris.
        undesirable_weight (`float`, *optional*, defaults to `1.0`):
            Undesirable losses are weighed by this factor to counter unequal number of desirable and undesirable pairs.
        label_pad_token_id (`int`, *optional*, defaults to `-100`):
            Label pad token id. This argument is required if you want to use the default data collator.
        padding_value (`int` or `None`, *optional*, defaults to `None`):
            Padding value to use. If `None`, the padding value of the tokenizer is used.
        truncation_mode (`str`, *optional*, defaults to `"keep_end"`):
            Truncation mode to use when the prompt is too long. Possible values are `"keep_end"` or `"keep_start"`.
            This argument is required if you want to use the default data collator.
        generate_during_eval (`bool`, *optional*, defaults to `False`):
            If `True`, generates and logs completions from both the model and the reference model to W&B or Comet
            during evaluation.
        is_encoder_decoder (`bool` or `None`, *optional*, defaults to `None`):
            When using the `model_init` argument (callable) to instantiate the model instead of the `model` argument,
            you need to specify if the model returned by the callable is an encoder-decoder model.
        precompute_ref_log_probs (`bool`, *optional*, defaults to `False`):
            Whether to precompute reference model log probabilities for training and evaluation datasets. This is
            useful when training without the reference model to reduce the total GPU memory needed.
        model_init_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
            Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the model from a
            string.
        ref_model_init_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
            Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the reference model
            from a string.
        dataset_num_proc: (`int` or `None`, *optional*, defaults to `None`):
            Number of processes to use for processing the dataset.
        disable_dropout (`bool`, *optional*, defaults to `True`):
            Whether to disable dropout in the model and reference model.
        use_liger_loss (`bool`, *optional*, defaults to `False`):
            Whether to use Liger loss. It requires liger-kernel to be installed.
        base_model_attribute_name (`str`, *optional*, defaults to `"model"`):
            Name of the attribute in the model that contains the base model. This is used to get the base model from
            the model when the model does not have a `get_decoder` method in the case when `use_liger_loss` is `True`.
    
    NÚhelpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrSz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnorTéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?ç@Úlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrXéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéééÚktoéœÿÿÿÚkeep_endÚmodelc—™s|dkrtd|›d�ƒ‚|dkrtd|›d�ƒ‚|dur(|#dkr(|$dkr(d}d	}#|‘dur:d
dlm}˜t|˜ƒdd
ƒ}‘tƒjd¡id|“d|“d|“d|“d|“d|“d|“d|“d|	“d|
“d|“d|“d|
“d|“d|“d|“d|“d|“d |“d!|“d"|“d#|“d$|“d%|“d&|“d'|“d(|“d)|“d*|“d+|“d,|“d-| “d.|!“d/|"“d0|#“d1|$“d2|%“d3|&“d4|'“d5|(“d6|)“d7|*“d8|+“d9|,“d:|-“d;|.“d<|/“d=|0“d>|1“d?|2“d@|3“dA|4“dB|5“dC|6“dD|7“dE|8“dF|9“dG|:“dH|;“dI|<“dJ|=“dK|>“dL|?“dM|@“dN|A“dO|B“dP|C“dQ|D“dR|E“dS|F“dT|G“dU|H“dV|I“dW|J“dX|K“dY|L“dZ|M“d[|N“d\|O“d]|P“d^|Q“d_|R“d`|S“da|T“db|U“dc|V“dd|W“de|X“df|Y“dg|Z“dh|[“di|\“dj|]“dk|^“dl|_“dm|`“dn|a“do|b“dp|c“dq|d“dr|e“ds|f“dt|g“du|h“dv|i“dw|j“dx|k“dy|l“dz|m“d{|n“d||o“d}|p“d~|q“d|r“d€|s“d�|t“d‚|u“dƒ|v“d„|w“d…|x“d†|y“d‡|z“dˆ|{“d‰||“dŠ|}“d‹|~“dŒ|“d�|€“dŽ|�“d�|‚“d�|ƒ“d‘|„“d’|…“d“|†“d”|‡“d•|ˆ“d–|‰“d—|Š“d˜|‹“d™|Œ“dš|�“d›|Ž“dœ|�“d�|�“dž|‘“dŸ|’“d |““|—¤Ž|”|_|•|_|–|_	dS)¢NgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rXza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!r†r‡Úunsloth_training_checkpointsrxr)Ú	cpu_countrTryÚ
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚweight_decayÚ
adam_beta1Ú
adam_beta2Úadam_epsilonÚ
max_grad_normÚnum_train_epochsÚ	max_stepsÚlr_scheduler_typeÚwarmup_ratioÚwarmup_stepsÚ	log_levelÚlog_level_replicaÚlog_on_each_nodeÚlogging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ	data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚdisable_tqdmÚremove_unused_columnsÚlabel_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚfsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ	deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ	adafactorÚgroup_by_lengthÚlength_column_nameÚ	report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚpush_to_hubÚresume_from_checkpointÚhub_model_idÚhub_strategyÚ	hub_tokenÚhub_private_repoÚhub_always_pushÚhub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚfp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚtorchdynamoÚ	ray_scopeÚddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚ
max_lengthÚmax_prompt_lengthÚmax_completion_lengthÚbetaÚ	loss_typeÚdesirable_weightÚundesirable_weightÚlabel_pad_token_idÚ
padding_valueÚtruncation_modeÚgenerate_during_evalÚis_encoder_decoderÚdisable_dropoutÚprecompute_ref_log_probsÚmodel_init_kwargsÚref_model_init_kwargsÚdataset_num_procÚuse_liger_lossÚbase_model_attribute_namern)
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr™ÚmaxÚsuperÚ__init__rurvrw)™Úselfršr›rœr�ržrŸr r¡r¢r£r¤r¥r¦r§r¨r©rªr«r¬rr®r¯r°r±r²r³r´rµr¶r·r¸r¹rºr»r¼r½r¾r¿rÀrÁrÂrÃrÄrÅrÆrÇrÈrÉrÊrËrÌrÍrÎrÏrÐrÑrÒrÓrÔrÕrÖr×rØrÙrÚrÛrÜrÝrÞrßràrárârãrärårærçrèrérêrërìrírîrïrðrñròrórôrõrör÷rørùrúrûrürýrþrÿrrrrrrrrrr	r
rrr
rrrrrrrrrrrrrrrrrrr r!r"r#r$r%r&r'r(r)r*r+r,rurvrwÚkwargsr™©Ú	__class__rnror2…s¸ÿþýüûúùø	÷
öõô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·J¶KµL´M³N²O±P°Q¯R®ST¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžc�dœe›fšg™h˜i—j–k•l”m“n’o‘p�q�rŽs�tŒu‹vŠw‰xˆy‡z†{…|„}ƒ~‚��ÿ�þ�ý�ü�û�ú�ù�ø	�÷
�ö�õ�ô
�ó�ò�ñ�ð�ï�î�í�ì
zUnslothKTOConfig.__init__)–NNFFFrxFrTrTNNryryrrzr{r|r}r~rr€r�rSr‚rƒrr„r…TNr†FrXFr†r‡NTFFFFFFrˆrˆFFFFr‰rŠFFNrSNNFr‹FNrNrSNNTNFNNFr‹rNNNNrŒr�NFFrŽNNNNTFTFFNNr�NNFNFNFTrŠNNNr‹TFNr�r‘FNNFFNNFFFNFTr’r“Nrƒr”r€r€r•Nr–FNTFNNNFr—NrSN)Ú__name__Ú
__module__Ú__qualname__Ú__doc__rIrurrÚ__annotations__rvÚintrwr2Ú
__classcell__rnrnr5rorq3sL
Dþþþ�èrqc süeZdZdZddgZ															d^deeeje	fde
eeeje	fded	e
ed
e
eee
e	effde
eeeeefde
ed
e
egefde
eedeejjejjjfde
eejejgejfde
e
de
eege
fde
e	de
e	f‡fdd„
Zedd„ƒZ de!f‡fdd„Z"d_d
e
ede!f‡fdd„
Z#de
de
fdd „Z$e%	!	"	!d`d#ej&d$ej'd%e(d&e)d'e(dej&fd(d)„ƒZ*dejd*e
e	eeej'ffdeej&ej&ej&ej&ffd+d,„Z+d-ej&d.ej&d/ej&d0ej&d1ej&d2ej&deej&ej&ej&ej&ffd3d4„Z,d5d6„Z-d7d8„Z.d*e
e	eeej'fffd9d:„Z/	!	dadeeejfd;e
e	eeje0ffdeejeeje
e	ejffffd<d=„Z1dbd?e
e	e2fd@e3dAddfdBdC„Z4d_dDe
ede
ej5j6j7fdEdF„Z8d*e
e	ej'fdee	e	ffdGdH„Z9	d_deeejfd;e
e	eeje0ffdIe(dJe
ee	fdKdL„Z:			MdcdNe!dOe	dIe
e(dJe
ee	dPe	def‡fdQdR„
Z;d_dSe
e	e2fdTe
e2ddf‡fdUdV„
Z<‡fdWdX„Z=			dddYe
e	dZe
e	d[ee	ee	dffd\d]„Z>‡Z?S)eÚ_UnslothKTOTrainerr‹Útrlr”N©NNr—Ú	ref_modelÚargsÚ
train_datasetÚeval_datasetÚprocessing_classÚ
data_collatorÚ
model_initÚ	callbacksÚ
optimizersÚpreprocess_logits_for_metricsÚpeft_configÚcompute_metricsÚmodel_adapter_nameÚref_adapter_namec$
sœ	t|ƒtur
tdƒ‚t|tƒs||urtdƒ‚|jduri}n9t|tƒs(tdƒ‚|j}| d¡}|durXt|tƒrB|dkrBtt|ƒ}|dkrTt|tj	ƒsTtd|›d�ƒ‚||d<|j
dur`i}n9t|tƒsitdƒ‚|j
}| d¡}|dur™t|tƒrƒ|dkrƒtt|ƒ}|dkr•t|tj	ƒs•td|›d�ƒ‚||d<t|tƒr§tj|fi|¤Ž}t|tƒrµtj|fi|¤Ž}d	|_
tƒsÃ|durÃtd
ƒ‚tƒ�r5|du�r5t|tƒrÕ| ¡}t|dd	ƒsât|dd	ƒ�rt|d
ƒoðd
tt t¡jƒv}d|ji}|rý|j|d
<t|fi|¤Ž}n|j�r t|dƒ�r| ¡ndd„}| ¡ |¡|}|j�r4t|dd	ƒ�r4t|ƒd|_
n|j�rOt|dƒ�rD| ¡ndd„}| ¡ |¡|j�r_tƒ�s_t ƒ�s_tdƒ‚|du�rj|j!j"|_"n|j"du�rttdƒ‚|j"|_"tƒ�o€t|tƒ|_#||_$||_%|�r�||_&n|j#�s—|j'�r›d|_&nt(|ƒ|_&|du�r©tdƒ‚|j)du�r·t* +dt,¡d}|j)du�rÀ|j)}|j-du�rÎt* +dt,¡d}|j-du�r×|j-}d}|j.du�rë|j"�rët* +dt,¡d}|j.du�rø|j"�rø|j.}|du�rt/|j0|j1|j"d�}|j2�rd	|_2t* +dt,¡d|_3nd	|_3|j4�r.t5|ƒ|j&du�r.t5|j&ƒ|j6|_6||_)|j|_|j1|_1|j7du�rF|j7n|j0|_7||_-|j8|_8||_.||_9|j'|_'d|_:|j6dv�rgd	|_:d	|_;d	|_<t=dd „ƒ|_>|j?|_?|j@|_@|jA|_At|j!d!d	ƒ|_Bt|j!d"d#ƒ|_C|jB�r |jCd#k�r t* +d$t,¡d|jDd%<tEƒ F¡��qˆjGtH|jId&d'�‰tJˆ|jId(d)�‰ˆjGtKd*|i|jId+d,�‰ˆdu�rëˆjGtH|jId-d'�‰tJˆ|jId.d)�‰ˆjGtKd*|i|jId/d,�‰ˆjGtLdd*|j9i|jId0d1�‰d2|j"|j9|j)|j8|j1|j-|j.d3œ}ˆjGtM||jId4d,�‰ˆdu�r2ˆjGtLd*|j9id|jId5d6�‰ˆjGtM||jId7d,�‰|j:�r—|jNd8k�r@td9ƒ‚ˆjGtOd|jN|jId:d;�}d<|d=<|jGtM||jI‡fd>d?„|jPDƒd@dA�}tQˆ|gd8dB�‰ˆdu�r—ˆjGtOd|jN|jIdCd;�}|jGtM||jI‡fdDd?„|jPDƒdEdA�}tQˆ|gd8dB�‰tRtSˆdFƒd8ƒ}tRtTˆdFƒ|d8ƒ}||k�rtU||jA|d8dGƒ}tU||jA|dHdGƒ}tU||j@|dHdGƒ} tU||j@|d8dGƒ}!||j@k�oë|kn}"| |jAk�où|!kn}#|"�s|#�st* +dI|›dJ|›dK| ›dJ|!›dL�	t,¡Wdƒn	1�s wYtVƒjW|||ˆˆ|||
|	|
|dM�d	|_Xt|jYdNƒ�rG|jY Z|j[¡t|dOƒ�sQt\dPƒ‚|j]�rf|j^j_j`jadQk�rf|j'�rftdRƒ‚|j&du�ry|j#�sx|j'�sxtdSƒ‚n|j]�r†tb|j&|j^ƒ|_&n
|j^jc|j&ddT�|_&|jdje�rÌtfƒ�s�tgdUƒ‚|j6dv�r§tdVƒ‚|j'�r¯tdWƒ‚|j#�s¹|j%du�r½tdXƒ‚th|j1|j?|j&dudY�|_idSdS)ZNz1Please use `KTOConfig` instead TrainingArguments.zœ`model` and `ref_model` cannot be the same object. If you want `ref_model` to be the same as `model`, you must mass a copy of it, or `None` if you use peft.zRYou passed model_kwargs to the KTOTrainer. But your model is already instantiated.Útorch_dtyperŠznInvalid `torch_dtype` passed to the KTOConfig. Expected a string with either `torch.dtype` or 'auto', but got Ú.zZYou passed ref_model_kwargs to the KTOTrainer. But your ref_model is already instantiated.FzŽPEFT is not installed and you passed a `peft_config` in the trainer's kwargs, please install it with `pip install peft` to use the PEFT modelsÚis_loaded_in_8bitÚis_loaded_in_4bitrÚuse_gradient_checkpointingÚenable_input_require_gradscSó| d¡dS©NT©Úrequires_grad_©ÚmoduleÚinputÚoutputrnrnroÚmake_inputs_require_grad-óz=_UnslothKTOTrainer.__init__.<locals>.make_inputs_require_gradTcSrUrVrWrYrnrnror]Br^z‚`generate_during_eval=True` requires Weights and Biases or Comet to be installed. Please install `wandb` or `comet-ml` to resolve.zMWhen no model is provided, you need to pass the parameter is_encoder_decoder.zdmax_length or a processing_class must be specified when using the default DPODataCollatorWithPaddingz¬When using DPODataCollatorWithPadding, you should set `max_length` in the KTOTrainer's init it will be set to `512` by default, but you should do it yourself in the future.r“z³When using DPODataCollatorWithPadding, you should set `max_prompt_length` in the KTOTrainer's init it will be set to `128` by default, but you should do it yourself in the future.é€zÜWhen using DPODataCollatorWithPadding with an encoder decoder architecture, you should set `max_completion_length` in the KTOTrainer's init it will be set to `128` by default, but you should do it yourself in the future.)Úpad_token_idr!r%zªWhen using DPODataCollatorWithPadding, you should set `remove_unused_columns=False` in your KTOConfig we have set it for you, but you should do it yourself in the future.)Úapo_zero_unpairedcSsttƒS©N)r)ÚlistrnrnrnroÚ<lambda>³sz-_UnslothKTOTrainer.__init__.<locals>.<lambda>Úoutput_router_logitsÚrouter_aux_loss_coefrŒa-You set `output_router_logits` to `True` in the model config, but `router_aux_loss_coef` is set to `0.0`, meaning the auxiliary loss will not be used. Either set `router_aux_loss_coef` to a value greater than `0.0`, or set `output_router_logits` to `False` if you don't want to use the auxiliary loss.Úestimate_tokensz$Extracting prompt from train dataset)Únum_procÚdesczUnpairing train dataset)riÚ	tokenizerz'Applying chat template to train dataset)Ú	fn_kwargsrhriz#Extracting prompt from eval datasetzUnpairing eval datasetz&Applying chat template to eval datasetzTokenizing train dataset)Úbatchedrkrhrir‹)Úprefixr%rjrr#r!rrz"Processing tokenized train datasetzTokenizing eval dataset)rkrlrhriz!Processing tokenized eval datasetrXz‡Actual (not effective) batch size must be > 1. KTO will not work properly because the KL term will be equivalent to the implied reward.zExtracting KL train dataset)rlÚ
batch_sizerhriÚKL_rmcóg|]	}|ˆjvr|‘qSrn©Úcolumn_names©Ú.0Úc)rCrnroÚ
<listcomp>/óz/_UnslothKTOTrainer.__init__.<locals>.<listcomp>z%Processing tokenized train KL dataset)rkrhÚremove_columnsri)ÚaxiszExtracting eval KL datasetcrprnrqrs©rDrnrorvDrwz$Processing tokenized eval KL datasetÚlabelrygHáz®Gõ?zìYou have different amounts of desirable/positive and undesirable/negative examples but the weights on the desirable and undesirable losses don't seem to be in an ideal range. Based on your data, we recommend EITHER desirable_weight in [z, z] or undesirable_weight in [zN] (but NOT BOTH). See the documentation on how to optimally set these weights.)r—rBrFrCrDrErGrLrHrIrJÚadd_model_tagsÚacceleratorzXYour `Trainer` does not have an `accelerator` object. Consider upgrading `transformers`.ézrYou cannot use `precompute_ref_log_probs=True` with Deepspeed ZeRO-3. Please set `precompute_ref_log_probs=False`.z]No reference model and model is not a Peft model. Try setting `precompute_ref_log_probs=True`)Úevaluation_modez‚You set `use_liger_loss=True` but the liger kernel is not available. Please install liger-kernel first: `pip install liger-kernel`znYou cannot set `loss_type='apo_zero_unpaired'` with liger-kernel.Only KTO loss is supported with liger-kernel.znYou cannot use `precompute_ref_log_probs=True` with liger kernel. Please set `precompute_ref_log_probs=False`.zYYou cannot use `use_liger_loss=True` with Peft models. Please set `use_liger_loss=False`.)Úignore_indexrÚ
use_ref_model)jÚtyper!Ú
ValueErrorÚ
isinstanceÚstrr(ÚgetÚgetattrrDÚdtyper)rÚfrom_pretrainedÚ_peft_has_been_casted_to_bf16r1rÚmerge_and_unloadÚhasattrrcr.Ú	signaturer@Ú
parametersrÿrrTÚget_input_embeddingsÚregister_forward_hookrÊr>r$r2r/Úconfigr%Ú
is_peft_modelrMrNrAr'r(rrFÚwarnÚUserWarningrrrr`r!rÝÚuse_dpo_data_collatorr&r*rr"r#rEÚcalculate_KLÚ _precomputed_train_ref_log_probsÚ_precomputed_eval_ref_log_probsr)Ú_stored_metricsrrr Úaux_loss_enabledÚ
aux_loss_coefÚwarnings_issuedrÚmain_process_firstÚmapr6r*r7r5r$r#r¡r"rrr&r0ÚsumÚlenÚroundr1r2Úmodel_accepts_loss_kwargsr—r|Ú
_tag_namesÚAttributeErrorÚis_deepspeed_enabledr}ÚstateÚdeepspeed_pluginÚ
zero_stager?Ú
prepare_modelrBr+r0ÚImportErrorÚLigerFusedLinearKTOLossÚkto_loss_fn)$r3r—rArBrCrDrErFrGrHrIrJrKrLrMrNr(rOr)Ú_support_gc_kwargsÚprepare_model_kwargsr]rrrrkÚtrain_kl_datasetÚeval_kl_datasetÚ
num_desirableÚnum_undesirableÚdes_weight_lower_boundÚdes_weight_upper_boundÚund_weight_lower_boundÚund_weight_upper_boundÚdes_weight_in_rangeÚund_weight_in_ranger5)rDrCror2Æs´ÿ


ÿ

ÿ


ÿ

ÿ
ÿ
ÿþ


€
ÿ


ÿýýý
ýý
û
ÿÿü
ÿÿüû	øü
ûüÿûû	
ûû	
ýýüüù€�ôõÿÿÿ€
ÿÿÿÿÿìz_UnslothKTOTrainer.__init__ccsŽ�|jr|js|j |j¡ ¡ntƒ�*|jr|j |j¡dV|jr5|j |jp+d¡WdƒdSWdƒdS1s@wYdS)zWContext manager for handling null reference model (that is, peft adapter manipulation).Nrs)	r’rNr}Úunwrap_modelr—Údisable_adapterr:Úset_adapterrM©r3rnrnroÚnull_ref_context«s€ÿÿý÷"øz#_UnslothKTOTrainer.null_ref_contextÚreturncsü|jry|jsy|jj|j|jj|jjddœ}|j t	|j
fi|¤Ž¡}g}g}t|dd�D]&}| |¡\}}|j 
|¡}| | ¡¡|jrR|j 
|¡}| | ¡¡q,|j
jdt |¡ ¡ ¡d�|_
|jrv|j
jdt |¡ ¡ ¡d�|_
d|_tƒ ¡S)	z·
        Returns the training [`~torch.utils.data.DataLoader`].

        Subclass of transformers.src.transformers.trainer.get_train_dataloader to precompute `ref_log_probs`.
        F©rnÚ
collate_fnÚnum_workersÚ
pin_memoryÚshufflez!Train dataset reference log probs©ÚiterableriÚreference_logps©ÚnameÚcolumnÚreference_KL_logpsT)r'r—rBr¡rFrØrór}ÚpreparerrCrEÚcompute_reference_log_probsÚgather_for_metricsrcÚcpur–Ú
add_columnrDÚcatÚfloatÚnumpyr1Úget_train_dataloader)r3Údataloader_paramsÚdata_loaderÚreference_completion_logpsrÊÚpadded_batchÚreference_completion_logpÚreference_KL_logpr5rnrorÓ¹s6û	€ÿÿ
z'_UnslothKTOTrainer.get_train_dataloaderc	s2|dur
|jdur
tdƒ‚|dur|n|j}|jr’|js’|jj|j|jj|jjddœ}|j	 
t|fi|¤Ž¡}g}g}t|dd�D]&}| 
|¡\}}|j	 |¡}| | ¡¡|jrg|j	 |¡}| | ¡¡qA|jdt |¡ ¡ ¡d�}|jr‡|jd	t |¡ ¡ ¡d�}|jdur�||_d
|_tƒj|d�S)aé
        Returns the evaluation [`~torch.utils.data.DataLoader`].

        Subclass of transformers.src.transformers.trainer.get_eval_dataloader to precompute `ref_log_probs`.

        Args:
            eval_dataset (`torch.utils.data.Dataset`, *optional*):
                If provided, will override `self.eval_dataset`. If it is a [`~datasets.Dataset`], columns not accepted
                by the `model.forward()` method are automatically removed. It must implement `__len__`.
        Nz-Trainer: evaluation requires an eval_dataset.Fr¿z Eval dataset reference log probsrÄrÆrÇrÊTrz)rDrƒr'r˜rBr¢rFrØrór}rËrrErÌrÍrcrÎr–rÏrDrÐrÑrÒr1Úget_eval_dataloader)	r3rDrÔrÕrÖrÊr×rØrÙr5rnrorÚås@û	€ÿÿ
z&_UnslothKTOTrainer.get_eval_dataloaderr×c	CsÎt ¡�²|jdurg| ¡�P|jr<|j|d|d| d¡|dd�j}|jr;|j|d|d| d	¡|d
d�j}n|j|d|dd
�j}|jrW|j|d|dd
�j}Wdƒn1sawYnH|jr”|j|d|d| d¡|dd�j}|jr“|j|d|d| d	¡|d
d�j}n|j|d|dd
�j}|jr¯|j|d|dd
�j}Wdƒn1s¹wY|j	||dd|j|j
d�}|jrá|j	||d
d|j|j
d�}||fSd}||fS)zfComputes log probabilities of the reference model for a single padded batch of a KTO specific dataset.NÚprompt_input_idsÚprompt_attention_maskÚcompletion_decoder_input_idsÚcompletion_labels)Úattention_maskÚdecoder_input_idsÚlabelsÚKL_prompt_input_idsÚKL_prompt_attention_maskÚKL_completion_decoder_input_idsÚKL_completion_labelsÚcompletion_input_idsÚcompletion_attention_mask)rßÚKL_completion_input_idsÚKL_completion_attention_maskF©Úaverage_log_probr%r!)rDÚno_gradrAr½r%r—r†rer–Úget_batch_logpsr!)r3r×Úcompletion_logitsÚ	KL_logitsÚcompletion_logpsÚKL_logpsrnrnrorÌs²


üûüû€þýþý€é€üûüû€ÿþþý€Í8ûû
þz._UnslothKTOTrainer.compute_reference_log_probsFr•rerárër!r%cCs¤|jdd…|jkrtdƒ‚|s*|dd…dd…f ¡}|dd…dd…dd…f}n| ¡}||k}d|||k<t||ƒ}|rK|| d¡| d¡S|| d¡S)aCompute the log probabilities of the given labels under the given logits.

        Args:
            logits:
                Logits of the model (unnormalized). Shape: (batch_size, sequence_length, vocab_size)
            labels:
                Labels for which to compute the log probabilities. Label tokens with a value of label_pad_token_id are
                ignored. Shape: (batch_size, sequence_length)
            average_log_prob:
                If True, return the average log probability per (non-masked) token. Otherwise, return the sum of the
                log probabilities of the (non-masked) tokens.

        Returns:
            A tensor of shape (batch_size,) containing the average/sum log probabilities of the given labels under the
            given logits.
        NrSzKLogits (batch and sequence length dim) and labels must have the same shape.rXr)r[rƒÚclonerBrŸ)rerárër!r%Ú	loss_maskrmrnrnroríks
z"_UnslothKTOTrainer.get_batch_logpsÚbatchcs"| |ˆ¡}|jrˆdˆ d¡dœni}|jrd|d<|ˆdfdˆdi|¤Ž}|j}|j|ˆdd	|j|jd
�}|jdtˆdƒkrJt	d
ƒ‚‡fdd„t
|jdƒDƒ}‡fdd„t
|jdƒDƒ}	||df}
||	df}||df}||	df}
|jrŠ|
|||
||jfS|
|||
|fS)NrÞrÝ©ráràTrerærßrçFrêrr{z‡There is a mismatch between the number of examples in this batch and the number of examples for which an output sequence was predicted.có g|]}ˆd|dur|‘qS©r{Trn©rtÚi©rôrnrorv¾ó z._UnslothKTOTrainer.forward.<locals>.<listcomp>crö©r{Frnrørúrnrorv¿rû.)Ú_compute_kl_logpsr%r†ršrerír!r[r rƒÚrangeÚaux_loss)r3r—rôrñÚmodel_kwargsÚoutputsrîrðÚ
chosen_idxÚrejected_idxÚchosen_logpsÚrejected_logpsÚ
chosen_logitsÚrejected_logitsrnrúroÚforward™sLüþúÿþýûÿz_UnslothKTOTrainer.forwardÚpolicy_chosen_logpsÚpolicy_rejected_logpsÚpolicy_KL_logpsÚreference_chosen_logpsÚreference_rejected_logpsrÊcCs¢|jr|| ¡ ¡}|j |¡ ¡jdd�}n	t d¡ |j	¡}|j
ddks/|j
ddkr\||}|jdkrEdt 
|j||¡}	n|jdkrTdt 
|j|¡}	|j| ¡}
nt g¡ |jj	¡}	t g¡ |jj	¡}
|j
ddks~|j
ddkr©||}|jdkr”dt 
|j||¡}n
|jdkr¡t 
|j|¡}|j| ¡}
nt g¡ |jj	¡}t g¡ |jj	¡}
t |j|	|j|fd¡}||
|
|fS)avCompute the KTO loss for a batch of policy and reference model log probabilities.

        Args:
            policy_chosen_logps:
                Log probabilities of the policy model for the chosen responses. Shape: (num(chosen) in batch_size,)
            policy_rejected_logps:
                Log probabilities of the policy model for the rejected responses. Shape: (num(rejected) in batch_size,)
            policy_KL_logps: Log probabilities of the policy model for the KL responses. Shape: (batch_size,)
            reference_chosen_logps:
                Log probabilities of the reference model for the chosen responses. Shape: (num(chosen) in batch_size,)
            reference_rejected_logps:
                Log probabilities of the reference model for the rejected responses. Shape: (num(rejected) in
                batch_size,)
            reference_KL_logps: Log probabilities of the reference model for the KL responses. Shape: (batch_size,)

        Returns:
            A tuple of four tensors: (losses, chosen_rewards, rejected_rewards, KL). The losses tensor contains the KTO
            loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for
            the chosen and rejected responses, respectively. The KL tensor contains the detached KL divergence estimate
            between the policy and reference models.
        r©ÚminrXr”ra)r–ÚmeanÚdetachr}rÍÚclamprDÚzerosr]Údevicer[rrÚsigmoidrrrÐrr )r3r	r
rrr
rÊÚklÚchosen_logratiosÚ
chosen_lossesÚchosen_rewardsÚrejected_logratiosÚrejected_lossesÚrejected_rewardsÚlossesrnrnroÚkto_lossÌs6


þz_UnslothKTOTrainer.kto_losscCsœd}|jrL|jr|d|d|d| d¡dœ}n	|d|dd	œ}t ¡�|di|¤Žj}Wdƒn1s9wY|j||dd
|j|jd�}|S)
z/Compute KL log probabilities for a given batch.Nrârãrårä)Ú	input_idsrßráràrèré)rrßFrêrn)r–r%r†rDrìrerír!)r3r—rôrñÚKL_model_kwargsrïrnrnrorýs,üþ
ÿûz$_UnslothKTOTrainer._compute_kl_logpscCsž| ||¡}| |j|¡}|jr%|| ¡ ¡}|j |¡ ¡jdd�}n
t 	d¡ 
|jj¡}|jr<|d| 
d¡dœni}|jrEd|d<|jr�| ¡|d	f|d
ddœ|¤Ž}| ¡d|d|jd
dœ|¤Ž}|j ¡|d	f|d
ddœ|¤Ž}	|j ¡d|d|	jd
dœ|¤Ž}
nCt|dƒr—| ¡}nt||jjƒ}||d	f|d
d
dœ|¤Ž}t|jdƒr¹|j ¡}nt|j|jjƒ}||d	f|d
d
dœ|¤Ž}
| ¡}
|j ¡}|j|jsé|jdd…dd…fn|j|
j|ddd…dd…ft|
dƒ�r|
jndtj|dtjd� 
|jj¡|j�s|
jdd…dd…fn|j|jt|
dƒ�r,|jnd|d�	\}\}}}}}}||||||||dœ}|j�rM|j|d<|S)a!
        Compute the KTO loss using the Liger-Kernel's LigerFusedLinearKTOLoss.

        Args:
            model:
                The policy model used for generating log probabilities and outputs. It could be an encoder-decoder
                model or a regular language model.
            batch: A dictionary containing the input data and labels for the batch.

        Returns:
            A dictionary containing the following keys:
                - "loss": The computed KTO loss for the batch.
                - "chosen_logits_sum": Sum of the logits for the chosen responses from the policy model.
                - "rejected_logits_sum": Sum of the logits for the rejected responses from the policy model.
                - "chosen_logps": Log probabilities of the chosen responses from the policy model.
                - "rejected_logps": Log probabilities of the rejected responses from the policy model.
                - "chosen_rewards": Rewards for the chosen responses.
                - "rejected_rewards": Rewards for the rejected responses.
                - "kl": The KL divergence between the policy and reference models (detached).

            If auxiliary loss is enabled, the dictionary will also include:
                - "aux_loss": The auxiliary loss from the model outputs.
        rrrXrÞrÝrõTrerærç)rßÚreturn_dictràF)rÚencoder_hidden_statesÚ	use_cacheÚget_decoder)rßr#NrSÚbiasr{)rˆ)	Ú_inputÚ
lin_weightÚtargetr%Úpreference_labelsÚ	ref_inputÚ
ref_weightÚref_biasr)ÚlossÚchosen_logits_sumÚrejected_logits_sumÚchosen_logps_sumÚrejected_logps_sumÚchosen_rewards_sumÚrejected_rewards_sumrrÿrn)rýrAr–rrr}rÍrrDrr]rr%r†ršÚget_encoderr$Úlast_hidden_staterŒr‡rBr,Úget_output_embeddingsr¬Úweightr%ÚtensorÚboolrÿ)r3r—rôrrÊrrÚencoder_outputsrÚref_encoder_outputsÚref_outputsÚ
base_modelÚref_base_modelÚlm_headÚref_lm_headr-r0r1r.r/r2r3r\rnrnroÚ_compute_loss_liger6sÐüþúÿýüýüÿýü
ýü

ÿýüÿýü
 ÿõöø

z&_UnslothKTOTrainer._compute_loss_ligerc	s@i}‡fdd„ˆ ¡Dƒ‰t ˆd¡}| ¡ ˆjj¡}t|ƒ| ˆjj¡}ˆjj	rZˆ 
|ˆ¡}|d}|d}	|d}
|d}|d}|d	}
|d
}|d}ˆjrY|d}n³ˆ |ˆ¡}|d
d…\}}}	}
}ˆjrr|d}dˆvr±‡fdd„t
ˆdjdƒDƒ}‡fdd„t
ˆdjdƒDƒ}ˆd|df}ˆd|df}ˆjr®ˆd}nQd
}nNt ¡�Bˆjd
uràˆ ¡�ˆ ˆjˆ¡d
d…\}}}}}Wd
ƒn1sÚwYnˆ ˆjˆ¡d
d…\}}}}}Wd
ƒn1súwYˆ ||||||¡\}}
}}| ¡|d<ˆj |¡ ¡ ¡}ˆj |¡ ¡ ¡}|dk�rZˆj |
 ¡¡ ¡ ¡|d<ˆj | ¡¡ ¡ ¡|d<ˆj |	 ¡¡ ¡ ¡|d<||d<|dk�r�ˆj | ¡¡ ¡ ¡|d<ˆj | ¡¡ ¡ ¡|d<ˆj |
 ¡¡ ¡ ¡|d<||d<| ¡}ˆj�rœ|ˆj|7}||fS)zWCompute the KTO loss and other metrics for the given batch of inputs for train or test.cs0i|]\}}|t|tjƒr| ˆjj¡n|“qSrn)r„rDrr]r}r©rtÚkÚvr¼rnroÚ
<dictcomp>Çs0z=_UnslothKTOTrainer.get_batch_loss_metrics.<locals>.<dictcomp>r{r-r.r/r0r1r2r3rrÿNérÆcrör÷rnrørúrnrorvçrûz=_UnslothKTOTrainer.get_batch_loss_metrics.<locals>.<listcomp>rcrörürnrørúrnrorvèrû.rÊzrewards/chosen_sumzlogps/chosen_sumúlogits/chosen_sumzcount/chosenzrewards/rejected_sumzlogps/rejected_sumúlogits/rejected_sumzcount/rejected)ÚitemsrDr8rŸr]r}rr rBr+rAršrrþr[r–rìrAr½r—rÚitemrÍÚnansumÚnanmeanr›)r3r—rôÚmetricsráÚ
num_chosenÚnum_rejectedÚmodel_outputrÚpolicy_chosen_logitsÚpolicy_rejected_logitsr	r
rrrrÿÚforward_outputrrrrr
rÊÚ_Úall_num_chosenÚall_num_rejectedr-rn)rôr3roÚget_batch_loss_metricsÀs°€
ú  


úú€ú€ðú	
ÿÿÿ
ÿÿÿz)_UnslothKTOTrainer.get_batch_loss_metricsÚinputscCs‚|jr
t|jjjƒntƒ}|�| ||¡\}}Wdƒn1s"wY| |jj¡}|jj	r9|j
|dd�|r?||fS|S)NÚtrain©Ú
train_eval)rŠr%r}rr‚r:rWr]rBÚis_main_processÚ
store_metrics)r3r—rXÚreturn_outputsÚnum_items_in_batchÚcompute_loss_context_managerr-rMrnrnroÚcompute_loss0sÿÿz_UnslothKTOTrainer.compute_lossrYrMr[)rYÚevalcCs*| ¡D]\}}|j|| |¡qdSrb)rIr™rc)r3rMr[ÚkeyÚvaluernrnror]Hsÿz _UnslothKTOTrainer.store_metricsÚdatasetcCs*|dur|j}|dust|ƒsdSt|ƒSrb)rCr-r)r3rernrnroÚ_get_train_samplerLs
z%_UnslothKTOTrainer._get_train_samplerc	Cs:|jr
t|jjjƒntƒ}|�`|j|d|d|jd|jj	d�}d|vr*|d}n>|j
durV| ¡�|jj|d|d|jd|jj	d�}Wdƒn1sPwYn|j
j|d|d|jd|jj	d�}Wdƒn1srwYt
||j|jj	ƒ}|jj|dd�}t
||j|jj	ƒ}|jj|dd�}||fS)zRGenerate samples from the model and reference model for the given batch of inputs.rÛrÜT)rrßrÚ	do_sampler`Úreference_outputN)Úskip_special_tokens)rŠr%r}rr‚r:ÚgeneraterrEr`rAr½r—r<Úbatch_decode)r3r—rôÚgenerate_context_managerÚ
policy_outputrhÚpolicy_output_decodedÚreference_output_decodedrnrnroÚgenerate_from_model_and_refSsJÿû	


ûÿ€	û€éz._UnslothKTOTrainer.generate_from_model_and_refr Úignore_keysc	s>ˆdurt|dƒrt|jdgƒ‰ng‰|jrt|jjjƒntƒ}t	 
¡�"|�| ||¡\}}Wdƒn1s:wYWdƒn1sIwY|jjrY|j
|dd�|rb| ¡ddfSi}d|vrn|d|d<d|vrx|d|d<‡fd	d
„| ¡Dƒ}	t	j|	|jjd�}	t	j|	jd|jjd�}
| ¡|	|
fS)
Nr‘Úkeys_to_ignore_at_inferencerbrZrGzeval_logits/chosenrHzeval_logits/rejectedcsg|]
\}}|ˆvr|‘qSrnrnrB©rqrnrorv£sz6_UnslothKTOTrainer.prediction_step.<locals>.<listcomp>)rr)rŒr‡r‘rŠr%r}rr‚r:rDrìrWr\r]rrIr8rr[)r3r—rXr rqÚprediction_context_managerr-rMÚlogits_dictrerárnrsroÚprediction_stepƒs0
ÿÿ€z"_UnslothKTOTrainer.prediction_steprbÚ
dataloaderÚdescriptionÚmetric_key_prefixcs$|jr†t|jƒ}tjt|ƒ|jjd�}|j |¡}| 	|¡}	| 
|	¡}	tj|	dtj
|jjd�}
t |
¡d}|	d||	d|t|Ž|	dƒdœ}| |j|¡\}
}tjgd	¢d
d„t|d|
|ƒDƒd�}d
|jjvrzt dtj|d�i¡d|jjvr†td|d�tƒ |||||¡}|S)zÞ
        Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by
        `Trainer.evaluate()` and `Trainer.predict()`.