Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothCPOTrainer.cpython-310.pyc
T

434 lines
46 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
2025-08-28 22:41:56 +00:00
õ×°h=3ã@sdZddlmZddlZddlmZddlmZddlmZm Z m
2025-08-28 17:57:59 +00:00
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZmZmZm Z m!Z!m"Z"m#Z#m Z m$Z$m%Z%m&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2mZm3Z3m4Z4m5Z5m6Z6m7Z7m8Z8m9Z9m:Z:m;Z;m<Z<mZm=Z=mZm
Z
mZmZm"Z"m-Z-m5Z5mZddl5Z5ddlTddl>m?Z?m@Z@dd lAmBZBddlZddlCZ3dd
lDm4Z4ddlmZdd lEmFZFmGZHd d
d d
d
dœZIejJd d eIdddƒZKe?GdddeƒƒZL Gddde"ƒZMGdddeMƒZNdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)<rÚAutoModelForCausalLMÚBaseImageProcessorÚ CPOConfigÚ
CPOTrainerr ÚDPODataCollatorWithPaddingÚ DataCollatorÚ
DataLoaderÚDatasetÚEvalLoopOutputÚFeatureExtractionMixinÚLiteralrÚ PartialStateÚPathÚ PeftModelÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚTrainerÚTrainerCallbackrÚadd_bos_token_if_neededÚadd_eos_token_if_neededÚautocastÚ defaultdictÚdisable_dropout_in_modelÚgenerate_model_cardÚget_comet_experiment_urlÚinspectÚis_comet_availableÚis_peft_availableÚis_torch_fx_proxyÚis_wandb_availableÚlog_table_to_comet_experimentÚmaybe_apply_chat_templateÚmaybe_extract_promptÚnnÚnpÚ nullcontextÚosÚ
pad_to_lengthÚpdÚpeft_module_casting_to_bf16Úprepare_model_for_kbit_trainingÚrandomÚselective_log_softmaxÚtextwrapÚtorchÚwarningsrrrrrr)r2r:)Ú*)Ú dataclassÚfield)ÚVersion)r1)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rKÚindex©rKé)
r:ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
ÚlogitsrLÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©rdúQ/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothCPOTrainer.pyÚchunked_selective_log_softmax"s  
rfceZdZUdZedddidZeeed<edddidZ ee
ed <eddd
idZ ee
ed <  
                            ! ! " #     $           $      % &  '         (      #    $   ) *       + ,   % -   . /  0      d3‡fd1d2„ Z Z
S)4ÚUnslothCPOConfiguy
Configuration class for the [`CPOTrainer`].
This class includes only the parameters that are specific to CPO training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this class may
differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
max_length (`int` or `None`, *optional*, defaults to `1024`):
Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want
to use the default data collator.
max_prompt_length (`int` or `None`, *optional*, defaults to `512`):
Maximum length of the prompt. This argument is required if you want to use the default data collator.
max_completion_length (`int` or `None`, *optional*, defaults to `None`):
Maximum length of the completion. This argument is required if you want to use the default data collator
and your model is an encoder-decoder.
beta (`float`, *optional*, defaults to `0.1`):
Parameter controlling the deviation from the reference model. Higher β means less deviation from the
reference model. For the IPO loss (`loss_type="ipo"`), β is the regularization parameter denoted by τ in
the [paper](https://huggingface.co/papers/2310.12036).
label_smoothing (`float`, *optional*, defaults to `0.0`):
Label smoothing factor. This argument is required if you want to use the default data collator.
loss_type (`str`, *optional*, defaults to `"sigmoid"`):
Type of loss to use. Possible values are:
- `"sigmoid"`: sigmoid loss from the original [DPO](https://huggingface.co/papers/2305.18290) paper.
- `"hinge"`: hinge loss on the normalized likelihood from the
[SLiC](https://huggingface.co/papers/2305.10425) paper.
- `"ipo"`: IPO loss from the [IPO](https://huggingface.co/papers/2310.12036) paper.
- `"simpo"`: SimPO loss from the [SimPO](https://huggingface.co/papers/2405.14734) paper.
disable_dropout (`bool`, *optional*, defaults to `True`):
Whether to disable dropout in the model.
cpo_alpha (`float`, *optional*, defaults to `1.0`):
Weight of the BC regularizer in CPO training.
simpo_gamma (`float`, *optional*, defaults to `0.5`):
Target reward margin for the SimPO loss, used only when the `loss_type="simpo"`.
label_pad_token_id (`int`, *optional*, defaults to `-100`):
Label pad token id. This argument is required if you want to use the default data collator.
padding_value (`int` or `None`, *optional*, defaults to `None`):
Padding value to use. If `None`, the padding value of the tokenizer is used.
truncation_mode (`str`,*optional*, defaults to `"keep_end"`):
Truncation mode to use when the prompt is too long. Possible values are `"keep_end"` or `"keep_start"`.
This argument is required if you want to use the default data collator.
generate_during_eval (`bool`, *optional*, defaults to `False`):
If `True`, generates and logs completions from the model to W&B or Comet during evaluation.
is_encoder_decoder (`bool` or `None`, *optional*, defaults to `None`):
When using the `model_init` argument (callable) to instantiate the model instead of the `model` argument,
you need to specify if the model returned by the callable is an encoder-decoder model.
model_init_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the model from a
string.
dataset_num_proc (`int` or `None`, *optional*, defaults to `None`):
Number of processes to use for processing the dataset.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrHz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnorIéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrNéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéééÚsigmoidçà?éœÿÿÿÚkeep_endc” |dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|dur:d
d lm}•t|•ƒd d
ƒ}tƒjdžid|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d|d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-| “d.|!“d/|"“d0|#“d1|$“d2|%“d3|&“d4|'“d5|(“d6|)“d7|*“d8|+“d9|,“d:|-“d;|.“d<|/“d=|0“d>|1“d?|2“d@|3“dA|4“dB|5“dC|6“dD|7“dE|8“dF|9“dG|:“dH|;“dI|<“dJ|=“dK|>“dL|?“dM|@“dN|A“dO|B“dP|C“dQ|D“dR|E“dS|F“dT|G“dU|H“dV|I“dW|J“dX|K“dY|L“dZ|M“d[|N“d\|O“d]|P“d^|Q“d_|R“d`|S“da|T“db|U“dc|V“dd|W“de|X“df|Y“dg|Z“dh|[“di|\“dj|]“dk|^“dl|_“dm|`“dn|a“do|b“dp|c“dq|d“dr|e“ds|f“dt|g“du|h“dv|i“dw|j“dx|k“dy|l“dz|m“d{|n“d||o“d}|p“d~|q“d|r“d€|s“d|t“d|u“dƒ|v“d„|w“d…|x“d†|y“d‡|z“dˆ|{“d‰||“dŠ|}“d|~“dŒ|d|€“dŽ|d|‚“d|ƒ“d‘|„“d’|…“d“|†“d”|‡“d•|ˆ“d–|‰“d—|Š“d˜|‹“d™|Œ“dš|d›|Ž“dœ|d||”¤Ž||_||_|“|_ dS)ŸNgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rNza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!r|r}Úunsloth_training_checkpointsrnr)Ú cpu_countrIroÚ
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚ
max_lengthÚmax_prompt_lengthÚmax_completion_lengthÚbetaÚlabel_smoothingÚ loss_typeÚdisable_dropoutÚ cpo_alphaÚ simpo_gammaÚlabel_pad_token_idÚ
padding_valueÚtruncation_modeÚgenerate_during_evalÚis_encoder_decoderÚmodel_init_kwargsÚdataset_num_procrd)
ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingrÚmaxÚsuperÚ__init__rkrlrm)–Úselfrrrr“r”r•r–r—r™rrr r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrrrrrkrlrmÚkwargsr©Ú __class__rdrer%~s   ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüûúùø ÷
ö õ ô
óòñðï
zUnslothCPOConfig.__init__)“NNFFFrnFrIrINNrororrprqrrrsrtrurvrwrHrxryrrzr{TNr|FrNFr|r}NTFFFFFFr~r~FFFFrr€FFNrHNNFrFNrNrHNNTNFNNFrrNNNNrNFFr„NNNNTFTFFNNr…NNFNFNFTr€NNNrTFNr†r‡FNNFFNNFFFNFTrˆr‰NryrTrvrNrFNNNNrHN)Ú__name__Ú
__module__Ú __qualname__Ú__doc__r>rkrrÚ__annotations__rlÚintrmr%Ú
__classcell__rdrdr(rerg3sF
=þþþërgceZdZdZddgZ            dRdeeeej e
fdee dee d ee
d
eee
ee
e
ffd eeeeeefd eegefd
eeedeejjejjjfdeeejejgejfdeedeeegefffdd
ZddZdSdeeeej fdefddZ e!    dTdee
eeej"ffde#de$de$d eej%dee
ej"ff d!d"„ƒZ&d#ej'd$ej'deej'ej'ej'ffd%d&„Z(e!   dUd'ej'd(ej"d)e#de$de#dej'f d*d+„ƒZ)dej dee
eeej"ffdeej'ej'ej'ej'ffd,d-„Z* .dVdee
eeej"ffd/e+d0fd1d2„Z,  dWdeeej fd3ee
eeje-ffdeejeejee
ejffffd4d5„Z.dee
ej"fde
fd6d7„Z/ dSdeeej fd3ee
eeje-ffd8e#d9eee
fd:d;„Z0dVd<ee
e1fd/e+d0ddfd=d>„Z2   ?dXd@e3dAe
d8ee#d9eee
dBe
def fdCdD„
Z4dSdEee
e1fdFee1ddffdGdH„
Z5dIdJ„Z6‡fdKdL„Z7   dYdMee
dNee
dOee
ee
dffdPdQ„Z8‡Z9S)ZÚ_UnslothCPOTrainerrÚtrlÚcpoN©NNÚmodelÚargsÚ
data_collatorÚ
train_datasetÚ eval_datasetÚprocessing_classÚ
model_initÚ callbacksÚ
optimizersÚpreprocess_logits_for_metricsÚ peft_configÚcompute_metricsc

sF|jduri}
n9t|tƒstdƒ|j}
|
 d¡}|durAt|tƒr+|dkr+tt|ƒ}|dkr=t|tjƒs=td|dƒ||
d<t|tƒrOtj |fi|
¤Ž}d|_
t ƒs]| dur]tdƒt ƒrÈ| durÈt|t ƒrm| 
¡}t|ddƒsyt|d dƒrt|d
ƒo‡d
tt t¡jƒv}d |ji}|r”|j|d
<t|fi|¤Ž}n|jrµt|d ƒrª| ¡n d
d}| ¡ |¡|}|jrÇt|d dƒrÇt|ƒd|_
n|jràt|d ƒrÕ| ¡n dd}| ¡ |¡|jrítƒsítƒsítdƒ|dur÷|jj|_n|jdurtdƒ|j|_|jr|jj |_ |jj!|_!|durtdƒ|j"dur+t# $dt%¡d}n|j"}|j&dur=t# $dt%¡d}n|j&}||ksPtd|d|dƒ|j'durc|jrct# $dt%¡d}n|j'}|dur†t(|j!|j)|jd}|j*rd|_*t# $dt%¡d|_+nd|_+|j,rt-|ƒ||_"|j|_|j)|_)|j.dur¥|j.n|j!|_.||_&|j/|_/||_'||_0|j1dvrÍ|j2dkrÍt# $d |j1d!t%¡|j1d"kr×td#ƒ|j3|_3|j2|_2|j1|_1|j4|_4t|jd$dƒ|_5t|jd%d&ƒ|_6|j5r|j6d&krt# $d't%¡|j1d(kr|j7|_7t8d)d*„ƒ|_9d|j:d+<t;ƒ J|j=t>|j?d,}|j=t@d-|i|j?d.}|durN|j=t>|j?d,}|j=t@d-|i|j?d.}|j=|jA|j?d,}|dure|j=|jA|j?d,}Wdƒn 1spwYtBƒjC|||||||| || |
d/ d|_Dt|jEd0ƒr—|jE F|jG¡t|d1ƒs¡tHd2ƒdS)3NzRYou passed model_kwargs to the CPOTrainer. But your model is already instantiated.Ú torch_dtyper€znInvalid `torch_dtype` passed to the CPOConfig. Expected a string with either `torch.dtype` or 'auto', but got Ú.FzvPEFT is not installed and you passed a `peft_config` in the trainer's kwargs, please install it to use the PEFT modelsÚis_loaded_in_8bitÚis_loaded_in_4bitröÚuse_gradient_checkpointingÚenable_input_require_gradscSó| d¡dS©NT©Úrequires_grad_©ÚmoduleÚinputÚoutputrdrdreÚmake_inputs_require_gradþóz=_UnslothCPOTrainer.__init__.<locals>.make_inputs_require_gradTcSrGrHrIrKrdrdrerOrPz`generate_during_eval=True` requires Weights and Biases or Comet to be installed. Please install `wandb` or `comet-ml` to resolve.zMWhen no model is provided, you need to pass the parameter is_encoder_decoder.z=processing_class must be specified to tokenize a CPO dataset.z`max_length` is not set in the CPOConfig's init it will default to `512` by default, but you should do it yourself in the future.r‰zˆ`max_prompt_length` is not set in the CPOConfig's init it will default to `128` by default, but you should do it yourself in the future.é€zmax_prompt_length (z+) should be strictly less than max_length (z).z¼When using an encoder decoder architecture, you should set `max_completion_length` in the CPOConfig's init it will default to `128` by default, but you should do it yourself in the future.)Ú pad_token_idrrz²When using DPODataCollatorWithPadding, you should set `remove_unused_columns=False` in your TrainingArguments we have set it for you, but you should do it yourself in the future.)ÚhingeÚiporzYou are using the z™ loss type that does not support label smoothing. The `label_smoothing` parameter will be ignored. Set `label_smoothing` to `0.0` to remove this warning.Úkto_pairzKSupport for kto_pair has been removed in CPOTrainer. Please use KTOTrainer.Úoutput_router_logitsÚrouter_aux_loss_coefra-You set `output_router_logits` to `True` in the model config, but `router_aux_loss_coef` is set to `0.0`, meaning the auxiliary loss will not be used. Either set `router_aux_loss_coef` to a value greater than `0.0`, or set `output_router_logits` to `False` if you don't want to use the auxiliary loss.ÚsimpocSsttƒS©N)r#ÚlistrdrdrdreÚ<lambda>‰sz-_UnslothCPOTrainer.__init__.<locals>.<lambda>Úestimate_tokens)Únum_procÚ tokenizer)Ú fn_kwargsr]) r5r6r7r8r9r:r;r@r<r=r>Úadd_model_tagsÚ acceleratorzXYour `Trainer` does not have an `accelerator` object. Consider upgrading `transformers`.)IrÚ
isinstanceÚstrÚ
ValueErrorÚgetÚgetattrr:Údtyper Úfrom_pretrainedÚ_peft_has_been_casted_to_bf16r)rÚmerge_and_unloadÚhasattrrZr'Ú signaturer6Ú
parametersrõrFÚget_input_embeddingsÚregister_forward_hookrÀr5rr+r(ÚconfigrÚdecoder_start_token_idrRrr;ÚwarnÚ UserWarningrrrrÚuse_dpo_data_collatorrr$rrr:rrrrÚaux_loss_enabledÚ
aux_loss_coefrr#Ú_stored_metricsÚwarnings_issuedrÚmain_process_firstÚmapr.rr-Ú tokenize_rowr$r%Úmodel_accepts_loss_kwargsr5r`Ú
_tag_namesÚAttributeError)r&r5r6r7r8r9r:r;r<r=r>r?r@rrAÚ_support_gc_kwargsÚprepare_model_kwargsrOrrrr(rdrer%¹s\




ÿ
ÿ
ÿ
ÿþ





ÿ  


 ý ý
ÿý
ýý ý û 
 ÿ
ý
ïõ ÿÿz_UnslothCPOTrainer.__init__c Cs |j||dd}|j|ddd}|dt|ƒd}|dt|ƒd}t ||g¡}t |d¡}t|ƒt|ƒkr@tdƒt|ƒ} ||dd| …krR| d8} |dd| …}|dd| …}
t|ƒt|
ƒkrntdƒ|d| d}|d| d}t||
||d S)
a
Llama tokenizer does satisfy `enc(a + b) = enc(a) + enc(b)`. It does ensure `enc(a + b) = enc(a) + enc(a +
b)[len(enc(a)):]`. Reference:
https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257
Úadd_special_tokensÚ input_idsNÚattention_maskzBPrompt input ids and answer input ids should have the same length.rNz@Prompt input ids and attention mask should have the same length.)Úprompt_input_idsÚprompt_attention_maskrƒr„)r:Úlenr0Ú concatenateÚarrayrdÚdict) r&ÚpromptÚanswerÚfull_tokenizedr…Úanswer_input_idsÚanswer_attention_maskÚfull_concat_input_idsÚfull_input_idsÚresponse_token_ids_start_idxr†rdrdreÚbuild_tokenized_answerÅs.üz)_UnslothCPOTrainer.build_tokenized_answerÚreturncsi}|d}|d}|d}|js±t|tƒs tdt|ƒƒ|j|dd}dd| ¡Dƒ}t|tƒs>td t|ƒƒ| ||¡t|tƒsRtd
t|ƒƒ| ||¡t|d ƒ}tˆd ƒ} tˆd ƒ}
t | |
ƒ}| ¡D] \} } | d ||| <qst
d
dt ˆd ˆd ƒDƒƒ}
t | |
ƒ}|
dksž|dkr¢tdƒt
|jj||| ˆ|
ˆƒ\}t|jjˆˆƒ\ttˆdƒtˆdƒƒ}ˆˆ|fD]D}t|d ƒ||jkr|jdkrñdD]
} || d |j|| <qâqÍ|jdkr dD]} || |j d || <qùqÍtd|jƒq͈ˆfD]#}t|d ƒ||jkr8dD]} || d |j|j|| <q&qfdddDƒ}fdddDƒ}|dd d |d<|jgtˆd ƒ|dd tˆd ƒ<|dd d |d<|jgtˆd ƒ|dd tˆd ƒ<|||dœ ¡D]\} }| ¡D]\}}|dkq˜||| |<q˜q|S|j|d|jdd|j|d|jdd|j|d|jdd}ˆd|d<ˆd|d<|d|d <|d |d!<|d ur t|d"ƒr |jt |d¡d#|d$<|jt |d¡d#|d%<|S)&a.Tokenize a single row from a CPO specific dataset.
At this stage, we don't convert to PyTorch tensors yet; we just handle the truncation in case the prompt +
chosen or prompt + rejected responses is/are too long. First we truncate the prompt; if we're still too long,
we truncate the chosen/rejected.
We also create the labels for the chosen/rejected responses, which are of length equal to the sum of the length
of the prompt and the chosen/rejected response, with label_pad_token_id for the prompt tokens.
rÚchosenÚrejectedz prompt should be an str but got FrcSsi|]
\}}d||qS©Úprompt_rd©Ú.0ÚvrdrdreÚ
<dictcomp>óz3_UnslothCPOTrainer.tokenize_row.<locals>.<dictcomp>z chosen should be an str but got z"rejected should be an str but got r…NcSsg|]\}}||kqSrdrd)ÚbrdrdreÚ
<listcomp>&sz3_UnslothCPOTrainer.tokenize_row.<locals>.<listcomp>rNzdChosen and rejected prompt_input_ids might only differ on the last token due to tokenizer merge ops.rƒÚ
keep_start)r…r†rzUnknown truncation mode: )r„có$i|]}|ˆd|ˆ|qSr—rd©r)Ú
chosen_tokensrdrerTóÿcr—rd)Úrejected_tokensrdrerWÚlabels)Úchosen_Ú rejected_rÚtoken_type_idsT)Ú
truncationrrÚ
chosen_labelsÚrejected_labelsr„r†Ú%prepare_decoder_input_ids_from_labels)Úrejected_decoder_input_idsÚchosen_decoder_input_ids)rrbrcrdÚtyper:Úitemsr“r‡ÚminÚsumrRÚabsr Ú bos_token_idr!Ú eos_token_idr#rrrrrrkr:Útensor)r&Úfeaturer5Úbatchrr•rÚ
prompt_tokensÚprompt_len_input_idsÚchosen_prompt_len_input_idsÚrejected_prompt_len_input_idsrÚnum_diff_tokensÚ num_diff_lenÚlonger_response_lengthÚ
answer_tokensÚchosen_sequence_tokensÚrejected_sequence_tokensÚtoksÚtype_keyÚtokensrd)rer{ö


 
    
ÿ ÿ
ù
ÿ
ÿ ÿø  
ÿ
ÿÿ
þÿ
þý
ü
ýé
ÿ
ÿ
ÿ     
ÿ 
ÿz_UnslothCPOTrainer.tokenize_rowFrŒrrrrÚdevicec
Ci}|rt|djd|djdƒ}nt|djd|djdƒ}|D]8}| d¡r]t||tjƒr]d|vs:|r=|}n| d¡rE|}n| d ¡rLd
}| dd ¡} t||||d || <q%|D]E}| d
¡r¥t||tjƒr¥d|vsu|rx|}n| d¡r€|}n| d ¡r‡d
}| d
d ¡} tj || t||||d fd
dj
|d|| <q`|rÄ|d  dd¡j
|d|d<|d  dd¡j
|d|d<|S)Concatenate the chosen and rejected inputs into a single tensor.
Args:
batch:
A batch of data. Must contain the keys 'chosen_input_ids' and 'rejected_input_ids', which are tensors
of shape (batch_size, sequence_length).
is_encoder_decoder:
Whether the model is an encoder-decoder model.
label_pad_token_id:
The label pad token id.
padding_value:
The padding value to use for the concatenated inputs_ids.
device:
The device for the concatenated inputs.
Returns:
A dictionary containing the concatenated inputs under the key 'concatenated_input_ids'.
r­rNÚchosen_input_idsÚrejected_input_idsr•Ú
_input_idsÚ_attention_maskrÚ concatenated)Ú pad_valuerrM©r…roÚconcatenated_input_idsr†Úconcatenated_attention_mask) r#rQÚ
startswithrbr:rÚendswithÚreplacer3ÚcatrSÚrepeat)
rrrÚconcatenated_batchrrÚconcatenated_keyrdrdreÚconcatenated_inputs‡sL  

  

 þû
úÿz&_UnslothCPOTrainer.concatenated_inputsÚpolicy_chosen_logpsÚpolicy_rejected_logpscCs4|| |jj¡}|jdkr3|j|j}||}t |j|¡ d|jt |j |¡|j}nJ|jdkrSt |j|¡ d|jt |j |¡|j}n*|jdkrct  
d|j|¡}n|jdkrt|dd|jd}n t d|jdƒ|j| |jj¡  ¡}|j| |jj¡  ¡}|||fS) aµCompute the CPO loss for a batch of policy and reference model log probabilities.
Args:
policy_chosen_logps:
Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)
policy_rejected_logps:
Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)
Returns:
A tuple of three tensors: (losses, chosen_rewards, rejected_rewards). The losses tensor contains the CPO
loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for
the chosen and rejected responses, respectively.
rXrNrSrTrozUnknown loss type: z7. Should be one of ['sigmoid', 'hinge', 'ipo', 'simpo'])
rSrarrrrÚ
logsigmoidrr:ÚrelurdÚdetach)r&r[Úgamma_logratiosÚlossesÚchosen_rewardsÚrejected_rewardsrdrdreÚcpo_lossËs.
 ÿÿ
ÿÿ

 ÿ
z_UnslothCPOTrainer.cpo_lossr[Úaverage_log_probcC|jdd|jkrtdƒ|s)|ddddf ¡}|ddddddf}||k}d|||k<t||ƒ}|rF|| d¡| d¡S|| d¡S)Compute the log probabilities of the given labels under the given logits.
Args:
logits: Logits of the model (unnormalized). Shape: (batch_size, sequence_length, vocab_size)
labels:
Labels for which to compute the log probabilities. Label tokens with a value of label_pad_token_id are
ignored. Shape: (batch_size, sequence_length)
average_log_prob:
If True, return the average log probability per (non-masked) token. Otherwise, return the sum of the
log probabilities of the (non-masked) tokens.
label_pad_token_id: The label pad token id.
is_encoder_decoder: Whether the model is an encoder-decoder model.
Returns:
A tensor of shape (batch_size,) containing the average/sum log probabilities of the given labels under the
given logits.
NrHzKLogits (batch and sequence length dim) and labels must have the same shape.rNr)rQrdÚcloner8)r[rrÚ loss_maskrcrdrdreÚget_batch_logpss 
z"_UnslothCPOTrainer.get_batch_logpscsDˆj|ˆjˆjˆjˆjjd}|djd}ˆjr"dˆ |d¡ini}ˆjr+d|d<||df|d d
d œ|¤Ž}|j }fd d
}|d 
¡} ˆj dkrYt  
d¡ ˆjj¡}
n
||d|| d|ƒ}
ˆj||dˆjdvˆjˆjd} | d|} | |d}
|d|}||d}ˆjr| |
|||
|jfS| |
|||
fS)zÆRun the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.
We do this to avoid doing two forward passes, because it's faster for FSDP.
)rrrr­rÚdecoder_input_idsÚconcatenated_labelsTrVF)r„Ú use_cachecsrˆjs|dddddf ¡}|dddf ¡}t ¡}| d|jd¡}| d¡}| |j¡}|||ƒ}|S)N.rHrN)rÚ
contiguousr/ÚCrossEntropyLossÚviewrQrS)r[Úloss_fctÚloss©r&rdreÚcross_entropy_lossNs
 
zC_UnslothCPOTrainer.concatenated_forward.<locals>.cross_entropy_lossrN)rTrX)rr)rrrrarQÚ _shift_rightrur[rr:rSrÚaux_loss)r&r5Ú
len_chosenÚ model_kwargsÚoutputsÚ
all_logitsròÚnll_lossÚ all_logpsÚ chosen_logpsÚrejected_logpsÚ
chosen_logitsÚrejected_logitsrdreÚconcatenated_forward+sXûýÿûÿýü  
û    z'_UnslothCPOTrainer.concatenated_forwardÚtrainÚ
train_eval)rÚevalcCi}| ||¡}|dd\}}}} }
|jr|d} | ||¡\} }
}|  ¡|j|
}|
|k ¡}|dkr8dnd}|j |
¡ ¡ ¡||d<|j |¡ ¡ ¡||d<|j |¡ ¡ ¡||d<|j |
|¡ ¡ ¡||d <|j |¡  ¡ ¡ ¡||d
<|j |¡  ¡ ¡ ¡||d <|j |   ¡ ¡¡ ¡ ¡||d <|j |  ¡ ¡¡ ¡ ¡||d
<|j |
¡  ¡ ¡ ¡||d<|jrÛ||j
| 7}||fS)zWCompute the CPO loss and other metrics for the given batch of inputs for train or test.NérÚeval_rzrewards/chosenzrewards/rejectedzrewards/accuracieszrewards/marginszlogps/rejectedz logps/chosenzlogits/rejectedz
logits/chosenrù) rÿruÚmeanrÚfloatraÚgather_for_metricsÚitemrßrv)r&r5rÚmetricsÚforward_outputrÛÚpolicy_chosen_logitsÚpolicy_rejected_logitsÚpolicy_nll_lossrôÚreward_accuraciesÚprefixrdrdreÚget_batch_loss_metricsvsF 
ú
þ  ÿ ÿ ÿ ÿ ÿ"z)_UnslothCPOTrainer.get_batch_loss_metricsÚinputscCsp|jr
t|jjjƒntƒ}||j||dd\}}Wdƒn1s$wY|j|dd|r6||fS|S)Nr©r)rir"rar1rÚ
store_metrics)r&r5rÚreturn_outputsÚnum_items_in_batchÚcompute_loss_context_managerrðr rdrdreÚ compute_loss¬sÿÿz_UnslothCPOTrainer.compute_losscCs†|jr
t|jjjƒntƒ}||j|d|d|jd|jj d}Wdƒn1s+wYt
||j|jj ƒ}|jj |dd}|S)zRGenerate samples from the model and reference model for the given batch of inputs.r…r†T)r„rÚ do_samplerRN)Úskip_special_tokens) rir"rar1Úgeneraterr:rRr3Ú batch_decode)r&r5Úgenerate_context_managerÚ
policy_outputÚpolicy_output_decodedrdrdreÚgenerate_from_modelÁsÿûÿ z&_UnslothCPOTrainer.generate_from_modelrÚ ignore_keysc s ˆdurt|dƒrt|jdgƒng|jrt|jjjƒntƒ}t  
¡$||j ||dd\}}Wdƒn1s<wYWdƒn1sKwY|j |dd|r`| 
¡ddfS|d|ddœ}fdd „| ¡Dƒ} t j| |jjd
} t j| jd |jjd
}
| 
¡| |
fS) NrpÚkeys_to_ignore_at_inferencerrúeval_logits/chosenúeval_logits/rejected)r"r#csg|]
\}}|ˆvr|qSrdrdr™©r rdre÷z6_UnslothCPOTrainer.prediction_step.<locals>.<listcomp>rÐr)rkrfrprir"rar1r:Úno_gradrrÚzerosrQ) r&r5rrr Úprediction_context_managerrðr Ú logits_dictr[rdr$reÚprediction_stepØs*
ÿÿþz"_UnslothCPOTrainer.prediction_stepr cCs*| ¡D]\}}|j|| |¡qdSrY)rwrY)r&r rÚkeyÚvaluerdrdrerýsÿz _UnslothCPOTrainer.store_metricsrÚ
dataloaderÚ descriptionÚmetric_key_prefixc
|jrZt|jƒ}tjt|ƒ|jjd}|j |¡}|  |¡} | 
| ¡} |  |j | ¡}
t
jddgddt| d|
ƒDƒd} d|jjvrNt d tj| d
i¡d |jjvrZtd | d
tƒ |||||¡} | S)
Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by
`Trainer.evaluate()` and `Trainer.predict()`.
Works both with or without labels.
)rÚPromptÚPolicycSs$g|]\}}||t|ƒdgqSrY)r‡)rÚpolrdrdrez6_UnslothCPOTrainer.evaluation_loop.<locals>.<listcomp>r)ÚcolumnsÚdataÚwandbÚgame_log)r3Úcomet_mlz game_log.csv)ÚnameÚtable)rr‡Údatasetr7ÚsampleÚranger6Úeval_batch_sizeÚselectr7Ú_prepare_inputsrr5r4Ú DataFramerRr4ÚlogÚTabler,r$Úevaluation_loop)
r&r,r-rr r.Ú num_samplesÚrandom_indicesÚrandom_batch_datasetÚ random_batchrr8Úinitial_outputr(rdrerBs0
 

 ÿþ  þ
ÿz"_UnslothCPOTrainer.evaluation_loopÚlogsÚ
start_timecsTd|vrdnd}|j| ¡D]\}}t |¡ ¡ ¡||<q|j|=tƒ ||¡S)a1
Log `logs` on the various objects watching training, including stored metrics.
Args:
logs (`dict[str, float]`):
The values to log.
start_time (`float` or `None`, *optional*, defaults to `None`):
Start time of the training.
rr)rwr:rrr$r@)r&rHrIrr*r r(rdrer@3s
 z_UnslothCPOTrainer.logcCs´|jdur tdƒt|ƒr+t |jddd|j¡}tj||dddfgdd}n| |j¡}|dddf ¡|dddf<|j|d<|j durOtdƒ| 
|d k|j ¡|S)
Nz]model.config.decoder_start_token_id has to be defined. It is usually set to the pad_token_id.rH)rN.rMrN).rz,model.config.pad_token_id has to be defined.rŒ) rqrdr*r:ÚfullrQÚ new_zerosrærRÚ masked_fill_)r&Úshifted_input_idsrdrdreEs
ÿ   

z_UnslothCPOTrainer._shift_rightcsL|jjdurt|jjƒj}n |jj d¡d}|j|dtƒ ||¡dS)/rH)Ú
model_name) r6rrr7ÚsplitÚcreate_model_cardr$Ú_save_checkpoint)r&r5ÚtrialrOr(rdrerR]s
 z#_UnslothCPOTrainer._save_checkpointrOÚ dataset_nameÚtagsc
C| ¡sdSt|jjdƒrtj |jjj¡s|jjj}nd}|dur&tƒ}n
t |t
ƒr/|h}nt|ƒ}t|jjdƒr?|  d¡|  |j
¡t d¡}t|||j||tƒr]tjdur]tjjndtƒd|ddd }| tj |jjd
¡¡dS) 
Creates a draft of a model card using the information available to the `Trainer`.
Args:
model_name (`str` or `None`, *optional*, defaults to `None`):