Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothOnlineDPOTrainer.cpython-310.pyc
T

324 lines
40 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
;—°hã@s.dZddlmZddlZddlmZddlmZddlmZm Z m
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZm Z m!Z!m"Z"m#Z#m$Z$m%Z%m&Z&m Z m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1m2Z2m3Z3m4Z4mZm5Z5m6Z6m7Z7m8Z8mZm9Z9m:Z:m;Z;m<Z<m=Z=mZm/Z/m5Z5mZmZm
Z
m Z m!Z!m%Z%m0Z0m5Z5mZddl5Z5ddlTddl>m?Z?m@Z@dd lAmBZBddlZddlCZDdd
lEmFZFddlmZdd lGmHZHmIZJd d
d d
d
dœZKejLd d eKdddƒZMddZNe?GdddeƒƒZO Gddde%ƒZPGdddePƒZQdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)@rÚAutoModelForCausalLMÚBaseImageProcessorÚBasePairwiseJudger ÚDPODataCollatorWithPaddingÚ DataCollatorÚ
DataLoaderÚDatasetÚEvalPredictionÚFeatureExtractionMixinÚGenerationConfigÚIterableDatasetÚOnlineDPOConfigÚOnlineDPOTrainerÚOptimizerNamesrÚPathÚ PeftModelÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚSIMPLE_CHAT_TEMPLATEÚTrainerÚTrainerCallbackrÚapply_chat_templateÚcreate_reference_modelÚdatasetsÚdisable_dropout_in_modelÚ empty_cacheÚgenerate_model_cardÚget_comet_experiment_urlÚ
get_rewardÚis_conversationalÚis_peft_availableÚis_wandb_availableÚjinja2ÚloggingÚmaybe_apply_chat_templateÚnnÚosÚprepare_deepspeedÚ seed_workerÚtextwrapÚtorchÚtruncate_rightÚunwrap_model_for_generationÚversionÚwarningsÚwrapsrr+r2r6rrrrr!r,r2r6)Ú*)Ú dataclassÚfield)ÚVersion)Ú nullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rLÚindex©rLé)
r6ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
ÚlogitsrMÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©reúW/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothOnlineDPOTrainer.pyÚchunked_selective_log_softmax"s  
rgcKs$ddlm}|di|¤Ž}||_|S)Nr)ÚSamplingParamsre)ÚvllmrhÚ _set_kwargs)ÚkwargsrhÚsampling_paramsrererfÚvLLMSamplingParams3s rmceZdZUdZedddidZeeed<edddidZ ee
ed <eddd
idZ ee
ed <  
                            ! ! " #     $           $      % &  '         (      #    $   ) *         + ,   -   . /     d2‡fd0d1„ Z Z
S)3ÚUnslothOnlineDPOConfigu¥
Configuration class for the [`OnlineDPOTrainer`].
This class includes only the parameters that are specific to Online DPO training. For a full list of training
arguments, please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this
class may differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
reward_model_path (`str` or `None`, *optional*, defaults to `None`):
Path to the reward model. Either `judge` or `reward_model_path` must be set, but not both.
judge (`str` or `None`, *optional*, defaults to `None`):
Name of the judge to use. Either `judge` or `reward_model_path` must be set, but not both.
max_new_tokens (`int`, *optional*, defaults to `64`):
Maximum number of tokens to generate per completion.
max_length (`int`, *optional*, defaults to `256`):
Maximum total length of the sequence (prompt + completion) used to compute log probabilities. If the
sequence exceeds this limit, the leftmost tokens will be truncated to preserve as much of the completion as
possible.
temperature (`float`, *optional*, defaults to `0.9`):
Temperature for sampling. The higher the temperature, the more random the completions.
missing_eos_penalty (`float` or `None`, *optional*, defaults to `None`):
Penalty applied to the score when the model fails to generate an EOS token. This is useful to encourage to
generate completions shorter than the maximum length (`max_new_tokens`). The penalty must be a positive
value.
beta (`float` or `list[float]`, *optional*, defaults to `0.1`):
Parameter controlling the deviation from the reference model. Higher β means less deviation from the
reference model. For the IPO loss (`loss_type="ipo"`), β is the regularization parameter denoted by τ in
the [paper](https://huggingface.co/papers/2310.12036). If a list of floats is provided then the β is
selected for each new epoch and the last β is used for the rest of the epochs.
loss_type (`str`, *optional*, defaults to `"sigmoid"`):
Type of loss to use. Possible values are:
- `"sigmoid"`: sigmoid loss from the original [DPO](https://huggingface.co/papers/2305.18290) paper.
- `"ipo"`: IPO loss from the [IPO](https://huggingface.co/papers/2310.12036) paper.
dataset_num_proc (`int` or `None`, *optional*, defaults to `None`):
Number of processes to use for processing the dataset.
disable_dropout (`bool`, *optional*, defaults to `True`):
Whether to disable dropout in the model and reference model.
use_vllm (`bool`, *optional*, defaults to `False`):
Whether to use vLLM for generating completions. Requires vLLM to be installed (`pip install vllm`).
vllm_model_impl (`str`, *optional*, defaults to `"vllm"`):
Model implementation to use for vLLM. Must be one of `"transformers"` or `"vllm"`. `"transformers"`: Use
the `transformers` backend for model implementation. `"vllm"`: Use the `vllm` library for model
implementation.
gpu_memory_utilization (`float`, *optional*, defaults to `0.55`):
The vLLM memory utilization. The default value is 0.55.
ds3_gather_for_generation (`bool`, *optional*, defaults to `True`):
This setting applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation,
improving generation speed. However, disabling this option allows training models that exceed the VRAM
capacity of a single GPU, albeit at the cost of slower generation.
model_init_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the model from a
string.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrIz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnorJéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrOéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéé@éÚsigmoidriçš™™™™™á?c’ s|dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|ˆdur:d
d lm}“t|“ƒd d
ƒ}ˆ|…d
krBtdƒ|…dkrJtdƒtƒjdŸid|d|d|d|d|d|d|d|d| “d|
d| d| d|
d|d|d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/|d0| “d1|!“d2|"“d3|#“d4|$“d5|%“d6|&“d7|'“d8|(“d9|)“d:|*“d;|+“d<|,“d=|-“d>|.“d?|/“d@|0“dA|1“dB|2“dC|3“dD|4“dE|5“dF|6“dG|7“dH|8“dI|9“dJ|:“dK|;“dL|<“dM|=“dN|>“dO|?“dP|@“dQ|A“dR|B“dS|C“dT|D“dU|E“dV|F“dW|G“dX|H“dY|I“dZ|J“d[|K“d\|L“d]|M“d^|N“d_|O“d`|P“da|Q“db|R“dc|S“dd|T“de|U“df|V“dg|W“dh|X“di|Y“dj|Z“dk|[“dl|\“dm|]“dn|^“do|_“dp|`“dq|a“dr|b“ds|c“dt|d“du|e“dv|f“dw|g“dx|h“dy|i“dz|j“d{|k“d||l“d}|m“d~|n“d|o“d€|p“d|q“d|r“dƒ|s“d„|t“d…|u“d†|v“d‡|w“dˆ|x“d‰|y“dŠ|z“d|{“dŒ||“d|}“dŽ|~“d|d|€“d|d’|‚“d“|ƒ“d”|„“d•|…“d–|†“d—|‡“d˜|ˆ“d™|‰“dš|Š“d›|‹“dœ|Œ“d|dž|Ž“|’¤Ž||_||_ ||_
dS) NgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rOza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rƒr„Úunsloth_training_checkpointsrur)Ú cpu_countrJrvzUUnsloth: Please set a positive non-zero temperature since your results will be wrong.é
zgUnsloth: Please set a positive non-zero temperature less than 10, since sampling will be quite erratic.Ú
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚreward_model_pathÚjudgeÚmax_new_tokensÚ
max_lengthÚ temperatureÚmissing_eos_penaltyÚ loss_typeÚdataset_num_procÚdisable_dropoutÚuse_vllmÚvllm_model_implÚgpu_memory_utilizationÚds3_gather_for_generationÚmodel_init_kwargsre) ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr”ÚmaxÚ MathErrorÚsuperÚ__init__rrrsrt)”Úselfrr—r˜r™rrr r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrrrrr r!r"r#rrrsrtrkr”©Ú __class__rerfr*ƒs˜  ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüûúùø ÷
ö õ ô
óòñ
zUnslothOnlineDPOConfig.__init__)NNFFFruFrJrJNNrvrvrrwrxryrzr{r|r}r~rIrr€rrrTNrƒFrOFrƒr„NTFFFFFFr…r…FFFFr†r‡FFNrINNFrˆFNrNrINNTNFNNFrˆrNNNNr‰NFFrNNNNTFTFFNNrŒNNFNFNFTr‡NNNrˆTFNrFNNFFNNFFFNFTNNrrrzNrNTFrirTNNrIN)Ú__name__Ú
__module__Ú __qualname__Ú__doc__r>rrrrÚ__annotations__rsÚintrtr*Ú
__classcell__rerer,rfrn8sB
=þþþírnc"s>eZdZdZddgZ              d6deeeje fdeeejdfdeeejdfd e
e d
e
e d e
e
d e
eeed
fde
eeee efd
fde
eeeeefde
ede
ede
eegefde
eedeejjejjjfde
eejejgejfddf ‡fdd
Z e!ddƒZ"e#de$dedee e%ffddƒZ&e'e(j)ƒde*fdd „ƒZ)e'e(j+ƒd7de
ee efde*fd!d"„ƒZ+d#d$„Z,d%d&„Z-d'd(„Z. d7dejd)ee eeje%ffd*e
e/dejfd+d,„Z0 d7d-d.„Z1‡fd/d0„Z2   d8d1e
e d2e
e d3ee ee dffd4d5„Z3‡Z4S)9Ú_UnslothOnlineDPOTrainerrˆÚtrlz
online-dpoN©NNÚmodelÚ ref_modelÚ reward_modelrÚargsÚ
data_collatorÚ
train_datasetzdatasets.DatasetÚ eval_datasetÚprocessing_classÚreward_processing_classÚ peft_configÚcompute_metricsÚ callbacksÚ
optimizersÚpreprocess_logits_for_metricsÚreturnc s~t|dƒrt|dƒrt|ddƒdkrd|_||urtdƒ||_|dur1|dur1t dt¡d}n |dur=|dur=tdƒ||_|
|_ ||_
|j durS|durStdƒ|dur[td ƒ| durctd
ƒ|j pgi}t
|tƒr¤|}| d ¡}t
|tjƒs|d ks|durƒnt
|tƒrtt|ƒ}||d <ntd
|dƒtj|fi|¤Ž}n |j dur­tdƒ|jj|_ |jrÄt|ƒ|jdurÄt|jƒ|dud|_n||_|j ¡|jdurß|j ¡|durét| jd}|j |_ gggggggggggdœ |_!|jdurg|j!d<g|j!d<g|j!d<|jr7|j"|_#d|_$t%d!d|j&|j'ddddœtt|dt(ƒƒdiƒ¤Ž|_)nt*|j&|j'ddd|j+rEdndd|_)d|j,d<t-ƒj.|||||| | |
||d
t|j/d ƒrm|j/ 0|j1¡|j2|_3|j4r|jdur‡t5|j|j6|j7|j8ƒ|_|jdurt5|j|j6|j7|j8ƒ|_dSdS|jdur¬|j 9|j:j;¡|_|jdur½|j 9|j:j;¡|_dSdS)"NÚ vllm_enginerFTzš`model` and `ref_model` cannot be the same object. If you want `ref_model` to be the same as `model`, either omit the `ref_model` argument or pass `None`.z€Both `reward_model` and `judge` are provided. Please choose provide only one of them. Ignoring `judge` and using `reward_model`.z2Either `reward_model` or `judge` must be provided.z@`missing_eos_penalty` is not supported when `judge` is provided.z`args` must be provided.z$`processing_class` must be provided.Ú torch_dtyper‡zŽInvalid `torch_dtype` passed to `OnlineDPOConfig`. Expected either 'auto' or a string representing a `torch.dtype` (e.g., 'float32'), but got Ú.z¦You passed `model_init_kwargs` to the `OnlineDPOConfig`, but your model is already instantiated. This argument can only be used when the `model` argument is a string.zfPEFT is not available and passed `peft_config`. Please install PEFT with `pip install peft` to use it.)Ú pad_token_id) ú objective/klúobjective/entropyúobjective/non_score_rewardúrewards/chosenúrewards/rejectedúrewards/accuraciesúrewards/marginsú logps/chosenúlogps/rejectedúval/contain_eos_tokenÚbetaúobjective/rlhf_rewardúobjective/scores_marginúobjective/scoresrrvé2r})Ú
max_tokensrÚtop_kÚtop_pÚ
detokenizerrrj)rrr\r]Ú do_sampleÚ use_cacheÚestimate_tokens)
r8r;r<r=r>r?rBrCrDrEÚadd_model_tagsre)<ÚhasattrÚgetattrrÚ
ValueErrorr9r:ÚwarnÚ UserWarningr:r@rrr#Ú
isinstanceÚstrÚgetr6Údtyper Úfrom_pretrainedÚconfigÚis_encoder_decoderr,Ú ImportErrorrÚmerge_and_unloadrr&r$ÚevalrrJrÚstatsrGÚllmÚ_last_loaded_steprhrrrmÚgeneration_configrÚwarnings_issuedr)r*r8rbÚ
_tag_namesrUÚ_betaÚis_deepspeed_enabledr3rrTÚ acceleratorÚdevice)r+r8r9r:rr;r<r=r>r?r@rArBrCrDrEr#Úmodel_idrHr,rerfr*¿ÿý





ÿÿ
ÿ





 õ


ú
ù
ú
ö ÿ 
ÿÿ  ÿz!_UnslothOnlineDPOTrainer.__init__cCs<t|jtƒr|jj}|t|jƒkr|j|S|jdS|jS)NrI)rhrxÚlistÚstateÚepochÚlen)r+rrererfrUs "z_UnslothOnlineDPOTrainer.betarnÚ tokenizercCs|s6||ddd}|jdur5t|dƒ}|dks"|j|ddkr5|jg|d|d<dg|d|d<n||dd d}d
d | ¡Dƒ}|S) z2Tokenize a single row from a DPO specific dataset.ÚpromptF)Úadd_special_tokensNÚ input_idsrrOÚattention_maskTcSsi|]
\}}d||qS)Úprompt_re)Ú.0ÚkeyÚvaluerererfÚ
<dictcomp>¥sz9_UnslothOnlineDPOTrainer.tokenize_row.<locals>.<dictcomp>)Ú bos_token_idr€Úitems)ÚfeaturernrÚbatchÚprompt_len_input_idsrererfÚ tokenize_row˜s
 z%_UnslothOnlineDPOTrainer.tokenize_rowcCs|jdur tdƒ|j}|j}|j||jj|jj|jjdœ}t|t j
j j ƒs<| 
¡|d<|jj|d<t|d<|jj|d<|j t|fi|¤Ž¡S)Nz+Trainer: training requires a train_dataset.©Ú
batch_sizeÚ
collate_fnÚ num_workersÚ
pin_memoryÚpersistent_workersÚsamplerÚ drop_lastÚworker_init_fnÚprefetch_factor)r=rer<Ú_train_batch_sizer;rhr6ÚutilsÚdatarÚ_get_train_samplerrÒr4rzÚpreparer)r+r=r<Údataloader_paramsrererfÚget_train_dataloader©s
û   z-_UnslothOnlineDPOTrainer.get_train_dataloadercCs |dur
|jdur
tdƒt|tƒr|nd}t|dƒr-||jvr-|jjr-|j  |j|¡St|tƒr7|j|n|dur=|n|j}|j
}|jj ||jj |jj
|jjdœ}t|tjjjƒsn| |¡|d<|jj|d<|jj|d<t|fi|¤Ž}|jjrŠt|dƒr…||j|<n||i|_|j  |¡S)Nz-Trainer: evaluation requires an eval_dataset.rqÚ_eval_dataloadersrr—r˜)r>rerhrircr;rzr<Úeval_batch_sizerÔr6rrÚ_get_eval_samplerrÒr)r+r>Údataloader_keyr<r Úeval_dataloaderrererfÚget_eval_dataloaderÁs@ÿ
ÿ ÿÿûû  
 
 z,_UnslothOnlineDPOTrainer.get_eval_dataloadercsd|jj|jj td|diƒr$|jj||jd|jjddddn|jj ||jd|jjddddfdd „t
d
ƒDƒ}fd d „t
d
ƒDƒ}t d d
|Dƒƒfdd „|Dƒ}fdd „|Dƒ}|jj fdd „|Dƒ}fdd „|Dƒ}fdd „|Dƒ}t
j||jjd}t
j||jjd}t
j||jjd}t
j||jjd}||||fS)NrrFZonline_dpo_trainer_lora_modelT)Ú load_tensors)Úuse_tqdmÚ lora_requestcs&g|]}ˆD]
}t|j|jƒqqSre)r}ÚoutputsÚ token_ids)r‡Úoutput©rerfÚ
<listcomp>s&z;_UnslothOnlineDPOTrainer._generate_vllm.<locals>.<listcomp>rvcs g|] }ˆD]}t|jƒqqSre)r}Úprompt_token_ids)r‡Ú_r®rerfs css|]}t|ƒVqdS©r€©r‡ÚidsrererfÚ <genexpr>sz:_UnslothOnlineDPOTrainer._generate_vllm.<locals>.<genexpr>cs,g|]}dgˆt|ƒdgt|ƒqS)rrOr´)Úmax_prompt_lengthrerfó,cs"g|]
}ˆgˆt|ƒ|qSrer´)r¸rJrerfó"cs,g|]}dgt|ƒdgˆt|ƒqS)rOrr´)r[rerfcs2g|]}|dˆkrt|ƒˆkr|ˆgn|qS)rIr´)Ú eos_token_idr[rerf s$ÿÿcs"g|]
}|ˆgˆt|ƒqSrer´)r[rJrerf
©r{)r?rJr+rsÚchatrur8Ú load_loraÚgenerateÚranger'r[r6Útensorrzr{)r+r8ÚpromptsÚcompletion_idsÚ
prompt_idsÚ prompt_maskÚcompletion_maskre)r¸r[rJrfÚ_generate_vllmñs.$" þ z'_UnslothOnlineDPOTrainer._generate_vllmc ˆjj}ˆjj}dd|Dƒ}fdd|Dƒ}fdd|Dƒ}ˆ |¡}ˆ |¡}|d dd¡}|d dd¡}t|ˆjˆjj d }|j
||ˆj d
} Wdƒn1sYwY| dd|  d¡df}
t
|
||ƒ\}
} |||
| fS) NcSsg|]}d|iqS©rre©r‡rrererfóz6_UnslothOnlineDPOTrainer._generate.<locals>.<listcomp>cóg|]}t|ˆjƒqSre)r0r?©r‡Úr+rerfócsg|] }ˆ |ˆjˆj¡qSre)rrnr?rerfsÚprompt_input_idsrvrOÚprompt_attention_mask)Úgather_deepspeed3_params)r„r…ru)r?rJr<Ú_prepare_inputsÚrepeatr8rzr;r"r¿ruÚsizer7) r+r8rJÚinputsrÄÚunwrapped_modelr®rerfÚ _generates,

 ÿýý  z"_UnslothOnlineDPOTrainer._generatecCt| d¡| d¡|jdƒ}|dd|df}|dd|df}tj||fdd}tj||fdd}|||d} | d¡}
|
dkrI|
dnd} | jdd| df} tj| jdd| d¡dd  d¡}
|
S)NrOrrN)r…rIrv)
r'rr6Úcatr\Útake_along_dimÚ log_softmaxrWrX)r+r8Únum_tokens_to_truncateÚprompt_completion_idsÚprompt_completion_maskr®Ú
prompt_lenÚ start_idxr\ÚlogprobsrererfÚ_forward4s  
$z!_UnslothOnlineDPOTrainer._forwardrÖÚnum_items_in_batchc= s| ¡|d}t|ƒ}ˆjjrˆ ||¡\}}}} n
ˆ ||¡\}}}} tj|ˆjj kdd}
ˆ 
||||| ¡} t  ¡7ˆj durNˆ 
ˆj |||| ¡} nˆj
 ¡ˆ 
ˆj
|||| ¡} Wdƒn1shwYWdƒn1swwY| j}
ˆjj|dd}td|diƒrdd|Dƒ}ˆjdurßtd|diƒr¾t ¡}| t¡fd d|Dƒ}fd
d|Dƒ}ˆj |tt|d|||dƒƒ¡}tjd d|Dƒ|
d }n—d
|}td|diƒr
ddt||ƒDƒ}fdd|Dƒ}dd|Dƒ}dd|Dƒ}ˆj|ddddd |
¡}|jd}ˆj|ddddd |
¡}tj||fdd}t ¡'tˆj |ˆjj!|ƒ\}}}ˆjj"dur[||
ˆjj"8<Wdƒn 1sfwY| #|¡\}}||k}tj$||
d }|||}|||}tj||fdd}| |}| |} |  }!|!|}"||" &d¡}#| |" &d¡}$t #|#|¡\}%}&t #|$|¡\}'}(|%|&})|'|(}*|)|*}+ˆjj'dkrát( )ˆj*|+¡ },nˆjj'dkrô|+dd
ˆj*d
},nt+dˆj'ƒ|, }-ˆj dur2||||}.ˆj-d .ˆj/ 0|. ¡  ¡ˆj-d .ˆj/ 0| ¡  ¡ˆj-d .|
 2¡  ¡ˆj-d .ˆj/ 0|%¡  ¡ˆj-d .ˆj/ 0|&¡  ¡| | }/|/ &d¡ }0ˆj-d  .ˆj/ 0|0¡  ¡ˆj* |/ &d¡}1|1 }2ˆj-d! .ˆj/ 0|2¡  ¡ˆj dur²||1}3ˆj-d" .ˆj/ 0|3¡  ¡|  &d¡  }4ˆj-d# .ˆj/ 0|4¡  ¡ˆj*|%|'}5ˆj/ 0|5¡}6ˆj-d$ .|6  ¡ˆj*|&|(}7ˆj/ 0|7¡}8ˆj-d% .|8  ¡|6|8}9ˆj-d& .|9  ¡|9dk}:ˆj-d' .|:   ¡ˆj-d( .ˆj*¡ˆjj3dur<ˆj4j5ˆjj3dkr<t6ƒi};ˆjj7t8j9t8j:fvrOˆ |;d)<ˆjj<dkrZ|- }-ˆj=r{t> ?|-ˆj@¡ }<|< Wdƒn 1suwYn
ˆj/jA|-fi|;¤Ž|- ˆjjCS)*NrrIrNT)Úskip_special_tokensrcSsg|]}d|dœgqS)Ú assistant)ÚroleÚcontentre©r‡Ú
completionrererfgz:_UnslothOnlineDPOTrainer.training_step.<locals>.<listcomp>cóg|]}ˆj|dqS©)Úmessages©ÚrenderrÉ©ÚtemplatererfrcrerfscSsg|]}|dkqS)rre)r‡Úrankrererf|rvcSsg|] \}}||dœqS))rre)r‡Úcrererfscre)r#r@©r‡ÚexamplerÎrerfƒcSóg|]}|dqSrerererfcS)rerererfÚptÚleft)ÚpaddingÚreturn_tensorsÚ padding_sider„rOÚrightrÚipozinvalid loss type rWrXrTrRrSrKrMrVrLrNrOrQrPrU)DÚtrainr€r;rr6Úanyr?Úno_gradr9r8Údisable_adapterr{Ú batch_decoder+rr.Ú EnvironmentÚ from_stringr r}rSr@rTrRÚinference_moder*r:rJrÚsplitÚarangeÚboolÚsumrrÚ
logsigmoidrUÚNotImplementedErrorÚmeanrrrZrzÚgather_for_metricsÚitemÚfloatr¤r~Ú global_stepr'rÚLOMOÚADALOMOÚ_get_learning_rateÚn_gpuÚuse_apexÚampÚ
scale_lossÚ optimizerÚbackwardÚdetachr¡)=r+r8rÚcontain_eos_tokenráÚ ref_logprobsr{Ú completionsÚ environmentÚranks_of_first_completionÚmaskÚexamplesÚ prompts_idsÚcontext_lengthÚcompletions_idsrÝÚscoresÚ
first_halfÚ second_halfÚ batch_rangeÚchosen_indicesÚrejected_indicesÚ
cr_indicesÚ cr_logprobsÚcr_ref_logprobsÚ padding_maskÚcr_padding_maskÚcr_logprobs_sumÚcr_ref_logprobs_sumÚchosen_logprobs_sumÚrejected_logprobs_sumÚchosen_ref_logprobs_sumÚrejected_ref_logprobs_sumÚ pi_logratiosÚ
ref_logratiosr\ÚlossesÚlossÚ
scores_marginÚklÚmean_klÚnon_score_rewardÚmean_non_score_rewardÚ rlhf_rewardÚ mean_entropyÚchosen_rewardsÚgathered_chosen_rewardsÚrejected_rewardsÚgathered_rejected_rewardsÚmarginÚaccuracyrkÚ scaled_lossre)r+rfÚ
training_stepLs

 ÿü

ÿÿþþ
ÿþþ

ÿø  
 
ÿ$   
ÿ      
ÿz&_UnslothOnlineDPOTrainer.training_stepc Csj|jjr~|jj|jkr~i} | |¡ ¡ ¡}
||8}t|
|jj|jdƒ| d<|dur<t |t
j ƒr8|  ¡ ¡n|| d<|durE|| d<n| 
¡| d<|j ¡D]\} } t| ƒt| ƒ| | <qPdd|jDƒ|_|j|
7_|jj|_| ¡| | |¡d}
|jjr| ||¡}
|j|
|d}|jjdkr||j_|jjr³| ||¡|j |j|j|j¡|_dSdS) NrJr9Ú grad_normr¥cSsi|]}|gqSrere)r‡rˆrererfszE_UnslothOnlineDPOTrainer._maybe_log_save_evaluate.<locals>.<dictcomp>)ÚmetricsÚtrialÚbest)ÚcontrolÚ
should_logr~rÚ_globalstep_last_loggedÚ_nested_gatherr rÚroundrhr6rrrrrr r€Ú_total_loss_scalarÚ
store_flosÚlogÚshould_evaluateÚ _evaluateÚ_determine_best_metricr;r¸Ú should_saveÚ_save_checkpointÚcallback_handlerÚon_save)r+Útr_lossrIr8rKrÚignore_keys_for_evalÚ
start_timer¥ÚlogsÚtr_loss_scalarrˆÚvalrJÚis_new_best_metricrererfÚ_maybe_log_save_evaluates6 
 
    þz1_UnslothOnlineDPOTrainer._maybe_log_save_evaluatecsL|jjdurt|jjƒj}n |jj d¡d}|j|dtƒ ||¡dS)/rI)Ú
model_name) r;rrÚnamerÚcreate_model_cardr)rY)r+r8rKrer,rerfrY,s
 z)_UnslothOnlineDPOTrainer._save_checkpointreÚ dataset_nameÚtagsc
C| ¡sdSt|jjdƒrtj |jjj¡s|jjj}nd}|dur&tƒ}n
t |t
ƒr/|h}nt|ƒ}t|jjdƒr?|  d¡|  |j
¡t d¡}t|||j||tƒr]tjdur]tjjndtƒd|ddd }| tj |jjd
¡¡dS) 
Creates a draft of a model card using the information available to the `Trainer`.
Args:
model_name (`str` or `None`, *optional*, defaults to `None`):
Name of the model.
dataset_name (`str` or `None`, *optional*, defaults to `None`):
Name of the dataset used for training.
tags (`str`, `list[str]` or `None`, *optional*, defaults to `None`):
Tags to be associated with the model card.
_name_or_pathÚunsloth_versionÚunslotha» @article{guo2024direct,
title = {{Direct Language Model Alignment from Online AI Feedback}},
author = {Shangmin Guo and Biao Zhang and Tianlin Liu and Tianqi Liu and Misha Khalman and Felipe Llinares and Alexandre Ram{'{e}} and Thomas Mesnard and Yao Zhao and Bilal Piot and Johan Ferret and Mathieu Blondel},
year = 2024,
eprint = {arXiv:2402.04792}
}z
Online DPOz7Direct Language Model Alignment from Online AI Feedbackz
2402.04792) Ú