Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothGKDTrainer.cpython-310.pyc
T

192 lines
24 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
2025-08-28 22:41:56 +00:00
õ×°hr¡ã@dZddlmZddlZddlmZddlmZddlmZm Z m
2025-08-28 17:57:59 +00:00
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZmZmZmZm Z m Z m!Z!m"Z"m#Z#m$Z$m%Z%mZm&Z&m'Z'm(Z(m)Z)mZm*Z*ddl&Z&ddlTddl+m,Z,m-Z-dd l.m/Z/ddlZddl0Z1dd
l2m3Z3ddlmZdd l4m5Z5m6Z7d d
d d
d
dœZ8ej9d d e8dddƒZ:e,GdddeƒƒZ; GdddeƒZ<Gddde<ƒZ=dS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)!rÚAutoModelForCausalLMÚBaseImageProcessorr Ú DataCollatorÚDataCollatorForChatMLÚDatasetÚEvalPredictionÚFeatureExtractionMixinÚ GKDConfigÚ
GKDTrainerÚGenerationConfigrÚ
PeftConfigÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚ
SFTTrainerÚTrainerCallbackrÚdisable_dropout_in_modelÚ empty_cacheÚgenerate_model_cardÚget_comet_experiment_urlÚis_wandb_availableÚnnÚosÚprepare_deepspeedÚrandomÚtextwrapÚtorchÚunwrap_model_for_generation)Ú*)Ú dataclassÚfield)ÚVersion)Ú nullcontext)ÚDataCollatorForSeq2SeqÚDataCollatorForLanguageModelingTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)r9Úindex©r9é)
r'ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
Úlogitsr:Úchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©rRúQ/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothGKDTrainer.pyÚchunked_selective_log_softmax"s  
rTceZdZUdZedddidZeeed<edddidZ ee
ed <eddd
idZ ee
ed <  
                            ! ! " #     $           $      % &  '         (      #    $   ) *         +     , -     . . /      d2‡fd0d1„ Z Z
S)3ÚUnslothGKDConfigaB
Configuration class for [`GKDTrainer`].
This class includes only the parameters that are specific to GKD training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] and [`SFTConfig`] documentation.
Args:
temperature (`float`, *optional*, defaults to `0.9`):
Temperature for sampling. The higher the temperature, the more random the completions.
lmbda (`float`, *optional*, defaults to `0.5`):
Lambda parameter that controls the student data fraction (i.e., the proportion of on-policy
student-generated outputs).
beta (`float`, *optional*, defaults to `0.5`):
Interpolation coefficient between `0.0` and `1.0` of the Generalized Jensen-Shannon Divergence loss. When
beta is `0.0`, the loss is the KL divergence. When beta is `1.0`, the loss is the Inverse KL Divergence.
max_new_tokens (`int`, *optional*, defaults to `128`):
Maximum number of tokens to generate per completion.
teacher_model_name_or_path (`str` or `None`, *optional*, defaults to `None`):
Model name or path of the teacher model. If `None`, the teacher model will be the same as the model being
trained.
teacher_model_init_kwargs (`dict[str, Any]]` or `None`, *optional*, defaults to `None`):
Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the teacher model
from a string.
disable_dropout (`bool`, *optional*, defaults to `True`):
Whether to disable dropout in the model.
seq_kd (`bool`, *optional*, defaults to `False`):
Seq_kd parameter that controls whether to perform Sequence-Level KD (can be viewed as supervised FT on
teacher-generated output).
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsr6z8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnor7éréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsr<éôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéÚtextéÚbfdçà?造 s†|dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|…dur:d
d lm}t|ƒd d
ƒ}…tj dd¡dkrWd
dlm }ž|žrW|ŒdurWd
dlm
|Ÿ}Œ|d
kr_t dƒ|dkrgt dƒt ƒj
d®id|d|d|d|d|d|d|d|d| “d|
d | d!| d"|
d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/|d0|d1|d2|d3|d4|d5| “d6|!“d7|"“d8|#“d9|$“d:|%“d;|&“d<|'“d=|(“d>|)“d?|*“d@|+“dA|,“dB|-“dC|.“dD|/“dE|0“dF|1“dG|2“dH|3“dI|4“dJ|5“dK|6“dL|7“dM|8“dN|9“dO|:“dP|;“dQ|<“dR|=“dS|>“dT|?“dU|@“dV|A“dW|B“dX|C“dY|D“dZ|E“d[|F“d\|G“d]|H“d^|I“d_|J“d`|K“da|L“db|M“dc|N“dd|O“de|P“df|Q“dg|R“dh|S“di|T“dj|U“dk|V“dl|W“dm|X“dn|Y“do|Z“dp|[“dq|\“dr|]“ds|^“dt|_“du|`“dv|a“dw|b“dx|c“dy|d“dz|e“d{|f“d||g“d}|h“d~|i“d|j“d€|k“d|l“d|m“dƒ|n“d„|o“d…|p“d†|q“d‡|r“dˆ|s“d‰|t“dŠ|u“d|v“dŒ|w“d|x“dŽ|y“d|z“d|{“d||“d|}“d“|~“d”|d•|€“d–|d—|‚“d˜|ƒ“d™|„“dš|…“d›|†“dœ|‡“d|ˆ“dž|‰“dŸ|Š“d |‹“d¡|Œ“d¢|d£|Ž“d¤|d¥|d¦|‘“d§|’“d¨|““d©|”“dª|•“d«|–“d¬|—“d­|˜“|œ¤Ž|™|_|š|_||_dS)¯NgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!r<za` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rjrkÚunsloth_training_checkpointsr\r)Ú cpu_countr7r]ÚUNSLOTH_ENABLE_FLEX_ATTENTIONÚ1)ÚHAS_FLEX_ATTENTION)ÚFLEX_ATTENTION_BLOCK_SIZEzUUnsloth: Please set a positive non-zero temperature since your results will be wrong.é
zgUnsloth: Please set a positive non-zero temperature less than 10, since sampling will be quite erratic.Ú
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚmodel_init_kwargsÚchat_template_pathÚdataset_text_fieldÚdataset_kwargsÚdataset_num_procÚ eos_tokenÚ pad_tokenÚ
max_lengthÚpackingÚpacking_strategyÚ padding_freeÚpad_to_multiple_ofÚ eval_packingÚcompletion_only_lossÚassistant_only_lossÚactivation_offloadingÚ temperatureÚlmbdaÚbetaÚmax_new_tokensÚteacher_model_name_or_pathÚteacher_model_init_kwargsÚdisable_dropoutÚseq_kdrR)ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr|Úmaxr#ÚenvironÚgetZunsloth_zoo.flex_attentionr€rZ MathErrorÚsuperÚ__init__rYrZr[) Úselfrƒr„r…r†r‡r‰rrrrrrr“r”r•r–r—r™rrr r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrYrZr[Úkwargsr|r€r©Ú __class__rRrSr"`      ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèç
zUnslothGKDConfig.__init__)NNFFFr\Fr7r7NNr]r]rr^r_r`rarbrcrdrer6rfrgrrhriTNrjFr<FrjrkNTFFFFFFrlrlFFFFrmrnFFNr6NNFroFNrNr6NNTNFNNFrorNNNNrprqNFFrrNNNNTFTFFNNrsNNFNFNFTrnNNNroTFNrtruFNNFFNNFFFNFTNNrvNNNNrwFrxFNNNFFraryryrzNNTFNr6N)Ú__name__Ú
__module__Ú __qualname__Ú__doc__r+rYrrÚ__annotations__rZÚintr[r"Ú
__classcell__rRrRr%rSrU3sV
þþþãrUcs eZdZddgZ             d)deeeeje fdeeeje fdee
dee d ee d
eee e
e e ffd eeeeeefd eeege
fd
eeedeejjejjjfdeeejejgejfdeddeeffdd
Ze d*ddƒZd+ddZ ed,ddƒZ! d,dejde
e eeje"ffd ee#d!ejffd"d#„
Z$   d-d$ee d%ee d&ee ee dffd'd(„Z%‡Z&S).Ú_UnslothGKDTrainerÚtrlÚgkdN©NNÚmodelÚ
teacher_modelÚargsÚ
data_collatorÚ
train_datasetÚ eval_datasetÚprocessing_classÚcompute_metricsÚ callbacksÚ
optimizersÚpreprocess_logits_for_metricsÚ peft_configrÚformatting_funccsXd|_t||jd}tƒj|||||||| |
| | |
d |jdur$i}nt|tƒs-tdƒ|j}|ddvr:|dnt t
|dƒ|d<t|tƒrQt j |fi|¤Ž}|j
rYt|jƒ|jrdt||jƒ|_n |jj|dd|_|j|_|j|_|j|_|j|_t|j|jdd |jr‰dnd|jjd
|_t|jjd ƒr¨|jjj durª|jjj |j_ dSdSdS) NF)Ú tokenizerr
) r4r5r6r7r8r9r:r;r<r=r>zfYou passed teacher_model_init_kwargs to the GKDConfig, but your teacher_model is already instantiated.Ú torch_dtype)rnNT)Úevaluation_moder)rrÚ do_sampleÚtop_kÚ use_cacheÚ pad_token_idÚ eos_token_id)!rÆrr
r!r"rÚ
isinstanceÚstrÚ
ValueErrorÚgetattrr'r Úfrom_pretrainedrrr2Úis_deepspeed_enabledr$Ú acceleratorr3Ú
prepare_modelrrrrrrr8rEÚgeneration_configÚhasattrrF)r#r2r3r4r5r6r7r8r9r:r;r<r=r>rr%rRrSr"´shô

ÿ ÿ ý

 ú
ÿüz_UnslothGKDTrainer.__init__ryrdÚ batchmeanc
CsT||}||}tj|dd}tj|dd}|dkr$tj||ddd}nJ|dkr2tj||ddd}n<tj||jd}tjt |t d|¡|t |¡g¡dd} tj| |ddd}
tj| |ddd} ||
d|| }|d urz|d
k} || }|d kr˜|d urŠ|  ¡|   ¡S|  ¡| 
d¡| 
d¡S|d kr |  ¡S|d
kr¨|  ¡S|S)
Compute the generalized Jensen-Shannon Divergence loss for knowledge distillation using F.kl_div. See Eq. (1)
of https://huggingface.co/papers/2306.13649 for the definition.
Args:
student_logits:
Tensor of shape (batch_size, sequence_length, vocab_size)
teacher_logits:
Tensor of shape (batch_size, sequence_length, vocab_size)
labels:
Tensor of shape (batch_size, sequence_length) with -100 for padding tokens to ignore when computing
loss
beta:
Interpolation coefficient between 0 and 1 (default: 0.5)
temperature:
Softmax temperature (default: 1.0)
reduction:
Specifies the reduction to apply to the output (default: 'batchmean')
Returns:
loss: Scalar tensor with the generalized JSD loss
r6r;rÚnoneT)Ú reductionÚ
log_targetr<)ÚdtypeNéœÿÿÿrQÚsumÚmean) rÚ log_softmaxÚkl_divr'ÚtensorrUrFÚstackÚlogrWÚsizerX)
Ústudent_logitsÚteacher_logitsÚlabelsrrrSÚstudent_log_probsÚteacher_log_probsÚjsdÚmixture_log_probsÚ
kl_teacherÚ
kl_studentÚmaskrRrRrSÚgeneralized_jsd_loss
s4$þ4z'_UnslothGKDTrainer.generalized_jsd_lossFc C||d|dd}|j ¡t ¡|j|d|dd}Wdƒn1s)wY|djd}|jdd|ddddf}|jdd|ddddf} |ddd|df}
|j|| |
|jd} tƒ|rt| |fS| S) NÚ input_idsÚattention_mask)rjrkÚpromptsr<r6ra)r_r`rar) r3Úevalr'Úno_gradr?rIrirr) r#r2ÚinputsÚreturn_outputsÚnum_items_in_batchÚoutputs_studentÚoutputs_teacherÚprompt_lengthsÚshifted_student_logitsÚshifted_teacher_logitsÚshifted_labelsÚlossrRrRrSÚ compute_lossQs.þ

þÿ  üz_UnslothGKDTrainer.compute_losscCs`|j|d| dd¡|dd}|j}t |¡}| ¡}|dur+d|||k<d|||k<|||fS)NrlÚprompt_attention_maskT)rjrkrOÚreturn_dict_in_generaterVr)Úgenerater Ú sequencesr'Ú ones_likeÚclone)r2rorOrEÚgenerated_outputsÚgenerated_tokensÚnew_attention_maskÚ
new_labelsrRrRrSÚgenerate_on_policy_outputsts
ü
  
z-_UnslothGKDTrainer.generate_on_policy_outputsrorqÚreturnc |jr4t|j|jƒ}| |||j|jj¡\}}}Wdƒn1s#wY||d<||d<||d<t ¡|j krkt||jƒ}| |||j|jj¡\}}}Wdƒn1sZwY||d<||d<||d<t
ƒ  |||¡}|S)aa
Perform a training step for the Generalized Knowledge Distillation (GKD) model.
This method implements the on-policy learning approach described in the GKD paper. With probability
`self.lmbda`, it generates new responses using the student model, which are then used for training instead of
the original inputs.
Nrjrkra) rr(r3rMr„rOr8rEr%rr!Ú
training_step) r#r2rorqÚunwrapped_modelÚ
new_input_idsrrxr%rRrSr†s(
 ÿÿ ÿÿz _UnslothGKDTrainer.training_stepÚ
model_nameÚ dataset_nameÚtagsc
C| ¡sdSt|jjdƒrtj |jjj¡s|jjj}nd}|dur&tƒ}n
t |t