Files
DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/__pycache__/UnslothSFTTrainer.cpython-310.pyc
T

334 lines
39 KiB
Plaintext
Raw Normal View History

2025-08-28 17:57:59 +00:00
o
2025-08-28 22:41:56 +00:00
ö×°h:ýã@s*dZddlmZddlZddlmZddlmZddlmZm Z m
2025-08-28 17:57:59 +00:00
Z
m Z m Z m
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZmZmZmZmZm Z m!Z!m"Z"m#Z#m$Z$m Z m%Z%m&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0mZm1Z1m2Z2m3Z3m4Z4m5Z5m6Z6mZm7Z7m8Z8mZmZmZmZmZm
Z
m Z m1Z1m2Z2m3Z3m
Z
mZmZm"Z"m/Z/m1Z1m4Z4mZm1Z1ddl1Z1ddlTddl(m'Z'm9Z9dd l:m;Z;ddlZddl<Z=dd
l&m>Z>ddlmZdd l?m@Z@mZAd d
d d
d
dœZBejCd d eBdddƒZDe'Gddde ƒƒZE Gddde"ƒZFGdddeFƒZGdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)ArÚAutoModelForCausalLMÚ
AutoTokenizerÚBaseImageProcessorr Ú DataCollatorÚDataCollatorForLanguageModelingÚDatasetÚEvalPredictionÚFeatureExtractionMixinÚIterableDatasetrÚPathÚ
PeftConfigÚ PeftModelÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚ SFTConfigÚ
SFTTrainerÚTrainerÚTrainerCallbackÚTrainingArgumentsrÚclone_chat_templateÚ
contextlibÚ dataclassÚ dataclassesÚ defaultdictÚgenerate_model_cardÚget_act_offloading_ctx_managerÚget_comet_experiment_urlÚget_peft_modelÚis_conversationalÚis_peft_availableÚis_wandb_availableÚnnÚosÚ pack_datasetÚpadÚpeftÚpeft_module_casting_to_bf16Úprepare_model_for_kbit_trainingÚtorchÚversionÚwarningsr rrrrrrr-r.r/rrrrr*r-r0r3r-)Ú*)r"Úfield)ÚVersion)Ú nullcontext©ÚDataCollatorForSeq2SeqrTF)Úepilogue_fusionÚ max_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ fullgraphÚoptionsc
Ctj| d|jd¡ddd}tj| d¡ddd}g}t||ƒD](\}}| tj¡}tj|d| d¡d  d¡}tj
|dd}||} |  | ¡q! t  |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rEÚindex)rEé)
r3ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ unsqueezeÚsqueezeÚ logsumexpÚappendÚconcat)
ÚlogitsrFÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚ chunk_logitsÚ chunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©r]úQ/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothSFTTrainer.pyÚchunked_selective_log_softmax"s  
r_ceZdZUdZedddidZeeed<edddidZ ee
ed <eddd
idZ ee
ed <  
                            ! ! " #     $           $      % &  '         (      #    $   ) *         +     , -      d0‡fd.d/„ Z Z
S)1ÚUnslothSFTConfiga5
Configuration class for the [`SFTTrainer`].
This class includes only the parameters that are specific to SFT training. For a full list of training arguments,
please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this class may
differ from those in [`~transformers.TrainingArguments`].
Using [`~transformers.HfArgumentParser`] we can turn this class into
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
command line.
Parameters:
> Parameters that control the model
model_init_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
Keyword arguments for [`~transformers.AutoModelForCausalLM.from_pretrained`], used when the `model`
argument of the [`SFTTrainer`] is provided as a string.
chat_template_path (`str` or `None`, *optional*, defaults to `None`):
If specified, sets the model's chat template. This can either be the path to a tokenizer (local directory
or Hugging Face Hub model) or a direct path to a Jinja template file. When using a Jinja file, you must
ensure that any special tokens referenced in the template are added to the tokenizer and that the model's
embedding layer is resized accordingly.
> Parameters that control the data preprocessing
dataset_text_field (`str`, *optional*, defaults to `"text"`):
Name of the column that contains text data in the dataset.
dataset_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
Dictionary of optional keyword arguments for the dataset preparation. The only supported key is
`skip_prepare_dataset`.
dataset_num_proc (`int` or `None`, *optional*, defaults to `None`):
Number of processes to use for processing the dataset.
eos_token (`str` or `None`, *optional*, defaults to `None`):
Token used to indicate the end of a turn or sequence. If `None`, it defaults to
`processing_class.eos_token`.
pad_token (`int` or `None`, *optional*, defaults to `None`):
Token used for padding. If `None`, it defaults to `processing_class.pad_token`, or if that is also `None`,
it falls back to `processing_class.eos_token`.
max_length (`int` or `None`, *optional*, defaults to `1024`):
Maximum length of the tokenized sequence. Sequences longer than `max_length` are truncated from the right.
If `None`, no truncation is applied. When packing is enabled, this value sets the sequence length.
packing (`bool`, *optional*, defaults to `False`):
Whether to group multiple sequences into fixed-length blocks to improve computational efficiency and reduce
padding. Uses `max_length` to define sequence length.
packing_strategy (`str`, *optional*, defaults to `"bfd"`):
Strategy for packing sequences. Can be either `"bfd"` (best-fit decreasing, default), or `"wrapped"`.
padding_free (`bool`, *optional*, defaults to `False`):
Whether to perform forward passes without padding by flattening all sequences in the batch into a single
continuous sequence. This reduces memory usage by eliminating padding overhead. Currently, this is only
supported with the FlashAttention 2 or 3, which can efficiently handle the flattened batch structure. When
packing is enabled with strategy `"bfd"`, padding-free is enabled, regardless of the value of this
parameter.
pad_to_multiple_of (`int` or `None`, *optional*, defaults to `None`):
If set, the sequences will be padded to a multiple of this value.
eval_packing (`bool` or `None`, *optional*, defaults to `None`):
Whether to pack the eval dataset. If `None`, uses the same value as `packing`.
> Parameters that control the training
completion_only_loss (`bool` or `None`, *optional*, defaults to `None`):
Whether to compute loss only on the completion part of the sequence. If set to `True`, loss is computed
only on the completion, which is supported only for [prompt-completion](#prompt-completion) datasets. If
`False`, loss is computed on the entire sequence. If `None` (default), the behavior depends on the dataset:
loss is computed on the completion for [prompt-completion](#prompt-completion) datasets, and on the full
sequence for [language modeling](#language-modeling) datasets.
assistant_only_loss (`bool`, *optional*, defaults to `False`):
Whether to compute loss only on the assistant part of the sequence. If set to `True`, loss is computed
only on the assistant responses, which is supported only for [conversational](#conversational) datasets. If `False`,
loss is computed on the entire sequence.
activation_offloading (`bool`, *optional*, defaults to `False`):
Whether to offload the activations to the CPU.
helpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrBz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnorCéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?çlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrGéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéÚtextéÚbfdc”˜ s6|dkr td|dƒ|dkrtd|dƒ|dur(|#dkr(|$dkr(d}d }#|…dur:d
d lm}•t|•ƒd d
ƒ}…tj dd¡dkrWd
dlm }|rW|ŒdurWd
dlm
}—|—}Œt ƒj d£id|d|d|d|d|d|d|d|d| “d|
d| d| d|
d |d!|d"|d#|d$|d%|d&|d'|d(|d)|d*|d+|d,|d-|d.|d/|d0|d1|d2| “d3|!“d4|"“d5|#“d6|$“d7|%“d8|&“d9|'“d:|(“d;|)“d<|*“d=|+“d>|,“d?|-“d@|.“dA|/“dB|0“dC|1“dD|2“dE|3“dF|4“dG|5“dH|6“dI|7“dJ|8“dK|9“dL|:“dM|;“dN|<“dO|=“dP|>“dQ|?“dR|@“dS|A“dT|B“dU|C“dV|D“dW|E“dX|F“dY|G“dZ|H“d[|I“d\|J“d]|K“d^|L“d_|M“d`|N“da|O“db|P“dc|Q“dd|R“de|S“df|T“dg|U“dh|V“di|W“dj|X“dk|Y“dl|Z“dm|[“dn|\“do|]“dp|^“dq|_“dr|`“ds|a“dt|b“du|c“dv|d“dw|e“dx|f“dy|g“dz|h“d{|i“d||j“d}|k“d~|l“d|m“d€|n“d|o“d|p“dƒ|q“d„|r“d…|s“d†|t“d‡|u“dˆ|v“d‰|w“dŠ|x“d|y“dŒ|z“d|{“dŽ||“d|}“d|~“d|d|€“d“|d”|‚“d•|ƒ“d–|„“d—|…“d˜|†“d™|‡“dš|ˆ“d›|‰“dœ|Š“d|‹“dž|Œ“dŸ|d |Ž“d¡|d¢||”¤Ž|‘|_
||_|“|_dS)¤NgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rGza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rurvÚunsloth_training_checkpointsrgr©Ú cpu_countrCrhÚUNSLOTH_ENABLE_FLEX_ATTENTIONÚ1)ÚHAS_FLEX_ATTENTION)ÚFLEX_ATTENTION_BLOCK_SIZEÚ
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚ weight_decayÚ
adam_beta1Ú
adam_beta2Ú adam_epsilonÚ
max_grad_normÚnum_train_epochsÚ max_stepsÚlr_scheduler_typeÚ warmup_ratioÚ warmup_stepsÚ log_levelÚlog_level_replicaÚlog_on_each_nodeÚ logging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚ ddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚ disable_tqdmÚremove_unused_columnsÚ label_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚ fsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ adafactorÚgroup_by_lengthÚlength_column_nameÚ report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚ push_to_hubÚresume_from_checkpointÚ hub_model_idÚ hub_strategyÚ hub_tokenÚhub_private_repoÚhub_always_pushÚ hub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚ fp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚ torchdynamoÚ ray_scopeÚ ddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚmodel_init_kwargsÚchat_template_pathÚdataset_text_fieldÚdataset_kwargsÚdataset_num_procÚ eos_tokenÚ pad_tokenÚ
max_lengthÚpackingÚpacking_strategyÚ padding_freeÚpad_to_multiple_ofÚ eval_packingÚcompletion_only_lossÚassistant_only_lossÚactivation_offloadingr])ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr†Úmaxr-ÚenvironÚgetÚunsloth_zoo.flex_attentionrŠrÚsuperÚ__init__rdrerf)˜ÚselfrŒrrrrrr“r”r•r–r—r™rrr r­r¿rÿrrrrrrrrrr r
r r r
rrrrrrrrrrrrrrrdrerfÚkwargsr†r©Ú __class__r]r^r$     ÿþýüûúùø ÷
ö õ ô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·JKµL´M³N²O±P°Q¯R®S­T¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžcdœefšgh˜ijklmnopqrŽstŒuvŠwxˆyz{|}ƒ~ÿþýüûúùø ÷
ö õ ô
óòñðï
zUnslothSFTConfig.__init__)“NNFFFrgFrCrCNNrhrhrrirjrkrlrmrnrorprBrqrrrrsrtTNruFrGFrurvNTFFFFFFrwrwFFFFrxryFFNrBNNFrzFNrNrBNNTNFNNFrzrNNNNr{r|NFFr}NNNNTFTFFNNr~NNFNFNFTryNNNrzTFNrr€FNNFFNNFFFNFTNNrNNNNrFrƒFNNNFFNrBN)Ú__name__Ú
__module__Ú __qualname__Ú__doc__r7rdrrÚ__annotations__reÚintrfr$Ú
__classcell__r]r]r'r^r`3sF
Jþþþër`c sdeZdZdZddgZ             d7deeeje fde
ee e fde
e
d e
eeefd
e
eeeeeffd e
eeeeefd e
ed
e
eegefde
eedee
ejje
ejjjfde
eeejjeee ffde
eej!ej!gej!fde
dde
eegefffdd
Z"dede de fddZ#de de de de fddZ$de de de fddZ%de de de fdd „Z&d!eeefd"e'de
eegefd#edeeeff
d$d%„Z(d&d'„Z)d8‡fd)d*„ Z*‡fd+d,„Z+d9d-eee,fd.e
e,ddffd/d0„
Z-‡fd1d2„Z.   d:d3e
ed#e
ed4eeeedffd5d6„Z/‡Z0S);Ú_UnslothSFTTrainerrzÚtrlÚsftN©NNÚmodelÚargsÚ
data_collatorÚ
train_datasetÚ eval_datasetÚprocessing_classÚcompute_loss_funcÚcompute_metricsÚ callbacksÚ
optimizersÚoptimizer_cls_and_kwargsÚpreprocess_logits_for_metricsÚ peft_configrÚformatting_funccst|tƒr|n|jj}ˆdur| d¡d}t|dƒntˆtƒr=tˆtƒs=ˆ ¡}ˆj|d<|  d¡td*i|¤ŽˆdurFt
  |¡ˆj durgˆj }ˆ 
|¡}|durdtd|dˆjjdƒ|ˆ_ˆjdurvt|tƒsvt d ¡t|tƒrˆ |ˆ¡}ˆjdur½tj ˆj¡r²ˆj d
¡r²tˆjd d 
}| ¡ˆ_Wdƒn1sªwYg}n
t|ˆˆjƒ\}}ng} ˆj#pˈj$oˈj%dkˆ_#|jj&dv}ˆj#rÿ|durÞtdƒˆj$rëˆj%dkrët d¡|sòt d¡ˆj'dkrÿˆj$sÿt d¡t(t)|ƒƒ}ˆj*durd|vˆ_*nˆj*ˆ_*|durHˆj+p$ˆj+p$ˆj }ˆ 
|¡}|dur<td|dˆjjdƒt,|ˆj*ˆj#|ˆj-d}ˆj$rZˆj%dkrZ|sZt d¡ˆj.rgt/|ƒsgtdƒˆj0duptˆj0 1dd
¡ }|r¿ˆj*rƒˆrƒtd ƒˆ 2|ˆˆˆj$ˆd!¡}|dur¿ˆj3durœˆj$nˆj3‰t|t4ƒrµ‡fd"d#„| Dƒ}n
ˆ 2|ˆˆˆˆd$¡}t6t7ƒt6t7ƒd%œˆ_8d&ˆ_9t:ƒj;|ˆ|||ˆ||| |
| | d' ˆj<j=rët>ˆj?d(ˆ_@ntA ˆ_@tCˆj?d)ƒrˆj? DˆjE¡dSdS)+Nú/rBz-SFTrízThe specified `eos_token` ('zC') is not found in the vocabulary of the given `processing_class` (zX). Ensure that the `eos_token` exists in the vocabulary before using it as an EOS token.zYou passed model_init_kwargs to the `SFTConfig`, but your model is already instantiated. The `model_init_kwargs` will be ignored.)z.jinjaz.j2zutf-8)ÚencodingFÚ embed_tokensÚlm_heada-Cloning chat template added new tokens to the tokenizer, but 'lm_head' is not in PEFT's `modules_to_save`. As a result, the model may not learn to generate outputs with these new tokens, leading to degraded generation quality. To fix this, add `modules_to_save=['lm_head']` to your PEFT configuration.rƒ)Úflash_attention_2z"kernels-community/vllm-flash-attn3zHPassing a custom data collator is not supported when using padding-free.Úwrappedz¯You are passing `padding_free=True` with the 'wrapped' packing strategy, which is not recommended. Please refer to the documentation to understand why this is not recommended.açPadding-free training is enabled, but the attention implementation is not set to 'flash_attention_2'. Padding-free training flattens batches into a single sequence, and 'flash_attention_2' is the only known attention mechanism that reliably supports this. Using other implementations may lead to unexpected behavior. To ensure compatibility, set `attn_implementation='flash_attention_2'` in the model configuration, or verify that your attention mechanism can handle flattened sequences.rGzÎYou are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size to at least 2.ÚpromptzThe specified `pad_token` ('z[). Ensure that the `pad_token` exists in the vocabulary before using it as a padding token.)Ú pad_token_idrrÚreturn_position_idsra$You are using packing, but the attention implementation is not set to 'flash_attention_2' or 'kernels-community/vllm-flash-attn3'. Packing flattens batches into a single sequence, and Flash Attention is the only known attention mechanisms that reliably support this. Using other implementations may lead to cross-contamination between batches. To avoid this, either disable packing by setting `packing=False`, or set `attn_implementation='flash_attention_2'` or `attn_implementation='kernels-community/vllm-flash-attn3'` in the model configuration.z…You set `assistant_only_loss=True`, but the dataset is not conversational. This option is only supported for conversational datasets.Úskip_prepare_datasetaEA formatting function was provided while `completion_only_loss=True`, which is incompatible. Using a formatter converts the dataset to a language modeling type, conflicting with completion-only loss. To resolve this, apply your formatting function before passing the dataset, or disable `completion_only_loss` in `SFTConfig`.Útrainc s&i|]\}}|ˆ |ˆˆˆˆ|¡qSr])Ú_prepare_dataset)Ú.0ÚkeyÚdataset©r5rArr9r%r]r^Ú
<dictcomp>ƒsÿÿz/_UnslothSFTTrainer.__init__.<locals>.<dictcomp>Úeval)rLrSr) r4r5r6r7r8r9r:r;r<r=r>r?)r4Úadd_model_tagsr])FÚ
isinstanceÚstrÚconfigÚ
_name_or_pathÚsplitrrÚto_dictríÚpopr
Úfrom_pretrainedrÚconvert_tokens_to_idsÚ
ValueErrorr(r)Ú eos_token_idr r5ÚwarnÚ_create_model_from_pathr
r-ÚpathÚisfileÚendswithÚopenÚreadÚ
chat_templater Útrainable_token_indicesÚextendÚmodules_to_saverRrrrÚ_attn_implementationr“ÚnextÚiterrrrrrr)rr!rMrÚdictÚitemsr$ÚlistÚ_metricsÚ_total_train_tokensr#r$r5rr&r4Ú maybe_activation_offload_contextr!r9ÚhasattrrTÚ
_tag_names)r%r4r5r6r7r8r9r:r;r<r=r>r?r@rAÚmodel_idÚ
model_nameÚ dict_argsrr_Úchat_template_fileÚ added_tokensÚuse_flash_attentionÚdataset_samplerrIÚpreprocess_datasetr'rQr^r$Ì




ÿÿÿ
 
 ÿ ÿ ÿÿÿ   


ÿÿú ÿÿÿÿ
 þ ÿô

ÿz_UnslothSFTTrainer.__init__Ú
model_pathÚreturncCsv|jpi}| d¡}t|tjƒs|dks|durnt|tƒr(tt|ƒ}||d<ntd|dƒtj |fi|¤Ž}|S)z0Creates a model from a path or model identifier.Ú torch_dtyperyNzˆInvalid `torch_dtype` passed to `SFTConfig`. Expected either 'auto' or a string representing a `torch.dtype` (e.g., 'float32'), but got Ú.)
r r!rUr3ÚdtyperVÚgetattrr^r r\)r%r~r5r r€r4r]r]r^ra¯s




ÿÿ z*_UnslothSFTTrainer._create_model_from_pathcCstƒstdƒt|ddƒpt|ddƒ}d}t|ddƒr3| ¡D]\}}|jjdkr2|jjjdv}nq|rE|sE|  ||¡}t
j |dd}n |j rN| 
||¡}|durrt tj¡t d ¡krmt|ddƒrm|rmt||dd
}nt||ƒ}|jrt|ddƒr|st|ƒ|S) z#Prepares a model for PEFT training.z9To use PeftModel, you need to install the `peft` library.Úis_loaded_in_4bitFÚis_loaded_in_8bitÚ
Params4bit>ÚcpuÚmeta)Nz0.12)Úautocast_adapter_dtype)r*Ú ImportErrorrƒÚnamed_parametersr(r)ÚdataÚdeviceÚtypeÚ _prepare_model_for_kbit_trainingr#ÚreplacerñÚ_enable_gradient_checkpointingr4Úparser0Ú __version__r(r1)r%r4r@r5Úis_qloraÚis_sharded_qloraÚparamr]r]r^Ú_prepare_peft_modelÆs4  þ  
ÿþ
z&_UnslothSFTTrainer._prepare_peft_modelcCs"|j|jpidœ}t|fi|¤ŽS)z-Prepares a quantized model for kbit training.)Úuse_gradient_checkpointingrò)r2)r%r4r5Úprepare_model_kwargsr]r]r^rïsþz3_UnslothSFTTrainer._prepare_model_for_kbit_trainingcCsN|jpi}d|vp |d}|r%t|dƒr| ¡|Sdd}| ¡ |¡|S)z-Enables gradient checkpointing for the model.Ú
use_reentrantÚenable_input_require_gradscSs| d¡dS)NT)Úrequires_grad_)ÚmoduleÚinputÚoutputr]r]r^Úmake_inputs_require_gradszS_UnslothSFTTrainer._enable_gradient_checkpointing.<locals>.make_inputs_require_grad)rtÚget_input_embeddingsÚregister_forward_hook)r%r4r5rr]r]r^røs
ÿ
ûz1_UnslothSFTTrainer._enable_gradient_checkpointingrPrÚ dataset_namecsz
t|tƒr |WSWnYi}t|tƒ}t|dƒ} || r"|jt|ddƒˆdkr2t|ddƒˆdkr<t|ddƒˆdkrFt|ddƒˆdkrNtdƒt|ddƒˆdkd ‰d
}
ttt |ƒƒ 
¡ƒ} d g} d | vrr|   d ¡dd
l m
}
m}d| vr| rŽtˆdƒsŽtd|jdƒ|
ˆƒ|_|   d¡d }
n,d | vr¹| r¯tˆdƒs¯td|jdƒ|ˆd d|_d }
nˆ| vrÇd
ˆdurÇtdƒ |
rˆrãˆtt |ƒƒƒ}t|tƒsÞtdƒ|d}n
tt |ƒƒˆd}t|ddƒ}|dkrÿ| rÿtˆddƒ}|durd}d
t|ddƒ}tˆddƒ}|p|}|dur/| |¡s)||vr/d ‰tdƒ fdd} t|tƒsat|ddƒ}|dur\ddlm}t|ƒddƒ}||d<n|jj|d <|rrd!ˆd"|d#<|j|fd$d
i|¤Ž}| rt|dƒs|ˆd d}||_ |rÆztWn td%ƒ|YSˆdkr­td&ƒ|r¸d'|d(|d#<t| | ¡ˆt|d)d*ƒ|ƒ} |S)+NÚ tokenizerrrrfÚmax_seqz1Unsloth: max_seq_length is 0! Please specify one!rrFTÚ input_idsÚattention_maskr:Úlabelsr/z Unsloth: z does not have .pad!)Úmlmz-Unsloth: You must specify a `formatting_func`zIUnsloth: The `formatting_func` should return a list of processed strings.rgrzÚ bos_tokenzHUnsloth: We found double BOS tokens - we shall remove one automatically.cs"ˆˆs|ˆnˆ|ƒˆˆdˆdS)NF)Ú
truncationrÚreturn_token_type_idsÚadd_special_tokensr])Úexample©rÚdo_formatting_funcÚ
do_truncationrArfr]r^Ú _tokenizehsûz6_UnslothSFTTrainer._prepare_dataset.<locals>._tokenizerr…rCrhÚnum_procÚ
batch_sizezUnsloth: Tokenizing ["z"]ÚdescÚbatchedzPUnsloth: Hugging Face's packing is currently buggy - we're disabling it for now!z:When packing is enabled, `max_seq_length` can't be `None`.zUnsloth: Packing z datasetr)rUÚConstantLengthDatasetrrtÚ RuntimeErrorÚsetrlrmÚkeysrRÚ transformersr;rr(r6rpr^Ú
startswithÚprintrrr†rÚ _ex_iterablerµÚmapr.Úselect_columns)r%rPr9r5rrAÚ
map_kwargsÚuse_descÚis_vlmÚ do_tokenizeÚ column_namesÚused_column_namesr;rÚ test_textrgÚ bos_token_1Ú bos_token_2r«rr†r6r]r^rM 



  



ÿ
   
  

  
 
  

üz#_UnslothSFTTrainer._prepare_datasetcCs|jdur gd¢|_dSdS)N)Ú seq_lengthsÚcompletion_maskÚassistant_masks)Ú_signature_columns)r%r]r]r^Ú _set_signature_columns_if_needed™s
ÿz3_UnslothSFTTrainer._set_signature_columns_if_neededFcstƒj||||d}|S)N)Úreturn_outputsÚnum_items_in_batch)r#Ú compute_loss)r%r4ÚinputsrÐÚoutputsr'r]r^§süz_UnslothSFTTrainer.compute_losscs<|jtƒj|i|¤ŽWdƒS1swYdS©N)rsr#Ú
training_step)r%r5r&r'r]r^±s$ÿz _UnslothSFTTrainer.training_stepÚlogsÚ
start_timecsn|jjrdnd}dd|j| ¡Dƒ}|dkr!dd| ¡Dƒ}i|¥|¥}tƒ ||¡|j| ¡dS)NrLrScSs"i|]
\}}|t|ƒt|ƒqSr])ÚsumÚlen©rNrOÚvalr]r]r^rR·s"z*_UnslothSFTTrainer.log.<locals>.<dictcomp>cSsi|]
\}}d||qS)Úeval_r]r]r]r^rR¼s)r4Útrainingrqror#ÚlogÚclear)r%r×ÚmodeÚmetricsr'r]r^µs z_UnslothSFTTrainer.logcsL|jjdurt|jjƒj}n |jj d¡d}|j|dtƒ ||¡dS)NrBrB)rw) r5rÚnamerYÚcreate_model_cardr#Ú_save_checkpoint)r%r4Útrialrwr'r]r^Ãs
 z#_UnslothSFTTrainer._save_checkpointrwÚtagsc
C| ¡sdSt|jjdƒrtj |jjj¡s|jjj}nd}|dur&tƒ}n
t |t
ƒr/|h}nt|ƒ}t|jjdƒr?|  d¡|  |j
¡t|||j|t|ƒtƒrZtjdurZtjjndtƒdd}| tj |jjd¡¡dS)
Creates a draft of a model card using the information available to the `Trainer`.
Args:
model_name (`str` or `None`, *optional*, defaults to `None`):
Name of the model.
dataset_name (`str` or `None`, *optional*, defaults to `None`):
Name of the dataset used for training.
tags (`str`, `list[str]` or `None`, *optional*, defaults to `None`):
Tags to be associated with the model card.
NrXÚunsloth_versionÚunslothÚSFT)Ú
base_modelrwÚ wandb_urlÚ comet_urlÚ trainer_namez README.md)Úis_world_process_zerortr4rWr-rbÚisdirrXrUrVÚaddÚupdaterur%rpr+ÚwandbÚrunÚurlr'ÚsaveÚjoinr5)r%rwÚ
model_cardr]r]r^Ës0  

 ø z$_UnslothSFTTrainer.create_model_card)
NNNNNNNNr3NNNN)FNrÕ)NNN)1r)r*r+r,rurrVr,ÚModulerrrrrrrrnrrrrr rrprÚtupler3Ú OptimizerÚ lr_schedulerÚLambdaLRrŽrrr$rar˜rrÚboolrMÚfloatrßr/r]r]r'r^r0Çïþýüûúÿù
ö õ
ô
óòñðïd) 
þûúù

ø
( 
üþýür0cs:eZdZdZ            dfdd„ ZZS)ÚUnslothSFTTrainera¢
Trainer for Supervised Fine-Tuning (SFT) method.
This class is a wrapper around the [`transformers.Trainer`] class and inherits all of its attributes and methods.
Example:
```python
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("roneneldan/TinyStories", split="train[:1%]")
trainer = SFTTrainer(model="Qwen/Qwen2-0.5B-Instruct", train_dataset=dataset)
trainer.train()
```
Args:
model (`Union[str, PreTrainedModel]`):
Model to be trained. Can be either:
- A string, being the *model id* of a pretrained model hosted inside a model repo on huggingface.co, or a
path to a *directory* containing model weights saved using
[`~transformers.PreTrainedModel.save_pretrained`], e.g., `'./my_model_directory/'`. The model is loaded
using [`~transformers.AutoModelForCausalLM.from_pretrained`] with the keyword arguments in
`args.model_init_kwargs`.
- A [`~transformers.PreTrainedModel`] object. Only causal language models are supported.
args ([`SFTConfig`], *optional*, defaults to `None`):
Configuration for this trainer. If `None`, a default configuration is used.
data_collator (`DataCollator`, *optional*):
Function to use to form a batch from a list of elements of the processed `train_dataset` or `eval_dataset`.
Will default to a custom [`DataCollatorForLanguageModeling`].
train_dataset ([`~datasets.Dataset`] or [`~datasets.IterableDataset`]):
Dataset to use for training. SFT supports both [language modeling](#language-modeling) type and