unsloth_compiled_cache/__pycache__/UnslothSFTTrainer.cpython-310.pyc

o
ö×°h:ýã@s*dZddlmZddlZddlmZddlmZddlmZm	Z	m
Z
mZmZm
Z
mZmZddlmZmZmZmZmZmZmZmZmZmZmZm
Z
mZmZmZmZmZmZm Z m!Z!m"Z"m#Z#m$Z$mZm%Z%m&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0mZm1Z1m2Z2m3Z3m4Z4m5Z5m6Z6mZm7Z7m8Z8mZmZmZmZmZm
Z
mZm1Z1m2Z2m3Z3m
Z
mZmZm"Z"m/Z/m1Z1m4Z4mZm1Z1ddl1Z1ddlTddl(m'Z'm9Z9dd	l:m;Z;ddlZddl<Z=dd
l&m>Z>ddlmZddl?m@Z@mZAdd
dd
d
dœZBejCddeBd�dd„ƒZDe'Gdd„de ƒƒZE	Gdd„de"ƒZFGdd„deFƒZGdS)z9
2025.8.9
2025.8.10
4.55.4
0.21.0
__UNSLOTH_VERSIONING__
é)ÚTensorN)Ú
functional)ÚAnyÚListÚOptionalÚTupleÚUnionÚDictÚSetÚCallable)ArÚAutoModelForCausalLMÚ
AutoTokenizerÚBaseImageProcessorrÚDataCollatorÚDataCollatorForLanguageModelingÚDatasetÚEvalPredictionÚFeatureExtractionMixinÚIterableDatasetrÚPathÚ
PeftConfigÚ	PeftModelÚPreTrainedModelÚPreTrainedTokenizerBaseÚProcessorMixinÚ	SFTConfigÚ
SFTTrainerÚTrainerÚTrainerCallbackÚTrainingArgumentsrÚclone_chat_templateÚ
contextlibÚ	dataclassÚdataclassesÚdefaultdictÚgenerate_model_cardÚget_act_offloading_ctx_managerÚget_comet_experiment_urlÚget_peft_modelÚis_conversationalÚis_peft_availableÚis_wandb_availableÚnnÚosÚpack_datasetÚpadÚpeftÚpeft_module_casting_to_bf16Úprepare_model_for_kbit_trainingÚtorchÚversionÚwarningsrrrrrrrr-r.r/rrrrr*r-r0r3r-)Ú*)r"Úfield)ÚVersion)Únullcontext©ÚDataCollatorForSeq2SeqrTF)Úepilogue_fusionÚmax_autotuneÚ
shape_paddingz
trace.enabledztriton.cudagraphs)ÚdynamicÚ	fullgraphÚoptionsc
Cs¾tj| d|jd¡ddd�}tj| d¡ddd�}g}t||ƒD](\}}| tj¡}tj|d| d¡d� 	d¡}tj
|dd�}||}	| |	¡q!	t |¡}| |jd|jdf¡}|S)Néÿÿÿÿér)ÚchunksÚdim)rEÚindex)rEé)
r3ÚchunkÚreshapeÚshapeÚzipÚtoÚfloat32ÚgatherÚ	unsqueezeÚsqueezeÚ	logsumexpÚappendÚconcat)
ÚlogitsrFÚchunked_logitsÚ
chunked_indexÚall_per_token_logpsÚchunk_logitsÚchunk_indexÚselected_logitsÚlogsumexp_valuesÚper_token_logps©r]úQ/workspace/DS-LLM-TEMPLATE-FINETUNING/unsloth_compiled_cache/UnslothSFTTrainer.pyÚchunked_selective_log_softmax"s
r_csžeZdZUdZedddid�Zeeed<edddid�Z	ee
ed	<eddd
id�Zee
ed<						
																														 									!	!					"	#								$														$						%	&				'												(									#				$				)	*																+					,		-									d0‡fd.d/„	Z‡Z
S)1ÚUnslothSFTConfiga5
    
    Configuration class for the [`SFTTrainer`].

    This class includes only the parameters that are specific to SFT training. For a full list of training arguments,
    please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this class may
    differ from those in [`~transformers.TrainingArguments`].

    Using [`~transformers.HfArgumentParser`] we can turn this class into
    [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
    command line.

    Parameters:
        > Parameters that control the model

        model_init_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
            Keyword arguments for [`~transformers.AutoModelForCausalLM.from_pretrained`], used when the `model`
            argument of the [`SFTTrainer`] is provided as a string.
        chat_template_path (`str` or `None`, *optional*, defaults to `None`):
            If specified, sets the model's chat template. This can either be the path to a tokenizer (local directory
            or Hugging Face Hub model) or a direct path to a Jinja template file. When using a Jinja file, you must
            ensure that any special tokens referenced in the template are added to the tokenizer and that the model's
            embedding layer is resized accordingly.

        > Parameters that control the data preprocessing

        dataset_text_field (`str`, *optional*, defaults to `"text"`):
            Name of the column that contains text data in the dataset.
        dataset_kwargs (`dict[str, Any]` or `None`, *optional*, defaults to `None`):
            Dictionary of optional keyword arguments for the dataset preparation. The only supported key is
            `skip_prepare_dataset`.
        dataset_num_proc (`int` or `None`, *optional*, defaults to `None`):
            Number of processes to use for processing the dataset.
        eos_token (`str` or `None`, *optional*, defaults to `None`):
            Token used to indicate the end of a turn or sequence. If `None`, it defaults to
            `processing_class.eos_token`.
        pad_token (`int` or `None`, *optional*, defaults to `None`):
            Token used for padding. If `None`, it defaults to `processing_class.pad_token`, or if that is also `None`,
            it falls back to `processing_class.eos_token`.
        max_length (`int` or `None`, *optional*, defaults to `1024`):
            Maximum length of the tokenized sequence. Sequences longer than `max_length` are truncated from the right.
            If `None`, no truncation is applied. When packing is enabled, this value sets the sequence length.
        packing (`bool`, *optional*, defaults to `False`):
            Whether to group multiple sequences into fixed-length blocks to improve computational efficiency and reduce
            padding. Uses `max_length` to define sequence length.
        packing_strategy (`str`, *optional*, defaults to `"bfd"`):
            Strategy for packing sequences. Can be either `"bfd"` (best-fit decreasing, default), or `"wrapped"`.
        padding_free (`bool`, *optional*, defaults to `False`):
            Whether to perform forward passes without padding by flattening all sequences in the batch into a single
            continuous sequence. This reduces memory usage by eliminating padding overhead. Currently, this is only
            supported with the FlashAttention 2 or 3, which can efficiently handle the flattened batch structure. When
            packing is enabled with strategy `"bfd"`, padding-free is enabled, regardless of the value of this
            parameter.
        pad_to_multiple_of (`int` or `None`, *optional*, defaults to `None`):
            If set, the sequences will be padded to a multiple of this value.
        eval_packing (`bool` or `None`, *optional*, defaults to `None`):
            Whether to pack the eval dataset. If `None`, uses the same value as `packing`.

        > Parameters that control the training

        completion_only_loss (`bool` or `None`, *optional*, defaults to `None`):
            Whether to compute loss only on the completion part of the sequence. If set to `True`, loss is computed
            only on the completion, which is supported only for [prompt-completion](#prompt-completion) datasets. If
            `False`, loss is computed on the entire sequence. If `None` (default), the behavior depends on the dataset:
            loss is computed on the completion for [prompt-completion](#prompt-completion) datasets, and on the full
            sequence for [language modeling](#language-modeling) datasets.
        assistant_only_loss (`bool`, *optional*, defaults to `False`):
            Whether to compute loss only on the assistant part of the sequence. If set to `True`, loss is computed
            only on the assistant responses, which is supported only for [conversational](#conversational) datasets. If `False`,
            loss is computed on the entire sequence.
        activation_offloading (`bool`, *optional*, defaults to `False`):
            Whether to offload the activations to the CPU.
    
    NÚhelpzvLLM SamplingParams)ÚdefaultÚmetadataÚvllm_sampling_paramsrBz8Chunk size to reduce memory usage. -1 is most efficient.Úunsloth_num_chunksz'Maximum sequence length to truncate to.Úmax_seq_lengthFÚnorCéréúç-Cëâ6
?ç{®Gáz„?çÍÌÌÌÌÌì?ç+‡ÙÎ÷ï?ç:Œ0âŽyE>çð?ç@Úlinearçš™™™™™¹?ÚpassiveÚwarningTÚstepsrGéôéO
ÚO1ÚautoÚçÚ
adamw_8bitÚlengthÚ
every_saveÚlastéÚtextéÚbfdc”˜s6|dkrtd|›d�ƒ‚|dkrtd|›d�ƒ‚|dur(|#dkr(|$dkr(d}d	}#|…dur:d
dlm}•t|•ƒdd
ƒ}…tj dd¡dkrWd
dlm	}–|–rW|ŒdurWd
dlm
}—|—}Œtƒjd£id|“d|“d|“d|“d|“d|“d|“d|“d|	“d|
“d|“d|“d|
“d |“d!|“d"|“d#|“d$|“d%|“d&|“d'|“d(|“d)|“d*|“d+|“d,|“d-|“d.|“d/|“d0|“d1|“d2| “d3|!“d4|"“d5|#“d6|$“d7|%“d8|&“d9|'“d:|(“d;|)“d<|*“d=|+“d>|,“d?|-“d@|.“dA|/“dB|0“dC|1“dD|2“dE|3“dF|4“dG|5“dH|6“dI|7“dJ|8“dK|9“dL|:“dM|;“dN|<“dO|=“dP|>“dQ|?“dR|@“dS|A“dT|B“dU|C“dV|D“dW|E“dX|F“dY|G“dZ|H“d[|I“d\|J“d]|K“d^|L“d_|M“d`|N“da|O“db|P“dc|Q“dd|R“de|S“df|T“dg|U“dh|V“di|W“dj|X“dk|Y“dl|Z“dm|[“dn|\“do|]“dp|^“dq|_“dr|`“ds|a“dt|b“du|c“dv|d“dw|e“dx|f“dy|g“dz|h“d{|i“d||j“d}|k“d~|l“d|m“d€|n“d�|o“d‚|p“dƒ|q“d„|r“d…|s“d†|t“d‡|u“dˆ|v“d‰|w“dŠ|x“d‹|y“dŒ|z“d�|{“dŽ||“d�|}“d�|~“d‘|“d’|€“d“|�“d”|‚“d•|ƒ“d–|„“d—|…“d˜|†“d™|‡“dš|ˆ“d›|‰“dœ|Š“d�|‹“dž|Œ“dŸ|�“d |Ž“d¡|�“d¢|�“|”¤Ž|‘|_
|’|_|“|_dS)¤NgH¯¼šò×z>z Unsloth: Your learning rate of `zi` is too small and less than 1e-7! Consider increasing it, otherwise gradient updates will be close to 0!rGza` is way too larger > 1! Consider decreasing it to 1e-1, otherwise gradient updates will explode!rurvÚunsloth_training_checkpointsrgr©Ú	cpu_countrCrhÚUNSLOTH_ENABLE_FLEX_ATTENTIONÚ0Ú1)ÚHAS_FLEX_ATTENTION)ÚFLEX_ATTENTION_BLOCK_SIZEÚ
output_dirÚoverwrite_output_dirÚdo_trainÚdo_evalÚ
do_predictÚ
eval_strategyÚprediction_loss_onlyÚper_device_train_batch_sizeÚper_device_eval_batch_sizeÚper_gpu_train_batch_sizeÚper_gpu_eval_batch_sizeÚgradient_accumulation_stepsÚeval_accumulation_stepsÚ
eval_delayÚtorch_empty_cache_stepsÚ
learning_rateÚweight_decayÚ
adam_beta1Ú
adam_beta2Úadam_epsilonÚ
max_grad_normÚnum_train_epochsÚ	max_stepsÚlr_scheduler_typeÚwarmup_ratioÚwarmup_stepsÚ	log_levelÚlog_level_replicaÚlog_on_each_nodeÚlogging_dirÚlogging_strategyÚlogging_first_stepÚ
logging_stepsÚlogging_nan_inf_filterÚ
save_strategyÚ
save_stepsÚsave_total_limitÚsave_safetensorsÚsave_on_each_nodeÚsave_only_modelÚ'restore_callback_states_from_checkpointÚno_cudaÚuse_cpuÚuse_mps_deviceÚseedÚ	data_seedÚ
jit_mode_evalÚuse_ipexÚbf16Úfp16Úfp16_opt_levelÚhalf_precision_backendÚbf16_full_evalÚfp16_full_evalÚtf32Ú
local_rankÚddp_backendÚ
tpu_num_coresÚtpu_metrics_debugÚdebugÚdataloader_drop_lastÚ
eval_stepsÚdataloader_num_workersÚdataloader_prefetch_factorÚ
past_indexÚrun_nameÚdisable_tqdmÚremove_unused_columnsÚlabel_namesÚload_best_model_at_endÚmetric_for_best_modelÚgreater_is_betterÚignore_data_skipÚfsdpÚfsdp_min_num_paramsÚfsdp_configÚ"fsdp_transformer_layer_cls_to_wrapÚaccelerator_configÚ	deepspeedÚlabel_smoothing_factorÚoptimÚ
optim_argsÚ	adafactorÚgroup_by_lengthÚlength_column_nameÚ	report_toÚddp_find_unused_parametersÚddp_bucket_cap_mbÚddp_broadcast_buffersÚdataloader_pin_memoryÚdataloader_persistent_workersÚskip_memory_metricsÚuse_legacy_prediction_loopÚpush_to_hubÚresume_from_checkpointÚhub_model_idÚhub_strategyÚ	hub_tokenÚhub_private_repoÚhub_always_pushÚhub_revisionÚgradient_checkpointingÚgradient_checkpointing_kwargsÚinclude_inputs_for_metricsÚeval_do_concat_batchesÚfp16_backendÚpush_to_hub_model_idÚpush_to_hub_organizationÚpush_to_hub_tokenÚ
mp_parametersÚauto_find_batch_sizeÚfull_determinismÚtorchdynamoÚ	ray_scopeÚddp_timeoutÚ
torch_compileÚtorch_compile_backendÚtorch_compile_modeÚinclude_tokens_per_secondÚinclude_num_input_tokens_seenÚneftune_noise_alphaÚoptim_target_modulesÚbatch_eval_metricsÚ
eval_on_startÚuse_liger_kernelÚliger_kernel_configÚeval_use_gather_objectÚaverage_tokens_across_devicesÚmodel_init_kwargsÚchat_template_pathÚdataset_text_fieldÚdataset_kwargsÚdataset_num_procÚ	eos_tokenÚ	pad_tokenÚ
max_lengthÚpackingÚpacking_strategyÚpadding_freeÚpad_to_multiple_ofÚeval_packingÚcompletion_only_lossÚassistant_only_lossÚactivation_offloadingr])ÚFloatingPointErrorÚ
OverflowErrorÚmultiprocessingr†Úmaxr-ÚenvironÚgetÚunsloth_zoo.flex_attentionrŠr‹ÚsuperÚ__init__rdrerf)˜ÚselfrŒr�rŽr�r�r‘r’r“r”r•r–r—r˜r™ršr›rœr�ržrŸr r¡r¢r£r¤r¥r¦r§r¨r©rªr«r¬rr®r¯r°r±r²r³r´rµr¶r·r¸r¹rºr»r¼r½r¾r¿rÀrÁrÂrÃrÄrÅrÆrÇrÈrÉrÊrËrÌrÍrÎrÏrÐrÑrÒrÓrÔrÕrÖr×rØrÙrÚrÛrÜrÝrÞrßràrárârãrärårærçrèrérêrërìrírîrïrðrñròrórôrõrör÷rørùrúrûrürýrþrÿrrrrrrrrrr	r
rrr
rrrrrrrrrrrrrrrdrerfÚkwargsr†rŠr‹©Ú	__class__r]r^r$‹sªÿþýüûúùø	÷
öõô
óòñðïîíìëêéèçæåäãâá à!ß"Þ#Ý$Ü%Û&Ú'Ù(Ø)×*Ö+Õ,Ô-Ó.Ò/Ñ0Ð1Ï2Î3Í4Ì5Ë6Ê7É8È9Ç:Æ;Å<Ä=Ã>Â?Á@ÀA¿B¾C½D¼E»FºG¹H¸I·J¶KµL´M³N²O±P°Q¯R®ST¬U«VªW©X¨Y§Z¦[¥\¤]£^¢_¡` aŸbžc�dœe›fšg™h˜i—j–k•l”m“n’o‘p�q�rŽs�tŒu‹vŠw‰xˆy‡z†{…|„}ƒ~‚��ÿ�þ�ý�ü�û�ú�ù�ø	�÷
�ö�õ�ô
�ó�ò�ñ�ð�ï
zUnslothSFTConfig.__init__)“NNFFFrgFrCrCNNrhrhrrirjrkrlrmrnrorprBrqrrrrsrtTNruFrGFrurvNTFFFFFFrwrwFFFFrxryFFNrBNNFrzFNrNrBNNTNFNNFrzrNNNNr{r|NFFr}NNNNTFTFFNNr~NNFNFNFTryNNNrzTFNrr€FNNFFNNFFFNFTNNr�NNNNr‚FrƒFNNNFFNrBN)Ú__name__Ú
__module__Ú__qualname__Ú__doc__r7rdrrÚ__annotations__reÚintrfr$Ú
__classcell__r]r]r'r^r`3sF
Jþþþ�ër`c sdeZdZdZddgZ													d7deeeje	fde
eeefde
e
d	e
eeefd
e
eeeeeffde
eeeeefde
ed
e
eegefde
eedee
ejje
ejjjfde
eeejjeee ffde
eej!ej!gej!fde
dde
eegeff‡fdd„
Z"dedede	fdd„Z#de	de dede	fdd„Z$de	dede	fdd„Z%de	dede	fdd „Z&d!eeefd"e'de
eegefd#edeeeff
d$d%„Z(d&d'„Z)d8‡fd)d*„	Z*‡fd+d,„Z+d9d-eee,fd.e
e,ddf‡fd/d0„
Z-‡fd1d2„Z.			d:d3e
ed#e
ed4eeeedffd5d6„Z/‡Z0S);Ú_UnslothSFTTrainerrzÚtrlÚsftN©NNÚmodelÚargsÚ
data_collatorÚ
train_datasetÚeval_datasetÚprocessing_classÚcompute_loss_funcÚcompute_metricsÚ	callbacksÚ
optimizersÚoptimizer_cls_and_kwargsÚpreprocess_logits_for_metricsÚpeft_configrÚformatting_funccst|tƒr|n|jj}ˆdur| d¡d}t|›d�ƒ‰ntˆtƒr=tˆtƒs=ˆ ¡}ˆj|d<| 	d¡td*i|¤Ž‰ˆdurFt
 |¡‰ˆjdurgˆj}ˆ 
|¡}|durdtd|›dˆjj›d�ƒ‚|ˆ_ˆjdurvt|tƒsvt d	¡t|tƒr�ˆ |ˆ¡}ˆjdur½tj ˆj¡r²ˆj d
¡r²tˆjdd��
}| ¡ˆ_Wdƒn1sªwYg}n
t|ˆˆjƒ\}‰}ng}		ˆj#pËˆj$oËˆj%dkˆ_#|jj&dv}ˆj#rÿ|durÞtdƒ‚ˆj$rëˆj%dkrët d¡|sòt d¡ˆj'dkrÿˆj$sÿt d¡t(t)|ƒƒ}ˆj*du�rd|vˆ_*nˆj*ˆ_*|du�rHˆj+�p$ˆj+�p$ˆj}ˆ 
|¡}|du�r<td|›dˆjj›d�ƒ‚t,|ˆj*ˆj#|ˆj-d�}ˆj$�rZˆj%dk�rZ|�sZt d¡ˆj.�rgt/|ƒ�sgtdƒ‚ˆj0du�ptˆj0 1dd
¡}|�r¿ˆj*�rƒˆ�rƒtd ƒ‚ˆ 2|ˆˆˆj$ˆd!¡}|du�r¿ˆj3du�rœˆj$nˆj3‰t|t4ƒ�rµ‡‡‡‡‡fd"d#„| 5¡Dƒ}n
ˆ 2|ˆˆˆˆd$¡}t6t7ƒt6t7ƒd%œˆ_8d&ˆ_9t:ƒj;|ˆ|||ˆ|||	|
||d'�ˆj<j=�rët>ˆj?d(�ˆ_@ntA B¡ˆ_@tCˆj?d)ƒ�rˆj? DˆjE¡dSdS)+Nú/rBz-SFTrírøzThe specified `eos_token` ('zC') is not found in the vocabulary of the given `processing_class` (zX). Ensure that the `eos_token` exists in the vocabulary before using it as an EOS token.z�You passed model_init_kwargs to the `SFTConfig`, but your model is already instantiated. The `model_init_kwargs` will be ignored.)z.jinjaz.j2zutf-8)ÚencodingFÚembed_tokensÚlm_heada-Cloning chat template added new tokens to the tokenizer, but 'lm_head' is not in PEFT's `modules_to_save`. As a result, the model may not learn to generate outputs with these new tokens, leading to degraded generation quality. To fix this, add `modules_to_save=['lm_head']` to your PEFT configuration.rƒ)Úflash_attention_2z"kernels-community/vllm-flash-attn3zHPassing a custom data collator is not supported when using padding-free.Úwrappedz¯You are passing `padding_free=True` with the 'wrapped' packing strategy, which is not recommended. Please refer to the documentation to understand why this is not recommended.açPadding-free training is enabled, but the attention implementation is not set to 'flash_attention_2'. Padding-free training flattens batches into a single sequence, and 'flash_attention_2' is the only known attention mechanism that reliably supports this. Using other implementations may lead to unexpected behavior. To ensure compatibility, set `attn_implementation='flash_attention_2'` in the model configuration, or verify that your attention mechanism can handle flattened sequences.rGzÎYou are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size to at least 2.ÚpromptzThe specified `pad_token` ('z[). Ensure that the `pad_token` exists in the vocabulary before using it as a padding token.)Úpad_token_idrrÚreturn_position_idsra$You are using packing, but the attention implementation is not set to 'flash_attention_2' or 'kernels-community/vllm-flash-attn3'. Packing flattens batches into a single sequence, and Flash Attention is the only known attention mechanisms that reliably support this. Using other implementations may lead to cross-contamination between batches. To avoid this, either disable packing by setting `packing=False`, or set `attn_implementation='flash_attention_2'` or `attn_implementation='kernels-community/vllm-flash-attn3'` in the model configuration.z…You set `assistant_only_loss=True`, but the dataset is not conversational. This option is only supported for conversational datasets.Úskip_prepare_datasetaEA formatting function was provided while `completion_only_loss=True`, which is incompatible. Using a formatter converts the dataset to a language modeling type, conflicting with completion-only loss. To resolve this, apply your formatting function before passing the dataset, or disable `completion_only_loss` in `SFTConfig`.Útraincs&i|]\}}|ˆ |ˆˆˆˆ|¡“qSr])Ú_prepare_dataset)Ú.0ÚkeyÚdataset©r5rArr9r%r]r^Ú
<dictcomp>ƒsÿÿz/_UnslothSFTTrainer.__init__.<locals>.<dictcomp>Úeval)rLrSr)r4r5r6r7r8r9r:r;r<r=r>r?)r4Úadd_model_tagsr])FÚ
isinstanceÚstrÚconfigÚ
_name_or_pathÚsplitrrÚto_dictríÚpopr
Úfrom_pretrainedrÚconvert_tokens_to_idsÚ
ValueErrorr(r)Úeos_token_idrr5ÚwarnÚ_create_model_from_pathr
r-ÚpathÚisfileÚendswithÚopenÚreadÚ
chat_templater Útrainable_token_indicesÚextendÚmodules_to_saverRrrrÚ_attn_implementationr“ÚnextÚiterrrrrrr)rr!rMrÚdictÚitemsr$ÚlistÚ_metricsÚ_total_train_tokensr#r$r5rr&r4Ú maybe_activation_offload_contextr!r9ÚhasattrrTÚ
_tag_names)r%r4r5r6r7r8r9r:r;r<r=r>r?r@rAÚmodel_idÚ
model_nameÚ	dict_argsrr_Úchat_template_fileÚadded_tokensÚuse_flash_attentionÚdataset_samplerrIÚpreprocess_datasetr'rQr^r$Ìsø


ÿÿÿ

ÿÿÿÿÿ


ÿÿú	ÿÿÿÿ
þÿô

ÿz_UnslothSFTTrainer.__init__Ú
model_pathÚreturncCsv|jpi}| d¡}t|tjƒs|dks|durnt|tƒr(tt|ƒ}||d<ntd|›d�ƒ‚tj	|fi|¤Ž}|S)z0Creates a model from a path or model identifier.Útorch_dtyperyNzˆInvalid `torch_dtype` passed to `SFTConfig`. Expected either 'auto' or a string representing a `torch.dtype` (e.g., 'float32'), but got Ú.)
rr!rUr3ÚdtyperVÚgetattrr^rr\)r%r~r5rr€r4r]r]r^ra¯s


ÿÿ	z*_UnslothSFTTrainer._create_model_from_pathcCstƒstdƒ‚t|ddƒpt|ddƒ}d}t|ddƒr3| ¡D]\}}|jjdkr2|jjjdv}nq|rE|sE| 	||¡}t
j|dd�}n	|jrN| 
||¡}|durrt tj¡t d	¡krmt|ddƒrm|rmt||dd
�}nt||ƒ}|jr�t|ddƒr�|s�t|ƒ|S)z#Prepares a model for PEFT training.z9To use PeftModel, you need to install the `peft` library.Úis_loaded_in_4bitFÚis_loaded_in_8bitÚ
Params4bit>ÚcpuÚmeta)rñNz0.12)Úautocast_adapter_dtype)r*ÚImportErrorrƒÚnamed_parametersr(r)ÚdataÚdeviceÚtypeÚ _prepare_model_for_kbit_trainingr#ÚreplacerñÚ_enable_gradient_checkpointingr4Úparser0Ú__version__r(r¼r1)r%r4r@r5Úis_qloraÚis_sharded_qloraÚ_Úparamr]r]r^Ú_prepare_peft_modelÆs4þ
ÿþ
z&_UnslothSFTTrainer._prepare_peft_modelcCs"|j|jpidœ}t|fi|¤ŽS)z-Prepares a quantized model for kbit training.)Úuse_gradient_checkpointingrò)rñròr2)r%r4r5Úprepare_model_kwargsr]r]r^r�ïsþz3_UnslothSFTTrainer._prepare_model_for_kbit_trainingcCsN|jpi}d|vp|d}|r%t|dƒr| ¡|Sdd„}| ¡ |¡|S)z-Enables gradient checkpointing for the model.Ú
use_reentrantÚenable_input_require_gradscSs| d¡dS)NT)Úrequires_grad_)ÚmoduleÚinputÚoutputr]r]r^Úmake_inputs_require_gradszS_UnslothSFTTrainer._enable_gradient_checkpointing.<locals>.make_inputs_require_grad)ròrtrœÚget_input_embeddingsÚregister_forward_hook)r%r4r5ròr›r¡r]r]r^r‘øs
ÿ
ûz1_UnslothSFTTrainer._enable_gradient_checkpointingrPrÚdataset_namecs’z
t|tƒr	|WSWnYi}t|tƒ}t|dƒ}	|‰|	r"|j‰t|ddƒ‰ˆdkr2t|ddƒ‰ˆdkr<t|ddƒ‰ˆdkrFt|ddƒ‰ˆdkrNtdƒ‚t|ddƒ‰ˆdk‰d	‰d
}
ttt	|ƒƒ 
¡ƒ}dg}d|vrr| d¡dd
lm
}
m}d|vr›|	rŽtˆdƒsŽtd|j›d�ƒ‚|
ˆƒ|_| d¡d	}
n,d|vr¹|	r¯tˆdƒs¯td|j›d�ƒ‚|ˆd	d�|_d	}
nˆ|vrÇd
‰ˆdurÇtdƒ‚	|
�r�ˆrãˆtt	|ƒƒƒ}t|tƒsÞtdƒ‚|d}n
tt	|ƒƒˆd}t|ddƒ}|dkrÿ|	rÿtˆddƒ}|du�rd}d
‰t|ddƒ}tˆddƒ}|�p|}|du�r/| |¡�s)||v�r/d	‰tdƒ	‡‡‡‡‡‡‡fdd„}	t|tƒ�sat|ddƒ}|du�r\ddlm}t|ƒddƒ}||d<n|jj|d <|�rrd!ˆ›d"�|d#<|j|fd$d
i|¤Ž}|	�r�t|dƒ�s�|ˆd	d�}||_		|�rÆztWntd%ƒ|YSˆdk�rtd&ƒ‚|�r¸d'|›d(�|d#<t| |¡ˆt|d)d*ƒ|ƒ}	|S)+NÚ	tokenizerrrrfÚmax_seqz1Unsloth: max_seq_length is 0! Please specify one!rr�FTÚ	input_idsÚattention_maskr:Úlabelsr/z	Unsloth: z does not have .pad!)Úmlmz-Unsloth: You must specify a `formatting_func`zIUnsloth: The `formatting_func` should return a list of processed strings.rgrzÚ	bos_tokenzHUnsloth: We found double BOS tokens - we shall remove one automatically.cs"ˆˆs|ˆnˆ|ƒˆˆdˆd�S)NF)Ú
truncationrÚreturn_token_type_idsÚadd_special_tokensr])Úexample©r®rÚdo_formatting_funcÚ
do_truncationrArfr¥r]r^Ú	_tokenizehsûz6_UnslothSFTTrainer._prepare_dataset.<locals>._tokenizerr…rCrhÚnum_procÚ
batch_sizezUnsloth: Tokenizing ["z"]ÚdescÚbatchedzPUnsloth: Hugging Face's packing is currently buggy - we're disabling it for now!z:When packing is enabled, `max_seq_length` can't be `None`.zUnsloth: Packing z datasetrrƒ)rUÚConstantLengthDatasetrrtr¥rƒÚRuntimeErrorÚsetrlrmÚkeysrRÚtransformersr;rr(r6rpr^Ú
startswithÚprintrrr†rÚ_ex_iterablerµÚmapr.Úselect_columns)r%rPr9r5rrAr¤Ú
map_kwargsÚuse_descÚis_vlmÚdo_tokenizeÚcolumn_namesÚused_column_namesr;rÚ	test_textrgÚbos_token_1Úbos_token_2r«r³rr†r6r]r°r^rMs¾


ÿ


üz#_UnslothSFTTrainer._prepare_datasetcCs|jdurgd¢|_dSdS)N)r§r©Úseq_lengthsÚcompletion_maskÚassistant_masks)Ú_signature_columns)r%r]r]r^Ú _set_signature_columns_if_needed™s
ÿz3_UnslothSFTTrainer._set_signature_columns_if_neededFcstƒj||||d�}|S)N)Úreturn_outputsÚnum_items_in_batch)r#Úcompute_loss)r%r4ÚinputsrÐrÑÚoutputsr'r]r^rÒ§süz_UnslothSFTTrainer.compute_losscs<|j�tƒj|i|¤ŽWdƒS1swYdS©N)rsr#Ú
training_step)r%r5r&r'r]r^rÖ±s$ÿz _UnslothSFTTrainer.training_stepÚlogsÚ
start_timecsn|jjrdnd}dd„|j| ¡Dƒ}|dkr!dd„| ¡Dƒ}i|¥|¥}tƒ ||¡|j| ¡dS)NrLrScSs"i|]
\}}|t|ƒt|ƒ“qSr])ÚsumÚlen©rNrOÚvalr]r]r^rR·s"z*_UnslothSFTTrainer.log.<locals>.<dictcomp>cSsi|]
\}}d|›�|“qS)Úeval_r]rÛr]r]r^rR¼s)r4Útrainingrqror#ÚlogÚclear)r%r×rØÚmodeÚmetricsr'r]r^rßµsz_UnslothSFTTrainer.logcsL|jjdurt|jjƒj}n	|jj d¡d}|j|d�tƒ ||¡dS)NrBrB)rw)	r5rërrŒÚnamerYÚcreate_model_cardr#Ú_save_checkpoint)r%r4Útrialrwr'r]r^råÃs
z#_UnslothSFTTrainer._save_checkpointrwÚtagsc
CsÞ| ¡sdSt|jjdƒrtj |jjj¡s|jjj}nd}|dur&tƒ}n
t	|t
ƒr/|h}nt|ƒ}t|jjdƒr?| d¡| |j
¡t|||j|t|ƒtƒrZtjdurZtjjndtƒdd�}| tj |jjd¡¡dS)aî
        Creates a draft of a model card using the information available to the `Trainer`.

        Args:
            model_name (`str` or `None`, *optional*, defaults to `None`):
                Name of the model.
            dataset_name (`str` or `None`, *optional*, defaults to `None`):
                Name of the dataset used for training.
            tags (`str`, `list[str]` or `None`, *optional*, defaults to `None`):
                Tags to be associated with the model card.
        NrXÚunsloth_versionÚunslothÚSFT)Ú
base_modelrwrër¤rçÚ	wandb_urlÚ	comet_urlÚtrainer_namez	README.md)Úis_world_process_zerortr4rWr-rbÚisdirrXrºrUrVÚaddÚupdaterur%rërpr+ÚwandbÚrunÚurlr'ÚsaveÚjoinr5rŒ)r%rwr¤rçrëÚ
model_cardr]r]r^räËs0 

øz$_UnslothSFTTrainer.create_model_card)
NNNNNNNNr3NNNN)FNrÕ)NNN)1r)r*r+r,rurrVr,ÚModulerrrrrrrrnrrrrrrrprÚtupler3rÜÚ	OptimizerÚlr_schedulerÚLambdaLRrŽrrr$rar˜r�r‘ÚboolrMrÏrÒrÖÚfloatrßrårär/r]r]r'r^r0Çsžïþýüûúÿù
öõ
ô
óòñðïd)	
þûúù

ø
(
üþýür0cs:eZdZdZ												d‡fdd„	Z‡ZS)ÚUnslothSFTTrainera¢
    
    Trainer for Supervised Fine-Tuning (SFT) method.

    This class is a wrapper around the [`transformers.Trainer`] class and inherits all of its attributes and methods.

    Example:

    ```python
    from datasets import load_dataset
    from trl import SFTTrainer

    dataset = load_dataset("roneneldan/TinyStories", split="train[:1%]")

    trainer = SFTTrainer(model="Qwen/Qwen2-0.5B-Instruct", train_dataset=dataset)
    trainer.train()
    ```

    Args:
        model (`Union[str, PreTrainedModel]`):
            Model to be trained. Can be either:

            - A string, being the *model id* of a pretrained model hosted inside a model repo on huggingface.co, or a
              path to a *directory* containing model weights saved using
              [`~transformers.PreTrainedModel.save_pretrained`], e.g., `'./my_model_directory/'`. The model is loaded
              using [`~transformers.AutoModelForCausalLM.from_pretrained`] with the keyword arguments in
              `args.model_init_kwargs`.
            - A [`~transformers.PreTrainedModel`] object. Only causal language models are supported.
        args ([`SFTConfig`], *optional*, defaults to `None`):
            Configuration for this trainer. If `None`, a default configuration is used.
        data_collator (`DataCollator`, *optional*):
            Function to use to form a batch from a list of elements of the processed `train_dataset` or `eval_dataset`.
            Will default to a custom [`DataCollatorForLanguageModeling`].
        train_dataset ([`~datasets.Dataset`] or [`~datasets.IterableDataset`]):
            Dataset to use for training. SFT supports both [language modeling](#language-modeling) type and