Files
DS-LLM-TEMPLATE-FINETUNING/__pycache__/classification_pipeline.cpython-312.pyc
T

107 lines
16 KiB
Plaintext
Raw Normal View History

2025-08-06 22:45:37 +01:00
Ë
Ào“hµ:ã óbGdd«Z d dedededeedef
dZd „Zed
k(re«yy) c
ó€eZdZdZdZ ddedeedeededed ef d
Zd ed e e
e e ee fffd Z
d
e
e d ed e ee
e ffdZd
e
e d e
e fdZdd
e
e ded e
e fdZdd
e
e ded e
e fdZdd
e
e ded e
e fdZdd
e
e dedefdZ d d ededed e ee ffdZd ed e ee ffdZy)!ÚClassificationPipelinezMain classification pipelinecó@t«|_t«|_y©N)Ú
DataValidatorÚ validatorÚHuggingFaceDataLoaderÚ hf_loader)Úselfs úclassification_pipeline.pyÚ__init__zClassificationPipeline.__init__sÜ&ˆŒÜ0ˆó data_sourceÚ dataset_nameÚ data_pathÚ input_fieldÚ label_fieldÚreturnc ó®|dk(r|s td«td||||dœ|¤Ž}|S|dk(r|s td«td||||dœ|¤Ž}|Std|«) z#Create classification configurationÚ huggingfacez2Dataset name is required for Hugging Face datasets)rrrrÚcustomz)Data path is required for custom datasets)rrrrúUnsupported data source: ©)Ú
ValueErrorÚClassificationConfig)r
rrrrrÚkwargsÚconfigs r Ú
create_configz$ClassificationPipeline.create_config
ð ˜-Ò Ü Ð!UÓØð
ñ ˆFð(ˆ
ð˜
Ü Ð!LÓØð
ñ ˆˆ
ôÐ8¸¸
Ð Gr
rcó|jdk(r8|jj|«}|jj||«}n_|jdk(r8|jj|«}|jj||«}nt d|j«|rét
|dj««}td|«|j|v}|j|v}|r|sŸd}|d|jd|jd z
}|d
t|«d z
}|Dcgc]Štˆfd d
D««sŒŒ}}|Dcgc]ŠtˆfddD««sŒŒ} }|r |d|d z
}| r |d| d z
}t |«|jj||«\}
} |
s?tj!d«| D]} tj!d| «Œt d«|jj#||«}
tj%d«tj%d|
d«tj%d|
d«tj%d|
d«||
fScc}wcc}w)zLoad and preprocess datarrrézAvailable fields in dataset: z(Configured fields not found in dataset.
zLooking for: input_field='z', label_field='z'
zAvailable fields: Ú
c3óBK|]}|j«vŒy­wr©Úlower©Ú.0ÚkeywordÚfs €r Ú <genexpr>z=ClassificationPipeline.load_and_preprocess.<locals>.<genexpr>Js(øèø€ðBeÙ\cÀ'ÈQÏWÉWËYÔBVÙcùóƒ©ÚtextÚsentenceÚcontentÚinputÚcommentÚmessageÚtitleÚbodyc3óBK|]}|j«vŒy­wrr"r$s €r r(z=ClassificationPipeline.load_and_preprocess.<locals>.<genexpr>Ls(øèø€ðCbÙ]dÀ7ÈaÏgÉgËiÔCWÙ`ùr)©ÚlabelÚclassÚcategoryÚtargetÚemotionÚlabelsÚtagÚtypezSuggested text fields: zSuggested label fields: zData validation failed:z - zData validation failedzDataset analysis:z Total samples: Ú
total_samplesz Unique labels: Ú
unique_labelsz Label distribution: Úlabel_distribution)rr ÚloadÚ
preprocessÚ
custom_loaderrÚsetÚkeysÚprintrrÚlistÚanyrÚvalidate_classification_dataÚloggerÚerrorÚanalyze_datasetÚinfo)r
rÚdataÚavailable_fieldsÚinput_field_existsÚlabel_field_existsÚ error_msgr'Ú text_fieldsÚ label_fieldsÚis_validÚerrorsrJÚanalysiss ` r Úload_and_preprocessz*ClassificationPipeline.load_and_preprocess.ø€ð × Ñ  Ò —>>×& .ˆ—>>×,¨T°6Ó:‰DØ
×
Ñ
 
×*¨6Ó2ˆDØ×%×°vÓ>‰DäÐ8¸×9KÑ9KÐ8LÐ  Ü" ¡7§<¡<£>Ó Ü Ð1Ð2BÐ1CÐ "(×!3Ñ!3Ð7GÐ!GÐ Ø!'×!3Ñ!3Ð7GÐ!GÐ á%Ñ-?ØG ØÐ9¸&×:LÑ:LÐ9MÐM]Ð^d×^pÑ^pÐ]qÐqtÐu ØÐ1´$Ð7GÓ2HÐ1IÈÐL ñ+;ôfÑ*: Q¼cóBeÙBeõ?ešqÐ*: ðfá+;ô cÑ+; a¼sóCbÙCbõ@b¢Ð+; ð cñØÐ#:¸;¸-ÀrÐ!JÑJØÐ#;¸L¸Ð!LÑL  Ó Ÿ>™>×FÀtÈVÓˆÜ L‰LÐ Ü ˜t E 7˜^Õ äÐ —>>×1°$¸Ó?ˆä РаÑ(AÐ'BРаÑ(AÐ'BÐ Ð,¨XÐ6JÑ-KÐ,LÐXˆùò9fùò csÄ-I9ÅI9ÅI>Å,I>rMc
ó2tt|«|jz«}tt|«|jz«}|d|}||||z}|||zd}|||dœ}tj dt|«dt|«dt|««|S)z*Split data into train/validation/test setsN)ÚtrainÚ
validationÚtestzData splits: Train=z, Val=z, Test=)ÚintÚlenÚ train_splitÚvalidation_splitrIrL) r
rMrÚ
train_sizeÚval_sizeÚ
train_dataÚval_dataÚ test_dataÚsplitss r Ú
split_dataz!ClassificationPipeline.split_datahôœ˜T› V×%7Ñ%7Ñ8ˆ
Ü”s˜4“y 6×#:Ñ#:Ñ;ˆà˜+˜&ˆ
ؘ
 :°Ñ#8ÐØ˜ hÑ0ˆ ð Øñ
ˆô  Ð)¬#¨j«/Ð):¸ÀXÃÀÈwÔWZÐ[dÓWeÐVfЈ
r
cóLg}|D]}|j|d|ddœ«Œ|S)z=Convert classification data to standard classification formatr.r5)r+r5©Úappend)r
rMÚclassification_dataÚitems r Ú convert_to_classification_formatz7ClassificationPipeline.convert_to_classification_format}s;à Ðãˆ × ˜W™
ؘg™ñ(õ
ðð "r
Útask_descriptionc ó`g}|D]&}|j||dt|d«dœ«Œ(|S)z1Convert classification data to instruction formatr.r5)Ú instructionr.Úoutput©riÚstr)r
rMrmÚinstruction_datarks r Úconvert_to_instruction_formatz4ClassificationPipeline.convert_to_instruction_format‰sCàÐãˆDØ × ˜g™Ü˜d 7™mÓ%õ
ðð Ðr
cóng}|D]-}d|d|ddœdd|ddœg}|jd|i«Œ/|S) z2Convert classification data to conversation formatÚuserú
r.)Úroler-Ú assistantzThe classification is: r5Ú
conversationsrh)r
rMrmÚconversation_datarkrzs r Úconvert_to_conversation_formatz5ClassificationPipeline.convert_to_conversation_formatspàÐãˆ#Ø"2Ð!3°4¸¸
°Ðð
(Ø!8¸¸g¹¸Ðð ˆMð
×  ð&õ
ðð  r
Úquestion_templatecóhg}|D]*}|j|d|dt|d«dœ«Œ,|S)z)Convert classification data to Q&A formatrwr.r5)ÚquestionÚanswerrq)r
rMr}Úqa_datarks r Úconvert_to_qa_formatz+ClassificationPipeline.convert_to_qa_format¬sHàˆãˆDØ N‰Nذd¸7±m°_Иd 7™mÓõ
ðð ˆr
Ú output_pathÚformatcó.t|«}|jjdd¬«|dk(rIt|dd¬«5}|D]+}|j t
j
|d¬«d z«Œ- d
d
d
«nc|d k(r1t|dd¬«5}t
j||dd ¬
«d
d
d
«n-|dk(r(tj|«}|j|d¬«tjdt|«d|«y
#1swYŒ.xYw#1swYŒ:xYw)zSave processed data to fileT©ÚparentsÚexist_okÚjsonlÚwzutf-8)ÚencodingF)Ú ensure_asciir jsoné)ÚindentÚcsv)ÚindexzSaved z samples to )ÚPathÚparentÚmkdirÚopenÚwriterÚdumpsÚdumpÚpdÚ DataFrameÚto_csvrIrLr])r
rMr„Ú output_filer'rkÚdfs r Ú save_dataz ClassificationPipeline.save_data¸ä˜ Ø×Ñ× Ñ ¨¸Ð Ô  Ük 3°Õ9¸QÛ —G‘GœDŸJ™J t¸%˜JÓ@À4Ñ
Ük 3°Õ9¸QÜ— ‘ ˜$ °¸a Ô