Files
DS-LLM-TEMPLATE-FINETUNING/pipelines/styling/__pycache__/data_processor.cpython-311.pyc
T

339 lines
74 KiB
Plaintext
Raw Normal View History

2025-08-13 21:17:01 +01:00
§
–Éœh$ôã
ó^ddlZddlZddlZddlmZddlmZm Z m
Z
m Z m Z m
Z
ddlmZmZddlZddlmZddlmZmZddlZddlmZddlZddlZddlZddlZeje¦«Z e  !ej"¦«eGdd ¦«¦«Z#Gd
d ¦«Z$Gd d
e¦«Z%Gdde%¦«Z&Gdde%¦«Z'Gdd¦«Z(d*de)de)de)de)de#f
dZ*d+de)d e)de)de)de)de#f d!„Z+d"e)de,fd#„Z-d$e)fd%„Z.d&e)dee)e ffd'„Z/d(„Z0ed)kr e0¦«dSdS),éN)ÚPath)ÚDictÚListÚOptionalÚUnionÚAnyÚTuple)ÚDatasetÚ load_dataset)Ú dataclass)ÚABCÚabstractmethod)Útrain_test_splitcó°eZdZUdZdZeed<dZeeed<dZ eeed<dZ
eed<d Z eed
<d Z eed <d
Z
eed<dZeeed<dZeed<dZeed<dZeed<dZeed<dZeed<dZeed<dZeed<dZeed<dZeed<d Zeed!<d"Zeed#<dZeeed$<d"Zeed%<d"Zeed&<d'Z eed(<d)Z!eed*<d+Z"eed,<d-Z#eed.<dS)/Ú
StylingConfigzConfiguration for styling tasksÚ huggingfaceÚ data_sourceNÚ dataset_nameÚ data_pathÚjsonlÚ data_formatÚtextÚ input_fieldÚ styled_textÚ output_fieldú,Rewrite the following text in a formal styleÚ instructionÚ max_samplesçš™™™™™é?Ú train_splitçš™™™™™¹?Úvalidation_splitÚ
test_splitTÚ
clean_textFÚremove_special_charsÚ lowercaseé
Ú
min_lengthéèÚ
max_lengthÚstylingÚ
output_formatú./dataÚ
output_dirÚtrainÚhf_splitÚ hf_cache_dirÚtest_split_fromÚval_split_fromúutf-8Úencodingú,Ú delimiterzÆBelow is an instruction that describes a task, paired with an input that provides further context. Write a response that follows the instruction
### Instruction:
{}
### Input:
{}
### Response:
{}Ú
alpaca_promptz
<|eot_id|>Ú eos_token)$Ú__name__Ú
__module__Ú __qualname__Ú__doc__rÚstrÚ__annotations__rrrrrrrrÚintr Úfloatr"r#r$Úboolr%r&r(r*r,r.r0r1r2r3r5r7r8r9©óúY/Users/macbook/Desktop/blessing_ai/mkd/fine-tune-task/pipelines/styling/data_processor.pyrrà$€KÐ$Ø"&€L(˜3”-Ð#€Iˆx˜Œ}ЀKÐÐÑð€KÐÐÑØ%€LE€KÐ"&€K˜#”ЀKÐÐÑØ€JÐÐÑð€JÐÐÑØ!&И€IˆÐÑØ€JÐÐÑØ€JÐÐÑð#€M€JÐÐÑð€HˆcÐÐÑØ"&€L(˜3”-Ð#€O!€N€HˆcÐÐÑØ€IˆsÐÐÑð €M ð ñ ð"€Iˆ!rDrc
óÊeZdZdZed
deeeefdede de
e eeffd¦«Z ed
deeeefdede deee ffd¦«Z
d S) Ú
DataValidatorz)Validates styling data quality and formatFÚdataÚconfigÚ is_processedÚreturnc
ó g}gd¢}|D]C}||vr| d|d¦«Œ |dkr||s| d¦«ŒD|rd|fStd| ¦«D¦«¦«}t d|d ¦«|rd
n|jŠ|rd n|jŠ| ¦«D]1\}}|st d |d
¦«Œ't d|dt|¦«d¦«d} d}
t|¦«D]S\} } | vr#| dd|d| ¦«| dz
} ‰| vr#| dd|d| ¦«|
dz
}
ŒTt |d| ›¦«t |d|
¦«d}
t|¦«D]\} } t|   d¦«t¦«s#| dd|d| ¦«|
dz
}
t|   d¦«t¦«s#| dd|d| ¦«|
dz
}
Œžt |d|
¦«tˆfd|D¦«¦«}tˆfd|D¦«¦«}|dkr| d|d |d
¦«|dkr| d|d!|d
¦«t |d"|¦«t |d#|¦«|r¯t d$|d%¦«ttd&t|¦«¦«¦«D]f} || } t d'| d(|   d¦«d)d*…d+|   d¦«d)d*…d,¦«ŒgŒ3t|¦«dk|fS)-zValidate styling dataset splits©r/Ú
validationÚtestz Missing 'z' splitr/zTrain split cannot be emptyFc3ó4K|]}t|¦«VŒdS©Úlen)Ú.0Ú
split_datas rEú <genexpr>z6DataValidator.validate_styling_data.<locals>.<genexpr>as(èèÐ
C 
™OœOÐLrDz Validating z# total samples across all splits...ÚinputÚoutputzSkipping validation for empty z splitú split with ú samples...rzMissing input field 'z' in z
split, item ézMissing output field 'ú - Items missing input field: ú - Items missing output field: Úú
Input field 'z' must be string in úOutput field 'z - Type errors: c3ólK|].}| d¦« ¦«°*dVŒ/dS©r^r[ÚgetÚstrip)rTÚitemrs €rErVz6DataValidator.validate_styling_data.<locals>.<genexpr>ŒsCøèèÐa T¸t¿xºxÈ ÐUWÑ?XÔ?X×?^Ò?^Ñ?`Ô?`Ða˜arDc3ólK|].}| d¦« ¦«°*dVŒ/dSrbrc)rTrfrs €rErVz6DataValidator.validate_styling_data.<locals>.<genexpr>sCøèèÐc dÀÇÂÈÐWYÑ@ZÔ@Z×@`Ò@`Ñ@bÔ@bÐc ÐcrDzFound z items with empty input text in z! items with empty output text in z - Empty inputs: z - Empty outputs: zSample processed items from úz Item z : input='Né2z...', output='ú...')ÚappendÚsumÚvaluesÚloggerÚinforrÚitemsrSÚ enumerateÚ
isinstancerdr>ÚrangeÚmin)rHrIrJÚerrorsÚexpected_splitsÚsplitÚ
total_samplesÚ
split_namerUÚmissing_input_countÚmissing_output_countÚirfÚ type_errorsÚ empty_inputsÚ
empty_outputsrrs @@rEÚvalidate_styling_dataz#DataValidator.validate_styling_dataPsEøø€ðˆð9ˆØ >ˆ˜ Ð Ø
Ð8¨%И!¨$¨u¬+Ð
Ð=øð ð˜& åÐL¸d¿kºk¹m¼mÐLˆ
Ý Š ÐT -Ð".ÐEgg°6Ô3Eˆ Ø#/ÐHxx°VÔ5Hˆ ð'+§j¢j¡l¤lð3 Kñ3 KÑ "ˆJ˜
Øð
Ý ÐO¸ZÐå KŠKÐZ jÐZ½cÀ*¹o¼oÐ #$Ð Ø#$Ð å$ 

.‘ —MMÐ"h¸+Ð"hÐ"hÈJÐ"hÐ"hÐefÐ"hÐ"hÑ'¨1Ñ —MMÐ"j¸<Ð"jÐ"jÈjÐ"jÐ"jÐghÐ"jÐ"jÑ(¨AÑ(øå KŠK˜ZÐEXÐ KŠK˜\ÐFZÐ ˆKÝ$ 

%‘! $§(¢(¨;¸Ñ";Ô";½SÑ—MMÐ"o°+Ð"oÐ"oÐS]Ð"oÐ"oÐlmÐ"oÐ"oÑ $! $§(¢(¨<¸Ñ"<Ô"<½cÑ—MMÐ"q°<Ð"qÐ"qÐU_Ð"qÐ"qÐnoÐ"qÐ"qÑ $Køå KŠK˜:ÐD°{Ð Ða¨ZÐaˆLÝÐc¨jÐcˆMà˜aÒÐØ
Ðg gÐU_ИqÒ Ð Ø
Ði iÐWaÐ KŠK˜:Ð Ð KŠK˜:ÐÐ ð
KÝ ÐH¸:Ðs 1¥c¨*¡o¤oÑKðK% aœ=—K!J¨!ð!Jð!J°d·h²h¸{ÈBÑ6OÔ6OÐPSÐQSÐPSÔ6Tð!Jð!JÐdh×dlÒdlÐmyÐ{}Ñd~Ôd~ð@CðACð@CôeDð!Jð!Jð!JñKôKðKðKùå6‰{Œ{˜ Ð'rDcó¶ ididœdœ}|rdn|j}|rdn|j}| ¦«D]#\}}|s#diidœ}||d|<d|dd |<Œ+t|¦«iidœ}d|fd|ffD]c\} Š ˆ fd
|D¦«}
|
rNt |
¦«t |
¦«t
j|
¦«t
j|
¦«d œ|d | <Œd||fD](Š tˆ fd
|D¦«¦«} | |d <Œ)||d|<|ddxxt|¦«z
cc<t|¦«|dd |<Œ%|S)z1Analyze dataset characteristics across all splitsr)ryÚ split_sizes)ÚsplitsÚoverallrWrX)ryÚtext_length_statsÚmissing_valuesr„r…cóVg|]%}t| d¦«¦«Œ&S)r^)rSrd©rTrfÚfields €rEú
<listcomp>z1DataValidator.analyze_dataset.<locals>.<listcomp>Äs/ø€ÐP¸ D§H¢H¨U°BÑ$7Ô$7Ñ 8Ô 8ÐPrD)ruÚmaxÚmeanÚmedianr†c3óFK|]}| ¦«°dVŒdS©r[N)rdr‰s €rErVz0DataValidator.analyze_dataset.<locals>.<genexpr>Ïs2øèèÐ#TÐ#T¨$ÀDÇHÂHÈUÁOÄOÐ#T AÐ#TÐ#TÐ#TÐ#TÐ#TÐ#TrDr‡ry)
rrrqrSruÚnprrm)
rHrIrJÚanalysisrrrzrUÚsplit_analysisÚ
field_nameÚ text_lengthsÚ
missing_countrŠs
@rEÚanalyze_datasetzDataValidator.analyze_dataset sø€ðà!"Øðð
ð
ˆð".ÐEgg°6Ô3Eˆ Ø#/ÐHxx°VÔ5Hˆ ð'+§j¢j¡l¤lð$ Mñ$ MÑ "ˆJ˜
Øð
ð&'Ø)+Ø&(ð"ð"ð
2@˜Ô" .ØAB˜Ô# 2°:Ñõ"% ¤Ø%'Ø"$ððˆ(/° Ð&<¸Ð>VÐ%Wð
ð
Ñ!
˜EØPÀZÐP Øðå" " 0Ý "¤¨ Ñ 5Ô 5Ý"$¤)¨LÑ"9Ô"9ð GðGNÐ#6Ô7¸
ÑCøð& 
Hð
HÝ #Ð#TÐ#TÐ#TÐ#T°*Ð#TÑ#TÔ#TÑ TÔ T
Ø:GÐÑ7à-;ˆH ˜zÑ   Ð 0µC¸
±O´OÑ 0Ý=@À¹_¼_ˆH  
Ô .¨zÑ ˆrDN)F)r:r;r<r=Ú staticmethodrr>rrrBr rrr—rCrDrErGrGMØðMM( ¨d°4¬j¨Ô$9ðM(À=ðM(Ð`dðM(ÐqvÐw{ð~BðCFô~GðxGôrHðMMM„\ðM(ð^ð5ð5˜d ¨T¬
 5¸5ÐZ^ð5ÐkoÐpsÐuxÐpxÔkyð5ð5ð5ñ„\ð5ð5ð5rDrGc
óªeZdZdZededeeeeffd¦«Z edeeeefdedeeeeffd¦«Z
dS)ÚBaseDataLoaderz$Abstract base class for data loadersrIrKcódS)z:Load data and return dictionary with train/val/test splitsNrC)ÚselfrIs rEÚloadzBaseDataLoader.loadÛó ð
ˆrDrHcódS)z'Apply preprocessing steps to all splitsNrC)rHrIs rEÚ
preprocesszBaseDataLoader.preprocessàrDN) r:r;r<r=rrrr>rrr rCrDrEØØð
˜
¨T°#°t¸D´z°/Ô-Bð
ð
ð
ñ„^ð
ðð
˜t C¨¨d¬ OÔ
¸
ÐPTÐUXÐZ^Ð_cÔZdÐUdÔPeð
ð
ð
ñ„^ð
ð
ð
rDc ó¾eZdZdZdedeeeeffdZdeeeefdedeeeeffdZ dedede
efdZ d ededefd
Z d S) ÚHuggingFaceDataLoaderz#Load datasets from Hugging Face HubrIrKc óú|jstd¦«t d|j¦« t |j|j¬¦«}t
| ¦«¦«}t d|¦«gggdœ}d|vrF|d}t dt|¦«d¦«t
|¦«|d<nOt  d ¦«t  d
|¦«td |jd ¦«|j
d
krJd|vrF|d}t dt|¦«d¦«t
|¦«|d<nä|j
d
krJd|vrF|d}t dt|¦«d¦«t
|¦«|d<n|j
d
kr^t  d¦«t d
|¦«t d|j dzd¦«n&t d|j dzd¦«|j
dkrKd|vrG|d}t dt|¦«d¦«t
|¦«|d<n9|j
d
krJd|vrF|d}t dt|¦«d¦«t
|¦«|d<nä|j
d
krJd|vrF|d}t dt|¦«d¦«t
|¦«|d<n|j
dkr^t  d¦«t d
|¦«t d|jdzd¦«n&t d|jdzd¦«|dr |ds|d}t|¦«d kr<t  d!t|¦«d"¦«||d<g|d<g|d<n:|j|j z|jz} | d#krKt  d$| d%¦«|j| z |_|j | z |_ |j| z |_|ds|ds
t!t|¦«|jz¦«}
t!t|¦«|j z¦«} t|¦«d&krGd'|_d(|_ d(|_t d)|jd*|j d+|j¦«t#d,t!t|¦«d-z¦«¦«} t#d,t!t|¦«d-z¦«¦«}
t#| t!t|¦«|j z¦«¦«} t#|
t!t|¦«|jz¦«¦«}t|¦«| z
|z
}
|
d,krD| d,kr | d,z} |
d,z
}
n|d,kr
|d,z}|
d,z
}
t d.|
d*| d+|¦«t%|| |zd/¬0¦«\}}t%|| |zd1kr|| |zz nd1d/¬0¦«\}}||d<||d<||d<n³|dsRt#d,t!t|¦«|j z¦«¦«} t%|| d/¬0¦«\}}||d<||d<nY|dsQt#d,t!t|¦«|jz¦«¦«}t%||d/¬0¦«\}}||d<||d<t d2¦«t d3t|d¦«d¦«t d4t|d¦«d¦«t d5t|d¦«d¦«d|vrg|d<d|vrg|d<|jrq|D]n}||rdt||¦«}||d6|j||<t d7|d8|d9t||¦«d¦«Œo| ¦«D]¸\}}|r¯t d:|d;|d1¦«t d<|d=t
|d1 ¦«¦«¦«|j|d1vr“t  d>|jd?|d@t
|d1 ¦«¦«¦«dA„|d1 ¦«D¦«}|r t dB|d;|¦«|j|d1vr“t  dC|jd?|d@t
|d1 ¦«¦«¦«dD„|d1 ¦«D¦«}|r t dE|d;|¦«Œºt dF|j¦«|S#t.$r+}t  dG|jd;|¦«d6}~wwxYw)Hz?Load dataset from Hugging Face Hub with flexible split handlingz2Dataset name is required for Hugging Face datasetszLoading Hugging Face dataset: )Ú cache_dirzAvailable splits in dataset: rMr/zUsing 'train' split with ú samplesz"No 'train' split found in dataset!zAvailable splits: zDataset z does not have a 'train' splitÚuse_val_if_availablerNzUsing 'validation' split with ÚvalzUsing 'val' split with zCNo validation split found in dataset. Will create from train split.z Will use édz% of train data for validationz.Will create validation split from train data (z%)Úuse_test_if_availablerOzUsing 'test' split with z&Using 'validation' split as test with zUsing 'val' split as test with z=No test split found in dataset. Will create from train split.z% of train data for testz(Will create test split from train data (riúDataset has only ú& samples. Using all data for training.gð?z(Split percentages don't sum to 1.0 (got z). Normalizing...r'ç333333ã?çš™™™™™É?ú8Small dataset detected. Adjusted split ratios to: train=ú, val=ú, test=r[r!zAdjusted split sizes: train=é*©Ú test_sizeÚ random_staterzFinal split sizes:ú Train: ú Validation: ú Test: NzLimited z split from z to zSample data item from ú: úAvailable fields in z split: r_ú' not found in ú. Available fields: cóJg|]ŠtˆfddD¦«¦«¯Œ S)c3óDK|]}| ¦«vVŒdSrQ©Úlower©rTÚkeywordÚfs €rErVz8HuggingFaceDataLoader.load.<locals>.<listcomp>.<genexpr>©sVøèèðNrðNrÐgnÈgÐYZ×Y`ÒY`ÑYbÔYbÐNbðNrðNrðNrðNrðNrðNrrD)rÚsentenceÚcontentrWÚcommentÚmessage©Úany©
ValueErrorrorpr r1ÚlistÚkeysrSÚerrorr3Úwarningr"r2r#r r@rrrqrrÚ Exception)rIÚdatasetÚavailable_splitsÚ splits_dataÚ
train_datasetÚ val_datasetÚ test_datasetÚ
train_dataÚtotal_train_percentageÚ
train_sizeÚval_sizeÚ min_val_sizeÚ
min_test_sizer³Ú new_trainÚ temp_dataÚnew_valÚnew_testrzÚ
original_sizerUÚ text_fieldsÚ
output_fieldsÚes rErzHuggingFaceDataLoader.loadés
àÔ SÝÐ  Š ÐJ°VÔ5HÐH åÔ ÔñôˆGõ $ G§L¢L¡N¤NÑ Ý KŠKÐJÐ8HÐ Ø ØððˆKðÐ*Ø '¨Ô 0
Ý Ð¸MÑ8JÔ8JÐUÝ'+¨MÑ':Ô': ˜ Ð ÐDÐ2BÐ Ð!_¨FÔ,?Ð!_Ð!_Ð!_ÑÔ$Ð(>Ò>À<ÐScÐCcÐCcØ% 3 Ý ÐW½SÀÑ=MÔ=MÐXÝ,0°Ñ,=Ô,= ˜Ô&Ð*@Ò@ÀUÐN^ÐE^ÐE^Ø% eœn Ý ÐPµc¸+Ñ6FÔ6FÐQÝ,0°Ñ,=Ô,= ˜Ô&Ð*@ÒÐ ÐCÐ1AÐ ÐÔ(?À#Ñ(EÐ ÐnÈVÔMdÐgjÑMjÐÔ%Ð)@Ò@ÀVÐO_ÐE_ÐE_Ø&  Ý ÐRµs¸<Ñ7HÔ7HÐSÝ&*¨<Ñ&8Ô&8 ˜Ô'Ð+AÒAÀlÐVfÐFfÐFfØ& 4 Ý Ð`ÅSÈÑEVÔEVÐaÝ&*¨<Ñ&8Ô&8 ˜Ô'Ð+AÒAÀeÐO_ÐF_ÐF_Ø& uœ~ Ý ÐY½cÀ,Ñ>OÔ>OÐZÝ&*¨<Ñ&8Ô&8 ˜Ô'Ð+BÒÐ ÐCÐ1AÐ ÐÔ(9¸CÑ(?Ð ÐbÀvÔGXÐ[^ÑG^И|ÔY
7°KÀÔ4GñY
Ô1
õz?”? —NNÐ#nµs¸´Ð#nÐ#nÐ#nÑoØ+5K Ñ(Ø02K  Ñ-Ø*,K Ñ.4Ô-?À&ÔBYÑ-YÐ\bÔ\mÑ-mÐÒŸšÐ'{ÐRhÐ'{Ð'{Ð'{Ñ|à-3Ô-?ÐBXÑ-X˜Ô*Ø28Ô2IÐLbÑ2b˜Ô/Ø,2Ô,=Ð@VÑ,V˜Ô' E7¸Ô=PñE7å%(­¨Z©¬¸6Ô;MÑ)MÑ%NÔ%N˜
Ý#&¥s¨:¡¤¸Ô9PÑ'PÑ#QÔ#Q˜õ˜z™?œ?¨RÒ/à14˜.Ø69˜3Ø03˜"ŸKšKð)qÐciÔcuð)qð)qð~Dô~Uð)qð)qð^dô^oð)qð)qñrôrðrõ(+¨1­cµ#°j±/´/ÀCÑ2GÑ.HÔ.HÑ'IÔ'I˜ Ý(+¨A­sµ3°z±?´?ÀSÑ3HÑ/IÔ/IÑ(JÔ(J˜
å#& |µS½¸¼È6ÔKbÑ9bÑ5cÔ5cÑ#dÔ#d˜Ý$'¨
µs½3¸z¹?¼?ÈVÔM^Ñ;^Ñ7_Ô7_Ñ$`Ô$`˜ Ý%(¨¡_¤_°xÑ%?À)Ñ%K˜
ðš>˜'¨!š|˜|Ø (¨A¡
 Ø *¨a¡ 
 
Ø!*¨Q¢ Ø )¨Q¡  Ø *¨a¡ 
Ý"ŸKšKÐ(uÀzÐ(uÐ(uÐYaÐ(uÐ(uÐjsÐ(uÐ(uÑ0@Ø&Ø&.°Ñ&:Ø)+ð0ñ0ô0Ñ,˜  -=Ø%ØMUÐXaÑMaÐefÒLfÐLf i°8¸iÑ3GÑ&HÐ&HÐlmØ)+ð-ñ-ô-Ñ)˜ ð 09˜  ,Ø4;˜  1Ø.6˜  Ô7å#& q­#­c°*©o¬oÀÔ@WÑ.WÑ*XÔ*XÑ#YÔ#Y˜Ý-=Ø&Ø&.Ø)+ð.ñ.ô.Ñ*˜  
09˜  ,Ø4;˜  Ô 7å$'¨­3­s°:©¬ÀÔARÑ/RÑ+SÔ+SÑ$TÔ$T˜ Ý.>Ø&Ø&/Ø)+ð/ñ/ô/Ñ+˜  
09˜  ,Ø.6˜   KŠKÐ KŠKÐG¥C¨ °GÔ(<Ñ$=Ô$=Ð KŠKÐQ­¨[¸Ô-FÑ)GÔ)GÐ KŠKÐE¥3 {°6Ô':Ñ#;Ô#;Ð  ;Ð.Ø,. ˜˜(Ø&( ˜Ô
BØ"-ðBðB" BÝ(+¨K¸
Ô,CÑ(DÔ(D˜
Ø2=¸jÔ2IÐJ]È6ÔK]ÐJ]Ô2^˜  Ÿ š ð%A¨zð%Að%AÀ}ð%Að%AÕZ]Ð^iÐjtÔ^uÑZvÔZvð%Að%Að%AñBôBðBøð+6×*;Ò*;Ñ*=Ô*=ð
fñ
fÑ&
˜JØñfÝ—K’KÐ V¸Ð VÐ VÀzÐRSÄ}Ð VÐ VÑ—K’KÐ g°zÐ gÐ gÍ4ÐPZÐ[\ÔP]×PbÒPbÑPdÔPdÑKeÔKeÐ gÐ gÑÔ¸A´ÐŸšð(W°vÔ7Ið(Wð(WÐZdð(Wð(WÕz~ð@JðKLô@M÷@Rò@Rñ@Tô@Tñ{Uô{Uð(Wð(WñXôXðXð'sð's°*¸Q´-×2DÒ2DÑ2FÔ2Fð'sñ'sô's˜ ØbÝ"ŸKšKÐ(`ÀZÐ(`Ð(`ÐS^Ð(`Ð(`ÑÔ*°*¸Q´-Пšð(Y¸Ô8Kð(Yð(YÐ\fð(Yð(Yõ}AðBLðMNôBO÷BTòBTñBVôBVñ}Wô}Wð(Yð(YñZôZðZð)kð)k°J¸q´M×4FÒ4FÑ4HÔ4Hð)kñ)kô)k˜
ØfÝ"ŸKšKÐ(dÀzÐ(dÐ(dÐUbÐ(dÐ(dÑeùå KŠKÐL°vÔ7JÐ Ð øåð ð ð Ý LŠLÐL°&Ô2EÐÐ øøøøð øøøsºj
kë
k:ë&k5ë5k:rHc óòi}t d¦«| ¦«D]Ã\}}t d|dt|¦«d¦«|rÔt |d ¦«¦«}t d|d|¦«t djd ‰jd
¦«j|vr(t d jd |d
|¦«j|vr(t djd |d
|¦«tˆfd|D¦«¦«}tˆfd|D¦«¦«}t |d|¦«t |d|¦«t d| 
¦«d¦«ttdt|¦«¦«¦«D]·} || }
t d| d|d¦«|
 ¦«D]w\} } t| t¦«r=t| ¦«dkr*t d| d| ddd¦«ŒWt d| d| ¦«ŒxŒ¸g}
d}d}d|_t!|¦«D]f\} }
| |
¦«}||
 |¦«|dz
}Œ8|dz
}|dkr#t d| d|d|
¦«Œg|
||<t |d |d!|d"¦«|
r†t d#| 
¦«d$¦«ttdt|
¦«¦«¦«D]+} t d%| d|d|
| ¦«Œ,ŒÅ|S)&ú2Apply preprocessing steps to all splits separatelyz=== PREPROCESSING DATA ===ú Processing rYú items...rr¸zLooking for input field: 'z', output field: 'ú'r_r`c3óbK|])}j|vs| j¦«°%dVŒ*dSr)rrd©rTrfrIs €rErVz3HuggingFaceDataLoader.preprocess.<locals>.<genexpr>ÏsIøèèÐРd¸FÔ<NÐVZÐ<ZÐ<ZÐbf×bjÒbjÐkqÔk}Ñb~Ôb~Ð<Z Ð<ZÐ<ZÐ<ZÐ<ZÐÐrDc3óbK|])}j|vs| j¦«°%dVŒ*dSr)rrds €rErVz3HuggingFaceDataLoader.preprocess.<locals>.<genexpr>Ðsaøèèð!Cð!C t¸VÔ=PÐX\Ð=\Ð=\Ðdh×dlÒdlÐmsônAñeBôeBÐ=\ Ð=\Ð=\Ð=\Ð=\ð!Cð!CrDr\r]z=== SAMPLE RAW DATA FROM z BEFORE PREPROCESSING ===riz Raw item ú from rhz z: 'Nrkr[ú
Skipped item ú - Preprocessed ú samples, skipped r¥z=== SAMPLE PROCESSED DATA FROM z ===zProcessed item )rorprqrSÚsetrÒrrrmÚupperrtrursr>Ú _debug_countrrÚ_preprocess_itemrl)rHrIÚprocessed_splitsrzrUÚavailable_fieldsÚ
missing_inputÚmissing_outputr}rfÚkeyÚvalueÚprocessed_dataÚprocessed_countÚ
skipped_countÚprocessed_items ` rEr z HuggingFaceDataLoader.preprocessºø€àÐå Š Ð2à&*§j¢j¡l¤lð9 ^ñ9 ^Ñ "ˆJ˜
Ý KŠKÐX jÐX½cÀ*¹o¼oÐ ð
JÝ#& z°!¤}×'9Ò'9Ñ';Ô';Ñ#<Ô#<Ð Ý ÐS°:ÐSÐAQÐ ÐÔ9KÐuÐ_eÔ_rÐÔ%Ð-=ЗL"G°Ô1Cð"Gð"GÐT^ð"Gð"GðuEð"Gð"GñHôHðHØÔ&Ð.>ЗL"I°&Ô2Eð"Ið"IÐV`ð"Ið"IðwGð"Ið"IñJôJðJõ ÐÐÐШjÐÑÔÑÔˆMÝ ð!Cð!Cð!Cð!C¨zð!Cñ!Cô!CñCôCˆ KŠK˜:ÐTÀ]Ð KŠK˜:ÐVÀnÐ 
KŠKÐa°J×4DÒ4DÑ4FÔ4FÐ 3˜q¥# j¡/¤/Ñ

9Ø! !”}Ý ÐÐÐ?Ø"&§*¢*¡,¤,ð9JC˜Ý! %­Ñ9µ#°e±*´*¸sÒ2BÐ2BÝŸ š Ð$B¨Ð$BÐ$B°°t¸°t´Ð$BÐ$BÐ$BÑŸ š Ð$7¨Ð$7Ð$7°Ð$7Ð$7Ñ ˆˆˆ!"ˆ å$ 
Sð
S4Ø!%×!6Ò!6°t¸VÑ!DÔ!DØ"×)¨.Ñ# (O! &ÒŸ š Ð$Q°AÐ$QÐ$Q¸ZÐ$QÐ$QÈ4Ð$QÐ$QÑRøà+9Ð ˜ KŠK˜:ÐÐqÐZgÐ ð
^Ý ÐV¸j×>NÒ>NÑ>PÔ>PÐs 1¥c¨.Ñ&9Ô&9Ñ^ð^—K’KÐ \°!Ð \Ð \¸:Ð \Ð \ÈÐXYÔIZÐ \Ð \Ñ]ùàÐrDrfc óø| |jd¦«}| |jd¦«}t|d¦«r|xjdz
c_nd|_|jdkrmt
 d|jd¦«t
 d|jd|¦«t
 d |jd|¦«|d}|d}t|¦«}t|¦«}|jdkr1t
 d |d
d d
|d
d d¦«|jr|}|}|  ||¦«}|  ||¦«}|jdkrbt
 d|d
d d|d
d d¦«t
 d|d
d d|d
d d¦«t|¦«|j kst|¦«|j krH|jdkr;t
 dt|¦«d|j d|j d¦«d
St|¦«|j kst|¦«|j krH|jdkr;t
 dt|¦«d|j d|j d¦«d
S||dœ}|jdkrt
 d|¦«|S)úPreprocess a single itemr^r[rizProcessing item rhz Looking for input field 'z': z Looking for output field 'Nz After conversion - input: 'rjz...', output: 'rkz After cleaning - input: 'z ...' -> 'z After cleaning - output: 'z Skipping - input length z not in range [z, ú]z Skipping - output length ©rWrXz Final processed item: )
rdrrÚhasattrrøroÚdebugr>r$Ú _clean_textrSr(r*)rfrIÚ
input_textÚ output_textÚoriginal_inputÚoriginal_outputrs rEz&HuggingFaceDataLoader._preprocess_itemýð—XX˜0°"Ñ5ˆ
Ø—h’h˜vÔ2°BÑ õ 4˜Ñ  Ð Ô  Ñ Ô Ð à !ˆDÔ à Ô  Ò LŠLÐ@¨DÔ,=Ð LŠLÐZ°vÔ7IÐZÈjÐ LŠLÐ]¸Ô8KÐ]ÐP[Ð  Р؈JØ Ð ØˆKõ˜‘_”_ˆ
ݘ&ˆ à Ô  Ò LŠLÐÀCÀRÀC¼ÐoÐYdÐehÐfhÐehÔYiÐ  Ô ð sØ'ˆ)ˆ×)¨*°fÑ=ˆJØ×*¨;¸Ñ?ˆKØÔ  AÒ Ðn¸>È#È2È#Ô;NÐnÐYcÐdgÐegÐdgÔYhÐ Ðq¸OÈCÈRÈCÔ<PÐqÐ[fÐgjÐhjÐgjÔ[kÐ ˆz‰?Œ?˜VÔ .µ#°j±/´/ÀFÔDUÒ2UÐ2UØÔ  AÒ ðD½#¸j¹/¼/ðDðDÐZ`ÔZkðDðDÐouôpAðDðDðDñEôEðEØ ˆ{Ñ Ô ˜fÔ /µ3°{Ñ3CÔ3CÀfÔFWÒ3WÐ3WØÔ  AÒ ðF½3¸{Ñ;KÔ;KðFðFÐ\bÔ\mðFðFÐqwôrCðFðFðFñGôGðGØ Ø
ð
ˆð
Ô  Ò LŠLÐD°NÐ ÐrDrcóôt|t¦«sdStjdd|¦« ¦«}|jr| ¦«}|jrtjdd|¦«}|S©úClean and normalize textr^z\s+ú z[^\w\s]©rsr>ÚreÚsubrer&r¿r%©rrIs rEr
z!HuggingFaceDataLoader._clean_text;óyå˜Ñ ØŒvf˜c ð Ô ð Ø—:’:<”<ˆ Ô ”6˜* b¨$Ñ/ˆˆ rDN)
r:r;r<r=rrr>rrr rr
rCrDrEæØO˜=ðO¨T°#°t¸D´z°/Ô-BðOðOðOðOðbA ˜t ¨d¬ OÔA ¸A ÐPTÐUXÐZ^Ð_cÔZdÐUdÔPeðA ðA ðA ðA ðF< <°=ð<ÀXÈdÄ^ð<ð<ð<ð<ð| ð¨]ð¸sððððððrDc óZeZdZdZdedeeeeffdZdeededeeeeffdZ de
dedeefdZ de
dedeefd „Z de
dedeefd
Z
deeeefdedeeeeffd Zd ededeefd
ZdededefdZdS)ÚCustomDataLoaderz%Load custom datasets from local filesrIrKcóŒ|jstd¦«t|j¦«}| ¦«st d|¦«t
 d|¦«|jdkr| ||¦«}n[|jdkr|  ||¦«}n9|jdkr| 
||¦«}ntd|j¦«|j r|d|j }t
 d t|¦«d
|¦«| 
||¦«}|S) z5Load custom dataset from local file and create splitsz)Data path is required for custom datasetszData file not found: zLoading custom dataset: rÚcsvÚjsonzUnsupported format: NzLoaded z samples from )rrÚexistsÚFileNotFoundErrorrorprÚ _load_jsonlÚ _load_csvÚ
_load_jsonrrSÚ_create_splits)rIÚ file_pathÚraw_datarØs rErzCustomDataLoader.loadQs]àÔð JÝÐ ˜Ô*ˆ à×ÒÑ IÝ#Ð$G¸IÐ$GÐ$GÑ  Š Ð:¨yÐ Ô  Ò ×'¨ °6Ñ:ˆHˆ
Ô
 
—~’~ i°Ñ8ˆHˆHØ
Ô
 
 y°&Ñ9ˆHˆÐH°FÔ4FÐ Ô ðÐ 3 Ô!3Ð 3Ô4ˆ Š ÐFc (™mœmÐF¸9Ð×)¨(°FÑ;ˆ àÐrDrHcót dt|¦«d¦«t|¦«dkr1t dt|¦«d¦«|ggdœSt|¦«}t dt |dz¦«¦«}t dt |dz¦«¦«}|d krGd
|_d |_d |_t d |jd
|jd|j¦«t |t ||jz¦«¦«}t |t ||jz¦«¦«}||z
|z
}|dkrD|dkr |dz}|dz
}n|dkr
|dz}|dz
}t d|d
|d|¦«t d|d|d|¦«|dkr
|dkr|ggdœ} nw|dkrt||d¬¦«\}
} |
g| dœ} nU|dkrt||d¬¦«\}
} |
| gdœ} n3t|||zd¬¦«\}
}
t|
|d¬¦«\} } |
| | dœ} t d¦«t dt| d¦«d¦«t dt| d¦«d¦«t dt| d¦«d¦«| S)z1Create train/validation/test splits from raw datazCreating splits from rZrirMr[r!r'r­zBAdjusted split sizes to ensure train has at least 1 sample: train=zSplit sizes: train=z
, validation=rzCreated splits:rµr/rNrO)
rorprSr@r r"r#r)rHrIryÚ test_dataÚval_datarãs rEr"zCustomDataLoader._create_splitspå Š ÐB­C°©I¬IÐ ˆt‰9Œ9qŠ=ˆ NŠNÐ`­s°4©y¬yÐ Ø Øððð
õ˜D™ œ ˆ
õ˜1c -°#Ñ"5Ñ7ˆ ݘAs =°3Ñ#6Ñ8ˆ
ð ˜2Ò Ð à!$ˆ Ø&)ˆ #Ø #ˆ Ý KŠKðaÐSYÔSeðaðaÐmsônEðaðaðNTôN_ðaðañ
bô
bð
bõ|¥S¨¸Ô9PÑ)PÑ%QÔ%QÑRˆÝ˜
¥s¨=¸6Ô;LÑ+LÑ'MÔ'MÑNˆ Ø" -° Ñ9ˆ
𠘊>ˆ˜!Š|ˆ˜A
ؘa
ؘQ’ؘQ‘ ؘa‘
Ý KŠKðLÐ]gðLðLÐowðLðLðAJðLðLñ
Mô
Mð
Må Š Ð_¨*Ð_À8Ð_ÐT]Ð qŠ=ˆ=˜Y¨!š^˜^ðØ ØððˆKˆKð
˜Š]ˆ]å$4°TÀYÐ]_Ð$`Ñ$`Ô$`Ñ !ˆJ˜ à ØðˆKˆKð
˜!Š^ˆ^å#3°DÀHÐ[]Ð#^Ñ#^Ô#^Ñ ˆJ˜àððˆKˆ%5ØØ" ð%ñ%ô%Ñ !ˆJ˜ õ#3ØØð#ñ#ô#Ñ ˆHðˆ  Š Ð Š Ð Ô$8Ñ 9Ô 9Ð Š ÐM¥S¨°\Ô)BÑ%CÔ%CÐ Š ÐAs ;¨vÔ#6ÑÐrDr#c óˆg}t|d|j¬¦«5}t|d¦«D]~\}}| ¦«re | t j|¦«¦«ŒB#t
j$r*}t  d|d|¦«Yd}~Œvd}~wwxYwŒ ddd¦«n #1swxYwY|S)zLoad JSONL fileÚr5r[zInvalid JSON at line r¸N)
Úopenr5rrrerlrÚloadsÚJSONDecodeErrorro)r#rIrHÚline_numÚlinerés rErzCustomDataLoader._load_jsonlÓs>àˆÝ
)˜S¨6¬?Ð
 P¸qÝ"+¨A¨q¡/¤/ð
Pð
P˜—::<”<ðPðPØŸ š ¥D¤J¨tÑ$4Ô$4Ñ5øÝÔPðPðPÝŸšÐ'N¸xÐ'NÐ'NÈ1Ð'NÐ'NÑOøøøøðPøøøðPð
Pð Pð Pð Pñ Pô Pð Pð Pð Pð Pð Pð Pøøøð Pð Pð Pð Pðˆ s;š*B7Á'A-Á,B7Á-B&Á< B!ÂB7Â!B&Â&B7Â7B;Â>B;cóntj||j|j¬¦«}| d¦«S)z
Load CSV file©r5r7Úrecords)ÚpdÚread_csvr5r7Úto_dict)r#rIÚdfs rEr zCustomDataLoader._load_csvßs/å
Œ[˜¨V¬_ÈÔHXÐ
ØzŠz˜$rDcóþt|d|j¬¦«5}tj|¦«}ddd¦«n #1swxYwYt |t
¦«r|St |t ¦«r d|vr|dS|gS)zLoad JSON filer)r*NrH)r+r5rrrsÚdict)r#rIrHs rEr!zCustomDataLoader._load_jsonäå