Files
ds_erp_ai/data_ingestion/__pycache__/utils.cpython-311.pyc
T

122 lines
20 KiB
Plaintext
Raw Normal View History

2024-08-05 22:14:19 +01:00
§
2024-08-13 22:16:12 +01:00
ÅÊ»fZHãóBddlmZddlmZddlZddlmZddlmZddl m
2024-08-07 17:50:40 +01:00
Z
ddl m Z ddl m Z dd l
2024-08-13 22:16:12 +01:00
mZdd
2024-08-13 21:30:01 +01:00
lmZdd lmZddlZddlZdd lmZddlZddlZddlZdd
lmZddlZddlZddlm Z ddl!Z!ddl"m#Z#e#¦«ej$d¦«a%eej$d¦«¬¦«Z&dZ'dZ(e(¦«Z)d5dZ*dZ+dZ,dZ-dZ.dZ/dZ0d6dZ1dZ2d7d!„Z3d8d#„Z4d9d%„Z5d&e6d'e7fd(„Z8d:d*„Z9e)d)fd+„Z:d,e6fd-„Z;d.e<d/e<d0e<fd1„Z=d.e<d/e<d0e<fd2„Z>e:¦«Z?d;d4„Z@dS)<é©ÚHuggingFaceBgeEmbeddings)ÚRecursiveCharacterTextSplitterN)ÚInMemoryDocstore)ÚFAISS)Ú PyPDFLoader)Ú
TextLoader)ÚDocx2txtLoader)Úuuid4)ÚDocument)Ú
TextExtractor)ÚGroq)Ú AudioSegment)Ú
VideoFileClip)Ú load_dotenvÚOPENAI_API_KEYÚ GROQ_API_KEY)Úapi_keyúwhisper-large-v3có>d}ddi}ddi}t|||¬¦«}|S)NzBAAI/bge-small-enÚdeviceÚcudaÚnormalize_embeddingsT)Ú
2024-08-07 17:50:40 +01:00
model_nameÚ model_kwargsÚ
2024-08-13 21:30:01 +01:00
encode_kwargsr)rrrÚ
embeddingss úWc:\Users\timmy_3aupohg\Downloads\Manaknight Projects\ds_aiindex\data_ingestion\utils.pyÚload_embedding_modelr"s?Ø$€Jؘ%€LØ+¨TÐ2€MÝ%°LÐP]ðñô€Jð ÐóÚtextcóV|dj}|dj}tddtd¬¦«}| |g¦«}g}t |¦«D]N\}}| ¦«} || d<|| d<t|j| ¬¦«}
| |
¦«ŒO|S) Nréèé
2024-08-08 14:58:44 +01:00
F)Ú
2024-08-13 21:30:01 +01:00
chunk_sizeÚ
chunk_overlapÚlength_functionÚis_separator_regexÚpageÚ file_type©Ú page_contentÚmetadata) r,r-rÚlenÚcreate_documentsÚ enumerateÚcopyr Úappend) Údocr*r!r-Ú
2024-08-13 21:30:01 +01:00
text_splitterÚdocsÚ documentsÚchunkÚ doc_metadataÚdocuments rr/r/1Ø ˆqŒ6Ô €DØ1ŒvŒ€HÝØÝØ ð ñô€Mð × )¨4¨&Ñ 1€Dà€Iݘd‘O”Oð#‰ˆˆ5à—}’}‘ˆ Ø ˆ Ø$-ˆ ¨Ô);ÀlÐØ×Ò˜Ñ Ðr cóš t|¦«}| ¦«}t|¦«}|S#td|¦«xYw©NzError loading -- )r Úloadr/Ú
ValueError)Ú
document_pathÚtxt_docr!r5s rÚload_txt_documentrAGsSð˜Ø|Š|‰~Œ~ˆå Ñ؈ øðÐ<¨]Ð=øøøó 36A
cóš t|¦«}| ¦«}t|¦«}|S#td|¦«xYwr<)r
r=r/r>)r?Údocx_docr!r5s rÚload_docx_documentrERsSð! Ø}Š}‰Œˆå Ñ؈ øðÐ<¨]Ð=øøørBcó| t|¦«}| ¦«}|S#td|¦«xYwr<)rÚload_and_splitr>)r?Úpdf_docÚpagess rÚload_pdf_documentrJ^sJð˜,ˆØ×(ˆØˆ øðÐ<¨]Ð=øøøs$'§;cóþ| d¦«rt|¦«S| d¦«rt|¦«S| d¦«rt|¦«St d|¦«)Nz.pdfz.txtz.docxzUnsupported document type for )ÚendswithrJrArEr>)r?s rÚ
load_documentrMhsˆØ×Ò˜fÑKÝ  Ñ× Ò  Ñ 'Ô 'ðKÝ  Ñ× Ò  Ñ (Ô (ðKÝ! ÐI¸Jr cóÈt|d¦«5}tj| ¦«¦« d¦«cddd¦«S#1swxYwYdS)rbzutf-8)ÚopenÚbase64Ú b64encodeÚreadÚdecode)Ú
image_pathÚ
image_files rÚ encode_imagerWts™Ý ˆJ˜ÑÔðÝ Ô ˜JŸOšOÑ .× 5°gÑ ?øøøð?s9AÁAÁAcó(t|¦«}ddtdœ} dddddœd d
d |id œgd
2024-08-13 21:30:01 +01:00
œgddœ}tjd||¬¦«}| ¦«dddd}n#t
$r }d}Yd}~nd}~wwxYw|S)Nzapplication/jsonzBearer )z Content-TypeÚ
Authorizationz gpt-4o-miniÚuserr!uWhat’s in this image?)Útyper!Ú image_urlÚurlzdata:image/jpeg;base64,)r[r\)ÚroleÚcontenti,)ÚmodelÚmessagesÚ
max_tokensz*https://api.openai.com/v1/chat/completions)ÚheadersÚjsonÚchoicesrÚmessager_ú$Image not good enough for processing)rWrÚrequestsÚpostrdÚ Exception)rUÚ base64_imagercÚpayloadÚresponseÚes rÚ
2024-08-09 16:33:21 +01:00
process_imageroysõ  
Ñ+€Lð,¥7Ðð€Gð
%+Ø$=ððð
%0à %Ð'OÀÐ'OÐ'Oð*ððð ðððð"ð'
ð
ˆõ,”=Ð!MÐW^ÐelÐmˆà—=’=‘?”? 9Ô-¨aÔÔ;¸IÔˆøÝ ðˆˆˆˆˆøøøøð:øøøð €OsžAA9Á9
BÂB
Â
BÚimagecóZ| d¦«d d¦«d}||dœ}t¦«}| |¦«}d d|D¦«¦«}|dkrt |¦«}|dkrt ||¬ ¦«}|gSdS)
/éÿÿÿÿú.r)Úfilenamer*Úc3óvK|]4}| ¦«s| ¦«s|dk¯0|VŒ5dS)ú
N)ÚisalnumÚisspace)Ú.0rns rú <genexpr>z(create_image_document.<locals>.<genexpr>­s@èèРa§i¢i¡k¤kÐN°Q·Y²Y±[´[ÐNÀAÈÂIÀI1ÀIÀIÀIÀIÐNr rgr+)Úsplitr
Úread_text_from_imageÚjoinror )rUr*Ú
image_namer-Útext_extractorr!r3s rÚcreate_image_documentr¥à×! & *×Ñ5°aÔ8€Jà&°YÐ?€HÝ"_”_€NØ × .¨zÑ :€Dà
7Š7ÐN˜dÐ N€Dð ˆr‚z€zݘZÑð РD°8Ð<ˆàˆuˆ à ˆr cóÔt|d¦«5}tjj || ¦«fd¬¦«}ddd¦«n #1swxYwY|jS)NrOr)Úfiler`)rPÚclientÚaudioÚ translationsÚcreaterSr!)Úfilepathr„Ú translations rÚ
audio_to_textr¾Ý
ˆh˜Ñ Ô ð
 Ý”lÔ˜DŸIšI™KœKÐ
ô
ˆ ð
ð
ð
ñ
ô
2024-08-13 21:30:01 +01:00
ð
ð
ð
2024-08-09 16:33:21 +01:00
ð
ð
ð
2024-08-13 21:30:01 +01:00
øøøð
ð
ð
ð
ð
Ô Ðs;AÁAÁATcó |dzdz}tj|¦«}t|¦«}tj |¦« d¦«d}|d}tj |¦«stj|¦«g}||kr||z||zdkrdndz} t| ¦«D]r}
2024-08-09 16:33:21 +01:00
|
2024-08-13 21:30:01 +01:00
|z} t| |z|¦«} || | }
2024-08-09 16:33:21 +01:00
|d|d|
2024-08-13 21:30:01 +01:00
dzd }|
  |d
¬ ¦«|  |¦«|rtd |¦«ŒsnH|d|d
}|  |d
¬ ¦«|  |¦«|rtd |¦«||fS)Né<r#rtrÚ_chunksérrÚ_chunkú.mp3Úmp3)Úformatz
Exporting z _chunk1.mp3)rÚ from_filer.ÚosÚpathÚbasenamer}ÚexistsÚmakedirsÚrangeÚminÚexportr2Úprint)Úaudio_file_pathÚchunk_duration_minutesÚ print_outputÚchunk_length_msr†Úaudio_duration_msÚ
base_filenameÚ chunk_folderÚ chunk_pathsÚ
num_chunksr7Ústart_msÚend_msr8Úchunk_filenames rÚsplit_audio_by_durationrªÇà,¨rÑ1°DÑ8€Oõ
Ô "  3€EݘE™
œ
Ðõ”G×$ _Ñ;¸CÑÔC€MØ,€LÝ
Œ7>Š>˜ 
Œ €Kà˜&¨/Ñ9ÐBSÐVeÑBeÐijÒBjÐBj¸Q¸QÐpqÑrˆ
å 5ˆ˜*ˆHݘ OÑ3Ð5FÑGˆFؘ( 6˜*ˆEØ ,ÐM¨}ÐMÀAÀaÁCÐMˆNØ LŠL˜°ˆLÑ × Ò ˜~Ñ ð
Ð3 4øðE¨=ÐØ
Š ^¨Eˆ Ñ×Ò˜>Ñ ð Ð/˜~Ð ˜Ð $r r†c ót||¦«\}}g}|D]}t|¦«}tj |¦«}t jd|¦«} | r8|  d¦«}
t|  d¦«¦«} n'tj  |¦«d}
d} | dz
|z} | |z}
t|
ttj
|¦«¦«dz¦«}|
| d|
d|dœ}t||¬ ¦«}| |¦«Œt!j|¦«|S)
Nz(.*)_chunk(\d+)\.mp3$réri`êú-z minutes)ruÚdurationr*r+)rr•rr—ÚreÚsearchÚgroupÚintÚsplitextrr.rr”r r2ÚshutilÚrmtree)r*r6Ú
chunk_pathÚ
transcriptr©Úmatchr£Ú chunk_indexÚ start_minÚend_minÚactual_end_minr-r:s rÚtranscribe_audio_chunksr½îsaå 7¸ÐI_Ñ `Ô `Ñ€L€IØ
å" .ˆ
2024-08-08 14:58:44 +01:00
õœ×)¨*ÑÝ” Ð2°NÑØ ð Ø!ŸKšK¨™NœNˆMݘeŸkšk¨!™nœnÑ-ˆKˆKõœG×,¨^Ñ<¸?ˆˆ! 1_Ð(>Ñ>ˆ ØÐ 6Ñ6ˆÝ˜W¥s­<Ô+AÀ/Ñ+RÔ+RÑ'SÔ'SÐW\Ñ'\Ñ^ˆð<¨7Ððˆõ
2024-08-09 16:33:21 +01:00
¨¸hÐØ×Ò˜Ñ „MÔÐà Ðr écó(t|||¦«}|S)N))r*r6s rÚcreate_audio_documentrÀsÝÐ9OÐQZÑ[€IØ Ðr Ú
2024-08-08 14:58:44 +01:00
video_pathÚ
time_intervalcózt|¦«}|j}| dd¦«}|j |¦«}t
j t
2024-08-13 21:30:01 +01:00
j |¦«¦«d}t
2024-08-09 16:33:21 +01:00
j  t
j 
2024-08-13 21:30:01 +01:00
|¦«|d¦«}t j |d¬¦«d}tj
|¦«} t| dd ¦«}tdt!|¦«|¦«D]h}
|
} t
j  |d
| d zd ¦«} tj|| ¬
2024-08-13 22:16:12 +01:00
¦« | d¬¦« ¦«Œit)d|d¦«t+|d¬¦«}
|
S)Nz.mp4rrÚ
_snapshotsT)Úexist_oké´r“Ú frame_at_rzmin.png)Ússr)ÚvframeszSnapshots saved in rtÚvideo)r*)rÚreplacer†Úwrite_audiofiler•rr—rÚdirnamer™ÚffmpegÚprobeÚfloatršÚinputÚoutputÚrunr)Ú
audio_pathÚ
video_nameÚ snapshot_dirÚintervalrÏr7Ú
2024-08-09 16:33:21 +01:00
frame_timeÚ frame_imgr6s rÚpreprocess_video_datarÛõ
˜*Ñ %€EðŒ~€Hð×# F¨FÑ3€JØ
2024-08-13 21:30:01 +01:00
Œ ×# JÑ/€Aõ×!¥"¤'×"2Ò"2°:Ñ">Ô">ÑÔB€Jõ”7—<¤§¢°
Ñ ;Ô ;À
2024-08-13 22:16:12 +01:00
Ð=VÐ=VÐ=VÑW€LÝ„K  €Hõ