Files
ds_erp_ai/data_ingestion/__pycache__/utils.cpython-311.pyc
T

91 lines
18 KiB
Plaintext
Raw Normal View History

2024-08-05 22:14:19 +01:00
§
2024-08-09 16:33:21 +01:00
|4¶f«=ãóddlmZddlmZddlZddlmZddlmZddl m
2024-08-07 17:50:40 +01:00
Z
ddl m Z ddl m Z dd l
2024-08-09 16:33:21 +01:00
mZdd
lmZdd lmZddlZddlZdd lmZddlZddlZddlZdd
lmZddlZddlZddlm Z e ¦«ej!d¦«a"eej!d¦«¬¦«Z#dZ$dZ%e%¦«Z&dZ'dZ(dZ)dZ*dZ+dZ,dZ-dZ.dZ/d-dZ0dZ1d.d!„Z2d/d#„Z3e&d"fd$„Z4d%e5fd&„Z6d'e7d(e7d)e7fd*„Z8d'e7d(e7d)e7fd+„Z9d.d,„Z:dS)0é©ÚHuggingFaceBgeEmbeddings)ÚRecursiveCharacterTextSplitterN)ÚInMemoryDocstore)ÚFAISS)Ú PyPDFLoader)Ú
TextLoader)ÚDocx2txtLoader)Úuuid4)ÚDocument)Ú
2024-08-07 17:50:40 +01:00
TextExtractor)ÚGroq)Ú AudioSegment)Ú load_dotenvÚOPENAI_API_KEYÚ GROQ_API_KEY)Úapi_keyúwhisper-large-v3có>d}ddi}ddi}t|||¬¦«}|S)NzBAAI/bge-small-enÚdeviceÚcudaÚnormalize_embeddingsT)Ú
2024-08-09 16:33:21 +01:00
model_nameÚ model_kwargsÚ
encode_kwargsr)rrrÚ
embeddingss úWc:\Users\timmy_3aupohg\Downloads\Manaknight Projects\ds_aiindex\data_ingestion\utils.pyÚload_embedding_modelr s?Ø$€JؘfÐ%€LØ+¨TÐ2€MÝ%°LÐP]ðñô€Jð ÐócóL|dj}|dj}tddtd¬¦«}| |g¦«}g}t |¦«D]I\}}| ¦«}||d<t|j|¬¦«} | | ¦«ŒJ|S)Nréèé
F)Ú
chunk_sizeÚ
chunk_overlapÚlength_functionÚis_separator_regexÚpage©Ú page_contentÚmetadata) r)r*rÚlenÚcreate_documentsÚ enumerateÚcopyr Úappend)
2024-08-08 14:58:44 +01:00
ÚdocÚtextr*Ú
2024-08-09 16:33:21 +01:00
text_splitterÚdocsÚ documentsÚchunkÚ doc_metadataÚdocuments
rr,r,/Ø ˆqŒ6Ô €DØ1ŒvŒ€HÝØÝØ ð ñô€Mð × )¨4¨&Ñ 1€Dà€Iݘd‘O”Oð#‰ˆˆ5à—}’}‘ˆ Ø ˆ ݨÔ);ÀlÐØ×Ò˜Ñ Ðrcóš t|¦«}| ¦«}t|¦«}|S#td|¦«xYw©NzError loading -- )r Úloadr,Ú
2024-08-09 16:33:21 +01:00
ValueError)Ú
document_pathÚtxt_docr1r3s rÚload_txt_documentr?DsSð˜Ø|Š|‰~Œ~ˆå Ñ؈ øðÐ<¨]Ð=øøøó 36A
cóš t|¦«}| ¦«}t|¦«}|S#td|¦«xYwr:)r
r;r,r<)r=Údocx_docr1r3s rÚload_docx_documentrCOsSð! Ø}Š}‰Œˆå Ñ؈ øðÐ<¨]Ð=øøør@có| t|¦«}| ¦«}|S#td|¦«xYwr:)rÚload_and_splitr<)r=Úpdf_docÚpagess rÚload_pdf_documentrH[sJð˜,ˆØ×(ˆØˆ øðÐ<¨]Ð=øøøs$'§;cóþ| d¦«rt|¦«S| d¦«rt|¦«S| d¦«rt|¦«St d|¦«)Nz.pdfz.txtz.docxzUnsupported document type for )ÚendswithrHr?rCr<)r=s rÚ
load_documentrKesˆØ×Ò˜fÑKÝ  Ñ× Ò  Ñ 'Ô 'ðKÝ  Ñ× Ò  Ñ (Ô (ðKÝ! ÐI¸JrcóÈt|d¦«5}tj| ¦«¦« d¦«cddd¦«S#1swxYwYdS)rbzutf-8)ÚopenÚbase64Ú b64encodeÚreadÚdecode)Ú
image_pathÚ
image_files rÚ encode_imagerUqs™Ý ˆJ˜ÑÔðÝ Ô ˜JŸOšOÑ .× 5°gÑ ?øøøð?s9AÁAÁAcó(t|¦«}ddtdœ} dddddœd d
d |id œgd
œgddœ}tjd||¬¦«}| ¦«dddd}n#t
$r }d}Yd}~nd}~wwxYw|S)Nzapplication/jsonzBearer )z Content-TypeÚ
Authorizationz gpt-4o-miniÚuserr1uWhat’s in this image?)Útyper1Ú image_urlÚurlzdata:image/jpeg;base64,)rYrZ)ÚroleÚcontenti,)ÚmodelÚmessagesÚ
2024-08-09 16:33:21 +01:00
max_tokensz*https://api.openai.com/v1/chat/completions)ÚheadersÚjsonÚchoicesrÚmessager]ú$Image not good enough for processing)rUrÚrequestsÚpostrbÚ Exception)rSÚ base64_imageraÚpayloadÚresponseÚes rÚ
process_imagermvsõ  
Ñ+€Lð,¥7Ðð€Gð
%+Ø$=ððð
%0à %Ð'OÀÐ'OÐ'Oð*ððð ðððð"ð'
ð
ˆõ,”=Ð!MÐW^ÐelÐmˆà—=’=‘?”? 9Ô-¨aÔÔ;¸IÔˆøÝ ðˆˆˆˆˆøøøøð:øøøð €OsžAA9Á9
BÂB
Â
BcóX| d¦«d d¦«d}d|i}t¦«}| |¦«}d d|D¦«¦«}|dkrt |¦«}|dkrt ||¬ ¦«}|gSdS)
/éÿÿÿÿú.rÚfilenameÚc3óvK|]4}| ¦«s| ¦«s|dk¯0|VŒ5dS)ú
N)ÚisalnumÚisspace)Ú.0rls rú <genexpr>z(create_image_document.<locals>.<genexpr>ªs@èèРa§i¢i¡k¤kÐN°Q·Y²Y±[´[ÐNÀAÈÂIÀI1ÀIÀIÀIÀIÐNrrer()Úsplitr
Úread_text_from_imageÚjoinrmr )rSÚ
image_namer*Útext_extractorr1r0s rÚcreate_image_documentr¢à×! & *×Ñ5°aÔ8€Jà˜'€HÝ"_”_€NØ × .¨zÑ :€Dà
7Š7ÐN˜dÐ N€Dð ˆr‚z€zݘZÑð РD°8Ð<ˆàˆuˆ à ˆrcóÔt|d¦«5}tjj || ¦«fd¬¦«}ddd¦«n #1swxYwY|jS)NrMr)Úfiler^)rNÚclientÚaudioÚ translationsÚcreaterQr1)ÚfilepathrÚ translations rÚ
audio_to_textrˆ»Ý
ˆh˜Ñ Ô ð
 Ý”lÔ˜DŸIšI™KœKÐ
ô
ˆ ð
ð
ð
ñ
ô
ð
ð
ð
ð
ð
ð
øøøð
ð
ð
ð
ð
Ô Ðs;AÁAÁATcó |dzdz}tj|¦«}t|¦«}tj |¦« d¦«d}|d}tj |¦«stj|¦«g}||kr||z||zdkrdndz} t| ¦«D]r}
|
|z} t| |z|¦«} || | }
|d|d|
dzd }|
  |d
¬ ¦«|  |¦«|rtd |¦«ŒsnH|d|d
}|  |d
¬ ¦«|  |¦«|rtd |¦«||fS)Né<r!rqrÚ_chunkséroÚ_chunkz.mp3Úmp3)Úformatz
2024-08-08 14:58:44 +01:00
Exporting z _chunk1.mp3)rÚ from_filer+ÚosÚpathÚbasenamerzÚexistsÚmakedirsÚrangeÚminÚexportr/Úprint)Úaudio_file_pathÚchunk_duration_minutesÚ print_outputÚchunk_length_msrƒÚaudio_duration_msÚ
2024-08-09 16:33:21 +01:00
base_filenameÚ chunk_folderÚ chunk_pathsÚ
2024-08-08 14:58:44 +01:00
num_chunksr5Ústart_msÚend_msr6Úchunk_filenames rÚsplit_audio_by_durationr¦Äà,¨rÑ1°DÑ8€Oõ
Ô "  3€EݘE™
œ
2024-08-09 16:33:21 +01:00
Ðõ”G×$ _Ñ;¸CÑÔC€MØ,€LÝ
Œ7>Š>˜ 
Œ €Kà˜&¨/Ñ9ÐBSÐVeÑBeÐijÒBjÐBj¸Q¸QÐpqÑrˆ
å 5ˆ˜*ˆHݘ OÑ3Ð5FÑGˆFؘ( 6˜*ˆEØ ,ÐM¨}ÐMÀAÀaÁCÐMˆNØ LŠL˜°ˆLÑ × Ò ˜~Ñ ð
Ð3 4øðE¨=ÐØ
Š ^¨Eˆ Ñ×Ò˜>Ñ ð Ð/˜~Ð ˜Ð $rc óŽt||¦«\}}g}|D]}t|¦«}tj |¦«}t jd|¦«}|r8| d¦«} t| d¦«¦«}
n'tj  |¦«d} d}
|
dz
2024-08-09 16:33:21 +01:00
|z} |
|z} t| ttj
|¦«¦«dz¦«}
| | d| ddœ}t||¬ ¦«}| |¦«Œt!j|¦«|S)