Files
ds_citationpro/__pycache__/utils.cpython-311.pyc
T

260 lines
22 KiB
Plaintext
Raw Normal View History

2024-11-23 20:44:04 +01:00
§
²ö@g]Lãó4ddlZddlZddlZddlmZddlmZddlmZddl m
Z
ddlmZddl m Z ddl
mZmZddlmZdd l mZdd
lmZddl m
Z
ddlZdd lmZdd lmZdd
lmZddlmZmZddlZddlmZddl Z e
¦«ej!d¦«ej"d<e d¬¦«Z#ed¬¦«Z$de%defdZ&d/de%dede'fdZ(e&de#¦«Z)e¦«Z*d0dZ+d0dZ,dZ-d „Z.d!Z/d"e%fd#„Z0d$„Z1d%„Z2d&„Z3d1d(„Z4d)„Z5d*e%d+e%d,e6de7fd-„Z8e)d'fd.„Z9dS)2éN)ÚDocument)Úconvert)ÚOpenAI)Ú load_dotenv)ÚOpenAIEmbeddings)ÚStrOutputParserÚJsonOutputParser)Úlogger)Ú
ChatOpenAI)ÚPromptTemplate)Úuuid4)ÚInMemoryDocstore)ÚFAISS)ÚThreadPoolExecutorÚ as_completedÚOPENAI_API_KEYztext-embedding-3-large)Úmodelúgpt-4oÚ file_pathÚreturncó0tj||d¬¦«S)a
Load a vector store from a local file.
Args:
- file_path (str): Path to the file where the vector store is saved.
- embeddings: The embedding function to use for loading the vector store.
Returns:
- FAISS: The loaded vector store.
T)Úallow_dangerous_deserialization)rÚ
load_local)rÚ
embeddingss úLc:\Users\timmy_3aupohg\Downloads\Manaknight Projects\ds_citationpro\utils.pyÚload_vector_storersõ Ô ˜I zÐSWÐ éÚqueryÚ vector_storeÚtop_kcóH| ||¬¦«}d|D¦«S)a}
Perform a similarity search in the vector store.
Args:
- query (str): The query string to search for.
- vector_store (FAISS): The vector store to perform the search on.
- top_k (int): The number of top similar documents to return.
Returns:
- List of tuples containing page_number and page_content of documents that are most similar to the query.
)Úkcó6g|]}|jd|jfŒS)Ú page_number)ÚmetadataÚ page_content©Ú.0Údocs rú
<listcomp>z,search_similar_documents.<locals>.<listcomp>8s'Ð ˆSŒ\˜-Ô
(¨#Ô*:Ð Or)Úsimilarity_search)rr r!Úresultss rÚsearch_similar_documentsr.+s0ð×,¨U°eÐ<€GØ OÀwÐ OrÚ APA_indexÚ
output_imagescóØtj |¦«stj|¦«t j|¦«}g}t
t|¦«¦«D]o}||}| ¦«}tj  |d|dzd¦«}| 
|¦«|  |¦«Œp|  ¦«|S)z×
Convert a PDF file to images using PyMuPDF.
Args:
- pdf_path (str): Path to the PDF file.
- output_folder (str): Folder to save the output images.
Returns:
- List of image file paths.
Úpage_éú.png)
ÚosÚpathÚexistsÚmakedirsÚfitzÚopenÚrangeÚlenÚ
get_pixmapÚjoinÚsaveÚappendÚclose)Úpdf_pathÚ
output_folderÚ pdf_documentÚ image_pathsr%ÚpageÚpixÚ
image_paths rÚ
pdf_to_imagesrI@õ Œ7>Š>˜ 
Œ ”9˜&€LØ€KõS Ñ'ˆ ؘØoŠoÑԈݔW—\’\ -Ð1N¸Àq¹Ð1NÐ1NÐ1NÑOˆ
Ø ŠÑÔÐØ×Ò˜×ÒÑÔÐØ Ðrcóütj |¦«stj|¦«tj |¦«ddz}t ||¦«t
||¦«}|S)aS
Convert a DOCX file to images using an intermediate PDF conversion and PyMuPDF for rendering.
Args:
- docx_path (str): Path to the DOCX file.
- pdf_to_images_func (function): Function to convert PDF to images.
- output_folder (str): Folder to save the output images.
Returns:
- List of image file paths.
rú.pdf)r5r6r7r8ÚsplitextrrI)Ú docx_pathrCrBrEs rÚdocx_to_imagesrN^srõ Œ7>Š>˜ 
Œ Œw×Ò  Ñ*¨1ÔÑ6€HÝ ˆI Ô Ð õ  ¨-Ñ8€Kà Ðrcóâtj |¦«d ¦«}|dkrt |¦«S|dkrt |¦«St
d¦«)a
Convert a PDF or DOCX file to images.
Args:
- file_path (str): Path to the document file (PDF or DOCX).
- output_folder (str): Folder to save the output images.
- dpi (int): Resolution of the output images.
Returns:
- List of image file paths.
r3rKz.docxz;Unsupported file format. Please provide a PDF or DOCX file.)r5r6rLÚlowerrIrNÚ
ValueError)rÚfile_extensions rÚdocument_to_imagesrSwskõ”W×% iÑÔ;€NؘÒÐݘ˜7Ò "Ð "ݘÐWrcóg}d}tj|¦«D]ï}| ¦« |¦«rÆtj ||¦«} t
|d¦«5}tj|  ¦«¦« 
d¦«}|  ||f¦«ddd¦«n #1swxYwYŒÃ#t$r}td|d|¦«Yd}~Œçd}~wwxYwŒð|S)a
Convert all images in the specified directory to Base64-encoded strings.
Args:
- directory_path (str): Path to the directory containing image files.
Returns:
- List of tuples containing the image filename and its Base64-encoded string.
r4Úrbzutf-8NzError processing file z: )r5ÚlistdirrPÚendswithr6r>r:Úbase64Ú b64encodeÚreadÚdecoder@Ú ExceptionÚprint)Údirectory_pathÚ
base64_imagesÚsupported_extensionsÚfilenamerÚ
image_fileÚencoded_stringÚes rÚimages_to_base64reŒsð€Mðõ”J˜~Ñ
@ð
@ˆà >Š>Ñ Ô × $Ð%9Ñ  @ÝœŸ š  ^°XÑ>ˆ
@ݘ) E¨jå%+Ô%5°j·o²oÑ6GÔ6GÑ%HÔ%H×%OÒ%OÐPWÑ%XÔ%X!×(¨(°NÐ)CÑEðEðEñEôEðEðEðEðEðEðEøøøðEðEðEðEøøõð
@ð
@ð
@ÝÐ>¨xÐ>¸?øøøøð
@øøøð @ð Ðs=Á#CÁ3ACÃ CÃC ÃCÃC ÃCÃ
DÃ&DÄDas
You are an APA Compliance and Document Review Agent, highly specialized in ensuring strict adherence to APA guidelines as defined in the "APA Publication Manual, 7th Edition" by the American Psychological Association.
Your task is to analyze the provided text or document images to identify and correct errors in the following areas:
1. **Grammatical Errors:** Identify grammatical issues, focusing on APA-specific grammar requirements (e.g., third-person writing, formal tone, active voice).
2. **Document Structure Errors:** Ensure the document adheres to APA formatting requirements, including title page layout, abstract structure, headings, and reference list organization.
3. **Referencing Errors:** Detect and correct issues with references, such as missing references, improper formatting, or inconsistencies in style.
4. **Citation Errors:** Identify problems with in-text citations, such as missing elements, improper punctuation, or placement errors.
For each page/image, provide a detailed analysis and return a structured dictionary with the following format:
{
"Page/Image": <Page number or Image identifier>,
"Errors": [
{
"Line Number(s)": <Line number(s) where the error occurs>,
"Error Text": "<Exact text of the flawed element>",
"Description of the Error": "<Detailed explanation of why it is incorrect, referencing specific pages and sections from the APA 7th Edition Manual>",
"Suggested Correction": "<The correct or improved version of the text>"
},
...
],
"Summary": "<A brief summary stating whether the page/image meets APA standards or the total number of errors detected.>"
}
**Additional Instructions:**
1. For grammar, include both generic grammatical errors and APA-specific grammar violations. Cite the relevant page and section for APA grammar standards.
2. For document structure, verify that all APA-required sections are present and correctly formatted. Reference the relevant section (e.g., "Running Head: APA 7th Edition, p. 37").
3. For citations and references, explicitly state the page number and section of the "APA Publication Manual, 7th Edition" that supports your findings.
4. Be strict and exhaustive in your evaluation, ensuring no potential flaws are overlooked.
5. Include specific and detailed descriptions that allow the user to locate the correction in the APA manual.
6. If no errors are found on a page/image, include a summary confirming adherence to APA standards and set "Errors" to an empty list.
Your response must follow this structured format and be concise, well-structured, and formatted in JSON for easy parsing.
Úimage_directoryc
ópt|¦«}g}t|¦«D]\}\}}|dz}tjj dddt dœddd|id œgd
œg¬ ¦«}|jd jj } |jd jj d
d}tj |¦«}n7#ttj
f$r} td| ¦«i}Yd} ~ nd} ~ wwxYwd|| dg¦«| dd¦«dœ}
| |
¦«Œ|S)
Evaluate images for APA citation errors.
Args:
- image_directory (str): Path to the directory containing image files.
Returns:
- List of dictionaries containing page/image identifiers and citation error details.
r3rÚuserÚtext)ÚtyperiÚ image_urlÚurlzdata:image/jpeg;base64,)rjrk)ÚroleÚcontent)rÚmessagesrééýÿÿÿz#Error processing response content: NzImage ÚErrorsÚSummaryzNo errors found.)ú
Page/Imagerrrs)reÚ enumerateÚclientÚchatÚ completionsÚcreateÚpromptÚchoicesÚmessagernÚjsonÚloadsÚ
IndexErrorÚJSONDecodeErrorr]Úgetr@) rfr_r-Úindexrarcr%ÚresponseÚresponse_contentrdÚresults rÚevaluate_images_for_citationsr†Ðõ% _Ñ5€Mà€Gå-6°}Ñ-EÔ-Eð&ñ&Ñ)ˆÑ)˜.ؘa‘iˆ Ý”;Ôð%+Ý$*ððð
%0à %Ð'QÀÐ'QÐ'Qð*ððð ðððð
ô
ˆð*+¨AÔðÔBÀ1ÀRÀ4Ô Ý#œzÐ*:Ñ Ð øÝ Ð;¸Ð  Ð Ð Ð Ð Ð øøøøð "øøøð 1 &×*¨8°RÑ'×+¨IÐ7IÑ
ð
ˆð ŠÔÐÑà €NsÂ3B7Â7C+Ã
C&Ã&C+có¢g}|D]I}| dg¦«}|D].}| d¦«}|r| |¦«Œ/ŒJ|S)a
Extracts 'Description of the Error' from the list of documents.
Parameters:
documents (list): A list of dictionaries containing page information and errors.
Returns:
list: A list of all 'Description of the Error' values in the order they appear.
rrzDescription of the Error©rr@)Ú documentsÚ descriptionsÚdocumentÚerrorsÚerrorÚ descriptions rÚextract_error_descriptionsr
sxð€Làðà˜h¨Ñðð 1ˆŸ)š)Ð$>Ñ?ˆð
×# 0øð
Ðrcó¢g}|D]I}| dg¦«}|D].}| d¦«}|r| |¦«Œ/ŒJ|S)
Extracts 'Error Text' from the list of documents.
Parameters:
documents (list): A list of dictionaries containing page information and errors.
Returns:
list: A list of all 'Error Text' values in the order they appear.
rrú
Error Textrˆ)r‰Ú error_textsrrÚ
error_texts rÚextract_error_textsr”$swð€Kàðà˜h¨Ñðð /ˆŸš 0ˆJØð
×" .øð
Ðrcóøg}|D]t}| dd¦«}| dg¦«D]E}|| dd¦«| dd¦«dœ}| |¦«ŒFŒu|S) a¢
Extracts individual errors from the document errors list and converts
them into a flat list of dictionaries with essential details only.
Args:
- doc_errors (list): A list of dictionaries representing document pages with errors.
Returns:
- list: A flat list of dictionaries, each representing an individual error
with minimal details (Doc Page, Line Number(s), Error Text).
rtz Unknown PagerrúLine Number(s)z Unknown Linerz
No Error Text)zDoc Pagerrrˆ)Ú
doc_errorsÚ flat_errorsÚ page_datarFrÚ
flat_errors rÚextract_errors_minimalr=ð€Kàð

+ˆ à}Š}˜\¨>Ñ:ˆð—]] 8¨RÑ +ˆ!Ø"'§)¢)Ð,<¸nÑ"MÔ"MØ#Ÿiši¨ °oÑðˆJð
× Ò ˜zÑ  Ðré
cóˆ‡ —tjd¦« ˆˆfdŠ g}t¦«5Šˆˆ fd|D¦«}t|¦«D]Z} | ¦«}| |¦«Œ-#t $r!}tjd|¦«Yd}~ŒSd}~wwxYw ddd¦«n #1swxYwY|S)NzGetting Similar Documentscó>t|¦«}d|D¦«S)Ncó.g|]}|d|dfŒS)rr3©r(s rr+zEget_similar_documents.<locals>.fetch_similar_docs.<locals>.<listcomp>ns%Ð9 SQ”˜˜QœÐ Ð9r)r.)Ú similar_docsr#Úvec_dbs €€rÚfetch_similar_docsz1get_similar_documents.<locals>.fetch_similar_docsks(ø€å ¸VÀQÑ Ø9¨LÐ9rcó>i|]} |¦«|ŒSr )Úsubmit)r)ÚdescÚexecutorr£s €€rú
<dictcomp>z)get_similar_documents.<locals>.<dictcomp>rs+ø€Ð jÐ jÐ jÐUY §¢Ð1CÀTÑ!JÔ!JÈDÐ jÐ jÐ jrzError processing description: )r
Úinforrr…r@r\r)
r#Úsimilar_documents_contentÚfuture_to_descriptionÚfutureÚ page_infords
`` @@rÚget_similar_documentsr®^skøøøø€Ý
„KÐ
ð
!#ÐÝ Ñ Ô ðC Ø jÐ jÐ jÐ jÐ jÐ]iÐ jÑ jÔ jÐå"Ð#8Ñ Cð Cˆ
CØ"ŸMšM™OœO Ø)×Ñ;øÝð
Cð
Cð
CÝ ÐA¸aÐBøøøøð
Cøøøð  CðCðCðCñCôCðCðCðCðCðCðCøøøðCðCðCðCð $s;°!B7Á)A<Á;B7Á<
B'ÂB"ÂB7Â"B'Â'B7Â7B;Â>B;có tj |¦«r)tj|¦«t d|d¦«dSt d|d¦«dS#t $r}t d|¦«Yd}~dSd}~wwxYw)
Deletes a folder and all its contents.
Args:
- folder_path (str): The path to the folder to be deleted.
Returns:
- bool: True if the folder was successfully deleted, False otherwise.
zFolder 'z' deleted successfully.Tz' does not exist.Fz-An error occurred while deleting the folder: N)r5r6r7ÚshutilÚrmtreer]r\)Ú folder_pathrds rÚ
delete_folderr³~ð
Ý
Œ7>Š>˜  Ý ŒM˜ ÐA˜  Ð;˜ 5øÝ ðððÝ
ÐA¸aЈuˆuˆuˆuˆuøøøøðøøøs‚AAÁ
AÁ
BÁ)BÂBrÚerror_descriptionÚreference_pagescó¸tjd¦«tdgd¢¬¦«}|tzt ¦«z}| |||dœ¦«}|S)NzIdentifying Reference Pages a <|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an Advanced Document Analysis AI Agent. You are very good with understanding books and identifying the right one.
You are assigned a task of selecting the right reference book for a Grammatical and Citation error that occured in a document.
You are provided with the following information:
1. Identified error.
2. The description of the error.
3. A list of tuples of the reference document's page and content.
Note: The reference document is a text book that talks about APA citation, the book is "APA Publication Manual, 7th Edition".
Your task is to do the following:
1. Understand the provided information.
2. Identify the right document that is the properly document reference that talks about the error and error description.
3. You might want to reference multiple contents in the shared documents, but it must be the same page.
4. Identify the page and the reference statement (might be a combo of multiple statements).
5. Generate a correction explanation to that error based on the reference document.
6. Generate the corrected version of the error
So after that you want to prepare a JSON that has the following details:
1. reference_page: The identified page as seen above.
2. content: The content that speaks about the error and how to fix it as seen the identified page in the reference document.
3. correction_explanation: The explanation correction to the error as mentioned in the referenced page.
4. correction: The correct version of the error.
Verify the following output:
1. reference_page.
2. content.
3. correction_explanation.
4. correction.
Please make sure they are there.
It should always come out this way.
Lastly, the JSON structure is very important.
<|eot_id|><|start_header_id|>user<|end_header_id|>
Error: {error}
Error_Description: {error_description}
Reference_Pages: {reference_pages}