Pdf Ingestion pipeline completed

This commit is contained in:
timothyafolami
2024-08-05 22:14:19 +01:00
parent b0c3eb8032
commit c34de21971
15 changed files with 318 additions and 90 deletions
+16
View File
@@ -0,0 +1,16 @@
---- 1. Load User Document
----> Starting with word document. Like Pdf, txt and docx file.
----> Data Ingestion is meant to take in the user data. Load the embedding model, then create a vector database from it.
----> Considerations:
1. Pdfs have pages already, hence text splitter won't be used. We want to be able to make reference to the pages the searched document can be found.
2. The apporach for other data types can be different. we can have text splitter fot txt files and if possible add pages to the chunks made for easy reference.
3.
Data Ingestion Module:
This module will handle the data ingestion process.
uitls.py --> keep the reusable functions
pdf_ingest.py --> This module will handle pdfs
Loggings Module:
This module will keep logs of what's going on here.