Pdf Ingestion pipeline completed

2024-08-05 22:14:19 +01:00
parent b0c3eb8032
commit c34de21971
15 changed files with 318 additions and 90 deletions
@@ -0,0 +1,16 @@
+---- 1. Load User Document
+    ----> Starting with word document. Like Pdf, txt and docx file. 
+    ----> Data Ingestion is meant to take in the user data. Load the embedding model, then create a vector database from it. 
+    ----> Considerations: 
+            1. Pdfs have pages already, hence text splitter won't be used. We want to be able to make reference to the pages the searched document can be found. 
+            2. The apporach for other data types can be different. we can have text splitter fot txt files and if possible add pages to the chunks made for easy reference.
+            3. 
+        
+        Data Ingestion Module: 
+            This module will handle the data ingestion process.
+                uitls.py --> keep the reusable functions 
+                pdf_ingest.py --> This module will handle pdfs 
+
+        
+        Loggings Module: 
+            This module will keep logs of what's going on here.