Pdf Ingestion pipeline completed
This commit is contained in:
@@ -0,0 +1,16 @@
|
||||
---- 1. Load User Document
|
||||
----> Starting with word document. Like Pdf, txt and docx file.
|
||||
----> Data Ingestion is meant to take in the user data. Load the embedding model, then create a vector database from it.
|
||||
----> Considerations:
|
||||
1. Pdfs have pages already, hence text splitter won't be used. We want to be able to make reference to the pages the searched document can be found.
|
||||
2. The apporach for other data types can be different. we can have text splitter fot txt files and if possible add pages to the chunks made for easy reference.
|
||||
3.
|
||||
|
||||
Data Ingestion Module:
|
||||
This module will handle the data ingestion process.
|
||||
uitls.py --> keep the reusable functions
|
||||
pdf_ingest.py --> This module will handle pdfs
|
||||
|
||||
|
||||
Loggings Module:
|
||||
This module will keep logs of what's going on here.
|
||||
Reference in New Issue
Block a user