Commit Graph

94 Commits

Author SHA1 Message Date
rafaelsideguide 3ebdf93342 removed console.logs 2024-06-24 16:43:12 -03:00
rafaelsideguide 21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
Eric Ciarla 34e37c5671 Add unit tests to replace e2e 2024-06-15 16:43:37 -04:00
Eric Ciarla a6b7197737 Fix for maxDepth 2024-06-14 19:40:37 -04:00
Eric Ciarla 2c5f5c0ea2 Merge branch 'main' into feat/maxDepthRelative 2024-06-14 11:49:12 -04:00
Rafael Miller f9c7ca9388 Merge branch 'main' into feat/issue-266 2024-06-14 11:47:58 -03:00
Rafael Miller 3e2e76311c Merge branch 'main' into feat/issue-205 2024-06-14 11:25:20 -03:00
Eric Ciarla 59451754f5 Add tests 2024-06-14 10:14:07 -04:00
Eric Ciarla 71c98d8b80 Update logic 2024-06-13 18:00:52 -04:00
Eric Ciarla 095951aa4d Update test 2024-06-13 17:40:00 -04:00
Eric Ciarla 5e8aa92788 Update index.ts 2024-06-13 17:33:13 -04:00
Eric Ciarla 65d63bae45 Update index.ts 2024-06-13 17:17:44 -04:00
Eric Ciarla 32e814bedc Update index.ts 2024-06-13 17:02:30 -04:00
rafaelsideguide bb859ae9a7 Added metadata.pageStatusCode and metadata.pageError properties to the responses 2024-06-13 17:08:40 -03:00
rafaelsideguide 676d6e8ab5 Added pageOptions.removeTags 2024-06-13 10:51:05 -03:00
rafaelsideguide e37d151404 added parsePDF option to pageOptions
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
rafaelsideguide dc6acbf1f0 Merge remote-tracking branch 'origin/main' into feat/allowbackwardcrawling-option 2024-06-12 11:01:05 -03:00
Nicolas 520739c9f4 Nick: fixed bugs associated with absolute path replacements 2024-06-11 12:43:16 -07:00
rafaelsideguide ee282c3d55 Added allowBackwardCrawling option 2024-06-11 15:24:39 -03:00
Nicolas f6b06ac27a Nick: ignoreSitemap, better crawling algo 2024-06-10 18:12:41 -07:00
Nicolas 3091f0134c Nick: 2024-06-10 16:27:10 -07:00
Nicolas b4c6819a54 Nick: 2024-06-05 11:11:09 -07:00
Rafael Miller 02fe470e20 Merge pull request #148 from mendableai/nsc/improvemnts-fixes-misc
Better fallbacks for initial crawl start
2024-06-04 14:31:10 -03:00
rafaelsideguide 6920ec8a61 bugfixing. already on main 2024-06-04 11:05:50 -03:00
Nicolas 918059ee9e Merge branch 'main' into nsc/improvemnts-fixes-misc 2024-06-03 16:46:02 -07:00
Nicolas df6c3d1e7d Merge branch 'main' into detect-pdfs 2024-05-17 09:55:51 -07:00
Nicolas 9d635cb2a3 Nick: docx support 2024-05-16 11:48:02 -07:00
Nicolas 098db17913 Update index.ts 2024-05-15 17:37:09 -07:00
Nicolas 6ca368327f Merge branch 'main' into test/crawl-options 2024-05-15 17:18:25 -07:00
Nicolas ade4e05cff Nick: working 2024-05-15 17:13:04 -07:00
Nicolas bfccaf670d Nick: fixes most of it 2024-05-15 15:30:37 -07:00
rafaelsideguide d91043376c not working yet 2024-05-15 18:54:40 -03:00
rafaelsideguide fa014defc7 Fixing child links only bug 2024-05-15 18:35:09 -03:00
Nicolas 2ba743fb1a Merge pull request #27 from eltociear/patch-1
refactor: fix typo in WebScraper/index.ts
2024-05-15 13:28:38 -07:00
Nicolas 1b0d6341d3 Update index.ts 2024-05-15 11:48:12 -07:00
Nicolas d10f81e7fe Nick: fixes 2024-05-15 11:28:20 -07:00
Nicolas 87570bdfa1 Update index.ts 2024-05-15 11:06:03 -07:00
Ikko Eltociear Ashimine e91c122c69 Merge branch 'main' into patch-1 2024-05-15 12:14:52 +09:00
Nicolas a0fdc6f7c6 Nick: 2024-05-14 12:12:40 -07:00
Nicolas 7f31959be7 Nick: 2024-05-14 12:04:36 -07:00
Nicolas 8a72cf556b Nick: 2024-05-13 21:10:58 -07:00
Nicolas 26a092f780 Update index.ts 2024-05-13 21:04:49 -07:00
Nicolas 8101cbee37 Update index.ts 2024-05-13 21:02:47 -07:00
Nicolas 86b8439844 Nick: 2024-05-13 20:51:42 -07:00
Nicolas a96fc5b96d Nick: 4x speed 2024-05-13 20:45:11 -07:00
rafaelsideguide 8eb2e95f19 Cleaned up 2024-05-13 16:13:10 -03:00
Nicolas 2ce045912f Nick: disable vision right now 2024-05-13 10:56:08 -07:00
rafaelsideguide f4348024c6 Added check during scraping to deal with pdfs
Checks if the URL is a PDF during the scraping process (single_url.ts).

TODO: Run integration tests - Does this strat affect the running time?

ps. Some comments need to be removed if we decide to proceed with this strategy.
2024-05-13 09:13:42 -03:00
Rafael Miller 5a2712fa5a Merge branch 'main' into detect-pdfs 2024-05-10 15:53:13 -03:00
Nicolas dcedb8d798 Merge branch 'main' into feat/max-depth 2024-05-07 10:20:49 -07:00