Commit Graph

83 Commits

Author SHA1 Message Date
Nicolas d2de01d342 Nick: fixes 2024-07-18 13:19:44 -04:00
Nicolas f11137352c Merge branch 'main' into feat/fire-engine-chrome-cdp 2024-07-18 12:48:42 -04:00
Caleb Peffer c5d1e7260d Caleb: made changes per Rafaels requests 2024-07-17 11:29:05 -07:00
Caleb Peffer d39d3be649 Caleb: now extracting and returning a list of all links on the page for a customer 2024-07-16 18:38:03 -07:00
Thomas Kosmas 5c65ec58e5 Support chrome-cdp and restructure sitemap fire-engine support. 2024-07-15 18:40:43 +03:00
Nicolas 066d92f643 Update single_url.ts 2024-07-03 18:38:17 -03:00
Nicolas 90c54c32fd Nick: refactor 2024-07-03 18:01:17 -03:00
Nicolas 90cf799a3c Update single_url.ts 2024-07-03 17:56:21 -03:00
Nicolas b36406e465 Nick: log scrpaers 2024-07-03 17:28:53 -03:00
rafaelsideguide 7b7154ba1e bugfixed pageStatusCode 2024-07-02 10:51:35 -03:00
Nicolas 42cd58a679 Merge pull request #332 from mendableai/feat/rawHtmlExtraction
Adds pageOptions.includeRawHtml and new extraction mode "llm-extraction-from-raw-html"
2024-07-01 18:23:26 -03:00
rafaelsideguide 16aac7f8c5 Update single_url.ts 2024-07-01 18:21:15 -03:00
Eric Ciarla 87b54488d3 update to includeRawHtml 2024-06-28 17:07:47 -04:00
Eric Ciarla 70fcf2ce03 init 2024-06-28 16:39:09 -04:00
Nicolas 9bf74bc774 Update single_url.ts 2024-06-28 15:51:18 -03:00
Nicolas 7e17498bcf Update single_url.ts 2024-06-28 15:45:16 -03:00
Nicolas e7be17db92 Nick: metadata fixes and lock duration for bull decreased to 2 hrs 2024-06-25 15:21:14 -03:00
rafaelsideguide 3ebdf93342 removed console.logs 2024-06-24 16:43:12 -03:00
rafaelsideguide 21d29de819 testing crawl with new.abb.com case
many unnecessary console.logs for tracing the code execution
2024-06-24 16:25:07 -03:00
rafaelsideguide 9c539e9113 Fixed includeHTML to use cleanedHtml as response 2024-06-18 16:26:54 -03:00
rafaelsideguide 6c726a02eb Moved to utils/removeUnwantedElements, added unit tests 2024-06-18 09:46:42 -03:00
AndyMik90 8b3c3aae91 Added support for RegEx in removeTags 2024-06-18 07:31:46 +02:00
rafaelsideguide ad7795f973 Merge remote-tracking branch 'origin/main' into test/load-testing 2024-06-14 15:14:01 -03:00
Rafael Miller f9c7ca9388 Merge branch 'main' into feat/issue-266 2024-06-14 11:47:58 -03:00
Rafael Miller 3e2e76311c Merge branch 'main' into feat/issue-205 2024-06-14 11:25:20 -03:00
rafaelsideguide 5dd18ca79b fixed edge cases 2024-06-14 09:46:55 -03:00
rafaelsideguide bb859ae9a7 Added metadata.pageStatusCode and metadata.pageError properties to the responses 2024-06-13 17:08:40 -03:00
rafaelsideguide 676d6e8ab5 Added pageOptions.removeTags 2024-06-13 10:51:05 -03:00
rafaelsideguide e37d151404 added parsePDF option to pageOptions
user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves
2024-06-12 15:06:47 -03:00
Nicolas 7ae9778642 Update single_url.ts 2024-06-10 16:57:31 -07:00
Nicolas 913c1dd568 Nick: fetch -> axios and fix timeouts 2024-06-10 16:49:03 -07:00
rafaelsideguide 164676c70a bugfix screenshot for readme pages 2024-06-05 15:34:42 -03:00
rafaelsideguide 0d51b11dcd missing breaks 2024-06-05 15:02:28 -03:00
Rafael Miller 9e000ded03 Merge branch 'main' into feat/better-gdrive-pdf-fetch 2024-06-05 14:07:56 -03:00
rafaelsideguide ccc55127d6 Added scroll xpaths on fire-engine for handling readme docs 2024-06-05 11:48:41 -03:00
rafaelsideguide b5045d1661 [feat] improved the scrape for gdrive pdfs 2024-06-04 17:47:28 -03:00
Nicolas 674500affa Nick: 2024-06-04 12:15:39 -07:00
rafaelsideguide 5ae4d1caf5 Update single_url.ts 2024-06-04 15:28:09 -03:00
rafaelsideguide 64a4338ff0 Update single_url.ts 2024-06-04 14:40:05 -03:00
Rafael Miller b80fb374e5 Merge branch 'main' into playwright-service-bug-222 2024-06-04 11:57:17 -03:00
Nicolas 2ea01f1456 Update single_url.ts 2024-06-03 23:42:39 -07:00
Nicolas 854d5b3cb3 Update single_url.ts 2024-06-03 23:32:55 -07:00
Nicolas d30ced4394 Merge pull request #221 from mendableai/nsc/fwd-header-auth
feat: Ability to forward headers to reliable providers for auth etc...
2024-06-03 16:33:40 -07:00
rafaelsideguide 1fc3a15149 Update single_url.ts 2024-06-03 15:24:40 -03:00
Nicolas fde522c3e1 Update single_url.ts 2024-06-02 20:23:45 -07:00
Matt Joyce deefe65cbe Change the way the playwright response is parsed
Was failing with a Type Error, but actually looked ok.
This fixes the type error, and stop scraper fallback.
2024-06-01 19:16:56 +10:00
Nicolas 3b8059edb6 Update single_url.ts 2024-05-31 15:43:06 -07:00
Nicolas 6bea803120 Nick: 2024-05-31 15:39:54 -07:00
Nicolas 6c939d534d Nick: small refactor 2024-05-29 19:43:51 -07:00
Eric Ciarla 37915e11e8 Final push 2024-05-29 21:18:24 -04:00