Commit Graph

187 Commits

Author SHA1 Message Date
Gergő Móricz 73e7884df4 fix(queue-worker/crawl): only report successful page count in num_docs (#1179) 2025-02-13 13:14:24 -03:00
Móricz Gergő 8a8d7d645f fix(concurrency): proper job timeouting 2025-01-31 11:22:10 +01:00
Gergő Móricz b005450a34 port most of cheerio stuff to rust (#1089) 2025-01-24 22:04:54 +01:00
Gergő Móricz 0d9c9f36b8 feat(queue-worker): add verbosity for lock extension 2025-01-24 19:35:25 +01:00
Nicolas 498558d358 Nick: formatting done 2025-01-22 18:47:44 -03:00
Nicolas 56f048aeff Reapply "Nick:"
This reverts commit 4b4385c520.
2025-01-22 17:26:32 -03:00
Nicolas 4b4385c520 Revert "Nick:"
This reverts commit 6718ce8908.
2025-01-22 17:26:09 -03:00
Nicolas 6718ce8908 Nick: 2025-01-22 17:25:48 -03:00
Nicolas 92b8d97be3 Nick: 2025-01-19 13:09:29 -03:00
Nicolas 513f61a2d1 Nick: map improvements 2025-01-19 12:33:44 -03:00
Gergő Móricz dbc6d07871 fix(queue-worker): bring done add to earlier 2025-01-17 17:46:29 +01:00
Gergő Móricz dcd3d6d98d fix(kickoff): mark as finished if it errors out 2025-01-17 17:11:19 +01:00
Gergő Móricz d5929af010 fix(queue-worker/kickoff): make crawls wait for kickoff to finish (matters on big sitemapped sites) 2025-01-17 16:04:01 +01:00
Móricz Gergő 4a947e385f fix(queue-worker): fill out time taken on failure too 2025-01-17 11:28:37 +01:00
Gergő Móricz 655753cd27 fix(url): allow domains with ports 2025-01-16 16:30:14 +01:00
Gergő Móricz cbe67d89a5 feat(queue-worker): proactive job cancel 2025-01-15 19:02:20 +01:00
Gergő Móricz ce2f6ff884 fix(queue-worker/billing): fix crawl overbilling 2025-01-15 17:22:52 +01:00
Nicolas 5e5b5ee0e2 (feat/extract) New re-ranker + multi entity extraction (#1061)
* agent that decides if splits schema or not

* split and merge properties done

* wip

* wip

* changes

* ch

* array merge working!

* comment

* wip

* dereferentiate schema

* dereference schemas

* Nick: new re-ranker

* Create llm-links.txt

* Nick: format

* Update extraction-service.ts

* wip: cooking schema mix and spread functions

* wip

* wip getting there!!!

* nick:

* moved functions to helpers

* nick:

* cant reproduce the error anymore

* error handling all scrapes failed

* fix

* Nick: added the sitemap index

* Update sitemap-index.ts

* Update map.ts

* deduplicate and merge arrays

* added error handler for object transformations

* Update url-processor.ts

* Nick:

* Nick: fixes

* Nick: big improvements to rerank of multi-entity

* Nick: working

* Update reranker.ts

* fixed transformations for nested objs

* fix merge nulls

* Nick: fixed error piping

* Update queue-worker.ts

* Update extraction-service.ts

* Nick: format

* Update queue-worker.ts

* Update pnpm-lock.yaml

* Update queue-worker.ts

---------

Co-authored-by: rafaelmmiller <150964962+rafaelsideguide@users.noreply.github.com>
Co-authored-by: Thomas Kosmas <thomas510111@gmail.com>
2025-01-13 22:30:15 -03:00
Nicolas f4d10c5031 Nick: formatting fixes 2025-01-10 18:35:10 -03:00
Gergő Móricz 29c1f126ab feat(scrape-status): adapt 2025-01-09 19:14:00 +01:00
Gergő Móricz 2849ce2f13 fix(queue-worker): errored job logging 2025-01-09 18:48:47 +01:00
Gergő Móricz 0da386914d fix(queue-worker): graceful shutdown 2025-01-09 16:04:59 +01:00
Móricz Gergő 49e584f8e1 fix(queue-worker/crawl): use SCARD to generate num_docs field 2025-01-09 09:51:34 +01:00
Nicolas f82a742cd1 Merge pull request #1044 from mendableai/nsc/extract-queue
(feat/extract) Move extract to a queue system
2025-01-07 18:10:46 -03:00
Nicolas 11af214db1 Nick: update extract in case there is an error 2025-01-07 16:21:51 -03:00
Nicolas eb254547e5 Nick: 2025-01-07 16:16:01 -03:00
Gergő Móricz c6a63793bb crawl incomplete issues 2025-01-07 19:38:17 +01:00
Gergő Móricz ccfada98ca various queue fixes 2025-01-07 19:15:23 +01:00
Nicolas 86e34d7c6c Nick: wip 2025-01-07 12:13:12 -03:00
Móricz Gergő 7a03275575 add comment 2025-01-07 13:57:47 +01:00
Móricz Gergő 7d73ebdbf1 fix(crawl): never invalidate first crawl scrape if redirects 2025-01-07 13:57:23 +01:00
Móricz Gergő b96b97ed72 fix(crawl): don't push rawhtml to db unless requested 2025-01-07 10:09:15 +01:00
Nicolas bb27594443 Merge branch 'main' into nsc/extract-queue 2025-01-06 13:01:15 -03:00
Gergő Móricz b92a4eb79b fix(queue-worker): only do redirect handling logic on crawls, not batch scrape 2025-01-04 16:59:35 +01:00
Nicolas f2e0bfbfe3 Nick: url normalization 2025-01-03 23:54:03 -03:00
Nicolas c655c6859f Nick: fixed 2025-01-03 22:50:53 -03:00
Nicolas a4f7c38834 Nick: fixed 2025-01-03 22:15:23 -03:00
Nicolas 8df1c67961 Update queue-worker.ts 2025-01-03 21:48:28 -03:00
Nicolas 432b410678 Update queue-worker.ts 2025-01-03 21:26:05 -03:00
Nicolas 27457ed5db Nick: init 2025-01-03 20:44:27 -03:00
Nicolas bd81b41d5f Update queue-worker.ts 2024-12-30 21:43:59 -03:00
Gergő Móricz 9005757de3 fix(queue-worker): do not follow redirect URLs if they are not allowed by the crawl options 2024-12-30 14:41:31 +01:00
Gergő Móricz 0421f81020 Sitemap fixes (#1010)
* sitemap fixes iter 1

* feat(sitemap): dedupe improvements

---------

Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2024-12-27 19:59:26 +01:00
Móricz Gergő bd36c441d3 feat(queue-worker): improve team-based logging 2024-12-17 22:06:36 +01:00
Nicolas 3b6edef9fa chore: formatting 2024-12-17 16:58:57 -03:00
Gergő Móricz 30fa78cd9e feat(queue-worker): fix redirect slipping 2024-12-15 20:16:52 +01:00
Nicolas 4987880b32 Nick: random fixes 2024-12-15 02:52:06 -03:00
Nicolas 8a1c404918 Nick: revert trailing comma 2024-12-11 19:51:08 -03:00
Nicolas 00335e2ba9 Nick: fixed prettier 2024-12-11 19:46:11 -03:00
Gergő Móricz d9e017e5e2 feat(queue-worker/crawl): solidify redirect behaviour 2024-12-10 22:34:26 +01:00