firecrawl

Author	SHA1	Message	Date
Nicolas	8a1c404918	Nick: revert trailing comma	2024-12-11 19:51:08 -03:00
Nicolas	00335e2ba9	Nick: fixed prettier	2024-12-11 19:46:11 -03:00
Gergő Móricz	8d467c8ca7	`WebScraper` refactor into `scrapeURL` (#714 ) * feat: use strictNullChecking * feat: switch logger to Winston * feat(scrapeURL): first batch * fix(scrapeURL): error swallow * fix(scrapeURL): add timeout to EngineResultsTracker * fix(scrapeURL): report unexpected error to sentry * chore: remove unused modules * feat(transfomers/coerce): warn when a format's response is missing * feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support * (add note) * feat(scrapeURL): wip readme * feat(scrapeURL): LLM extract * feat(scrapeURL): better warnings * fix(scrapeURL/engines/fire-engine;playwright): fix screenshot * feat(scrapeURL): add forceEngine internal option * feat(scrapeURL/engines): scrapingbee * feat(scrapeURL/transformars): uploadScreenshot * feat(scrapeURL): more intense tests * bunch of stuff * get rid of WebScraper (mostly) * adapt batch scrape * add staging deploy workflow * fix yaml * fix logger issues * fix v1 test schema * feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions * scrapeURL: v0 backwards compat * logger fixes * feat(scrapeurl): v0 returnOnlyUrls support * fix(scrapeURL/v0): URL leniency * fix(batch-scrape): ts non-nullable * fix(scrapeURL/fire-engine/chromecdp): fix wait action * fix(logger): remove error debug key * feat(requests.http): use dotenv expression * fix(scrapeURL/extractMetadata): extract custom metadata * fix crawl option conversion * feat(scrapeURL): Add retry logic to robustFetch * fix(scrapeURL): crawl stuff * fix(scrapeURL): LLM extract * fix(scrapeURL/v0): search fix * fix(tests/v0): grant larger response size to v0 crawl status * feat(scrapeURL): basic fetch engine * feat(scrapeURL): playwright engine * feat(scrapeURL): add url-specific parameters * Update readme and examples * added e2e tests for most parameters. Still a few actions, location and iframes to be done. * fixed type * Nick: * Update scrape.ts * Update index.ts * added actions and base64 check * Nick: skipTls feature flag? * 403 * todo * todo * fixes * yeet headers from url specific params * add warning when final engine has feature deficit * expose engine results tracker for ScrapeEvents implementation * ingest scrape events * fixed some tests * comment * Update index.test.ts * fixed rawHtml * Update index.test.ts * update comments * move geolocation to global f-e option, fix removeBase64Images * Nick: * trim url-specific params * Update index.ts --------- Co-authored-by: Eric Ciarla <ericciarla@yahoo.com> Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com> Co-authored-by: Nicolas <nicolascamara29@gmail.com>	2024-11-07 20:57:33 +01:00
rafaelsideguide	c1f98d0371	fixed developer.notion special case	2024-10-11 10:54:59 -03:00
rafaelsideguide	6208ecdbc0	added logger	2024-07-23 17:30:46 -03:00
rafaelsideguide	0175152577	Fixed PDF match custom scraping Now it's working for both `https://getgc.ai/privacy` and `https://prairie.cards/products/wood-designs` usecases.	2024-07-02 11:25:17 -03:00
rafaelsideguide	5f69fc7677	Fixed the regex test	2024-06-25 18:24:01 -03:00
rafaelsideguide	e37d151404	added parsePDF option to pageOptions user can decide if they are going to let us take care of the parse or they are going to parse the pdf by themselves	2024-06-12 15:06:47 -03:00
Nicolas	7cb14edec8	Nick:	2024-06-05 10:13:52 -07:00
Rafael Miller	9e000ded03	Merge branch 'main' into feat/better-gdrive-pdf-fetch	2024-06-05 14:07:56 -03:00
rafaelsideguide	ccc55127d6	Added scroll xpaths on fire-engine for handling readme docs	2024-06-05 11:48:41 -03:00
rafaelsideguide	b5045d1661	[feat] improved the scrape for gdrive pdfs	2024-06-04 17:47:28 -03:00
Nicolas	96257b7b17	Update handleCustomScraping.ts	2024-06-04 12:22:46 -07:00
Nicolas	674500affa	Nick:	2024-06-04 12:15:39 -07:00

14 Commits