Files
AutoGPT/docs/integrations/firecrawl/crawl.md
Nicholas Tindle c1a1767034 feat(docs): Add block documentation auto-generation system (#11707)
- Add generate_block_docs.py script that introspects block code to
generate markdown
- Support manual content preservation via <!-- MANUAL: --> markers
- Add migrate_block_docs.py to preserve existing manual content from git
HEAD
- Add CI workflow (docs-block-sync.yml) to fail if docs drift from code
- Add Claude PR review workflow (docs-claude-review.yml) for doc changes
- Add manual LLM enhancement workflow (docs-enhance.yml)
- Add GitBook configuration (.gitbook.yaml, SUMMARY.md)
- Fix non-deterministic category ordering (categories is a set)
- Add comprehensive test suite (32 tests)
- Generate docs for 444 blocks with 66 preserved manual sections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

<!-- Clearly explain the need for these changes: -->

### Changes 🏗️

<!-- Concisely describe all of the changes made in this pull request:
-->

### Checklist 📋

#### For code changes:
- [x] I have clearly listed my changes in the PR description
- [x] I have made a test plan
- [x] I have tested my changes according to the test plan:
  <!-- Put your test plan here: -->
  - [x] Extensively test code generation for the docs pages



<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Introduces an automated documentation pipeline for blocks and
integrates it into CI.
> 
> - Adds `scripts/generate_block_docs.py` (+ tests) to introspect blocks
and generate `docs/integrations/**`, preserving `<!-- MANUAL: -->`
sections
> - New CI workflows: **docs-block-sync** (fails if docs drift),
**docs-claude-review** (AI review for block/docs PRs), and
**docs-enhance** (optional LLM improvements)
> - Updates existing Claude workflows to use `CLAUDE_CODE_OAUTH_TOKEN`
instead of `ANTHROPIC_API_KEY`
> - Improves numerous block descriptions/typos and links across backend
blocks to standardize docs output
> - Commits initial generated docs including
`docs/integrations/README.md` and many provider/category pages
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
631e53e0f6. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 07:03:19 +00:00

2.4 KiB

Firecrawl Crawl

Blocks for crawling multiple pages of a website using Firecrawl.

Firecrawl Crawl

What it is

Firecrawl crawls websites to extract comprehensive data while bypassing blockers.

How it works

This block uses Firecrawl's API to crawl multiple pages of a website starting from a given URL. It navigates through links, handling JavaScript rendering and bypassing anti-bot measures to extract clean content from each page.

Configure the crawl depth with the limit parameter, choose output formats (markdown, HTML, or raw HTML), and optionally filter to main content only. The block supports caching with configurable max age and wait times for dynamic content.

Inputs

Input Description Type Required
url The URL to crawl str Yes
limit The number of pages to crawl int No
only_main_content Only return the main content of the page excluding headers, navs, footers, etc. bool No
max_age The maximum age of the page in milliseconds - default is 1 hour int No
wait_for Specify a delay in milliseconds before fetching the content, allowing the page sufficient time to load. int No
formats The format of the crawl List["markdown" | "html" | "rawHtml" | "links" | "screenshot" | "screenshot@fullPage" | "json" | "changeTracking"] No

Outputs

Output Description Type
error Error message if the crawl failed str
data The result of the crawl List[Dict[str, Any]]
markdown The markdown of the crawl str
html The html of the crawl str
raw_html The raw html of the crawl str
links The links of the crawl List[str]
screenshot The screenshot of the crawl str
screenshot_full_page The screenshot full page of the crawl str
json_data The json data of the crawl Dict[str, Any]
change_tracking The change tracking of the crawl Dict[str, Any]

Possible use case

Documentation Indexing: Crawl entire documentation sites to build searchable knowledge bases or training data.

Competitor Research: Extract content from competitor websites for market analysis and comparison.

Content Archival: Systematically archive website content for backup or compliance purposes.