Crawlee is a web
scraping and browser
automation library

Crawlee is aweb
scrapingandbrowser
automationlibrary

It helps you build reliable crawlers. Fast.

🚀 Crawlee for Python is open to early adopters!

pipx run crawlee create my-crawler

Reliable crawling 🏗️

Crawlee won't fix broken selectors for you (yet), but it helps youbuild and maintain your crawlers faster.

When a website addsJavaScript rendering,you don't have to rewrite everything, only switch to a browser crawler. When you later find a great API to speed up your crawls, flip the switch back.

Crawlee is built by people who scrape for a living and use it every day to scrape millions of pages.Meet our community on Discord.

Python with type hints

Crawlee for Python is written in a modern way using type hints, providing code completion in your IDE and helping you catch bugs early on build time.

Headless browsers

Switch your crawlers from HTTP to aheadless browserin 3 lines of code. Crawlee builds on top ofPlaywrightand adds its own features. Chrome, Firefox and more.

Automatic scaling and proxy management

Crawlee automatically manages concurrency based onavailable system resourcesandsmartly rotates proxies.Proxies that often time-out, return network errors or bad HTTP codes like 401 or 403 are discarded.

Try Crawlee out 👾

before you start

Crawlee requiresPython 3.9 or higher.

The fastest way to try Crawlee out is to use theCrawlee CLIand choose one of the provided templates. The CLI will prepare a new project for you, and add boilerplate code for you to play with.

pipx run crawlee create my-crawler

If you prefer to integrate Crawleeinto your own project,you can follow the example below. Crawlee is available onPyPI,so you can install it usingpip.Since it usesPlaywrightCrawler,you will also need to installcrawleepackage withplaywrightextra. It is not not included with Crawlee by default to keep the installation size minimal.

pipinstall'crawlee[playwright]'

Currently we have Python packagescrawleeandplaywrightinstalled. There is one more essential requirement: the Playwright browser binaries. You can install them by running:

playwrightinstall

Now we are ready to execute our first Crawlee project:

importasyncio

fromcrawlee.playwright_crawlerimportPlaywrightCrawler,PlaywrightCrawlingContext


asyncdefmain()->None:
crawler=PlaywrightCrawler(
max_requests_per_crawl=5,# Limit the crawl to 5 requests.
headless=False,# Show the browser window.
browser_type='firefox',# Use the Firefox browser.
)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
asyncdefrequest_handler(context:PlaywrightCrawlingContext)->None:
context.log.info(f'Processing{context.request.url}...')

# Enqueue all links found on the page.
awaitcontext.enqueue_links()

# Extract data from the page using Playwright API.
data={
'url':context.request.url,
'title':awaitcontext.page.title(),
'content':(awaitcontext.page.content())[:100],
}

# Push the extracted data to the default dataset.
awaitcontext.push_data(data)

# Run the crawler with the initial list of URLs.
awaitcrawler.run(['https://crawlee.dev'])

# Export the entire dataset to a JSON file.
awaitcrawler.export_data('results.json')

# Or work with the data directly.
data=awaitcrawler.get_data()
crawler.log.info(f'Extracted data:{data.items}')


if__name__=='__main__':
asyncio.run(main())