Skip to content

Latest commit

 

History

History
145 lines (95 loc) · 5.17 KB

0.4.1.md

File metadata and controls

145 lines (95 loc) · 5.17 KB

Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features!

This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂

Hi everyone,

I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it.


Handling Lazy Loading Better (Images Included)

One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI waits for all images to load before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes.

Here’s how to enable it:

await crawler.crawl(
    url="https://example.com",
    wait_for_images=True  # Add this argument to ensure images are fully loaded
)

What this does is:

  1. Waits for the page to reach a "network idle" state.
  2. Ensures all images on the page have been completely loaded.

This single change handles the majority of lazy-loading cases you’re likely to encounter.


Text-Only Mode (Fast, Lightweight Crawling)

Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable text-only mode to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling 3-4 times faster in most cases.

Here’s how to turn it on:

crawler = AsyncPlaywrightCrawlerStrategy(
    text_only=True  # Set this to True to enable text-only crawling
)

When text_only=True, the crawler automatically:

  • Disables GPU processing.
  • Blocks image and JavaScript resources.
  • Reduces the viewport size to 800x600 (you can override this with viewport_width and viewport_height).

If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.


Adjusting the Viewport Dynamically

Another useful addition is the ability to dynamically adjust the viewport size to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly.

Here’s how it works:

  1. The crawler calculates the page’s width and height after it loads.
  2. It adjusts the viewport to fit the content dimensions.
  3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport.

To enable this, use:

await crawler.crawl(
    url="https://example.com",
    adjust_viewport_to_content=True  # Dynamically adjusts the viewport
)

This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.


Simulating Full-Page Scrolling

Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for full-page scanning. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.

Here’s an example:

await crawler.crawl(
    url="https://example.com",
    scan_full_page=True,   # Enables scrolling
    scroll_delay=0.2       # Waits 200ms between scrolls (optional)
)

What happens here:

  1. The crawler scrolls down in increments, waiting for content to load after each scroll.
  2. It stops when no new content appears (i.e., dynamic elements stop loading).
  3. It scrolls back to the top before finishing (if necessary).

If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.


Reusing Browser Sessions (Save Time on Setup)

By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session.

I added a method called create_session for this:

session_id = await crawler.create_session()

# Use the same session for multiple crawls
await crawler.crawl(
    url="https://example.com/page1",
    session_id=session_id  # Reuse the session
)
await crawler.crawl(
    url="https://example.com/page2",
    session_id=session_id
)

This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.


Other Updates

Here are a few smaller updates I’ve made:

  • Light Mode: Use light_mode=True to disable background processes, extensions, and other unnecessary features, making the browser more efficient.
  • Logging: Improved logs to make debugging easier.
  • Defaults: Added sensible defaults for things like delay_before_return_html (now set to 0.1 seconds).

How to Get the Update

You can install or upgrade to version 0.4.1 like this:

pip install crawl4ai --upgrade

As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know!

Enjoy the new features, and happy crawling! 🕷️