-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawling error #105
Comments
@BlackChila Would you please share the link of the page you experienced this? |
hey unclecode, thanks for your answer! |
@BlackChila Thx for sharing, Please do us a favor and try using the asynchronous method. Let's see if you get something similar with it or not. If you still face issues, then we'll start to do the stress test by crawling a set of links and websites to see when such things happen. However, let's just try it with asynchronous and let me know about it first. Thank you. |
@unclecode Getting the same NoneType error. Here are the logs: INFO: Started server process [14039]
|
Thanks @RhonnieAl for posting the same issue with the asynchronous method here! |
Getting the same [ERROR] 🚫 Failed to crawl error: Failed to extract content from the website: error: can only concatenate str (not "NoneType") to str When trying to crawl a Notion site |
+1. I am encountering this same issue. |
Related to unclecode#105 Fix the 'NoneType' object has no attribute 'get' error in `AsyncWebCrawler`. * **crawl4ai/async_webcrawler.py** - Add a check in the `arun` method to ensure `html` is not `None` before further processing. - Raise a descriptive error if `html` is `None`. * **crawl4ai/async_crawler_strategy.py** - Add a check in the `crawl` method of the `AsyncPlaywrightCrawlerStrategy` class to handle cases where `html` is `None`. - Raise a descriptive error if `html` is `None`. * **tests/async/test_basic_crawling.py** - Add a test case to verify handling of `None` values for the `html` variable in the `test_invalid_url` function. * **tests/async/test_error_handling.py** - Add a test case to verify handling of `None` values for the `html` variable in the `test_network_error` function. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/unclecode/crawl4ai/issues/105?shareId=XXXX-XXXX-XXXX-XXXX).
@unclecode Thanks for the amazing library edits. I'm having a bit of a trouble understanding the error I'm getting. In the following error the library fails to crawl a news website given a specific news topic. I have used gemini api for authentication. I have a few questions in mind.
|
I have the same error with the use of WebCrawler The problem is in the file utils.py So I patch with a Try/except: |
@RhonnieAl Sorry for my delayed response. The links that you are trying to crawl have very strong bot detection. This is why they won't navigate to the page. For the error message, we made some adjustments to make the error message a little bit better in the new version, 0.3.7. You can update to this new version, and then get a better message. I think I'm going to release the new version within a day or two. One thing that you can do is always try to set the headless to false, so that you can see what's happening, and in this way, you'll get an understanding of what's going on. Here's a screenshot of what's happening. Fyi you can apply some sort of scripts and techniques using the hooks that we have in our library before going to a page to fix some of such issues. However, if you use the new version, the error message contains some useful information for you to try on different websites. Anyway, hopefully, this can be helpful for you. |
Hi, would you please share with me your code snippet, so I can check it for you. |
@mobyds This is interesting, would you please share the url caused this issue? Thx |
@mobyds It works for me, perhaps you can share with me your code as well as you system specs. async def main():
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
url = "https://chantepie.fr/"
result = await crawler.arun(
url=url,
bypass_cache=True,
screenshot = True
)
# Save screenshot to file
with open(os.path.join(__data, "chantepie.png"), "wb") as f:
f.write(base64.b64decode(result.screenshot))
print(result.markdown)
|
Iy was with WebCrawler, not with AsyncWebCrawler |
@unclecode Hi
Below is the code snippet I used for extraction. I find the issue most common with hindustantimes and NDTV. The news block is not getting extracted completely. url1 = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922" related_content = [] async def process_urls():
Execute the asynchronous functionasyncio.run(process_urls()) print(f"Number of related items extracted: {len(related_content)}") |
@mobyds Oh, I see. Yes, I think it's better to switch to async because I very soon plan to remove the synchronous version. Additionally, I want to cut the dependency on Selenium and stick with Playwright. So, anyway, if there are any other issues, don't hesitate to reach out. Thank you for trying our library. |
@DhrubojyotiDey I followed the first link that you shared here. The page is actually very long. Let me explain how this LLM extraction strategy works. By default, there is a chunking stage. This means that when you pass the content, break it into smaller chunks and then send every chunk in parallel to the language model. This is designed to be more suitable for smaller languages. Those small language models may not have a long context window. Therefore, we can make the most of them this way. If you're using a language model that supports long window contexts, such as Gemini, in your code, the best way to handle it is to either turn off this feature or to use a very long chunk length. Here's an example of both approaches. In my case, they work perfectly. I hope this is helpful for you. async def main():
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini',
api_token=os.getenv('OPENAI_API_KEY'),
apply_chunking = False,
# chunk_token_threshold = 2 ** 14 # 16k tokens
instruction="""Extract only content related to Israel and hamas war and extract URL if available"""
)
async with AsyncWebCrawler() as crawler:
url = "https://www.nbcnews.com/news/world/live-blog/live-updates-hamas-leader-yahya-sinwar-possibly-killed-gaza-rcna175922"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
# magic=True
)
extracted_content = json.loads(result.extracted_content)
print(extracted_content)
print("Done") |
OK, and thanks a lot for this very useful lib |
You're welcome @mobyds |
hey and thanks for this nice package!
am having the following issue: some websites are randomly not scraped, while others get scraped correctly. On each run of the code which websites are scraped or not varies randomly. For the not scraped websites I get the following error:
[ERROR] 🚫 Failed to crawl https://random-website.com, error: 'NoneType' object has no attribute 'get'.
I save the .html files after scraping and the websites which are affected by this bug are saved in an html file with just ['', None] contained in the file.
tried to update all packages and also setup a new conda environment, but it didnt fix the issue. I am using WebCrawler, not the AsyncWebCrawler
The text was updated successfully, but these errors were encountered: