[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

mingqxu7 · 2024-11-09T08:18:11Z

Bug Description

File "/Users/mingqxu/miniforge3arm64/envs/rag/lib/python3.11/site-packages/llama_index/readers/web/trafilatura_web/base.py", line 77, in load_data
documents.append(Document(text=response, id_=url))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mingqxu/miniforge3arm64/envs/rag/lib/python3.11/site-packages/pydantic/main.py", line 193, in init
self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Document
text
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]

Here's the code:

def load_data(
self,
urls: List[str],
include_comments=True,
output_format="txt",
include_tables=True,
include_images=False,
include_formatting=False,
include_links=False,
show_progress=False,
no_ssl=False,
**kwargs,
) -> List[Document]:
"""Load data from the urls.

    Args:
        urls (List[str]): List of URLs to scrape.
        include_comments (bool, optional): Include comments in the output. Defaults to True.
        output_format (str, optional): Output format. Defaults to 'txt'.
        include_tables (bool, optional): Include tables in the output. Defaults to True.
        include_images (bool, optional): Include images in the output. Defaults to False.
        include_formatting (bool, optional): Include formatting in the output. Defaults to False.
        include_links (bool, optional): Include links in the output. Defaults to False.
        show_progress (bool, optional): Show progress bar. Defaults to False
        no_ssl (bool, optional): Bypass SSL verification. Defaults to False.
        kwargs: Additional keyword arguments for the `trafilatura.extract` function.

    Returns:
        List[Document]: List of documents.

    """
    import trafilatura

    if not isinstance(urls, list):
        raise ValueError("urls must be a list of strings.")
    documents = []

    if show_progress:
        from tqdm import tqdm

        iterator = tqdm(urls, desc="Downloading pages")
    else:
        iterator = urls
    for url in iterator:
        downloaded = trafilatura.fetch_url(url, no_ssl=no_ssl)
        response = trafilatura.extract(
            downloaded,
            include_comments=include_comments,
            output_format=output_format,
            include_tables=include_tables,
            include_images=include_images,
            include_formatting=include_formatting,
            include_links=include_links,
            **kwargs,
        )
        documents.append(Document(text=response, id_=url))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^ --- the code does not check whether response is None.  One case that it. can be None if load_data is called from a thread and the thread is terminated after a timeout set by the invoker program. 

    return documents

Version

0.11.19

Steps to Reproduce

tasks = [asyncio.to_thread(self.loader.load_data, [url]) for url in urls]
for task in asyncio.as_completed(tasks, timeout=1):
docs = await task

Relevant Logs/Tracbacks

see the Bug Description

The text was updated successfully, but these errors were encountered:

dosubot · 2024-11-09T08:19:25Z

To address the issue where the TrafilaturaWebReader's load_data function attempts to append a Document with NoneType text, you can modify the function to check if the response is None before appending it to the documents list. Here's how you can update the code:

def load_data(
    self,
    urls: List[str],
    include_comments=True,
    output_format="txt",
    include_tables=True,
    include_images=False,
    include_formatting=False,
    include_links=False,
    show_progress=False,
    no_ssl=False,
    **kwargs,
) -> List[Document]:
    """Load data from the urls.

    Args:
        urls (List[str]): List of URLs to scrape.
        include_comments (bool, optional): Include comments in the output. Defaults to True.
        output_format (str, optional): Output format. Defaults to 'txt'.
        include_tables (bool, optional): Include tables in the output. Defaults to True.
        include_images (bool, optional): Include images in the output. Defaults to False.
        include_formatting (bool, optional): Include formatting in the output. Defaults to False.
        include_links (bool, optional): Include links in the output. Defaults to False.
        show_progress (bool, optional): Show progress bar. Defaults to False
        no_ssl (bool, optional): Bypass SSL verification. Defaults to False.
        kwargs: Additional keyword arguments for the `trafilatura.extract` function.

    Returns:
        List[Document]: List of documents.

    """
    import trafilatura

    if not isinstance(urls, list):
        raise ValueError("urls must be a list of strings.")
    documents = []

    if show_progress:
        from tqdm import tqdm
        iterator = tqdm(urls, desc="Downloading pages")
    else:
        iterator = urls
    for url in iterator:
        downloaded = trafilatura.fetch_url(url, no_ssl=no_ssl)
        if not downloaded:
            logger.warning(f"Failed to download content from url: {url}")
            continue
        response = trafilatura.extract(
            downloaded,
            include_comments=include_comments,
            output_format=output_format,
            include_tables=include_tables,
            include_images=include_images,
            include_formatting=include_formatting,
            include_links=include_links,
            **kwargs,
        )
        if response is None:
            logger.warning(f"Response is None for url: {url}")
            continue
        documents.append(Document(text=response, id_=url))

    return documents

This modification adds a check for None responses and logs a warning if the response is None, allowing the function to skip appending a Document with NoneType text ^[1]^[2].

To continue talking to Dosu, mention @dosu.

mingqxu7 · 2024-11-09T08:20:13Z

Edit: repro steps:

loader = TrafilaturaWebReader()
urls = [...] # give a list of urls
tasks = [asyncio.to_thread(loader.load_data, [url]) for url in urls]
for task in asyncio.as_completed(tasks, timeout=1):
docs = await task

mingqxu7 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

mingqxu7 commented Nov 9, 2024

dosubot bot commented Nov 9, 2024

mingqxu7 commented Nov 9, 2024

[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

Comments

mingqxu7 commented Nov 9, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Nov 9, 2024

mingqxu7 commented Nov 9, 2024