Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

Open
mingqxu7 opened this issue Nov 9, 2024 · 2 comments
Open

[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891

mingqxu7 opened this issue Nov 9, 2024 · 2 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@mingqxu7
Copy link

mingqxu7 commented Nov 9, 2024

Bug Description

File "/Users/mingqxu/miniforge3arm64/envs/rag/lib/python3.11/site-packages/llama_index/readers/web/trafilatura_web/base.py", line 77, in load_data
documents.append(Document(text=response, id_=url))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mingqxu/miniforge3arm64/envs/rag/lib/python3.11/site-packages/pydantic/main.py", line 193, in init
self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Document
text
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]

Here's the code:

def load_data(
self,
urls: List[str],
include_comments=True,
output_format="txt",
include_tables=True,
include_images=False,
include_formatting=False,
include_links=False,
show_progress=False,
no_ssl=False,
**kwargs,
) -> List[Document]:
"""Load data from the urls.

    Args:
        urls (List[str]): List of URLs to scrape.
        include_comments (bool, optional): Include comments in the output. Defaults to True.
        output_format (str, optional): Output format. Defaults to 'txt'.
        include_tables (bool, optional): Include tables in the output. Defaults to True.
        include_images (bool, optional): Include images in the output. Defaults to False.
        include_formatting (bool, optional): Include formatting in the output. Defaults to False.
        include_links (bool, optional): Include links in the output. Defaults to False.
        show_progress (bool, optional): Show progress bar. Defaults to False
        no_ssl (bool, optional): Bypass SSL verification. Defaults to False.
        kwargs: Additional keyword arguments for the `trafilatura.extract` function.

    Returns:
        List[Document]: List of documents.

    """
    import trafilatura

    if not isinstance(urls, list):
        raise ValueError("urls must be a list of strings.")
    documents = []

    if show_progress:
        from tqdm import tqdm

        iterator = tqdm(urls, desc="Downloading pages")
    else:
        iterator = urls
    for url in iterator:
        downloaded = trafilatura.fetch_url(url, no_ssl=no_ssl)
        response = trafilatura.extract(
            downloaded,
            include_comments=include_comments,
            output_format=output_format,
            include_tables=include_tables,
            include_images=include_images,
            include_formatting=include_formatting,
            include_links=include_links,
            **kwargs,
        )
        documents.append(Document(text=response, id_=url))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^ --- the code does not check whether response is None.  One case that it. can be None if load_data is called from a thread and the thread is terminated after a timeout set by the invoker program. 

    return documents

Version

0.11.19

Steps to Reproduce

tasks = [asyncio.to_thread(self.loader.load_data, [url]) for url in urls]
for task in asyncio.as_completed(tasks, timeout=1):
docs = await task

Relevant Logs/Tracbacks

see the Bug Description
@mingqxu7 mingqxu7 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Nov 9, 2024
Copy link

dosubot bot commented Nov 9, 2024

To address the issue where the TrafilaturaWebReader's load_data function attempts to append a Document with NoneType text, you can modify the function to check if the response is None before appending it to the documents list. Here's how you can update the code:

def load_data(
    self,
    urls: List[str],
    include_comments=True,
    output_format="txt",
    include_tables=True,
    include_images=False,
    include_formatting=False,
    include_links=False,
    show_progress=False,
    no_ssl=False,
    **kwargs,
) -> List[Document]:
    """Load data from the urls.

    Args:
        urls (List[str]): List of URLs to scrape.
        include_comments (bool, optional): Include comments in the output. Defaults to True.
        output_format (str, optional): Output format. Defaults to 'txt'.
        include_tables (bool, optional): Include tables in the output. Defaults to True.
        include_images (bool, optional): Include images in the output. Defaults to False.
        include_formatting (bool, optional): Include formatting in the output. Defaults to False.
        include_links (bool, optional): Include links in the output. Defaults to False.
        show_progress (bool, optional): Show progress bar. Defaults to False
        no_ssl (bool, optional): Bypass SSL verification. Defaults to False.
        kwargs: Additional keyword arguments for the `trafilatura.extract` function.

    Returns:
        List[Document]: List of documents.

    """
    import trafilatura

    if not isinstance(urls, list):
        raise ValueError("urls must be a list of strings.")
    documents = []

    if show_progress:
        from tqdm import tqdm
        iterator = tqdm(urls, desc="Downloading pages")
    else:
        iterator = urls
    for url in iterator:
        downloaded = trafilatura.fetch_url(url, no_ssl=no_ssl)
        if not downloaded:
            logger.warning(f"Failed to download content from url: {url}")
            continue
        response = trafilatura.extract(
            downloaded,
            include_comments=include_comments,
            output_format=output_format,
            include_tables=include_tables,
            include_images=include_images,
            include_formatting=include_formatting,
            include_links=include_links,
            **kwargs,
        )
        if response is None:
            logger.warning(f"Response is None for url: {url}")
            continue
        documents.append(Document(text=response, id_=url))

    return documents

This modification adds a check for None responses and logs a warning if the response is None, allowing the function to skip appending a Document with NoneType text [1][2].

To continue talking to Dosu, mention @dosu.

@mingqxu7
Copy link
Author

mingqxu7 commented Nov 9, 2024

Edit: repro steps:

loader = TrafilaturaWebReader()
urls = [...] # give a list of urls
tasks = [asyncio.to_thread(loader.load_data, [url]) for url in urls]
for task in asyncio.as_completed(tasks, timeout=1):
docs = await task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant