-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: TrafilatruaWebReader - add document with NoneType text. #16891
Comments
To address the issue where the def load_data(
self,
urls: List[str],
include_comments=True,
output_format="txt",
include_tables=True,
include_images=False,
include_formatting=False,
include_links=False,
show_progress=False,
no_ssl=False,
**kwargs,
) -> List[Document]:
"""Load data from the urls.
Args:
urls (List[str]): List of URLs to scrape.
include_comments (bool, optional): Include comments in the output. Defaults to True.
output_format (str, optional): Output format. Defaults to 'txt'.
include_tables (bool, optional): Include tables in the output. Defaults to True.
include_images (bool, optional): Include images in the output. Defaults to False.
include_formatting (bool, optional): Include formatting in the output. Defaults to False.
include_links (bool, optional): Include links in the output. Defaults to False.
show_progress (bool, optional): Show progress bar. Defaults to False
no_ssl (bool, optional): Bypass SSL verification. Defaults to False.
kwargs: Additional keyword arguments for the `trafilatura.extract` function.
Returns:
List[Document]: List of documents.
"""
import trafilatura
if not isinstance(urls, list):
raise ValueError("urls must be a list of strings.")
documents = []
if show_progress:
from tqdm import tqdm
iterator = tqdm(urls, desc="Downloading pages")
else:
iterator = urls
for url in iterator:
downloaded = trafilatura.fetch_url(url, no_ssl=no_ssl)
if not downloaded:
logger.warning(f"Failed to download content from url: {url}")
continue
response = trafilatura.extract(
downloaded,
include_comments=include_comments,
output_format=output_format,
include_tables=include_tables,
include_images=include_images,
include_formatting=include_formatting,
include_links=include_links,
**kwargs,
)
if response is None:
logger.warning(f"Response is None for url: {url}")
continue
documents.append(Document(text=response, id_=url))
return documents This modification adds a check for |
Edit: repro steps: loader = TrafilaturaWebReader() |
Bug Description
File "/Users/mingqxu/miniforge3arm64/envs/rag/lib/python3.11/site-packages/llama_index/readers/web/trafilatura_web/base.py", line 77, in load_data
documents.append(Document(text=response, id_=url))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mingqxu/miniforge3arm64/envs/rag/lib/python3.11/site-packages/pydantic/main.py", line 193, in init
self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Document
text
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
Here's the code:
def load_data(
self,
urls: List[str],
include_comments=True,
output_format="txt",
include_tables=True,
include_images=False,
include_formatting=False,
include_links=False,
show_progress=False,
no_ssl=False,
**kwargs,
) -> List[Document]:
"""Load data from the urls.
Version
0.11.19
Steps to Reproduce
tasks = [asyncio.to_thread(self.loader.load_data, [url]) for url in urls]
for task in asyncio.as_completed(tasks, timeout=1):
docs = await task
Relevant Logs/Tracbacks
The text was updated successfully, but these errors were encountered: