Use mozilla/readability to extract the text content of webpages #84

tomasgvivo · 2023-02-24T23:15:48Z

tomasgvivo
Feb 24, 2023

I would like the extension to include the websites content, not only the headlines.
The package https://github.com/mozilla/readability allows you to extract the text from a website for readability (this is what Firefox uses on it's "reader mode"). This may simplify the process of extracting the texts from the websites, but obviously you will get to the problem of token limit for a message.

After reading other issues, I think that if you divide the process of accessing the web in various steps/prompts you might be able to avoid that limit by separating the results in multiple messages.

A couple of days ago, I tried something like this:

You are now in "Text Ingestion Mode".

When I send you a message, reply with '...'.
If I send you the string "EOM", exit "Text Ingestion Mode".

At first it worked and responded with "..." after the first text to ingest, but after the second text, it jumped to conclusions and tried to give an opinion about the text I provided. I think this is just a problem with the initial prompt and after some tweaks it should work most of the times.

So what I imagine is:

User:
{prompt}

GPT:
Welcome to WebGPT [...] How can I help you?

User:
{query}

GPT:
SEARCH: {gpt_generted_query}

User:
Result 1/5
{content of first result's site}

GPT:
Next result.

User:
Result 2/5
{content of first result's site}
GPT:
Next result.

User:
Result 3/5
{content of first result's site}
GPT:
Next result.

User:
Result 4/5
{content of first result's site}
GPT:
Next result.

User:
Result 5/5
{content of first result's site}

GPT:
{gpt_answer_to_query}

qunash · 2023-02-25T11:08:08Z

qunash
Feb 25, 2023
Collaborator

Already testing mozilla/readability for text extraction. If you'd like to try it, checkout the serverless branch and build the extension from source. Then type in /page:url to extract text from the url.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use mozilla/readability to extract the text content of webpages #84

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Use mozilla/readability to extract the text content of webpages #84

tomasgvivo Feb 24, 2023

Replies: 1 comment

qunash Feb 25, 2023 Collaborator

tomasgvivo
Feb 24, 2023

qunash
Feb 25, 2023
Collaborator