Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

Open
uqs opened this issue Mar 29, 2024 · 4 comments · May be fixed by #868
Open

Comments

@uqs
Copy link

uqs commented Mar 29, 2024

Hi, pretty much every article on that substack is missing the headers when turning on reader mode.

  1. Go to https://www.astralcodexten.com/p/practically-a-book-review-rootclaim and turn on reader mode
  2. The article starts with an <h1> I. Foo, followed by some <p> and then another <h1> II. Bar. The headers are not turned into headings in reader mode.
  3. The title on the page has a tagline and the data, both of them are missing, the reader mode does produce the author name though.
@yagudaev
Copy link

yagudaev commented Apr 18, 2024

I was also surprised by how Readability handles headings.

The demo below converts a page to markdown. It first uses readability to eliminate other necessary content.

However, the heading information h1, h2, h3, is destroyed and is all h2 at the end.

CleanShot 2024-04-18 at 16 29 51@2x

Is there a way to turn this off?

Here is a live demo

@cmkm
Copy link

cmkm commented Apr 19, 2024

The Substack issue may be due to including header in the unlikely candidates list. We should try removing it and see if it fixes this and other Substack parsing issues.

@yagudaev: Thank you for your contribution, but I think this is unrelated to the issue @uqs filed. Would you mind opening a new issue, please?

@yagudaev
Copy link

yagudaev commented Apr 26, 2024

@cmkm got it, I'll open a new issue 😊.

Created a new issue here: #863 -- cleaned up the demo and made the description more clear

@inhumantsar
Copy link
Contributor

had a quick look at this. removing "header" from unlikely does prevent the h1 from being culled but it gets removed later on due to low class weight, probably because the class name is header-with-anchor-widget and widget is listed in the negative regex.

with header removed from unlikely and header added to positive does ensure all four headers appear in the output properly. this also improves the parsing on quite a few of the other test cases as well. eg: readding the rubric as well as headings for "Pros", "Cons", and "Summary" on the Engadget review, definition terms from the Google SRE test case. it does introduce a couple of issues on other pages though, like adding back the site index section on NYT pages and a pair of duplicated headings on the Mercurial test case.

i'll push up a quick draft PR for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants