H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

uqs · 2024-03-29T14:45:50Z

Hi, pretty much every article on that substack is missing the headers when turning on reader mode.

Go to https://www.astralcodexten.com/p/practically-a-book-review-rootclaim and turn on reader mode
The article starts with an <h1> I. Foo, followed by some <p> and then another <h1> II. Bar. The headers are not turned into headings in reader mode.
The title on the page has a tagline and the data, both of them are missing, the reader mode does produce the author name though.

The text was updated successfully, but these errors were encountered:

yagudaev · 2024-04-18T23:34:26Z

I was also surprised by how Readability handles headings.

The demo below converts a page to markdown. It first uses readability to eliminate other necessary content.

However, the heading information h1, h2, h3, is destroyed and is all h2 at the end.

Is there a way to turn this off?

Here is a live demo

cmkm · 2024-04-19T16:33:54Z

The Substack issue may be due to including header in the unlikely candidates list. We should try removing it and see if it fixes this and other Substack parsing issues.

@yagudaev: Thank you for your contribution, but I think this is unrelated to the issue @uqs filed. Would you mind opening a new issue, please?

yagudaev · 2024-04-26T21:14:44Z

@cmkm got it, I'll open a new issue 😊.

Created a new issue here: #863 -- cleaned up the demo and made the description more clear

inhumantsar · 2024-05-04T22:32:16Z

had a quick look at this. removing "header" from unlikely does prevent the h1 from being culled but it gets removed later on due to low class weight, probably because the class name is header-with-anchor-widget and widget is listed in the negative regex.

with header removed from unlikely and header added to positive does ensure all four headers appear in the output properly. this also improves the parsing on quite a few of the other test cases as well. eg: readding the rubric as well as headings for "Pros", "Cons", and "Summary" on the Engadget review, definition terms from the Google SRE test case. it does introduce a couple of issues on other pages though, like adding back the site index section on NYT pages and a pair of duplicated headings on the Mercurial test case.

i'll push up a quick draft PR for review

cmkm added has-website-testcase reader-mode-has-issues parsing-issue labels Apr 19, 2024

inhumantsar linked a pull request May 4, 2024 that will close this issue

fix: relax filtering of heading elements with classnames that include the word "header" #868

Open

ragnar48h mentioned this issue Jun 1, 2024

Missing section titles from substack.com #821

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

uqs commented Mar 29, 2024

yagudaev commented Apr 18, 2024 •

edited

Loading

cmkm commented Apr 19, 2024

yagudaev commented Apr 26, 2024 •

edited

Loading

inhumantsar commented May 4, 2024

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim #855

Comments

uqs commented Mar 29, 2024

yagudaev commented Apr 18, 2024 • edited Loading

cmkm commented Apr 19, 2024

yagudaev commented Apr 26, 2024 • edited Loading

inhumantsar commented May 4, 2024

yagudaev commented Apr 18, 2024 •

edited

Loading

yagudaev commented Apr 26, 2024 •

edited

Loading