-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract feed and item images from more places #220
Conversation
Besides a few tests that have the issue mentioned above in the review I think this should work fine. I'd like to get some input on the review above before I convert this from a draft. |
@infogulch I think the fallback image sources in the translator function you added look clean and make sense to me, including the HTML parsing code. I had no clue that many images stash their images in there, lol. |
1eed9c3
to
b1ed5bf
Compare
@infogulch update looks good to me. I might create a separate issue to think about what to do with naked HTML markup within tags. |
Thank you for your contribution @infogulch ! Now I just need to tackle #210, and hopefully turn back on gating of PRs for tests passing. |
PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This provides a solution to PR mmcdole#211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.
PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This provides a solution to PR mmcdole#211 (and includes a test based on @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.
PR mmcdole#220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This should also fix PR mmcdole#211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in mmcdole#222 are merged, this should fix the remaining failing test reported in mmcdole#210.
I'd like to comment that fetching the first Take for example the feed from slashdot: https://rss.slashdot.org/Slashdot/slashdotMain The first image in the body will be https://a.fsdn.com/sd/twitter_icon_large.png which is 56x20 pixels. This is directly unsuitable as a thumbnail for an article. Perhaps it would be better to place the first body image as an |
PR #220 introduced a failing test for detecting images in the "content" element. It should instead be testing the "content:encoded" element. But that uncovered an issue with how extensions were being detected (the "content" namespace was being detected as an extension namespace). As a more robust way of checking for the "content" namespace, this PR exposes `shared.PrefixForNamspace()` as a public function so it can be used in the rss parser. This should also fix PR #211 (and includes @JLugagne's test case from that PR). Once the fixes to xml:base handling in #222 are merged, this should fix the remaining failing test reported in #210.
Additional locations where images are attempted to be extracted:
<img>
in content or descriptionFixes #133