You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My eventual solution was to zap the evil windows-1253 from the html content and replace it with UTF-8.
How to solve this properly (thiugh I don't mind the zapping)?
Secondly, I tried to tell HTML5::DOM not to be concerned at all with unicode and return me back un-encoded text so that I would encode it myself using parse(..., {utf8=>0}). Either I made a mistake or this is not possible because I ended up with even more gibberish. On second though why use utf8=>0 when encoding is not utf8?
Below is a self-contained example demonstrating the problem.
Many thanks,
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request;
use HTML5::DOM;
use Encode;
my $ua = LWP::UserAgent->new();
my $response = $ua->request(
HTTP::Request->new(
'GET' => 'https://www.areiospagos.gr/proedros.htm',
[
'Connection' => 'keep-alive',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate',
'Accept-Language' => 'en-GB,en;q=0.5',
'Referer' => 'http://www.polignosi.com/cgibin/hweb',
'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0',
'Upgrade-Insecure-Requests' => '1'
],
)
);
die unless $response && $response->is_success;
my $html = $response->decoded_content;
print "encoding using detect(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detect($html))."\n";
print "encoding using detectUnicode(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectUnicode($html))."\n";
print "encoding using detectByPrescanStream(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectByPrescanStream($html))."\n";
# The above html contains
# <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
# replacing the crappy windows-1253 with UTF-8 solves my problem
#$html =~ s/charset=windows-1253/charset=UTF-8/g;
my $parser = HTML5::DOM->new();
my $tree = $parser->parse($html, {scripts => 0});
my $is_utf8_enabled = $tree->utf8;
# it prints 'true'
print "is_utf8_enabled=".($tree ? "true" : "false")."\n"; # false
my $text = $tree->find('body table#table1 tbody tr td table#table2 tbody tr td p span')->[0]->text();
# it prints gibberish (doubly-encoded)
print $text;
# it is solved by replacing the windows-1235 charset from $html, see above
The text was updated successfully, but these errors were encountered:
Hi and thank you for HTML5::DOM which had served me superbly quite a few times.
Alas, it failed me when I tried to parse the contents of a webpage which it states it is encoded with "
charset=windows-1253
" (via this:<meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
). The result is thatparse()
returns nodes whose text, when printed on a linux console, appears gibberish (the typical horror of Perl's screen-of-unicode-death§Ξ΅Ξ—ΣΤΛΩΛΩ
).My eventual solution was to zap the evil
windows-1253
from the html content and replace it withUTF-8
.How to solve this properly (thiugh I don't mind the zapping)?
Secondly, I tried to tell HTML5::DOM not to be concerned at all with unicode and return me back un-encoded text so that I would encode it myself using
parse(..., {utf8=>0})
. Either I made a mistake or this is not possible because I ended up with even more gibberish. On second though why useutf8=>0
when encoding is notutf8
?Below is a self-contained example demonstrating the problem.
Many thanks,
The text was updated successfully, but these errors were encountered: