Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when html's charset is windows-1253 #10

Open
hadjiprocopis opened this issue Aug 28, 2023 · 0 comments
Open

Problems when html's charset is windows-1253 #10

hadjiprocopis opened this issue Aug 28, 2023 · 0 comments

Comments

@hadjiprocopis
Copy link

Hi and thank you for HTML5::DOM which had served me superbly quite a few times.

Alas, it failed me when I tried to parse the contents of a webpage which it states it is encoded with "charset=windows-1253" (via this: <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">). The result is that parse() returns nodes whose text, when printed on a linux console, appears gibberish (the typical horror of Perl's screen-of-unicode-death §Ξ΅Ξ—ΣΤΛΩΛΩ).

My eventual solution was to zap the evil windows-1253 from the html content and replace it with UTF-8.

How to solve this properly (thiugh I don't mind the zapping)?

Secondly, I tried to tell HTML5::DOM not to be concerned at all with unicode and return me back un-encoded text so that I would encode it myself using parse(..., {utf8=>0}). Either I made a mistake or this is not possible because I ended up with even more gibberish. On second though why use utf8=>0 when encoding is not utf8?

Below is a self-contained example demonstrating the problem.

Many thanks,

use strict;
use warnings;

use LWP::UserAgent;
use HTTP::Request;
use HTML5::DOM;
use Encode;

my $ua = LWP::UserAgent->new();
my $response = $ua->request(
  HTTP::Request->new(
	'GET' => 'https://www.areiospagos.gr/proedros.htm',
	[
		'Connection' => 'keep-alive',
		'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
		'Accept-Encoding' => 'gzip, deflate',
		'Accept-Language' => 'en-GB,en;q=0.5',
		'Referer' => 'http://www.polignosi.com/cgibin/hweb',
		'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0',
		'Upgrade-Insecure-Requests' => '1'
	],
  )
);
die unless $response && $response->is_success;

my $html = $response->decoded_content;

print "encoding using detect(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detect($html))."\n";
print "encoding using detectUnicode(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectUnicode($html))."\n";
print "encoding using detectByPrescanStream(): ".HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding::detectByPrescanStream($html))."\n";

# The above html contains
#  <meta http-equiv="Content-Type" content="text/html; charset=windows-1253">
# replacing the crappy windows-1253 with UTF-8 solves my problem
#$html =~ s/charset=windows-1253/charset=UTF-8/g;

my $parser = HTML5::DOM->new();

my $tree = $parser->parse($html, {scripts => 0});

my $is_utf8_enabled = $tree->utf8;
# it prints 'true'
print "is_utf8_enabled=".($tree ? "true" : "false")."\n"; # false
my $text = $tree->find('body table#table1 tbody tr td table#table2 tbody tr td p span')->[0]->text();
# it prints gibberish (doubly-encoded)
print $text;
# it is solved by replacing the windows-1235 charset from $html, see above
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant