Decode HTML entities in text values #40

Delagen · 2018-02-02T07:36:31Z

Checklist

Are you running the latest version?
Include sample input XML
Include actual output
Include expected output
Did you try online tool?
Did you star the repository for further updates? ;)

I include PR soon

parse html entities in text values, fixes #40

amitguptagwl · 2018-02-02T09:28:32Z

Published

amitguptagwl · 2018-02-02T09:40:01Z

I believe it should not parse HTML entities for CDATA.

Delagen · 2018-02-02T10:51:59Z

@amitguptagwl CDATA also can contain HTML entities https://en.wikipedia.org/wiki/CDATA
It just indicate that it's text value of node, not markup (children)

amitguptagwl · 2018-02-02T12:04:03Z

Yes. :)
So what I expect from <tag><![CDATA[<sender>John Smith</sender>]]></tag> is

{
   "tag" : "&lt;sender&gt;John Smith&lt;/sender&gt;"
}

But what I'm getting from current parser is

{
   "tag" : "<sender>John Smith</sender>"
}

Expectation 2: <tag><![CDATA[<sender>John Smith</sender>]]></tag> is

{
   "tag" : "<sender>John Smith</sender>"
}

Expectation 3: <tag><sender>John Smith</sender></tag> which is equivalent of <tag><![CDATA[<sender>John Smith</sender>]]></tag> is

{
   "tag" : "<sender>John Smith</sender>"
}

Delagen · 2018-02-02T12:23:37Z

But what I'm getting from current parser is

And it's right

I placed holywar at work )
Specs only specified

left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using " < " and " & "

but no any word about other entities, and even not specified correctly NEED NOT & CAN NOT is not the same as MUST NOT

For example CDATA can contain HTML parts

<tag><![CDATA[<sender>&amp;John Smith&copy;</sender>]]></tag>

For consistensy You can disable parse CDATA with this or make it optional. But it need more refactor due in code as I know you join it together

amitguptagwl · 2018-02-02T14:33:54Z

As per the wiki,

Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup. CDATA sections are useful for writing XML code as text data within an XML document.

if the numeric character reference ð appears in element content, it will be interpreted as the single Unicode character 00F0 (small letter eth). But if the same appears in a CDATA section, it will be parsed as six characters: ampersand, hash mark, digit 2, digit 4, digit 0, semicolon.

Above phrase in my words (understanding),

if ð (represents to ð) appears in CDATA section should not be parsed to ð but to ð only.

Delagen · 2018-02-02T14:45:19Z

@amitguptagwl I will make some refactoring, and place PR tomorrow

amitguptagwl · 2018-02-02T15:34:47Z

Thanks @Delagen . In case if you are busy with some other prior work, we can just roll back the changes for the time being. And will handle them later. I'm anyhow planning to rewrite parser but it'll take at least a month 😸

…t, updates NaturalIntelligence#40

Delagen · 2018-02-02T17:25:59Z

Thanks @amitguptagwl for work, I placed PR to not decode CDATA
But seems parser ignore sibling text when CDATA present

<a>
<![CDATA[asdf]]>a
</a>

result in:

{
"a":{
"#text":"asdf"
}
}

I know it rare case, but I am sure it must be asdfa
xm2js parse it correctly.

Example:

<a>a
<![CDATA[asdf&amp;]]>&amp;<![CDATA[asdf]]>
a</a>

{ "a": "a\nasdf&amp;&asdf\na" }

amitguptagwl · 2018-02-03T15:43:47Z

Sorry I'm holding this PR as I'm rewriting the parser So it'll become to handle this situation, large files, reduce time for separate validation etc. I'm 60% complete with the change. Will update you with the progress.

Delagen · 2018-02-03T19:03:44Z

Thanks for your project. I have a pleasure to contribute if project owner interested in making it code better. )

amitguptagwl · 2018-02-08T08:30:03Z

v3 is live to handle this issue

amitguptagwl closed this as completed in 4c35142 Feb 2, 2018

amitguptagwl added a commit that referenced this issue Feb 2, 2018

Merge pull request #41 from Delagen/master

2485df3

parse html entities in text values, fixes #40

amitguptagwl reopened this Feb 2, 2018

Delagen added a commit to Delagen/fast-xml-parser that referenced this issue Feb 2, 2018

fix(entities): decode only text in attribute values and non CDATA tex…

86fb2e4

…t, updates NaturalIntelligence#40

amitguptagwl closed this as completed Feb 8, 2018

mwardle mentioned this issue Jan 9, 2024

Entities are being processed in CDATA sections #632

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode HTML entities in text values #40

Decode HTML entities in text values #40

Delagen commented Feb 2, 2018

amitguptagwl commented Feb 2, 2018

amitguptagwl commented Feb 2, 2018 •

edited

Loading

Delagen commented Feb 2, 2018 •

edited

Loading

amitguptagwl commented Feb 2, 2018 •

edited

Loading

Delagen commented Feb 2, 2018 •

edited

Loading

amitguptagwl commented Feb 2, 2018

Delagen commented Feb 2, 2018

amitguptagwl commented Feb 2, 2018

Delagen commented Feb 2, 2018 •

edited

Loading

amitguptagwl commented Feb 3, 2018

Delagen commented Feb 3, 2018

amitguptagwl commented Feb 8, 2018

Decode HTML entities in text values #40

Decode HTML entities in text values #40

Comments

Delagen commented Feb 2, 2018

Checklist

amitguptagwl commented Feb 2, 2018

amitguptagwl commented Feb 2, 2018 • edited Loading

Delagen commented Feb 2, 2018 • edited Loading

amitguptagwl commented Feb 2, 2018 • edited Loading

Delagen commented Feb 2, 2018 • edited Loading

amitguptagwl commented Feb 2, 2018

Delagen commented Feb 2, 2018

amitguptagwl commented Feb 2, 2018

Delagen commented Feb 2, 2018 • edited Loading

amitguptagwl commented Feb 3, 2018

Delagen commented Feb 3, 2018

amitguptagwl commented Feb 8, 2018

amitguptagwl commented Feb 2, 2018 •

edited

Loading

Delagen commented Feb 2, 2018 •

edited

Loading

amitguptagwl commented Feb 2, 2018 •

edited

Loading

Delagen commented Feb 2, 2018 •

edited

Loading

Delagen commented Feb 2, 2018 •

edited

Loading