Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode HTML entities in text values #40

Closed
6 tasks done
Delagen opened this issue Feb 2, 2018 · 12 comments
Closed
6 tasks done

Decode HTML entities in text values #40

Delagen opened this issue Feb 2, 2018 · 12 comments

Comments

@Delagen
Copy link
Contributor

Delagen commented Feb 2, 2018

Checklist

  • Are you running the latest version?
  • Include sample input XML
  • Include actual output
  • Include expected output
  • Did you try online tool?
  • Did you star the repository for further updates? ;)

I include PR soon

amitguptagwl added a commit that referenced this issue Feb 2, 2018
parse html entities in text values, fixes #40
@amitguptagwl
Copy link
Member

Published

@amitguptagwl
Copy link
Member

amitguptagwl commented Feb 2, 2018

I believe it should not parse HTML entities for CDATA.

@amitguptagwl amitguptagwl reopened this Feb 2, 2018
@Delagen
Copy link
Contributor Author

Delagen commented Feb 2, 2018

@amitguptagwl CDATA also can contain HTML entities https://en.wikipedia.org/wiki/CDATA
It just indicate that it's text value of node, not markup (children)

@amitguptagwl
Copy link
Member

amitguptagwl commented Feb 2, 2018

Yes. :)
So what I expect from <tag><![CDATA[&lt;sender&gt;John Smith&lt;/sender&gt;]]></tag> is

{
   "tag" : "&lt;sender&gt;John Smith&lt;/sender&gt;"
}

But what I'm getting from current parser is

{
   "tag" : "<sender>John Smith</sender>"
}

Expectation 2: <tag><![CDATA[<sender>John Smith</sender>]]></tag> is

{
   "tag" : "<sender>John Smith</sender>"
}

Expectation 3: <tag>&lt;sender&gt;John Smith&lt;/sender&gt;</tag> which is equivalent of <tag><![CDATA[<sender>John Smith</sender>]]></tag> is

{
   "tag" : "<sender>John Smith</sender>"
}

@Delagen
Copy link
Contributor Author

Delagen commented Feb 2, 2018

But what I'm getting from current parser is

And it's right

I placed holywar at work )
Specs only specified

left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using " < " and " & "

but no any word about other entities, and even not specified correctly NEED NOT & CAN NOT is not the same as MUST NOT

For example CDATA can contain HTML parts

<tag><![CDATA[<sender>&amp;John Smith&copy;</sender>]]></tag>

For consistensy You can disable parse CDATA with this or make it optional. But it need more refactor due in code as I know you join it together

@amitguptagwl
Copy link
Member

As per the wiki,

Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup. CDATA sections are useful for writing XML code as text data within an XML document.

if the numeric character reference &#240; appears in element content, it will be interpreted as the single Unicode character 00F0 (small letter eth). But if the same appears in a CDATA section, it will be parsed as six characters: ampersand, hash mark, digit 2, digit 4, digit 0, semicolon.

Above phrase in my words (understanding),

if &#240; (represents to ð) appears in CDATA section should not be parsed to ð but to &#240; only.

@Delagen
Copy link
Contributor Author

Delagen commented Feb 2, 2018

@amitguptagwl I will make some refactoring, and place PR tomorrow

@amitguptagwl
Copy link
Member

Thanks @Delagen . In case if you are busy with some other prior work, we can just roll back the changes for the time being. And will handle them later. I'm anyhow planning to rewrite parser but it'll take at least a month 😸

Delagen added a commit to Delagen/fast-xml-parser that referenced this issue Feb 2, 2018
@Delagen
Copy link
Contributor Author

Delagen commented Feb 2, 2018

Thanks @amitguptagwl for work, I placed PR to not decode CDATA
But seems parser ignore sibling text when CDATA present

<a>
<![CDATA[asdf]]>a
</a>

result in:

{
"a":{
"#text":"asdf"
}
}

I know it rare case, but I am sure it must be asdfa
xm2js parse it correctly.

Example:

<a>a
<![CDATA[asdf&amp;]]>&amp;<![CDATA[asdf]]>
a</a>
{ "a": "a\nasdf&amp;&asdf\na" }

@amitguptagwl
Copy link
Member

Sorry I'm holding this PR as I'm rewriting the parser So it'll become to handle this situation, large files, reduce time for separate validation etc. I'm 60% complete with the change. Will update you with the progress.

@Delagen
Copy link
Contributor Author

Delagen commented Feb 3, 2018

Thanks for your project. I have a pleasure to contribute if project owner interested in making it code better. )

@amitguptagwl
Copy link
Member

v3 is live to handle this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants