Attributes parsed more efficiently by Read operations #138

beevik · 2024-07-04T01:51:45Z

When parsing an XML element, this package no longer checks whether attribute names appear more than once in the same element. Instead, duplicate attribute names are allowed, just as with the encoding/xml package.

CreateAttr continues to behave as before ("If an attribute with same key already exists on this element, then its value is replaced").

Deprecated the ReadSettings.PreserveDuplicateAttrs setting since this is now standard behavior and no longer performs any function..

DemiMarie · 2024-07-04T20:49:06Z

When parsing an XML element, this package no longer checks whether attribute names appear more than once in the same element. Instead, duplicate attribute names are allowed, just as with the encoding/xml package.

I’m concerned that this could break applications that relied on this check, possibly in a security-sensitive way. What about using a map instead?

beevik · 2024-07-05T16:11:56Z

I’m concerned that this could break applications that relied on this check, possibly in a security-sensitive way. What about using a map instead?

A map would break the existing API, since Attr is currently an exposed field in Element. (It would also be less space efficient and possibly less time efficient in many cases.)

In retrospect it was a bad idea to default to deduplicating attributes, since it's slow when done with slices of attributes. Also, encoding/xml doesn't do deduplication, and many people switched to this package as an alternative to that one.

I am sensitive to the possibility that this change could break existing applications that depend on deduplication, however. Perhaps this entire change will need to wait until a 2.0 release.

DemiMarie · 2024-07-05T16:46:33Z

I’m concerned that this could break applications that relied on this check, possibly in a security-sensitive way. What about using a map instead?

A map would break the existing API, since Attr is currently an exposed field in Element. (It would also be less space efficient and possibly less time efficient in many cases.)

In retrospect it was a bad idea to default to deduplicating attributes, since it's slow when done with slices of attributes. Also, encoding/xml doesn't do deduplication, and many people switched to this package as an alternative to that one.

An xml.StartElement provides all attributes at once, so one option would be to sort the attributes (using Go’s slices.SortFunc). Once the attributes are sorted, checking for duplicates is trivial. That doesn’t solve the problem of adding new attributes to an element, but I think this is less critical than creating a tree from (possibly malicious) XML.

I am sensitive to the possibility that this change could break existing applications that depend on deduplication, however. Perhaps this entire change will need to wait until a 2.0 release.

I filed golang/go#68295 for Go not rejecting duplicate attributes.

beevik · 2024-07-06T02:41:20Z

An xml.StartElement provides all attributes at once, so one option would be to sort the attributes (using Go’s slices.SortFunc).

This could be done, although I'd still want the processed element to maintain the original attribute order after stripping duplicates. Doing anything else might break existing users of this package. I haven't yet come up with a way to do sorting-based deduplication without the overhead of making extra (temporary) copies of the original attribute slices. Moreover, it would still be O(nlogn), which is not ideal (even though it's better than O(n^2)).

beevik · 2024-07-07T16:44:24Z

Ok I modified the branch to use a temporary map to eliminate duplicate attributes. This implementation perfectly preserves the same result as the current main branch.

When reading an XML document, this package uses a more time-efficient technique to detect and remove attributes with duplicated names (within each element).

beevik · 2024-07-10T03:09:52Z

Branch merged into main.

beevik force-pushed the attr-duplicates branch 3 times, most recently from 7756312 to 52ce29e Compare July 7, 2024 16:41

beevik force-pushed the attr-duplicates branch 8 times, most recently from 055a38c to 97b5e18 Compare July 8, 2024 06:17

Attributes deduplicated more efficiently

97b5e18

When reading an XML document, this package uses a more time-efficient technique to detect and remove attributes with duplicated names (within each element).

beevik closed this Jul 10, 2024

beevik deleted the attr-duplicates branch July 10, 2024 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attributes parsed more efficiently by Read operations #138

Attributes parsed more efficiently by Read operations #138

beevik commented Jul 4, 2024

DemiMarie commented Jul 4, 2024

beevik commented Jul 5, 2024 •

edited

Loading

DemiMarie commented Jul 5, 2024

beevik commented Jul 6, 2024

beevik commented Jul 7, 2024

beevik commented Jul 10, 2024

Attributes parsed more efficiently by Read operations #138

Attributes parsed more efficiently by Read operations #138

Conversation

beevik commented Jul 4, 2024

DemiMarie commented Jul 4, 2024

beevik commented Jul 5, 2024 • edited Loading

DemiMarie commented Jul 5, 2024

beevik commented Jul 6, 2024

beevik commented Jul 7, 2024

beevik commented Jul 10, 2024

beevik commented Jul 5, 2024 •

edited

Loading