Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributes parsed more efficiently by Read operations #138

Closed
wants to merge 1 commit into from

Conversation

beevik
Copy link
Owner

@beevik beevik commented Jul 4, 2024

When parsing an XML element, this package no longer checks whether attribute names appear more than once in the same element. Instead, duplicate attribute names are allowed, just as with the encoding/xml package.

CreateAttr continues to behave as before ("If an attribute with same key already exists on this element, then its value is replaced").

Deprecated the ReadSettings.PreserveDuplicateAttrs setting since this is now standard behavior and no longer performs any function..

@DemiMarie
Copy link

When parsing an XML element, this package no longer checks whether attribute names appear more than once in the same element. Instead, duplicate attribute names are allowed, just as with the encoding/xml package.

I’m concerned that this could break applications that relied on this check, possibly in a security-sensitive way. What about using a map instead?

@beevik
Copy link
Owner Author

beevik commented Jul 5, 2024

I’m concerned that this could break applications that relied on this check, possibly in a security-sensitive way. What about using a map instead?

A map would break the existing API, since Attr is currently an exposed field in Element. (It would also be less space efficient and possibly less time efficient in many cases.)

In retrospect it was a bad idea to default to deduplicating attributes, since it's slow when done with slices of attributes. Also, encoding/xml doesn't do deduplication, and many people switched to this package as an alternative to that one.

I am sensitive to the possibility that this change could break existing applications that depend on deduplication, however. Perhaps this entire change will need to wait until a 2.0 release.

@DemiMarie
Copy link

I’m concerned that this could break applications that relied on this check, possibly in a security-sensitive way. What about using a map instead?

A map would break the existing API, since Attr is currently an exposed field in Element. (It would also be less space efficient and possibly less time efficient in many cases.)

In retrospect it was a bad idea to default to deduplicating attributes, since it's slow when done with slices of attributes. Also, encoding/xml doesn't do deduplication, and many people switched to this package as an alternative to that one.

An xml.StartElement provides all attributes at once, so one option would be to sort the attributes (using Go’s slices.SortFunc). Once the attributes are sorted, checking for duplicates is trivial. That doesn’t solve the problem of adding new attributes to an element, but I think this is less critical than creating a tree from (possibly malicious) XML.

I am sensitive to the possibility that this change could break existing applications that depend on deduplication, however. Perhaps this entire change will need to wait until a 2.0 release.

I filed golang/go#68295 for Go not rejecting duplicate attributes.

@beevik
Copy link
Owner Author

beevik commented Jul 6, 2024

An xml.StartElement provides all attributes at once, so one option would be to sort the attributes (using Go’s slices.SortFunc).

This could be done, although I'd still want the processed element to maintain the original attribute order after stripping duplicates. Doing anything else might break existing users of this package. I haven't yet come up with a way to do sorting-based deduplication without the overhead of making extra (temporary) copies of the original attribute slices. Moreover, it would still be O(nlogn), which is not ideal (even though it's better than O(n^2)).

@beevik beevik force-pushed the attr-duplicates branch 3 times, most recently from 7756312 to 52ce29e Compare July 7, 2024 16:41
@beevik
Copy link
Owner Author

beevik commented Jul 7, 2024

Ok I modified the branch to use a temporary map to eliminate duplicate attributes. This implementation perfectly preserves the same result as the current main branch.

@beevik beevik force-pushed the attr-duplicates branch 8 times, most recently from 055a38c to 97b5e18 Compare July 8, 2024 06:17
When reading an XML document, this package uses a more
time-efficient technique to detect and remove attributes
with duplicated names (within each element).
@beevik
Copy link
Owner Author

beevik commented Jul 10, 2024

Branch merged into main.

@beevik beevik closed this Jul 10, 2024
@beevik beevik deleted the attr-duplicates branch July 10, 2024 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants