feat: add benchmarks for XML deserialization #605

ianbotsf · 2022-03-17T19:16:25Z

Issue #

Addresses aws-sdk-kotlin#538

Description of changes

This change adds benchmarks for XML lexing and deserialization.

As part of running the benchmarks and comparing the results with the previous XmlPull implementation, several optimizations were made to the lexer. (Unfortunately, some of these changes hamper the readability and separation of that part of the code. Suggestions welcome on how to address that.)
Now that benchmarks prove the relatively-even performance of the new XML lexer vs XmlPull, the XmlPull deserializer is removed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

aajtodd

Looks great overall. Couple minor suggestions

aajtodd · 2022-03-18T16:47:09Z

...de/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/tokenization/StringTextStream.kt

-    fun peekAtMost(length: Int): String {
-        val actualLength = min(length, end - offset)
-        return sliceByLength(actualLength)
+    private fun checkBounds(length: Int, errCondition: String) {


you could probably mark this as inline

I can do that but there's no significant difference in benchmark speed either way.

then no point inlining

aajtodd · 2022-03-18T16:49:18Z

...de/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/tokenization/StringTextStream.kt

+            '\u2fef' < c && c < '\u3001' ||
+            '\ud7ff' < c
+        ) {
+            error("Unable to find valid XML start name character")


may be useful to add the character that was found

aajtodd · 2022-03-18T17:16:31Z

runtime/serde/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/tokenization/XmlLexer.kt

-private fun Char.toRange() = this..this
-
-// https://www.w3.org/TR/xml/#NT-Name
-private val nameStartCharRanges = setOf(


interesting. I would not have expected these to be a bottleneck. Certainly easier to read and map to the spec this way.

Agreed, it was more readable before. I think the bottleneck is because neither the JVM stdlib nor Kotlin stdlib special-cases small set sizes nor scalar sets. This used a full linked hashset implementation with Char boxing/unboxing to perform the lookups, which was slower than a tight if condition directly on scalars.

aajtodd · 2022-03-18T17:30:46Z

...de/serde-xml/common/src/aws/smithy/kotlin/runtime/serde/xml/tokenization/StringTextStream.kt

+        var peekOffset = offset + 1
+        while (peekOffset < end) {
+            val ch = source[peekOffset]
+            if (


The multiline if statement is fine. Though it may be cleaner to look for the valid character ranges rather than the invalid ones. I'd also expect that most XML we get is going to end a name by finding the end of the tag > (or a space indicating start of an attribute).

The way this is currently structured you actually end up checking until you hit an invalid character. It may be faster to look for valid characters (prioritizing ascii) first. In other words right now we end up checking every branch to prove that a character isn't an invalid name char.

The opposite may be quicker as we expect in most cases to find valid chars (especially given that the XML names come from smithy shape names which is a restricted character set anyway)

when(val ch = source[peekOffset]) { in 'a'..'z', in 'A'..'Z', ..., in '\u203f'..'\u2040' -> { peekOffset++; continue } else -> error(...) }

You could even prioritize the common branches for how a name will end:

when val ch = source[peekOffset++]) { in 'a'..'z', in 'A'..'Z' -> continue ' ', '>' -> break // other valid cases else -> invalid }

You're right, optimizing for valid and expected characters first improves the performance. Using when or range checks (e.g., c in 'A'..'Z') actually hurts performance so I'll skip those (although they are more readable).

sonarqubecloud · 2022-03-21T16:51:44Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
0.0% Duplication

feat: add benchmarks for XML deserialization

bf1ba4d

ianbotsf requested a review from a team as a code owner March 17, 2022 19:16

ianbotsf requested a review from aajtodd March 17, 2022 19:16

aajtodd approved these changes Mar 18, 2022

View reviewed changes

ianbotsf added 2 commits March 21, 2022 16:47

addressing PR feedback with more optimizations

55477d9

updating README.md with latest benchmark baseline

07f0a3c

ianbotsf merged commit dec805b into feat-kmp-xml Mar 21, 2022

ianbotsf deleted the xml-deserializer-benchmarks branch March 21, 2022 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add benchmarks for XML deserialization #605

feat: add benchmarks for XML deserialization #605

ianbotsf commented Mar 17, 2022

aajtodd left a comment

aajtodd Mar 18, 2022

ianbotsf Mar 21, 2022

aajtodd Mar 21, 2022

aajtodd Mar 18, 2022

aajtodd Mar 18, 2022

ianbotsf Mar 21, 2022

aajtodd Mar 18, 2022 •

edited

Loading

ianbotsf Mar 21, 2022

sonarqubecloud bot commented Mar 21, 2022

feat: add benchmarks for XML deserialization #605

feat: add benchmarks for XML deserialization #605

Conversation

ianbotsf commented Mar 17, 2022

Issue #

Description of changes

aajtodd left a comment

Choose a reason for hiding this comment

aajtodd Mar 18, 2022

Choose a reason for hiding this comment

ianbotsf Mar 21, 2022

Choose a reason for hiding this comment

aajtodd Mar 21, 2022

Choose a reason for hiding this comment

aajtodd Mar 18, 2022

Choose a reason for hiding this comment

aajtodd Mar 18, 2022

Choose a reason for hiding this comment

ianbotsf Mar 21, 2022

Choose a reason for hiding this comment

aajtodd Mar 18, 2022 • edited Loading

Choose a reason for hiding this comment

ianbotsf Mar 21, 2022

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 21, 2022

aajtodd Mar 18, 2022 •

edited

Loading