Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static Object Optimizations #197

Merged
merged 18 commits into from
Jan 9, 2024
Merged

Static Object Optimizations #197

merged 18 commits into from
Jan 9, 2024

Conversation

ahirreddy
Copy link
Collaborator

@ahirreddy ahirreddy commented Jan 4, 2024

  • This PR uses the Parser & StaticOptimizer thread local string interner for keys in static objects
  • Similarly, we deduplicate the String -> Boolean map used to determine if a field is static.
  • For static objects we also use immutable.VectorMap (with a JavaWrapper) for the field set.
  • Lastly for the value cache, we size it according to the number of keys in the object this reduces unnecessary up-sizing for large objects, and more importantly removes the large number of sparse maps we previously had for small objects (the default was 16 elements)

Before: 855MB for the parsed file
image (9)

After: 425MB
image (10)

@lihaoyi lihaoyi removed their request for review January 5, 2024 07:37
Copy link
Collaborator

@szeiger szeiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the performance impact of these changes? Are we trading memory for speed?

Comment on lines 77 to 80
// This is a dummy benchmark to see how much memory is used by the interpreter.
// You're meant to execute it, and once it prints "sleeping" you can attach yourkit and take a heap
// dump. Because we store the cache, the parsed objects will have strong references - and thus will
// be in the heap dump.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we turn this into a simple command line program that does a gc before and after parsing and then prints the memory usage diff to the console so you can easily retest this for future changes without having to attach a profiler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and updated readme.

Comment on lines 21 to 25
// HashMap to deduplicate strings.
private[this] val strings = new mutable.HashMap[String, String]

private[this] val fieldSet = new mutable.HashMap[Val.StaticObjectFieldSet, java.util.LinkedHashMap[String, java.lang.Boolean]]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cache separate from the Parser's? Should it be a single cache at the Interpreter level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, there's now a single cache at the interpreter level.

@@ -297,15 +298,45 @@ object Val{
}
}

def staticObject(pos: Position, fields: Array[Expr.Member.Field]): Obj = {
final case class StaticObjectFieldSet(keys: Array[String]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a case class. Array has no useful equality or toString and you're overriding equals and hashCode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ahirreddy
Copy link
Collaborator Author

Sorry I missed your most important question. The performance impact here was undetectable in the benchmark. I think the fact that the object interning is thread-local and unsynchronized makes it pretty fast.

@@ -3,6 +3,8 @@ package sjsonnet
import java.io.StringWriter
import java.util.concurrent.TimeUnit

import scala.collection.mutable.HashMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep mutable types qualified (i.e. only import scala.collection.mutable) everywhere for consistency.

@lihaoyi-databricks lihaoyi-databricks merged commit c462833 into master Jan 9, 2024
1 check passed
stephenamar-db pushed a commit that referenced this pull request Jan 10, 2025
…e object creation in common cases (#258)

This PR bundles together several small optimizations, most aimed at
reducing garbage object creation. Collectively, these changes result in
a large performance improvement for some of our largest jsonnet inputs.

## Optimizations

These are somewhat coherently split into multiple smaller intermediate
commits, which can help for navigating this change. I'll describe each
optimization in more detail below.

### Optimizing `Obj` key lookup methods for objects without `super`

The `hasKeys`, `containsKey`, `containsVisibleKey`, `allKeyNames`, and
`visibleKeyNames` methods can be optimized in the common case of objects
that don't have `super` objects.

We already perform an optimization for `static` objects, pre-populating
`allKeys`, but for non-static objects we had to run `gatherAllKeys` to
populate a `LinkedHashMap` of keys and a boolean indicating visibility.

If a non-static object has no `super` then we can just use `value0` to
compute the keys: this avoids an additional `LinkedHashMap` allocation
and also lets us pre-size the resulting string arrays, avoiding wasteful
array copies from resizes or trims.

In `visibleKeyNames`, I chose to pre-size the output builder based on
the _total_ key count: this is based a common-case assumption that most
objects don't have hidden keys.

This optimization makes a huge difference in `std.foldl(std.megePatch,
listOfPatches, {})`, where the intermediate merge targets' visible are
repeatedly recomputed. In these cases, the intermediate objects contain
_only_ visible keys, allowing this optimization to maximally avoid
unnecessary array allocations.

### Pre-size various hash maps 

This builds on an idea from
#197 : there are multiple
places where we construct hashmaps that are either over- or under-sized:
an over-sized map wastes space (I saw >90% of backing array slots wasted
in some heap dumps) and an under-sized map wastes time and space in
re-sizing upwards during construction.

Here, I've generalized that PR's pre-sizing to apply in more contexts.

One notable special case is the `valueCache`: if an object inherits
fields then it's not free to determine the map size. As a result, I've
only special-sized this for `super`-free objects. This map is a little
bit different from `value0` or `allFields` because its final size is a
function of whether or not object field values are actually computed.
Given this, I've chosen to start pretty conservatively by avoiding
changing the size in cases where it's not an obvious win; I may revisit
this further in a future followup.

### Change `valueCache` from a Scala map to a Java map

This was originally necessary because the Scala 2.12 version of
`mutable.HashMap` does not support capacity / load factor configuration,
which got in the way with the pre-sizing work described above.

But a nice secondary benefit is that Java maps let me avoid closure /
anonfun allocation in `map.getOrElse(k, default)` calls: even if we
don't invoke `default`, we still end up doing some allocations for the
lambda / closure / thunk. I had noticed this overhead previously in
`Obj.value` and this optimization should fix it.

### Remove most Scala sugar in `std.mergePatch`, plus other
optimizations

The `recMerge` and `recSingle` methods used by `std.mergePatch`
contained big Scala `for` comprehensions and used `Option` for handling
nulls. This improves readability but comes at a surprising performance
cost. I would have naively assumed that most of those overheads would
have been optimized out but empirically this was not the case in my
benchmarks.

Here, I've rewritten this with Java-style imperative `while` loops and
explicit null checks.

### Optimize `std.mergePatch`'s distinct key computation

After fixing other bottlenecks, I noticed that the 

```scala
val keys: Array[String] = (l.visibleKeyNames ++ r.visibleKeyNames).distinct
```

step in `std.mergePatch` was very expensive. Under the hood, this
constructs a combined array, allocates an ArrayBuilder, and uses an
intermediate HashSet for detecting already-seen keys.

Here, I've added an optimized fast implementation for the cases where
`r.visibleKeyNames.length < 8`. I think it's much more common for the
LHS of a merge to be large and the RHS to be small, in which case we're
conceptually better off by building a hash set on the RHS and removing
RHS elements as they're seen during the LHS traversal. But if the RHS is
small enough then the cost of hashing and probing will be higher than a
simple linear scan of a small RHS array.

Here, `8` is a somewhat arbitrarily chosen threshold based on some local
microbenchmarking.

### Special overload of `Val.Obj.mk` to skip an array copy

Pretty self-explanatory: we often have an `Array[(String, Val.Member)]`
and we can avoid a copy by defining a `Val.Obj.mk` overload which
accepts the array directly.

### Make `PrettyNamed` implicits into constants

This is pretty straightforward, just changing a `def` to a `val`, but it
makes a huge different in reducing ephemeral garabge in some parts of
the evaluator.

## Other changes

I also added `Error.fail` calls in a couple of `case _ =>` matches which
should never be hit. We weren't actually hitting these, but it felt like
a potentially dangerous pitfall to silently ignore those cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants