Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate YAML with comments indicating which line of code generated each item #105

Merged
merged 18 commits into from
Dec 16, 2020

Conversation

lihaoyi-databricks
Copy link
Contributor

@lihaoyi-databricks lihaoyi-databricks commented Dec 4, 2020

Generates YAML that looks like this:

builds: # kube-config/runbot/dev/runbot-app.jsonnet:159:5
- args: # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:147:29
  - name: sha # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:147:37
    required: true # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:147:54
  autorun: # kube-config/runbot/autorun.jsonnet.TEMPLATE:3:22
  - repeat: true # kube-config/runbot/autorun.jsonnet.TEMPLATE:12:13
    sequence: # kube-config/runbot/autorun.jsonnet.TEMPLATE:4:16
    - args: # kube-config/runbot/autorun.jsonnet.TEMPLATE:10:15
        sha: # kube-config/runbot/autorun.jsonnet.TEMPLATE:10:22
          value: origin/master # kube-config/runbot/autorun.jsonnet.TEMPLATE:10:29
      interval: 3600000 # kube-config/runbot/autorun.jsonnet.TEMPLATE:8:29
  cleanupCommands: # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:44
  - - bash # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:221:8
    - -c # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:221:16
    - docker ps -aq | xargs docker rm -f; true # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:221:22
  - - df # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:222:8
    - -h # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:222:14
  - - git # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:48
    - clean # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:55
    - -xdf # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:64
  - - git # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:74
    - reset # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:81
    - --hard # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:90
    - HEAD # kube-config/runbot/runbot-app.jsonnet.TEMPLATE:202:100

The idea being to make it easier to figure out how the materialized values are being constructed: by letting the user get back to the last line of jsonnet that generated a value, they can then use the https://github.com/databricks/intellij-jsonnet IDE plugin to further explore the jsonnet code to see how the values are passed into that line.

How it works

We capture a Position value along with every runtime value sjsonnet.Val in our program. This is a thing wrapper around a file path and a character offset; we don't do any heavy computation until the end when we need to generate path+linenumber comments, and so it's reasonably lightweight and cheap as long as comment-generation isn't enabled (see perf numbers below). Every time a new value is computed, we store the current Position of the expression that computed it, and the positions of any upstream values that fed into the computation are discarded/ignored

Although we could in theory capture the entire computation graph (DAG) leading up to each value, the size of this forest of graphs is likely to be large and too verbose for general usage. The IDE effectively lets the user interactively explore a subset of the graph, which is probably good enough for now. Capturing and making sense of the entire computation forest can be explored in future

Only works with direct YAML output, since JSON doesn't allow comments, though in theory we could generate a look-aside-table/source-map-file with the same information and somehow get the IDE to hook them up.

As the upickle.core.Visitor interface doesn't support anything other than Ints as offsets to each callback method, we instead plumb the currentPos in a mutable variable which the Materializer sets using a callback and the PrettyYamlRenderer reads using a callback. Not pretty, but it works. While in theory we could encode the entire Position value using the offset: Int that upickle.core.Visitor receives, I suspect it would be even more inelegant than this shared mutable variable thing.

I had to plumb offset: Int or pos: Position information into a bunch of places where we previously didn't need it (e.g. all of Std.scala). Where previously we only needed position information where we have a chance of throwing errors, now we need position information in all places where we construct new sjsonnet.Val objects.

Output comment generation is controlled under a flag --yaml-debug

Comment Positioning

Due to the somewhat heterogenous output style of PrettyYamlRenderer, which is copied from the PyYAML package, there is some subtlety in how the positioning of comments works. At a high-level:

  • For most terminal values (bools, strings, nulls, numbers, empty arrays, empty lists), these are only ever printed at the end of a line, and so the source line comment is simply appended at the end of the line after the value, EXCEPT

  • For Strings rendered block-style (|- , |+, |2, etc.) the only place we can put a comment is after the block delimiter, before the first line of the string, so we put the source line comment there.

  • Lists and dictionaries in general are not guaranteed to appear on a line on their own (e.g. you can have nested lists in one line - - - foo or nested dictionaries - foo: bar) and so in general we do not include a source line comment for where collections values were constructed, even though we have the data available, EXCEPT

  • Lists and dictionaries that are themselves values in another dictionary do have a spot of space where we can place a comment: after the preceding key of the enclosing dictionary. This is because nested lists/dictionaries on one line such as - - foo: can only ever have a dictionary in the right-most position, and if the dictionary key is a collection (whether list or dict) that is always placed on the following lines. Thus in this case we put the source line comment of the following dictionary value at the end of the line of the preceding dictionary key

Performance

Performance on my laptop, # of iterations per 20s of generation some arbitrary set of config files:

Case Run1 Run2 Run3 Avg StdDev
Master Generating JSON 1157 1142 1140 1146 9
PR Generating JSON, positions disabled with env var 1139 1107 1119 1122 16
PR Generating JSON, positions enabled 1126 1093 1101 1106 17
PR Generating PrettyYAML, positions enabled 670 698 681 683 14
PR Generating PrettyYAML + Comments, positions enabled 263 294 292 283 17

In addition to the code in this PR, I tried disabling Position-management "globally" via an environment variable (made Position.apply return null, wrapped a lot of position-related code blocks in if guards) to see if it made a significant difference. It turns out that even with positions enabled, the performance degradation of PR Generating JSON, positions enabled compared to the Master generating JSON benchmark is small enough (2-3%) that it's probably not worth the complexity of juggling environment variables

The PR Generating PrettyYAML + Comments, positions enabled Is about 2x slower than without the source line comment generation; this is about as much a slowdown as I expected. To reach this, I had to cache calls to fastparse.internal.Util.lineNumberLookup that performs the building of line-number tables for the Jsonnet files we load, and
I re-implemented fastparse.IndexedParserInput.prettyIndex in sjsonnet.Util.prettyIndex using a binary search for performance rather than a linear scan; without these changes, it's about 300x slower on the arbitrary set of configs I was profiling above

Materializing the largest configuration file i could find on my laptop (not part of the benchmarks above) with a hot Sjsonnet JVM took ~3s with --yaml-debug vs ~2.7s with --yaml-out, and on a cold JVM it was ~19s vs ~17s. Either way it's probably reasonably fast enough for now, especially since it's intended mostly for manual usage when debugging things and not really intended to be part of automated workflows

Merge-Patch

In order to preserve Positions, I rewrote the mergePatch std lib function to avoid round-tripping the sjsonnet.Val data structures through ujson.Values, and instead performing the merge-patch algorithm lazily directly on the lazy sjsonnet.Val data structure. This allows us to do a best-effort preservation of Positions:

  • Terminal types (numbers, booleans, nulls, strings) are passed through unchanged with positions preserved
  • Arrays are also passed through unchanged (since they do not participate in the merge-patch algorithm) with positions preserved
  • Objects that do not get merged on the LHS are passed through unchanged with positions preserved
  • Objects that do not get merged on the RHS are passed through with nulls removed, but with positions preserved
  • Objects that do get merged are the only case where the position is over-written with the position of the std.mergePatch call

There are probably other std.* methods that we could rewrite in order to better preserve Position information while they do their work, but empirically it seems mergePatch is the most commonly used one in our company's config codebase

Testing

Added some rudimentary unit tests in PrettyYamlRendererTests, covering both this new line-comment generation as well as the previously-untested PrettyYamlRenderer. Not very thorough, but at least it's something...

Added a small fuzz-test to compare the output of the two prettyIndex implementations on a range of inputs, to try and catch any deviations between them. All such tests pass both on Scala-JVM and Scala.JS

Added few test cases in StdMergePatchTests to cover cases where I had trouble with semantic differences during implementation. These are in addition to the existing upstream test cases vendored in sjsonnet/test/resources/test_suite/merge.jsonnet.

Cleaned up the test suite a bit to DRY up copy-paste code, extracting a few common helpers in TestUtils.scala

All existing unit tests pass. The correctness of the changes when --yaml-debug is not passed is also validated by running this on our work configuration codebase, and verifying that no output YAMLs were changed as the result of these refactors

We don't actually have any fuzz tests to validate that adding the source-line comments doesn't change the parsed contents of the YAML output, but since it's mostly for manual debugging that's probably fine for now

@lihaoyi-databricks lihaoyi-databricks changed the title POC generate YAML with comments indicating which line of code generated each item Generate YAML with comments indicating which line of code generated each item Dec 14, 2020
Copy link
Contributor

@iuliand-db iuliand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments, but overall ship it!



object Util{
def binarySearch(lineStartsMin: Int, lineStartsMax: Int, lineStarts: Array[Int], index: Int): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Arrays.binarySearch would work as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with Arrays.binarySearch is that it only can check for exact value == input, whereas this binary search needs to check for value > input. In theory it should be about the same logic, but I didn't find any convenient verison built in :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it returns the -insertion_point for a value that's not in the array, if I understand this correctly:

Returns:
index of the search key, if it is contained in the array within the specified range; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the array: the index of the first element in the range greater than the key, or toIndex if all elements in the range are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool find, I'll make use of that then!

Copy link
Contributor Author

@lihaoyi-databricks lihaoyi-databricks Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that the comment explaining how we use the result of java.util.Arrays.binarySearch is longer than the binary search implementation in the first place! OTOH it lets me delete 115 lines of test suite, so definitely still a win overall

def saveCurrentPos() = {
val current = getCurrentPosition()
if (current != null){
bufferedComment = " # " + current.currentFile.renderOffsetStr(current.offset, loadedFileContents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a hot code path (almost every line of output will have a comment), so maybe it would be better to avoid string concatenation. The logic where the buffer is appended to the StringWriter could do out.append(" # ") before appending the actual offset. Not sure if this would result in an observable speedup, but operations on Strings used to be a significant slowdown in the Scala compiler. Since this seems like an easy change, I'd be curious to know the outcome for very large files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try this and see if there's any measurable difference

@lihaoyi-databricks
Copy link
Contributor Author

All tests pass, merging this

@lihaoyi-databricks lihaoyi-databricks merged commit ee72a0b into master Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants