-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate YAML with comments indicating which line of code generated each item #105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments, but overall ship it!
sjsonnet/src/sjsonnet/Util.scala
Outdated
|
||
|
||
object Util{ | ||
def binarySearch(lineStartsMin: Int, lineStartsMax: Int, lineStarts: Array[Int], index: Int): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Arrays.binarySearch
would work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with Arrays.binarySearch
is that it only can check for exact value == input
, whereas this binary search needs to check for value > input
. In theory it should be about the same logic, but I didn't find any convenient verison built in :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it returns the -insertion_point
for a value that's not in the array, if I understand this correctly:
Returns:
index of the search key, if it is contained in the array within the specified range; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the array: the index of the first element in the range greater than the key, or toIndex if all elements in the range are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool find, I'll make use of that then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that the comment explaining how we use the result of java.util.Arrays.binarySearch
is longer than the binary search implementation in the first place! OTOH it lets me delete 115 lines of test suite, so definitely still a win overall
def saveCurrentPos() = { | ||
val current = getCurrentPosition() | ||
if (current != null){ | ||
bufferedComment = " # " + current.currentFile.renderOffsetStr(current.offset, loadedFileContents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a hot code path (almost every line of output will have a comment), so maybe it would be better to avoid string concatenation. The logic where the buffer is appended to the StringWriter
could do out.append(" # ")
before appending the actual offset. Not sure if this would result in an observable speedup, but operations on Strings used to be a significant slowdown in the Scala compiler. Since this seems like an easy change, I'd be curious to know the outcome for very large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try this and see if there's any measurable difference
All tests pass, merging this |
Generates YAML that looks like this:
The idea being to make it easier to figure out how the materialized values are being constructed: by letting the user get back to the last line of jsonnet that generated a value, they can then use the https://github.com/databricks/intellij-jsonnet IDE plugin to further explore the jsonnet code to see how the values are passed into that line.
How it works
We capture a
Position
value along with every runtime valuesjsonnet.Val
in our program. This is a thing wrapper around a file path and a character offset; we don't do any heavy computation until the end when we need to generate path+linenumber comments, and so it's reasonably lightweight and cheap as long as comment-generation isn't enabled (see perf numbers below). Every time a new value is computed, we store the currentPosition
of the expression that computed it, and the positions of any upstream values that fed into the computation are discarded/ignoredAlthough we could in theory capture the entire computation graph (DAG) leading up to each value, the size of this forest of graphs is likely to be large and too verbose for general usage. The IDE effectively lets the user interactively explore a subset of the graph, which is probably good enough for now. Capturing and making sense of the entire computation forest can be explored in future
Only works with direct YAML output, since JSON doesn't allow comments, though in theory we could generate a look-aside-table/source-map-file with the same information and somehow get the IDE to hook them up.
As the
upickle.core.Visitor
interface doesn't support anything other thanInt
s as offsets to each callback method, we instead plumb thecurrentPos
in a mutable variable which theMaterializer
sets using a callback and thePrettyYamlRenderer
reads using a callback. Not pretty, but it works. While in theory we could encode the entirePosition
value using theoffset: Int
thatupickle.core.Visitor
receives, I suspect it would be even more inelegant than this shared mutable variable thing.I had to plumb
offset: Int
orpos: Position
information into a bunch of places where we previously didn't need it (e.g. all ofStd.scala
). Where previously we only needed position information where we have a chance of throwing errors, now we need position information in all places where we construct newsjsonnet.Val
objects.Output comment generation is controlled under a flag
--yaml-debug
Comment Positioning
Due to the somewhat heterogenous output style of
PrettyYamlRenderer
, which is copied from the PyYAML package, there is some subtlety in how the positioning of comments works. At a high-level:For most terminal values (bools, strings, nulls, numbers, empty arrays, empty lists), these are only ever printed at the end of a line, and so the source line comment is simply appended at the end of the line after the value, EXCEPT
For Strings rendered block-style (
|-
,|+
,|2
, etc.) the only place we can put a comment is after the block delimiter, before the first line of the string, so we put the source line comment there.Lists and dictionaries in general are not guaranteed to appear on a line on their own (e.g. you can have nested lists in one line
- - - foo
or nested dictionaries- foo: bar
) and so in general we do not include a source line comment for where collections values were constructed, even though we have the data available, EXCEPTLists and dictionaries that are themselves values in another dictionary do have a spot of space where we can place a comment: after the preceding key of the enclosing dictionary. This is because nested lists/dictionaries on one line such as
- - foo:
can only ever have a dictionary in the right-most position, and if the dictionary key is a collection (whether list or dict) that is always placed on the following lines. Thus in this case we put the source line comment of the following dictionary value at the end of the line of the preceding dictionary keyPerformance
Performance on my laptop, # of iterations per 20s of generation some arbitrary set of config files:
In addition to the code in this PR, I tried disabling
Position
-management "globally" via an environment variable (madePosition.apply
returnnull
, wrapped a lot of position-related code blocks inif
guards) to see if it made a significant difference. It turns out that even with positions enabled, the performance degradation ofPR Generating JSON, positions enabled
compared to theMaster generating JSON
benchmark is small enough (2-3%) that it's probably not worth the complexity of juggling environment variablesThe
PR Generating PrettyYAML + Comments, positions enabled
Is about 2x slower than without the source line comment generation; this is about as much a slowdown as I expected. To reach this, I had to cache calls tofastparse.internal.Util.lineNumberLookup
that performs the building of line-number tables for the Jsonnet files we load, andI re-implemented
fastparse.IndexedParserInput.prettyIndex
insjsonnet.Util.prettyIndex
using a binary search for performance rather than a linear scan; without these changes, it's about 300x slower on the arbitrary set of configs I was profiling aboveMaterializing the largest configuration file i could find on my laptop (not part of the benchmarks above) with a hot Sjsonnet JVM took ~3s with
--yaml-debug
vs ~2.7s with--yaml-out
, and on a cold JVM it was ~19s vs ~17s. Either way it's probably reasonably fast enough for now, especially since it's intended mostly for manual usage when debugging things and not really intended to be part of automated workflowsMerge-Patch
In order to preserve
Position
s, I rewrote themergePatch
std lib function to avoid round-tripping thesjsonnet.Val
data structures throughujson.Values
, and instead performing the merge-patch algorithm lazily directly on the lazysjsonnet.Val
data structure. This allows us to do a best-effort preservation ofPosition
s:std.mergePatch
callThere are probably other
std.*
methods that we could rewrite in order to better preservePosition
information while they do their work, but empirically it seemsmergePatch
is the most commonly used one in our company's config codebaseTesting
Added some rudimentary unit tests in
PrettyYamlRendererTests
, covering both this new line-comment generation as well as the previously-untestedPrettyYamlRenderer
. Not very thorough, but at least it's something...Added a small fuzz-test to compare the output of the two
prettyIndex
implementations on a range of inputs, to try and catch any deviations between them. All such tests pass both on Scala-JVM and Scala.JSAdded few test cases in
StdMergePatchTests
to cover cases where I had trouble with semantic differences during implementation. These are in addition to the existing upstream test cases vendored insjsonnet/test/resources/test_suite/merge.jsonnet
.Cleaned up the test suite a bit to DRY up copy-paste code, extracting a few common helpers in
TestUtils.scala
All existing unit tests pass. The correctness of the changes when
--yaml-debug
is not passed is also validated by running this on our work configuration codebase, and verifying that no output YAMLs were changed as the result of these refactorsWe don't actually have any fuzz tests to validate that adding the source-line comments doesn't change the parsed contents of the YAML output, but since it's mostly for manual debugging that's probably fine for now