Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple top level JSON arrays for UnwrapArray #438

Merged

Conversation

mdedetrich
Copy link
Contributor

@mdedetrich mdedetrich commented Feb 17, 2022

Currently Jawn's AsyncParser in UnwrapArray (which is designed to stream individual JSON elements from a top level JSON array) only supports having a single top level JSON array in the byte stream.

While although on the surface this may appear to be sensible it ends up causing issues when you use Jawn's AsyncParser in actual streaming scenarios. In a lot of streaming cases there isn't really an "end", you are just fed with a constant stream of bytes and in some cases (depending on the streaming library you use) you may end up combining multiple JSON arrays into a single stream (i.e. you have a single JSON array per file but you want to stream one file after another, streaming libraries will typically combine this into a single massive byte stream).

As a comparison, Akka's Alpakka JSON streaming which uses jsurfer underneath doesn't exhibit this limitation (see akka/alpakka#2830 and https://discuss.lightbend.com/t/working-with-eof-for-a-source-flow/9481).

Originally I tried to work around this in my own library https://github.com/mdedetrich/akka-streams-json however since we are dealing with read only ByteBuffer's its not really possible/sensible to peek into the byte array ahead of time to see if you are going to hit another [ (indicating multiple top level JSON arrays).

So I ended up creating this PR which covers this exception case only if you happen to be using AsyncParser.UnwrapArray. Currently the PR hardcodes this exception however there is also an argument that you can add multiValue as a parameter to AsyncParser.apply if requested (albeit its a bit complicated if you want to use default parameters because afaik default parameters can't depend on another value in the parameter list and ideally such a default parameter would be mode == UnwrapArray however one can argue that you can just be explicit with a default value of false).

Property based tests have also been added to SyntaxCheck to verify this behavior.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch 2 times, most recently from b862137 to 2f10867 Compare February 17, 2022 11:31
@djspiewak
Copy link
Member

Yeah this is wildly tricky. I'm not sure whether or not this is the right approach to it. Definitely works for this case, but in general you're talking about broadly unwinding and incrementally flattening nested data structures, and there are a lot more cases than just this one. It's also not clear that it's always the right thing to do to just recursively and dynamically flatten in this fashion.

The right™ solution is probably to have some more controllable output layers and fewer assumptions in those layers around finity of the structures in question. This is basically the Tectonic approach to the problem. This would require some significant (and breaking) changes to the facade though, and it's also not entirely clear that people necessarily want the API that it would generate. So that brings us back to this PR.

Honestly I'm not sure what the right answer is. Like I said, this works but it hard-codes a very specific semantic and it's a bit philosophically odd. It's hard for me to say for certain whether it would be better to focus more on a "Jawn 2.0" which solves some of these issues, or to simply go with more incremental fixes like this PR.

@mdedetrich
Copy link
Contributor Author

@djspiewak Thanks for the reply!

Would you be happy if I adjusted the PR so that this behaviour is enabled via an explicit parameter that by default is false?

In regards to your points about Jawn 2.0, I definitely agree and there is an argument that we can also generalise the solution to use JSON path selectors like jsurfer does so you can even selected a nested JSON array without having to rely on it being top level.

For now though if appropriate I would push for an incremental solution. Not having this is actually blocking/causing other issues for my use case which I explained earlier.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch from 2f10867 to fc8e8c6 Compare March 3, 2022 07:43
@mdedetrich
Copy link
Contributor Author

I just rebased the PR, the only change was adding a type annotation since CI complained about it.

@mdedetrich
Copy link
Contributor Author

@djspiewak Is there any update on this?

@mdedetrich
Copy link
Contributor Author

Pinging

@eed3si9n
Copy link
Collaborator

eed3si9n commented Jul 4, 2022

The readme https://github.com/typelevel/jawn#parsing currently says:

  • UnwrapArray if the top-level element is an array, return values as they become available.

so I guess the new parameter should at least be documented. Or better yet, would it make sense to add a new mode like UnwrapMultiArray? That way, people who need array-chaining behavior can opt into it explicitly?

@mdedetrich
Copy link
Contributor Author

mdedetrich commented Jul 4, 2022

Yes agreed, I will update the PR to try and add the extra parameter and document it appropriately, if that's not possible I will make a new mode but to me the extra parameter sounds better.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch from fc8e8c6 to 2beb73f Compare July 4, 2022 20:23
@mdedetrich
Copy link
Contributor Author

@eed3si9n In the end your suggestion of UnwrapMultiArray ended up being the most appropriate. I have updated the original commit using UnwrapMultiArray as a new mode and also updated the README.md documentation.

Let me know if anything else is needed.

Copy link
Collaborator

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI

@mdedetrich
Copy link
Contributor Author

So mima has failed, i.e.

[error] parser: Failed binary compatibility check against org.typelevel:jawn-parser_3:1.1.2 (e:info.versionScheme=early-semver)! Found 1 potential problems (filtered 60)
[error]  * method this(Int,Int,org.typelevel.jawn.FContext,scala.collection.immutable.List,Array[Byte],Int,Int,Int,Boolean,Int)Unit in class org.typelevel.jawn.AsyncParser does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.typelevel.jawn.AsyncParser.this")
[info] jawn-benchmarks: mimaPreviousArtifacts is empty, not analyzing binary compatibility.
[error] java.lang.RuntimeException: Failed binary compatibility check against org.typelevel:jawn-parser_3:1.1.2 (e:info.versionScheme=early-semver)! Found 1 potential problems (filtered 60)
[error] 	at scala.sys.package$.error(package.scala:30)
[error] 	at com.typesafe.tools.mima.plugin.SbtMima$.reportModuleErrors(SbtMima.scala:89)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2(MimaPlugin.scala:36)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2$adapted(MimaPlugin.scala:26)
[error] 	at scala.collection.Iterator.foreach(Iterator.scala:943)
[error] 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
[error] 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1(MimaPlugin.scala:26)
[error] 	at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1$adapted(MimaPlugin.scala:25)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:68)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:282)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23)
[error] 	at sbt.Execute.work(Execute.scala:[29](https://github.com/typelevel/jawn/runs/7186541988?check_suite_focus=true#step:11:30)1)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:282)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:64)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:7[50](https://github.com/typelevel/jawn/runs/7186541988?check_suite_focus=true#step:11:51))

Is this an issue, I think it may be due to AsyncParser's protected this method which should probably be private anyways?

@eed3si9n
Copy link
Collaborator

eed3si9n commented Jul 4, 2022

[error] parser: Failed binary compatibility check against org.typelevel:jawn-parser_3:1.1.2 (e:info.versionScheme=early-semver)! Found 1 potential problems (filtered 60)
[error]  * method this(Int,Int,org.typelevel.jawn.FContext,scala.collection.immutable.List,Array[Byte],Int,Int,Int,Boolean,Int)Unit in class org.typelevel.jawn.AsyncParser does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.typelevel.jawn.AsyncParser.this")

Given that the constructor is protected it should be ok to ignore this, but I guess you can provide an overload constructor if you want to be more safe?

@mdedetrich
Copy link
Contributor Author

mdedetrich commented Jul 4, 2022

but I guess you can provide an overload constructor if you want to be more safe?

I am trying this now but due to the fact that AsyncParser's primary constructor uses both var's and protected it doesn't appear to be possible in Scala either by using def this or apply in companion object, i.e.

def apply[J](var state: Int, ...): AsyncParser[J]

or

def this[J](var state: Int, ...): AsyncParser[J]

Is not valid Scala and I can't think of another way to do this (honestly this should be private[jawn] rather than protected[jawn] especially considering its a stateful class with var's).

Should I add a mima ProblemFilter?

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch 2 times, most recently from 72af0e0 to 16d31af Compare July 5, 2022 07:34
@mdedetrich
Copy link
Contributor Author

So in addition to rebasing off the latest origin/master I have added an extra commit that adds a MIMA filter. Let me know if any other changes are needed.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch 2 times, most recently from fe9f6c3 to 766806f Compare July 5, 2022 19:28
@mdedetrich
Copy link
Contributor Author

So the latest bincompat issue from mima came from the js target (beforehand it was the jvm target). So I have just adjusted the mima commit to add async.parser.backwards.excludes to all relevant targets i.e. jvm/js/native.

Its a bit silly and there is code repetition but at least the mima check should pass now

Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why a secondary constructor with the old signature cannot be added. There is no need to use var in the constructor arguments; they can simply be unqualified arguments (like a method signature) that are passed to the primary constructor.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch from 766806f to cac2888 Compare July 5, 2022 20:53
@mdedetrich
Copy link
Contributor Author

mdedetrich commented Jul 5, 2022

I'm confused why a secondary constructor with the old signature cannot be added. There is no need to use var in the constructor arguments; they can simply be unqualified arguments (like a method signature) that are passed to the primary constructor.

Indeed you are correct, I didn't know about Scala's unqualified arguments. I added a second this constructor that matched the original and at least locally on my machine it passed the mima check.

Comment on lines 32 to 35
sealed abstract class Mode(val start: Int, val value: Int)
case object UnwrapArray extends Mode(-5, 1)
case object UnwrapMultiArray extends Mode(-5, 1)
case object ValueStream extends Mode(-1, 0)
case object SingleValue extends Mode(-1, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this change is not backwards compatible, since it adds a new member to a sealed hierarchy. This will break any previous total pattern matches since they will not be accounting for this new case.

See:

Copy link
Contributor Author

@mdedetrich mdedetrich Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is your recommendation? When reading lightbend-labs/mima#200 (comment) there is an argument that this is out of scope for mima because its a compilation error, not a linking error which is historically what mima was responsible for.

I also agree with the general sentiment there which is that warning in this case seems sensible however treating it the same as the other mima errors is out of scope of mima.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it is not a linking error, but it can still crash at runtime. If I wrote a pattern match that checks for UnwrapArray, ValueStream, and SingleValue, and assumes those are the only 3 possible cases (as promised by sealed) then that code is in big trouble if it encounters an instance of UnwrapMultiArray.

My recommendation is to find a way to make this change backwards-compatibly :)

Copy link
Contributor Author

@mdedetrich mdedetrich Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So afaik the only way to make this binary compatible while still making sure multiValue is explicit (i.e. not changing current Jawn behavior) is the initial implementation which was actually changed because design wise its worse, i.e.

  sealed abstract class Mode(val start: Int, val value: Int)
  case object UnwrapArray extends Mode(-5, 1)
  case object ValueStream extends Mode(-1, 0)
  case object SingleValue extends Mode(-1, -1)

  def apply[J](mode: Mode = SingleValue): AsyncParser[J] =
    new AsyncParser(
      state = mode.start,
      curr = 0,
      context = null,
      stack = Nil,
      data = new Array[Byte](131072),
      len = 0,
      allocated = 131072,
      offset = 0,
      done = false,
      streamMode = mode.value,
      multiValue = false
    )

  def apply[J](mode: Mode, multiValue: Boolean): AsyncParser[J] =
    new AsyncParser(
      state = mode.start,
      curr = 0,
      context = null,
      stack = Nil,
      data = new Array[Byte](131072),
      len = 0,
      allocated = 131072,
      offset = 0,
      done = false,
      streamMode = mode.value,
      multiValue = multiValue
    )
}

The reason why this is worse is because multiValue has no meaning for other Mode's that is not ValueStream so its misleading. Furthermore due to how overloading works you cannot have a duplicate default parameter for mode: Mode (which is why the second new apply is missing the = SingleValue default param).

There was also another idea which is the making ValueStream a case class that takes a multiValue parameter but afaik even with default parameters that is not binary compatible at all (in the traditional/"correct" sense).

If you have any other ideas/suggestions that would be greatly appreciated because otherwise I think I am kind of stuck here? I guess I can add something like AsyncParser.unwrapMultiArray function although it still kind of looks weird.

@mdedetrich
Copy link
Contributor Author

So I just added another commit which implements the secondary apply for the AsyncParser object. Its not the nicest solution design wise but at least to me its the least worst solution that is binary and backwards compatible.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch from 9c09847 to ca3c0cf Compare July 6, 2022 07:16
@mdedetrich
Copy link
Contributor Author

@armanbilge Is there anything else needed for this PR?

@armanbilge
Copy link
Member

I'm not really a maintainer here :) I believe as-written it is backwards compatible.

Its not the nicest solution design wise but at least to me its the least worst solution that is binary and backwards compatible.

I'm glad it solves your problem, but do you mind if I ask: do you believe that this is a good change for this library overall?

I thought Daniel's review in #438 (comment) raised some important points.

Definitely works for this case, but in general you're talking about broadly unwinding and incrementally flattening nested data structures, and there are a lot more cases than just this one. It's also not clear that it's always the right thing to do to just recursively and dynamically flatten in this fashion.

Since there are many possible behaviors here, we must be careful not to box ourselves with compatibility constraints.

@mdedetrich
Copy link
Contributor Author

mdedetrich commented Aug 2, 2022

I'm not really a maintainer here :) I believe as-written it is backwards compatible.

Ah okay, I thought you are a maintainer.

I'm glad it solves your problem, but do you mind if I ask: do you believe that this is a good change for this library overall?

I thought Daniel's review in #438 (comment) raised some important points.

Since there are many possible behaviors here, we must be careful not to box ourselves with compatibility constraints.

Oh definitely I agree with Daniel however as he also said I don't think it's possible to solve these problems in a principled way without either breaking binary compatibility or introducing an API which Daniel contemplated would be better solved for a Jawn 2.0 (and I also suggested RFCs that have standards on how to deal with nesting of JSON structures).

My intention with this PR is to unblock the specific problem I have in a way that is still reasonable, I don't have any problems with a new API/better design but that dramatically increases the scope (and judging from Daniel's comment earlier he also seems to be on the same page and is open to a quick fix as long as its broadly sensible).

@mdedetrich
Copy link
Contributor Author

@rossabaker @eed3si9n Do you mind reviewing/re-reviewing the PR since I believe you are the currently active maintainers?

README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eed3si9n
Copy link
Collaborator

eed3si9n commented Aug 2, 2022

I am only nominally a maintainer, and haven't been active, but overall I think this change is fine.

@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch 2 times, most recently from 8550e0d to a8f809f Compare August 3, 2022 18:29
@mdedetrich mdedetrich force-pushed the support-multiple-json-array-unwrap branch from a8f809f to d2493b4 Compare August 3, 2022 18:41
@mdedetrich
Copy link
Contributor Author

@eed3si9n Thanks for the review, just rebased and pushed the PR with your changes

@rossabaker Would it be possible to look/review this PR. You had a brief look at it previously when doing the new release and the issues back then have been resolved

@rossabaker rossabaker merged commit 305dd34 into typelevel:main Sep 29, 2022
@mdedetrich mdedetrich deleted the support-multiple-json-array-unwrap branch September 29, 2022 14:50
@mdedetrich
Copy link
Contributor Author

Thanks for merging this through, really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants