Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible re-use: deferred keywords vs schema transforms #515

Closed
handrews opened this issue Nov 28, 2017 · 235 comments
Closed

Flexible re-use: deferred keywords vs schema transforms #515

handrews opened this issue Nov 28, 2017 · 235 comments

Comments

@handrews
Copy link
Contributor

handrews commented Nov 28, 2017

NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.

We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.

The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.


This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false use cases that do not work well with our existing modularity and re-use features.


TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.

Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.

EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.

  • Level 1: Basic media type functionality. Identify and link schemas, allow for basic modularity and re-use
  • Level 2: Full structural access. Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
  • Level 3: Assertions. Evaluate the assertions within a schema object without regard to the contents of any other schema object
  • Level 4: Annotations. Collect all annotations that apply to a given location and combine the values as defined by each keyword
  • Level 5: Deferred Assertions. Evaluate these assertions across all subschemas that apply to a given location
  • Level 6: Deferred Annotations. Collect annotations and combine them with existing level 4 results as specified by the keyword. Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas

A general JSON Schema processing model

With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.

NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.

NOTE 2: Even if this approach is used, the steps are not executed linearly. $ref must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.

  1. Process schema linking and URI base keywords ($schema, $id, $ref, definitions as discussed in Move "definitions" to core (as "$defs"?) #512)
  2. Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
  3. Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
  4. Collect annotations from the remaining relevant subschemas

There is a basic example in one of the comments.

Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.

Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.

Steps 3 and 4 are where things get more interesting.

Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).

Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).

Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.

So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).

To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:

Deferred processing

To solve the re-use problems I propose defining a step 5:

  • Process additional assertions (a.k.a. deferred assertions) that may make use of all subschemas that are relevant at the end of step 4. Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.

EDIT: The proposal was originally called unknownProperties, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behavior unevaluatedProperties instead. But that name does not otherwise appear until much later in this issue.

This easily allows a keyword to implement "ban unknown properties", among other things. We can define unevaluatedProperties to be a deferred assertion analogous to additionalProperties. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas of properties and patternProperties.

There is an example of how unevaluatedProperties, called unknownProperties in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.

We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be unevaluatedItems, which would be analogous to additionalItems except that it would apply to elements after the maximum length items array across all relevant schemas. (I don't think anyone's ever asked for this, though).

Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like deferredDefault, which would override any/all default values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).


Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.

Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).

Schema transforms

In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as $merge or $patch would be added as a step 1.5, as they are processed after $ref but before all other keywords.

These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals ($spread, $use, single-level overrides) can be described as limited versions of $merge and/or $patch, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.


It's not clear to me how schema transform keywords work with the idea that $ref is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).

[EDIT: @epoberezkin has proposed a slightly different $merge syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]

If $ref is lazily replaced with its target (with $id and $schema adjusted accordingly), then transforms are straightforward. However, we currently forbid changing $schema while processing a schema document, and merging schema objects that use different $schema values seems impossible to do correctly in the general case.

Imposing a restriction of identical $schemas seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.

On the other hand, if $ref is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different $schema values but it is not at all clear to me how schema transforms would apply.

@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?

Conclusions

Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.

Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing $ref, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.

I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.

I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)

@erayd
Copy link

erayd commented Nov 28, 2017

I like deferred keywords as a concept, but they do not obviate my need for schema transforms.

My primary use-case for transforms is re-use of a schema fragment, with the ability to override some of the keywords. To take a trivial example, using {"type": "integer", "maximum": 5}, but with a higher maximum, is currently impossible and requires a lot of copy / paste that reduces maintainability.

@erayd
Copy link

erayd commented Nov 28, 2017

Also for the record, I think that $ref should not be related in any way to schema transforms. It should be an immutable delegation (i.e. essentially a black-box function call).

@handrews
Copy link
Contributor Author

@erayd I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema. Although that view is certainly debatable.

To apply arbitrary transforms to JSON like that has nothing to do with JSON Schema. There is no awareness needed of the source or target being schemas or having particular keyword behavior. You're just manipulating JSON text at a raw level. That is why I see it as out of scope- there is simply nothing that requires it to be part of JSON Schema at all.

This is different from $ref where it's simply not possible to have a usable system without some mechanism for modularity and cyclic references. The media type would be useless for any non-trivial purpose without it. However, it's always possible to refactor to avoid schema transforms, and frankly if anyone submitted a PR on a schema doing "re-use" by what is essentially textual editing, I'd send it back.

The violation of the opacity of $ref (which it seems at least you, @epoberezkin, and me all prefer) means that it is inviting a huge class of unpredictable errors due to unexpected changes on the target side. Your result across a regular delegation-style $ref may change in ways that you can't see or predict, but you have established an interface contract- I am referring to whatever functionality is identified by the target URI.

With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me.

@handrews
Copy link
Contributor Author

Hopefully others can talk about how their use cases line up with these proposals. The primary use cases that I remember (OO-style inheritance for strictly typed systems, and disambiguating multiple annotations) can both be solved by deferred keywords.

So I would be particularly interested in use cases that stop short of "I want to be able to do arbitrary transforms regardless of schema-ness" but are beyond what can be addressed with deferred keywords.

@erayd
Copy link

erayd commented Nov 28, 2017

@handrews

I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema.

It doesn't have to be. I think it just makes more sense to define it as part of JSON schema in order for JSON schema to have a standard and consistent way of solving the problem. To my mind, this is fundamentally a preprocessing step, and could easily be defined as a separate, referenced standard (e.g. perhaps JSON schema specifies that the last step of core processing before applying $ref is to transform based on Transform Spec XYZ). That would solve the underlying problem, but without cluttering up the JSON schema spec with it.

With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me.

I guess I see it as forming a new contract at the point of reuse, rather than trying to preserve whatever that piece of schema may have been doing before.

As an OOP example, defining a child class and then overriding one of the parent methods does not result in a child class that is guaranteed to behave in the same manner as the parent - but it allows for multiple children that share some of their behavior without having to redefine that behavior inside every child class.

...OO-style inheritance for strictly typed systems... can be solved by deferred keywords.

Are you able to clarify that a bit? Because even in strictly typed OO inheritance, the behavior in a child class can still override the parent and break whatever behavioral assumptions you may be making based on how the parent works. The only guarantee you have is that the types are the same [Ed: and that the methods etc. exist].

In my ideal world, any reuse mechanism would be applied before $ref is processed. This enforces the '$ref-is-a-black-box` approach, and makes the outcome much easier to reason about.

@erayd
Copy link

erayd commented Nov 28, 2017

Also for what it's worth, I care more about $ref being opaque than I care about having a transform mechanism. If it comes down to it, I'd rather have no transform mechanism at all than compromise $ref.

@handrews
Copy link
Contributor Author

@erayd I don't consider violations of the Liskov Substitution Principle to be proper OO modeling. Once you break the parent's interface contract you're just doing random stuff and the programmer can't reason about the type system in any consistent way.

I'd like to avoid going down a rathole on this before anyone else has had a chance to weigh in. These issues rapidly get too long for most people to read, and this one is long to start with. If you want to argue about type systems let's take it to email (you can get mine off the spec, at the bottom of the document) and see if we can leave space for others to cover their own use cases here.

@erayd
Copy link

erayd commented Nov 28, 2017

Fair call - let's switch to email.

@Anthropic
Copy link
Collaborator

@handrews so the TL;DR would be "I want to add a step to the theoretical processing sequence so in future we can peg new keywords to that point in execution"?
You mentioned deferred Default and unknownProperties, do you have many other examples/ideas for use cases?

@handrews
Copy link
Contributor Author

@Anthropic basically, yeah. LOL my TL;DRs need TL;DRs.

It's not really intended to be theoretical- we would do this to add keywords immediately. I just want to settle on a why and how because I feel like arguing over all of the concrete keywords in this area didn't get us anywhere useful. Just a huge pile of conflicting proposals that people voted for in weird patterns that didn't resolve anything.

unknownProperties is pretty compelling on its own. How much time have we spent on "additionalProperties": false + "allOf" not working the way people think it should? unknownProperties solves that. I mean, we all agreed after the vote-a-rama that solving that alone would be sufficient to publish a draft-08. It's why people were OK with deferring the discussion out of draft-07.

I did make up the deferredDefault thing as a way to think about how this would solve the $use use cases (#98). One problem with default is that if you end up applying different default values to the same property across multiple branches of an allOf, which default do you use? deferredDefault would say "ignore any regular defaults that might have been stuffed in there somewhere and use this. deferredDefaultis not a good name and the use case is not well-developed, but it's relevant. Same issue fortitleanddescription`. I can see ways to solve those without deferred keywords, but there's a possible class of things to consider there.

@handrews
Copy link
Contributor Author

handrews commented Nov 29, 2017

Here's an example of the overall process:

{
  "title": "an example",
  "description": "something that can be a number or a string",
  "anyOf": [
    {
      "description": "the number is for calculating",
      "type": "integer",
      "examples": [42]
    },
    {
      "description": "strings are fun, too!",
      "type": "string",
      "examples": ["hello"]
    }
  ]
}

NOTE: Again, this is not necessarily how an implementation would or should work in terms of step order

So for step 1, there's nothing to do b/c there are no $id or $ref keywords (nothing's changed about this step so I'm leaving it out).

Step 2 is to determine what's applicable, which means looking for keywords like anyOf. In this case, we have three schema objects that are applicable: each of the objects within the anyOf, plus the parent object containing the anyOf. If we identify these with URI fragment JSON Pointers, the set is ("#/anyOf/0", "#/anyOf/1", "#")

Step 3 is to evaluate assertions. Let's assume an instance of 100.

  • An integer is valid against "#/anyOf/0", so keep it
  • It is not valid against "#/anyOf/1", so remove that schema object from the set
  • anyOf ORs it's results, so overall the instance is valid against "", so keep the root object

So now our set is ("#/anyOf/0", "#")

Step 4 is to collect annotations. By default, multiple annotations are put in an unordered list, while examples values are flattened into a single list (this is all in draft-07). So if we made a JSON document out of annotations it would be something like:

{
  "title": ["an example"],
  "description": [
    "something that can be a number or a string",
    "the number is for calculating"
  ],
  "examples": [42]
}

I'll do another example showing the deferred keyword stuff next.

@handrews
Copy link
Contributor Author

handrews commented Nov 29, 2017

This example illustrates how deferred keywords work, using unknownProperties. You should read the previous comment's example first.

{
  "type": "object",
  "properties": {
    "required": ["x"],
    "x": {"type": "boolean"}
  },
  "allOf": [
    {
      "if": {
        "properties": {
          "x": {"const": true}
        }
      },
      "then": {
          "properties": {
            "required": ["y"],
            "y": {"type": "string"}
          }
      },
      "else": {
        "properties": {
          "required": ["z"],
          "z": {"type": "integer"}
        }
      }
    },
    {
      "patternProperties": {
        "^abc": true
      }
    }
  ]
}

Assuming an instance of {"x": true, "y": "stuff", "abc123": 456}, after going through our first three steps, we end up with the following schema objects in the set:

("#/allOf/0/if", "#/allOf/0/then", "#/allOf/0", "#/allOf/1", "#")

Now of course, if we put "additionalProperties": false in the root schema, the whole thing falls apart. We can't have a valid instance without "x", but depending on "x" we're also required to have either "y" or "z". But that addlProps would only 'see' property "x", so having either "y" or "z" would fail validation. So there are no valid instances if you do that. But what if we have deferred keywords and unknownProperties?

{
  "type": "object",
  "properties": {
    "required": ["x"],
    "x": {"type": "boolean"}
  },
  "allOf": [{...}, {...}],
  "unknownProperties": false
}

So now we once again consider our set that we have after step 3. There are no annotation keywords in this schema document, so there's nothing to do for step 4. But we have a deferred keyword, so we have a step 5 to consider.

Unlike immediate keywords at step 3, which can only work in each schema object separately, deferred keywords can look across the whole set of relevant schema objects.

This is because we cannot know the full relevant set until after step 3 is complete. So step 3 can't depend on knowing the set that it determines.

However, step 5 can. We go into step 5 knowing our full set of relevant schema objects. So, as specified by unknownProperties in the first comment of this issue, we take a look at the union of all properties and patternProperties:

  • "#/anyOf/0/if" defines "x"
  • "#/anyOf/0/then" defines "y"
  • "#/anyOf/1" defines the pattern "^abc"
  • "#" defines "x"

So the known properties are "x", "y", and any property matching pattern "^abc".

This means that our instance

{"x": true, "y": "stuff", "abc123": 456}

is valid, but

{"q': "THIS SHOULDN'T BE HERE", "x": true, "y": "stuff", "abc123": 456}

is not. Which is the behavior people have been asking for LITERALLY FOR YEARS.

@handrews
Copy link
Contributor Author

handrews commented Nov 29, 2017

Another idea for implementing deferred keywords is to have a core keyword, $deferred, which is an object where all deferred keywords live. I'm not sure if that actually makes implementation (including choosing an implementation level that may stop short of deferred keywords) easier or not. But I'll leave it here in case folks have thoughts on it.

{
    "allOf": [{...}, {...}],
    "$deferred": {
        "unknownKeywords": false
    }
}

instead of

{
    "allOf": [{...}, {...}],
    "unknownKeywords": false
}

@handrews
Copy link
Contributor Author

handrews commented Nov 29, 2017

With $deferred you could even use the same keyword as an immediate assertion and just apply it across all relevant schemas. This wouldn't make a difference for most keywords (e.g. maximum has the same effect whether immediate or deferred), but additionalProperties and additionalItems would have well-defined modified behavior, as explained for the proposed unknownProperties and unknownItems.

Again, not sure if this is more or less confusing. Just thinking out loud about different ways to manage this, so that folks have some more concrete options to consider.

@handrews
Copy link
Contributor Author

@erayd and I have been having a fantastic side discussion about OO design, subtyping, merge/patch, and other related ideas. He'll post his own summary when he gets a chance, but I wanted to copy over some key points about why merge/patch as optional functionality is hard even though we're perfectly happy to have format be optional for validating.

TL;DR:

  • format is still useful to applications even when validated, and failing to validate it has no further impact on the current processing model
  • failing to validate format can impact deferred keywords if we add them, but it is easy to add an extensablity hook for applications to register their own handlers (and many implementations do this)
  • $merge/$patch cannot safely be ignored, as the document may not make any sense, and the implementation cannot make assumptions about how useful the document is even if it is otherwise a valid schema
  • Because of lazy evaluation and $ref, you can't make a simple callback to implement schema transforms in your application- it needs to be something more like a co-routine which is much more difficult.

Annotating Assertions

format, contentMediaType, and contentEncoding are what I call annotating assertions, where the assertion part of the functionality is optional. Since we have never had a formal specification about what to do with annotations before draft-07, that's more or less been viewed as making the whole keyword optional.

But the nature of format and content* is that even if the validator ignores them, they still convey all of the information needed to validate them up to the application. The application can choose to do its own validation. So even when validation is not implemented, these keywords are still useful.

Validating them is also somewhere between challenging and impossible (for instance, there is no perfect regex for validating email addresses). So even when format is supported it's not as strong of a guarantee as something like maxLength. And content* is even harder to validate in any general sense.

Callback Extensibiilty

Annotating assertions are handled at steps 3 (assertion) and 4 (annotation) of the processing model. Most existing implementations provide only steps 1-3. Instead of step 4 (only defined in draft-07, and still optional), most implementations assume the application will find and use annotations however it wants to.

Let's say we have this schema (yes, I know that oneOf would work and avoid at least one problem, but it doesn't illustrate my point as well, just roll with it please):

{
  "type": "string",
  "anyOf": [
    {
      "type": "string",
      "format": "email",
      "title": "email username"
    },
    {
      "pattern": "^[a-z]\\w*[\\w\\d]$",
      "title": "basic username"
    }
  ]
}

If we have handrews as the instance, then a level 3 implementation will correctly accept that as valid, whether it supports validating the "email" format or not.

A level 4 implementation that validates the "email" format will return an annotation set of

{"title": ["basic username"]}

while one that does not validate format will return an annotation set of

{"title": ["basic username", "email username"], "format": ["email"]}

(recall that format is also an annotation, and annotation values are collected as unordered arrays).

So we see that not implementing format can cause a problem in a level 4 implementation. However, this can be avoided in implementations that allow registering a callback or something similar for format. The implementation makes the callback while processing level 3, and then moves on to level 4 just fine. There are interoperability concerns, but basically this is easily managed if we want to manage it.

Extra Level Extensibility

The whole deferred keyword proposal (level 5) relies on the idea that adding a later processing step is an easy extension model. For that matter, so did defining an algorithm for collecting annotations (level 4) in draft-07. All existing level 3 (assertions) implementations are still valid without having to change anything at all. They can add support for the new levels or not, and it's easy to explain what level of support is provided.

Level 1 Extensibility Challenges

This doesn't work when you change level 1, which is what schema transforms such as $merge and $patch do. You can only process level 2 (applicability) once you have resolved references and executed any transforms. Because references require lazy evaluation, so do transforms, and you are likely to bounce back and forth between the two. Your transform almost always references at least one part by $ref, and that part may itself include another transform which uses a $ref, etc.

So you can't just ignore the transforms, because the schemas you pass to level 2 are flat out wrong. But you can't just provide a simple callback for the keyword because level 1 processing is more complex- your application-side callback would need to call back into the JSON Schema implementation when it hits $ref.

Also, real implementations will go back and forth among levels 1, 2, and 3, because you can't find all $refs without examining applicability keywords, and you can't determine which subschemas are worth recursing into without checking assertions. Inserting schema transform processing into this as an application-side extension would be very challenging.

This, in addition to conflicting with $ref-as-delegation, is why $merge and $patch are not suitable for handling as extensions. Obviously you can do it (see ajv-merge-patch), but it's complex (see ajv-merge-patch's disclaimers about evaluation context).

@epoberezkin
Copy link
Member

@handrews re $merge/$patch: it is a pre-processing step, so it's not step 1.5, it's step 0 that should happen before anything else. Ignore the way it's defined in ajv-merge-patch, it uses $refs to essentially include schemas, which is not consistent with the delegation model. So if we add it it should have a different syntax.

@erayd some of the re-use ideas can be better implemented with $params (#322) than with $merge.

unknownProperties is the same idea as banUnknownProperties mode, but as a schema keyword. The presence of compound keywords (anyOf etc.) complicates the definition of "known properties" though. The way @handrews proposes it, it seems that what is known will depend on the actual data, that from my point of view leads to non-determinism, potential contradictions (I need to think about the example) and, at the very least, inefficiency. For example, if the idea is that property is known only if it passes validation by subschema where the property is defined, then ALL branches of "anyOf" should be validated, you cannot short-circuit it (probably the idea of collecting annotations suffers from the same problem).

I think that for all real use-cases it would be sufficient (and better) if "unknownProperties" operated on the predefined set of properties that does not depend on the validated data and can be obtained via the static schema analysis that would require traversal of referenced schemas as well, but would not traverse them more than once to correctly handle recursion. In this case we would avoid deferred processing at all and keep it simple while achieving the same benefits.

The example above would treat as known properties x, y, z, abc*, regardless the data, and if some additional restrictions need to be applied (e.g. make y and z mutually exclusive) it can easily be achieved by adding some extra keywords.

If we defined unknownProperties based on static schema analysis we would break shallowness principle but at least not the processing model.

Still, some pre-processing syntax I find more useful and less ambiguous than deferred data-dependent keywords and even than statically defined keywords that require deep schema traversal to define their behaviour, even though it can result in invalid schemas (e.g. from combining different drafts). It can be either generic $merge or more specialised syntax, either for extending properties of the nearest parent schema or for merging properties from the child subschema.

If my memory is correct $merge also received the most votes. I guess people like it because it is simple, clearly defined, introduces no ambiguity in the results and solves both the extension and other problems.

@handrews
Copy link
Contributor Author

handrews commented Dec 3, 2017

@epoberezkin I'm going to respond in pieces to several of your points. Feel free to lay out new points here, but for back and forth on things we've stated here let's please do that on email to avoid overwhelming everyone else (a frequent complaint we've both heard rather often). We can each come back with summaries of the offlist discussion, as @erayd and I are also doing.

@handrews
Copy link
Contributor Author

handrews commented Dec 3, 2017

@epoberezkin said:

For example, if the idea is that property is known only if it passes validation by subschema where the property is defined, then ALL branches of "anyOf" should be validated, you cannot short-circuit it (probably the idea of collecting annotations suffers from the same problem).

Yes, that limitation on short-circuiting has been in the spec explicitly for two drafts now, and has always been implicit in the definition of the meta-data keywords. We've just never required validators to collect annotations (nor do we in draft-07, we just state how to do it for implementations that wish to do so).

The no-short-circuit requirement explicitly defined for validation in draft-07 of Validation, Section 3.3: Annotations, in particular Section 3.3.2 Annotations and Short-Circuit Validation. I do hope you at least skimmed the table of contents during the full month that spec was posted for pre-publication feedback. There were at least three PRs on the topic, at least two of which were open for the standard 2 week feedback period.

In draft-06 it was in the section on [defining how hyper-schema builds on validation],
which I would not particularly have expected you to pay attention to as you don't implement Hyper-Schema. But it's really always been there for annotations. For example, if you want to find a possible default, you have to look everywhere for it.

Validation has never been required to do this and still is not required. That is the point of the opt-in multi-level proposal. A Level 3 validator such as Ajv can be much faster than a Level 4 annotation-gathering validator. That's great! Many people would rather have speed. The set of people who need complex annotation gathering is relatively small, and implementation requirements for validation should not be constrained by their use cases.

However, all hyper-schema implementations need to be Level 4. Or else they just don't work. I can go into this in more detail, but static analysis produces incorrect results. While I'm generally willing to defer to you on validation itself, you do not implement hyper-schema and have never expressed any interest in doing so. I have put a lot of thought into that. So if you want to convince me that static analysis is sufficient, you are going to have to dig deep into Hyper-Schema (which, essentially, is just a rather complex annotation) and demonstrate how it could work statically.

But I only have a link if the instances matches the relevant schema. That's been part of Hyper-Schema since the beginning. I'm just making the implications more clear.

@epoberezkin
Copy link
Member

epoberezkin commented Dec 29, 2017

However, I do not understand the concrete use case(s) for your proposed behavior for oneOf, anyOf, if/then/else

@handrews The behaviour I propose is to treat the properties defined in subschemas of these keywords as "unknown" and apply subschema in "unknownProperties" to these properties.
The argument for such behaviour is that only picking up "known" properties from "allOf" is really needed for schema extension, and all the necessary scenarios can be implemented as the example above demonstrates.

I suggest that you present the use case when it is not possible to design the schema when only properties inside allOf are considered "known".

Also, could you please address the problems of your proposal I explained above:

  • inconsistency of unknown properties with your proposal for different vocabularies
  • inconsistency because of optional validation keywords (format, content).
  • contradiction with separating validation from applicability (Move "applicability" keywords to core #513)

The correct intended behavior for my proposal ... is that it applies to any properties that have not successfully validated against at least one subschema.

That is not clear at all to me.

@handrews
Copy link
Contributor Author

handrews commented Jan 1, 2018

@epoberezkin I'm not ignoring your questions, but the things that you see as problems I see as essential goals, and vice versa. I am trying to figure out how to make my perspective clear.

Fundamentally, to me a schema is a runtime contract. If my data validates against the schema then it satisfies the runtime contract and I can use any annotations the schema provides.

The contract may be documented in some human-readable way expressing the key aspects of the structure that I need to know to ensure that I can create and consume instances, but I neither know nor care about the actual structure of the schemas ($refs, allOfs, etc.) As long as I have tools that understand the schema, then I care about the outcome, not the internals.

This is true of the APIs at my current employer. The human-readable documentation is a condensed and simplified form of a very complex set of schema files. The schema files are organized to facilitate things like sharing structures among internal vs external APIs. There are a lot of oneOfs that are used to expand the capabilities of an existing structure (for instance, an "owner" field was originally a user, but once organizations became a feature, "owner" became a oneOf of user or organization).

So there are two ways to publish the schemas:

  1. Documentation, which is reduced to the most straightforward human presentation
  2. Re-usable, extensible runtime schemas, which produce a consistent result but may change substantially in their internal structure for a variety of reasons (these are not currently published externally for the API I'm working with, but they are published for re-use among teams internally).

For things to be extensible, they need to not use additionalProperties. To be able to check for misspelled properties, you need to be able to add a keyword that has that effect at the point where you connect the data type schemas into a hyper-schema. When you add that keyword, it needs to work with the runtime behavior of the schema because that is what is interesting. Static behavior is not interesting. Schema validation is a runtime thing.

I also can't arbitrarily refactor things away on the grounds of them not being the ideal data design for how you want a particular feature to work. I have to deal with an existing system. Some things are changeable, but others are not. Or are not high enough priority to put resources on changing them. Or are complex enough that the plan to improve them is a long-term plan, not a quick fix.

I don't mean that I'm dealing with some sort of difficult situation- these limitations have existed at every company I've ever worked for. That's just reality when you come into an existing system that was built rapidly while a start-up is trying to prove its viability. Then you work to improve things, but that's an ongoing process.

This is why I am frustrated by your response to my example being to re-factor it into something different. That Vehicle example is actually not directly from my current job (the owner == one of user vs organization is, though). I can't just do that. So when you do that, it's not helpful to me.

You've at times accused me of being overly academic and theoretical, but when you are discarding anything that isn't "ideal" data design, now I'm the one who is frustrated with you not being willing to deal with imperfect but real systems.

Getting back to inconsistencies: the behavior of format and content* and other as-yet-unknown vocabularies is not specific to this problem here. Everyone who uses those features needs to figure out how to handle situations where they are not supported and therefore the outcome of regular old validation is not necessarily consistent. So I don't see that as any more problematic than it already is.

As for using other vocabularies- again, it's the runtime behavior that I want. If I'm working with vocabularies that are not universally supported, I need to deal with that problem whether or not there is some keyword that depends on a not-necessarily-dependable validation outcome. Presumably any application using those schemas also has to deal with that variability.

The answer for this issue is no different than any other usage: Either have a plan for how to degrade gracefully, and document it, or do something to make sure that the users you care about have access to the right tools. But it's not a valid objection to any proposal here because it is already a problem.

@handrews
Copy link
Contributor Author

handrews commented Jan 1, 2018

@epoberezkin @Relequestual @philsturgeon I can work up a concrete example based on constraints from my current job, but before I do so I want assurances that responses to such a thing:

  1. Cannot change the problem that I say I am trying to solve, or declare that I should not solve that problem
  2. Cannot refactor the schemas that I present (although see the next comment for clarification, and if something really looks refactorable, it would be fine to ask if it is possible and I will either say that it is or give some indication of why it is not)
  3. Cannot pick apart any aspect of the premise of the example with any sort of "I would not do that" response

If everyone promises to respect those constraints in writing in this issue then I will be happy to spend the time to build up a more detailed example. I started trying to do this over the past week, but I kept having to put in disclaimers trying to prevent responses from going off the rails, and it started to become > 50% disclaimers. It's too exhausting, particularly knowing that I'll certainly miss something, and my example will end up dismissed as a result of it. As has already happened repeatedly in this issue.

So, if we can agree to some pre-conditions, I will delve more deeply into the questions that @epoberezkin has asked. As I said, I went through @epoberezkin's schema refactoring / data design objections and tried to make it work. It does not work, but in order to show how it does not work the scenario is fairly complex. Easy-to-follow examples are also easy-to-unravel examples, in my experience. The real world is complicated.

@handrews
Copy link
Contributor Author

handrews commented Jan 1, 2018

Let me explain the "cannot refactor" request a bit, since refactoring came up as an alternative to schema transforms. I'm making a distinction based on the view of schemas as runtime contracts

The schema transform cases where I recommended refactoring involved reaching inside a schema, snipping out some keywords, and splicing the into another schema. This violates the abstraction / encapsulation of the contract. So the correct thing is to refactor to produce the right contract.

As an example, if you have a schema that describes Cars, and you want to write a schema for Trucks, and you like how the Cars schema represents engines, the correct approach is to factor out an Engines schema that both Cars and Trucks can use with a regular old $ref. Then the maintainer of the Engine schema knows that it has multiple uses, and needs to keep describing appropriate representations of Engines for those uses.

Now let's think about that Engines schema. It could be implemented as a oneOf of several different types of engines. Or it could be implemented with a relatively flat object structure plus some use of if/then or the schema form of dependencies to ensure that combinations that are needed for certain engine types are present, and combinations that make no sense are not.

As a consumer of the published Engines schema, I shouldn't know or care. So they, the maintainers, are free to refactor as they wish, but I, the consumer who wishes to use the schema and then add "unevaluatedProperties": false on top of it, cannot and should not demand that they refactor their internals.

There are still a lot of places where re-use and encapsulation don't map cleanly to JSON Schema, but this is all about trying to get us closer to a place where they do, without throwing away the useful properties of JSON Schema that have gotten us this far.

@handrews
Copy link
Contributor Author

handrews commented Jan 2, 2018

Note: @Relequestual has let me know that's he's ill and will likely not be catching up with or commenting on issues for another week. @philsturgeon and @awwright are both traveling this week I believe. So things will likely be quiet for a bit longer here.

@epoberezkin if the sort of example I talk about putting together in my last few comments is appealing and you're willing to engage it on its own premises, do let me know and I will go ahead and work on that. Otherwise this is on hold until we have a quorum.

@Relequestual
Copy link
Member

@epoberezkin Thanks for defining your problem statement.
As we differ in opinion on if you provided an example that I requested, I'll try to be a little more explixit, sorry.

I'm more sold on the approach @handrews presents because I can SEE the use case presented as a real example, where the behaviour he suggests would be desireable.

I want to see the use case which your proposed behaviour then solves. Not abtstract or re-hashing a different example to work with your proposed behaviour. Totally ignore the other proposal for the purposes of your example. Pretend it doesn't exist! (I feel I adequitly understand the differences in the behaviours).

When I see @handrews's example, his behaviour seems like the right solution.
I want to see a totally different example which you feel would lead me to the same conclusion for your proposed behaviour.

Does that make sense?
Sorry if my previous comments have seemed somewhat abrupt, I was just trying to stay tarse =]

@epoberezkin
Copy link
Member

epoberezkin commented Jan 4, 2018

(A) For things to be extensible, they need to not use additionalProperties. To be able to check for misspelled properties, you need to be able to add a keyword that has that effect at the point where you connect the data type schemas into a hyper-schema. (B) When you add that keyword, it needs to work with the runtime behavior of the schema because that is what is interesting. Static behavior is not interesting. Schema validation is a runtime thing.

@handrews, it is really difficult to try to shoot a moving target here. This requirement (A) above, as well as the requirement to introduce the desired outcome without refactoring existing schemas, has little to do with what was agreed to be the desired outcomes previously.

In addition to that, (B) above doesn't follow from (A). While I understand the desire to achieve A, even though it's different from the agreed objective of the proposed change, I disagree that B is required to achieve A.

Getting back to inconsistencies: the behavior of format and content* and other as-yet-unknown vocabularies is not specific to this problem here.

Indeed, but together with the proposal it creates a much bigger problem than currently exists. There is a big difference between (1) "validation results for some keywords being not strictly defined" (e.g. format, content) and (2) "what properties some schema is applied to is not strictly defined" (as a consequence of 1). Most users of JSON schema seem to see the distinction between validating/processing of the data structure and validating/processing of the property values. This proposal undermines this distinction by making the structure validation dependent on the results of validating property values.

In addition to these problems, this proposal violates most existing JSON schema paradigms, such as shallowness and context-independence.

So I really think we need to agree on what objectives we want to achieve with this change. I do think, for example, that (1) "the ability to introduce the new behaviours to the existing schemas without refactoring" should be a much lower priority than (2) "the ability to create schemas that are easy to maintain and reason about". This (1) being important for a given user of JSON Schema doesn't mean that this should be the consideration affecting what is added to the specification.

If you really believe that this proposal is the only possible solution, it would be much better to implement this proposal in some validator(s) and gather some usage practice for at least 6-12 months - that would make it clearer whether my concerns cause real issues quite often or if they only cause problems in some rare edge cases.

It was said many times that usage practice should precede standardisation. Nevertheless, here we discuss merits of the idea that was never used and reject the idea that has some usage practice ($merge).

@epoberezkin
Copy link
Member

epoberezkin commented Jan 4, 2018

When I see @handrews's example, his behaviour seems like the right solution.
I want to see a totally different example which you feel would lead me to the same conclusion for your proposed behaviour.

@Relequestual I was just trying to show that the same outcome can be achieved by a simpler change. I am presenting the alternative idea only as the illustration that the proposed keyword is not the minimal required change needed to implement the desired outcomes. I do not know how this new keyword should work, some schema extension mechanism seems a better solution to me, but if we MUST avoid it, I could live with some new keyword that violates shallowness principle but does not violate context-independence (as this proposal does).

So, as I wrote, I see little point comparing proposals until:

  1. there is an agreement on what we want to achieve - it seems to be a moving target at the moment
  2. some usage practice before anything is added to the spec

@handrews
Copy link
Contributor Author

handrews commented Jan 5, 2018

shallowness and context-independence

@epoberezkin let's take a look at these while waiting for @Relequestual and/or @philsturgeon's next responses. Perhaps if I understand how you see these things, I may change my position, as these concepts are important to me.

Here is how I think about evaluating a schema and instance, and what those two concepts mean. Let me know if you agree, or if not, I would love to see how you view them. I don't necessarily think I have the "correct" view here so this is is not intended to be an argument! :-) Purely fact-finding.

Given:

  • S: a schema document
  • s: an arbitrary schema object within S
  • k1, k2, ... kn: the keywords within schema object s
  • I, an instance document
  • i, a portion of I to which schema object s applies
  1. The result of evaluating s against i is determined by combining the results of evaluating all keywords k1, etc.
  2. Evaluating a keyword k1 is allowed to depend on:
    i. The value of the keyword k1
    ii. The results of any subschemas of k1
    iii. The value of i
    iv. The immediate values of k2, k3, ... kn, but not the contents of any of their subschemas
    v. The results of k2, k3, ... kn

2.v. is for if/then/else. How you evaluate then and else is dependent upon the outcome of evaluating if. You can evaluate them without waiting to evaluate if, but then can only cause the whole schema object s to fail validation if if succeeded. Likewise, else can only cause an overall failure if if failed.

Shallowness

Shallowness is about how subschemas are or are not involved in schema object evaluation.

2.ii, 2.iv. and 2.v. illustrate _shallowness.

For 2.ii, If s is

{
    "anyOf": [
        {"type": "integer"},
        {"pattern": "^foo"}
    ]
}

k1 is anyOf, the result of that anyOf depends on the results of the subschemas {"type": "integer"} and {"pattern": "^foo"}. But only their results. The fact that they happen to involve the type and pattern keywords is irrelevant. Changing the subschema keywords may change the outcome, but it does not change the process of determining the result of the anyOf keyword.

For 2.iv., properties and patternProperties both take an object of schemas.
Which instance properties additionalProperties affects is determined by the property names in those objects of schemas (for properties, directly matching instance property names, and for patternProperties, it's property names pattern-match instance property names). But none of this depends on the contents of the subschemas in those objects of schemas.

For 2.v., with if/then/else, evaluating then and else does not require knowing the contents of the if subschema (what keywords and values it uses), only the results.

Context Independence

Context independence is about parent and sibling schemas not being involved in schema object validation.

It doesn't matter whether s is a subschema under an if vs then vs else vs *Of vs properties vs items vs..., or if it is the root schema. The evaluation rules given above is identical.

Technically, $schema is inherited into all subschemas, and the base URI is also inherited and possibly modified by $id and then used in $ref, but that is totally independent of the processes of concern to this issue. And if you want to, you can go through beforehand and set $schema and $id to absolute URIs in every subschema, and resolve all $refs to absolute URIs. At that point, evaluation is truly context-free, as the necessary context has now been written into every subschema without violating the spec or changing the outcome.


@epoberezkin do you agree with the above? If not, how would you define shallowness and context independence?

@erayd
Copy link

erayd commented Jan 5, 2018

@handrews I'm currently on holiday and have not had time to catch up with all the recent discussion in this issue (will try to get up to speed again over the weekend), so feel free to ignore me if I've just missed something. However:

iv. The immediate values of k2, k3, ... kn, but not the contents of any of their subschemas

Am I correct that your proposal includes bubbling up property names as part of the result of evaluating those subschemas? I thought that's what you were proposing, but that's necessarily content-dependent, which would seem to contradict your point above.

I have said this earlier, but it feels worth repeating - I think we really, really need to have some discussion around implementation concepts before trying to put anything in the spec. There still seems to be a fair bit of confusion around what is actually intended, and discussing implementation should hopefully get everyone on the same page pretty quickly - code (or pseudocode) is not woolly and open to interpretation the way English can be.

@Relequestual
Copy link
Member

@Relequestual I was just trying to show that the same outcome can be achieved by a simpler change. I am presenting the alternative idea only as the illustration that the proposed keyword is not the minimal required change needed to implement the desired outcomes.

See, I dissagree that your proposed schema to arrive at the same outcome is simpler. I consider it far more complex, and requirering more changes. I expect you mean for the implementor and not the schema author, in which case, I would say that JSON Schema is complex enough, and our primary focus needs to be on ease of use for the schema author, and clarity of specification for the implementor. All other considerations, to me, are secondary.

@epoberezkin
Copy link
Member

@handrews for shallowness the definition is quite close but with a few corrections. Also, I assume we are trying to define, more precisely, how it works in the absence of this proposal.

Shallowness:

2.ii. The results of any direct subschemas of k1, direct includes array items and property values.
2.iv. The immediate values of k2, k3, ... kn, including the names of the properties in case their values are objects with multiple subschemas, but not the contents of any of their subschemas and excluding values of keywords that are schemas.

2.ii change restricts subschemas to direct subschemas
2.iv allows for additionalProperties to work based on the property names from "properties" and "patternProperties", but prohibits depending on schemas in other keywords (e.g. "not" - although maybe you can refer to the value of "not" as its subschema).

By context-independence I meant independence of the schema applicability from the property/item values - currently the applicability only relies on property names and item indices (i.e. on data structure), but not on their values. With this proposal, the applicability starts depending on property and item values.
Maybe "context-independence" is an incorrect term, but I think that's what was used in arguments against "$data". But given that "$data" can only be used with validation and not applicability keywords, it's different.

@handrews
Copy link
Contributor Author

handrews commented Jan 5, 2018

@erayd

Am I correct that your proposal includes bubbling up property names as part of the result of evaluating those subschemas? I thought that's what you were proposing, but that's necessarily content-dependent, which would seem to contradict your point above.

I have said this earlier, but it feels worth repeating - I think we really, really need to have some discussion around implementation concepts before trying to put anything in the spec. There still seems to be a fair bit of confusion around what is actually intended, and discussing implementation should hopefully get everyone on the same page pretty quickly - code (or pseudocode) is not woolly and open to interpretation the way English can be.

I'm getting there. Let me sort out with @epoberezkin what the principles that he's concerned about mean first so that I can either address those or change the proposal to reflect them if needed.

@handrews
Copy link
Contributor Author

handrews commented Jan 5, 2018

@epoberezkin

2.ii. The results of any direct subschemas of k1, direct includes array items and property values.

I'm not entirely sure that I follow this. The result of allOf, anyOf, oneOf, not, and if/then/else depend on their subschemas, which are independent of whether the instance is an object, array, or something else.

2.iv. The immediate values of k2, k3, ... kn, including the names of the properties in case their values are objects with multiple subschemas, but not the contents of any of their subschemas and excluding values of keywords that are schemas.

Let me see if I can state this a different way to ensure that I'm understanding: Immediate values in the sense of object property names and array indices are available for static analysis (this is how additionalProperties and additionalItems work). However, the contents of subschemas, whether they are immediate values of keywords or are within an object or an array, are off-limits from static examination.

I'm saying "static examination" because we do agree (I think?) that the dynamic results of a subschema are a factor in the results of the keyword (that's kind of the whole point of subschemas, right?).

I'm going to post later about the context-independence part, some good new information for me there that I need to think through- thanks!

@epoberezkin
Copy link
Member

epoberezkin commented Jan 5, 2018

I'm not entirely sure that I follow this. The result of allOf, anyOf, oneOf, not, and if/then/else depend on their subschemas.

By adding "direct", I mean that the keyword cannot depend on sub-sub-schemas (we don't have a precedent of it at the moment). EDIT: by “array items and property values” I meant that the subschemas of “allOf”, for example, are “array items” and the subschemas of “properties” are “property value” (of the value of “properties” keyword). Sorry for the confusion.

Let me see if I can state this a different way to ensure that I'm understanding: Immediate values in the sense of object property names and array indices are available for static analysis (this is how additionalProperties and additionalItems work). However, the contents of subschemas, whether they are immediate values of keywords or are within an object or an array, are off-limits from static examination.

We talk about the same thing (I think :), I just wanted to clarify.

I'm saying "static examination" because we do agree (I think?) that the dynamic results of a subschema are a factor in the results of the keyword (that's kind of the whole point of subschemas, right?).

Correct, that is covered by 2.ii and 2.v.

I'm going to post later about the context-independence part

Thank you

@handrews
Copy link
Contributor Author

handrews commented Jan 5, 2018

By adding "direct", I mean that the keyword cannot depend on sub-sub-schemas

Awesome- I am on board with this.

Still working on writing up context-independence and addressing your concerns about depending on property/item values.

@handrews
Copy link
Contributor Author

handrews commented Jan 6, 2018

@epoberezkin regarding context-independence:

By context-independence I meant independence of the schema applicability from the property/item values - currently the applicability only relies on property names and item indices (i.e. on data structure), but not on their values. With this proposal, the applicability starts depending on property and item values.

(I don't actually remember what was said about $data anymore so I'm skipping that bit)

I think the key thing here is that I'm making a distinction between:

  • A keyword's immediate non-subschema values (including the property names and array indices for objects or arrays of subschemas [OK to use]
  • The contents of subschemas as would be seen by static examination of the schema document(s) [Not OK to use]
  • The runtime result of evaluating subschemas [OK to use]

The runtime result of evaluating a subschema of course depends on both the subschema's contents and the instance data. But the subschema contents and instance data remain opaque for the purposes of evaluating the parent schema object.

It may be possible to infer things about the subschema contents based on those results, and on the immediate property names / array indices that are fair game to examine, but that's not the same thing as actually looking at the subschema contents and instance data as a separate process from evaluating the subschema.

Does this make sense? If we're just depending on results then both of these objects as subschemas: {"patternProperties": {"^.*$": {"type": "string"}} and {"additionalProperties": {"type": "string"}} have the same behavior (every object property is evaluated, and every object property's value must be a string).

In this view, we are not allowed to look into the subschema and see whether the result was achieved with additionalProperties or with a patternProperties that matches all possible names.

So I'm claiming that if we are only using results, then we are still context-independent. Does that make sense?

@epoberezkin
Copy link
Member

epoberezkin commented Jan 6, 2018

So I'm claiming that if we are only using results, then we are still context-independent. Does that make sense?

Yes, as long as by "results" we mean "boolean result of assertions", i.e. valid or invalid.

The reason for that limitation is that if you arbitraryly define validation results, then they can include something which is either "context" (i.e. data values) or something that depends on the "context", so we are no longer context independent.

The way annotation collection is defined makes it exactly the case, collected annotations are context dependent.

EDIT: actually annotations add the parts of the schema itself, so making a keyword dependent on annotations (or something similar) violates shallowness, not context-independence.

@epoberezkin
Copy link
Member

@handrews Another way to explain the problem I see with this proposal is related to "applicability" concept and how this proposal changes it. Regardless which section of the spec we put some keywords in, we have keywords that apply subschemas to either child or current location of the data instance. They, by definition (section 3.1), belong to the applicability group.

Currently the locations in the data instance to which subschemas should be applied can be determined by:
(1). the keyword logic, as defined in the spec
(2). the keyword value, excluding subschemas
(3). sibling keywords values, excluding subschemas
(4). data structure, i.e. property names and indices of the data instance (but not values of properties and array items).

So applicability keywords have stronger context-independence than validation keywords (that need data values).

To illustrate:

  • allOf, anyOf, oneOf, not, if/then/else - only have (1) and they apply all their subschemas to the current data instance
  • properties - need (1), (2), and (4), it applies subschemas to corresponding child instances
  • patternProperties - need (1), (2) and (4), it applies subschemas to child instances where property names matches the patterns
  • additionalProperties - need (1), (2), (3) and (4)
    etc.

The problem with the proposed keyword is that it makes applicability dependent on data values, as data structure is no longer sufficient to determine whether the subschema of unwhateverProperties will be applied to some child instance.

Do you follow this argument or something needs clarifying? Do you see the problem?

I believe that we can and should solve the problems at hand (extending schemas, avoiding typos in property names, etc.) without changing how applicability works.

@handrews
Copy link
Contributor Author

handrews commented Jan 7, 2018

As with other controversial issues right now, I'm locking this rather than responding further until people who are currently ill and/or traveling can get back and catch up.

@json-schema-org json-schema-org locked as too heated and limited conversation to collaborators Jan 7, 2018
@handrews
Copy link
Contributor Author

I have filed #530 for nailing down how annotations are collected, since it doesn't really have anything to do with this issue. We may end up using that process, but it's not at all specific to or driven by this concept.

@erayd you'll get your pseudocode there (whether it ends up being relevant here or not- if not, we'll work out whatever we need for this issue here).

@handrews
Copy link
Contributor Author

handrews commented Mar 2, 2018

I've been talking with the OpenAPI Technical Steering Committee, and one thing that's going on with their project is that the schema for version 3.0 of their specification (the schema for the OAS file, not the schemas used in the file) has been stalled for months.

The main reason it stalled is concern over the massive duplication required to get "additionalProperties": false in all of the situations where the OAS 3.0 specification forbids additional properties. Rather than using allOf and oneOf to avoid duplication, every variation on a schema must be entirely listed out so that additionalProperties can have the desired effect.

I have refactored the schema to use allOf, oneOf, and unevaluatedProperties, which not only dramatically shrank the file (1500 lines down to 845) but allowed a different approach consisting of a number of "mix-in" schemas grouping commonly used fields, which are then referenced throughout a set of object schemas.

See the refactored schema

Note that there is a link to the original PR in the comment on the gist.

I think that this is pretty compelling evidence in favor of unevaluatedProperties. None of the other solutions proposed here could accomplish this due to the heavy use of oneOf. OpenAPI is a well-established, widely used project, and they have found the current situation to be a enough of a problem to leave the schema unfinished for months.

@philsturgeon
Copy link
Collaborator

This implementation of the OpenAPI spec in JSON Schema provides a powerful example of the problem at hand. Multiple different people have been discussing multiple different problems, and asking for examples of the other problems, talking past each other and generally this thread got to an unreadable point due to this confusion.

Now we have this very specific real-world example solving the problem we're trying to solve, other problems can be discussed in other issues and potentially solved in other threads.

I think we can move along now, closing this issue, happy and content we have a great example. We have fundamentally solved a giant issue with JSON Schema., and that's fantastic news.

@Relequestual
Copy link
Member

This is a clear solution to a real problem which has effected aspects of an important project. Let's fix this. Let's go with unevaluatedProperties!

Can you file a new issues specifically for that option? Then we can move directly to pull request. I feel the general consensus is we need this.

Unrelared, hello from the UK! ❄️ ❄️ ❄️ ❄️

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants