-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible re-use: deferred keywords vs schema transforms #515
Comments
I like deferred keywords as a concept, but they do not obviate my need for schema transforms. My primary use-case for transforms is re-use of a schema fragment, with the ability to override some of the keywords. To take a trivial example, using |
Also for the record, I think that |
@erayd I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema. Although that view is certainly debatable. To apply arbitrary transforms to JSON like that has nothing to do with JSON Schema. There is no awareness needed of the source or target being schemas or having particular keyword behavior. You're just manipulating JSON text at a raw level. That is why I see it as out of scope- there is simply nothing that requires it to be part of JSON Schema at all. This is different from The violation of the opacity of With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me. |
Hopefully others can talk about how their use cases line up with these proposals. The primary use cases that I remember (OO-style inheritance for strictly typed systems, and disambiguating multiple annotations) can both be solved by deferred keywords. So I would be particularly interested in use cases that stop short of "I want to be able to do arbitrary transforms regardless of schema-ness" but are beyond what can be addressed with deferred keywords. |
It doesn't have to be. I think it just makes more sense to define it as part of JSON schema in order for JSON schema to have a standard and consistent way of solving the problem. To my mind, this is fundamentally a preprocessing step, and could easily be defined as a separate, referenced standard (e.g. perhaps JSON schema specifies that the last step of core processing before applying
I guess I see it as forming a new contract at the point of reuse, rather than trying to preserve whatever that piece of schema may have been doing before. As an OOP example, defining a child class and then overriding one of the parent methods does not result in a child class that is guaranteed to behave in the same manner as the parent - but it allows for multiple children that share some of their behavior without having to redefine that behavior inside every child class.
Are you able to clarify that a bit? Because even in strictly typed OO inheritance, the behavior in a child class can still override the parent and break whatever behavioral assumptions you may be making based on how the parent works. The only guarantee you have is that the types are the same [Ed: and that the methods etc. exist]. In my ideal world, any reuse mechanism would be applied before |
Also for what it's worth, I care more about |
@erayd I don't consider violations of the Liskov Substitution Principle to be proper OO modeling. Once you break the parent's interface contract you're just doing random stuff and the programmer can't reason about the type system in any consistent way. I'd like to avoid going down a rathole on this before anyone else has had a chance to weigh in. These issues rapidly get too long for most people to read, and this one is long to start with. If you want to argue about type systems let's take it to email (you can get mine off the spec, at the bottom of the document) and see if we can leave space for others to cover their own use cases here. |
Fair call - let's switch to email. |
@handrews so the TL;DR would be "I want to add a step to the theoretical processing sequence so in future we can peg new keywords to that point in execution"? |
@Anthropic basically, yeah. LOL my TL;DRs need TL;DRs. It's not really intended to be theoretical- we would do this to add keywords immediately. I just want to settle on a why and how because I feel like arguing over all of the concrete keywords in this area didn't get us anywhere useful. Just a huge pile of conflicting proposals that people voted for in weird patterns that didn't resolve anything.
I did make up the |
Here's an example of the overall process: {
"title": "an example",
"description": "something that can be a number or a string",
"anyOf": [
{
"description": "the number is for calculating",
"type": "integer",
"examples": [42]
},
{
"description": "strings are fun, too!",
"type": "string",
"examples": ["hello"]
}
]
} NOTE: Again, this is not necessarily how an implementation would or should work in terms of step order So for step 1, there's nothing to do b/c there are no Step 2 is to determine what's applicable, which means looking for keywords like Step 3 is to evaluate assertions. Let's assume an instance of
So now our set is ("#/anyOf/0", "#") Step 4 is to collect annotations. By default, multiple annotations are put in an unordered list, while {
"title": ["an example"],
"description": [
"something that can be a number or a string",
"the number is for calculating"
],
"examples": [42]
} I'll do another example showing the deferred keyword stuff next. |
This example illustrates how deferred keywords work, using {
"type": "object",
"properties": {
"required": ["x"],
"x": {"type": "boolean"}
},
"allOf": [
{
"if": {
"properties": {
"x": {"const": true}
}
},
"then": {
"properties": {
"required": ["y"],
"y": {"type": "string"}
}
},
"else": {
"properties": {
"required": ["z"],
"z": {"type": "integer"}
}
}
},
{
"patternProperties": {
"^abc": true
}
}
]
} Assuming an instance of ("#/allOf/0/if", "#/allOf/0/then", "#/allOf/0", "#/allOf/1", "#") Now of course, if we put {
"type": "object",
"properties": {
"required": ["x"],
"x": {"type": "boolean"}
},
"allOf": [{...}, {...}],
"unknownProperties": false
} So now we once again consider our set that we have after step 3. There are no annotation keywords in this schema document, so there's nothing to do for step 4. But we have a deferred keyword, so we have a step 5 to consider. Unlike immediate keywords at step 3, which can only work in each schema object separately, deferred keywords can look across the whole set of relevant schema objects. This is because we cannot know the full relevant set until after step 3 is complete. So step 3 can't depend on knowing the set that it determines. However, step 5 can. We go into step 5 knowing our full set of relevant schema objects. So, as specified by
So the known properties are "x", "y", and any property matching pattern "^abc". This means that our instance
is valid, but
is not. Which is the behavior people have been asking for LITERALLY FOR YEARS. |
Another idea for implementing deferred keywords is to have a core keyword, {
"allOf": [{...}, {...}],
"$deferred": {
"unknownKeywords": false
}
} instead of {
"allOf": [{...}, {...}],
"unknownKeywords": false
} |
With Again, not sure if this is more or less confusing. Just thinking out loud about different ways to manage this, so that folks have some more concrete options to consider. |
@erayd and I have been having a fantastic side discussion about OO design, subtyping, merge/patch, and other related ideas. He'll post his own summary when he gets a chance, but I wanted to copy over some key points about why merge/patch as optional functionality is hard even though we're perfectly happy to have TL;DR:
Annotating Assertions
But the nature of Validating them is also somewhere between challenging and impossible (for instance, there is no perfect regex for validating email addresses). So even when Callback ExtensibiiltyAnnotating assertions are handled at steps 3 (assertion) and 4 (annotation) of the processing model. Most existing implementations provide only steps 1-3. Instead of step 4 (only defined in draft-07, and still optional), most implementations assume the application will find and use annotations however it wants to. Let's say we have this schema (yes, I know that {
"type": "string",
"anyOf": [
{
"type": "string",
"format": "email",
"title": "email username"
},
{
"pattern": "^[a-z]\\w*[\\w\\d]$",
"title": "basic username"
}
]
} If we have A level 4 implementation that validates the "email" format will return an annotation set of
while one that does not validate
(recall that So we see that not implementing Extra Level ExtensibilityThe whole deferred keyword proposal (level 5) relies on the idea that adding a later processing step is an easy extension model. For that matter, so did defining an algorithm for collecting annotations (level 4) in draft-07. All existing level 3 (assertions) implementations are still valid without having to change anything at all. They can add support for the new levels or not, and it's easy to explain what level of support is provided. Level 1 Extensibility ChallengesThis doesn't work when you change level 1, which is what schema transforms such as So you can't just ignore the transforms, because the schemas you pass to level 2 are flat out wrong. But you can't just provide a simple callback for the keyword because level 1 processing is more complex- your application-side callback would need to call back into the JSON Schema implementation when it hits Also, real implementations will go back and forth among levels 1, 2, and 3, because you can't find all This, in addition to conflicting with |
@handrews re $merge/$patch: it is a pre-processing step, so it's not step 1.5, it's step 0 that should happen before anything else. Ignore the way it's defined in ajv-merge-patch, it uses $refs to essentially include schemas, which is not consistent with the delegation model. So if we add it it should have a different syntax. @erayd some of the re-use ideas can be better implemented with $params (#322) than with $merge. unknownProperties is the same idea as banUnknownProperties mode, but as a schema keyword. The presence of compound keywords (anyOf etc.) complicates the definition of "known properties" though. The way @handrews proposes it, it seems that what is known will depend on the actual data, that from my point of view leads to non-determinism, potential contradictions (I need to think about the example) and, at the very least, inefficiency. For example, if the idea is that property is known only if it passes validation by subschema where the property is defined, then ALL branches of "anyOf" should be validated, you cannot short-circuit it (probably the idea of collecting annotations suffers from the same problem). I think that for all real use-cases it would be sufficient (and better) if "unknownProperties" operated on the predefined set of properties that does not depend on the validated data and can be obtained via the static schema analysis that would require traversal of referenced schemas as well, but would not traverse them more than once to correctly handle recursion. In this case we would avoid deferred processing at all and keep it simple while achieving the same benefits. The example above would treat as known properties x, y, z, abc*, regardless the data, and if some additional restrictions need to be applied (e.g. make y and z mutually exclusive) it can easily be achieved by adding some extra keywords. If we defined unknownProperties based on static schema analysis we would break shallowness principle but at least not the processing model. Still, some pre-processing syntax I find more useful and less ambiguous than deferred data-dependent keywords and even than statically defined keywords that require deep schema traversal to define their behaviour, even though it can result in invalid schemas (e.g. from combining different drafts). It can be either generic $merge or more specialised syntax, either for extending properties of the nearest parent schema or for merging properties from the child subschema. If my memory is correct $merge also received the most votes. I guess people like it because it is simple, clearly defined, introduces no ambiguity in the results and solves both the extension and other problems. |
@epoberezkin I'm going to respond in pieces to several of your points. Feel free to lay out new points here, but for back and forth on things we've stated here let's please do that on email to avoid overwhelming everyone else (a frequent complaint we've both heard rather often). We can each come back with summaries of the offlist discussion, as @erayd and I are also doing. |
@epoberezkin said:
Yes, that limitation on short-circuiting has been in the spec explicitly for two drafts now, and has always been implicit in the definition of the meta-data keywords. We've just never required validators to collect annotations (nor do we in draft-07, we just state how to do it for implementations that wish to do so). The no-short-circuit requirement explicitly defined for validation in draft-07 of Validation, Section 3.3: Annotations, in particular Section 3.3.2 Annotations and Short-Circuit Validation. I do hope you at least skimmed the table of contents during the full month that spec was posted for pre-publication feedback. There were at least three PRs on the topic, at least two of which were open for the standard 2 week feedback period. In draft-06 it was in the section on [defining how hyper-schema builds on validation], Validation has never been required to do this and still is not required. That is the point of the opt-in multi-level proposal. A Level 3 validator such as Ajv can be much faster than a Level 4 annotation-gathering validator. That's great! Many people would rather have speed. The set of people who need complex annotation gathering is relatively small, and implementation requirements for validation should not be constrained by their use cases. However, all hyper-schema implementations need to be Level 4. Or else they just don't work. I can go into this in more detail, but static analysis produces incorrect results. While I'm generally willing to defer to you on validation itself, you do not implement hyper-schema and have never expressed any interest in doing so. I have put a lot of thought into that. So if you want to convince me that static analysis is sufficient, you are going to have to dig deep into Hyper-Schema (which, essentially, is just a rather complex annotation) and demonstrate how it could work statically. But I only have a link if the instances matches the relevant schema. That's been part of Hyper-Schema since the beginning. I'm just making the implications more clear. |
@handrews The behaviour I propose is to treat the properties defined in subschemas of these keywords as "unknown" and apply subschema in "unknownProperties" to these properties. I suggest that you present the use case when it is not possible to design the schema when only properties inside allOf are considered "known". Also, could you please address the problems of your proposal I explained above:
That is not clear at all to me. |
@epoberezkin I'm not ignoring your questions, but the things that you see as problems I see as essential goals, and vice versa. I am trying to figure out how to make my perspective clear. Fundamentally, to me a schema is a runtime contract. If my data validates against the schema then it satisfies the runtime contract and I can use any annotations the schema provides. The contract may be documented in some human-readable way expressing the key aspects of the structure that I need to know to ensure that I can create and consume instances, but I neither know nor care about the actual structure of the schemas ( This is true of the APIs at my current employer. The human-readable documentation is a condensed and simplified form of a very complex set of schema files. The schema files are organized to facilitate things like sharing structures among internal vs external APIs. There are a lot of So there are two ways to publish the schemas:
For things to be extensible, they need to not use I also can't arbitrarily refactor things away on the grounds of them not being the ideal data design for how you want a particular feature to work. I have to deal with an existing system. Some things are changeable, but others are not. Or are not high enough priority to put resources on changing them. Or are complex enough that the plan to improve them is a long-term plan, not a quick fix. I don't mean that I'm dealing with some sort of difficult situation- these limitations have existed at every company I've ever worked for. That's just reality when you come into an existing system that was built rapidly while a start-up is trying to prove its viability. Then you work to improve things, but that's an ongoing process. This is why I am frustrated by your response to my example being to re-factor it into something different. That Vehicle example is actually not directly from my current job (the owner == one of user vs organization is, though). I can't just do that. So when you do that, it's not helpful to me. You've at times accused me of being overly academic and theoretical, but when you are discarding anything that isn't "ideal" data design, now I'm the one who is frustrated with you not being willing to deal with imperfect but real systems. Getting back to inconsistencies: the behavior of As for using other vocabularies- again, it's the runtime behavior that I want. If I'm working with vocabularies that are not universally supported, I need to deal with that problem whether or not there is some keyword that depends on a not-necessarily-dependable validation outcome. Presumably any application using those schemas also has to deal with that variability. The answer for this issue is no different than any other usage: Either have a plan for how to degrade gracefully, and document it, or do something to make sure that the users you care about have access to the right tools. But it's not a valid objection to any proposal here because it is already a problem. |
@epoberezkin @Relequestual @philsturgeon I can work up a concrete example based on constraints from my current job, but before I do so I want assurances that responses to such a thing:
If everyone promises to respect those constraints in writing in this issue then I will be happy to spend the time to build up a more detailed example. I started trying to do this over the past week, but I kept having to put in disclaimers trying to prevent responses from going off the rails, and it started to become > 50% disclaimers. It's too exhausting, particularly knowing that I'll certainly miss something, and my example will end up dismissed as a result of it. As has already happened repeatedly in this issue. So, if we can agree to some pre-conditions, I will delve more deeply into the questions that @epoberezkin has asked. As I said, I went through @epoberezkin's schema refactoring / data design objections and tried to make it work. It does not work, but in order to show how it does not work the scenario is fairly complex. Easy-to-follow examples are also easy-to-unravel examples, in my experience. The real world is complicated. |
Let me explain the "cannot refactor" request a bit, since refactoring came up as an alternative to schema transforms. I'm making a distinction based on the view of schemas as runtime contracts The schema transform cases where I recommended refactoring involved reaching inside a schema, snipping out some keywords, and splicing the into another schema. This violates the abstraction / encapsulation of the contract. So the correct thing is to refactor to produce the right contract. As an example, if you have a schema that describes Cars, and you want to write a schema for Trucks, and you like how the Cars schema represents engines, the correct approach is to factor out an Engines schema that both Cars and Trucks can use with a regular old Now let's think about that Engines schema. It could be implemented as a As a consumer of the published Engines schema, I shouldn't know or care. So they, the maintainers, are free to refactor as they wish, but I, the consumer who wishes to use the schema and then add There are still a lot of places where re-use and encapsulation don't map cleanly to JSON Schema, but this is all about trying to get us closer to a place where they do, without throwing away the useful properties of JSON Schema that have gotten us this far. |
Note: @Relequestual has let me know that's he's ill and will likely not be catching up with or commenting on issues for another week. @philsturgeon and @awwright are both traveling this week I believe. So things will likely be quiet for a bit longer here. @epoberezkin if the sort of example I talk about putting together in my last few comments is appealing and you're willing to engage it on its own premises, do let me know and I will go ahead and work on that. Otherwise this is on hold until we have a quorum. |
@epoberezkin Thanks for defining your problem statement. I'm more sold on the approach @handrews presents because I can SEE the use case presented as a real example, where the behaviour he suggests would be desireable. I want to see the use case which your proposed behaviour then solves. Not abtstract or re-hashing a different example to work with your proposed behaviour. Totally ignore the other proposal for the purposes of your example. Pretend it doesn't exist! (I feel I adequitly understand the differences in the behaviours). When I see @handrews's example, his behaviour seems like the right solution. Does that make sense? |
@handrews, it is really difficult to try to shoot a moving target here. This requirement (A) above, as well as the requirement to introduce the desired outcome without refactoring existing schemas, has little to do with what was agreed to be the desired outcomes previously. In addition to that, (B) above doesn't follow from (A). While I understand the desire to achieve A, even though it's different from the agreed objective of the proposed change, I disagree that B is required to achieve A.
Indeed, but together with the proposal it creates a much bigger problem than currently exists. There is a big difference between (1) "validation results for some keywords being not strictly defined" (e.g. format, content) and (2) "what properties some schema is applied to is not strictly defined" (as a consequence of 1). Most users of JSON schema seem to see the distinction between validating/processing of the data structure and validating/processing of the property values. This proposal undermines this distinction by making the structure validation dependent on the results of validating property values. In addition to these problems, this proposal violates most existing JSON schema paradigms, such as shallowness and context-independence. So I really think we need to agree on what objectives we want to achieve with this change. I do think, for example, that (1) "the ability to introduce the new behaviours to the existing schemas without refactoring" should be a much lower priority than (2) "the ability to create schemas that are easy to maintain and reason about". This (1) being important for a given user of JSON Schema doesn't mean that this should be the consideration affecting what is added to the specification. If you really believe that this proposal is the only possible solution, it would be much better to implement this proposal in some validator(s) and gather some usage practice for at least 6-12 months - that would make it clearer whether my concerns cause real issues quite often or if they only cause problems in some rare edge cases. It was said many times that usage practice should precede standardisation. Nevertheless, here we discuss merits of the idea that was never used and reject the idea that has some usage practice ($merge). |
@Relequestual I was just trying to show that the same outcome can be achieved by a simpler change. I am presenting the alternative idea only as the illustration that the proposed keyword is not the minimal required change needed to implement the desired outcomes. I do not know how this new keyword should work, some schema extension mechanism seems a better solution to me, but if we MUST avoid it, I could live with some new keyword that violates shallowness principle but does not violate context-independence (as this proposal does). So, as I wrote, I see little point comparing proposals until:
|
@epoberezkin let's take a look at these while waiting for @Relequestual and/or @philsturgeon's next responses. Perhaps if I understand how you see these things, I may change my position, as these concepts are important to me. Here is how I think about evaluating a schema and instance, and what those two concepts mean. Let me know if you agree, or if not, I would love to see how you view them. I don't necessarily think I have the "correct" view here so this is is not intended to be an argument! :-) Purely fact-finding. Given:
2.v. is for ShallownessShallowness is about how subschemas are or are not involved in schema object evaluation. 2.ii, 2.iv. and 2.v. illustrate _shallowness. For 2.ii, If s is {
"anyOf": [
{"type": "integer"},
{"pattern": "^foo"}
]
} k1 is For 2.iv., For 2.v., with Context IndependenceContext independence is about parent and sibling schemas not being involved in schema object validation. It doesn't matter whether s is a subschema under an Technically, @epoberezkin do you agree with the above? If not, how would you define shallowness and context independence? |
@handrews I'm currently on holiday and have not had time to catch up with all the recent discussion in this issue (will try to get up to speed again over the weekend), so feel free to ignore me if I've just missed something. However:
Am I correct that your proposal includes bubbling up property names as part of the result of evaluating those subschemas? I thought that's what you were proposing, but that's necessarily content-dependent, which would seem to contradict your point above. I have said this earlier, but it feels worth repeating - I think we really, really need to have some discussion around implementation concepts before trying to put anything in the spec. There still seems to be a fair bit of confusion around what is actually intended, and discussing implementation should hopefully get everyone on the same page pretty quickly - code (or pseudocode) is not woolly and open to interpretation the way English can be. |
See, I dissagree that your proposed schema to arrive at the same outcome is simpler. I consider it far more complex, and requirering more changes. I expect you mean for the implementor and not the schema author, in which case, I would say that JSON Schema is complex enough, and our primary focus needs to be on ease of use for the schema author, and clarity of specification for the implementor. All other considerations, to me, are secondary. |
@handrews for shallowness the definition is quite close but with a few corrections. Also, I assume we are trying to define, more precisely, how it works in the absence of this proposal. Shallowness: 2.ii. The results of any direct subschemas of k1, direct includes array items and property values. 2.ii change restricts subschemas to direct subschemas By context-independence I meant independence of the schema applicability from the property/item values - currently the applicability only relies on property names and item indices (i.e. on data structure), but not on their values. With this proposal, the applicability starts depending on property and item values. |
I'm getting there. Let me sort out with @epoberezkin what the principles that he's concerned about mean first so that I can either address those or change the proposal to reflect them if needed. |
I'm not entirely sure that I follow this. The result of
Let me see if I can state this a different way to ensure that I'm understanding: Immediate values in the sense of object property names and array indices are available for static analysis (this is how I'm saying "static examination" because we do agree (I think?) that the dynamic results of a subschema are a factor in the results of the keyword (that's kind of the whole point of subschemas, right?). I'm going to post later about the context-independence part, some good new information for me there that I need to think through- thanks! |
By adding "direct", I mean that the keyword cannot depend on sub-sub-schemas (we don't have a precedent of it at the moment). EDIT: by “array items and property values” I meant that the subschemas of “allOf”, for example, are “array items” and the subschemas of “properties” are “property value” (of the value of “properties” keyword). Sorry for the confusion.
We talk about the same thing (I think :), I just wanted to clarify.
Correct, that is covered by 2.ii and 2.v.
Thank you |
Awesome- I am on board with this. Still working on writing up context-independence and addressing your concerns about depending on property/item values. |
@epoberezkin regarding context-independence:
(I don't actually remember what was said about I think the key thing here is that I'm making a distinction between:
The runtime result of evaluating a subschema of course depends on both the subschema's contents and the instance data. But the subschema contents and instance data remain opaque for the purposes of evaluating the parent schema object. It may be possible to infer things about the subschema contents based on those results, and on the immediate property names / array indices that are fair game to examine, but that's not the same thing as actually looking at the subschema contents and instance data as a separate process from evaluating the subschema. Does this make sense? If we're just depending on results then both of these objects as subschemas: In this view, we are not allowed to look into the subschema and see whether the result was achieved with So I'm claiming that if we are only using results, then we are still context-independent. Does that make sense? |
Yes, as long as by "results" we mean "boolean result of assertions", i.e. valid or invalid. The reason for that limitation is that if you arbitraryly define validation results, then they can include something which is either "context" (i.e. data values) or something that depends on the "context", so we are no longer context independent. The way annotation collection is defined makes it exactly the case, collected annotations are context dependent. EDIT: actually annotations add the parts of the schema itself, so making a keyword dependent on annotations (or something similar) violates shallowness, not context-independence. |
@handrews Another way to explain the problem I see with this proposal is related to "applicability" concept and how this proposal changes it. Regardless which section of the spec we put some keywords in, we have keywords that apply subschemas to either child or current location of the data instance. They, by definition (section 3.1), belong to the applicability group. Currently the locations in the data instance to which subschemas should be applied can be determined by: So applicability keywords have stronger context-independence than validation keywords (that need data values). To illustrate:
The problem with the proposed keyword is that it makes applicability dependent on data values, as data structure is no longer sufficient to determine whether the subschema of unwhateverProperties will be applied to some child instance. Do you follow this argument or something needs clarifying? Do you see the problem? I believe that we can and should solve the problems at hand (extending schemas, avoiding typos in property names, etc.) without changing how applicability works. |
As with other controversial issues right now, I'm locking this rather than responding further until people who are currently ill and/or traveling can get back and catch up. |
I have filed #530 for nailing down how annotations are collected, since it doesn't really have anything to do with this issue. We may end up using that process, but it's not at all specific to or driven by this concept. @erayd you'll get your pseudocode there (whether it ends up being relevant here or not- if not, we'll work out whatever we need for this issue here). |
I've been talking with the OpenAPI Technical Steering Committee, and one thing that's going on with their project is that the schema for version 3.0 of their specification (the schema for the OAS file, not the schemas used in the file) has been stalled for months. The main reason it stalled is concern over the massive duplication required to get I have refactored the schema to use Note that there is a link to the original PR in the comment on the gist. I think that this is pretty compelling evidence in favor of |
This implementation of the OpenAPI spec in JSON Schema provides a powerful example of the problem at hand. Multiple different people have been discussing multiple different problems, and asking for examples of the other problems, talking past each other and generally this thread got to an unreadable point due to this confusion. Now we have this very specific real-world example solving the problem we're trying to solve, other problems can be discussed in other issues and potentially solved in other threads. I think we can move along now, closing this issue, happy and content we have a great example. We have fundamentally solved a giant issue with JSON Schema., and that's fantastic news. |
This is a clear solution to a real problem which has effected aspects of an important project. Let's fix this. Let's go with Can you file a new issues specifically for that option? Then we can move directly to pull request. I feel the general consensus is we need this. Unrelared, hello from the UK! ❄️ ❄️ ❄️ ❄️ |
NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.
We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.
The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.
This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the
"additionalPropeties": false
use cases that do not work well with our existing modularity and re-use features.TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.
Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.
EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.
A general JSON Schema processing model
With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.
NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.
NOTE 2: Even if this approach is used, the steps are not executed linearly.
$ref
must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.$schema
,$id
,$ref
,definitions
as discussed in Move "definitions" to core (as "$defs"?) #512)There is a basic example in one of the comments.
Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.
Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.
Steps 3 and 4 are where things get more interesting.
Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).
Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).
Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.
So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).
To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:
Deferred processing
To solve the re-use problems I propose defining a step 5:
EDIT: The proposal was originally called
unknownProperties
, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behaviorunevaluatedProperties
instead. But that name does not otherwise appear until much later in this issue.This easily allows a keyword to implement "ban unknown properties", among other things. We can define
unevaluatedProperties
to be a deferred assertion analogous toadditionalProperties
. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas ofproperties
andpatternProperties
.There is an example of how
unevaluatedProperties
, calledunknownProperties
in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be
unevaluatedItems
, which would be analogous toadditionalItems
except that it would apply to elements after the maximum lengthitems
array across all relevant schemas. (I don't think anyone's ever asked for this, though).Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like
deferredDefault
, which would override any/alldefault
values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.
Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).
Schema transforms
In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as
$merge
or$patch
would be added as a step 1.5, as they are processed after$ref
but before all other keywords.These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals (
$spread
,$use
, single-level overrides) can be described as limited versions of$merge
and/or$patch
, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.It's not clear to me how schema transform keywords work with the idea that
$ref
is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).[EDIT: @epoberezkin has proposed a slightly different
$merge
syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]If
$ref
is lazily replaced with its target (with$id
and$schema
adjusted accordingly), then transforms are straightforward. However, we currently forbid changing$schema
while processing a schema document, and merging schema objects that use different$schema
values seems impossible to do correctly in the general case.Imposing a restriction of identical
$schema
s seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.On the other hand, if
$ref
is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different$schema
values but it is not at all clear to me how schema transforms would apply.@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?
Conclusions
Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.
Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing
$ref
, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.
I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)
The text was updated successfully, but these errors were encountered: