Improvements in union matching logic during validation #1332

sydney-runkle · 2024-06-16T01:44:39Z

This PR attempts to augment the current union validation logic by adding in tie-breaker logic for union members that tie in terms of exactness.

The current naive approach is such that when validating a model type data (including dataclasses and typed dicts), we count the number of fields attached to said validator. Matches with more fields beat out those with fewer to help us deal with cases of subclasses, etc.

Fix pydantic/pydantic#9094
Fix pydantic/pydantic#8690
Fix pydantic/pydantic#7223
Fix pydantic/pydantic#9224
Fix pydantic/pydantic#7085
Fix pydantic/pydantic#6637
Fix pydantic/pydantic#9664

Almost fixes the problems in pydantic/pydantic#9095, but not entirely.

codspeed-hq · 2024-06-16T01:51:28Z

CodSpeed Performance Report

Merging #1332 will not alter performance

_{Comparing union-val-fix (ed8ddb9) with main (46379ac)}

Summary

✅ 155 untouched benchmarks

sydney-runkle · 2024-06-16T23:04:59Z

This happens to fix pydantic/pydantic#8690, but really this belongs in the "want to fix" category below, as we need to account for defaults more carefully.

Interestingly, after writing this PR, I noticed that a similar suggestion for a fix was made in pydantic/pydantic#9095. We don't exactly use model_fields_set because that hasn't been initialized at that time, but that idea is the same.

There's still more to be done here, specifically, we can fix the below issues if we address correctness of a match based on model fields set, and not defaults used.

I do think that issues with default handling should also be addressed in this PR, because without that, existing test cases are failing with the current naive addition.

It would be exciting to close 9 issues with one PR 😂. This speaks to both our lack of ability to close duplicates, but also is helpful in terms of inspiration for a variety of test cases!

Thus, I plan tomorrow to experiment with the __pydantic_fields_set__ data, and experiment with how to extract that info for dataclasses and typeddicts as well :). Right now we've handled the square, but need to address the rectangle.

Another follow up question I have: would it make sense to externally manage the 'scoring' associated with various union matches, instead of updating the best match with every new attempt? I don't think the performance will be much different, but we should consider readability.

Want to fix:

sydney-runkle · 2024-06-17T01:09:30Z

Update, I believe I have now fixed the following:

Improved smart union behaviour when extra="allow" or extra="ignore" and union members have no required fields pydantic#9095
Unexpected union evalution when union members' fields have defaults and the union itself is optional pydantic#9094
Smart union fails to compare schema correctly pydantic#8690
Union type coercion with defaults results in unexpected behavior pydantic#7223

That being said, I'd definitely be open to reworking this design in general. Specifically, I'm wondering:

Do we really want to attach the num_fields method to the various validators in question? I'm unsure of a better way to approach this given that the validator property is private, but I'm guessing there's a better way to do this lookup on a validator instance.
Do we want to go with an approach to this rating that does more external state management to keep track of each of the attempted valdiations + related metadata, then compare them at the end? I think this would be less performant, and not that much more readable...
I don't love reading the __pydantic_fields_set__ attribute value in pydantic-core, but we fallback to num_fields() in the case of this getattr failing, so maybe it's ok?

sydney-runkle · 2024-06-17T01:21:15Z

Things I still definitely need to do here:

Tests for all of the above issues that I'm claiming this fixes
Get feedback on the above 3 questions to improve this implementation

davidhewitt

It's really exciting to see progress in this area!

So I will be honest and say that I'm quite scared about all possible solutions in this space and how they generalise to meet every users' expectations.

Especially the more complex we make this logic, the more we risk that users will find edge cases which don't work as they hope and we have to add even more complexity to this process to meet their needs. Where does the line get drawn, and how do we document this process to users? At the moment smart mode unions are very opaque and I think we've permitted ourselves to adjust their behaviour in future releases as long as we believe the new behaviour is better, but it's a fine line.

Using this heuristic solves a lot of the current reports so it seems like a win and we should move forward; let's just tread carefully.

It doesn't need implementing right away, but it'd also be good for us to have an idea of what we could offer as a Python API for users to customise this heuristic. Maybe that's a can of worms, but it'd also finally give us a way to deal with a load of the user cases here.

src/validators/union.rs

sydney-runkle · 2024-06-17T15:33:25Z

It's really exciting to see progress in this area!

🎈

So I will be honest and say that I'm quite scared about all possible solutions in this space and how they generalise to meet every users' expectations.

Especially the more complex we make this logic, the more we risk that users will find edge cases which don't work as they hope and we have to add even more complexity to this process to meet their needs. Where does the line get drawn, and how do we document this process to users? At the moment smart mode unions are very opaque and I think we've permitted ourselves to adjust their behaviour in future releases as long as we believe the new behaviour is better, but it's a fine line.

Yep, this all makes sense to me. I think that because we have some flexibility with what "smart" means, we can make this change. Additionally, users who have specialized cases that behave differently with this change can get the old behavior by using left-to-right mode. I've also tried to preserve some left-to-right behavior by only overriding the current union member match if the fields metric of a new match is greater than that of the current.

Using this heuristic solves a lot of the current reports so it seems like a win and we should move forward; let's just tread carefully.

👍

It doesn't need implementing right away, but it'd also be good for us to have an idea of what we could offer as a Python API for users to customise this heuristic. Maybe that's a can of worms, but it'd also finally give us a way to deal with a load of the user cases here.

I like this idea. At least for now, custom validators provide some sort of shortcut for this.

sydney-runkle · 2024-06-17T18:18:32Z

For my own notes, wanted to keep track of test cases covering the issues this fixes:

sydney-runkle · 2024-06-17T22:02:32Z

I'll note, pydantic/pydantic#9101 seems like a different issue.

sydney-runkle · 2024-06-18T00:50:27Z

Now using ValidationState to keep track of the fields_set_count, as we don't really want to use schema information here, but rather validation time info.

I still need to add support for this to dataclasses, and then add some tests for TypedDict and dataclass

davidhewitt · 2024-06-18T10:20:52Z

src/validators/union.rs

-                            // been one success, we turn on strict mode, to avoid unnecessary
-                            // coercions for further validation?
-                            success = Some((new_success, new_exactness));
+                        let new_fields_set = state.fields_set_count;


I think we probably need to reset fields_set_count between each loop turn? (e.g. like exactness is reset on line 124).

Propagating on the state like this potentially has surprising effects like list[ModelA] | list[ModelB] probably now gets chosen based on whether the last entry fitted ModelA or ModelB better. We could deal with that by explicitly clearing num_fields_set in compound validators like list validators.

Overall I'm fine with this approach of using state, I think either this or num_fields (and exactness) all are fiddly to implement and will likely just need lots of hardening via test cases.

Behavior looks good for the list case you've mentioned above:

from pydantic import BaseModel, TypeAdapter class ModelA(BaseModel): a: int class ModelB(ModelA): b: int ta = TypeAdapter(list[ModelA] | list[ModelB]) print(repr(ta.validate_python([{'a': 1}, {'a': 2}]))) #> ModelA(a=1), ModelA(a=2)] print(repr(ta.validate_python([{'a': 1, 'b': 2}, {'a': 2, 'b': 3}]))) #> [ModelB(a=1, b=2), ModelB(a=2, b=3)] print(repr(ta.validate_python([{'a': 1, 'b': 2}, {'a': 2}]))) #> [ModelA(a=1), ModelA(a=2)] print(repr(ta.validate_python([{'a': 1}, {'a': 2, 'b': 3}]))) #> [ModelA(a=1), ModelA(a=2)]

I think this is because I'm now clearing the relevant parts of the state, and fields_set_count is by default None, in which case we only consider exactness.

sydney-runkle · 2024-06-19T00:24:27Z

As discussed with @dmontagu offline, I made some changes so that fields_set_count bubbles up for each union, that way we count fields set on a subclass as contributing to the success of a match. For example, this case now returns a ModelB:

from pydantic import BaseModel, TypeAdapter


class SubModelA(BaseModel):
    x: int = 1

class SubModelB(BaseModel):
    y: int = 2

    
class ModelA(BaseModel):
    sub: SubModelA


class ModelB(BaseModel):
    sub: SubModelB


ta = TypeAdapter(ModelA | ModelB)

print(repr(ta.validate_python({'sub': {'y': 3}})))
#> Before: ModelA(sub=SubModelA(x=1))
#> After: ModelB(sub=SubModelB(y=3))

sydney-runkle · 2024-06-19T00:26:25Z

I've also added a bunch of new test cases that verify we're getting the desired behavior associated with this change.

src/validators/dataclass.rs

davidhewitt · 2024-06-19T10:51:33Z

src/validators/union.rs

+                        let new_success_is_best_match: bool =
+                            success.as_ref().map_or(true, |(_, cur_exactness, cur_fields_set)| {
+                                match (*cur_fields_set, new_fields_set) {
+                                    (Some(cur), Some(new)) if cur != new => cur < new,
+                                    _ => *cur_exactness < new_exactness,
+                                }
+                            });


Can we add a comment here describing the logic? I read it as follows:

We prefer number of fields to exactness

In case where number of fields is tied, or one side doesn't have number of fields, we use exactness

What about ModelA | dict[str, int]? If we pass {'a': 2, 'b': 3} as input, then according to the above the dict validator doesn't currently have number of fields, and will win based on exactness.

I think instead if we passed {'a': '2', 'b': '3'} with string keys, the exactness is the same (lax) and so ModelA probably wins as the leftmost member.

Maybe dict validators should also count fields set?

I've added the comment :)

That's correct re current behavior:

from pydantic import BaseModel, TypeAdapter class DictModel(BaseModel): a: int b: int ta = TypeAdapter(DictModel | dict[str, int]) print(repr(ta.validate_python({'a': 1, 'b': 2}))) #> {'a': 1, 'b': 2} print(repr(ta.validate_python({'a': '1', 'b': '2'}))) #> DictModel(a=1, b=2) print(repr(ta.validate_python(DictModel(a=1, b=2)))) #> DictModel(a=1, b=2)

Honestly, I think this might be a case where we suggest that users use left-to-right or a discriminator if they really want to customize behavior.

src/validators/union.rs

davidhewitt · 2024-06-19T11:03:01Z

src/validators/model.rs

            force_setattr(py, self_instance, intern!(py, ROOT_FIELD), &output)?;
+            state.fields_set_count = Some(fields_set.len() + state.fields_set_count.unwrap_or(0));


I think by adding in this way, when we consider a list we're summing up the models across all list items:

[{'a': 1, 'b': 2}, {'a': 2, 'b': 3}] # total of ModelA is 2, total of ModelB is 4, so ModelB wins

If I follow correctly, that means that stuff like how many defaulted members exist swing the outcome depending on the input length. I get the following characteristics out where the list can flip to A1 or A2 type based on the count of total successful fields:

from pydantic import TypeAdapter, BaseModel class A1(BaseModel): a: int b: int = 0 class A2(BaseModel): a: int c: int = 0 d: int = 0 fits_a1_better = {'a': 0, 'b': 0} fits_a2_better = {'a': 0, 'c': 0, 'd': 0} validate = TypeAdapter(list[A1] | list[A2]).validate_python print(validate([fits_a1_better, fits_a2_better])) #> [A2(a=0, c=0, d=0), A2(a=0, c=0, d=0)] print(validate([fits_a1_better, fits_a1_better, fits_a2_better])) #> [A1(a=0, b=0), A1(a=0, b=0), A1(a=0, b=0)] print(validate([fits_a1_better, fits_a1_better, fits_a2_better, fits_a2_better])) #> [A2(a=0, c=0, d=0), A2(a=0, c=0, d=0), A2(a=0, c=0, d=0), A2(a=0, c=0, d=0)]

... is this ok? Unsure, but it's a less intuitive result than current behaviour, I think. This is why I think list validators might want to remove fields_set. And then there's a whole load of validators with similar problems 😥

@davidhewitt,

What other validators would you also like to disable this metric for?

Another case where left_to_right helps:

from pydantic import TypeAdapter, BaseModel, Field from typing import Annotated class A1(BaseModel): a: int b: int = 0 class A2(BaseModel): a: int c: int = 0 d: int = 0 fits_a1_better = {'a': 0, 'b': 0} fits_a2_better = {'a': 0, 'c': 0, 'd': 0} validate = TypeAdapter(Annotated[list[A1] | list[A2], Field(union_mode='left_to_right')]).validate_python print(validate([fits_a1_better, fits_a2_better])) #> [A1(a=0, b=0), A1(a=0, b=0)] print(validate([fits_a1_better, fits_a1_better, fits_a2_better])) #> [A1(a=0, b=0), A1(a=0, b=0), A1(a=0, b=0)] print(validate([fits_a1_better, fits_a1_better, fits_a2_better, fits_a2_better])) #> [A1(a=0, b=0), A1(a=0, b=0), A1(a=0, b=0), A1(a=0, b=0)]

@davidhewitt,

What other validators would you also like to disable this metric for?

I think basically every container of variable size has the same problem? sets, lists, tuples, dicts etc

Yeah. Makes sense. I think for now we can move forward with the current fix and add a docs update for the left_to_right suggestions.

sydney-runkle · 2024-06-19T14:57:41Z

I'll note, I'll accompany this with a docs change on the pydantic end to update our explanation of smart mode and when to use left_to_right.

This is to get pydantic/pydantic-core#1332 and can be switched back to the next release of pydantic after v2.8.2

Why === An upstream bugfix in pydantic/pydantic-core#1332 isn't able to be installed due to tight version pinning in [`[email protected]/pyproject.toml#L53`](https://github.com/pydantic/pydantic/blob/v2.8.2/pyproject.toml#L53). For the time being, just switch to the latest release in the pydantic repo as of the time of writing, which appears to resolve the generation issue I'm seeing when running the codegen. ![image](https://github.com/user-attachments/assets/b85ea67c-91cc-455c-bcc1-176cce9610a8) What changed ============ - Set pydantic version to track git @ pydantic/pydantic@f5d6acf Test plan ========= Manually running the codegen against production schemas appears to generate sensible results

See #65 for context, or pydantic/pydantic-core#1332 directly. The issue we were seeing was specifically around union decoders failing to decode properties of the selected variant, so even though we'd get the right model it wouldn't be filled out.

Why === See #65 for context, or pydantic/pydantic-core#1332 directly. The issue we were seeing was specifically around union decoders failing to decode properties of the selected variant, so even though we'd get the right model it wouldn't be filled out. What changed ============ Bump pydantic to the latest released beta Test plan ========= Run codegen, observe that the generated code matches the previously generated code to a reasonable degree.

sydney-runkle added 2 commits June 15, 2024 21:40

first pass at union match improvement

9117c6b

don't need to add explicit numbers

fcc84ec

figuring out fields set

0255211

behavior for checking model fields set

ceb5e1b

working on my rust best practices

1a6b00c

sydney-runkle changed the title ~~Draft: improvements in union matching logic during validation~~ Improvements in union matching logic during validation Jun 17, 2024

sydney-runkle marked this pull request as ready for review June 17, 2024 11:55

davidhewitt reviewed Jun 17, 2024

View reviewed changes

src/validators/union.rs Outdated Show resolved Hide resolved

src/validators/union.rs Outdated Show resolved Hide resolved

sydney-runkle added 2 commits June 17, 2024 11:17

using method on Validator trait to streamline things

1718e8b

comment udpate

500a812

new round of tests

6645ac2

sydney-runkle added 3 commits June 17, 2024 18:03

get all tests passing

d479928

use Union not pipe in tests

e16103a

abandon num_fields for a state based approach

9208282

davidhewitt reviewed Jun 18, 2024

View reviewed changes

sydney-runkle added 3 commits June 18, 2024 11:12

get typed dicts working, add tests for other model like cases

69f6ec0

dataclass support + more efficient exact return

18b70d4

all dataclass tests passing :)

e4f1e6b

sydney-runkle mentioned this pull request Jun 18, 2024

Improved smart union behaviour when extra="allow" or extra="ignore" and union members have no required fields pydantic/pydantic#9095

Closed

14 tasks

sydney-runkle added 3 commits June 18, 2024 19:02

bubble up fields set in nested models

00ddca2

add nested counting for dataclasses, typed dicts

fc21636

corresponding tests

72aa7bd

davidhewitt reviewed Jun 19, 2024

View reviewed changes

sydney-runkle added 8 commits June 19, 2024 07:27

updating fields set at the end)

d2ef400

comments and best practice with state updates

1585e96

consistency w var names

d3f88b7

abbreviated syntax

38f911f

adding a bubble up test + doing some minor test refactoring

c3b43e9

3.8 fixes

31a439a

oops, another list

ed920ad

ugh, last 3.8 fix hopefully

645f917

sydney-runkle mentioned this pull request Jun 19, 2024

pydantic.validate_call not picking right type for unioned types pydantic/pydantic#9664

Closed

1 task

name change success -> best_match

ed8ddb9

dmontagu approved these changes Jun 19, 2024

View reviewed changes

sydney-runkle merged commit fcc77f8 into main Jun 19, 2024
28 checks passed

sydney-runkle deleted the union-val-fix branch June 19, 2024 16:10

sydney-runkle mentioned this pull request Jun 19, 2024

Make better union validation decisions based on extra behavior #1335

Open

This was referenced Jul 8, 2024

Bump pydantic-core from 2.18.4 to 2.20.1 vasudev-gm/django_ninja_tutorial#64

Merged

Bump pydantic-core from 2.18.4 to 2.20.1 vasudev-gm/magika_demo#18

Closed

hb2638 mentioned this pull request Aug 14, 2024

pydantic.validate_call is converting strings to decimals when string is one of the allowed types pydantic/pydantic#10142

Closed

1 task

blast-hardcheese added a commit to replit/river-python that referenced this pull request Aug 15, 2024

Bumping pydantic to bleeding edge

bfabed3

This is to get pydantic/pydantic-core#1332 and can be switched back to the next release of pydantic after v2.8.2

blast-hardcheese mentioned this pull request Aug 15, 2024

chore/Pull in unreleased pydantic Union bugfix replit/river-python#65

Merged

blast-hardcheese added a commit to replit/river-python that referenced this pull request Aug 15, 2024

Bumping pydantic to bleeding edge

406740c

This is to get pydantic/pydantic-core#1332 and can be switched back to the next release of pydantic after v2.8.2

blast-hardcheese added a commit to replit/river-python that referenced this pull request Aug 15, 2024

Bumping pydantic to bleeding edge

3ad73f2

This is to get pydantic/pydantic-core#1332 and can be switched back to the next release of pydantic after v2.8.2

blast-hardcheese mentioned this pull request Aug 26, 2024

Bump pydantic to get bugfix! replit/river-python#75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements in union matching logic during validation #1332

Improvements in union matching logic during validation #1332

sydney-runkle commented Jun 16, 2024 •

edited

Loading

codspeed-hq bot commented Jun 16, 2024 •

edited

Loading

sydney-runkle commented Jun 16, 2024 •

edited

Loading

sydney-runkle commented Jun 17, 2024

sydney-runkle commented Jun 17, 2024

davidhewitt left a comment

sydney-runkle commented Jun 17, 2024

sydney-runkle commented Jun 17, 2024 •

edited

Loading

sydney-runkle commented Jun 17, 2024

sydney-runkle commented Jun 18, 2024

davidhewitt Jun 18, 2024 •

edited

Loading

sydney-runkle Jun 18, 2024

sydney-runkle Jun 18, 2024

sydney-runkle commented Jun 19, 2024 •

edited

Loading

sydney-runkle commented Jun 19, 2024

davidhewitt Jun 19, 2024

sydney-runkle Jun 19, 2024

sydney-runkle Jun 19, 2024

sydney-runkle Jun 19, 2024

davidhewitt Jun 19, 2024

sydney-runkle Jun 19, 2024

sydney-runkle Jun 19, 2024

davidhewitt Jun 19, 2024

sydney-runkle Jun 19, 2024

sydney-runkle commented Jun 19, 2024

		force_setattr(py, self_instance, intern!(py, ROOT_FIELD), &output)?;
		state.fields_set_count = Some(fields_set.len() + state.fields_set_count.unwrap_or(0));

Improvements in union matching logic during validation #1332

Improvements in union matching logic during validation #1332

Conversation

sydney-runkle commented Jun 16, 2024 • edited Loading

codspeed-hq bot commented Jun 16, 2024 • edited Loading

Merging #1332 will not alter performance

Summary

sydney-runkle commented Jun 16, 2024 • edited Loading

sydney-runkle commented Jun 17, 2024

sydney-runkle commented Jun 17, 2024

davidhewitt left a comment

Choose a reason for hiding this comment

sydney-runkle commented Jun 17, 2024

sydney-runkle commented Jun 17, 2024 • edited Loading

sydney-runkle commented Jun 17, 2024

sydney-runkle commented Jun 18, 2024

davidhewitt Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sydney-runkle commented Jun 19, 2024 • edited Loading

sydney-runkle commented Jun 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sydney-runkle commented Jun 19, 2024

sydney-runkle commented Jun 16, 2024 •

edited

Loading

codspeed-hq bot commented Jun 16, 2024 •

edited

Loading

sydney-runkle commented Jun 16, 2024 •

edited

Loading

sydney-runkle commented Jun 17, 2024 •

edited

Loading

davidhewitt Jun 18, 2024 •

edited

Loading

sydney-runkle commented Jun 19, 2024 •

edited

Loading