-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add label polymorphism #4388
Conversation
0ed93df
to
f39b0c2
Compare
f056b51
to
794a801
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the direction this proposal is going. Have a few questions below.
libflux/Polymorphic_Labels.md
Outdated
{ 'column: A } <=> { b: string, a: int } | ||
``` | ||
|
||
There may be a consistent way to unify these records in the face of type variables, however an easy workaround would be to delay the unification of records with unknown fields until they have been resolved, at which point they can unify normally. If there is a field that is still unknown when type checking is done we can designate that as an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me delaying unification until records with unknown fields have been resolved makes the most sense. Its how I reason about this logic in the first place. Are there any drawbacks to delaying? Multiple passes over the semantic graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It adds some complexity, and could possibly give worse error messages (though as long as we take care to store all the context we need for the errors, it shouldn't be an issue).
There may be some performance implications as well as we need to know/decide when to evaluate these delayed constraints. Constraints could (in theory at least) depend on each other in complex ways which we would need to untangle to resolve them all
A bit of a tangent, but Rust's typechecker has an largely equivalent system of "obligations" which it brute forces when solving them, I tried to improve that with rust-lang/rust#69218 which would make them know more precisely when they would be solveable, but the overhead was to much).
func(opt: "a") | ||
|
||
// Possible extension where we only allow some specific labels to be passed in | ||
builtin func : (opt: "option1" | "option2") => int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here to make a sort of enum type? For things like the method
parameter to the quantile
function?
Why would we use labels for this other than its convenient? Meaning I don't see a use case yet of when a label used in the context of a column name would be constrained to a specific set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here to make a sort of enum type? For things like the method parameter to the quantile function?
Yes, at least a limited one. I mostly just extracted it from #4410 where I found some functions which accept only a limited number of strings as an argument (see group
and stddev
)
// We should (probably) still allow dynamic strings for backwards compatibility sake so the `string` type | ||
// also implements the `Label` Kind | ||
c = "a" + "" | ||
fill(column: c, value: 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how often a truly dynamic column name is used? Maybe we can get some of that from the query archive?
Also what do we consider dynamic? Anything that requires evalauting the semantic graph? What about this case:
c = "foo"
fill(column: c, value: 0)
Is that considered dynamic or static? Seems like it could be considered static. I would also expect this case to be very common as a user might have to use the column name in several different functions and so would like to put it in a variable to make changing it easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that could be considered static in this context (probably harder to force it to be "dynamic" than not).
For just about all functions defined with polymorphic labels in #4410 I can think of a way to redefine them to not require polymorphic labels which may be an argument for not doing this.
I didn't go through all of the functions thoroughly but other than issues implementing these changed functions it seems pretty plausible to "only" provide functions that operate on streams of plain values instead of records (or records with specific fields and use Lists of polymorphic labels (which is not part of this PR) seem more complicated but at least some operations could be defined.
|
In general I like the idea of using smaller composable operations, and especially For this example with aggregate functions:
How does this work for a typical example where there are group key columns? So say something like
For this example I want to get the sum of memory used by region. How do we preserve the separation between regions if aggregates always produce A broader question, are you thinking that users will not have to change their scripts in order to get the same behavior as before? Or are you assuming that some changes will be required of users? |
True, aggregators need to take groups into account, and the group concept is sort of shared between records and streams. The stream tracks each group, but to check the group key of a record that is accessed through the record. So returning or accepting streams of raw values is probably not that useful. I suppose there could be some way to try and separate group keys from records, but that likely strays to far towards breaking changes and it may just end replicating what records already do. So I think we need to stick with streams of records for that reason which in turn forces aggregate functions (and other transformations) to accept which column(s) to work with as a parameter.
No, users would be able to keep everything as is, but when working with pivoted data they would need to add |
c70fb56
to
848cd8e
Compare
For #4388 we need a way to describe a type (label) variable in place of a record field. As type variables are described as a single, uppercase letter, mimicking that syntax will make it impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
For #4388 we need a way to describe a type (label) variable in place of a record field. As type variables are described as a single, uppercase letter, mimicking that syntax will make it impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
For #4388 we need a way to describe a type (label) variable in place of a record field. As type variables are described as a single, uppercase letter, mimicking that syntax will make it impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
cddf3bf
to
98a3984
Compare
For #4388 we need a way to describe a type (label) variable in place of a record field. As type variables are described as a single, uppercase letter, mimicking that syntax will make it impossible to describe the literal field name "A", "B", etc. By allowing a string literal in this location one can always use this syntax to describe these single letter field names and also describe any other fields which do not fit into an `identifier`, for example `"field with spaces"`, `"symbols#%^"`.
78e85ac
to
70914d2
Compare
Co-authored-by: Scott Anderson <[email protected]>
<error> isn't the best in this error message but it isn't worse than the previous error
This were causing errors due to accessing the `experimental/universe` imports where accessing, say, `universe.fill` would instantiate all the functions in the `universe` record but only the labels of the used function would get filled which then triggers the error. I thought this was necessary to prevent ambiguities from arising when doing unifications like ``` { A with B: string } <=> { C with test: string } { A with B: string } <=> { D with abc: string } ``` as unifying the labels here (`B <=> "test"/"abc"`) would cause different results depending on which were done first. However I was able to remove that code so record labels that do not get concrete will just hang around unused which will at least not cause any problems.
POC to introduce "label polymorphism" which will allow type signatures to more precisely specify how they transform records, thereby being able to statically determine more errors. This works by letting string literals get the new
label("literal")
type instead of juststring
which in conjunction with record fields being capable of being type variables lets functions accept a "label" as a parameter which determines which fields that it runs on.Allowing multiple labels to be passed in an array? (Should be a separate issue, implemented later if desired.columns: [L]
) Probably needs to treat arrays as heterogeneous lists of some sort so they are typed like["a", "b"]
instead of[string]
.On the other hand functions still want to accept a plain dynamicWe are moving ahead with just statically defined labels, should be rare enough with dynamic ones that we can accept any breakages (though we do need to verify that).string
(as they do today). So unification needs to cope with that still. We may need "dynamic" labels for this{ A with 'dynamic: int, 'dynamic: B }
where a record with one (or more!) dynamic labels would accept any field being accessed.{ A: int }
, but that makes it impossible to define a record type with the fieldA
. Add a"A"
syntax perhaps (same as the record creation/indexing syntax)? See feat: Accept string literals in the fields of a record type #4664Type variables are currently re-numbered during theShould be another issue, may not bother with thisconvert
pass so you can get confusing errors where it says that the fieldC
does not exist, except the field were actually specified withB
(as seen below). (This happens at the master branch today as well)B
has before we do the record unfication. A better way may be to add a constraint onB
likeB must be one of the fields in record X
which can be solved later.labels.rs
have adequate coverageBased on #4664