Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two regex-related items on a wish list #1849

Closed
pdeffebach opened this issue Jun 11, 2019 · 12 comments
Closed

Two regex-related items on a wish list #1849

pdeffebach opened this issue Jun 11, 2019 · 12 comments
Labels
non-breaking The proposed change is not breaking
Milestone

Comments

@pdeffebach
Copy link
Contributor

I really like the new regular expressions indexing! We are really getting feature parity with R because of this.

However I want to look beyond R and focus on two Stata features where regex is super useful.

  1. Mixing regex and symbols in select.

    Right now you can't do

    select(df, r"a", :b) 
    

    or

    select(df, [r"a", :b])
    

    This is a pattern that is really useful in Stata when you want, for example, a bunch of nicely titled variables you just made and an ID variable that you use for joining.

  2. Collapse based regular expressions. Inspired by this tweet.

    In Stata you can do the following super nicely:

     collapse (first) id (mean) m_* (sum) s_*
    

    Obviously with Add more splatting and appending of arguments in by #1620 you can make the vectors of symbols first and use that. But you could do that before with the getindex but we added a convenience method for regular expressions. It's worth thinking about adding them here.

Let me know if this is better suited for DataFramesMeta as well.

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

Mixing regex and symbols in select.

Currently the way to write this is select(df, r"a|^b$"), which is not ideal but it is not that bad I think.
However, select API is currently very basic and can be extended (this is a plan actually to e.g. allow for column transformations) then select(df, r"a", :b) would be a natural syntax to use.

Collapse based regular expressions.

I think I get the idea what you want but to be sure could you give an example of proposed Julia target code (as you did in point 1) and expected output? Thank you!

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

Actually regarding 1, if we decided to introduce getcolindex function (or colindex) then you could write:

select(df, vcat(getcolindex.(r"a", :b)...))

Also note that the subtlety is that sometimes several selectors could match the same column names and we would have to decide what to do in this case (throw an error or deduplicate).

@nalimilan
Copy link
Member

Why not allow tuples/vectors mixing regexes and other selectors, if that doesn't make the code too complex. Only trying an implementation will tell...

@bkamins
Copy link
Member

bkamins commented Jun 12, 2019

The initial reason was the following:

  • it is OK in select and select! only
  • it will be a nightmare to support mixing them in all operations that allow column indexing to be consistent (e.g. consider adding it to getindex and view code)

It is doable - simply this will add tons of technological debt. Therefore if we move in this direction a careful redesign of internal API is required. I have been thinking about if for some time now. The initial idea is that all public API functions that take columns should have three versions only the following types of col are allowed:

  • taking an Int
  • taking a AbstractVector{Int}
  • catch all taking Any as column which calls index(df)[col] which guarantees to return Int or AbstractVector{Int} and passing the result to one of the earlier defined functions

Then any such change as proposed here can be implemented only in one place - namely Index. This approach introduces a small overhead, because sometimes you pass cols several times when performing some operations.

This brings us to the question I have been raising for some time now - do we want to expose index(df)[cols] in public API (the getcolindex function or something like this). I know I have been raising this for some time now, but the issue keeps returning constantly. Namely: we should have one canonical representation of column index which is Union{Int, AbstractVector{Int}} IMO and have a method to convert non-canonical representations to canonical one.

If what I propose here is OK with you then in #1847 I will redesign this following what I propose above. Please let me know.

@bkamins
Copy link
Member

bkamins commented Jun 12, 2019

And then when #1847 is done adding the support you ask for here should be a small PR.

@nalimilan
Copy link
Member

Makes sense, but probably better not spend too much time on internal redesigning before 1.0. Features like this issue can wait for 1.x.

@pdeffebach
Copy link
Contributor Author

I think I get the idea what you want but to be sure could you give an example of proposed Julia target code (as you did in point 1) and expected output? Thank you!

For the second point, I am thinking of the following syntax, which admittedly doesn't work because of broadcasting.

by(df, :id, r"^m" .=> mean, r"^s" .=> sum)

In this case r"^m" would behave like an array of all columns that start woth :m. Of course, this breaks parsing rules because Julia would try to broadcast the regex first.

This brings us to the question I have been raising for some time now - do we want to expose index(df)[cols] in public API (the getcolindex function or something like this). I know I have been raising this for some time now, but the issue keeps returning constantly. Namely: we should have one canonical representation of column index which is Union{Int, AbstractVector{Int}} IMO and have a method to convert non-canonical representations to canonical one.

I think this is a great idea. Every time we make a new feature, I am likely to say "well, it would be great if this worked with an array of tons of different kinds of inputs". I think an infrastructure for taking

  • Symbols
  • Ints
  • Regular expressions
  • Named tuple from keyword arguments

in scalar, tuple, or vector form, and then having DataFrames take them and put it into a nice array would be a good investment.

@nalimilan
Copy link
Member

For the second point, I am thinking of the following syntax, which admittedly doesn't work because of broadcasting.

by(df, :id, r"^m" .=> mean, r"^s" .=> sum)

In this case r"^m" would behave like an array of all columns that start woth :m. Of course, this breaks parsing rules because Julia would try to broadcast the regex first.

Yeah, that's a more general problem we have with column selectors (see comments around #1256 (comment)). We should probably add a wrapper type to be used like All(r"^m") => mean or All(r"^m") .=> mean. That would also work for All(:) or just All(). Though the name All isn't great, it should almost be called Broadcast (since that's essentially a way to perform broadcasting, but deferring it until the regex/selector has been expanded to actual column names).

@bkamins
Copy link
Member

bkamins commented Jun 13, 2019

In long term this would be solved by the colindices(cols) function which would expose index(df)[cols] and then you can throw anything as cols. Functions like by could implicitly call colindices in such cases as we discuss here (and then only need to be able to work with canonical Int and AbstractVector{Int} column specifications).

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added this to the 2.0 milestone Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

@pdeffebach - could you please summarize what you think is left from your proposal to be implemented given the current API we have (note that r"x" .=> fun cannot be supported as we would have to override boradcasting of Regex from Base which is not possible).

@pdeffebach
Copy link
Contributor Author

I think the names(df, r"x") .=> fun is very solid and this will be enough to satisfy Stata users. The only issue is that when doing

by(df, :group, names(df, r"income_") .=> mean)

you have to specify df twice. This is fine for now. We can improve upon this when broadcasting of All (or Cols!) gets implemented.

@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

OK - so I close this as we have a separate issue for broadcasting of All etc. that tracks this specific point.

@bkamins bkamins closed this as completed Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

3 participants