Two regex-related items on a wish list #1849

pdeffebach · 2019-06-11T21:43:56Z

I really like the new regular expressions indexing! We are really getting feature parity with R because of this.

However I want to look beyond R and focus on two Stata features where regex is super useful.

Mixing regex and symbols in select.

Right now you can't do
```
select(df, r"a", :b) 
```
or
```
select(df, [r"a", :b])
```
This is a pattern that is really useful in Stata when you want, for example, a bunch of nicely titled variables you just made and an ID variable that you use for joining.
Collapse based regular expressions. Inspired by this tweet.

In Stata you can do the following super nicely:
```
 collapse (first) id (mean) m_* (sum) s_*
```
Obviously with Add more splatting and appending of arguments in by #1620 you can make the vectors of symbols first and use that. But you could do that before with the getindex but we added a convenience method for regular expressions. It's worth thinking about adding them here.

Let me know if this is better suited for DataFramesMeta as well.

The text was updated successfully, but these errors were encountered:

bkamins · 2019-06-11T22:04:38Z

Mixing regex and symbols in select.

Currently the way to write this is select(df, r"a|^b$"), which is not ideal but it is not that bad I think.
However, select API is currently very basic and can be extended (this is a plan actually to e.g. allow for column transformations) then select(df, r"a", :b) would be a natural syntax to use.

Collapse based regular expressions.

I think I get the idea what you want but to be sure could you give an example of proposed Julia target code (as you did in point 1) and expected output? Thank you!

bkamins · 2019-06-11T22:07:34Z

Actually regarding 1, if we decided to introduce getcolindex function (or colindex) then you could write:

select(df, vcat(getcolindex.(r"a", :b)...))

Also note that the subtlety is that sometimes several selectors could match the same column names and we would have to decide what to do in this case (throw an error or deduplicate).

nalimilan · 2019-06-12T11:41:53Z

Why not allow tuples/vectors mixing regexes and other selectors, if that doesn't make the code too complex. Only trying an implementation will tell...

bkamins · 2019-06-12T12:50:33Z

The initial reason was the following:

it is OK in select and select! only
it will be a nightmare to support mixing them in all operations that allow column indexing to be consistent (e.g. consider adding it to getindex and view code)

It is doable - simply this will add tons of technological debt. Therefore if we move in this direction a careful redesign of internal API is required. I have been thinking about if for some time now. The initial idea is that all public API functions that take columns should have three versions only the following types of col are allowed:

taking an Int
taking a AbstractVector{Int}
catch all taking Any as column which calls index(df)[col] which guarantees to return Int or AbstractVector{Int} and passing the result to one of the earlier defined functions

Then any such change as proposed here can be implemented only in one place - namely Index. This approach introduces a small overhead, because sometimes you pass cols several times when performing some operations.

This brings us to the question I have been raising for some time now - do we want to expose index(df)[cols] in public API (the getcolindex function or something like this). I know I have been raising this for some time now, but the issue keeps returning constantly. Namely: we should have one canonical representation of column index which is Union{Int, AbstractVector{Int}} IMO and have a method to convert non-canonical representations to canonical one.

If what I propose here is OK with you then in #1847 I will redesign this following what I propose above. Please let me know.

bkamins · 2019-06-12T12:52:46Z

And then when #1847 is done adding the support you ask for here should be a small PR.

nalimilan · 2019-06-12T16:34:44Z

Makes sense, but probably better not spend too much time on internal redesigning before 1.0. Features like this issue can wait for 1.x.

pdeffebach · 2019-06-12T16:42:19Z

I think I get the idea what you want but to be sure could you give an example of proposed Julia target code (as you did in point 1) and expected output? Thank you!

For the second point, I am thinking of the following syntax, which admittedly doesn't work because of broadcasting.

by(df, :id, r"^m" .=> mean, r"^s" .=> sum)

In this case r"^m" would behave like an array of all columns that start woth :m. Of course, this breaks parsing rules because Julia would try to broadcast the regex first.

This brings us to the question I have been raising for some time now - do we want to expose index(df)[cols] in public API (the getcolindex function or something like this). I know I have been raising this for some time now, but the issue keeps returning constantly. Namely: we should have one canonical representation of column index which is Union{Int, AbstractVector{Int}} IMO and have a method to convert non-canonical representations to canonical one.

I think this is a great idea. Every time we make a new feature, I am likely to say "well, it would be great if this worked with an array of tons of different kinds of inputs". I think an infrastructure for taking

Symbols
Ints
Regular expressions
Named tuple from keyword arguments

in scalar, tuple, or vector form, and then having DataFrames take them and put it into a nice array would be a good investment.

nalimilan · 2019-06-13T16:18:07Z

For the second point, I am thinking of the following syntax, which admittedly doesn't work because of broadcasting.
by(df, :id, r"^m" .=> mean, r"^s" .=> sum)
In this case r"^m" would behave like an array of all columns that start woth :m. Of course, this breaks parsing rules because Julia would try to broadcast the regex first.

Yeah, that's a more general problem we have with column selectors (see comments around #1256 (comment)). We should probably add a wrapper type to be used like All(r"^m") => mean or All(r"^m") .=> mean. That would also work for All(:) or just All(). Though the name All isn't great, it should almost be called Broadcast (since that's essentially a way to perform broadcasting, but deferring it until the regex/selector has been expanded to actual column names).

bkamins · 2019-06-13T19:11:27Z

In long term this would be solved by the colindices(cols) function which would expose index(df)[cols] and then you can throw anything as cols. Functions like by could implicitly call colindices in such cases as we discuss here (and then only need to be able to work with canonical Int and AbstractVector{Int} column specifications).

bkamins · 2020-04-16T12:48:40Z

@pdeffebach - could you please summarize what you think is left from your proposal to be implemented given the current API we have (note that r"x" .=> fun cannot be supported as we would have to override boradcasting of Regex from Base which is not possible).

pdeffebach · 2020-04-16T13:13:45Z

I think the names(df, r"x") .=> fun is very solid and this will be enough to satisfy Stata users. The only issue is that when doing

by(df, :group, names(df, r"income_") .=> mean)

you have to specify df twice. This is fine for now. We can improve upon this when broadcasting of All (or Cols!) gets implemented.

bkamins · 2020-04-16T13:38:17Z

OK - so I close this as we have a separate issue for broadcasting of All etc. that tracks this specific point.

bkamins mentioned this issue Oct 6, 2019

Allow rename when selecting #1975

Closed

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins added this to the 2.0 milestone Feb 12, 2020

bkamins closed this as completed Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two regex-related items on a wish list #1849

Two regex-related items on a wish list #1849

pdeffebach commented Jun 11, 2019

bkamins commented Jun 11, 2019

bkamins commented Jun 11, 2019

nalimilan commented Jun 12, 2019

bkamins commented Jun 12, 2019

bkamins commented Jun 12, 2019

nalimilan commented Jun 12, 2019

pdeffebach commented Jun 12, 2019

nalimilan commented Jun 13, 2019

bkamins commented Jun 13, 2019

bkamins commented Apr 16, 2020

pdeffebach commented Apr 16, 2020

bkamins commented Apr 16, 2020

Two regex-related items on a wish list #1849

Two regex-related items on a wish list #1849

Comments

pdeffebach commented Jun 11, 2019

bkamins commented Jun 11, 2019

bkamins commented Jun 11, 2019

nalimilan commented Jun 12, 2019

bkamins commented Jun 12, 2019

bkamins commented Jun 12, 2019

nalimilan commented Jun 12, 2019

pdeffebach commented Jun 12, 2019

nalimilan commented Jun 13, 2019

bkamins commented Jun 13, 2019

bkamins commented Apr 16, 2020

pdeffebach commented Apr 16, 2020

bkamins commented Apr 16, 2020