Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

narrow_types! operation to wrangle Any columns #9

Closed
anandijain opened this issue Oct 11, 2020 · 4 comments
Closed

narrow_types! operation to wrangle Any columns #9

anandijain opened this issue Oct 11, 2020 · 4 comments

Comments

@anandijain
Copy link
Contributor

anandijain commented Oct 11, 2020

There have been a number of times when I have wanted to coerce a DataFrame/Table with Any or incorrect column types into something more specific.

I end up with poorly written conversions using tryparse, etc.
For a new user to want their data in the correct data type without this hassle, a utility function like this would be pretty handy:

function narrow_types!(df)
	for c in names(df)
		   T = mapreduce(typeof, promote_type, df[!,c])
		   df[!, c] = Vector{T}(df[!,c])
	end
end

example:

julia> df = DataFrame(a = Any[0, missing, 12.8, 30.2])

julia> eltype(df.a)
Any

julia> narrow_types!(df)

julia> eltype(df.a)
Union{Missing, Float64}

This function could probably be improved to work with specified columns, or an individual column, but this is the gist of it

@quinnj
Copy link
Member

quinnj commented Oct 12, 2020

Yeah, I can see the utility of this. Here's a few thoughts:

  • TableOperations.jl has (so far) focused on "lazy" transformations, whereas this example is "eager"; this could be built in a lazy way by having a TableOperations.NarrowTypes struct that would take any Tables.jl input, computed the narrowed types, and stored the new Tables.Schema w/ the original table. Then we'd define Tables.getcolumn(x::NarrowTypes, ...) to do the actual Vector{T}(original_col) operation.
  • Does that make sense?
  • If you (or anyone) is willing to make a PR, I'm happy to give pointers/review and we can merge it in.

@anandijain
Copy link
Contributor Author

Thank you! Yes I believe it makes sense. Although I'm a little confused about storing the new Schema and specifying a subset of columns to narrow (as opposed to all columns).

I don't have a ton of experience with lazy eval so this probably needs a bunch of work but I've started by omitting the ability to specify columns to narrow (ie just narrowing the whole table), but I imagine it is quite similar to Select and namesubset:

struct NarrowTypes{T}
    x::T
    schema::Tables.Schema 
end

narrow_arr(x) = mapreduce(typeof, promote_type, x)
narrow_types(t) = NarrowTypes(t, Tables.Schema(Tables.columnnames(t), [narrow_arr(getproperty(t, nm)) for nm in Tables.columnnames(t)]))

Tables.getcolumn(nt::NarrowTypes, nm::Symbol) = Vector{getproperty(nt.schema.types, nm)}(Tables.getcolumn(getfield(nt, 1), nm))
Tables.getcolumn(nt::NarrowTypes, i::Int) = Vector{nt.schema.types[i]}(Tables.getcolumn(getfield(nt, 1), i))

Tables.columnnames(nt::NarrowTypes) = Tables.columnnames(getfield(nt, 1)) # or nt.sch.names?
Tables.schema(nt::NarrowTypes) = nt.schema

Tables.istable(::Type{<:NarrowTypes}) = true

MWE (I believe any df could be used)

using Tables, TableOperations, CSV, DataFrames

df = CSV.read("purple_air_data.csv", DataFrame)

t = Tables.table(Matrix(df))
Tables.MatrixTable{Array{Any,2}}:

nt = narrow_types(t)

t_sch = Tables.schema(t)
nt_sch = Tables.schema(nt)

julia> Tables.getcolumn(t, 5)
13555-element Array{Any,1}:

julia> Tables.getcolumn(nt, 5)
13555-element Array{Int64,1}:

I'm not exactly certain I've done lazy evaluation correctly, but I'd appreciate letting me know if I'm on the right track.

Last things I noted:

  • I don't have any checking for if it's rowaccess (as Select does) yet
  • should I be mutating the schema or storing a new one? it could be confusing having an incorrect copy of nt.x.schema and the correct nt.schema with updated types
  • it seems that there's a decent amount of copy pasting of code to define new operations, not that this is very common, but could defining a macro help to avoid a bunch of getfields?

Thanks!

@quinnj
Copy link
Member

quinnj commented Oct 27, 2020

That looks pretty good so far; mind putting it in a pull request?

@quinnj
Copy link
Member

quinnj commented Nov 19, 2020

Implemented in #14

@quinnj quinnj closed this as completed Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants