-
-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add blog about working with tabular data using FastAI.jl #94
base: main
Are you sure you want to change the base?
Conversation
To start working, we'll have to take our tabular data and load it in such that it supports the interface defined by [Tables.jl](https://tables.juliadata.org/stable/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)-1). Most of the popular packages for loading in data from different formats do so already, so you probably won't have to worry about this. | ||
|
||
Here, we have a `path` to a csv file, which we'll load in using [CSV.jl](https://github.com/JuliaData/CSV.jl) package, and get a DataFrame using [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl). | ||
If your data is present in a different format, you could use a package which supports loading that format, provided that the final object created supports the required interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this required interface? Can you link it? Or is this a general comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, seems like you are referring to the Tables.jl Interface, maybe explicitly note that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was referring to the Tables.jl interface. Sure I'll do that.
|
||
[FastAI.jl](https://github.com/FluxML/FastAI.jl) is a package inspired by [fastai](https://github.com/fastai/fastai), and it's goal is to easily enable creating state-of-the-art models. | ||
|
||
This blog post shows how to get started on working with tabular data using FastAI.jl and related packages. The work being presented here was done as a part of [GSoC'21](https://summerofcode.withgoogle.com/projects/#5088642453733376) under the mentorship of Kyle Daruwalla, Brian Chen and Lorenz Ohly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should not undersell your work here, I am truthfully unfamiliar with the deep technical detail but saying something like "Before my GSoC project, we could only do x and y. Now we can do XY & Z together with this unified interface". This will make it very clear why someone should read this post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this project was no small feat. Just look at how long it's taken other frameworks to (not) add support for new modalities!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! I'll add this in as well.
Co-authored-by: Logan Kilpatrick <[email protected]>
|
||
julia> path = joinpath(datasetpath("adult_sample") , "adult.csv"); | ||
|
||
julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5) | |
julia> df = DataFrames.DataFrame(CSV.File(path)) | |
julia> first(df, 5) |
|
||
``` | ||
|
||
What this `TableDataset` object allows us to do is that we can get any observation at a particular index by using `getindex(td, index)` and the total number of observations by using `nobs(td)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add in a line about why this is cool, and how it generalises the usual getindex
based approach for arrays to data frames?
|
||
julia> item = DataAugmentation.TabularItem(row, Tables.columnnames(df)); | ||
|
||
julia> DataAugmentation.apply(normalize, item).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show the TablularItem
here to clarify what is written in the next sentence? We never see the TabularItem
post normalisation.
Hey @manikyabard it would be great to get this wrapped up, let me know if I can help in any way! |
Sure @logankilpatrick, I'll get this done (although the next 2-3 weeks look a little busy for me, so this might take a bit). Also just wanted to get a confirmation from @darsnack, @ToucheSir, or @lorenzoh if it's fine to put this post here since we were talking about putting this on the FastAI.jl website as well. I think we did discuss this a few ML Community calls ago but can't remember what our opinion was on that. Another thing is that this blog mainly focused previously on loading the data and performing some transformations on it (mainly because this was all the code that was written at that time), but we have come a long way from that, and can probably include more functionalities such as creating and training tabular models with the data. |
I think it's fine to post this on FluxML, but I agree the content should be expanded to include the full GSoC. |
The post explores some of the work done for FastAI.jl Development as a part of GSoC'21 (container pr, transformation pr) under the mentorship of @darsnack, @ToucheSir and @lorenzoh, and shows how to get started with working on tabular data by creating a container, and performing various transformations on it.