Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing large DataFrame extremely slow #1017

Closed
jariji opened this issue Aug 30, 2022 · 2 comments
Closed

Writing large DataFrame extremely slow #1017

jariji opened this issue Aug 30, 2022 · 2 comments

Comments

@jariji
Copy link

jariji commented Aug 30, 2022

I have a 3M x 200 DataFrame with string and numeric columns. CSV.write goes about 300 rows per second, which means it will take way too long to finish.

@quinnj
Copy link
Member

quinnj commented Aug 30, 2022

Can you provide more details, like the types of the columns and how big some of hte values are? (i.e. are there some really big string values? mostly integers?). It would be helpful to provide a flamegraph sample to help indicate where most of the time is being spent, like can be produced via the PProf.jl package.

@jariji
Copy link
Author

jariji commented Aug 30, 2022

Ok, I notice there was actually a NamedTuple column in there that I'd forgotten about with components
Tuple{Int64, String, Dates.Date, String, String, String, Dates.Date, Union{Missing, Dates.Date}, Union{Missing, String}, Union{Missing, String}, Union{Missing, Int64}, Int64}}. Including that column it takes 68 seconds to write 10k rows.

After dropping that, it took 32 seconds to write 100k rows, which isn't fast but at least it's not hours.

julia> sort(freqtable(eltype.(eachcol(d))))
10-element Named Vector{Int64}
Dim1                    │ 
────────────────────────┼───
Float64                 │  1
Bool                    │  1
Union{Missing, Date}    │  7
Date                    │  7
String                  │ 11
Union{Missing, Bool}    │ 13
Union{Missing, Float64} │ 24
Union{Missing, String}  │ 33
Int64                   │ 40
Union{Missing, Int64}   │ 61

The longest string is less than 40 characters.

PProf failed to spawn something

julia> PProf.@pprof CSV.write(path, d)
ERROR: IOError: could not spawn `/home/user/.julia/artifacts/373d20d2dd1459e5066c22ec847146e85dfe6818/bin/pprof -http=localhost:57599 -relative_percentages profile.pb.gz`: no such file or directory (ENOENT)
Stacktrace:
  [1] _spawn_primitive(file::String, cmd::Cmd, stdio::Vector{Any})
    @ Base ./process.jl:100
  [2] #690
    @ ./process.jl:113 [inlined]
  [3] setup_stdios(f::Base.var"#690#691"{Cmd}, stdios::Vector{Any})
    @ Base ./process.jl:197
  [4] _spawn
    @ ./process.jl:112 [inlined]
  [5] open(cmds::Cmd, stdio::Base.DevNull; write::Bool, read::Bool)
    @ Base ./process.jl:371
  [6] open (repeats 2 times)
    @ ./process.jl:362 [inlined]
  [7] (::PProf.var"#7#8"{String, Int64, String, String})(pprof_path::String)
    @ PProf ~/.julia/packages/PProf/vjh6a/src/PProf.jl:325
  [8] (::JLLWrappers.var"#2#3"{PProf.var"#7#8"{String, Int64, String, String}, String})()
    @ JLLWrappers ~/.julia/packages/JLLWrappers/QpMQW/src/runtime.jl:49
  [9] withenv(::JLLWrappers.var"#2#3"{PProf.var"#7#8"{String, Int64, String, String}, String}, ::Pair{String, String}, ::Vararg{Pair{String, String}})
    @ Base ./env.jl:172
 [10] withenv_executable_wrapper(f::Function, executable_path::String, PATH::String, LIBPATH::String, adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ JLLWrappers ~/.julia/packages/JLLWrappers/QpMQW/src/runtime.jl:48
 [11] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base ./essentials.jl:716
 [12] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base ./essentials.jl:714
 [13] pprof(f::Function; adjust_PATH::Bool, adjust_LIBPATH::Bool)
    @ pprof_jll ~/.julia/packages/JLLWrappers/QpMQW/src/products/executable_generators.jl:21
 [14] pprof(f::Function)
    @ pprof_jll ~/.julia/packages/JLLWrappers/QpMQW/src/products/executable_generators.jl:21
 [15] refresh(; webhost::String, webport::Int64, file::String, ui_relative_percentages::Bool)
    @ PProf ~/.julia/packages/PProf/vjh6a/src/PProf.jl:324
 [16] pprof(data::Nothing, lidict::Nothing; sampling_delay::Nothing, web::Bool, webhost::String, webport::Int64, out::String, from_c::Bool, full_signatures::Bool, drop_frames::Nothing, keep_frames::Nothing, ui_relative_percentages::Bool)
    @ PProf ~/.julia/packages/PProf/vjh6a/src/PProf.jl:281
 [17] pprof (repeats 2 times)
    @ ~/.julia/packages/PProf/vjh6a/src/PProf.jl:104 [inlined]
 [18] top-level scope
    @ ~/.julia/packages/PProf/vjh6a/src/PProf.jl:351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants