-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing large DataFrame extremely slow #1017
Comments
Can you provide more details, like the types of the columns and how big some of hte values are? (i.e. are there some really big string values? mostly integers?). It would be helpful to provide a flamegraph sample to help indicate where most of the time is being spent, like can be produced via the PProf.jl package. |
Ok, I notice there was actually a NamedTuple column in there that I'd forgotten about with components After dropping that, it took 32 seconds to write 100k rows, which isn't fast but at least it's not hours. julia> sort(freqtable(eltype.(eachcol(d))))
10-element Named Vector{Int64}
Dim1 │
────────────────────────┼───
Float64 │ 1
Bool │ 1
Union{Missing, Date} │ 7
Date │ 7
String │ 11
Union{Missing, Bool} │ 13
Union{Missing, Float64} │ 24
Union{Missing, String} │ 33
Int64 │ 40
Union{Missing, Int64} │ 61 The longest string is less than 40 characters. PProf failed to spawn something julia> PProf.@pprof CSV.write(path, d)
ERROR: IOError: could not spawn `/home/user/.julia/artifacts/373d20d2dd1459e5066c22ec847146e85dfe6818/bin/pprof -http=localhost:57599 -relative_percentages profile.pb.gz`: no such file or directory (ENOENT)
Stacktrace:
[1] _spawn_primitive(file::String, cmd::Cmd, stdio::Vector{Any})
@ Base ./process.jl:100
[2] #690
@ ./process.jl:113 [inlined]
[3] setup_stdios(f::Base.var"#690#691"{Cmd}, stdios::Vector{Any})
@ Base ./process.jl:197
[4] _spawn
@ ./process.jl:112 [inlined]
[5] open(cmds::Cmd, stdio::Base.DevNull; write::Bool, read::Bool)
@ Base ./process.jl:371
[6] open (repeats 2 times)
@ ./process.jl:362 [inlined]
[7] (::PProf.var"#7#8"{String, Int64, String, String})(pprof_path::String)
@ PProf ~/.julia/packages/PProf/vjh6a/src/PProf.jl:325
[8] (::JLLWrappers.var"#2#3"{PProf.var"#7#8"{String, Int64, String, String}, String})()
@ JLLWrappers ~/.julia/packages/JLLWrappers/QpMQW/src/runtime.jl:49
[9] withenv(::JLLWrappers.var"#2#3"{PProf.var"#7#8"{String, Int64, String, String}, String}, ::Pair{String, String}, ::Vararg{Pair{String, String}})
@ Base ./env.jl:172
[10] withenv_executable_wrapper(f::Function, executable_path::String, PATH::String, LIBPATH::String, adjust_PATH::Bool, adjust_LIBPATH::Bool)
@ JLLWrappers ~/.julia/packages/JLLWrappers/QpMQW/src/runtime.jl:48
[11] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Base ./essentials.jl:716
[12] invokelatest(::Any, ::Any, ::Vararg{Any})
@ Base ./essentials.jl:714
[13] pprof(f::Function; adjust_PATH::Bool, adjust_LIBPATH::Bool)
@ pprof_jll ~/.julia/packages/JLLWrappers/QpMQW/src/products/executable_generators.jl:21
[14] pprof(f::Function)
@ pprof_jll ~/.julia/packages/JLLWrappers/QpMQW/src/products/executable_generators.jl:21
[15] refresh(; webhost::String, webport::Int64, file::String, ui_relative_percentages::Bool)
@ PProf ~/.julia/packages/PProf/vjh6a/src/PProf.jl:324
[16] pprof(data::Nothing, lidict::Nothing; sampling_delay::Nothing, web::Bool, webhost::String, webport::Int64, out::String, from_c::Bool, full_signatures::Bool, drop_frames::Nothing, keep_frames::Nothing, ui_relative_percentages::Bool)
@ PProf ~/.julia/packages/PProf/vjh6a/src/PProf.jl:281
[17] pprof (repeats 2 times)
@ ~/.julia/packages/PProf/vjh6a/src/PProf.jl:104 [inlined]
[18] top-level scope
@ ~/.julia/packages/PProf/vjh6a/src/PProf.jl:351 |
I have a 3M x 200 DataFrame with string and numeric columns.
CSV.write
goes about 300 rows per second, which means it will take way too long to finish.The text was updated successfully, but these errors were encountered: