Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of creating an array from a range #4784

Closed
sungpily opened this issue Nov 11, 2013 · 24 comments
Closed

performance of creating an array from a range #4784

sungpily opened this issue Nov 11, 2013 · 24 comments
Assignees
Labels
performance Must go faster

Comments

@sungpily
Copy link

I have tested the following three variations of array creation statement:

function func1(a, b, inc, iter)
    tic()
    for i=1:iter
        A = [a:inc:b]
    end
    toc()
end

function func2(a, b, inc, iter)
    A = Array(Float64, length(a:inc:b))
    tic()
    for i=1:iter
        A = [a:inc:b]
    end
    toc()
end

function func3(a, b, inc, iter)
    A = Array(Float64, length(a:inc:b))
    tic()
    for i=1:iter
        A[:] = a:inc:b
    end
    toc()
end

and their execution times with a=0, b=10000, inc=0.1, iter=1000

func1: 0.67 seconds
func2: 0.69 seconds
func3: 0.34 seconds

Equivalent python/numpy function

def func(a, b, inc, iter):
    t0 = time()
    for i in range(1000):
        A = np.arange(a, b, inc)
    print(time()-t0, " seconds elapsed")

takes about 0.1 seconds and equivalent matlab function

function func(a, b, inc, iter)
    tic()
    for i=1:iter
        A = [a:inc:b];
    end
    toc()

takes about 0.1 seconds.

[jiahao - edit for formatting. Please use triple backquotes for posting code, otherwise it's unreadable.

also x-ref julia-users]

@StefanKarpinski
Copy link
Member

Thanks for filing the issue and trying out all these different variations!

@kmsquire
Copy link
Member

Can you post the actual output of your test calls? As I mentioned on the mailing list, for me, func3 timings are equivalent to python's.

(At one point, I thought that python was a lot faster, but that was because I had put the arguments in the wrong order.)

@StefanKarpinski
Copy link
Member

I get this on my laptop:

julia> [ f(0,10000,0.1,1000) for f=[func1,func2,func3] ]
3-element Array{Any,1}:
 0.689111
 0.671948
 0.309497

julia> versioninfo()
Julia Version 0.2.0-rc4
Commit a37b4d6 (2013-11-11 18:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.0.0)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

On the same machine, Python takes 0.091638 seconds, so there's definitely a discrepancy on some systems.

@StefanKarpinski
Copy link
Member

On the julia.mit.edu Linux machine (the official benchmark system), on the other hand, I get this:

>>> import time
>>> import numpy as np
>>> def func(a, b, inc, iter):
...   t0 = time.time()
...   for i in range(1000):
...     A = np.arange(a, b, inc)
...   print(time.time()-t0, " seconds elapsed")
...
>>> func(0,10000,0.1,1000)
(0.5556738376617432, ' seconds elapsed')

and Julia:

julia> [ f(0,10000,0.1,1000) for f=[func1,func2,func3] ]
3-element Array{Any,1}:
 1.06376
 1.06842
 0.564819

julia> versioninfo()
Julia Version 0.2.0-rc4+2
Commit 249fdca (2013-11-11 20:53 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

So this may be an OS X vs. Linux thing to some extent. @sungpily, @skariel – what operating systems are you on?

@sungpily
Copy link
Author

My results seem very consistent with those from Stefan. I am using Windows 7 machine.

@sungpily
Copy link
Author

By the way, thanks for reformatting the original post. I could not figure out how.

@sungpily
Copy link
Author

IMHO, func3 looks "hackish" even though it performs better. I will be happy to see func1 and func2 works as fast as python or matlab.

@StefanKarpinski
Copy link
Member

Yeah, the point is not that people should use that version. We're trying to figure out what versions are faster or slower and why so that they can all be faster.

@skariel
Copy link
Contributor

skariel commented Nov 12, 2013

This version is orders of magnitude faster than func3 above:

function func6(a, inc, b)
    A = Array(Float64, length(a:inc:b))
    for i=1:1000
        ix=1
        while a<b
            A[ix] = a
            ix+=1
            a+=inc
        end
    end
end

If you allocate each iteration then it is still significantly faster than func3 above.

I also compared to C++, code is here. I'm on Win7, using Julia-RC3-amd64, Python3.3 and latest Numpy (all amd64)

A couple of interesting observations:

  • When not allocating and de-allocating each iteration (i.e. Julia func3 above) then Julia is as fast as C (when using the same algorithms)
  • When only allocating and de-allocating, but not filling in the range, then again Julia is as fast as C
  • When allocating/de-allocating each iteration, Numpy beats the above func6 by a factor of ~2 (0.1 vs 0.22 secs)

So this seems like a GC issue to me, where Julia is using colder memory (since GC not always has time to clear recently used memory - so allocation gives fresh memory)

The only fix I think is appropriate is to optimize the range notation [a:inc:b] to use an algorithm like func6 above. The hot/cold memory is a non-issue really, since if you need hot memory then you just don't de-allocate it.

About using integer types and changing them to Float64 - I deleted this post from the forums like 1 minute after writing it ssince I timed the wrong function. It still got mailed though.. sorry for that

@kmsquire
Copy link
Member

About using integer types and changing them to Float64 - I deleted this post from the forums like 1 minute after writing it ssince I timed the wrong function. It still got mailed though.. sorry for that

No worries, and thanks for your tests!

@StefanKarpinski
Copy link
Member

Yes, thanks for the excellent analysis. Now we just need to make it do the fast thing :-)

@timholy
Copy link
Member

timholy commented Nov 12, 2013

@skariel, in func6 I think you need to reset the value of a on each iteration.

@skariel
Copy link
Contributor

skariel commented Nov 12, 2013

Oh man.... This is embarrassing.... I'll just maintain a low profile for the newest few months.... :(

@timholy
Copy link
Member

timholy commented Nov 12, 2013

@skariel, please don't! We need people who care about performance and have the ability to do the deep digging.

I remember that one of my earliest questions to the Julia community was "I see there's an isless() function, don't we need an isgreater() function?" Stefan very gently pointed out that you can implement isgreater(x,y) as isless(y,x). By that standard, you're doing just fine :-).

Hope to see you around a lot more!

@kmsquire
Copy link
Member

kmsquire commented Nov 12, 2013

I'll second that--I really appreciate that you're taking the time to explore this so deeply.

Kevin

@quinnj
Copy link
Member

quinnj commented Aug 21, 2014

Here's an update on this issue with a gist for all my code.
A few thoughts on the results:

  • vcat(::FloatRange) with no GC vs. devectorizing results are about the same with devec a little faster
  • vcat(::StepRange) with no GC vs. devectorizing: devec is significantly faster (fastest result overall)
  • vcat on FloatRange or StepRange (i.e. func1) is the only method impacted by GC
  • ::StepRanges seem to get killed with the A[:] = range method, not sure why, but I'll look into it more
  • On the other hand, the best overall performance for FloatRanges is using the A[:] = range technique

A few other meta-thoughts:

  • These results are certainly affected by the rangepocalypse, particularly with the introduction of FloatRange and its more specialized construction behavior
  • It's also important to remember that Matlab is usually run multi-threaded and it seems even Numpy's range-to-array code is multithreaded (plus the fact that it's not actually python, but C code we're comparing to in that case)

@StefanKarpinski
Copy link
Member

I think that we'll have to settle for matching single-threaded performance of other systems for now – so single-threaded C is the benchmark to compare against. Once we have threading in Julia, we can revisit this and make sure that we're as fast as other systems that use threads to speed this up.

@JeffBezanson
Copy link
Member

Small update on this. Seems we are doing a bit better now. Here are the times I get for func2 and func3 in 3 versions (func1 and func2 are identical):

# 0.3:
elapsed time: 1.223933704 seconds
elapsed time: 0.575497855 seconds

# 0.4:
elapsed time: 0.565238758 seconds
elapsed time: 1.041991829 seconds

# 0.5:
elapsed time: 0.484999175 seconds
elapsed time: 0.236237736 seconds

@KristofferC
Copy link
Member

# 0.6:
elapsed time: 0.00059386 seconds
elapsed time: 0.328283315 seconds

@vtjnash
Copy link
Member

vtjnash commented May 26, 2017

func1 and func2 need a trailing ; to do the same measurement ([a:inc:b; ])

@mbauman
Copy link
Member

mbauman commented May 26, 2017

Yeah, we're still a factor of 4 slower than numpy with func3 (in-place) and a factor of 7 with func2 on my machine. It's worth noting, though, that our float ranges are doing a whole lot more work (by default) than Numpy's… and this is a deliberate decision and trade-off.

If we use a lower precision range type, we're doing much better:

julia> begin
       function func4(a, b, inc, iter)
           A = Array{Float64}(length(a:inc:b))
           tic()
           for i=1:iter
               A = [StepRangeLen(a,inc,b*10+1);]
           end
           toc()
       end
       function func5(a, b, inc, iter)
           A = Array{Float64}(length(a:inc:b))
           tic()
           for i=1:iter
               A[:] = StepRangeLen(a,inc,b*10+1)
           end
           toc()
       end
       end

julia> func4(0,10000,.1,1000)
       func5(0,10000,.1,1000);
elapsed time: 0.523196442 seconds
elapsed time: 0.116625565 seconds

The Python definition on my machine takes 0.144 seconds.

@heitorPB
Copy link

heitorPB commented May 7, 2019

I think things improved since this issue was opened. Here the results in my laptop:

function func1(a, b, inc, iter)
    for i=1:iter
        A = [a:inc:b]
    end
end

function func2(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A = [a:inc:b]
    end
end

function func3(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A[:] = a:inc:b
    end
end
function func4(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A = [StepRangeLen(a,inc,b*10+1);]
    end
end
function func5(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A[:] = StepRangeLen(a,inc,b*10+1)
    end
end

@time func1(0,10000,.1,1000)
@time func2(0,10000,.1,1000)
@time func3(0,10000,.1,1000)
@time func4(0,10000,.1,1000)
@time func5(0,10000,.1,1000)

println("")
@time func1(0,10000,.1,1000)
@time func2(0,10000,.1,1000)
@time func3(0,10000,.1,1000)
@time func4(0,10000,.1,1000)
@time func5(0,10000,.1,1000)
$ julia a.jl 
  0.119092 seconds (397.67 k allocations: 21.202 MiB)
  0.009710 seconds (18.41 k allocations: 1.809 MiB)
  0.360205 seconds (90.05 k allocations: 5.279 MiB)
  0.236766 seconds (96.16 k allocations: 768.615 MiB, 8.72% gc time)
  0.136129 seconds (83.31 k allocations: 4.851 MiB)

  0.000187 seconds (1.00 k allocations: 125.156 KiB)
  0.000293 seconds (1.01 k allocations: 906.547 KiB)
  0.332319 seconds (6 allocations: 781.547 KiB)
  0.272638 seconds (2.01 k allocations: 763.840 MiB, 10.49% gc time)
  0.104920 seconds (6 allocations: 781.547 KiB)

@mbauman
Copy link
Member

mbauman commented May 7, 2019

What Julia means by [a:inc:b] has changed — that used to do concatenation but it now just puts the range itself inside a one-element array. You need to add a ; to make it do what it used to. And I think your laptop is faster than whatever was used in the original post — we've actually (intentionally) regressed a bit since 0.5.

In my tests, Julia 1.3-dev is still at roughly the same speed that version 0.6 had. Here's a slight update to the tests to use BenchmarkTools — I think there's also some adversarial compiler shenanigans going on in both languages.

julia> using BenchmarkTools

julia> a, b, inc = 0,10000,.1;

julia> A = Array{Float64}(undef, length(a:inc:b));

julia> @btime [$a:$inc:$b;];
  260.461 μs (2 allocations: 781.39 KiB)

julia> @btime $A[:] = $a:$inc:$b;
  287.677 μs (0 allocations: 0 bytes)

julia> @btime $A .= $a:$inc:$b;
  253.028 μs (0 allocations: 0 bytes)

julia> @btime [StepRangeLen($a,$inc,$b*10+1);];
  84.170 μs (2 allocations: 781.39 KiB)

julia> @btime $A[:] = StepRangeLen($a,$inc,$b*10+1);
  112.257 μs (0 allocations: 0 bytes)

julia> @btime $A .= StepRangeLen($a,$inc,$b*10+1);
  26.295 μs (0 allocations: 0 bytes)

Compare this to numpy:

julia> using PyCall

julia> @pyimport numpy

julia> @btime pycall(numpy.arange, PyObject, $a, $b, $inc);
  86.354 μs (9 allocations: 192 bytes)

So we're around a factor of 3-4 away for the colon construction, but the StepRangeLen construction is about the same. On its face this makes sense as numpy is doing roughly the same calculation as the latter, whereas the former is using higher precision arithmetic. Despite the threading macros in the numpy source, it doesn't appear to be doing any multithreading.

@JeffBezanson
Copy link
Member

Since we intentionally use higher precision for float ranges, maybe this should be closed?

@timholy timholy closed this as completed Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests