performance of creating an array from a range #4784

sungpily · 2013-11-11T23:27:19Z

I have tested the following three variations of array creation statement:

function func1(a, b, inc, iter)
    tic()
    for i=1:iter
        A = [a:inc:b]
    end
    toc()
end

function func2(a, b, inc, iter)
    A = Array(Float64, length(a:inc:b))
    tic()
    for i=1:iter
        A = [a:inc:b]
    end
    toc()
end

function func3(a, b, inc, iter)
    A = Array(Float64, length(a:inc:b))
    tic()
    for i=1:iter
        A[:] = a:inc:b
    end
    toc()
end

and their execution times with a=0, b=10000, inc=0.1, iter=1000

func1: 0.67 seconds
func2: 0.69 seconds
func3: 0.34 seconds

Equivalent python/numpy function

def func(a, b, inc, iter):
    t0 = time()
    for i in range(1000):
        A = np.arange(a, b, inc)
    print(time()-t0, " seconds elapsed")

takes about 0.1 seconds and equivalent matlab function

function func(a, b, inc, iter)
    tic()
    for i=1:iter
        A = [a:inc:b];
    end
    toc()

takes about 0.1 seconds.

[jiahao - edit for formatting. Please use triple backquotes for posting code, otherwise it's unreadable.

also x-ref julia-users]

StefanKarpinski · 2013-11-12T00:00:56Z

Thanks for filing the issue and trying out all these different variations!

kmsquire · 2013-11-12T00:03:58Z

Can you post the actual output of your test calls? As I mentioned on the mailing list, for me, func3 timings are equivalent to python's.

(At one point, I thought that python was a lot faster, but that was because I had put the arguments in the wrong order.)

StefanKarpinski · 2013-11-12T00:11:57Z

I get this on my laptop:

julia> [ f(0,10000,0.1,1000) for f=[func1,func2,func3] ]
3-element Array{Any,1}:
 0.689111
 0.671948
 0.309497

julia> versioninfo()
Julia Version 0.2.0-rc4
Commit a37b4d6 (2013-11-11 18:47 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.0.0)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

On the same machine, Python takes 0.091638 seconds, so there's definitely a discrepancy on some systems.

StefanKarpinski · 2013-11-12T00:21:03Z

On the julia.mit.edu Linux machine (the official benchmark system), on the other hand, I get this:

>>> import time
>>> import numpy as np
>>> def func(a, b, inc, iter):
...   t0 = time.time()
...   for i in range(1000):
...     A = np.arange(a, b, inc)
...   print(time.time()-t0, " seconds elapsed")
...
>>> func(0,10000,0.1,1000)
(0.5556738376617432, ' seconds elapsed')

and Julia:

julia> [ f(0,10000,0.1,1000) for f=[func1,func2,func3] ]
3-element Array{Any,1}:
 1.06376
 1.06842
 0.564819

julia> versioninfo()
Julia Version 0.2.0-rc4+2
Commit 249fdca (2013-11-11 20:53 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

So this may be an OS X vs. Linux thing to some extent. @sungpily, @skariel – what operating systems are you on?

sungpily · 2013-11-12T01:41:24Z

My results seem very consistent with those from Stefan. I am using Windows 7 machine.

sungpily · 2013-11-12T01:43:27Z

By the way, thanks for reformatting the original post. I could not figure out how.

sungpily · 2013-11-12T01:49:57Z

IMHO, func3 looks "hackish" even though it performs better. I will be happy to see func1 and func2 works as fast as python or matlab.

StefanKarpinski · 2013-11-12T01:56:40Z

Yeah, the point is not that people should use that version. We're trying to figure out what versions are faster or slower and why so that they can all be faster.

skariel · 2013-11-12T07:53:53Z

This version is orders of magnitude faster than func3 above:

function func6(a, inc, b)
    A = Array(Float64, length(a:inc:b))
    for i=1:1000
        ix=1
        while a<b
            A[ix] = a
            ix+=1
            a+=inc
        end
    end
end

If you allocate each iteration then it is still significantly faster than func3 above.

I also compared to C++, code is here. I'm on Win7, using Julia-RC3-amd64, Python3.3 and latest Numpy (all amd64)

A couple of interesting observations:

When not allocating and de-allocating each iteration (i.e. Julia func3 above) then Julia is as fast as C (when using the same algorithms)
When only allocating and de-allocating, but not filling in the range, then again Julia is as fast as C
When allocating/de-allocating each iteration, Numpy beats the above func6 by a factor of ~2 (0.1 vs 0.22 secs)

So this seems like a GC issue to me, where Julia is using colder memory (since GC not always has time to clear recently used memory - so allocation gives fresh memory)

The only fix I think is appropriate is to optimize the range notation [a:inc:b] to use an algorithm like func6 above. The hot/cold memory is a non-issue really, since if you need hot memory then you just don't de-allocate it.

About using integer types and changing them to Float64 - I deleted this post from the forums like 1 minute after writing it ssince I timed the wrong function. It still got mailed though.. sorry for that

kmsquire · 2013-11-12T08:10:48Z

About using integer types and changing them to Float64 - I deleted this post from the forums like 1 minute after writing it ssince I timed the wrong function. It still got mailed though.. sorry for that

No worries, and thanks for your tests!

StefanKarpinski · 2013-11-12T13:13:55Z

Yes, thanks for the excellent analysis. Now we just need to make it do the fast thing :-)

timholy · 2013-11-12T14:13:08Z

@skariel, in func6 I think you need to reset the value of a on each iteration.

skariel · 2013-11-12T15:57:53Z

Oh man.... This is embarrassing.... I'll just maintain a low profile for the newest few months.... :(

timholy · 2013-11-12T16:34:03Z

@skariel, please don't! We need people who care about performance and have the ability to do the deep digging.

I remember that one of my earliest questions to the Julia community was "I see there's an isless() function, don't we need an isgreater() function?" Stefan very gently pointed out that you can implement isgreater(x,y) as isless(y,x). By that standard, you're doing just fine :-).

Hope to see you around a lot more!

kmsquire · 2013-11-12T19:16:07Z

I'll second that--I really appreciate that you're taking the time to explore this so deeply.

Kevin

quinnj · 2014-08-21T16:29:05Z

Here's an update on this issue with a gist for all my code.
A few thoughts on the results:

vcat(::FloatRange) with no GC vs. devectorizing results are about the same with devec a little faster
vcat(::StepRange) with no GC vs. devectorizing: devec is significantly faster (fastest result overall)
vcat on FloatRange or StepRange (i.e. func1) is the only method impacted by GC
::StepRanges seem to get killed with the A[:] = range method, not sure why, but I'll look into it more
On the other hand, the best overall performance for FloatRanges is using the A[:] = range technique

A few other meta-thoughts:

These results are certainly affected by the rangepocalypse, particularly with the introduction of FloatRange and its more specialized construction behavior
It's also important to remember that Matlab is usually run multi-threaded and it seems even Numpy's range-to-array code is multithreaded (plus the fact that it's not actually python, but C code we're comparing to in that case)

StefanKarpinski · 2014-08-21T16:32:31Z

I think that we'll have to settle for matching single-threaded performance of other systems for now – so single-threaded C is the benchmark to compare against. Once we have threading in Julia, we can revisit this and make sure that we're as fast as other systems that use threads to speed this up.

JeffBezanson · 2016-07-21T03:29:42Z

Small update on this. Seems we are doing a bit better now. Here are the times I get for func2 and func3 in 3 versions (func1 and func2 are identical):

# 0.3:
elapsed time: 1.223933704 seconds
elapsed time: 0.575497855 seconds

# 0.4:
elapsed time: 0.565238758 seconds
elapsed time: 1.041991829 seconds

# 0.5:
elapsed time: 0.484999175 seconds
elapsed time: 0.236237736 seconds

KristofferC · 2017-05-26T19:28:04Z

# 0.6:
elapsed time: 0.00059386 seconds
elapsed time: 0.328283315 seconds

vtjnash · 2017-05-26T20:05:02Z

func1 and func2 need a trailing ; to do the same measurement ([a:inc:b; ])

mbauman · 2017-05-26T22:55:58Z

Yeah, we're still a factor of 4 slower than numpy with func3 (in-place) and a factor of 7 with func2 on my machine. It's worth noting, though, that our float ranges are doing a whole lot more work (by default) than Numpy's… and this is a deliberate decision and trade-off.

If we use a lower precision range type, we're doing much better:

julia> begin
       function func4(a, b, inc, iter)
           A = Array{Float64}(length(a:inc:b))
           tic()
           for i=1:iter
               A = [StepRangeLen(a,inc,b*10+1);]
           end
           toc()
       end
       function func5(a, b, inc, iter)
           A = Array{Float64}(length(a:inc:b))
           tic()
           for i=1:iter
               A[:] = StepRangeLen(a,inc,b*10+1)
           end
           toc()
       end
       end

julia> func4(0,10000,.1,1000)
       func5(0,10000,.1,1000);
elapsed time: 0.523196442 seconds
elapsed time: 0.116625565 seconds

The Python definition on my machine takes 0.144 seconds.

heitorPB · 2019-05-07T01:39:40Z

I think things improved since this issue was opened. Here the results in my laptop:

function func1(a, b, inc, iter)
    for i=1:iter
        A = [a:inc:b]
    end
end

function func2(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A = [a:inc:b]
    end
end

function func3(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A[:] = a:inc:b
    end
end
function func4(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A = [StepRangeLen(a,inc,b*10+1);]
    end
end
function func5(a, b, inc, iter)
    A = Array{Float64}(undef, length(a:inc:b))
    for i=1:iter
        A[:] = StepRangeLen(a,inc,b*10+1)
    end
end

@time func1(0,10000,.1,1000)
@time func2(0,10000,.1,1000)
@time func3(0,10000,.1,1000)
@time func4(0,10000,.1,1000)
@time func5(0,10000,.1,1000)

println("")
@time func1(0,10000,.1,1000)
@time func2(0,10000,.1,1000)
@time func3(0,10000,.1,1000)
@time func4(0,10000,.1,1000)
@time func5(0,10000,.1,1000)

$ julia a.jl 
  0.119092 seconds (397.67 k allocations: 21.202 MiB)
  0.009710 seconds (18.41 k allocations: 1.809 MiB)
  0.360205 seconds (90.05 k allocations: 5.279 MiB)
  0.236766 seconds (96.16 k allocations: 768.615 MiB, 8.72% gc time)
  0.136129 seconds (83.31 k allocations: 4.851 MiB)

  0.000187 seconds (1.00 k allocations: 125.156 KiB)
  0.000293 seconds (1.01 k allocations: 906.547 KiB)
  0.332319 seconds (6 allocations: 781.547 KiB)
  0.272638 seconds (2.01 k allocations: 763.840 MiB, 10.49% gc time)
  0.104920 seconds (6 allocations: 781.547 KiB)

mbauman · 2019-05-07T04:55:21Z

What Julia means by [a:inc:b] has changed — that used to do concatenation but it now just puts the range itself inside a one-element array. You need to add a ; to make it do what it used to. And I think your laptop is faster than whatever was used in the original post — we've actually (intentionally) regressed a bit since 0.5.

In my tests, Julia 1.3-dev is still at roughly the same speed that version 0.6 had. Here's a slight update to the tests to use BenchmarkTools — I think there's also some adversarial compiler shenanigans going on in both languages.

julia> using BenchmarkTools

julia> a, b, inc = 0,10000,.1;

julia> A = Array{Float64}(undef, length(a:inc:b));

julia> @btime [$a:$inc:$b;];
  260.461 μs (2 allocations: 781.39 KiB)

julia> @btime $A[:] = $a:$inc:$b;
  287.677 μs (0 allocations: 0 bytes)

julia> @btime $A .= $a:$inc:$b;
  253.028 μs (0 allocations: 0 bytes)

julia> @btime [StepRangeLen($a,$inc,$b*10+1);];
  84.170 μs (2 allocations: 781.39 KiB)

julia> @btime $A[:] = StepRangeLen($a,$inc,$b*10+1);
  112.257 μs (0 allocations: 0 bytes)

julia> @btime $A .= StepRangeLen($a,$inc,$b*10+1);
  26.295 μs (0 allocations: 0 bytes)

Compare this to numpy:

julia> using PyCall

julia> @pyimport numpy

julia> @btime pycall(numpy.arange, PyObject, $a, $b, $inc);
  86.354 μs (9 allocations: 192 bytes)

So we're around a factor of 3-4 away for the colon construction, but the StepRangeLen construction is about the same. On its face this makes sense as numpy is doing roughly the same calculation as the latter, whereas the former is using higher precision arithmetic. Despite the threading macros in the numpy source, it doesn't appear to be doing any multithreading.

JeffBezanson · 2019-10-01T21:58:17Z

Since we intentionally use higher precision for float ranges, maybe this should be closed?

JeffBezanson added a commit that referenced this issue Nov 19, 2013

add inbounds to vcat of ranges. helps #4784

9d1e5e8

ghost assigned JeffBezanson Dec 28, 2013

simonster mentioned this issue Mar 4, 2014

Slow broadcast addition of matrices JuliaLang/LinearAlgebra.jl#89

Closed

timholy closed this as completed Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of creating an array from a range #4784

performance of creating an array from a range #4784

sungpily commented Nov 11, 2013

StefanKarpinski commented Nov 12, 2013

kmsquire commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

sungpily commented Nov 12, 2013

sungpily commented Nov 12, 2013

sungpily commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

skariel commented Nov 12, 2013

kmsquire commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

timholy commented Nov 12, 2013

skariel commented Nov 12, 2013

timholy commented Nov 12, 2013

kmsquire commented Nov 12, 2013 •

edited by StefanKarpinski

Loading

quinnj commented Aug 21, 2014

StefanKarpinski commented Aug 21, 2014

JeffBezanson commented Jul 21, 2016

KristofferC commented May 26, 2017

vtjnash commented May 26, 2017

mbauman commented May 26, 2017

heitorPB commented May 7, 2019

mbauman commented May 7, 2019 •

edited

Loading

JeffBezanson commented Oct 1, 2019

performance of creating an array from a range #4784

performance of creating an array from a range #4784

Comments

sungpily commented Nov 11, 2013

StefanKarpinski commented Nov 12, 2013

kmsquire commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

sungpily commented Nov 12, 2013

sungpily commented Nov 12, 2013

sungpily commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

skariel commented Nov 12, 2013

kmsquire commented Nov 12, 2013

StefanKarpinski commented Nov 12, 2013

timholy commented Nov 12, 2013

skariel commented Nov 12, 2013

timholy commented Nov 12, 2013

kmsquire commented Nov 12, 2013 • edited by StefanKarpinski Loading

quinnj commented Aug 21, 2014

StefanKarpinski commented Aug 21, 2014

JeffBezanson commented Jul 21, 2016

KristofferC commented May 26, 2017

vtjnash commented May 26, 2017

mbauman commented May 26, 2017

heitorPB commented May 7, 2019

mbauman commented May 7, 2019 • edited Loading

JeffBezanson commented Oct 1, 2019

kmsquire commented Nov 12, 2013 •

edited by StefanKarpinski

Loading

mbauman commented May 7, 2019 •

edited

Loading