Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libsleef] Add modified Payne Hanek argument reduction #197

Merged
merged 15 commits into from
Jul 6, 2018

Conversation

shibatch
Copy link
Owner

@shibatch shibatch commented Jun 19, 2018

This patch adds a modified Payne Hanek argument reduction that can be used for very large arguments.

Payne Hanek reduction algorithm can handle very large arguments by table look-ups. In this patch, a vectorized version of the algorithm is implemented. The argument range for DP and SP trig functions will become [-1e+299, 1e+299] and [-1e+28, 1e+28], respectively. In order to avoid using 64 bit or 128 bit multiplication, the algorithm is modified to use DD computation. Gather instructions are used for table look-ups. The reduction subroutine is tested to confirm that it correctly handle the worst case with 6381956970095103.0 * 2.0^797.

The traditional implementation can be seen in the following gist.
https://gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3

@shibatch shibatch requested a review from fpetrogalli June 19, 2018 16:21
Copy link
Collaborator

@fpetrogalli fpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shibatch Is there any performance improvements with this?

@shibatch
Copy link
Owner Author

I haven't taken a benchmark, but it depends on how fast gather instructions are. On Skylake, the latency and throughput of a gather instruction is 20 and 4, so executing 4 gather instructions take (at least) 20 + 4 * 3 = 32 clocks, which is already pretty slow.

The execution time should not matter in practice, since the old reduction algorithms are still enabled. I would say there is something wrong if someone is calling a trig function with a very large argument. The point is that when adopting SLEEF to some project, that person wants to feel safe to replace the existing calls to the standard math functions with SLEEF functions.

@fpetrogalli
Copy link
Collaborator

@shibatch

I haven't taken a benchmark, but it depends on how fast gather instructions are. On Skylake, the latency and throughput of a gather instruction is 20 and 4, so executing 4 gather instructions take (at least) 20 + 4 * 3 = 32 clocks, which is already pretty slow.

But isn't this patch using gather instructions? If that's the case, it seems to be not a good idea given that they are slow.

The execution time should not matter in practice, since the old reduction algorithms are still enabled.

I think that execution time is a very important constraint.

I would say there is something wrong if someone is calling a trig function with a very large argument.

You are probably right, but penalizing those users that compute trigonometric functions on proper values with en extra vector test instruction doesn't seem fair.

The point is that when adopting SLEEF to some project, that person wants to feel safe to replace the existing calls to the standard math functions with SLEEF functions.

I agree. But I think that we should provide them via separate functions, not merging them in the same function with that if statement on the "test all one" function (this would of course add another degree of freedom for the run-time system, if we ever come up with it).

@shibatch
Copy link
Owner Author

But isn't this patch using gather instructions? If that's the case, it seems to be not a good idea given that they are slow.

Payne-Henek algorithm is a slow algorithm, compared to Cody-Waite.

You are probably right, but penalizing those users that compute trigonometric functions on proper values with en extra vector test instruction doesn't seem fair.

CPUs with out-of-order execution and branch prediction should have only small performance impact with this. We have to think the critical path to estimate the execution time, rather than simply summing up the execution time of all the instructions. Since the extra vector test instructions are not in the critical path, they don't affect the overall execution time so much.

@shibatch
Copy link
Owner Author

shibatch commented Jun 21, 2018

Here is the benchmark.
This is the results with the latest patch.

DP graph
SP graph

@shibatch
Copy link
Owner Author

shibatch commented Jun 21, 2018

As a workaround of the bug in armclang, I changed the definition of INLINE macro and it does not insert always_inline attribute when SLEEF is compiled for SVE target. armclang still aggressively inlines functions, so I think this is not a problem. Please check the assembly output from the compiler.

https://github.com/shibatch/sleef/wiki/197/sleefsimddp.s

Testing for SVE target is now enabled.

@shibatch
Copy link
Owner Author

I moved the fixup code to the reduction part, and now most of the functions are slightly faster than before, if the argument is small. The graphs are updated.

@shibatch shibatch changed the title [libsleef] Add modified Payne Henek argument reduction [libsleef] Add modified Payne Hanek argument reduction Jun 22, 2018
@fpetrogalli
Copy link
Collaborator

Hi @shibatch, a couple of questions before I dig into the code review:

  1. Am I correct thinking that the plots you have reported are for x86? If so, could you please report the same benchmarks for an AArch64 machine?

  2. Would it be possible to support such big range values by improving the reduction algorithms (especially for periodic functions like sin and cosine), instead of specializing the polynomials?

  3. Are you happy about the ~4x slowdown of the 4ULP version of sin/cos/tan (SP and DP) when sampling the input values outside of [0,6.28]?

  4. It is very nice that the functions are faster in [0, 6.28]. Is that due to the effect of using __builtin_expect? If so, what is the maximum range where this effect is still working

  5. How does this compares to SVML or libmvec on x86? Is SLEEF any better due to the changes in this patch? Specifically, can you say that now 4ULP sin is faster than sin in SVML for the new very large input values?

@shibatch
Copy link
Owner Author

Am I correct thinking that the plots you have reported are for x86? If so, could you please report the same benchmarks for an AArch64 machine?

Sure. I will post the results.

Would it be possible to support such big range values by improving the reduction algorithms (especially for periodic functions like sin and cosine), instead of specializing the polynomials?

Reduction algorithms for trig functions is not easy, and there are basically two algorithms. Cody-Waite algorithm has been used since the first version of SLEEF, and Payne-Hanek is the one introduced in this PR. The versions used in SLEEF are both modified versions of these algorithms, but basically it is not easy to further improve the algorithms.

Are you happy about the ~4x slowdown of the 4ULP version of sin/cos/tan (SP and DP) when sampling the input values outside of [0,6.28]?

For the DP versions of the functions, there is actually almost no slowdown, since the that large input domain was not supported. For the 1 ulp SP version, the situation is same, and there is almost no slowdown.

There is some slowdown in 3.5 ulp SP version of the functions, between 125 to 39000, but I don't know if people care about this. For the input domain less that 125, it should be a little faster than before.

It is very nice that the functions are faster in [0, 6.28]. Is that due to the effect of using __builtin_expect? If so, what is the maximum range where this effect is still working

For DP functions, the fastest algorithm is used up to 15, and the second algorithm is used up to 1e+9.

For SP functions, the faster algorithm is used up to 125.

The effect is not only __builtin_expect. It is a combination of __builtin_expect and moving fixup code to the slow reduction routine, and adjusting the polynomial for reduction.

How does this compares to SVML or libmvec on x86? Is SLEEF any better due to the changes in this patch? Specifically, can you say that now 4ULP sin is faster than sin in SVML for the new very large input values?

They are a little slower than SVML for very large arguments. I will post the graph.

@shibatch
Copy link
Owner Author

These are comparison between SVML and SLEEF with this patch on Core i7-6700.
SVML DP graph
SVML SP graph

These graphs are comparison on RK3399.
AArch64 DP graph
AArch64 SP graph

* Benchmarking tool now compiles for aarch64
@shibatch
Copy link
Owner Author

I put back the reduction routine for [125, 39000] range to 3.5 ulp SP functions.
There should be no big slowdown in those functions anymore.

@@ -290,6 +290,12 @@ static INLINE vdouble vloadu_vd_p(const double *ptr) { return _mm256_loadu_pd(pt
static INLINE void vstore_v_p_vd(double *ptr, vdouble v) { _mm256_store_pd(ptr, v); }
static INLINE void vstoreu_v_p_vd(double *ptr, vdouble v) { _mm256_storeu_pd(ptr, v); }

static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
int a[4];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this beint a[VECLENSP]?

@@ -477,6 +483,13 @@ static INLINE vfloat vloadu_vf_p(const float *ptr) { return _mm256_loadu_ps(ptr)
static INLINE void vstore_v_p_vf(float *ptr, vfloat v) { _mm256_store_ps(ptr, v); }
static INLINE void vstoreu_v_p_vf(float *ptr, vfloat v) { _mm256_storeu_ps(ptr, v); }

static INLINE vfloat vgather_vf_p_vi2(const float *ptr, vint2 vi2) {
int a[8];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, shouldn't it be int a[VECLENSP]?

@@ -74,6 +74,18 @@ static INLINE void vstoreu_v_p_vf(float *ptr, vfloat v) { ptr[0] = v[0]; ptr[1]

static INLINE void vscatter2_v_p_i_i_vd(double *ptr, int offset, int step, vdouble v) { vstore_v_p_vd((double *)(&ptr[2*offset]), v); }

static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
int a[4];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also be int[VECLENDP]?

}

static INLINE vfloat vgather_vf_p_vi2(const float *ptr, vint2 vi2) {
int a[4];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also be int[VECLENSP]?

@@ -267,6 +267,12 @@ static INLINE vdouble vloadu_vd_p(const double *ptr) { return _mm_loadu_pd(ptr);
static INLINE void vstore_v_p_vd(double *ptr, vdouble v) { _mm_store_pd(ptr, v); }
static INLINE void vstoreu_v_p_vd(double *ptr, vdouble v) { _mm_storeu_pd(ptr, v); }

static INLINE vdouble vgather_vd_p_vi(const double *ptr, vint vi) {
int a[4];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int a[VECLENDP];?

}

static float vcast_f_vf(vfloat v) {
float a[64];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int a[svcntw()];

}

static int vcast_i_vi(vint v) {
int a[64];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int a[svcntw()];

}

static int vcast_i_vi2(vint2 v) {
int a[64];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int a[svcntw()];

#ifdef ENABLE_SVE
typedef __sizeless_struct {
vfloat d;
vint2 i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be vint i; and not vint2 i;?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vint2 is correct here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, sorry

#else
typedef struct {
vfloat d;
vint2 i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, shouldn't this be vint i;?

@shibatch shibatch merged commit 4d5e75f into master Jul 6, 2018
@shibatch shibatch deleted the Add-payne-hanek-reduction branch July 6, 2018 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants