Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23

Merged
merged 4 commits into from
May 15, 2017

Conversation

shibatch
Copy link
Owner

@shibatch shibatch commented May 1, 2017

sinpi is just a by-product. It also fixes the following minor bugs.

  • TRIGRANGEMAX3 should be used in sleefld.c and sleefqp.c
  • fma could return inf if the true return value is above 1e+303, instead of 1e+304.

…ust a by-product.

It also fixes the following minor bugs.
* TRIGRANGEMAX3 should be used in sleefld.c and sleefqp.c
* fma could return inf if the true return value is above 1e+303, instead of 1e+304.
@@ -350,6 +351,28 @@ static INLINE vdouble vsel_vd_vo_vd_vd(vopmask mask, vdouble x, vdouble y) {
return vbslq_f64(vreinterpretq_u64_u32(mask), x, y);
}

static INLINE CONST vdouble vsel_vd_vo_d_d(vopmask o, double d0, double d1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you use the same implementation as the one in the AVX target here? It seems very expensive to me to create the idx vector each time the function is called.

Copy link
Owner Author

@shibatch shibatch May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx vector is common among the same series of coefficients. So, it is generated only once. You can see the assembly output from gcc-6.3.0 via the following link.

https://www.dropbox.com/s/n71fn42cscgzlbs/sleefsimddp.advsimd.s.txt?dl=0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assembly file is too long to check. :)

If I understood correctly, the idx is generated only once because the function is inlined. Is that the case? Even so, the generic implementation in the AVX header file seems to need less instructions. Am I missing something?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "common subexpression elimination." All modern compilers remove redundant codes for calculating the same results.

https://en.wikipedia.org/wiki/Common_subexpression_elimination

Copy link
Owner Author

@shibatch shibatch May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do benchmarking to get the definitive answer as to which implementation is faster, but the generic implementation for choosing between 4 values requires 4 loads into 4 different registers and three blending instructions. The tbl implementation requires two loads and one tbl instruction, in addition to two adrp instructions.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took benchmark of both implementation. It looks the generic implementation is around 10% faster. The following values show execution time of each function in micro second.

TBL
Sleef_lgammad2_u10advsimd, 0.553044
Sleef_tgammad2_u10advsimd, 0.572347
Sleef_erfd2_u10advsimd, 0.21031
Sleef_erfcd2_u15advsimd, 0.286495

Generic
Sleef_lgammad2_u10advsimd, 0.50789
Sleef_tgammad2_u10advsimd, 0.519268
Sleef_erfd2_u10advsimd, 0.187486
Sleef_erfcd2_u15advsimd, 0.256248

return (vdouble) vqtbl1q_u8(tab, idx);
}

static INLINE vdouble vsel_vd_vo_vo_vo_d_d_d_d(vopmask o0, vopmask o1, vopmask o2, double d0, double d1, double d2, double d3) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sam here - I think this is not optimal. Isn't the AVX code better?

t = vsel_vd_vo_vd_vd(o2, vrec_vd_vd(x.x), ddnormalize_vd2_vd2(ddadd2_vd2_vd2_vd(x, vsel_vd_vo_d_d(o0, -1, -2))).x);

u = vsel_vd_vo_vo_d_d_d(o2, o0, -156.801412704022726379848862, +0.2947916772827614196e+2, +0.7074816000864609279e-7);
u = vmla_vd_vd_vd_vd(u, t, vsel_vd_vo_vo_d_d_d(o2, o0, +1.120804464289911606838558160000, +0.1281459691827820109e+3, +0.4009244333008730443e-6));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be expensive on aarch64 expensive (lines 1950 to 1971). It would be good to write a #ifdef indexed_vmla version for those architectures that can run the code using less loads by means of using indexed mla instructions (see e.g. vfmad_lane_f64).

I am fine for you to leave the code as it is for now, but please a TODO comment in the source about the potential optimization.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using a gather instruction to choose a value from four values? Both SVE and AVX2 have permutation instructions, which should be less expensive than a gather instruction. So I think that's pointless.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, a large part of the computation time is actually the latency of FMA. My intention for this part is to hide the time taken by load and permutation in this latency. Load is actually pretty fast. Permutation takes time if there is no hardware support.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are having concern about SVE with 128 bit vector length, I would introduce a conditional branch that checks the vector length. If the vector length is 128 bit, the generic implementation should work well. If the vector length is 256 bit or longer, then TBL implementation would be good. For each FMA, one 256 bit load and one TBL are required.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is easier to understand what I mean if I show the code snippet. The generated AVX2 assembly code of the corresponding part is as follows.

    vpermd  .LC20(%rip), %ymm6, %ymm2
    vfmadd132pd     %ymm0, %ymm2, %ymm1
    vpermd  .LC21(%rip), %ymm6, %ymm2
    vfmadd231pd     %ymm0, %ymm1, %ymm2
    vpermd  .LC22(%rip), %ymm6, %ymm1
    vfmadd132pd     %ymm0, %ymm1, %ymm2
    vpermd  .LC23(%rip), %ymm6, %ymm1
    vfmadd231pd     %ymm0, %ymm2, %ymm1
    vpermd  .LC24(%rip), %ymm6, %ymm2
    vfmadd132pd     %ymm0, %ymm2, %ymm1

However, my benchmarking result shows that the latency of vpermd is not completely hidden in the latency of vfmadd. The overall computation speed is not bad compared to SVML, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks goot, it is just that there is a potential optimization here for AArch64 because we have vector-scalar FMA instruction, that allows loading the coefficient in vector registers and use fewer vsel operations. All I am asking is to add a TODO comment saing

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

u = vsel_vd_vo_vd_vd(o0, vmul_vd_vd_vd(a, a), a);

t = vsel_vd_vo_vo_d_d_d(o0, o1, +0.6801072401395392157e-20, +0.2830954522087717660e-13, -0.5846750404269610493e-17);
t = vmla_vd_vd_vd_vd(t, u, vsel_vd_vo_vo_d_d_d(o0, o1, -0.2161766247570056391e-18, -0.1509491946179481940e-11, +0.6076691048812607898e-15));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please add a TODO comment about the potential optimization that can be done for architectures with indexed mla instructions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */


u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0)))));

t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, please use add a TOD comment for the potential indexed mla optimization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

@@ -130,11 +130,11 @@ static INLINE vfloat vsqrt_vf_vf(vfloat d) { return vsqrtq_f32(d); }

// Multiply accumulate: z = z + x * y
static INLINE vfloat vmla_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) {
return vmlaq_f32(z, x, y);
return vfmaq_f32(z, x, y);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind this change?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can safely assume that FMA is the faster than any other combination of multiplication and addition. I saw the assembly output from the compiler and vmlaq is converted into multiplication and addition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Strange. Could you please provide a minimal example that shows what seems to be an inconsistent behavior? You might have found a bug in gcc.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiply-accumulate instructions are all fused in aarch64, so this is not a bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, I was just asking you to provide an example of code that generated multiplication and addition (separated) from the vmlaq_f32 intrinsics.

No worries if you don't have time.

}
// Multiply subtract: z = z = x * y
static INLINE vfloat vmlanp_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) {
return vmlsq_f32(z, x, y);
return vfmsq_f32(z, x, y);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

shibatch added 2 commits May 7, 2017 16:12
…ic implementation is faster on aarch64. So, the generic implementation is now default on aarch64.

* This patch fixes the issue of exp(1.0) not correctly rounded, only when FMA is available.
* This patch also includes small performance optimization for expk and expk2 functions.
…F, which is probably faster than the generic implementation.
t = vsel_vd_vo_vd_vd(o2, vrec_vd_vd(x.x), ddnormalize_vd2_vd2(ddadd2_vd2_vd2_vd(x, vsel_vd_vo_d_d(o0, -1, -2))).x);

u = vsel_vd_vo_vo_d_d_d(o2, o0, -156.801412704022726379848862, +0.2947916772827614196e+2, +0.7074816000864609279e-7);
u = vmla_vd_vd_vd_vd(u, t, vsel_vd_vo_vo_d_d_d(o2, o0, +1.120804464289911606838558160000, +0.1281459691827820109e+3, +0.4009244333008730443e-6));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks goot, it is just that there is a potential optimization here for AArch64 because we have vector-scalar FMA instruction, that allows loading the coefficient in vector registers and use fewer vsel operations. All I am asking is to add a TODO comment saing

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

u = vsel_vd_vo_vd_vd(o0, vmul_vd_vd_vd(a, a), a);

t = vsel_vd_vo_vo_d_d_d(o0, o1, +0.6801072401395392157e-20, +0.2830954522087717660e-13, -0.5846750404269610493e-17);
t = vmla_vd_vd_vd_vd(t, u, vsel_vd_vo_vo_d_d_d(o0, o1, -0.2161766247570056391e-18, -0.1509491946179481940e-11, +0.6076691048812607898e-15));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */


u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0)))));

t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

@shibatch shibatch merged commit c28e257 into master May 15, 2017
@shibatch shibatch deleted the Introducing_erf_and_gamma branch May 15, 2017 10:10
@avmgithub avmgithub mentioned this pull request May 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants