-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~10% speedup #1
~10% speedup #1
Conversation
Thanks for the commit, i will try them on earlier cuda cards and on the testnet... before commiting PS: the testnet is not "testable" as-is, there is a line to comment in ccminer to allow the empty version (data[0]) |
So, the first review is fine... with mixed results on linux regarding generations, but works well SM 3.0, 5 and 5.2. I ask decred team, if its me which should merge that ;) |
Good. Will need tuning on 750. Maybe just removing the GS4 calls on < 5.2. I only have 970s. |
Hi @pallas1 . That's a nice work there. I'd recommend to use a return statement inside the final if condition just like i did on blake8-rounds. https://github.com/alexis78/ccminer/blob/windows/Algo256/blake256.cu#L287 This addition may sound dummy, but as you'll notice it's not dummy at all.
On the other side, register usage of the kernel is increasing. |
Thanks, I will try that! |
Waiting for @jolan 's comment because I think he is the only one with GPUs. |
I have some 750 cards (Ubuntu 14.04 and CUDA 7.5). I see an increase (which is great, thanks!) but it's not that large with my setup - from about 461 to 462.5 MH/s. If I can provide any other info or help, I'd be glad to. |
The best improvements i've seen are on the GTX 9xx on linux, but indeed that's uncommon the 750ti doesn't profit... |
I had reports of both the 750 and the windows build to be slower. Unfortunately I do not have a 750 nor a windows setup ready for testing and tuning. I suspect it has to do with the "aggregated" GS4 calls, please try replacing them with standard blake rounds. If that helps, we might remove the GS4 calls altogether, or add some precompiler directives to select the best method based on operating system and card. |
I'm not sure how to do that. If you can explain it to me, or send me the change, I can recompile and test. |
replace this function: __device__ __forceinline__
uint32_t blake256_compress_14(uint32_t *m, uint32_t *v_init, uint32_t d_data6, uint32_t d_data7)
{
uint32_t v[16];
#pragma unroll
for (uint32_t i = 0; i < 16; i++) v[i] = v_init[i];
// these two are not modified:
v[ 9] = 0x85A308D3;
v[13] = 0x299F31D0 ^ (180U*8U);
// round 1 with nonce
GSPREC(1, 5, 0x9, 0xD, 2, 3);
GSPREC(0, 5, 0xA, 0xF, 8, 9);
GSPREC(1, 6, 0xB, 0xC, 10, 11);
GSPREC(2, 7, 0x8, 0xD, 12, 13);
GSPREC(3, 4, 0x9, 0xE, 14, 15);
// round 2
GSPREC(0, 4, 0x8, 0xC, 14, 10);
GSPREC(1, 5, 0x9, 0xD, 4, 8);
GSPREC(2, 6, 0xA, 0xE, 9, 15);
GSPREC(3, 7, 0xB, 0xF, 13, 6);
GSPREC(0, 5, 0xA, 0xF, 1, 12);
GSPREC(1, 6, 0xB, 0xC, 0, 2);
GSPREC(2, 7, 0x8, 0xD, 11, 7);
GSPREC(3, 4, 0x9, 0xE, 5, 3);
// round 3
GSPREC(0, 4, 0x8, 0xC, 11, 8);
GSPREC(1, 5, 0x9, 0xD, 12, 0);
GSPREC(2, 6, 0xA, 0xE, 5, 2);
GSPREC(3, 7, 0xB, 0xF, 15, 13);
GSPREC(0, 5, 0xA, 0xF, 10, 14);
GSPREC(1, 6, 0xB, 0xC, 3, 6);
GSPREC(2, 7, 0x8, 0xD, 7, 1);
GSPREC(3, 4, 0x9, 0xE, 9, 4);
// round 4
GSPREC(0, 4, 0x8, 0xC, 7, 9);
GSPREC(1, 5, 0x9, 0xD, 3, 1);
GSPREC(2, 6, 0xA, 0xE, 13, 12);
GSPREC(3, 7, 0xB, 0xF, 11, 14);
GSPREC(0, 5, 0xA, 0xF, 2, 6);
GSPREC(1, 6, 0xB, 0xC, 5, 10);
GSPREC(2, 7, 0x8, 0xD, 4, 0);
GSPREC(3, 4, 0x9, 0xE, 15, 8);
// round 5
GSPREC(0, 4, 0x8, 0xC, 9, 0);
GSPREC(1, 5, 0x9, 0xD, 5, 7);
GSPREC(2, 6, 0xA, 0xE, 2, 4);
GSPREC(3, 7, 0xB, 0xF, 10, 15);
GSPREC(0, 5, 0xA, 0xF, 14, 1);
GSPREC(1, 6, 0xB, 0xC, 11, 12);
GSPREC(2, 7, 0x8, 0xD, 6, 8);
GSPREC(3, 4, 0x9, 0xE, 3, 13);
// round 6
GSPREC(0, 4, 0x8, 0xC, 2, 12);
GSPREC(1, 5, 0x9, 0xD, 6, 10);
GSPREC(2, 6, 0xA, 0xE, 0, 11);
GSPREC(3, 7, 0xB, 0xF, 8, 3);
GSPREC(0, 5, 0xA, 0xF, 4, 13);
GSPREC(1, 6, 0xB, 0xC, 7, 5);
GSPREC(2, 7, 0x8, 0xD, 15,14);
GSPREC(3, 4, 0x9, 0xE, 1, 9);
// round 7
GSPREC(0, 4, 0x8, 0xC, 12, 5);
GSPREC(1, 5, 0x9, 0xD, 1, 15);
GSPREC(2, 6, 0xA, 0xE, 14,13);
GSPREC(3, 7, 0xB, 0xF, 4, 10);
GSPREC(0, 5, 0xA, 0xF, 0, 7);
GSPREC(1, 6, 0xB, 0xC, 6, 3);
GSPREC(2, 7, 0x8, 0xD, 9, 2);
GSPREC(3, 4, 0x9, 0xE, 8, 11);
// round 8
GSPREC(0, 4, 0x8, 0xC, 13,11);
GSPREC(1, 5, 0x9, 0xD, 7, 14);
GSPREC(2, 6, 0xA, 0xE, 12, 1);
GSPREC(3, 7, 0xB, 0xF, 3, 9);
GSPREC(0, 5, 0xA, 0xF, 5, 0);
GSPREC(1, 6, 0xB, 0xC, 15, 4);
GSPREC(2, 7, 0x8, 0xD, 8, 6);
GSPREC(3, 4, 0x9, 0xE, 2, 10);
// round 9
GSPREC(0, 4, 0x8, 0xC, 6, 15);
GSPREC(1, 5, 0x9, 0xD, 14, 9);
GSPREC(2, 6, 0xA, 0xE, 11, 3);
GSPREC(3, 7, 0xB, 0xF, 0, 8);
GSPREC(0, 5, 0xA, 0xF, 12, 2);
GSPREC(1, 6, 0xB, 0xC, 13, 7);
GSPREC(2, 7, 0x8, 0xD, 1, 4);
GSPREC(3, 4, 0x9, 0xE, 10, 5);
// round 10
GSPREC(0, 4, 0x8, 0xC, 10, 2);
GSPREC(1, 5, 0x9, 0xD, 8, 4);
GSPREC(2, 6, 0xA, 0xE, 7, 6);
GSPREC(3, 7, 0xB, 0xF, 1, 5);
GSPREC(0, 5, 0xA, 0xF, 15,11);
GSPREC(1, 6, 0xB, 0xC, 9, 14);
GSPREC(2, 7, 0x8, 0xD, 3, 12);
GSPREC(3, 4, 0x9, 0xE, 13, 0);
// round 11
GSPREC(0, 4, 0x8, 0xC, 0, 1);
GSPREC(1, 5, 0x9, 0xD, 2, 3);
GSPREC(2, 6, 0xA, 0xE, 4, 5);
GSPREC(3, 7, 0xB, 0xF, 6, 7);
GSPREC(0, 5, 0xA, 0xF, 8, 9);
GSPREC(1, 6, 0xB, 0xC, 10,11);
GSPREC(2, 7, 0x8, 0xD, 12,13);
GSPREC(3, 4, 0x9, 0xE, 14,15);
// round 12
GSPREC(0, 4, 0x8, 0xC, 14,10);
GSPREC(1, 5, 0x9, 0xD, 4, 8);
GSPREC(2, 6, 0xA, 0xE, 9, 15);
GSPREC(3, 7, 0xB, 0xF, 13, 6);
GSPREC(0, 5, 0xA, 0xF, 1, 12);
GSPREC(1, 6, 0xB, 0xC, 0, 2);
GSPREC(2, 7, 0x8, 0xD, 11, 7);
GSPREC(3, 4, 0x9, 0xE, 5, 3);
// round 13
GSPREC(0, 4, 0x8, 0xC, 11, 8);
GSPREC(1, 5, 0x9, 0xD, 12, 0);
GSPREC(2, 6, 0xA, 0xE, 5, 2);
GSPREC(3, 7, 0xB, 0xF, 15,13);
GSPREC(0, 5, 0xA, 0xF, 10,14);
GSPREC(1, 6, 0xB, 0xC, 3, 6);
GSPREC(2, 7, 0x8, 0xD, 7, 1);
GSPREC(3, 4, 0x9, 0xE, 9, 4);
// round 14
GSPREC(0, 4, 0x8, 0xC, 7, 9);
GSPREC(1, 5, 0x9, 0xD, 3, 1);
GSPREC(2, 6, 0xA, 0xE, 13,12);
GSPREC(3, 7, 0xB, 0xF, 11,14);
GSPREC(0, 5, 0xA, 0xF, 2, 6);
GSPREC(2, 7, 0x8, 0xD, 4, 0);
if ((d_data7 ^ v[7] ^ v[15]) == 0) {
GSPREC(1, 6, 0xB, 0xC, 5, 10);
GSPREC(3, 4, 0x9, 0xE, 15, 8);
return (d_data6 ^ v[6] ^ v[14]);
}
return UINT32_MAX;
} |
nope, that doesnt improve... but the vanilla "return;" does |
I tried the "return" (also "break"), and it hashed slower on my rig. |
read or pastebin the ptx obtained by cuobjdump . "break;" wont work |
I don't see any difference going back to the standard blake rounds |
Try playing with TPB and NPR. They are defined at the beginning. |
i tried too, but return; improve from 520 to 540MH see c0fca5c |
Any tips on what types of values to try for TPB and NPR? Stay in powers of two? Multiples of the number of cores? |
TPB try 256, NPR power of two |
I tried a bunch of different settings and here's what I got: TPB 256, NPR 64/128/256 all gave about 456 MH/s TPB 512 NPR 128 w return line added from c0fca5c gave 463 MH/s So not much difference either way, seems like your initial TPB/NPR settings were optimal, and the return line made another small improvement. @tpruvot are you getting 540 MH/s on a 750 or something else? If you are, then perhaps I'm doing something else wrong to have such a low starting point compared to you and not seeing any noticeable improvements, which would invalidate my tests. |
750 Ti Black Edition, a bit faster one, the important is the +20MHs
NVIDIA-SMI 352.79
|
@tpruvot right, I know the change is what's important, but I'm not seeing that much of a change... strange. My cards are stock 750TI from EVGA @alexis78 In my first post I mentioned that I'm using CUDA 7.5 on Ubuntu 14.04. from nvidia-smi:
|
For the record, my ~10% improvement is measured at fixed power usage, attained using a low TDP settings on nvidia-smi |
Sorry, I double checked and I hadn't actually pulled down the file from c0fca5c - with that version, I'm getting 481MH/s from 463MH/s. That's 18MH/s. |
done |
Untested on ARCH < 5.2