-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cryptonight variant 2 #4218
Cryptonight variant 2 #4218
Conversation
Can you clarify whether this would brick an existing ASIC or just slow it down? I.e., if an ASIC does exist for the current PoW (Cryptonight Variant 1), would these modifications make the ASIC useless or would it just decrease performance? |
It would definitely brick existing ASICs (if there are any): they don't have integer division and square root and they access memory in 16-byte chunks, not 64-byte chunks. |
And new ASICs will be even more expensive to build and operate. |
New ASICs will also be many times slower because of the nature of these changes. |
Well, that depends on ASIC implementation. We don't know for sure how their internals work. But according to a lot of scientific papers about hardware implementations of division and square root - yes, looks like 16x times slower. |
src/crypto/slow-hash.c
Outdated
} while (0) | ||
|
||
#define VARIANT2_SHUFFLE_SSE2(base_ptr, offset) \ | ||
do if (variant == 2) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make these >= 2, we don't want to revert for v3 without thinking about it first (and v3 will be the same as v2 anyway, there'll be just one day between the two).
src/crypto/slow-hash.c
Outdated
{ \ | ||
VARIANT2_INTEGER_MATH_DIVISION_STEP(b, ptr) \ | ||
const double z = sqrt_input; \ | ||
const int64_t z1 = sqrt(z); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not realize this would actually use floating point. Is it stable across the whole 64 bit range on various platforms ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's stable on any system that supports IEEE-754. Assuming that underlying hardware is IEEE-754 compliant, this code works like this:
- sqrt_input is 48-bit integer, it's converted to double without rounding and loosing precision
- sqrt(z) does rounding, but it's safe here: the result is less than 2^24 and absolute error is less than 2^-28 because double precision is used
- z is then truncated to get z1, but absolute error is too small to make this truncation incorrect
This was the theory, but I also tested this code (both portable and SSE2 versions) and it gives correct results for all integers less than 2^48.
I can publish the testing program so everyone else could confirm it. It's actually easy to write a testing program independently. The definition of integer square root is simple: INT_SQRT(N) is an integer number M such that both inequalities hold: M^2 <= N and (M+1)^2 > N. This number always exists if N >= 0 and there is always only one such number.
P.S. I can rewrite the fallback implementation (the last one in slow-hash.c) to use only integers to calculate the square root. It will become slower though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEEE-754 behavior isn't required by the C or C++ standard. There is a static check in numeric_limits<T>
for IEEE. I doubt this check fails frequently though.
- sqrt(z) does rounding, but it's safe here: the result is less than 2^24 and absolute error is less than 2^-28 because double precision is used
- z is then truncated to get z1, but absolute error is too small to make this truncation incorrect
I might be missing something, but this feels awfully close to a +/-1 bug with rounding, where the float itself is 3.999999
or 4.00000
. I would need to brush up on my floating point behavior. @hyc ?
EDIT: changed "defined" to "required" above because that was my intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEEE-754 is not required, but even without it this code is far away to getting incorrect rounding: worst case is when sqrt(2^48 - 1) is calculated: https://onlinegdb.com/r1AmAyzrX
Rounding error is never larger than 1 ULP by definition, and the distance to the next integer is 16 ULP in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick pass at the code semantics.
src/crypto/slow-hash.c
Outdated
const __m128i chunk1 = _mm_load_si128((__m128i *)((base_ptr) + ((offset) ^ 0x10))); \ | ||
const __m128i chunk2 = _mm_load_si128((__m128i *)((base_ptr) + ((offset) ^ 0x20))); \ | ||
const __m128i chunk3 = _mm_load_si128((__m128i *)((base_ptr) + ((offset) ^ 0x30))); \ | ||
_mm_store_si128((__m128i *)((base_ptr) + ((offset) ^ 0x10)), _mm_shufflelo_epi16(_mm_shuffle_epi32(chunk3, _MM_SHUFFLE(2, 0, 3, 1)), _MM_SHUFFLE(3, 1, 2, 0))); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently alters the CPU / GPU performance ratios. The CPU has 64-byte cachelines while most GPUs seem to have 32-byte cachelines. This is also touching more cachelines per iteration, giving a further performance edge to CPUs.
Not arguing for or against this altereration, but I thought it was worth pointing out for those that may be interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU/GPU performance ratio remains pretty much intact, we already have a lot of performance numbers: https://github.com/SChernykh/xmr-stak-cpu#performance
src/crypto/slow-hash.c
Outdated
uint32_t sqrt_result; \ | ||
do if (variant == 2) \ | ||
{ \ | ||
*((uint64_t*)division_result) = 0; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the aliasing violation, just do division_result = {0, 0};
. Detecting multi-wide initialization is trivial for the compiler.
src/crypto/slow-hash.c
Outdated
#define VARIANT2_SHUFFLE_PORTABLE(base_ptr, offset) \ | ||
do if (variant == 2) \ | ||
{ \ | ||
uint32_t* chunk1 = (uint32_t*)((base_ptr) + ((offset) ^ 0x10)); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't portable. This is an aliasing violation that can result in misaligned access errors on some CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base_ptr is actually a scratchpad pointer here. It's allocated by malloc/VirtualAlloc/mmap. It will be 8-byte aligned in the worst case. offset is always a multiple of 16, so misaligned access is not possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, its reliant on the code below. This is still an aliasing violation, but AFAIK will work on our targeted compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe as long as compiled code passes all tests. This PR has test vectors to catch any bug.
Regarding concerns about using floating point to calculate the integer square root: there is only 1 place in the algorithm where a rounding error happens (the sqrt() call itself), but this rounding error is too small to cause any trouble. This code was actually tested and confirmed correct for all 48-bit integers that can possibly be an input to it. |
I see the >> 16 reducing the range, so yes it seems fine. Nevermind my comment then. |
The issue would be portability, which cannot be easily tested. What platforms were tested for this? |
x86 and ARM (64 bit), also OpenCL on AMD and NVIDIA GPUs - all hash the same. And I also added test vectors in this pull request, so buildbot has already tested it on Windows, Linux and MacOS (32 and 64 bit) - no discrepancies. Portability won't be a problem here. It will run the same on any IEEE-754 compliant hardware: C standard is pretty strict when it comes to integer <-> double conversions and sqrt() implementation. So once it's confirmed correct (which it is) on one platform supporting IEEE-754, it will run the same on all platforms with IEEE-754 support. |
Have you measured relative power usage on CPUs and GPUs? The liked document above only discusses speed (H/s) not power (H/W). That's another efficiency factor to consider along with speed. If the power usage doesn't increase much that is good result. If not then some of the hypothesized relative gain against ASICs may illusory (and perhaps suggest looking at some variant that preserves a bit more power efficiency?) Certainly still valid in terms of 'bricking' of existing designs. |
Power increase is minimal. It's a bit higher than the old Cryptonight, but nothing dramatic. I've just asked people to give me numbers: SChernykh/xmr-stak-cpu#1 (comment) |
@iamsmooth I did a quick test on my Radeon RX 560: HWINFO64 reports "GPU chip power" to be 34.2 watts for original Cryptonight at 477 H/S and 38.7 watts for Cryptonight variant 2 at 447 H/S. I can't measure power at the wall - it's usually much larger than what HWINFO64 reports, but the absolute difference in wattage should be similar. |
This requires making sure that the compiler is actually configured to perform floating point in a strictly compliant manner. That (configuration) is not entirely portable, although I guess there aren't too many compilers we are dealing with in practice. Maybe this should be noted in documentation somewhere, for reference of future ports/updates. That said its pretty hard to believe the specific case of integer <-> double conversion would ever change. I would expect more risk on other operations. |
This code only needs 48 bits of precision for integer -> double conversion and sqrt. IEEE-754 guarantees 52 bits. Even if some implementations are not fully compliant and give a bit larger error for sqrt like an error in last bit sometimes (for example, some corners are cut when rounding), it will still be good enough. Edit: I'll add test vectors that go through one of edge cases (N^2-1, N^2, N^2+1) for sqrt in the main loop. I'll first have to make something like a vanity generator, but for test vectors :) This will help catch non-compliant implementations. |
@SChernykh I agree there are not likely to be any problems here on sane platforms. Rather I was replying on the specific point that what the C standard requires of the compiler and what the compiler actually does is not always the same thing. Even when a compiler does implement strict standard compliance (not always), it may require specific options. On that note, I noticed this in https://en.wikipedia.org/wiki/C_data_types
So the assumption of 48/52 bits of precision is not a given. As noted above (or maybe in the commit reviews), IEEE-754 is not strictly required (though is often the case in practice, at least with the right compiler options). Again, a good resolution of this is to clearly note the (new) dependencies in the developer documentation, so anyone porting can be made aware of them. |
So practically & effectively this makes FPGAs 4x slower, making them much worse in terms of So this makes GPUs & CPUs the best thing to mine with, if Monero keeps the strict 6 months fork schedule. |
Yes, it's correct. At least 4 times slowdown compared to Cryptonight v1 for all kinds of ASIC/FPGA. |
@SChernykh would it be possible that you could try to estimate ASIC production cost and real production time if the community would fund you a Bitmain CN 220khs miner (or is there any good take-apart video?)? As they will still be an ordner of magnitude better both in price/hash and much better in power/hash. If Bitmain can tape one out in 1 month there would still be a big problem. Everyone in the forums states 6 months production time from tape-out. How did everyone arrive at that number? Maybe it would be good to get several opinions. |
Someone will probably come up with some small simple change, either myself, othe, smooth, vtnerd, and we'lll discuss it, and post it on github, and if it passes review, it gets merged. |
I'm not a hardware expert, no need to send Bitmain ASIC to me. I haven't seen any teardown videos for their CN miner however.
@moneromooo-monero I'll also do one really small change near the fork. I had it in my plans since the beginning. It will be small, won't affect performance in any way and will also improve ASIC resistance a bit. So there will be two changes from two different sources. |
@SChernykh @moneromooo-monero okay fair enough. It should be noted that those infos probably have dozens of millions of value to Bitmain and that they'll try anything to get it (personal speculation) and it would be good if we had a way of preventing them getting the info earlier than when the general public does a few days before fork. Thanks for your efforts and taking decentralization so serious, unlike other chains...! I'll be sure to point all my GPUs & CPUs to Monero, as always :-) |
Two days before the fork is too little. Variant 2 should be finalized 2 weeks before the fork, together will all major miner software pull requests. Everyone should have enough time to upgrade. |
True that's the other side. So ASIC manufacturer's will also have an additional 2 weeks. Edit: Personally I think it would make sense to add an official "surprise POW tweak fork" once or twice a year to add further uncertainty to their economic models. Date would be determined by a pre-defnied block height block hash + time needed from devs to make a small tweak. |
@SChernykh |
These changes are big from hardware point of view, they'll require a completely new design. I think they won't even be ready with a proper design before the fork - they'll have to spend a lot of time optimizing div+sqrt logic for low latency before starting mass production. P.S. And I still think that even optimized low latency logic will not be fast enough to ROI in 6 months. |
This is exciting news as I mine Monero7 or as some call it MoneroV1 I will post this thread on my bitcointalk thread. |
The mining algorithm is called Cryptonight variant 1 or CryptonightV1 or CNv1. "7" is current protocol version number, not the algorithm name. Many people including whattomine and many other sites confused them since the beginning. New algorithm will be called Cryptonight variant 2 or CryptonightV2 or CNv2. |
Just LOL. There is literally an error in every line in your code. Did you even try to compile it and run on sample inputs? You did not. |
I gave simplified code... It is not targetting the sample inputs, just showing simpler ways (and less computational) to do exactly what you are attempting in the proposed patch. Add to that the fact that I have not enclosed my code in code markup, should tell you clearly, it is example code... Would you like me to submit a patch that works? |
So are you trying to make faster implementation? Whatever, calling "pow" in C++ is not faster than calling "sqrt". And calling "sqrt" is slower than calling SSE intrinsic for sqrt. |
But one uses more cachelines than the other. the difference in runtime is trivial, but the impact on performance of code following is substantial. And not all miners will have the sqrt SSE intrinsic now, will they? |
@philtimmes You're not skilled/competent enough for productive discussion, it seems. Do more research, check the actual miner code: https://github.com/SChernykh/xmr-stak-cpu - my version Try to make this code even 0.1% faster and if you succeed, then we can talk. P.S. All 64-bit x86 CPUs have sqrt SSE intrinsic. |
Skilled / competent enough? You say that with 0 indication of who I am... or even checking what I wrote with more varacity than a copy pasta... I proposed a patch, and that was ignored... |
@philtimmes The only thing you proposed so far was slow sqrt implementation with a bug in every line. You're right, I have 0 indication of who you are, so until I see some actual working code from you, it will be like this. Don't think I'm grumpy, but I had a lot of time wasted recently in such conversations. |
Yessir... |
Under what set of circumstances would you submit a patch that does not work (and be expected to be taken seriously)? |
@plavirudar While I admit there are 2 ways to read what I wrote, I would assume you could see both of them. |
Portable version needs improvement actually. Let's wait for what @philtimmes comes up with. |
@SChernykh An optimization for a portable code would be nice for webassembly, since there is no SIMD available at the moment. By the way: Converting to double before doing the division |
@SChernykh Please update the original cpuminer at https://github.com/hyc/cpuminer-multi |
@miki-bgd-011 This repository looks abandoned. xmrig and xmr-stak already have optimized CPU versions, I feel it would be a waste of time to add support there. |
:( ok |
I mean, I can add C code, but it will be significantly slower than what xmrig and xmr-stak have. Adding assembler versions will be a lot more work and not worth it. |
I understand and agree. |
@SChernykh Fwiw - no, not abandoned, I use this code still. Some people still like small C projects with few external dependencies... |
@hyc Ok, I'll add support later this week, but don't expect high performance from this. |
YAY!! |
@hyc @miki-bgd-011 Added it: hyc/cpuminer-multi#3 |
@SChernykh Thank you, works just fine on http://killallasics.moneroworld.com/. @hyc I've noticed the miner wastes cpu when connection to the pool is not established. Would be nice to fix that, to start mining only when connection to the pool is successfully established :) |
I know it's a bit late to change this, but it looks like the new algorithm is going to lock the network onto closed / locked Intel and AMD CPUs (ME/PSP concerns + secure boot etc.). The new algorithm has (inadvertently?) inserted a rather nasty FPU performance microbench that is knocking non-x86 CPUs, including owner controllable ones, out of consideration (half the hashrate in many cases). My main concern is that we've exchanged one kind of ASIC (mining ASIC) for another (Windows / x86 locked CPUs). I am not convinced the network can be properly secured when only one CPU architecture, controlled by a fairly hostile duopoly, is economical for mining. |
You sound like an unhappy ASIC owner. |
Nope, no ASIC here 😄 Small fleet of POWER9 machines that are far more secure than your x86 AMD/Intel controlled stuff though! |
Reference code: monero-project/monero#4218
Contains two modifications to improve ASIC resistance: shuffle and integer math.
Shuffle makes use of the whole 64-byte cache line instead of 16 bytes only, making Cryptonight 4 times more demanding for memory bandwidth.
Integer math adds 64:32 bit integer division followed by 64 bit integer square root, adding large and unavoidable computational latency to the main loop.
More or less complete description of these changes and performance numbers are here: https://github.com/SChernykh/xmr-stak-cpu/blob/master/README.md
Discussion that preceded this pull request: SChernykh/xmr-stak-cpu#1