Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptonight variant 2 #4218

Merged
merged 1 commit into from
Sep 11, 2018
Merged

Cryptonight variant 2 #4218

merged 1 commit into from
Sep 11, 2018

Conversation

SChernykh
Copy link
Contributor

@SChernykh SChernykh commented Aug 3, 2018

Contains two modifications to improve ASIC resistance: shuffle and integer math.

Shuffle makes use of the whole 64-byte cache line instead of 16 bytes only, making Cryptonight 4 times more demanding for memory bandwidth.

Integer math adds 64:32 bit integer division followed by 64 bit integer square root, adding large and unavoidable computational latency to the main loop.

More or less complete description of these changes and performance numbers are here: https://github.com/SChernykh/xmr-stak-cpu/blob/master/README.md

Discussion that preceded this pull request: SChernykh/xmr-stak-cpu#1

@Gingeropolous
Copy link
Collaborator

Can you clarify whether this would brick an existing ASIC or just slow it down? I.e., if an ASIC does exist for the current PoW (Cryptonight Variant 1), would these modifications make the ASIC useless or would it just decrease performance?

@SChernykh
Copy link
Contributor Author

It would definitely brick existing ASICs (if there are any): they don't have integer division and square root and they access memory in 16-byte chunks, not 64-byte chunks.

@el00ruobuob
Copy link
Contributor

And new ASICs will be even more expensive to build and operate.

@SChernykh
Copy link
Contributor Author

New ASICs will also be many times slower because of the nature of these changes.

@SChernykh
Copy link
Contributor Author

Well, that depends on ASIC implementation. We don't know for sure how their internals work. But according to a lot of scientific papers about hardware implementations of division and square root - yes, looks like 16x times slower.

} while (0)

#define VARIANT2_SHUFFLE_SSE2(base_ptr, offset) \
do if (variant == 2) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make these >= 2, we don't want to revert for v3 without thinking about it first (and v3 will be the same as v2 anyway, there'll be just one day between the two).

{ \
VARIANT2_INTEGER_MATH_DIVISION_STEP(b, ptr) \
const double z = sqrt_input; \
const int64_t z1 = sqrt(z); \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realize this would actually use floating point. Is it stable across the whole 64 bit range on various platforms ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's stable on any system that supports IEEE-754. Assuming that underlying hardware is IEEE-754 compliant, this code works like this:

  • sqrt_input is 48-bit integer, it's converted to double without rounding and loosing precision
  • sqrt(z) does rounding, but it's safe here: the result is less than 2^24 and absolute error is less than 2^-28 because double precision is used
  • z is then truncated to get z1, but absolute error is too small to make this truncation incorrect

This was the theory, but I also tested this code (both portable and SSE2 versions) and it gives correct results for all integers less than 2^48.

I can publish the testing program so everyone else could confirm it. It's actually easy to write a testing program independently. The definition of integer square root is simple: INT_SQRT(N) is an integer number M such that both inequalities hold: M^2 <= N and (M+1)^2 > N. This number always exists if N >= 0 and there is always only one such number.

P.S. I can rewrite the fallback implementation (the last one in slow-hash.c) to use only integers to calculate the square root. It will become slower though.

Copy link
Contributor

@vtnerd vtnerd Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEEE-754 behavior isn't required by the C or C++ standard. There is a static check in numeric_limits<T> for IEEE. I doubt this check fails frequently though.

  • sqrt(z) does rounding, but it's safe here: the result is less than 2^24 and absolute error is less than 2^-28 because double precision is used
  • z is then truncated to get z1, but absolute error is too small to make this truncation incorrect

I might be missing something, but this feels awfully close to a +/-1 bug with rounding, where the float itself is 3.999999 or 4.00000. I would need to brush up on my floating point behavior. @hyc ?

EDIT: changed "defined" to "required" above because that was my intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IEEE-754 is not required, but even without it this code is far away to getting incorrect rounding: worst case is when sqrt(2^48 - 1) is calculated: https://onlinegdb.com/r1AmAyzrX

Rounding error is never larger than 1 ULP by definition, and the distance to the next integer is 16 ULP in this case.

Copy link
Contributor

@vtnerd vtnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick pass at the code semantics.

const __m128i chunk1 = _mm_load_si128((__m128i *)((base_ptr) + ((offset) ^ 0x10))); \
const __m128i chunk2 = _mm_load_si128((__m128i *)((base_ptr) + ((offset) ^ 0x20))); \
const __m128i chunk3 = _mm_load_si128((__m128i *)((base_ptr) + ((offset) ^ 0x30))); \
_mm_store_si128((__m128i *)((base_ptr) + ((offset) ^ 0x10)), _mm_shufflelo_epi16(_mm_shuffle_epi32(chunk3, _MM_SHUFFLE(2, 0, 3, 1)), _MM_SHUFFLE(3, 1, 2, 0))); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently alters the CPU / GPU performance ratios. The CPU has 64-byte cachelines while most GPUs seem to have 32-byte cachelines. This is also touching more cachelines per iteration, giving a further performance edge to CPUs.

Not arguing for or against this altereration, but I thought it was worth pointing out for those that may be interested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU/GPU performance ratio remains pretty much intact, we already have a lot of performance numbers: https://github.com/SChernykh/xmr-stak-cpu#performance

uint32_t sqrt_result; \
do if (variant == 2) \
{ \
*((uint64_t*)division_result) = 0; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the aliasing violation, just do division_result = {0, 0};. Detecting multi-wide initialization is trivial for the compiler.

#define VARIANT2_SHUFFLE_PORTABLE(base_ptr, offset) \
do if (variant == 2) \
{ \
uint32_t* chunk1 = (uint32_t*)((base_ptr) + ((offset) ^ 0x10)); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't portable. This is an aliasing violation that can result in misaligned access errors on some CPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_ptr is actually a scratchpad pointer here. It's allocated by malloc/VirtualAlloc/mmap. It will be 8-byte aligned in the worst case. offset is always a multiple of 16, so misaligned access is not possible here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, its reliant on the code below. This is still an aliasing violation, but AFAIK will work on our targeted compilers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe as long as compiled code passes all tests. This PR has test vectors to catch any bug.

@SChernykh
Copy link
Contributor Author

Regarding concerns about using floating point to calculate the integer square root: there is only 1 place in the algorithm where a rounding error happens (the sqrt() call itself), but this rounding error is too small to cause any trouble. This code was actually tested and confirmed correct for all 48-bit integers that can possibly be an input to it.

@moneromooo-monero
Copy link
Collaborator

I see the >> 16 reducing the range, so yes it seems fine. Nevermind my comment then.

@vtnerd
Copy link
Contributor

vtnerd commented Aug 3, 2018

Regarding concerns about using floating point to calculate the integer square root: there is only 1 place in the algorithm where a rounding error happens (the sqrt() call itself), but this rounding error is too small to cause any trouble. This code was actually tested and confirmed correct for all 48-bit integers that can possibly be an input to it.

The issue would be portability, which cannot be easily tested. What platforms were tested for this?

@SChernykh
Copy link
Contributor Author

The issue would be portability, which cannot be easily tested. What platforms were tested for this?

x86 and ARM (64 bit), also OpenCL on AMD and NVIDIA GPUs - all hash the same. And I also added test vectors in this pull request, so buildbot has already tested it on Windows, Linux and MacOS (32 and 64 bit) - no discrepancies.

Portability won't be a problem here. It will run the same on any IEEE-754 compliant hardware: C standard is pretty strict when it comes to integer <-> double conversions and sqrt() implementation. So once it's confirmed correct (which it is) on one platform supporting IEEE-754, it will run the same on all platforms with IEEE-754 support.

@iamsmooth
Copy link
Contributor

iamsmooth commented Aug 3, 2018

Have you measured relative power usage on CPUs and GPUs? The liked document above only discusses speed (H/s) not power (H/W).

That's another efficiency factor to consider along with speed. If the power usage doesn't increase much that is good result. If not then some of the hypothesized relative gain against ASICs may illusory (and perhaps suggest looking at some variant that preserves a bit more power efficiency?)

Certainly still valid in terms of 'bricking' of existing designs.

@SChernykh
Copy link
Contributor Author

Power increase is minimal. It's a bit higher than the old Cryptonight, but nothing dramatic. I've just asked people to give me numbers: SChernykh/xmr-stak-cpu#1 (comment)

@SChernykh
Copy link
Contributor Author

@iamsmooth I did a quick test on my Radeon RX 560: HWINFO64 reports "GPU chip power" to be 34.2 watts for original Cryptonight at 477 H/S and 38.7 watts for Cryptonight variant 2 at 447 H/S. I can't measure power at the wall - it's usually much larger than what HWINFO64 reports, but the absolute difference in wattage should be similar.

@iamsmooth
Copy link
Contributor

iamsmooth commented Aug 3, 2018

Portability won't be a problem here. It will run the same on any IEEE-754 compliant hardware: C standard is pretty strict when it comes to integer <-> double conversions and sqrt() implementation

This requires making sure that the compiler is actually configured to perform floating point in a strictly compliant manner. That (configuration) is not entirely portable, although I guess there aren't too many compilers we are dealing with in practice. Maybe this should be noted in documentation somewhere, for reference of future ports/updates.

That said its pretty hard to believe the specific case of integer <-> double conversion would ever change. I would expect more risk on other operations.

@SChernykh
Copy link
Contributor Author

SChernykh commented Aug 3, 2018

This requires making sure that the compiler is actually configured to perform floating point in a strictly compliant manner.

This code only needs 48 bits of precision for integer -> double conversion and sqrt. IEEE-754 guarantees 52 bits. Even if some implementations are not fully compliant and give a bit larger error for sqrt like an error in last bit sometimes (for example, some corners are cut when rounding), it will still be good enough.

Edit: I'll add test vectors that go through one of edge cases (N^2-1, N^2, N^2+1) for sqrt in the main loop. I'll first have to make something like a vanity generator, but for test vectors :) This will help catch non-compliant implementations.

@iamsmooth
Copy link
Contributor

iamsmooth commented Aug 3, 2018

@SChernykh I agree there are not likely to be any problems here on sane platforms. Rather I was replying on the specific point that what the C standard requires of the compiler and what the compiler actually does is not always the same thing. Even when a compiler does implement strict standard compliance (not always), it may require specific options.

On that note, I noticed this in https://en.wikipedia.org/wiki/C_data_types

The actual size and behavior of floating-point types also vary by implementation. The only guarantee is that long double is not smaller than double, which is not smaller than float

So the assumption of 48/52 bits of precision is not a given. As noted above (or maybe in the commit reviews), IEEE-754 is not strictly required (though is often the case in practice, at least with the right compiler options).

Again, a good resolution of this is to clearly note the (new) dependencies in the developer documentation, so anyone porting can be made aware of them.

@MoneroCrusher
Copy link

So practically & effectively this makes FPGAs 4x slower, making them much worse in terms of $/hash (see Xilinx FPGA, 22 kH/s CN7 for 4-5k$) than GPUs and ASICs 16x slower still being better than GPUs in terms of production cost/hash but they'll be useless within a couple months and not break even.

So this makes GPUs & CPUs the best thing to mine with, if Monero keeps the strict 6 months fork schedule.
Is this correct?

@SChernykh
Copy link
Contributor Author

Yes, it's correct. At least 4 times slowdown compared to Cryptonight v1 for all kinds of ASIC/FPGA.

@MoneroCrusher
Copy link

@SChernykh would it be possible that you could try to estimate ASIC production cost and real production time if the community would fund you a Bitmain CN 220khs miner (or is there any good take-apart video?)? As they will still be an ordner of magnitude better both in price/hash and much better in power/hash. If Bitmain can tape one out in 1 month there would still be a big problem. Everyone in the forums states 6 months production time from tape-out. How did everyone arrive at that number? Maybe it would be good to get several opinions.
@moneromooo-monero
How is it determined & who determines the last minute changes prior to fork? It would be good to know. And how is it made sure that persons with access to that info don't get bribed by Bitmain to pass it on?

@moneromooo-monero
Copy link
Collaborator

Someone will probably come up with some small simple change, either myself, othe, smooth, vtnerd, and we'lll discuss it, and post it on github, and if it passes review, it gets merged.

@SChernykh
Copy link
Contributor Author

I'm not a hardware expert, no need to send Bitmain ASIC to me. I haven't seen any teardown videos for their CN miner however.

How is it determined & who determines the last minute changes prior to fork?

@moneromooo-monero I'll also do one really small change near the fork. I had it in my plans since the beginning. It will be small, won't affect performance in any way and will also improve ASIC resistance a bit. So there will be two changes from two different sources.

@MoneroCrusher
Copy link

MoneroCrusher commented Aug 4, 2018

@SChernykh @moneromooo-monero okay fair enough. It should be noted that those infos probably have dozens of millions of value to Bitmain and that they'll try anything to get it (personal speculation) and it would be good if we had a way of preventing them getting the info earlier than when the general public does a few days before fork.
As the current method implies trust in a few people (not that I don't trust you, but that's not the idea).
Maybe actively propagate changes up until 1-2 days before fork from all community members and then randomly choose one based on a pre-defined monero block hash number (last number or letter corresponding to 1 version of a proposed change). I'm sure there are better ways though.

Thanks for your efforts and taking decentralization so serious, unlike other chains...! I'll be sure to point all my GPUs & CPUs to Monero, as always :-)

@SChernykh
Copy link
Contributor Author

Two days before the fork is too little. Variant 2 should be finalized 2 weeks before the fork, together will all major miner software pull requests. Everyone should have enough time to upgrade.

@MoneroCrusher
Copy link

MoneroCrusher commented Aug 4, 2018

True that's the other side. So ASIC manufacturer's will also have an additional 2 weeks.
Wondering if there's a way to find out how long it took Bitmain & Baikal for their old CNv0 ASICs from planning to end product.

Edit:
I did some calculations with my own imginatory assumptions.
10'000 ASICs like their old X3 miner would result in 220 MH/s. Or about 45% of our current network. 10'000 ASICs is not a lot for a multi billion dollar company like Bitmain.
If we assume 1 ASIC costs $500 in production (including R&D) (they sold for low $k and S9 used to cost several $k and look at their prices now, I think they're still selling them for a profit, they only accomodate their prices to crypto income, not actual production cost).
So if the $500 assumption is true, then that's a $5M investment, which is nothing for Bitmain. That $5M investment would yield 45% of XMR (daily emission is 3024 XMR) therefore 1360,8 XMR go to Bitmain, or $163'296 in daily proceeds. They would consume 4.65MW of power, assuming a price of 3c per kW/h would mean daily costs of $3'348 and a net profit of $159'948. That would mean a break-even time of 31.26 days. With the new algo it's 125 days if they use external memory, 500.16 days if on-chip memory (or they'll just cram more on there if they have cheap access to it, as I also don't know what type of memory they use).

Personally I think it would make sense to add an official "surprise POW tweak fork" once or twice a year to add further uncertainty to their economic models. Date would be determined by a pre-defnied block height block hash + time needed from devs to make a small tweak.

@MoneroCrusher
Copy link

@SChernykh
I have another question: Does it make sense, hardware-wise, for an ASIC manufacturer to implement your code in a prototype ASIC and then wait for the 2 tweaks 2 weeks before fork so they would only have to do small adjustments to their ASIC before HF and therefore they would be ready to start production within the first month of the HF?
Just trying to view this from every angle a greedy company not caring about Blockchain would.

@SChernykh
Copy link
Contributor Author

SChernykh commented Aug 4, 2018

These changes are big from hardware point of view, they'll require a completely new design. I think they won't even be ready with a proper design before the fork - they'll have to spend a lot of time optimizing div+sqrt logic for low latency before starting mass production.

P.S. And I still think that even optimized low latency logic will not be fast enough to ROI in 6 months.

@philipma1957
Copy link

This is exciting news as I mine Monero7 or as some call it MoneroV1
I use mostly ryzen 1800 2700 and some thread rippers with a few rx560's

I will post this thread on my bitcointalk thread.

@SChernykh
Copy link
Contributor Author

Monero7 or as some call it MoneroV1

The mining algorithm is called Cryptonight variant 1 or CryptonightV1 or CNv1. "7" is current protocol version number, not the algorithm name. Many people including whattomine and many other sites confused them since the beginning. New algorithm will be called Cryptonight variant 2 or CryptonightV2 or CNv2.

@SChernykh
Copy link
Contributor Author

SChernykh commented Sep 30, 2018

the square root code is perfectly correct, and matches your integer square root perfectly.

Just LOL. There is literally an error in every line in your code. Did you even try to compile it and run on sample inputs? You did not.

@philtimmes
Copy link

I gave simplified code... It is not targetting the sample inputs, just showing simpler ways (and less computational) to do exactly what you are attempting in the proposed patch. Add to that the fact that I have not enclosed my code in code markup, should tell you clearly, it is example code... Would you like me to submit a patch that works?

@SChernykh
Copy link
Contributor Author

So are you trying to make faster implementation? Whatever, calling "pow" in C++ is not faster than calling "sqrt". And calling "sqrt" is slower than calling SSE intrinsic for sqrt.

@philtimmes
Copy link

But one uses more cachelines than the other. the difference in runtime is trivial, but the impact on performance of code following is substantial. And not all miners will have the sqrt SSE intrinsic now, will they?

@SChernykh
Copy link
Contributor Author

SChernykh commented Sep 30, 2018

@philtimmes You're not skilled/competent enough for productive discussion, it seems. Do more research, check the actual miner code: https://github.com/SChernykh/xmr-stak-cpu - my version
https://github.com/xmrig/xmrig/tree/dev - latest xmrig with variant 2 support

Try to make this code even 0.1% faster and if you succeed, then we can talk.

P.S. All 64-bit x86 CPUs have sqrt SSE intrinsic.

@philtimmes
Copy link

Skilled / competent enough? You say that with 0 indication of who I am... or even checking what I wrote with more varacity than a copy pasta... I proposed a patch, and that was ignored...
Again, I submit that I would be more than happy to submit a patch (for portable / non-SIMD bound side) along with before and after results.

@SChernykh
Copy link
Contributor Author

@philtimmes The only thing you proposed so far was slow sqrt implementation with a bug in every line. You're right, I have 0 indication of who you are, so until I see some actual working code from you, it will be like this. Don't think I'm grumpy, but I had a lot of time wasted recently in such conversations.

@philtimmes
Copy link

Yessir...
I will submit a patch shortly.

@plavirudar
Copy link

plavirudar commented Sep 30, 2018

I gave simplified code... It is not targetting the sample inputs, just showing simpler ways (and less computational) to do exactly what you are attempting in the proposed patch. Add to that the fact that I have not enclosed my code in code markup, should tell you clearly, it is example code... Would you like me to submit a patch that works?

Under what set of circumstances would you submit a patch that does not work (and be expected to be taken seriously)?

@philtimmes
Copy link

@plavirudar While I admit there are 2 ways to read what I wrote, I would assume you could see both of them.

@SChernykh
Copy link
Contributor Author

Portable version needs improvement actually. Let's wait for what @philtimmes comes up with.

@notgiven688
Copy link

notgiven688 commented Oct 1, 2018

@SChernykh An optimization for a portable code would be nice for webassembly, since there is no SIMD available at the moment. By the way:

Converting to double before doing the division
const uint64_t division = (uint64_t)((double)dividend / (double)divisor); did improve the overall speed of the hash function by about 1-2% in my test cases - the problem is that it does not yield the correct hash in all cases. Someone (maybe myself) needs to check if this is a possible route for a portable optimization.

@miki-bgd-011
Copy link

@SChernykh Please update the original cpuminer at https://github.com/hyc/cpuminer-multi

@SChernykh
Copy link
Contributor Author

@miki-bgd-011 This repository looks abandoned. xmrig and xmr-stak already have optimized CPU versions, I feel it would be a waste of time to add support there.

@miki-bgd-011
Copy link

:( ok

@SChernykh
Copy link
Contributor Author

I mean, I can add C code, but it will be significantly slower than what xmrig and xmr-stak have. Adding assembler versions will be a lot more work and not worth it.

@miki-bgd-011
Copy link

I understand and agree.

@hyc
Copy link
Collaborator

hyc commented Oct 3, 2018

@SChernykh Fwiw - no, not abandoned, I use this code still. Some people still like small C projects with few external dependencies...

@SChernykh
Copy link
Contributor Author

@hyc Ok, I'll add support later this week, but don't expect high performance from this.

@miki-bgd-011
Copy link

YAY!!

@SChernykh
Copy link
Contributor Author

@hyc @miki-bgd-011 Added it: hyc/cpuminer-multi#3

@miki-bgd-011
Copy link

@SChernykh Thank you, works just fine on http://killallasics.moneroworld.com/.

@hyc I've noticed the miner wastes cpu when connection to the pool is not established. Would be nice to fix that, to start mining only when connection to the pool is successfully established :)

@madscientist159
Copy link

madscientist159 commented Oct 15, 2018

I know it's a bit late to change this, but it looks like the new algorithm is going to lock the network onto closed / locked Intel and AMD CPUs (ME/PSP concerns + secure boot etc.). The new algorithm has (inadvertently?) inserted a rather nasty FPU performance microbench that is knocking non-x86 CPUs, including owner controllable ones, out of consideration (half the hashrate in many cases).

My main concern is that we've exchanged one kind of ASIC (mining ASIC) for another (Windows / x86 locked CPUs). I am not convinced the network can be properly secured when only one CPU architecture, controlled by a fairly hostile duopoly, is economical for mining.

@kio3i0j9024vkoenio
Copy link

I know it's a bit late to change this, but it looks like the new algorithm is going to lock the network onto closed / locked Intel and AMD CPUs (ME/PSP concerns + secure boot etc.). The new algorithm has (inadvertently?) inserted a rather nasty FPU performance microbench that is knocking non-x86 CPUs, including owner controllable ones, out of consideration (half the hashrate in many cases).

My main concern is that we've exchanged one kind of ASIC (mining ASIC) for another (Windows / x86 locked CPUs). I am not convinced the network can be properly secured when only one CPU architecture, controlled by a fairly hostile duopoly, is economical for mining.

You sound like an unhappy ASIC owner.

@madscientist159
Copy link

I know it's a bit late to change this, but it looks like the new algorithm is going to lock the network onto closed / locked Intel and AMD CPUs (ME/PSP concerns + secure boot etc.). The new algorithm has (inadvertently?) inserted a rather nasty FPU performance microbench that is knocking non-x86 CPUs, including owner controllable ones, out of consideration (half the hashrate in many cases).
My main concern is that we've exchanged one kind of ASIC (mining ASIC) for another (Windows / x86 locked CPUs). I am not convinced the network can be properly secured when only one CPU architecture, controlled by a fairly hostile duopoly, is economical for mining.

You sound like an unhappy ASIC owner.

Nope, no ASIC here 😄 Small fleet of POWER9 machines that are far more secure than your x86 AMD/Intel controlled stuff though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.