Cryptonight + random math discussion #1

SChernykh · 2018-12-05T23:02:13Z

Let's discuss my approach to random math in Cryptonight. You can post test results in this thread: #2

Basic algorithm description:

https://github.com/SChernykh/CryptonightR/blob/master/README.md

Reference implementation (variant4 code in the following files):

The same implementation in Monero code base:

https://github.com/SChernykh/monero/tree/CryptonightR

Optimized CPU miner:

xmrig

Optimized GPU miner:

Pool software:

Test pools:

http://killallasics.moneroworld.com/

@moneromooo-monero this is what I was talking about for the next fork. The only difference with CNv2 is the part where it does math.

@tevador @hyc your comments, suggestions?

tevador · 2018-12-05T23:14:33Z

The random sequence changes every block. It depends either on block height or previous block hash (TBD).

That probably means that GPU kernels will need to be recompiled every block. Have you tested the impact on GPUs?

SChernykh · 2018-12-05T23:17:10Z

That probably means that GPU kernels will need to be recompiled every block

Yes. If we choose block height as a seed for random math, they can precompile in background, so there will be no performance hit. It's a bit more tricky with previous block hash, but if we take hash from 2 blocks back, it can also be precompiled.

P.S. I don't have a GPU implementation that is dynamic yet. I've done only static test with different random sequences. But it shouldn't be hard to implement it.

SChernykh · 2018-12-06T11:29:16Z

I think it's also safer to use block height as a seed because it also makes possible to use auto-generated optimized code in daemon/wallet software. Updated readme.md:

The random sequence changes every block. Block height is used as a seed for random number generator. This allows CPU/GPU miners to precompile optimized code for each block. It also allows to verify optimized code for all future blocks against reference implementation, so it'll be guaranteed safe to use in monero daemon/wallet software.

hyc · 2018-12-06T14:29:38Z

Allowing precompilation also allows searching for easy nonces.

SChernykh · 2018-12-06T14:39:25Z

There are no easy nonces - code changes only once per block and miners don't control it.

SChernykh · 2018-12-06T15:26:54Z

Updated readme.md:

Further development plans
Reference implementation in Monero's code base (slow-hash.c, dependent files and tests): December 9th, 2018
Optimized CPU miner (xmrig): December 15th, 2018
Optimized GPU miner (xmrig-amd): December 20th, 2018
Pool software: December 24th, 2018
Public testing: January 2019

tevador · 2018-12-06T15:28:54Z

There are no easy nonces - code changes only once per block and miners don't control it.

Profit-switching miners can still take advantage of it, especially if this tweak is adopted by multiple currencies.

SChernykh · 2018-12-06T15:34:04Z

@tevador Hashrate shouldn't change between different blocks: code generator doesn't just emit random code, it emulates CPU's pipeline and produces just enough code to account for latency required. This is the main idea, and it's already working very good - hashrate changes only 1-2% in my tests with different seeds, and I plan to fix remaining differences.

SChernykh · 2018-12-07T09:21:10Z

Reference implementation (variant4 code in the following files):

The same implementation in Monero code base:

https://github.com/SChernykh/monero/tree/CryptonightR

Optimized CPU miner:

xmrig

Optimized GPU miner:

Pool software:

Test pools:

http://killallasics.moneroworld.com/

SChernykh · 2018-12-23T22:54:07Z

Test pool is up and running, you can now test CryptonightR with xmrig/xmrig-amd (see links above).

Update (2019-02-18): this test pool was for the original CryptonightR without tweaks, it's incompatible with the final version.

SChernykh · 2018-12-23T23:05:42Z

Please test with different CPUs/GPUs and compare hashrate/power consumption with CryptonightV2. I need data for further CryptonightR tweaking.

SChernykh · 2018-12-24T07:17:52Z

I've created a separate issue for test results: #2

SChernykh · 2018-12-24T14:00:30Z

Here's a dump of today's discussion in IRC:

<sech1> moneromooo are you here? What do you say about CryptonightR - is it good enough for the next
PoW tweak? Testing in Wownero network will iron out the kinks pretty quickly by the end of January.
<moneromooo> Is it a tweak or a large change ?
<hyc> adding random ops in the middle of cryptonight? more than just a tweak ;)
<moneromooo> AFAIK sech1 has a tweak that will be published in the next couple months. And that had
better be just a tweak :)
<moneromooo> And that random instructions change can maybe be used for the next fork, if we have
people competent to review that kind of code.
<sech1> moneromooo this is a tweak, the other one which is not published yet (without randomness)
is also a tweak
<moneromooo> Where are the diffs ?
<sech1> https://github.com/SChernykh/monero/commit/d756eca751afc9febd941708c8671155f08a6129#diff-7000dc02c792439471da62856f839d62
<sech1> This is slow_hash.c
<sech1> variant4_random_math.h is also small file
<sech1> Everything with links to code is here: https://github.com/SChernykh/CryptonightR
<sech1> I thought we had enough qualified people in this channel?
<sech1> And enough time to review/test it. It's already being tested in Wownero testnet.
<moneromooo> I think I'm going to have to avoid the work "tweak".
<sech1> On the scale from 0 to RandomX where would you put this change?
<moneromooo> You run blake from a buffer to that same buffer. Is that 100% safe ?
<moneromooo> I don't know, I did not look at RandomX.
<sech1> I looked at blake code, it writes to output in the end and uses internal buffers
<sech1> RandomX is total change from Cryptonight
<sech1> Like 100% replacement
<sech1> So when blake starts writing to output, it doesn't read from input anymore.
<sech1> At least this is what Monero's version does.
<moneromooo> Why the height as extra input ? The hashed blob does indirectly contain the height.
<sech1> To allow precompilation for GPUs - they need to know the code a bit before new block hits.
<sech1> Or they'll suffer from 0.5-1 second pause every time new block appears
<sech1> ProgPOW uses height as well, it's safe to use as a seed for RNG
<moneromooo> While I don't know much about hardware, I'd expect the main loop on an ASIC to be
"run all these 6 ops in parallel, and mux-select the right path".
<moneromooo> But I suppose that's kinda naive maybe.
<sech1> Then it'll be limited to the slowest of 6 operations
<cjd> are they not dependent ?
<sech1> and there are 60-80 operations generated on average
<moneromooo> Not necessarily.
<sech1> all inter-dependent
<cjd> oh nvm i get it
<moneromooo> Oh, I didn't mean the whole loop, but one loop run.
<moneromooo> The switch thing. It means the CPU gets to run the branches.
<sech1> ASIC can do it of course, but this random math has the same property as div+sqrt in
CNv2 - high latency
<moneromooo> Maybe inconsequential ?
<sech1> Switch thing is only in the reference code
<sech1> miner code is auto-generated and linear
<sech1> one operation = 1 x86 instruction
<sech1> well, 2 instructions most of the time for ROR/ROL
<sech1> mov ecx, counter/ROR reg, cl
<sech1> but modern CPUs can chew it as fast as like it's 1 instruction
<cjd> hmm, if you have a 20% chance that an op is a multiply, then I would be thinking to design
an ASIC with which can perform 20% multiplies in parallel and then give the thing enough threads
that each circuit is kept busy (for the most part)
<sech1> We need ASIC experts who can tell us how they would implement it
<cjd> indeed, I'm not one
<sech1> cjd 3/8 operations are MUL, and this is done for a purpose
<cjd> AFAICT the worst thing should be when you have one block which is almost all addition, and
then the next block is almost all multiplication, etc
<sech1> to achieve high computation latency - even higher than div+sqrt in CNv2
<cjd> because then my adder circuits are all busy, and next block all of my multiply circuits
are busy and my adders are idling
<moneromooo> OK I see how it's fixed for the search.
<sech1> Code generator accounts for this - it generates code that runs with the same speed on CPUs
<sech1> it emulates an abstract CPU with 2 ALUs and generates code to fill these 2 ALUs
<sech1> for specified amount of clock cycles
<sech1> so it'll never generate only MUL instructions, it'll always be a mix of different
operations with different registers
<moneromooo> Looks ok at first glance. Definitely not a tweak though. I might be using that term
incorrectly.
<moneromooo> I mean small change.
<moneromooo> Did you get people with relevant expertise to look at it ?
<sech1> if we want to move to ASIC resistant algo eventually, we have to move in bigger steps
than CNv1 tweak
<sech1> Not yet, ASIC designers haven't looked at it yet
<sech1> It's in very early stage, we barely made it up and running for now
<moneromooo> Hashing/crypto expertise I mean.
<sech1> No. I only did some tests like "run random sequence 1000000 times it didn't degrade to all
zeroes or small loop"
<sech1> *and check it didn't degrade
<sech1> This is why ADD is 3-way with random constant
<sech1> and SUB/XOR are always done with different registers
<sech1> Without these 2 changes, it degraded to 0 in this test
<sech1> moneromooo We do have people with hashing/crypto expertise in MRL, don't we?
<moneromooo> Crypto, yes. Hashing, I don't think so.
<sech1> The main property required from random math in the main loop is to be random aka not
reducing entropy. My tests show it's ok in this aspect.
<cjd> if you do    Blake(input || Program(input))  you will tend to stay out of too much trouble
<cjd> This is what Percivial did in the original scrypt, to bound it's potential badness
<cjd> WRT hashing, too much entropy erasure leads to entropy starvation, loops, degradation to zero,
etc...   Too little entropy erasure allows someone to run the hash in reverse for preimaging it (I
don't know if this is a concern here)
<cjd> you'll notice salsa20 does   x[ 4] ^= R(x[ 0]+x[12], 7);    there's only XOR and ROTL except
there's one ADD, the ADD is erasing 1 bit of entropy because of the possibility of rolling over
<cjd> so you have 20*32 additions, that's 640 bits of entropy destroyed, plenty to cause a state
explosion if anyone tries to backtrack it
<sech1> If you're talking about generated random code, it blends in 4*32 new bits of entropy on each
iteration
<sech1> It certainly can't lose more than that
<sech1> Even with 1000000 iterations without new bits of entropy, it didn't degrade to 0
<cjd> I would tend not to worry about it as long as there's a Hash(input || Program(input))
<sech1> so this generated random code is actually kind of RNG with rather long period
<sech1> "Hash(input || Program(input))" - no, that's not how it works in CryptonightR
<tevador> I don't think you need to worry about entropy much since all output will be processed by
AES afterwards

Gingeropolous · 2018-12-30T05:23:14Z

@timolson , have you weighed in on this re: asic design?

Gingeropolous · 2018-12-30T05:27:38Z

also, lets ping @dave-andersen .... hey hows it going? Its your lovable friends over at monero again :)

timolson · 2018-12-30T14:58:26Z

I left a brief review in the RandomX project... is CryptonightR a parallel effort or similar? I could give an hour or two if this is something different to look at, but my schedule's very tight through mid-to-late January. I'd be glad to do a deeper investigation after about 4 weeks from now.

SChernykh · 2018-12-30T15:44:18Z

CryptonightR is a modification to Cryptonight whereas RandomX is done completely from scratch. The main purpose of CryptonightR is to be the next PoW for Monero until RandomX is ready. The only difference from CNv2 is integer math part - it's random generated now, just like in ProgPoW. It's also supposed to be more computationally expensive (+higher computation latency) than div+sqrt in CNv2.

SChernykh · 2018-12-30T16:04:28Z

An example of generated random math: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

timolson · 2018-12-30T20:23:59Z

I’ll assume you care about modifying existing ASIC’s not ASIC-from-scratch which is a different question.

Time for a reveal: when we heard you were going to tweak the PoW, we added a bespoke VLIW processor to our design which could intercept various points along the CryptoNight datapath, perform programmable calculations, then inject the result back into the standard CryptoNight datapath. It potentially (subject to some details I don’t have time to look up) could have handled this new random math without changing our silicon at all! The VLIW could do shift/rotate/xor, then a “primary” operation like add or divide, then a trailing set of bitops before injection. In effect, any bitops would have been “free” because of the VLIW design. And yes there was enough register space. Not sure if our instruction buffer was long enough... I think so. What it couldn’t do is generate the random code itself. It needed to be programmed ahead of time. We could probably have used our SoC controller to do the program generation, but programming the chip involves extra IO which could be slow.

Also, each VLIW instruction cost one cycle (three for division). The length of your programs would have absolutely crushed our speed, since our inner loop was only 4-6 cycles.

Anyone with a Monero ASIC from before the announcement of the tweak threat would not have such a coprocessor, but someone developing an ASIC after your tweaking announcement could very well have planned like we did for the tweaks.

However, v2 added new datapaths to memory which was a real blow. An ASIC designer would have needed to go through a completely new physical layout for v2, and also added some kind of coprocessor like we did in anticipation of future tweaks. I find this unlikely, but possible.

If you are changing the program every nonce, it would be a big problem for such a coprocessor design, since it would need to be reprogrammed every nonce. If you are changing the program slowly like ProgPoW, your random math may not be safe. Slow-changing programs also opens the door to FPGA implementations.

In terms of an ECO on a chip that doesn’t have a coprocessor, it is probably too much to change, even if they left plenty of room.

I would recommend one or more of the following:

If the program changes slowly, change it every nonce instead, which forces reprogramming and external IO
Find a new strange operation from CPU’s that’s not something obvious like add or mul or aesround. We didn’t implement any float ops, for example.
Repeat the idea of v2 and create new memory channels. This kind of thing hurts a lot, but going from a new physical layout to packaged chip can still be done in maybe 4 months if they are good.
Make the programs even longer. This reduces any help from the CryptoNight part of the ASIC and emphasizes the new math part. Coprocessors will definitely be slower than production CPU’s and if the program is very long at all, the processor’s speed on the math will outweigh the ASIC’s speed on CryptoNight.

Overall, I think the threat of modifying an existing ASIC to handle this tweak is low, and if it did handle the tweak, the chip would be slow. However, you may consider the changes above out of an abundance of caution. If our chip had gone to production, and we had redone layouts to keep up with v2, I would be giggling about this tweak. I think we could have handled it with maybe zero changes, but only because we had this on-chip VLIW coprocessor in anticipation of your tweaks.

This was a super-quick review and I could be missing something really important in your design. LMK if that’s the case and I’ll jump back into the convo.

SChernykh · 2018-12-30T20:55:43Z

1 - This will kill GPU performance too, so not an option for now.

2 - Possible for further tweaking, it's not hard to add something new to current algorithm design as long as it's a single x86 instruction and fast enough on GPU.

3 - I tried a lot of things, but couldn't find "memory path" tweak that doesn't hurt CPU performance (yet).

4 - They're already as long as they can be without slowing down CPU. They can maybe a bit longer if they have fewer MUL and more simple ADD/ROTATE/XOR operations, but it'll also change current performance ratio between CPU/GPU.

The length of your programs would have absolutely crushed our speed, since our inner loop was only 4-6 cycles.

That's the whole point of this tweak, explicitly mentioned in readme:

Code generator ensures that minimal required latency for ASIC to execute random math is at least 3 times higher than what was needed for DIV+SQRT in CryptonightV2: current settings ensure latency equivalent to a chain of 18 multiplications while optimal ASIC implementation of DIV+SQRT has latency equivalent to a chain of 6 multiplications.

It makes single ASIC core limited both by memory access and computation part - whatever is slower, hopefully making single ASIC core not faster or even slower than single CPU core. Performance/power ratio is still much better for ASIC, but RandomX will get there eventually.

If our chip had gone to production, and we had redone layouts to keep up with v2, I would be giggling about this tweak. I think we could have handled it with maybe zero changes, but only because we had this on-chip VLIW coprocessor in anticipation of your tweaks.

Zero changes, but still with huge drop in performance?

timolson · 2018-12-30T21:07:52Z

I like #2 best if you can find the right operation. v1 was annoying because we didn’t plan for sqrt. Add/mul/bitops are too predictable.

tevador · 2018-12-30T21:19:12Z

That's why RandomX uses a lot of floating point math. If you want a good FPU in your chip, you will probably need to license some IP core.

timolson · 2018-12-30T22:28:48Z

It was not a matter of licensing... Synopsys will synthesize IEEE floats just fine out of the box, and licensing something else isn’t a problem. It was simply a matter of choice: we didn’t think floats were worth the die area, since it’s unusual to use floats in a PoW.

timolson · 2018-12-30T22:52:00Z

Zero changes, but still with huge drop in performance?

Yep. I didn’t say RandomX wasn’t effective against tweaked ASIC’s... just pointing out some ways to make it stronger.

Just curious, have you seen any evidence of ASIC’s returning after v1/v2 tweaks? PM me if you’d like to keep it quiet.

Also let me say: WhatsMiner is kicking BitMain’s butt, and BitMain is rumored to be divesting their mining operations, with Jihan Wu being demoted as well. The whole “ASIC’s are monopolized and scary” argument is certainly weakening.

SChernykh · 2018-12-31T08:00:29Z

have you seen any evidence of ASIC’s returning after v1/v2 tweaks?

v1 ASICs - yes, including CN/xtl variant. All CNv1 coins are 3-4 times less profitable now per 1 KH/s. CNv2 and CN/heavy coins - no signs of ASICs.

timolson · 2018-12-31T08:59:28Z

Excellent, that would align with expectations. v2 was the real killer with new datapaths to memory.

I’d say your risk of ASICs existing that have both a coprocessor and v2 capability are very low. RandomX should be fine and is probably even overkill at this point.

OH just to be sure... the new math is in the inner loop near the AES and MUL, right? Not in the initialization or finalization loops?

SChernykh · 2018-12-31T09:01:55Z

the new math is in the inner loop near the AES and MUL, right?

It replaces div+sqrt, so it's in the inner loop.

SChernykh · 2019-01-05T22:45:10Z

Testing showed that hashrate fluctuations on CPUs are bigger than we want, especially on AMD Bulldozer, so code generator will be revised.

ga10b · 2019-01-09T09:46:59Z

A small concern is that there is no native bit rotation instruction on GPU. Other operations look good. They may achieve ~0.5 instructions per clock on NVIDIA GPUs.

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array.

ga10b · 2019-01-09T18:06:20Z

@SChernykh cuda ptx is not native instruction. It is good to see how this instruction is translated to the native code on different architectures using CUDA binary utils. I guess we probably need to call special intrinsic function to generate the desired instruction. Here is a table of instruction throughout
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

hyc · 2019-01-09T18:14:34Z

@timolson do you think ASICs will be less than 2x more expensive than mass-market GPUs?

ga10b · 2019-01-09T19:33:51Z

Given the same process, there is no way a general purpose processor could make ASIC less than ~2x efficient in terms of perf/$ or perf/watt unless we make the program doing every possible task that the processor is designed for, such as graphics and AI. It seems random math is aiming toward the right direction but far from the goal. Even some AI chips could beat GPU by 10x in terms of power efficiency.

SChernykh · 2019-01-09T19:45:03Z

@ga10b This is temporary solution for the next 6 months of course - RandomX is much more advanced, check tevador's repository.

NVIDIA GPUs don't have performance issues with cn/r, so I think rotation instructions are natively supported.

tevador · 2019-01-09T21:14:47Z

@ga10b Can you have a look this? https://github.com/tevador/RandomX

The documentation is a bit outdated (memory access is being reworked at the moment), but it should be enough for a brief review.

SChernykh · 2019-01-09T22:30:00Z

@ga10b I use __funnelshift_l, __funnelshift_r intrinsics for rotations in CUDA code, the table you linked shows that bit shifts are only 2 times slower than additions, so it's fast enough.

It is good to see how this instruction is translated to the native code on different architectures using CUDA binary utils. I guess we probably need to call special intrinsic function to generate the desired instruction. Here is a table of instruction throughout

SChernykh · 2019-01-11T10:45:02Z

I've updated random code generator. The first version had a number of issues (high hashrate fluctuations on some CPUs) and also one small coding error that resulted in lower minimal theoretical latency (parameter L in the table) of generated programs than expected. Here is a comparison of 1st and 2nd versions of the code generator:

Parameter	Version 1	Version 2
Nominal L (cycles)	54	45
Actual L (cycles)	28-40, 35.56 on average	45-47, 45.86 on average
Program length	25-108, 62.458 on average	60-69, 63.1 on average

Testnet pool should be up and running with the updated version next week (maybe even this weekend).

MoneroChan · 2019-01-20T05:49:56Z

Hi, @SChernykh , just wondering re: timolson's 2 suggestions.

'Changing the program every nonce instead" affects 'GPU speed', and
'Making programs even longer" affects 'CPU speed'

What if, we use 'Both' 1) and 4) at varying levels, as 'control levers' to adjust the CPU / GPUs performance ratio, and at the same time reduce FPGA's and ASICS performance?

I'm thinking if it hurts FPGAs and ASICs more than CPU/GPUs, it may outweigh the drop in performance.
Are there any estimates available for the % performance drops?

Thanks,

SChernykh · 2019-01-20T08:59:47Z

@MoneroChan There is no way to estimate this, I need to actually implement it to see the impact on CPU/GPU. GPU will have to either use something very close to reference code with switch-case for every instruction so they will be hit very hard because of code divergence, or they'll have to run all 6 possible instructions at every step, save and load results to/from LDS, so it'll require ~18 times more GPU instructions to execute. Either way GPUs will be hit hard if random program changes every nonce.

MoneroChan · 2019-01-20T12:15:28Z

Thanks @SChernykh . it looks like the situation is suddenly changing very quickly for the worse.
Any thoughts on the 70%++ New XMR hashrate that Suddenly came online in the past 2 weeks?

Based on the ongoing reddit discussion, FPGA's or ASICS w/FPGA controllers is strongly suspected.

Your offer for a 18X Slower but competitive GPU with per nonce change, is starting to look like a better option than if there were No more gpus left mining.

Any thoughts? :-/

SChernykh · 2019-01-20T12:39:25Z

@MoneroChan GPUs won't get 18x slower, they still have plenty of computing power available. They will get 3-4, maybe 5 times slower, but the problem is that ASIC won't get slower if the program changes every nonce instead of every 2 minutes - program generator is simple enough to integrate into the ASIC pipeline. GPUs will get much slower, but future ASICs won't, so the network as a whole will be more vulnerable.

tevador · 2019-01-20T14:05:54Z

Sooner or later, we will have to choose a CPU-only or a GPU-only PoW. Keeping both brings too many limitations in fighting ASICs.

jorgealonso108 · 2019-01-20T20:30:47Z

My Thoughts...and just thoughts from a WOW "CPU" miners perspective.

I see little difference in the comparison of GPU to ASIC or FPGA

with that said and being fairly new at mining (since summer of 2018). I have mined cn/2 since Sept 2018 and was not aware that people could rent hash power in such a magnitude. So I did a little research this morning and looked at what was available to rent cn/2 hash power....at "any given moment" people can rent up to 20Mh/s of cn/2 with a few clicks of a mouse. Being a small miner and an advocate of POW and the crypto ecosystem, a question came to mind. Small miners have less chance of getting rewards than large miners, small pools have less chance of getting rewards than large pools, small miners on small pools vs large miners on small pools, and so on and so on, etc. The larger your budget the larger your hash...So my mind pondered a few questions....how does POW have any fairness scale at all? How can it grow an ecosystem? How do you get people involved and keep them involved in the current infrastructure? I know POW spends a fair amount of time on dev because I see some of it on Git.

I come to the conclusion that you would only feel this way because you are invested in GPU's.

Now as a CPU advocate, if purpose-built hardware becomes available that is economic and faster that GPU's I would buy it or rent it. Further review and study would have to be made to create or evolve a different environment.

Are we really just creating or supporting an existing GPU environment? I will continue to mine WOW with CPU's

ASIC and FPGA resistance seems to be a state of mind. I program in Verilog

All GPU miners have a CPU.

MoneroChan · 2019-01-21T00:21:00Z

I agree with @tevador, and we can strategically use the 'math' for the next hardfork to slowly start shifting the performance ratio to our target hardware (CPU or GPU) to allow a 'smooth gradual transition', people will be less likely to complain, so earlier the better.

Personally, I'm invested in decentralization, either CPU or GPU is fine by me.

hyc · 2019-01-21T01:48:38Z

The program generator must run on the GPU, to avoid the compile/download overhead per nonce.

I've stated my opinion before though - our primary focus should be CPU; GPU optimization can come later if someone feels compelled to do it.

ifdefelse · 2019-01-21T03:26:39Z

This is why ProgPoW will also fail to be ASIC-resistant.

I was just pointed at this thread. Most of the above quoted response doesn't apply to ProgPoW. I've left a response in relation to ProgPoW here:
ifdefelse/ProgPOW#24 (comment)

The program generator must run on the GPU, to avoid the compile/download overhead per nonce.

Why? We plan to decrease the ProgPoW period to change the random program about every 2 minutes, similar to the Monero block time. It only takes ~1 second to compile a kernel and it won't be hard to have the CPU compile GPU program N+1 while program N is executing. There's no overhead on having the miner call a different program when the block switches.

Btw feel free to use our miner as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA. Note that to on-line compile CUDA programs you'll need to distribute nvrtc. See this issue for details: https://github.com/AndreaLanfranchi/ethminer/issues/39

Now a quick review of CryptonightR. The high level idea is fairly similar to ProgPoW in that a random sequence is generated every few minutes and compiled offline. This random sequence is will almost certainly brick, or drastically reduce performance, on any existing or late-production ASIC. However it does not fundamentally prevent ASICs from being manufactured. As Tim pointed out ASIC manufacturers are building in more flexibility to their ASICs so they can handle tweaks exactly like these.

I would not be surprised if in the near future an ASIC is produced that is simply a bunch of ARM A53s, or similar highly efficient ARM core, attached to a large pool of on-die memory. Adapting this system for a new algorithm variant would be as easy as recompiling some new miner software.

To get an idea of the gains possible from this consider that an iPhone A12 scores around 11,000 multi-core Geekbench while consuming <10 watts (it's hard to find precise power numbers). An AMD Threadripper 2950X scores 35,000 but consumes 180 watts. That's a 3x perf difference but a 20x power difference, making the ARM 6x as efficient.

On top of the CPU core efficiency differences there are 2 fundamental points that make Cryptonight (all variants) attractive to ASICs are:

Small access size. Cryptonight V2 reduces the ASIC advantage by increasing the size from just 16 bytes to 64 bytes. However GPU memory controllers access DRAM in groups of 128 or 256 bytes, so there's still a 2x-4x advantage available for an ASIC with external memory.
Relatively small memory usage. Multiple 2MB buffers can easily fit on-chip on an ASIC. For example a Xilinx VU9P has 33.75 MB of UltraRAM (remember that FPGAs spec in Mega-bits, so the 270 Mb is more usefully 33.75 MB). Accessing on-die memory is significantly lower power than off-chip DRAM.

Given the above I would not be surprised if an FPGA bitstream was developed fairly quickly. There'd be one small part to generate the random sequence for the current nonce. The rest of the logic would be simple in-order CPU cores that would have the AES and general math needed to execute the random sequence. The overall performance on a VU9P should be around that of a 16-core x86 CPU (since 16 2mb buffers could fit) at significantly lower power.

I think any algorithm that is design to be "CPU-friendly" can have at least a 2x, and sometimes a 10x, ASIC made for it since a sea of low-power ARM cores could run it efficiently.

All that said I think CryptonightR is an improvement to CryptonightV2, so there's no reason not to go to it. However it's not an end-point.

I'll also do a review of RandomX in the near future. I think having a random program per-nonce actually makes the algorithm more ASIC-friendly, but I need some time to work out the details.

SChernykh · 2019-01-21T07:13:07Z

@ifdefelse

Btw feel free to use our miner as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA

I think you misunderstood something. We had it since the beginning, including precompilation for block N+1 while block N is running. Feel free to use our xmrig CryptonightR branch as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA.

SChernykh · 2019-01-21T07:17:26Z

To get an idea of the gains possible from this consider that an iPhone A12 scores around 11,000 multi-core Geekbench while consuming <10 watts (it's hard to find precise power numbers). An AMD Threadripper 2950X scores 35,000 but consumes 180 watts. That's a 3x perf difference but a 20x power difference, making the ARM 6x as efficient.

Geekbench != good reference. It's not 100% parallelizable unlike PoW in general. It's better to compare server ARM processors like Cavium ThunderX with their x86 counterparts in 100% parallelizable server tasks. They are not much more efficient.

SChernykh · 2019-01-21T07:25:41Z

Given the above I would not be surprised if an FPGA bitstream was developed fairly quickly. There'd be one small part to generate the random sequence for the current nonce. The rest of the logic would be simple in-order CPU cores that would have the AES and general math needed to execute the random sequence. The overall performance on a VU9P should be around that of a 16-core x86 CPU (since 16 2mb buffers could fit) at significantly lower power.

The size of generated random programs = 60-69 instructions and at least 45 cycles to execute (given 3 cycles for MUL, 1 cycle for the rest and unlimited parallel ALUs) - simply makes each FPGA/ASIC core with on-chip memory very slow. In my tests, 4 GHz Ryzen can't execute random math iteration faster than in 12.5 ns (without the rest of Cryptonight loop). ASIC would have to run at similar speed to be as fast per core while FPGA just can't run that fast. FPGA would take at least 100 ns to execute random math iteration which equals to ~19 h/s per core or ~400 h/s for the whole VU9P chip (21 cores can fit).

SChernykh · 2019-01-21T08:41:52Z

This random sequence is will almost certainly brick, or drastically reduce performance, on any existing or late-production ASIC. However it does not fundamentally prevent ASICs from being manufactured. As Tim pointed out ASIC manufacturers are building in more flexibility to their ASICs so they can handle tweaks exactly like these.
All that said I think CryptonightR is an improvement to CryptonightV2, so there's no reason not to go to it. However it's not an end-point.

I never said ASIC is impossible for CryptonightR. It's not a permanent solution rather just PoW for the next 6 months. The point is to slow down ASICs enough to make R&D->deployment->mining inprofitable within 6 months time frame.

ifdefelse · 2019-01-21T20:06:44Z

I think you misunderstood something. We had it since the beginning, including precompilation for block N+1 while block N is running. Feel free to use our xmrig CryptonightR branch as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA.

Thanks, good to know. I didn't have a chance to dig into the code and just saw the comment above about not having dynamic GPU kernels yet.

Geekbench != good reference. It's not 100% parallelizable unlike PoW in general. It's better to compare server ARM processors like Cavium ThunderX with their x86 counterparts in 100% parallelizable server tasks. They are not much more efficient.

I fully agree that geekbench isn't a good reference, but then again neither are most server workloads. The big, power-hungry portions of CPUs like caches, reorder buffers, prefetchers, etc help a lot on server workloads but just waste power for PoW algorithms. I do think there is a 2-4x efficiency gain available from CPU architecture alone, without considering the efficiency gains related to memory. It's hard to find good benchmarks of an A53 or A55.

FPGAs

I hadn't fully worked out the details of an FPGA implementation. Thinking through more details I'm now prone to agree with you that a CPU-based FPGA probably would not have enough performance to be interesting.

It might not be practical given existing tool chains, but the random sequence is the type of logic that FPGAs excel at - a math pipeline with little state and no external communication. In theory an FPGA could be structured where all the surrounding logic is fixed and just the random sequence is reconfigured every block. There's already precedent for doing rapid dynamic reconfiguration for algorithms like X16R. The hardest part would probably be setting up the CPU farm to compile the sequences and distribute them to a farm of FPGAs.

(ProgPoW reduces the benefit of this type of FPGA acceleration by having a ton of state that needs to flow through the pipeline and having 1/3rd of the random ops be memory accesses)

I never said ASIC is impossible for CryptonightR. It's not a permanent solution rather just PoW for the next 6 months. The point is to slow down ASICs enough to make R&D->deployment->mining inprofitable within 6 months time frame.

Assuming no one has already created an ASIC that is a sea of low-power ARM CPUs, then I agree that CryptonightR should accomplish its goals as a stop-gap measure.

SChernykh · 2019-01-21T20:15:46Z

It might not be practical given existing tool chains, but the random sequence is the type of logic that FPGAs excel at - a math pipeline with little state and no external communication.

They only excel at it because they can do it in a massively parallel way - i.e. high throughput but much higher latency compared to CPUs. The hard part of CryptonightR for FPGAs is not just randomness, but the need for 2 MB scratchpad for each independent core. Without massive parallelism FPGAs suffer from the high latency of calculations, even if they solve dynamic reconfiguration issues - it's not just a random sequence of 16 hashes, but random sequence of 60-69 instructions with random source/destinations registers.

P.S. What was said above only applies to on-chip memory approach, but FPGAs with HBM will be doing fine with CryptongihtR just like GPUs. They will be ~2 times faster than Vega with HBM because Vega uses ~50% of memory bandwidth.

MoneroChan · 2019-01-22T00:51:33Z

Hi @SChernykh, i'm wondering if the following 3 ideas would help,

Is there any way to increase 'Program Length' exponentially based on "Network Difficulty"?

If the program length only increases when network difficulty rises, (e.g when ASICs arrive) it will brick ASICs and FPGAs automatically, and only when required.
An uncertain program length would create a planning nightmare for ASIC manufacturers,

Alternatively, Increase 'program length' linearly based on Blockheight? (Example: Program length increases by 1 instruction, every 1000 blocks?),

Therefore, if we "Delay the next fork" intentionally, it would automatically brick ASICs/FPGAs naturally without requiring a hard fork, or at least, decrease ASIC/FPGA output overtime and force them to 'invest in potentially unnecessary components' to compensate for longer program length, that "May or may Not" occur due to a delayed hard fork that “May or may not Happen”

Safer option: set program length to increase rapidly “Only after 130000 blocks have passed” ( ~ 6 months), That way we have a “choice”, after 6 months when if any ASICs are up and running, we can choose to deliberately hold off a fork to see whether or not it will brick ASICs (attack mode), or we can choose to fork on schedule and nothing changes to program length (safe mode) if you’re worried increasing program length will cause issues; I think this is the safest option.

SChernykh · 2019-01-22T08:21:36Z

@MoneroChan The plan is to have the maximal possible program length from the beginning, so any further increase would slow down CPUs and ASICs similarly. ASICs will be able to handle any program length if it's needed.

moneromooo-monero · 2019-02-03T22:19:57Z

You don't seem to be on IRC anymore, so I'm posting here: we're considering using your changes here in the april fork (plus late tweak).

SChernykh · 2019-02-04T06:55:14Z

@moneromooo-monero I'm still on IRC, I just don't login often. Any chat logs available? CN/R is mostly done, it will be tested in Wownero mainnet soon. I have a few ideas about possible final tweak, but I'll keep it in private for now.

numerys mentioned this issue Dec 28, 2018

CryptonightR testing #2

Closed

jorgealonso108 mentioned this issue Jan 12, 2019

CryptonightR testing (second round) #5

Open

MoneroCrusher mentioned this issue Jan 19, 2019

ProgPow ASIC possibilities evaluated by 2 experts ifdefelse/ProgPOW#24

Open

SChernykh mentioned this issue Feb 4, 2019

Cryptonight variant 4 aka CryptonightR monero-project/monero#5126

Merged

hyc mentioned this issue Feb 18, 2019

[Write-up, Discussion] Cryptonight-GPU — FPGA-proof PoW algorithm based on floating point instructions monero-project/monero#5152

Closed

Cryptonight + random math discussion #1

Cryptonight + random math discussion #1

Comments

SChernykh commented Dec 5, 2018 • edited Loading

tevador commented Dec 5, 2018

SChernykh commented Dec 5, 2018 • edited Loading

SChernykh commented Dec 6, 2018

hyc commented Dec 6, 2018

SChernykh commented Dec 6, 2018

SChernykh commented Dec 6, 2018 • edited Loading

tevador commented Dec 6, 2018

SChernykh commented Dec 6, 2018

SChernykh commented Dec 7, 2018 • edited Loading

SChernykh commented Dec 23, 2018 • edited Loading

SChernykh commented Dec 23, 2018

SChernykh commented Dec 24, 2018 • edited Loading

SChernykh commented Dec 24, 2018 • edited Loading

Gingeropolous commented Dec 30, 2018

Gingeropolous commented Dec 30, 2018

timolson commented Dec 30, 2018

SChernykh commented Dec 30, 2018

SChernykh commented Dec 30, 2018

timolson commented Dec 30, 2018 • edited Loading

SChernykh commented Dec 30, 2018 • edited Loading

timolson commented Dec 30, 2018 via email • edited Loading

tevador commented Dec 30, 2018

timolson commented Dec 30, 2018 via email • edited Loading

timolson commented Dec 30, 2018 • edited Loading

SChernykh commented Dec 31, 2018

timolson commented Dec 31, 2018 • edited Loading

SChernykh commented Dec 31, 2018

SChernykh commented Jan 5, 2019

ga10b commented Jan 9, 2019

ga10b commented Jan 9, 2019 • edited Loading

hyc commented Jan 9, 2019

ga10b commented Jan 9, 2019

SChernykh commented Jan 9, 2019 • edited Loading

tevador commented Jan 9, 2019

SChernykh commented Jan 9, 2019

SChernykh commented Jan 11, 2019

MoneroChan commented Jan 20, 2019

SChernykh commented Jan 20, 2019 • edited Loading

MoneroChan commented Jan 20, 2019

SChernykh commented Jan 20, 2019 • edited Loading

tevador commented Jan 20, 2019

jorgealonso108 commented Jan 20, 2019

MoneroChan commented Jan 21, 2019

hyc commented Jan 21, 2019

ifdefelse commented Jan 21, 2019 • edited Loading

SChernykh commented Jan 21, 2019

SChernykh commented Jan 21, 2019

SChernykh commented Jan 21, 2019 • edited Loading

SChernykh commented Jan 21, 2019 • edited Loading

ifdefelse commented Jan 21, 2019

SChernykh commented Jan 21, 2019 • edited Loading

MoneroChan commented Jan 22, 2019

SChernykh commented Jan 22, 2019 • edited Loading

moneromooo-monero commented Feb 3, 2019

SChernykh commented Feb 4, 2019

SChernykh commented Dec 5, 2018 •

edited

Loading

SChernykh commented Dec 5, 2018 •

edited

Loading

SChernykh commented Dec 6, 2018 •

edited

Loading

SChernykh commented Dec 7, 2018 •

edited

Loading

SChernykh commented Dec 23, 2018 •

edited

Loading

SChernykh commented Dec 24, 2018 •

edited

Loading

SChernykh commented Dec 24, 2018 •

edited

Loading

timolson commented Dec 30, 2018 •

edited

Loading

SChernykh commented Dec 30, 2018 •

edited

Loading

timolson commented Dec 30, 2018 via email •

edited

Loading

timolson commented Dec 30, 2018 via email •

edited

Loading

timolson commented Dec 30, 2018 •

edited

Loading

timolson commented Dec 31, 2018 •

edited

Loading

ga10b commented Jan 9, 2019 •

edited

Loading

SChernykh commented Jan 9, 2019 •

edited

Loading

SChernykh commented Jan 20, 2019 •

edited

Loading

SChernykh commented Jan 20, 2019 •

edited

Loading

ifdefelse commented Jan 21, 2019 •

edited

Loading

SChernykh commented Jan 21, 2019 •

edited

Loading

SChernykh commented Jan 21, 2019 •

edited

Loading

SChernykh commented Jan 21, 2019 •

edited

Loading

SChernykh commented Jan 22, 2019 •

edited

Loading