-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cryptonight + random math discussion #1
Comments
That probably means that GPU kernels will need to be recompiled every block. Have you tested the impact on GPUs? |
Yes. If we choose block height as a seed for random math, they can precompile in background, so there will be no performance hit. It's a bit more tricky with previous block hash, but if we take hash from 2 blocks back, it can also be precompiled. P.S. I don't have a GPU implementation that is dynamic yet. I've done only static test with different random sequences. But it shouldn't be hard to implement it. |
I think it's also safer to use block height as a seed because it also makes possible to use auto-generated optimized code in daemon/wallet software. Updated readme.md:
|
Allowing precompilation also allows searching for easy nonces. |
There are no easy nonces - code changes only once per block and miners don't control it. |
Updated readme.md:
|
Profit-switching miners can still take advantage of it, especially if this tweak is adopted by multiple currencies. |
@tevador Hashrate shouldn't change between different blocks: code generator doesn't just emit random code, it emulates CPU's pipeline and produces just enough code to account for latency required. This is the main idea, and it's already working very good - hashrate changes only 1-2% in my tests with different seeds, and I plan to fix remaining differences. |
Reference implementation (variant4 code in the following files):
The same implementation in Monero code base: Optimized CPU miner: Optimized GPU miner: Pool software: Test pools: |
Test pool is up and running, you can now test CryptonightR with xmrig/xmrig-amd (see links above). Update (2019-02-18): this test pool was for the original CryptonightR without tweaks, it's incompatible with the final version. |
Please test with different CPUs/GPUs and compare hashrate/power consumption with CryptonightV2. I need data for further CryptonightR tweaking. |
I've created a separate issue for test results: #2 |
Here's a dump of today's discussion in IRC:
|
@timolson , have you weighed in on this re: asic design? |
also, lets ping @dave-andersen .... hey hows it going? Its your lovable friends over at monero again :) |
I left a brief review in the RandomX project... is CryptonightR a parallel effort or similar? I could give an hour or two if this is something different to look at, but my schedule's very tight through mid-to-late January. I'd be glad to do a deeper investigation after about 4 weeks from now. |
CryptonightR is a modification to Cryptonight whereas RandomX is done completely from scratch. The main purpose of CryptonightR is to be the next PoW for Monero until RandomX is ready. The only difference from CNv2 is integer math part - it's random generated now, just like in ProgPoW. It's also supposed to be more computationally expensive (+higher computation latency) than div+sqrt in CNv2. |
An example of generated random math: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl |
I’ll assume you care about modifying existing ASIC’s not ASIC-from-scratch which is a different question. Time for a reveal: when we heard you were going to tweak the PoW, we added a bespoke VLIW processor to our design which could intercept various points along the CryptoNight datapath, perform programmable calculations, then inject the result back into the standard CryptoNight datapath. It potentially (subject to some details I don’t have time to look up) could have handled this new random math without changing our silicon at all! The VLIW could do shift/rotate/xor, then a “primary” operation like add or divide, then a trailing set of bitops before injection. In effect, any bitops would have been “free” because of the VLIW design. And yes there was enough register space. Not sure if our instruction buffer was long enough... I think so. What it couldn’t do is generate the random code itself. It needed to be programmed ahead of time. We could probably have used our SoC controller to do the program generation, but programming the chip involves extra IO which could be slow. Also, each VLIW instruction cost one cycle (three for division). The length of your programs would have absolutely crushed our speed, since our inner loop was only 4-6 cycles. Anyone with a Monero ASIC from before the announcement of the tweak threat would not have such a coprocessor, but someone developing an ASIC after your tweaking announcement could very well have planned like we did for the tweaks. However, v2 added new datapaths to memory which was a real blow. An ASIC designer would have needed to go through a completely new physical layout for v2, and also added some kind of coprocessor like we did in anticipation of future tweaks. I find this unlikely, but possible. If you are changing the program every nonce, it would be a big problem for such a coprocessor design, since it would need to be reprogrammed every nonce. If you are changing the program slowly like ProgPoW, your random math may not be safe. Slow-changing programs also opens the door to FPGA implementations. In terms of an ECO on a chip that doesn’t have a coprocessor, it is probably too much to change, even if they left plenty of room. I would recommend one or more of the following:
Overall, I think the threat of modifying an existing ASIC to handle this tweak is low, and if it did handle the tweak, the chip would be slow. However, you may consider the changes above out of an abundance of caution. If our chip had gone to production, and we had redone layouts to keep up with v2, I would be giggling about this tweak. I think we could have handled it with maybe zero changes, but only because we had this on-chip VLIW coprocessor in anticipation of your tweaks. This was a super-quick review and I could be missing something really important in your design. LMK if that’s the case and I’ll jump back into the convo. |
1 - This will kill GPU performance too, so not an option for now. 2 - Possible for further tweaking, it's not hard to add something new to current algorithm design as long as it's a single x86 instruction and fast enough on GPU. 3 - I tried a lot of things, but couldn't find "memory path" tweak that doesn't hurt CPU performance (yet). 4 - They're already as long as they can be without slowing down CPU. They can maybe a bit longer if they have fewer MUL and more simple ADD/ROTATE/XOR operations, but it'll also change current performance ratio between CPU/GPU.
That's the whole point of this tweak, explicitly mentioned in readme:
It makes single ASIC core limited both by memory access and computation part - whatever is slower, hopefully making single ASIC core not faster or even slower than single CPU core. Performance/power ratio is still much better for ASIC, but RandomX will get there eventually.
Zero changes, but still with huge drop in performance? |
I like #2 best if you can find the right operation. v1 was annoying because
we didn’t plan for sqrt. Add/mul/bitops are too predictable.
|
That's why RandomX uses a lot of floating point math. If you want a good FPU in your chip, you will probably need to license some IP core. |
It was not a matter of licensing... Synopsys will synthesize IEEE floats
just fine out of the box, and licensing something else isn’t a problem. It
was simply a matter of choice: we didn’t think floats were worth the die
area, since it’s unusual to use floats in a PoW.
|
Yep. I didn’t say RandomX wasn’t effective against tweaked ASIC’s... just pointing out some ways to make it stronger. Just curious, have you seen any evidence of ASIC’s returning after v1/v2 tweaks? PM me if you’d like to keep it quiet. Also let me say: WhatsMiner is kicking BitMain’s butt, and BitMain is rumored to be divesting their mining operations, with Jihan Wu being demoted as well. The whole “ASIC’s are monopolized and scary” argument is certainly weakening. |
v1 ASICs - yes, including CN/xtl variant. All CNv1 coins are 3-4 times less profitable now per 1 KH/s. CNv2 and CN/heavy coins - no signs of ASICs. |
Excellent, that would align with expectations. v2 was the real killer with new datapaths to memory. I’d say your risk of ASICs existing that have both a coprocessor and v2 capability are very low. RandomX should be fine and is probably even overkill at this point. OH just to be sure... the new math is in the inner loop near the AES and MUL, right? Not in the initialization or finalization loops? |
It replaces div+sqrt, so it's in the inner loop. |
Testing showed that hashrate fluctuations on CPUs are bigger than we want, especially on AMD Bulldozer, so code generator will be revised. |
A small concern is that there is no native bit rotation instruction on GPU. Other operations look good. They may achieve ~0.5 instructions per clock on NVIDIA GPUs. In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array. |
@SChernykh cuda ptx is not native instruction. It is good to see how this instruction is translated to the native code on different architectures using CUDA binary utils. I guess we probably need to call special intrinsic function to generate the desired instruction. Here is a table of instruction throughout |
@timolson do you think ASICs will be less than 2x more expensive than mass-market GPUs? |
Given the same process, there is no way a general purpose processor could make ASIC less than ~2x efficient in terms of perf/$ or perf/watt unless we make the program doing every possible task that the processor is designed for, such as graphics and AI. It seems random math is aiming toward the right direction but far from the goal. Even some AI chips could beat GPU by 10x in terms of power efficiency. |
@ga10b This is temporary solution for the next 6 months of course - RandomX is much more advanced, check tevador's repository. NVIDIA GPUs don't have performance issues with cn/r, so I think rotation instructions are natively supported. |
@ga10b Can you have a look this? https://github.com/tevador/RandomX The documentation is a bit outdated (memory access is being reworked at the moment), but it should be enough for a brief review. |
@ga10b I use
|
I've updated random code generator. The first version had a number of issues (high hashrate fluctuations on some CPUs) and also one small coding error that resulted in lower minimal theoretical latency (parameter
Testnet pool should be up and running with the updated version next week (maybe even this weekend). |
Hi, @SChernykh , just wondering re: timolson's 2 suggestions.
What if, we use 'Both' 1) and 4) at varying levels, as 'control levers' to adjust the CPU / GPUs performance ratio, and at the same time reduce FPGA's and ASICS performance? I'm thinking if it hurts FPGAs and ASICs more than CPU/GPUs, it may outweigh the drop in performance. Thanks, |
@MoneroChan There is no way to estimate this, I need to actually implement it to see the impact on CPU/GPU. GPU will have to either use something very close to reference code with switch-case for every instruction so they will be hit very hard because of code divergence, or they'll have to run all 6 possible instructions at every step, save and load results to/from LDS, so it'll require ~18 times more GPU instructions to execute. Either way GPUs will be hit hard if random program changes every nonce. |
Thanks @SChernykh . it looks like the situation is suddenly changing very quickly for the worse. Based on the ongoing reddit discussion, FPGA's or ASICS w/FPGA controllers is strongly suspected. Your offer for a 18X Slower but competitive GPU with per nonce change, is starting to look like a better option than if there were No more gpus left mining. Any thoughts? :-/ |
@MoneroChan GPUs won't get 18x slower, they still have plenty of computing power available. They will get 3-4, maybe 5 times slower, but the problem is that ASIC won't get slower if the program changes every nonce instead of every 2 minutes - program generator is simple enough to integrate into the ASIC pipeline. GPUs will get much slower, but future ASICs won't, so the network as a whole will be more vulnerable. |
Sooner or later, we will have to choose a CPU-only or a GPU-only PoW. Keeping both brings too many limitations in fighting ASICs. |
My Thoughts...and just thoughts from a WOW "CPU" miners perspective. I see little difference in the comparison of GPU to ASIC or FPGA with that said and being fairly new at mining (since summer of 2018). I have mined cn/2 since Sept 2018 and was not aware that people could rent hash power in such a magnitude. So I did a little research this morning and looked at what was available to rent cn/2 hash power....at "any given moment" people can rent up to 20Mh/s of cn/2 with a few clicks of a mouse. Being a small miner and an advocate of POW and the crypto ecosystem, a question came to mind. Small miners have less chance of getting rewards than large miners, small pools have less chance of getting rewards than large pools, small miners on small pools vs large miners on small pools, and so on and so on, etc. The larger your budget the larger your hash...So my mind pondered a few questions....how does POW have any fairness scale at all? How can it grow an ecosystem? How do you get people involved and keep them involved in the current infrastructure? I know POW spends a fair amount of time on dev because I see some of it on Git. I come to the conclusion that you would only feel this way because you are invested in GPU's. Now as a CPU advocate, if purpose-built hardware becomes available that is economic and faster that GPU's I would buy it or rent it. Further review and study would have to be made to create or evolve a different environment. Are we really just creating or supporting an existing GPU environment? I will continue to mine WOW with CPU's ASIC and FPGA resistance seems to be a state of mind. I program in Verilog All GPU miners have a CPU. |
I agree with @tevador, and we can strategically use the 'math' for the next hardfork to slowly start shifting the performance ratio to our target hardware (CPU or GPU) to allow a 'smooth gradual transition', people will be less likely to complain, so earlier the better. Personally, I'm invested in decentralization, either CPU or GPU is fine by me. |
The program generator must run on the GPU, to avoid the compile/download overhead per nonce. I've stated my opinion before though - our primary focus should be CPU; GPU optimization can come later if someone feels compelled to do it. |
I was just pointed at this thread. Most of the above quoted response doesn't apply to ProgPoW. I've left a response in relation to ProgPoW here:
Why? We plan to decrease the ProgPoW period to change the random program about every 2 minutes, similar to the Monero block time. It only takes ~1 second to compile a kernel and it won't be hard to have the CPU compile GPU program N+1 while program N is executing. There's no overhead on having the miner call a different program when the block switches. Btw feel free to use our miner as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA. Note that to on-line compile CUDA programs you'll need to distribute nvrtc. See this issue for details: https://github.com/AndreaLanfranchi/ethminer/issues/39 Now a quick review of CryptonightR. The high level idea is fairly similar to ProgPoW in that a random sequence is generated every few minutes and compiled offline. This random sequence is will almost certainly brick, or drastically reduce performance, on any existing or late-production ASIC. However it does not fundamentally prevent ASICs from being manufactured. As Tim pointed out ASIC manufacturers are building in more flexibility to their ASICs so they can handle tweaks exactly like these. I would not be surprised if in the near future an ASIC is produced that is simply a bunch of ARM A53s, or similar highly efficient ARM core, attached to a large pool of on-die memory. Adapting this system for a new algorithm variant would be as easy as recompiling some new miner software. To get an idea of the gains possible from this consider that an iPhone A12 scores around 11,000 multi-core Geekbench while consuming <10 watts (it's hard to find precise power numbers). An AMD Threadripper 2950X scores 35,000 but consumes 180 watts. That's a 3x perf difference but a 20x power difference, making the ARM 6x as efficient. On top of the CPU core efficiency differences there are 2 fundamental points that make Cryptonight (all variants) attractive to ASICs are:
Given the above I would not be surprised if an FPGA bitstream was developed fairly quickly. There'd be one small part to generate the random sequence for the current nonce. The rest of the logic would be simple in-order CPU cores that would have the AES and general math needed to execute the random sequence. The overall performance on a VU9P should be around that of a 16-core x86 CPU (since 16 2mb buffers could fit) at significantly lower power. I think any algorithm that is design to be "CPU-friendly" can have at least a 2x, and sometimes a 10x, ASIC made for it since a sea of low-power ARM cores could run it efficiently. All that said I think CryptonightR is an improvement to CryptonightV2, so there's no reason not to go to it. However it's not an end-point. I'll also do a review of RandomX in the near future. I think having a random program per-nonce actually makes the algorithm more ASIC-friendly, but I need some time to work out the details. |
I think you misunderstood something. We had it since the beginning, including precompilation for block N+1 while block N is running. Feel free to use our xmrig CryptonightR branch as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA. |
Geekbench != good reference. It's not 100% parallelizable unlike PoW in general. It's better to compare server ARM processors like Cavium ThunderX with their x86 counterparts in 100% parallelizable server tasks. They are not much more efficient. |
The size of generated random programs = 60-69 instructions and at least 45 cycles to execute (given 3 cycles for MUL, 1 cycle for the rest and unlimited parallel ALUs) - simply makes each FPGA/ASIC core with on-chip memory very slow. In my tests, 4 GHz Ryzen can't execute random math iteration faster than in 12.5 ns (without the rest of Cryptonight loop). ASIC would have to run at similar speed to be as fast per core while FPGA just can't run that fast. FPGA would take at least 100 ns to execute random math iteration which equals to ~19 h/s per core or ~400 h/s for the whole VU9P chip (21 cores can fit). |
I never said ASIC is impossible for CryptonightR. It's not a permanent solution rather just PoW for the next 6 months. The point is to slow down ASICs enough to make R&D->deployment->mining inprofitable within 6 months time frame. |
Thanks, good to know. I didn't have a chance to dig into the code and just saw the comment above about not having dynamic GPU kernels yet.
I fully agree that geekbench isn't a good reference, but then again neither are most server workloads. The big, power-hungry portions of CPUs like caches, reorder buffers, prefetchers, etc help a lot on server workloads but just waste power for PoW algorithms. I do think there is a 2-4x efficiency gain available from CPU architecture alone, without considering the efficiency gains related to memory. It's hard to find good benchmarks of an A53 or A55.
I hadn't fully worked out the details of an FPGA implementation. Thinking through more details I'm now prone to agree with you that a CPU-based FPGA probably would not have enough performance to be interesting. It might not be practical given existing tool chains, but the random sequence is the type of logic that FPGAs excel at - a math pipeline with little state and no external communication. In theory an FPGA could be structured where all the surrounding logic is fixed and just the random sequence is reconfigured every block. There's already precedent for doing rapid dynamic reconfiguration for algorithms like X16R. The hardest part would probably be setting up the CPU farm to compile the sequences and distribute them to a farm of FPGAs. (ProgPoW reduces the benefit of this type of FPGA acceleration by having a ton of state that needs to flow through the pipeline and having 1/3rd of the random ops be memory accesses)
Assuming no one has already created an ASIC that is a sea of low-power ARM CPUs, then I agree that CryptonightR should accomplish its goals as a stop-gap measure. |
They only excel at it because they can do it in a massively parallel way - i.e. high throughput but much higher latency compared to CPUs. The hard part of CryptonightR for FPGAs is not just randomness, but the need for 2 MB scratchpad for each independent core. Without massive parallelism FPGAs suffer from the high latency of calculations, even if they solve dynamic reconfiguration issues - it's not just a random sequence of 16 hashes, but random sequence of 60-69 instructions with random source/destinations registers. P.S. What was said above only applies to on-chip memory approach, but FPGAs with HBM will be doing fine with CryptongihtR just like GPUs. They will be ~2 times faster than Vega with HBM because Vega uses ~50% of memory bandwidth. |
Hi @SChernykh, i'm wondering if the following 3 ideas would help,
|
@MoneroChan The plan is to have the maximal possible program length from the beginning, so any further increase would slow down CPUs and ASICs similarly. ASICs will be able to handle any program length if it's needed. |
You don't seem to be on IRC anymore, so I'm posting here: we're considering using your changes here in the april fork (plus late tweak). |
@moneromooo-monero I'm still on IRC, I just don't login often. Any chat logs available? CN/R is mostly done, it will be tested in Wownero mainnet soon. I have a few ideas about possible final tweak, but I'll keep it in private for now. |
Let's discuss my approach to random math in Cryptonight. You can post test results in this thread: #2
Basic algorithm description:
Reference implementation (variant4 code in the following files):
The same implementation in Monero code base:
Optimized CPU miner:
Optimized GPU miner:
Pool software:
Test pools:
@moneromooo-monero this is what I was talking about for the next fork. The only difference with CNv2 is the part where it does math.
@tevador @hyc your comments, suggestions?
The text was updated successfully, but these errors were encountered: