Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProgPow ASIC possibilities evaluated by 2 experts #24

Open
MoneroCrusher opened this issue Jan 19, 2019 · 45 comments
Open

ProgPow ASIC possibilities evaluated by 2 experts #24

MoneroCrusher opened this issue Jan 19, 2019 · 45 comments

Comments

@MoneroCrusher
Copy link

SChernykh/CryptonightR#1 (comment)

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array.

What's IfDefElse's take on this?

@jean-m-cyr
Copy link

No argument, but you understate the complexity of such an ASIC. What you are talking about is a set of integer math or DAG accessing logic blocks with programmable sources and destination. These blocks would need to be programmatically sequenced to form the necessary computational chain. When I say programmatically, I don't mean it in terms of software control, but rather in terms of digital logic block inputs.

ASICs achieve efficiency by maximizing logic block usage with a technique called pipelining. Imagine applying it to progpow. Each stage of the pipe will need to carry its own copy of the mix array, the sequencer state and the multiple transform block inputs. Then of course you need to manage multiple stages competing for dag access. Order of magnitude more complicated than an ethash ASIC!

@salanki
Copy link

salanki commented Jan 19, 2019

To be able to fully assess any specialized chip manufacturing, we need to know not only potential power efficiency improvement, but all of the following:

  • Development Cost
  • Development Time (part of cost)
  • Manufacturing cost
  • Performance per manufacturing or purchase $

The analysis in here, except for the DRAM statement (that I will let someone more knowledgeable than me answer) seems to be in line with the potential optimizations that IfDefElse have already outlined. We have seen different claims (10%, 20%, 50%, 800%) of potential efficiency improvement. What I have not seen from anyone is an analysis of the actual effort to build all of this.

@ifdefelse
Copy link
Owner

ifdefelse commented Jan 21, 2019

I had not looked into CryptonightR before. It looks promising but has some limitations. The basis is similar to ProgPoW in that a random sequence of math is generated every few minutes and compiled offline. This is unlike RandomX where every hash has a unique random sequence. RandomX is actually more ASIC-friendly because of the per-hash sequence, which requires online compilation that an ASIC could directly implement.

This comment I think is extremely good, showing that simple changes like Ethash 1a, which just change a constant, will not be effective as ASIC manufacturers can easily anticipate small tweaks like that and build flexibility into their designs:
SChernykh/CryptonightR#1 (comment)
https://eips.ethereum.org/EIPS/eip-1355

The quoted comment is given in the context of CryptonightR, not ProgPoW. Cryptonight does tiny accesses to DRAM, which are exceptionally inefficient for existing GPU memory controllers. Cryptonight V2 is better after it increased the access from 16 bytes to 64 bytes, but it's still wasting memory bandwidth. ProgPoW increases Ethash's 128 byte access to 256 bytes in order to be efficient on all existing GPU memory controllers.

ProgPoW makes use of a GPU's shared (aka local) memory while CryptonightR does not. Both CryptonightR and ProgPoW do not make use of floating point, tensor cores, L2 cache, graphics stuff (like textures), etc. However this underutilized portions of the GPU should not have any significant effect on power consumption. An ASIC that removed these portions would be have a smaller die size, but not otherwise be meaningfully more efficient.

I don't think a chained ALU array is practical to implement for either ProgPoW or CryptonightR. This is especially the case for ProgPoW where the mix state is 16 * 32 * 32-bits and about 1/3rd of the instructions need to access random locations in the cache.

@tevador
Copy link

tevador commented Jan 23, 2019

RandomX is actually more ASIC-friendly because of the per-hash sequence, which requires online compilation that an ASIC could directly implement

JIT compilation of RandomX code takes less than 0.5% of the time (~10 μs per hash). A ProgPoW ASIC can get a bigger efficiency gain than this just by having a native 32x32 multiplier [1].

The advantage of "online" compilation is that is forces an ASIC to dedicate die area to instruction decoders and dispatchers (which the CPU already has) rather than using a small FPGA to handle the slowly changing part of the algorithm.

@jean-m-cyr
Copy link

@tevador The notion of 'using a small FPGA to handle the slowly changing part of the algorithm' is not a workable solution. Compiling a bit stream to program an FPGA is an hour long process running on a high end system, for even the simplest algorithm. Furthermore, it requires the use of vendor specific proprietary tools that could not run on individual miner platforms.

@tevador
Copy link

tevador commented Feb 2, 2019

@jean-m-cyr I'm no expert in this field, but are you suggesting it's impossible to design a custom ASIC with hardwired operation blocks (such as multiplication, addition and popcnt) and only program a small SRAM block that specifies how operations are ordered and what their inputs are?

Based on the description of the ProgPoW loop, the 'bit stream' for ProgPoW would be roughly 400 bits [1] long, which is many orders of magnitude less than a typical bit stream for a big FPGA chip like Xilinx VCU 1525 (several megabytes).

[1] 12 cache loads with a random source and destination and one of 4 merge methods = 2257928402 * 412 total options. 20 math operations with a random source, destination, one of 11 operations and one of 4 merge methods = (32*31*4*11)20 total options. Estimated ~388 bits of entropy in total.

@jean-m-cyr
Copy link

jean-m-cyr commented Feb 2, 2019

400 bits sounds low given that the FPGA would need a RAM interface for DAG access (or at least a bus to the main ASIC) ... Nevertheless, how would you generate that bit stream on the fly? Using which tool set?

388 bits of entropy != 388 bit stream.

Not to mention that FPGAs are notoriously power inefficient.

@tevador
Copy link

tevador commented Feb 2, 2019

RAM interface for DAG access

RAM interface would be hardwired, not part of the FPGA.

Nevertheless, how would you generate that bit stream on the fly? Using which tool set?

Probably a custom toolset developed by the company that is designing the ASIC. Given the relative simplicity, it could run on the ARM controller which would precompile bit streams in advance.

388 bits of entropy != 388 bit stream

True, but I don't think the bit stream would need to be more than the order of ~1 kilobit.

@jean-m-cyr
Copy link

jean-m-cyr commented Feb 2, 2019

RAM interface would be hardwired, not part of the FPGA.

Even if the RAM ctlr. is on the main ASIC, the FPGA still requires high speed access to the DAG via the ASIC's mem controller. This implies a wide data bus that supports addressing between the FPGA and ASIC. Going off-chip with a high speed bus is expensive in terms of power.

Probably a custom toolset developed by the company that is designing the ASIC. Given the relative simplicity, it could run on the ARM controller which would precompile bit streams in advance.

FPGA tools contain highly proprietary algorithms that require large amounts of RAM and processing power. I think you underestimate the complexity involved with generating an FPGA bit stream.

True, but I don't think the bit stream would need to be more than the order of ~1 kilobit.

I doubt it! In your entropy calculation you speak of 11 arithmetic ops. and assign 4 bits of entropy. Each of those 11 blocks will require far more than 4 bits in the stream to specify and interconnect the required FPGA elements.

@tevador
Copy link

tevador commented Feb 2, 2019

This implies a wide data bus that supports addressing between the FPGA and ASIC.

'FPGA' and 'ASIC' here are not two separate chips. Everything is on a single die. Only the interconnects would use FPGA-style LUTs. The execution units themselves would be hardwired for maximum efficiency.

Each of those 11 blocks will require far more than 4 bits in the stream to specify and interconnect the required FPGA elements.

You have a pair of 32-bit datapaths, which can go into one of 11 execution units. I imagine you can use a multiplexer with 4 bits of SRAM to select the path to the execution engine. Am I wrong?

@ifdefelse
Copy link
Owner

are you suggesting it's impossible to design a custom ASIC with hardwired operation blocks (such as multiplication, addition and popcnt) and only program a small SRAM block that specifies how operations are ordered and what their inputs are?

You have a pair of 32-bit datapaths, which can go into one of 11 execution units. I imagine you can use a multiplexer with 4 bits of SRAM to select the path to the execution engine. Am I wrong?

You're essentially describing how classical CPUs, GPUs, DSPs, etc are designed. A handful of fixed function datapaths and then some muxes that route data between the register file, caches, and datapaths. Whether those muxes are controlled on-the-fly by instructions or semi-statically via a few bits of FPGA-style SRAM doesn't make a meaningful difference.

FPGAs gain significant efficiency when you can create a dataflow pipeline, where intermediate results are never stored but are directly consumed by the next step of the pipeline. This structure doesn't work for ProgPoW. ProgPoW requires a large amount of state (16 lanes * 32 words * 32 bits = 2 kilobyte) that needs to last across many cycles (at least 12 cycles for the 12 cache accesses). Setting up a pipeline to directly compute the random ProgPoW sequence would result in a massive duplication of compute resources (would need all 11 options available for every compute stage, even though only 1 is used) and repeating mix data that mostly doesn't change.

Once the mix data is stored in a register file any design is going to look a lot like a CPU's SIMD unit or a GPU.

@Sonia-Chen
Copy link

https://medium.com/@Linzhi/eip-1057-progpow-open-chip-design-for-only-1-cost-power-increase-eip-1057-progpow-d106d9baa6eb

// Random math between two input values
uint32_t Math(uint32_t a, uint32_t b, uint32_t r)
{
switch (r % 11)
{
case 0: return a + b;
case 1: return a * b;
case 2: return mul_hi(a, b);
case 3: return min(a, b);
case 4: return ROTL32(a, b);
case 5: return ROTR32(a, b);
case 6: return a & b;
case 7: return a | b;
case 8: return a ^ b;
case 9: return clz(a) + clz(b);
case 10: return popcount(a) + popcount(b);
}
}

*) modulo 11 operation, this is quite small logic, ~400gate, 1ck latency, can be a parallel process during mix read so we can hide this latency.
*) 32-bit Add, simple logic, ~300gate for a fast one.
*) 32-bit Multiplier, mature IP, ~20Kgate for a fast one, since multiplier only have ~4/11 activity rate, we can use a two cycle multiplier to half the area, small possibility to increase delay.
*) Rotation operation can easily map to a multiplier, for example I want to calculate ROTL(0x12345678, 8), I can do 0x12345678 * 0x00000100 = 0x0000001234567800, then we just need to OR higher word and lower word together to get 0x34567812. so just cost ~160gate extra logic
*) logic operation, A&B only cost 32 gate, A|B 32 gate, A^B 96 gate, it looks like three different instructions but actually extremely small on silicon (<30um²)
*) clz and popcount are also very small
*) We only need a multiplexer to select output.
*) Total size of Math() is about 0.0015mm² on a TSMC16ULP process.
*) Merge() is similar but even smaller, only shifter, adder, and tiny logic (no multipliers because constant multiply can be mapped into adder).
*) Size of Merge() is roughly ~0.0005mm².

4-stage pipeline
CK1: Mix regfile read, two operators in parallel (calculate %11 at same time)
CK2: Math()
CK3: Merge()
CK4: Mix write back to regfile

Task latency is 4 cycles, but we can have 4 independent threads so we can make this pipeline fully loaded.
Mix Regfile we use a 1W2R (1 write port, 2 read port) Regfile, mature IP, 8KB (for 4 threads), single cycle read/write, 1GHz operation.
If you want load/unload mix data on the fly without disturbing the pipeline, we can upgrade to a 12KB 2W3R Regfile. We then have extra read/write ports for load/unload, and 12KB is enough for 6 tasks (4 running, 1 loading, 1 unloading).
An ASIC can implement 10K sets of that block:

*) running at 1GHz frequency
*) 0.55V voltage (typical voltage for TSMC16ULP)
*) generating ~10T Math() + Merge() throughput per second
*) Power estimate roughly 3mW each pipeline, 30W in total.
*) No customized circuit/layout
*) All standard cells, auto placement route
*) Use mature IP only
*) No aggressive overclocking
*) No aggressive under voltage
*) ~8.4K Merge() and ~4.8K Math() per hash
*) pipeline includes 1 Merge() and 1 Math()
*) pipeline is only the logic part, not the memory part
*) the xGB memory are not included
*) we have 10T throughput on single chip asic, divided by 8.4K Merge() per hash, means 1.2GHash
Performance
*) increases die cost/power by about 1%, and causes a die increase of <1mm²

The design (logic part only) can provide 1.2 GHash ProgPoW performance at 30W power.
Nvidia GTX1070Ti can provide 15.7 MHash at 115W, but including memory.

@solardiz
Copy link
Contributor

@Sonia-Chen Interesting.

How many instances of the 8 KB (or 12 KB) register file would you need for "10K sets of that block"? Wouldn't that be significant die area to include in your estimates and conclusions, which doesn't appear to be included now?

The modulo operations in Math and Merge are compile-time. While you could be generating the inputs to them on the fly, I doubt you'd want to - you'd more likely have a pre-compiled ProgPoW program uploaded to your ASIC from host, like it's done when ProgPoW is run on GPUs.

@salanki
Copy link

salanki commented Mar 29, 2019

Thank you for this technical argument, greatly appreciated. It’s above my head, but on the first glance there are two things that stand out to me:

  • The kiss99 computations are missing.
  • You cannot compare the performance of a “PoW loosely based on ProgPoW without DAG access” to ProgPoW performance. The performance comparisons you state with a 1070Ti are completely invalid.

@solardiz
Copy link
Contributor

@Sonia-Chen Also, even without register files, ignoring the external bandwidth needs, and blindly trusting your numbers, they don't add up: (0.0015+0.0005)*10000 = 20 mm^2, but you claim under 1 mm^2 for "10K sets of that block". You also seem to claim that 30W would be dissipated in that 1 mm^2, ouch. Am I misreading what you meant or is the rest of what you wrote just as unreliable? ;-(

I don't mean to defend ProgPoW here. I actually think ASICs will have significant advantage at it. But we need realistic analyses, and this one looks bogus at least at a high level.

@tevador
Copy link

tevador commented Mar 29, 2019

@Sonia-Chen We have checked the numbers and it seems they are off by a factor of 4.

There are ~33K calls to Merge() and ~18K calls to Math() per hash in ProgPoW 0.9.3.

So the 30W logic can do about 300 MH/s. However, this is still at least 20x higher compute efficiency than a GPU.

@solardiz
Copy link
Contributor

Oh, maybe I get what was meant by "die increase of <1mm²" - this is probably scaled back from this ASIC's to GPU compute performance at this algorithm. A weird metric, but at least that would make some sense.

@MoneroCrusher
Copy link
Author

MoneroCrusher commented Mar 29, 2019

Very concerning to say the least. If this is true then ProgPoW is instantly invalidated and Ethereum is better off by keeping Ethash and maybe applying a smaller tweak every year to make ASICs more flexible and reduce the gap that way via increased cost on the ASIC side (until the switch to PoS).

@solardiz
Copy link
Contributor

@MoneroCrusher I'm not sure you're reading this right. What it says is that the advantage of ProgPoW over Ethash is tiny, not that ProgPoW is weaker than Ethash wrt ASICs. Either needs lots of memory bandwidth. It's just that ProgPoW's added limited programmability and computation is probably relatively lightweight when implemented in an ASIC. Perhaps not to the extent of it adding only 1% as claimed here (if I understand this right, now), but nevertheless adding not a lot.

@tevador
Copy link

tevador commented Mar 29, 2019

@solardiz If this ASIC estimate is correct, then ProgPoW is actually worse for GPUs than for ASICs. For example, RX480 consumes +30W when mining ProgPoW compared to Ethash. ASIC will only use about +1 W for the same hashrate compared to Ethash.

@Sonia-Chen
Copy link

@solardiz @MoneroCrusher that's right this is only about the advantage. We are not trying to trash ProgPoW, we are trying to study it. The post was in response to an earlier comment about pipelines, and we are focusing on the random program.
What's hard to compare is the power consumption difference between asic logic and gpu logic.
I will check on some of those mistakes or unclear descriptions pointed out by @solardiz @tevador and @salanki , needs a little time though. Thanks a lot!

@Sonia-Chen
Copy link

@MoneroCrusher I'm not sure you're reading this right. What it says is that the advantage of ProgPoW over Ethash is tiny, not that ProgPoW is weaker than Ethash wrt ASICs. Either needs lots of memory bandwidth. It's just that ProgPoW's added limited programmability and computation is probably relatively lightweight when implemented in an ASIC. Perhaps not to the extent of it adding only 1% as claimed here (if I understand this right, now), but nevertheless adding not a lot.

That is exactly what it was trying to say. I want to look into the feedback wrt Regfile, 4x and 1% calculation though, this can all be improved. Thanks again!

@MoneroCrusher
Copy link
Author

Yes, it adds little overhead to ASICs in comparison to GPUs, hence Ethash is actually better for GPU miners if this is all true.

@solardiz
Copy link
Contributor

@tevador @MoneroCrusher You're right. If we factor in the relative power consumption, then yes, ProgPoW's programmability might not look attractive overall. ProgPoW's more efficient external bandwidth usage (with its 256 rather than 128 byte blocks) might fare better.

Thanks @Sonia-Chen. I'll stay tuned. I also have a suggestion: rather than mention virtual/unrealistic compute-only hashrates, pick an external memory bandwidth that is realistic to have in an ASIC and estimate that realistic ASIC's parameters. Compare it to an Ethash ASIC that has as much bandwidth (or is normalized by some other parameter). Compare both against GPUs' hashrates and power consumption at ProgPoW and Ethash, respectively. Yes, this is much more work, maybe more than you volunteered for - but it'd be appreciated.

BTW, I put together and use/hack this C implementation for my experiments - https://github.com/solardiz/c-progpow - you might find it useful too, such as to count how many times some computation is done, etc.

@solardiz
Copy link
Contributor

solardiz commented Apr 1, 2019

@Sonia-Chen As I now understand, your company has designed a very impressive Ethash ASIC:

Ethash Miner Announcement, ETC Summit Seoul, September 2018
Specs: Ethash, 1400 MH/s, 1000 Watts, price commitment 4-6 months ROI.
Schedule: 12/2018 TapeOut, 04/2019 Samples, 06/2019 Mass Production.

(BTW, this greatly exceeds ProgPoW designers' expectation that only a ~2x improvement over GPUs would be possible for Ethash.)

I understand you might not want to reveal too much detail prematurely as this is a competitive market, but it'd help the community to know that ASIC unit's total off-die memory bandwidth and across how many ASIC chips it's split. For GPU-like bandwidth per chip, I'm guessing it could be e.g. 50x 10W chips (leaving 500W for the memory), but that's nothing more than a guess to illustrate the question.

It'd be interesting if you managed to avoid the external bandwidth needs by producing everything on a DRAM process. This feels plausible.

It'd be even more interesting if you do use off-die memory, but you managed to reduce the external bandwidth needs per hash computed, which might be possible through doing a huge number of concurrent computations and (un)stalling groups of them in a way that lets you reuse fetches from external memory into on-die cache to advance an entire group of those concurrent computations per fetch. I suggested this in 2014 as a way to turn random access into semi-sequential access (for making use of more than one item per an otherwise unnecessarily wide memory fetch) with sequential memory-hard functions that use tiny lookups, but I think it also applies to parallelizable memory-hard functions such as Ethash (for making use of a fetch more than once). The main trade-off is in needing to maintain internal state for those many concurrent computations, which can become very large and would require plenty of on-die memory. Another trade-off is between the computations state memory and external fetches cache memory. The question is then whether there exists an optimal trade-off point given target hash type, its parameters, and current technology - and what it is.

Once you reveal some properties (such as external bandwidth, die area, hashrate, power) of one ASIC chip in that Ethash unit, it'd be easier for us all to see how much ProgPoW might add to that and whether that's significant added cost or not.

Thanks in advance! ;-)

@solardiz
Copy link
Contributor

solardiz commented Apr 1, 2019

@tevador wrote:

There are ~33K calls to Merge() and ~18K calls to Math() per hash in ProgPoW 0.9.3.

I am getting 36864 and 20480. Block 30k:

Digest = 11f19805c58ab46610ff9c719dcf0a5f18fa2f1605798eef770c47219274767d
Merge 36864 total (8192 12288 5120 11264)
Math  20480 total (2048 4096 2048 1024 2048 1024 4096 1024 1024 2048 0)

Block 10M:

Digest = b2403f56c426177856eaf0eedd707c86ae78a432b9169c3689a67058fcf2a848
Merge 36864 total (9216 7168 10240 10240)
Math  20480 total (5120 2048 2048 2048 1024 0 1024 2048 1024 3072 1024)

BTW, quite often a certain math operation is left entirely unused. I think this is OK'ish since it'd be needed again 50 blocks later. But (in a tweak) we'd probably want to add code generation constraints to ensure we're never below a certain ratio of MULs.

@tevador
Copy link

tevador commented Apr 1, 2019

@solardiz I think your numbers are for ProgPoW 0.9.2 which has a higher number of math operations. See the table here: https://github.com/ifdefelse/ProgPOW#progpow-algorithm-walkthrough

@salanki
Copy link

salanki commented Apr 1, 2019

0.9.3 also has a shorter ProgPoW period of 10 blocks.

@solardiz
Copy link
Contributor

solardiz commented Apr 1, 2019

@tevador Oh, you're right. But this means the test vectors we have in here are also for 0.9.2 only. So I wouldn't be able to easily test my code for computing 0.9.3 correctly just yet. Correct?

@salanki
Copy link

salanki commented Apr 1, 2019

Test vectors for 0.9.3 are yet to be produced AFAIK.

@solardiz
Copy link
Contributor

solardiz commented Apr 3, 2019

@tevador @salanki Thanks. Here's what I am getting with my c-progpow when I substituted the parameter changes for 0.9.3:

ProgPoW version 0.9.3
Block   30000
Digest  6018c151b0f9895ebe44a4ca6ce2829e5ba6ae1a68a4ccd05a67ac01219655c1
Merge   33792 total (8192 7168 10240 8192)
Math    18432 total (0 2048 0 5120 5120 0 0 0 4096 0 2048)

Block   10000000
Digest  3cb394b257046429e7a18528f2d1bc64e3b712031534ecb1f60f5c6d61fd60ca
Merge   33792 total (7168 10240 8192 8192)
Math    18432 total (1024 1024 3072 2048 2048 1024 0 2048 0 1024 5120)

I also gave some further thought to how @Sonia-Chen's company might have managed that hashrate (and thus likely corresponding memory bandwidth) at Ethash at that power level, and I wouldn't be surprised if it's through use of SRAM, e.g. as 4x1 GB stacked or otherwise interlinked dies. There's a hint that this is possibly the case in https://github.com/LinzhiChips/linzhichips.github.io/raw/master/docs/LWP8-ECIP-1043-Revisited.pdf saying "16nm or deeper processes allow integrated 1GB RAM on a single silicon die." Since each hash computation uses a small portion of the DAG only, somewhat defective dies are usable - it's OK if a moderate percentage of hash computations are incorrect - thus achieving adequate yield even with huge mostly-SRAM dies like this.

If this guess is correct, I'd be very interested to know how the dies are interconnected. Is it TSVs or maybe e.g. TCI. Here's an academic example from last year with 9 much smaller stacked dies, 1 logic + 8 SRAM, in 40nm, TCI: https://fuse.wikichip.org/news/1206/quest-a-tci-based-3d-stacked-sram-neural-processor/ This is only 96 MB total. Back-of-the-envelope scaling from "121.55 mm² (14.3mm × 8.5mm)" to "maximum chip size is, generally, 26mm x 33mm" (the latter from Linzhi's LWP8 document referenced above) and from 40nm to 16nm gives: 96*26*33/121.55*(40/16)^2 = 4235 MB. That's our 4 GB in 8 stacked dies. Linzhi's LWP8 suggests it can also be done in just 4? Maybe, but either way 8 would be realistic to produce too.

@solardiz
Copy link
Contributor

solardiz commented Apr 3, 2019

@Sonia-Chen When you mention 20K gates for the multiplier, etc. you refer to just one 32-bit lane, right? But then you forget to multiply these estimates by 16 lanes before you derive the hashrate estimates, don't you? (Somehow your 8 KiB for the register file does appear to include all 16 lanes and the 4 pipeline stages. But then you forgot to include it in further consideration, as already pointed out. You also didn't include the 16 KiB cache.)

Edit: figured it out: I guess you indirectly included the lanes count in the presumably all-lane Math+Merge count you used as a divisor. (As already pointed out, this number is ~4x lower than actual, but that's another story.)

@solardiz
Copy link
Contributor

solardiz commented Apr 4, 2019

I just realized that Ethash still lacks the shuffling introduced into ProgPoW 0.9.1+ in #13, which means that @Sonia-Chen company's Ethash ASIC simply doesn't need very fast interconnect between the SRAM dies (or whatever). It also means that ProgPoW has major advantage there, totally unrelated to its added computation.

@timolson
Copy link

timolson commented May 30, 2019

I've developed both GPU miners and ASIC's, and together with 7400 and Salt released an open-source core for Cryptonight Classic that was 5x better H/J than Bitmain's, while using only 28nm. We've reviewed ProgPoW from a theory and design perspective, and would like to go on record with these qualitative comments:

Overall, ProgPoW looks to be a genuine, professional attempt by GPU-interested parties. That is to say, we found no obvious backdoors or any suggestion that ProgPoW is anything but an honest attempt at ASIC resistance.

The inner loop does try to cover the shader datapaths pretty well, but obviously GPU's without a unified texture/L1 architecture will waste some texture area, and all geometry pipelines go unused. Also, ProgPoW is strictly integer math, while GPU's predominantly focus on float performance, so that overlap is also less than 100%. However, we are not GPU insiders and cannot quantify the GPU die area that would go unused in ProgPoW.

We do point out that while GPU's are not especially good at integer multiplication and are outright bad at bit operations, five of eleven random math operations in ProgPoW are bitops and two are multiplies. For Nvidia, bitops take 2 cycles each, the same as addition, and multiplies are so slow the official docs say only "multiple cycles." In ASIC's, bit operations especially can run considerably faster.

We suspect a VLIW architecture may help exploit this, by combining multiple instructions into a bundle that can be computed in fewer clock cycles than each instruction individually. If we group the 11 operations into three categories: bitops, adds, and muls (also rot), then our slow-seed compiler can generate instructions like bitop-muladd-bitop that frequently match branches of the abstract syntax tree and run in far less than the 8+ cycles this would take on a GPU. The timings and dependencies of instructions may be precalculated by the compiler, such that no on-chip sequencing logic is necessary. Also, the set of VLIW instructions may be generated from the distribution of the program space, and this distribution may also inform the number of compute instances for each instruction. There would be many bitop-bitop units for example, and fewer bitop-add-mul-add-bitop units, efficiently matching transistor count to the frequency of each op sequence.

These gains all together may or may not give a 2x speedup, and we can't say without deeper analysis.

Overall we think ProgPoW is a good try, probably the best anti-ASIC attempt so far. It is relatively simple and straightforward, and professionally designed and documented, yet we remain uncertain of its chances for keeping the ASIC gap to 2x.

@solardiz
Copy link
Contributor

@timolson Thanks!

How would you tackle the memory bandwidth requirements?

As to speeding up the computations, you appear to focus on reducing the latency of individual groups of sequential operations, but in ProgPoW most operations are data-independent and can proceed in parallel. (That's in part because ProgPoW is designed to write into each register at least once per loop iteration, and with currently proposed settings it has similar numbers of operations per loop iteration and registers, so the "at least once" turns into "typically exactly once".) I think this is a speedup of per-hash computation far greater than 2x over GPU (more like 10x+), which you can exploit by having more execution units (and enough register file ports or banks) per core. Of course, this would cost some die area, but so would having the specialized units you propose. And it can be VLIW either way. The real issue is that speeding up per-hash computation doesn't necessarily translate to increasing hashrate per die area and per Watt, and even when it does it's not necessarily the most optimal way, so we need analysis in those terms.

@timolson
Copy link

timolson commented Jun 2, 2019

you appear to focus on reducing the latency of individual groups of sequential operations

No, this increases throughput. There will be a small increase in latency, because the first few instructions must be computed in serial since they don’t have any previously pipelined results as input.

ProgPoW most operations are data-independent and can proceed in parallel

If you mean the “lanes” then yes, but my description uses implied lanes. Each VLIW in our arch would actually be a 16-wide SIMD. Call it VLIWMD? Ugh no 😀 VLIW shortens the time it takes to compute the “length” of each sequence, regardless of the “width” of the datapaths.

this would cost some die area, but so would having the specialized units you propose

Actually no, we would expect a more efficient use of transistors (on average) under this design, reducing overall area. This is an average effect: some programs will run worse, but most programs will run better. The VLIW ISA may be designed analytically to take maximum advantage of the specific instructions that are inefficient in GPU’s.

speeding up per-hash computation doesn't necessarily translate to increasing hashrate per die area and per Watt

Could you explain this comment? Doing the same work with fewer cycles and fewer transistors should be strictly superior in terms of both area and power.

@solardiz
Copy link
Contributor

solardiz commented Jun 3, 2019

If you mean the “lanes”

No, I mean there's also plenty of parallelism within each lane. Plenty of the merge and math operations pending at the start of a loop iteration (maybe 1/3 or 1/2 of a loop iteration's total - need to simulate this with specific parameters) can all proceed in parallel - those where their input registers are still at their values as of the start of the loop iteration. Then there are some data dependencies between the rest, but overall the dependency tree depth is much lower (several times lower? need to simulate this to make sure) than the number of operations in a loop iteration. And that's when considering each iteration on its own; by considering parallelism available across loop iterations (merging the end of one with start of the next), even more parallelism can be extracted.

Doing the same work with fewer cycles and fewer transistors should be strictly superior

I agree. My comment was based on assumption that you went for an increase in transistor count to implement those specialized multi-op units.

@solardiz
Copy link
Contributor

I discovered the other day that in October last year Linzhi finally published architecture overview of their planned Ethash ASICs: https://medium.com/@Linzhi/linzhi-e1400-architecture-overview-6fed5a19ef70

Curiously, this looks like a general-purpose distributed RAM, NOC, and interconnect architecture similar to those found in many-core high-performance computing architectures like Adapteva's Epiphany, Kalray's, or even the once-top Chinese supercomputer's Sunway SW26010. Only with the general-purpose computation capability excluded.

Linzhi's claimed hashrate and power consumption put this at roughly 10x the efficiency of GPUs. That's for Ethash.

If Linzhi (or others) can in fact produce these ASICs in a cost-effective manner (and they'd cost quite a lot with the 64 medium-sized chips each), then (re-)adding general-purpose computation in a future revision looks do-able. If so and if Linzhi's current efficiency estimates are correct, my guesstimate is that the efficiency at ProgPoW vs. GPUs would be within 2x of that for Ethash, so 5x+ improvement over GPUs.

It isn't surprising to me that many-core architectures can be more efficient than GPUs for suitable tasks. We even played with one back in 2013-2014, getting 50x energy-efficiency improvement on 64-core Epiphany vs. then-current same-tech (28 nm) GPU for (I have to admit) a GPU-unfriendly computation task: https://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-48.html. Many-core would fare much worse for a GPU-friendly task, but there's potential for an improvement nevertheless.

So in a sense it's a win for Ethash and ProgPoW - seeing how even the ASIC designs to implement them are almost general-purpose computing devices, with a memory architecture so capable that it's a pity to waste it on mining.

However, if this materializes, it'd also be a win of many-core "CPU" + distributed RAM + NOC + mesh over GPU + DRAM.

@solardiz
Copy link
Contributor

A drawback of both Ethash and ProgPoW is that they're (almost solely) parallelizable memory-hard. While a sequential memory-hard hash function filling 4 GB would generally be considered too way slow for cryptocurrency PoW use case (in terms of PoW verification time, although MTP was an attempt to address that), it should be possible to add a smaller amount of sequential memory-hardness on top of the parallelizable memory-hardness. In fact, ProgPoW already does a tiny bit of that: its use of register files is per-hash, and greater than Ethash's. But that's very little. Implementing #41 would add some more per-hash memory, but also still quite little in absolute terms. To add much more (perhaps some megabytes), we'd have to accept a much lower hashrate - e.g., thousands instead of millions - which should be acceptable. We'd also have to start using GPUs' global memory not only for the read-only DAG, but also for the per-hash RAM. Unfortunately, this will reduce the portion of memory bandwidth available for the DAG reads, yet it could be a good trade-off.

Linzhi's proposed Ethash ASIC board (64 chips) has a total of 4.5 GiB RAM - same as we'd currently use on a GPU. Yet it has 129 Tbps (or ~16 TB/s) of peak communication bandwidth, which is maybe 30x greater than a GPU's global memory bandwidth. If we required a significant amount of per-hash RAM, then either this ASIC design would have to add a lot more RAM or it wouldn't be able to fully utilize its bandwidth advantage. Either way, its advantage would be greatly reduced.

So I think that adding significant sequential memory-hardness and accepting a numerically much lower (yet acceptable) hashrate is the way to go at minimizing the advantage of specialized ASICs. "Programmability" is also good (and might encourage those ASICs to be reusable for other computation once they're retired from mining), but on its own it can't achieve as much as sequential memory-hardness would.

[Not surprisingly, my yescrypt design for large-scale password hashing combines parallelizable and sequential memory-hardness. That's on CPU+RAM (typically with 100+ GB "ROM" and a few to tens of MB of per-hash "RAM"), but there's no reason why the same general approach couldn't be applied to mining on GPUs.]

@jean-m-cyr
Copy link

jean-m-cyr commented Mar 16, 2020

I don't quite follow the Linzi strategy here where one of 72 mixer cores holding a cached 1MB segment of the DAG gets selected for the next MIX stage depending on the calculated next DA. That's only 72MB of cached DAG to cover 4GB of DAG in DDR. Given the pseudo random nature of DAG accesses the hit rate on the cached portions of the DAG will be very low! Thus the design would be mostly limited by the DDR throughput anyway. In addition each cache fill operation to the MIX cores would need to transfer 1MB of DAG data as opposed to a single element. Transferring that much data to a cached 1MB segment unlikely to be reused seems wasteful!

If, on the other hand, the 1MB MIX core ram is not used to cache the DAG, then each MIX core would be responsible for handling a fixed DAG segment with 72 segments covering the entire DAG. The DAG accesses would still need to be serialized through a single 4GB DDR bank, unless there are 72 individual DDR controllers with 64MB of DDR each. The diagrams don't explicitly show this.

Am I missing something?

@solardiz
Copy link
Contributor

@jean-m-cyr My understanding is that they hold the entire DAG in this distributed memory, accessing its remote locations through inter-chip communication as necessary. No cache. No DDR.

@jean-m-cyr
Copy link

jean-m-cyr commented Mar 16, 2020

I don't believe that is likely with only 72 MIX elements, meaning 4GB would need to be distributed over 72 elements = nearly 64MB per element. Possible in a 72 chip design but not effective from a system cost perspective. Furthermore, off chip interconnects are power inefficient.

@solardiz
Copy link
Contributor

@jean-m-cyr Their design is 64 chips with 72 MB each. That's enough for the entire DAG. Yes, this is expensive. Like I wrote above, it is unclear if they "can in fact produce these ASICs in a cost-effective manner (and they'd cost quite a lot with the 64 medium-sized chips each)".

Yes, all of this consumes power. It remains to be seen if their estimates are correct or not. I guess they themselves didn't yet know at the time of publication of that Medium post. Where they provide easily verifiable numbers, those check out - e.g., 129 Tbps can be derived as 64 chips times 63 links times 32 Gbps per link. Can these ~4000 links consume only under 1 kW? That's under 250 mW per link. I'd say this looks realistic, considering e.g. the existence of low-power devices with PCIe where each lane is of a similar speed.

@jean-m-cyr
Copy link

jean-m-cyr commented Mar 16, 2020

A PCIe 3.0 lane has theoretical max of only 8 Gbits/sec. Each chip needs 63 lanes of PCIe or similar interface. Hmm...
A 64X64 crossbar would make more sense.

@solardiz
Copy link
Contributor

PCIe 5.0 is 32 Gbps/lane. I searched for power consumption figures, but couldn't find them for PCIe 5.0 yet. For older PCIe (up to Gen4) and for other SerDes, as well as for PCIe retimers and redrivers, I found figures in the range of 5 mW/Gbps to 7.5 mW/Gbps, e.g.: https://www.pads.com/multimedia/overview/tackling-design-and-verification-challenges-of-low-power-serdes-for-datacenter-and-automotive-34c1c1dd-7ab2-4ae2-b77c-05dbbf0b19a7

At 5 mW/Gbps, we'll have 160 mW per 32 Gbps link. This barely fits in the stated power budget, leaving little for powering the SRAMs and logic - about 350W out of the 1 kW budget. Can the whole thing fit in 1 kW given present tech? Doubtful, but not clearly impossible.

With all of those crossing links, I think the PCB design and getting it to actually work reliably would be tricky. It's probably still a long road from taping out the individual chips to getting the PCB to work. Yes, a crossbar could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants