Skip to content

solardiz/ProgPOW

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProgPoW - A Programmatic Proof of Work - unofficial fork, 3x faster on NVIDIA Maxwell

ProgPoW is a proof-of-work algorithm designed to close the efficiency gap available to specialized ASICs.

Build and Test Instructions

After cloning this repository into ProgPOW, it can be built with commands like:

cd ProgPOW
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DETHASHCUDA=ON
make -sj8

and benchmarked with commands like:

ethminer/ethminer -U -M 10000000
ethminer/ethminer -G -M 10000000

which use CUDA and OpenCL, respectively.

ProgPoW Overview

The design goal of ProgPoW is to have the algorithm’s requirements match what is available on commodity GPUs: If the algorithm were to be implemented on a custom ASIC there should be little opportunity for efficiency gains compared to a commodity GPU.

The main elements of the algorithm are:

  • Changes keccak_f1600 (with 64-bit words) to keccak_f800 (with 32-bit words) to reduce impact on total power
  • Increases mix state.
  • Adds a random sequence of math in the main loop.
  • Adds reads from a small, low-latency cache that supports random addresses.
  • Increases the DRAM read from 128 bytes to 2048 bytes (random reads of groups of 4 sequential blocks of 512 bytes).

The random sequence changes every PROGPOW_PERIOD (10 blocks, or about 2.5 minutes for Ethereum). When mining source code is generated for the random sequence and compiled on the host CPU. The GPU will execute the compiled code where what math to perform and what mix state to use are already resolved.

While a custom ASIC to implement this algorithm is still possible, the efficiency gains available are smaller than for Ethash. Much of a commodity GPU is required to support the above elements. Some available ASIC optimizations are:

  • Remove the graphics pipeline (displays, geometry engines, texturing, etc)
  • Remove floating point math
  • A few ISA tweaks, like instructions that exactly match the merge() function

Rationale for PoW on Commodity Hardware

With the growth of large mining pools, the control of hashing power has been delegated to the top few pools to provide a steadier economic return for small miners. While some have made the argument that large centralized pools defeats the purpose of “ASIC resistance,” it’s important to note that ASIC based coins are even more centralized for several reasons.

  1. No natural distribution: There isn’t an economic purpose for ultra-specialized hardware outside of mining and thus no reason for most people to have it.
  2. No reserve group: Thus, there’s no reserve pool of hardware or reserve pool of interested parties to jump in when coin price is volatile and attractive for manipulation.
  3. High barrier to entry: Initial miners are those rich enough to invest capital and ecological resources on the unknown experiment a new coin may be. Thus, initial coin distribution through mining will be very limited causing centralized economic bias.
  4. Delegated centralization vs implementation centralization: While pool centralization is delegated, hardware monoculture is not: only the limiter buyers of this hardware can participate so there isn’t even the possibility of divesting control on short notice.
  5. No obvious decentralization of control even with decentralized mining: Once large custom ASIC makers get into the game, designing back-doored hardware is trivial. ASIC makers have no incentive to be transparent or fair in market participation.

While the goal of “ASIC resistance” is valuable, the entire concept of “ASIC resistance” is a bit of a fallacy. CPUs and GPUs are themselves ASICs. Any algorithm that can run on a commodity ASIC (a CPU or GPU) by definition can have a customized ASIC created for it with slightly less functionality. Some algorithms are intentionally made to be “ASIC friendly” - where an ASIC implementation is drastically more efficient than the same algorithm running on general purpose hardware. The protection that this offers when the coin is unknown also makes it an attractive target for a dedicate mining ASIC company as soon as it becomes useful.

Therefore, ASIC resistance is: the efficiency difference of specialized hardware versus hardware that has a wider adoption and applicability. A smaller efficiency difference between custom vs general hardware mean higher resistance and a better algorithm. This efficiency difference is the proper metric to use when comparing the quality of PoW algorithms. Efficiency could mean absolute performance, performance per watt, or performance per dollar - they are all highly correlated. If a single entity creates and controls an ASIC that is drastically more efficient, they can gain 51% of the network hashrate and possibly stage an attack.

ProgPoW Algorithm Walkthrough

The DAG is generated exactly as in Ethash. All the parameters (epoch length, DAG size, etc) are unchanged. See the original Ethash spec for details on generating the DAG.

ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

  • PROGPOW_PERIOD: Number of blocks before changing the random program
  • PROGPOW_LANES: The number of parallel lanes that coordinate to calculate a single hash instance
  • PROGPOW_REGS: The register file usage size
  • PROGPOW_DAG_LOADS: Number of uint32 loads from the DAG per lane
  • PROGPOW_CACHE_BYTES: The size of the cache
  • PROGPOW_CNT_DAG: The number of DAG accesses, defined as the outer loop of the algorithm (64 is the same as Ethash)
  • PROGPOW_CNT_CACHE: The number of cache accesses per loop
  • PROGPOW_CNT_MATH: The number of math operations per loop

The value of these parameters has been tweaked between version 0.9.2 (live on the Gangnam testnet) and 0.9.3 (proposed for Ethereum adoption). See this medium post for details.

Parameter 0.9.2 0.9.3 0.9.3m1
PROGPOW_PERIOD 50 10 10
PROGPOW_LANES 16 16 16
PROGPOW_REGS 32 32 64
PROGPOW_DAG_LOADS 4 4 8
PROGPOW_CACHE_BYTES 16x1024 16x1024 16x1024
PROGPOW_CNT_DAG 64 64 32
PROGPOW_CNT_CACHE 12 11 22
PROGPOW_CNT_MATH 20 18 36

Besides the parameter changes above, this unofficial ProgPoW 0.9.3m1 further differs from 0.9.3 in that it makes every four adjacent PROGPOW_DAG_LOADS sequential, only proceeding to a new random offset afterwards, effectively increasing the random DAG read block size by a further factor of 4, for a total of 8 (from 256 bytes in 0.9.3 to 2048 bytes in 0.9.3m1).

The random program changes every PROGPOW_PERIOD blocks to ensure the hardware executing the algorithm is fully programmable. If the program only changed every DAG epoch (roughly 5 days for Ethereum) certain miners could have time to develop hand-optimized versions of the random sequence, giving them an undue advantage.

Sample code is written in C++, this should be kept in mind when evaluating the code in the specification.

All numerics are computed using unsigned 32 bit integers. Any overflows are trimmed off before proceeding to the next computation. Languages that use numerics not fixed to bit lengths (such as Python and JavaScript) or that only use signed integers (such as Java) will need to keep their languages' quirks in mind. The extensive use of 32 bit data values aligns with modern GPUs internal data architectures.

ProgPoW uses a 32-bit variant of FNV1a for merging data. The existing Ethash uses a similar variant of FNV1 for merging, but FNV1a provides better distribution properties.

Test vectors can be found in the test vectors file.

const uint32_t FNV_PRIME = 0x1000193;
const uint32_t FNV_OFFSET_BASIS = 0x811c9dc5;

uint32_t fnv1a(uint32_t h, uint32_t d)
{
    return (h ^ d) * FNV_PRIME;
}

ProgPow uses KISS99 for random number generation. This is the simplest (fewest instruction) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.

Test vectors can be found in the test vectors file.

typedef struct {
    uint32_t z, w, jsr, jcong;
} kiss99_t;

// KISS99 is simple, fast, and passes the TestU01 suite
// https://en.wikipedia.org/wiki/KISS_(algorithm)
// http://www.cse.yorku.ca/~oz/marsaglia-rng.html
uint32_t kiss99(kiss99_t &st)
{
    st.z = 36969 * (st.z & 65535) + (st.z >> 16);
    st.w = 18000 * (st.w & 65535) + (st.w >> 16);
    uint32_t MWC = ((st.z << 16) + st.w);
    st.jsr ^= (st.jsr << 17);
    st.jsr ^= (st.jsr >> 13);
    st.jsr ^= (st.jsr << 5);
    st.jcong = 69069 * st.jcong + 1234567;
    return ((MWC^st.jcong) + st.jsr);
}

The fill_mix function populates an array of int32 values used by each lane in the hash calculations.

Test vectors can be found in the test vectors file.

void fill_mix(
    uint64_t seed,
    uint32_t lane_id,
    uint32_t mix[PROGPOW_REGS]
)
{
    // Use FNV to expand the per-warp seed to per-lane
    // Use KISS to expand the per-lane seed to fill mix
    kiss99_t st;
    st.z = fnv1a(FNV_OFFSET_BASIS, seed);
    st.w = fnv1a(st.z, seed >> 32);
    st.jsr = fnv1a(st.w, lane_id);
    st.jcong = fnv1a(st.jsr, lane_id);
    for (int i = 0; i < PROGPOW_REGS; i++)
            mix[i] = kiss99(st);
}

Like Ethash Keccak is used to seed the sequence per-nonce and to produce the final result. The keccak-f800 variant is used as the 32-bit word size matches the native word size of modern GPUs. The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding. The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.

As with Ethash the input and output of the keccak function are fixed and relatively small. This means only a single "absorb" and "squeeze" phase are required. For a pseudo-code implementation of the keccak_f800_round function see the Round[b](A,RC) function in the "Pseudo-code description of the permutations" section of the official Keccak specs.

Test vectors can be found in the test vectors file.

hash32_t keccak_f800_progpow(hash32_t header, uint64_t seed, hash32_t digest)
{
    uint32_t st[25];

    // Initialization
    for (int i = 0; i < 25; i++)
        st[i] = 0;

    // Absorb phase for fixed 18 words of input
    for (int i = 0; i < 8; i++)
        st[i] = header.uint32s[i];
    st[8] = seed;
    st[9] = seed >> 32;
    for (int i = 0; i < 8; i++)
        st[10+i] = digest.uint32s[i];

    // keccak_f800 call for the single absorb pass
    for (int r = 0; r < 22; r++)
        keccak_f800_round(st, r);

    // Squeeze phase for fixed 8 words of output
    hash32_t ret;
    for (int i=0; i<8; i++)
        ret.uint32s[i] = st[i];

    return ret;
}

The inner loop uses FNV and KISS99 to generate a random sequence from the prog_seed. This random sequence determines which mix state is accessed and what random math is performed.

Since the prog_seed changes only once per PROGPOW_PERIOD (50 blocks or about 12.5 minutes) it is expected that while mining progPowLoop will be evaluated on the CPU to generate source code for that period's sequence. The source code will be compiled on the CPU before running on the GPU. You can see an example sequence and generated source code in kernel.cu.

Test vectors can be found in the test vectors file.

kiss99_t progPowInit(uint64_t prog_seed, int mix_seq_dst[PROGPOW_REGS], int mix_seq_src[PROGPOW_REGS])
{
    kiss99_t prog_rnd;
    prog_rnd.z = fnv1a(FNV_OFFSET_BASIS, prog_seed);
    prog_rnd.w = fnv1a(prog_rnd.z, prog_seed >> 32);
    prog_rnd.jsr = fnv1a(prog_rnd.w, prog_seed);
    prog_rnd.jcong = fnv1a(prog_rnd.jsr, prog_seed >> 32);
    // Create a random sequence of mix destinations for merge() and mix sources for cache reads
    // guarantees every destination merged once
    // guarantees no duplicate cache reads, which could be optimized away
    // Uses Fisher-Yates shuffle
    for (int i = 0; i < PROGPOW_REGS; i++)
    {
        mix_seq_dst[i] = i;
        mix_seq_src[i] = i;
    }
    for (int i = PROGPOW_REGS - 1; i > 0; i--)
    {
        int j;
        j = kiss99(prog_rnd) % (i + 1);
        swap(mix_seq_dst[i], mix_seq_dst[j]);
        j = kiss99(prog_rnd) % (i + 1);
        swap(mix_seq_src[i], mix_seq_src[j]);
    }
    return prog_rnd;
}

The math operations that merges values into the mix data are ones chosen to maintain entropy.

Test vectors can be found in the test vectors file.

// Merge new data from b into the value in a
// Assuming A has high entropy only do ops that retain entropy
// even if B is low entropy
// (IE don't do A&B)
uint32_t merge(uint32_t a, uint32_t b, uint32_t r)
{
    switch (r % 4)
    {
    case 0: return (a * 33) + b;
    case 1: return (a ^ b) * 33;
    // prevent rotate by 0 which is a NOP
    case 2: return ROTL32(a, ((r >> 16) % 31) + 1) ^ b;
    case 3: return ROTR32(a, ((r >> 16) % 31) + 1) ^ b;
    }
}

The math operations chosen for the random math are ones that are easy to implement in CUDA and OpenCL, the two main programming languages for commodity GPUs. The mul_hi, min, clz, and popcount functions match the corresponding OpenCL functions. ROTL32 matches the OpenCL rotate function. ROTR32 is rotate right, which is equivalent to rotate(i, 32-v).

Test vectors can be found in the test vectors file.

// Random math between two input values
uint32_t math(uint32_t a, uint32_t b, uint32_t r)
{
    switch (r % 11)
    {
    case 0: return a + b;
    case 1: return a * b;
    case 2: return mul_hi(a, b);
    case 3: return min(a, b);
    case 4: return ROTL32(a, b);
    case 5: return ROTR32(a, b);
    case 6: return a & b;
    case 7: return a | b;
    case 8: return a ^ b;
    case 9: return clz(a) + clz(b);
    case 10: return popcount(a) + popcount(b);
    }
}

The flow of the inner loop is:

  • Lane ((loop >> 2) % LANES) is chosen as the leader for that loop iteration
  • The leader's mix[0] value modulo the number of 512-byte DAG entries is used to select where to read from the full DAG
  • Each lane reads DAG_LOADS sequential words, using (lane ^ (loop >> 2)) % LANES as the starting offset within the entry
  • The random sequence of math and cache accesses is performed
  • The DAG data read at the start of the loop is merged at the end of the loop
  • Except on every fourth iteration of the loop (when mix[0] is a result of the random math, just like other mix registers are), mix[0] is advanced such that the next sequential 512-byte DAG entry would be read

prog_seed and loop come from the outer loop, corresponding to the current program seed (which is block_number/PROGPOW_PERIOD) and the loop iteration number. mix is the state array, initially filled by fill_mix. dag is the bytes of the Ethash DAG grouped into 32 bit unsigned ints in litte-endian format. On little-endian architectures this is just a normal int32 pointer to the existing DAG.

DAG_BYTES is set to the number of bytes in the current DAG, which is generated identically to the existing Ethash algorithm.

Test vectors for 0.9.2 can be found in the test vectors file.

void progPowLoop(
    const uint64_t prog_seed,
    const uint32_t loop,
    uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
    const uint32_t *dag)
{
    // dag_entry holds the 512 bytes of data loaded from the DAG
    uint32_t dag_entry[PROGPOW_LANES][PROGPOW_DAG_LOADS];
    // On every fourth loop iteration rotate which lane is the source of the DAG address.
    // The source lane's mix[0] value is used to ensure the last loop's DAG data feeds into this loop's address.
    // dag_addr_base is which 512-byte entry within the DAG will be accessed
    uint32_t orig_mix0 = mix[(loop >> 2) % PROGPOW_LANES][0];
    uint32_t dag_addr_base = orig_mix0 %
        (DAG_BYTES / (PROGPOW_LANES*PROGPOW_DAG_LOADS*sizeof(uint32_t)));
    for (int l = 0; l < PROGPOW_LANES; l++)
    {
        // Lanes access DAG_LOADS sequential words from the dag entry
        // Shuffle which portion of the entry each lane accesses each iteration by XORing lane and loop.
        // This prevents multi-chip ASICs from each storing just a portion of the DAG
        size_t dag_addr_lane = dag_addr_base * PROGPOW_LANES + (l ^ (loop >> 2)) % PROGPOW_LANES;
        for (int i = 0; i < PROGPOW_DAG_LOADS; i++)
            dag_entry[l][i] = dag[dag_addr_lane * PROGPOW_DAG_LOADS + i];
    }

    // Initialize the program seed and sequences
    // When mining these are evaluated on the CPU and compiled away
    int mix_seq_dst[PROGPOW_REGS];
    int mix_seq_src[PROGPOW_REGS];
    int mix_seq_dst_cnt = 0;
    int mix_seq_src_cnt = 0;
    kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq_dst, mix_seq_src);

    int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
    for (int i = 0; i < max_i; i++)
    {
        if (i < PROGPOW_CNT_CACHE)
        {
            // Cached memory access
            // lanes access random 32-bit locations within the first portion of the DAG
            int src = mix_seq_src[(mix_seq_src_cnt++)%PROGPOW_REGS];
            int dst = mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS];
            int sel = kiss99(prog_rnd);
            for (int l = 0; l < PROGPOW_LANES; l++)
            {
                uint32_t offset = mix[l][src] % ((PROGPOW_CACHE_BYTES/sizeof(uint32_t) << 2));
                mix[l][dst] = merge(mix[l][dst], dag[offset >> 2], sel);
            }
        }
        if (i < PROGPOW_CNT_MATH)
        {
            // Random Math
            // Generate 2 unique sources 
            int src_rnd = kiss99(prog_rnd) % (PROGPOW_REGS * (PROGPOW_REGS-1));
            int src1 = src_rnd % PROGPOW_REGS; // 0 <= src1 < PROGPOW_REGS
            int src2 = src_rnd / PROGPOW_REGS; // 0 <= src2 < PROGPOW_REGS - 1
            if (src2 >= src1) ++src2; // src2 is now any reg other than src1
            int sel1 = kiss99(prog_rnd);
            int dst  = mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS];
            int sel2 = kiss99(prog_rnd);
            for (int l = 0; l < PROGPOW_LANES; l++)
            {
                uint32_t data = math(mix[l][src1], mix[l][src2], sel1);
                mix[l][dst] = merge(mix[l][dst], data, sel2);
            }
        }
    }
    // Consume the global load data at the very end of the loop to allow full latency hiding
    // Always merge into mix[0] to feed the offset calculation
    for (int i = 0; i < PROGPOW_DAG_LOADS; i++)
    {
        int dst = (i==0) ? 0 : mix_seq_dst[(mix_seq_dst_cnt++)%PROGPOW_REGS];
        int sel = kiss99(prog_rnd);
        if (i == 0 && (loop & 3) != 3)
            dst = 1;
        for (int l = 0; l < PROGPOW_LANES; l++)
        {
            mix[l][dst] = merge(mix[l][dst], dag_entry[l][i], sel);
            if (i == 0 && (loop & 3) != 3)
            {
                mix[l][1] += mix[l][0];
                // To maintain sequential reads, it'd be sufficient to do this for the leader lane, but we do it for all
                mix[l][0] = orig_mix0 + 1;
            }
        }
    }
}

The flow of the overall algorithm is:

  • A keccak hash of the header + nonce to create a seed
  • Use the seed to generate initial mix data
  • Loop multiple times, each time hashing random loads and random math into the mix data
  • Hash all the mix data into a single 256-bit value
  • A final keccak hash is computed
  • When mining this final value is compared against a hash32_t target
hash32_t progPowHash(
    const uint64_t prog_seed, // value is (block_number/PROGPOW_PERIOD)
    const uint64_t nonce,
    const hash32_t header,
    const uint32_t *dag // gigabyte DAG located in framebuffer - the first portion gets cached
)
{
    uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
    hash32_t digest;
    for (int i = 0; i < 8; i++)
        digest.uint32s[i] = 0;

    // keccak(header..nonce)
    hash32_t seed_256 = keccak_f800_progpow(header, nonce, digest);
    // endian swap so byte 0 of the hash is the MSB of the value
    uint64_t seed = ((uint64_t)bswap(seed_256.uint32s[0]) << 32) | bswap(seed_256.uint32s[1]);

    // initialize mix for all lanes
    for (int l = 0; l < PROGPOW_LANES; l++)
        fill_mix(seed, l, mix[l]);

    // execute the randomly generated inner loop
    for (int i = 0; i < PROGPOW_CNT_DAG; i++)
        progPowLoop(prog_seed, i, mix, dag);

    // Reduce mix data to a per-lane 32-bit digest
    uint32_t digest_lane[PROGPOW_LANES];
    for (int l = 0; l < PROGPOW_LANES; l++)
    {
        digest_lane[l] = FNV_OFFSET_BASIS;
        for (int i = 0; i < PROGPOW_REGS; i++)
            digest_lane[l] = fnv1a(digest_lane[l], mix[l][i]);
    }
    // Reduce all lanes to a single 256-bit digest
    for (int i = 0; i < 8; i++)
        digest.uint32s[i] = FNV_OFFSET_BASIS;
    for (int l = 0; l < PROGPOW_LANES; l++)
        digest.uint32s[l%8] = fnv1a(digest.uint32s[l%8], digest_lane[l]);

    // keccak(header .. keccak(header..nonce) .. digest);
    return keccak_f800_progpow(header, seed, digest);
}

Example / Testcase

For ProgPoW 0.9.2:

The random sequence generated for block 30,000 (prog_seed 600) can been seen in kernel.cu.

The algorithm run on block 30,000 produces the following digest and result:

header ffeeddccbbaa9988776655443322110000112233445566778899aabbccddeeff
nonce 123456789abcdef0

digest: 11f19805c58ab46610ff9c719dcf0a5f18fa2f1605798eef770c47219274767d
result: 5b7ccd472dbefdd95b895cac8ece67ff0deb5a6bd2ecc6e162383d00c3728ece

A full run showing some intermediate values can be seen in result.log

Additional test vectors can be found in the test vectors file.

Change History

  • 0.9.3m2 (proposed here) - Index cache with byte offsets. See this GitHub issue for details.
  • 0.9.3m1 (proposed here) - Double REGS, DAG_LOADS, CNT_MATH, and CNT_CACHE, halve CNT_DAG. Make every four adjacent DAG_LOADS sequential, only proceeding to a new random offset afterwards, effectively increasing the random DAG read block size by a further factor of 4, for a total of 8 (from 256 bytes in 0.9.3 to 2048 bytes in 0.9.3m1). See this GitHub issue comments thread for details. While at it, drop the overly optimistic ASIC resistance claims from this README.md.
  • 0.9.3 (proposed) - Reduce parameters PERIOD, CNT_MATH, and CNT_CACHE. See this medium post for details.
  • 0.9.2 - Unique sources for math() and prevent rotation by 0 in merge(). Suggested by SChernykh
  • 0.9.1 - Shuffle what part of the DAG entry each lane accesses. Suggested by mbevand
  • 0.9.0 - Unique cache address sources, re-tune parameters
  • 0.8.0 - Original spec

License

The ProgPoW algorithm and this specification are a new work. Copyright and related rights are waived via CC0.

The reference ProgPoW mining implementation in this repository is a derivative of ethminer so retains the GPL license.

About

A Programmatic Proof-of-Work for Ethash. Forked from https://github.com/ethereum-mining/ethminer

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 79.3%
  • C 16.1%
  • CMake 2.2%
  • Cuda 2.1%
  • Shell 0.3%