-
-
Notifications
You must be signed in to change notification settings - Fork 123
Segmented sieve of Eratosthenes
This page contains a step by step explanation of a simple but fast C++ implementation of the segmented sieve of Eratosthenes that generates the primes below n using operations and space. It runs significantly faster than a traditional sieve of Eratosthenes implementation due to its more efficient CPU cache usage i.e. it uses the CPU's L1 data cache size as its sieve array size. This ensures that less than 1 percent of the memory accesses will be cache misses. This implementation generates the primes below 10^9 in just 0.8 seconds (single-threaded) on an Intel Core i7-6700 3.4 GHz CPU.
We set the size of the sieve array named segment_size
to be the
same size as the CPU's L1 data cache size per core e.g. 32768 bytes.
The segment size must not be smaller than the square root of the
sieving limit otherwise the run-time complexity of the algorithm
deteriorates.
/// Set your CPU's L1 data cache size (in bytes) here
const int64_t L1D_CACHE_SIZE = 32768;
/// Generate primes using the segmented sieve of Eratosthenes.
/// This algorithm uses O(n log log n) operations and O(sqrt(n)) space.
/// @param limit Sieve primes <= limit.
///
void segmented_sieve(int64_t limit)
{
int64_t sqrt = (int64_t) std::sqrt(limit);
int64_t segment_size = std::max(sqrt, L1D_CACHE_SIZE);
We sieve one segment in each iteration of the outer sieving loop. At the beginning of each iteration we reset the sieve array and set the lower and upper bounds of the current segment. Hence the current segment corresponds to the interval [low, high].
for (int64_t low = 0; low <= limit; low += segment_size)
{
std::fill(sieve.begin(), sieve.end(), true);
// current segment = [low, high]
int64_t high = low + segment_size - 1;
high = std::min(high, limit);
Before we can start sieving the current segment [low, high] we need to generate the sieving primes ≤ sqrt(high). We do this using a simple sieve of Eratosthenes implementation. When we find a new sieving prime we store it in the primes array and we also store the index of the first multiple that needs to be crossed off (in the segmented sieve) in the multiples array.
// generate sieving primes using simple sieve of Eratosthenes
for (; i * i <= high; i += 2)
if (is_prime[i])
for (int64_t j = i * i; j <= sqrt; j += i)
is_prime[j] = false;
// initialize sieving primes for segmented sieve
for (; s * s <= high; s += 2)
{
if (is_prime[s])
{
primes.push_back(s);
multiples.push_back(s * s - low);
}
}
After having generated the sieving primes we can proceed with the
actual prime sieving. We start crossing off the multiples of the first
sieving prime 3 until we reach a multiple of 3 ≥ segment_size
,
when this happens we calculate the index of that multiple in the next
segment using (multiple - segment_size) and store it in the
multiples array. We then cross off the multiples of the other sieving
primes using the same procedure.
// sieve the current segment
for (std::size_t i = 0; i < primes.size(); i++)
{
int64_t j = multiples[i];
for (int64_t k = primes[i] * 2; j < segment_size; j += k)
sieve[j] = false;
multiples[i] = j - segment_size;
}
Once we have crossed off the multiples of all sieving primes in the current segment we iterate over the sieve array and count (or print) the primes.
for (; n <= high; n += 2)
if (sieve[n - low]) // n is a prime
count++;
/// @file segmented_sieve.cpp
/// @author Kim Walisch, <[email protected]>
/// @brief This is a simple implementation of the segmented sieve of
/// Eratosthenes with a few optimizations. It generates the
/// primes below 10^9 in 0.8 seconds (single-threaded) on an
/// Intel Core i7-6700 3.4 GHz CPU.
/// @license Public domain.
#include <iostream>
#include <algorithm>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <stdint.h>
/// Set your CPU's L1 data cache size (in bytes) here
const int64_t L1D_CACHE_SIZE = 32768;
/// Generate primes using the segmented sieve of Eratosthenes.
/// This algorithm uses O(n log log n) operations and O(sqrt(n)) space.
/// @param limit Sieve primes <= limit.
///
void segmented_sieve(int64_t limit)
{
int64_t sqrt = (int64_t) std::sqrt(limit);
int64_t segment_size = std::max(sqrt, L1D_CACHE_SIZE);
int64_t count = (limit < 2) ? 0 : 1;
// we sieve primes >= 3
int64_t i = 3;
int64_t n = 3;
int64_t s = 3;
std::vector<char> sieve(segment_size);
std::vector<char> is_prime(sqrt + 1, true);
std::vector<int64_t> primes;
std::vector<int64_t> multiples;
for (int64_t low = 0; low <= limit; low += segment_size)
{
std::fill(sieve.begin(), sieve.end(), true);
// current segment = [low, high]
int64_t high = low + segment_size - 1;
high = std::min(high, limit);
// generate sieving primes using simple sieve of Eratosthenes
for (; i * i <= high; i += 2)
if (is_prime[i])
for (int64_t j = i * i; j <= sqrt; j += i)
is_prime[j] = false;
// initialize sieving primes for segmented sieve
for (; s * s <= high; s += 2)
{
if (is_prime[s])
{
primes.push_back(s);
multiples.push_back(s * s - low);
}
}
// sieve the current segment
for (std::size_t i = 0; i < primes.size(); i++)
{
int64_t j = multiples[i];
for (int64_t k = primes[i] * 2; j < segment_size; j += k)
sieve[j] = false;
multiples[i] = j - segment_size;
}
for (; n <= high; n += 2)
if (sieve[n - low]) // n is a prime
count++;
}
std::cout << count << " primes found." << std::endl;
}
/// Usage: ./segmented_sieve n
/// @param n Sieve the primes up to n.
///
int main(int argc, char** argv)
{
if (argc >= 2)
segmented_sieve(std::atoll(argv[1]));
else
segmented_sieve(1000000000);
return 0;
}
Download segmented_sieve.cpp, open a terminal and run:
# Compile using default C++ compiler
c++ -O3 segmented_sieve.cpp -o segmented_sieve
# Count the primes below 10^9
time ./segmented_sieve 1000000000
The segmented sieve of Eratosthenes implementation presented before is both simple and fast. But what if we wanted to further optimize it so that it runs even faster? Well the next step is to change our implementation so that it uses a bit array instead of a byte array for sieving. The idea is to map numbers to bits in the sieve array e.g. we define that each bit in the sieve array corresponds to an odd number. Hence bit0 = 1, bit1 = 3, bit2 = 5, bit3 = 7 and so forth. This also means that each byte of the sieve array corresponds to a interval, as each byte has 8 bits byte0 = [0, 15], byte1 = [16, 31], byte2 = [32, 47] and so forth.
This loop is very similar to the segmented byte sieving loop. But
here when we cross off a multiple we unset the corresponding bit instead
of setting the corresponding byte to false. We divide the multiple by
16 (right shift by 4) to find the corresponding byte and we use the
unset_bit
lookup table to mask off the bit corresponding to the
multiple in the sieve array.
// sieve segment using bit array
for (std::size_t i = 0; i < primes.size(); i++)
{
int64_t j = multiples[i];
for (int64_t k = primes[i] * 2; j < segment_size; j += k)
sieve[j >> 4] &= unset_bit[j & 15];
multiples[i] = j - segment_size;
}
Each 1 bit that remains in our sieve array after the sieving step corresponds to a prime. Instead of iterating over each bit and checking whether it is set or not we will use a precomputed popcnt lookup table that tells us how many bits are set in each byte, this is much faster. We also need to unset the bits in the sieve array that correspond to numbers > sieving limit. These are the most significant bits in the last byte of the sieve array.
// unset bits > limit
if (high == limit)
{
int64_t bits = 0xff << (limit % 16 + 1) / 2;
sieve[sieve_size - 1] &= ~bits;
}
// count primes
for (int64_t n = 0; n < sieve_size; n++)
count += popcnt[sieve[n]];
Download segmented_bit_sieve.cpp, open a terminal and run:
# Compile using default C++ compiler
c++ -O3 segmented_bit_sieve.cpp -o segmented_bit_sieve
# Count the primes below 10^9
time ./segmented_bit_sieve 1000000000
The performance of the segmented sieve of Eratosthenes mainly depends on CPU caching. If the sieve array fits into the CPU's L1 or L2 cache sieving will be extremely fast, if however the sieve array is larger than the CPU's cache size the sieving performance will deteriorate by up to 2 orders of magnitude. As the size of the sieve array is set to the square root of the sieving limit the sieve array will not fit into the CPU's cache if the sieving limit is sufficiently large e.g. 10^10. By using a bit array for sieving the sieve array uses less memory and hence it will fit into the CPU's caches even for larger sieving limits. Actually the bit sieving array will fit into the CPU's L1 cache up to ~ 10^11.4 and into the CPU's L2 cache up to ~ 10^13.8.
x | Prime Count |
Segmented byte sieve Intel Core i7-6700, 32 KiB L1 cache |
Segmented bit sieve Intel Core i7-6700, 32 KiB L1 cache |
107 | 664,579 | 0.03s | 0.03s |
108 | 5,761,455 | 0.11s | 0.11s |
109 | 50,847,534 | 0.80s | 0.65s |
1010 | 455,052,511 | 11.10s | 7.25s |
1011 | 4,118,054,813 | 141.88s | 88.47s |
1012 | 37,607,912,018 | 1742.95s | 1104.00s |