## Tuesday, April 9, 2013

### Segmented Sieve of Eratosthenes

From my last playing around with primes post, I've finally managed to segment the sieving algorithm. This does a few different things:

• Reduces memory usage. Now the usage is proportional to the number of primes found plus some small fixed buffer.
• Increases locality when sieving. This should help with performance.
• Hopefully makes parallelization easier/possible.
• Finding the first x primes becomes a relatively straightforward and quick task.

Here's the code:

public static int[] seg_sieve(int n)
{
if (n < 2)
{
return null;
}
int[] results = new int[(int) Math.ceil(1.25506 * n / Math.log(n))];
results[0] = 2;
results[1] = 3;
int size = 1;

// use 8kb buffer
char[] is_composite = new char[0x1000];
// base_shift_factor = 5
// segment_shift_factor = log2(is_composite.length) + base_shift_factor
final int seg_count = n - 2 >> 17;

for (int seg = 0; seg <= seg_count; ++seg)
{
final int seg_start = (seg << 17) + 3;
final int seg_end = Math.min((seg + 1 << 17) + 3, n);
final int limit_i = (int) Math.ceil(Math.sqrt(seg_end));

Arrays.fill(is_composite, (char) 0);

// pre-sieve primes we've already found
for (int i = 1; i < size && results[i] <= limit_i; ++i)
{
// start at max of prime squared or minimum multiple of result in range that's not
// also a multiple of 2
int mark_start = seg_start / results[i] * results[i];
if ((seg_start / results[i] & 0x1) == 1)
{
if (mark_start < seg_start)
{
mark_start += 2 * results[i];
}
}
else
{
mark_start += results[i];
}
for (int j = Math.max(results[i] * results[i], mark_start); j < seg_end; j += results[i] << 1)
{
is_composite[j - seg_start >> 5] |= 1 << ((j - seg_start & 0x1F) >> 1);
}
}

// start sieving
for (int i = seg_start; i <= limit_i; i += 2)
{
if ((is_composite[i - seg_start >> 5] & 1 << ((i - seg_start & 0x1F) >> 1)) == 0)
{
// i is prime, mark off all multiples
results[size++] = i;
// start at i squared
for (int j = i * i; j < seg_end; j += i << 1)
{
is_composite[j - seg_start >> 5] |= 1 << ((j - seg_start & 0x1F) >> 1);
}
}
}
// gather primes past sqrt(n) and are in range
for (int i = Math.max(limit_i + (limit_i - 1) % 2, seg_start); i < seg_end; i += 2)
{
if ((is_composite[i - seg_start >> 5] & 1 << ((i - seg_start & 0x1F) >> 1)) == 0)
{
results[size++] = i;
}
}
}
// resize array to only fit primes
int[] primes = new int[size];
for (int i = 0; i < size; ++i)
{
primes[i] = results[i];
}
return primes;
}


I'm using a fixed 8kb buffer for sieving. Through some testing I found that this size performed best for my setup. I haven't tested it extensively yet for correctness, but it should give correct results. Some basic benchmarking produced promising results. Here's a short summary of timings for generating primes up to 0x3FFFFFFF (~1 billion):

• Segmented code: ~5.8 sec
• Array-storage code: ~9.7 sec
• Arraylist somewhat obfuscated/optimized code: ~17.3 sec
• Naive implementation: ~63 sec

Definitely some good performance gains, but there is still some room for improvement. I'm not sure where I'm going to look for performance next, or when. I may consider multi-threading as that should be a fairly straightforward job. Alternatively, I could try another re-write and try to incorporate some other techniques like a modulo 210 wheel factorization.