## Tuesday, June 25, 2013

### Wheel Factorization with Variable Increments

Just a quick update to my previous wheel factorization code. I found a better way to implement the wheel factorization code which doesn't use bit masks to sieve out values. This version has a decent performance boost, and I think is easier to implement.

# Increment-based Wheel Factorization

The problem lies with this code:

// i is an actual number
for (int i = seg_start; i <= limit_i; i += 2)


The problem is there are numbers being checked that we know cannot be prime. The mask method solved this problem by pre-setting the appropriate bits, but this loop is still checking that flag. To get around this, we need to have a variable increment value. Consider a 30-wheel, then numbers which could be prime have the form:

\begin{bmatrix} 30m + 7& 30m + 11& 30m + 13& 30m + 17& 30m + 19& 30m + 23& 30m + 29& 30m + 31 \end{bmatrix}

To get to the next value, we just need to subtract adjacent values, ending up with the increment map:

\begin{bmatrix} 4 & 2 & 4& 2& 4& 6& 2& 6 \end{bmatrix}

Note that the last 6 is what is required to get from the last element to the start of the next period. Because we never have to check known composite numbers filtered out by the wheel, we don't need to pre-set using a bit mask. The bits are still there and could be set or unset by the sieving process, but we just don't care since we never check to see if these values are prime.

Here's a 210 wheel implementation using this technique:

private static int[] wheel210_inc = new int[] { 2, 4, 2, 4, 6, 2, 6, 4, 2, 4, 6, 6, 2, 6, 4, 2,
6, 4, 6, 8, 4, 2, 4, 2, 4, 8, 6, 4, 6, 2, 4, 6, 2, 6, 6, 4, 2, 4, 6, 2, 6, 4, 2, 4, 2,
10, 2, 10 };

public static int[] wheel210_sieve(int n)
{
if (n < 2)
{
return null;
}
else if (n < 3)
{
return new int[] { 2 };
}
else if (n < 5)
{
return new int[] { 2, 3 };
}
else if (n < 7)
{
return new int[] { 2, 3, 5 };
}
else if (n < 11)
{
return new int[] { 2, 3, 5, 7 };
}
int[] results = new int[(int) Math.ceil(1.25506 * n / Math.log(n))];
int wheel_offset = 0;

long[] is_composite = new long[0x1000];
results[0] = 2;
results[1] = 3;
results[2] = 5;
results[3] = 7;
int size = 4;

int seg_count = n - 10 >> 19;
long val = 11;

for (int seg = 0; seg <= seg_count; ++seg)
{
int seg_start = (seg << 19) + 11;
int seg_end = (int) Math.min(n, (seg + 1L << 19) + 11L);
int limit_i = (int) Math.ceil(Math.sqrt(seg_end));

// wheel-factor away small primes
for (int i = 0; i < is_composite.length; ++i)
{
is_composite[i] = 0;
}
// pre-sieve primes we've already found
for (int i = 4; i < size && results[i] <= limit_i; ++i)
{
// start at max of prime squared or minimum multiple of result in range that's not
// also a multiple of 2
long mark_start = seg_start / results[i] * results[i];
if ((seg_start / results[i] & 0x1) == 1)
{
if (mark_start < seg_start)
{
mark_start += 2 * results[i];
}
}
else
{
mark_start += results[i];
}
for (long j = mark_start; j < seg_end; j += results[i] << 1)
{
is_composite[(int) (j - seg_start >> 7)] |= 1L << ((j - seg_start & 0x7FL) >> 1);
}
}

// start sieving
while (val <= limit_i)
{
if ((is_composite[(int) (val - seg_start >> 7)] & 1L << ((val - seg_start & 0x7F) >> 1)) == 0)
{
// val is prime, mark off all multiples
results[size++] = (int) val;
// start at i squared
for (long j = val * val; j < seg_end; j += val << 1)
{
is_composite[(int) (j - seg_start >> 7)] |= 1L << ((j - seg_start & 0x7FL) >> 1);
}
}
val += wheel210_inc[wheel_offset++];

if (wheel_offset >= wheel210_inc.length)
{
wheel_offset = 0;
}
}
// gather primes past sqrt(n) and are in range
while (val < seg_end)
{
if ((is_composite[(int) (val - seg_start >> 7)] & 1L << ((val - seg_start & 0x7FL) >> 1)) == 0)
{
results[size++] = (int) val;
}
val += wheel210_inc[wheel_offset++];
if (wheel_offset >= wheel210_inc.length)
{
wheel_offset = 0;
}
}
}
// resize array to only fit primes
int[] primes = new int[size];
for (int i = 0; i < size; ++i)
{
primes[i] = results[i];
}
return primes;
}


# Benchmark Results

Here are the benchmark results using this technique. I'm using the same hardware as before (Intel i5-430m), going to n=0x3FFFFFFF.

• Segmented Sieve of Eratosthenes + long buffer: 5 sec
• 30 wheel mask + Segmented Sieve of Eratosthenes: 4.35 sec
• 210 wheel mask + Segmented Sieve of Eratosthenes: 4.17 sec
• 2310 wheel mask + Segmented Sieve of Eratosthenes: 4.1 sec
• 30 wheel increments + Segmented Sieve of Eratosthenes: 4.11 sec
• 210 wheel increments + Segmented Sieve of Eratosthenes: 3.68 sec
• 2310 wheel increments + Segmented Sieve of Eratosthenes: 3.45 sec
• 30030 wheel increments + Segmented Sieve of Eratosthenes: 3.36 sec

Some nice performance gains for a fairly simple code change.