## Tuesday, June 18, 2013

### Rudimentary Wheel Factorization using Bit Masks

Time to have more fun with prime numbers! I've made a few updates to my sieving code making it faster and more robust. I've also worked through a basic wheel-factorization implementation, which had reasonable performance benefits.

# Base Segmented Sieve of Eratosthenes Changes

The one thing that really bugs me about Java is the lack of unsigned integer support. That's the primary reason I originally chose to use the char data type because this is the only unsigned primitive in Java. However, after looking through my code, I figured that changing to an int or long buffer doesn't matter. So I changed the code to use longs for the sieve buffer. I also went through the code analyzing operations to see if they would overflow or not. I believe The code should be robust enough to handle all integer inputs safely.

public static int[] seg_sieve(int n)
{
if (n < 2)
{
return null;
}
int[] results = new int[(int) Math.ceil(1.25506 * n / Math.log(n))];

long[] is_composite = new long[0x1000];
results[0] = 2;
int size = 1;

int seg_count = n - 2 >> 19;

for (int seg = 0; seg <= seg_count; ++seg)
{
int seg_start = (seg << 19) + 3;
int seg_end = (int) Math.min(n, (seg + 1L << 19) + 3L);
int limit_i = (int) Math.ceil(Math.sqrt(seg_end));

for (int i = 0; i < is_composite.length; ++i)
{
is_composite[i] = 0;
}
// pre-sieve primes we've already found
for (int i = 1; i < size && results[i] <= limit_i; ++i)
{
// start at max of prime squared or minimum multiple of result in range that's not
// also a multiple of 2
long mark_start = seg_start / results[i] * results[i];
if ((seg_start / results[i] & 0x1) == 1)
{
if (mark_start < seg_start)
{
mark_start += 2 * results[i];
}
}
else
{
mark_start += results[i];
}
for (long j = Math.max((long) results[i] * (long) results[i], mark_start); j < seg_end; j += results[i] << 1)
{
is_composite[(int) (j - seg_start >> 7)] |= 1L << ((j - seg_start & 0x7FL) >> 1);
}
}

// start sieving
for (int i = seg_start; i <= limit_i; i += 2)
{
if ((is_composite[i - seg_start >> 7] & 1L << ((i - seg_start & 0x7F) >> 1)) == 0)
{
// i is prime, mark off all multiples
results[size++] = i;
// start at i squared
for (long j = i * i; j < seg_end; j += i << 1)
{
is_composite[(int) (j - seg_start >> 7)] |= 1L << ((j - seg_start & 0x7FL) >> 1);
}
}
}
// gather primes past sqrt(n) and are in range
for (long i = Math.max(limit_i + (limit_i - 1 & 0x1), seg_start); i < seg_end; i += 2)
{
if ((is_composite[(int) (i - seg_start >> 7)] & 1L << ((i - seg_start & 0x7FL) >> 1)) == 0)
{
results[size++] = (int) i;
}
}
}
// resize array to only fit primes
int[] primes = new int[size];
for (int i = 0; i < size; ++i)
{
primes[i] = results[i];
}
return primes;
}

# Wheel Factorization Basic Implementation

The next area I worked on is adding wheel factorization to remove multiples of small primes quickly. The Wikipedia article does a decent job explaining what Wheel Factorization works, I'm going to focus on how I put this idea into practice. Luckily, my chosen method was remarkably easy to implement. I'm not entirely sure if this is how it's intended to be implemented or not, but it works.

The basic idea I'm using is to use a pre-defined bit-mask which allows me to mark of multiples of small primes quickly. Because this mask is guaranteed to be period, I only need to store a single cycle and repeat the mask as needed. Now because the wheel factors don't line up with the byte boundary limits, the easiest method is to track the number of wheel periods required to align with the data type boundary (in my case, 64-bit byte boundaries).

Suppose we have some wheel with a period $$2m$$ (6, 30, 210, 2310, etc.). This span can be represented using $$m$$ bits because I'm pre-eliminating all even numbers. The question is when does this repeat cycle start on a 64-bit boundary?

As it turns out, we will need exactly $$m$$ 64-bit values before the cycle repeats. Here's some helper code I used to generate the appropriate bit-fields:

/**
*
* @param n
*            use primes less than/equal n, last prime is first seg_start
* @return
*/
public static long[] getWheel(int n)
{
int[] primes = Primality.sieve(n);
int seg_start = primes[primes.length - 1];
int domain = 1;
for (int i = 0; i < primes.length - 1; ++i)
{
domain *= primes[i];
}
System.out.println(domain);
long[] data = new long[domain / 2];
int seg_end = data.length * 128 + seg_start;
for (int j = seg_start; j < seg_end; j += 2)
{
for (int i = 1; i < primes.length - 1; ++i)
{
int prime = primes[i];
if (j % prime == 0)
{
// mark as non-prime
data[j - seg_start >> 7] |= 1L << ((j - seg_start & 0x7F) >> 1);
break;
}
}
}
return data;
}


The only thing remains is to initialize each sieve segment with the appropriate data. I used a simple circular counter to keep track of which element should be loaded next:

// wheel-factor away small primes
for (int i = 0; i < is_composite.length; ++i)
{
// actually has a value, not just 0
is_composite[i] = wheel210[wheel_offset];
wheel_offset = (wheel_offset + 1) % wheel210.length;
}

Here's the completed code using a 210 wheel. I used a static pre-computed buffer because there's really no need to have to compute the buffer at runtime, and it's relatively small.

private static long wheel210[] = new long[] { 0x6b74976b2cda59a4L, 0xb4b349e792cd2d9aL, 0x9a5b34d6e92ed659L, 0x5dacb3696693cf25L,
0x279e4b34b669add2L, 0xd35ba4bb5966d2cdL, 0xcda59a4f3c96696cL, 0x2cd2d9a6b74976b2L, 0x92ed659b4b349e79L, 0x693cf259a5b34d6eL,
0x669add25dacb3696L, 0x966d2cd279e4b34bL, 0xc96696cd35ba4bb5L, 0x74976b2cda59a4f3L, 0xb349e792cd2d9a6bL, 0x5b34d6e92ed659b4L,
0xacb3696693cf259aL, 0x9e4b34b669add25dL, 0x5ba4bb5966d2cd27L, 0xa59a4f3c96696cd3L, 0xd2d9a6b74976b2cdL, 0xed659b4b349e792cL,
0x3cf259a5b34d6e92L, 0x9add25dacb369669L, 0x6d2cd279e4b34b66L, 0x6696cd35ba4bb596L, 0x976b2cda59a4f3c9L, 0x49e792cd2d9a6b74L,
0x34d6e92ed659b4b3L, 0xb3696693cf259a5bL, 0x4b34b669add25dacL, 0xa4bb5966d2cd279eL, 0x9a4f3c96696cd35bL, 0xd9a6b74976b2cda5L,
0x659b4b349e792cd2L, 0xf259a5b34d6e92edL, 0xdd25dacb3696693cL, 0x2cd279e4b34b669aL, 0x96cd35ba4bb5966dL, 0x6b2cda59a4f3c966L,
0xe792cd2d9a6b7497L, 0xd6e92ed659b4b349L, 0x696693cf259a5b34L, 0x34b669add25dacb3L, 0xbb5966d2cd279e4bL, 0x4f3c96696cd35ba4L,
0xa6b74976b2cda59aL, 0x9b4b349e792cd2d9L, 0x59a5b34d6e92ed65L, 0x25dacb3696693cf2L, 0xd279e4b34b669addL, 0xcd35ba4bb5966d2cL,
0x2cda59a4f3c96696L, 0x92cd2d9a6b74976bL, 0xe92ed659b4b349e7L, 0x6693cf259a5b34d6L, 0xb669add25dacb369L, 0x5966d2cd279e4b34L,
0x3c96696cd35ba4bbL, 0xb74976b2cda59a4fL, 0x4b349e792cd2d9a6L, 0xa5b34d6e92ed659bL, 0xdacb3696693cf259L, 0x79e4b34b669add25L,
0x35ba4bb5966d2cd2L, 0xda59a4f3c96696cdL, 0xcd2d9a6b74976b2cL, 0x2ed659b4b349e792L, 0x93cf259a5b34d6e9L, 0x69add25dacb36966L,
0x66d2cd279e4b34b6L, 0x96696cd35ba4bb59L, 0x4976b2cda59a4f3cL, 0x349e792cd2d9a6b7L, 0xb34d6e92ed659b4bL, 0xcb3696693cf259a5L,
0xe4b34b669add25daL, 0xba4bb5966d2cd279L, 0x59a4f3c96696cd35L, 0x2d9a6b74976b2cdaL, 0xd659b4b349e792cdL, 0xcf259a5b34d6e92eL,
0xadd25dacb3696693L, 0xd2cd279e4b34b669L, 0x696cd35ba4bb5966L, 0x76b2cda59a4f3c96L, 0x9e792cd2d9a6b749L, 0x4d6e92ed659b4b34L,
0x3696693cf259a5b3L, 0xb34b669add25dacbL, 0x4bb5966d2cd279e4L, 0xa4f3c96696cd35baL, 0x9a6b74976b2cda59L, 0x59b4b349e792cd2dL,
0x259a5b34d6e92ed6L, 0xd25dacb3696693cfL, 0xcd279e4b34b669adL, 0x6cd35ba4bb5966d2L, 0xb2cda59a4f3c9669L, 0x792cd2d9a6b74976L,
0x6e92ed659b4b349eL, 0x96693cf259a5b34dL, 0x4b669add25dacb36L, 0xb5966d2cd279e4b3L, 0xf3c96696cd35ba4bL };

public static int[] wheel210_sieve(int n)
{
if (n < 2)
{
return null;
}
else if (n < 3)
{
return new int[] { 2 };
}
else if (n < 5)
{
return new int[] { 2, 3 };
}
else if (n < 7)
{
return new int[] { 2, 3, 5 };
}
else if (n < 11)
{
return new int[] { 2, 3, 5, 7 };
}
int[] results = new int[(int) Math.ceil(1.25506 * n / Math.log(n))];
int wheel_offset = 0;

long[] is_composite = new long[0x1000];
results[0] = 2;
results[1] = 3;
results[2] = 5;
results[3] = 7;
int size = 4;

int seg_count = n - 10 >> 19;

for (int seg = 0; seg <= seg_count; ++seg)
{
int seg_start = (seg << 19) + 11;
int seg_end = (int) Math.min(n, (seg + 1L << 19) + 11L);
int limit_i = (int) Math.ceil(Math.sqrt(seg_end));

// wheel-factor away small primes
for (int i = 0; i < is_composite.length; ++i)
{
is_composite[i] = wheel210[wheel_offset];
wheel_offset = (wheel_offset + 1) % wheel210.length;
}
// pre-sieve primes we've already found
for (int i = 4; i < size && results[i] <= limit_i; ++i)
{
// start at max of prime squared or minimum multiple of result in range that's not
// also a multiple of 2
long mark_start = seg_start / results[i] * results[i];
if ((seg_start / results[i] & 0x1) == 1)
{
if (mark_start < seg_start)
{
mark_start += 2 * results[i];
}
}
else
{
mark_start += results[i];
}
for (long j = Math.max((long) results[i] * (long) results[i], mark_start); j < seg_end; j += results[i] << 1)
{
is_composite[(int) (j - seg_start >> 7)] |= 1L << ((j - seg_start & 0x7FL) >> 1);
}
}

// start sieving
for (int i = seg_start; i <= limit_i; i += 2)
{
if ((is_composite[i - seg_start >> 7] & 1L << ((i - seg_start & 0x7F) >> 1)) == 0)
{
// i is prime, mark off all multiples
results[size++] = i;
// start at i squared
for (long j = i * i; j < seg_end; j += i << 1)
{
is_composite[(int) (j - seg_start >> 7)] |= 1L << ((j - seg_start & 0x7FL) >> 1);
}
}
}
// gather primes past sqrt(n) and are in range
for (long i = Math.max(limit_i + (limit_i - 1 & 0x1), seg_start); i < seg_end; i += 2)
{
if ((is_composite[(int) (i - seg_start >> 7)] & 1L << ((i - seg_start & 0x7FL) >> 1)) == 0)
{
results[size++] = (int) i;
}
}
}
// resize array to only fit primes
int[] primes = new int[size];
for (int i = 0; i < size; ++i)
{
primes[i] = results[i];
}
return primes;
}


# Benchmarking

Like always, we must benchmark or code to test for any actual performance benefits. So I ran some tests computing primes up to 0x3FFFFFFF (~1 billion). Here are the typical expected running times on my laptop (Intel i5-430m), which is the same hardware I did my previous testing on:

• Segmented Sieve of Eratosthenes + char buffer (previous best): 5.8 sec
• Segmented Sieve of Eratosthenes + long buffer: 5 sec
• 30 wheel factorization + Segmented Sieve of Eratosthenes + long buffer: 4.35 sec
• 210 wheel factorization + Segmented Sieve of Eratosthenes + long buffer: 4.17 sec
• 2310 wheel factorization + Segmented Sieve of Eratosthenes + long buffer: 4.1 sec

Quite a nice performance benefit compared to before whith wheel factorization. There's not too much more performance benefits to be had by using larger wheels, and there may even be some performance losses due to memory usage and cache utilization. I'm still not sure where primesieve is getting that last bit of performance from, I could parallelize and theoretically get ~2 seconds (assuming 2 threads), but that's still nowhere near primesieve's ~0.26 second runtime even with just one thread. In the future I may try to look into the compression techniques the author claims to be using, and perhaps this will improve cache utilization to the point where I could get an order of magnitude performance boost.