Eratosthenes Sieve on Risc PC with StrongARM, version 8

Introduction

Here is an implementation in ARM of the Eratosthenes sieve, especially optimized for the Risc PC with a StrongARM processor (202 MHz). With this configuration, the bandwidth is very low compared to the processor speed. After testing the version 7 of the Eratosthenes sieve, we could notice that a program that writes a constant into memory (with the same kind of memory access, i.e. using the cache by reading non initialised data) is only about 10% faster than version 7. Thus we seek to:

decrease the number of writes to the memory (cache or RAM) by avoiding redundant writes (as in version 7),
use the cache as much as possible by splitting the array of booleans into blocks (that hold in the cache) and by processing these blocks the one after the other (as in version 7),
avoid loading non initialised data into the cache (contrary to version 7): the cache will continuously contain a fixed block where the intermediate operations will be performed; at the end of each iteration, this block will be copied to the array into RAM by using STMs with 4 registers (faster than STRs).

Contrary to version 7, we will not seek to use the write buffer as much as possible... maybe in a future version?

You will find an article (in French) with more details in issue 0C of ARMada News (review published by the ARMada, a French association of users of Acorn machines).

Routines

The ARM routine is below. You can assemble it with !ObjAsm, possibly after replacing the file name h:RegNames. To use it, you can take the small C program below the ARM routine. Otherwise you can call it as follows to search for the primes from 0 to N²-1: register R0 must contain the integer N (higher or equal to 5), register R1 must contain the address (multiple of 32) of an array of 32.⌈N²/32⌉ bytes, register R2 must contain the address (multiple of 32) of another array (for temporary storages) and register R3 must contain the block size (multiple of 480), which must be less or equal to the size of array (R2) minus 8.π(N), where π(N) is the number of primes less or equal to N. At the return, the registers are not modified and the first N² bytes of array (R1) tell whether the corresponding number is a prime (1) or not (0).

Note: the output format (array of bytes) was chosen and fixed in advance. One could have had a faster routine if another format, using less memory (array of bits and/or non-stored even numbers), had been chosen. However, this changes the problem, which consists in finding an optimization for a fixed output format. Anyway the same techniques would be used.

; Eratosthenes Sieve, version 8 (1997-02-11)
; Search for the prime numbers up to N^2-1.

; In:
; R0: integer N >= 5.
; R1: array of 32*CEIL(N^2/32) bytes. Address: multiple of 32.
; R2: array (temporary data). Address: multiple of 32.
; R3: block size (multiple of 480), <= size(R2) - 8*pi(N).

; Out:
; The registers are not modified.
; Results in R1[0..N^2-1]: 0 (not a prime) or 1 (prime).

		GET	h:RegNames	;Define R0, ..., R14, SP, LR.

		AREA	area_erat, READONLY, CODE, PIC
		EXPORT	erat

erat		STMFD	SP!, {R0,R4-R12,LR}
		ADR	R12, offsets
		MOV	R11, #0
		MOV	R10, R2
		MLA	R5, R0, R0, R1
		ADD	R4, R2, R3
		MOV	R0, #0
		STR	R0, [R4]	;Empty list.

; In the following:
; R0: 0.
; R1: array of booleans (will contain the results).
; R2: block where the storages are performed (should be in the cache).
; R3: size of block (R2), multiple of 480.
; R4: list of couples (prime, address).
; R5: end of array (R1).
; R6: prime number p, in general.
; R9: pointer into block (R2).
; R10: address of the array read to find the next prime number
;   (R2 at the first iteration, then R1).
; R11: offset of the image of array (R1) in block (R2).
; R12: array of offsets.

; Load block (R2) into the cache.

		MOV	R9, R2
load_block	LDR	R14, [R9], #32	;32: cache line size.
		CMP	R9, R4
		BCC	load_block

; Block initialisation: suppression of the multiples of 2, 3 and 5 only.
; One stores: 0100 0001 0001 0100 0101 0001 0000 0101 0000 0100 0101 0001
; 0100 0100 0001... (period)

next_block	MOV	R9, R2
		MOV	R8, #&01000000	;0001.
		MOV	R6, #&00000100	;0100.
		ORR	R7, R8, R6	;0101.
init_block	STMIA	R9!, {R6,R8}
		STR	R8, [R9], #4
		STMIA	R9!, {R6-R8}
		STMIA	R9!, {R0,R7}
		STMIA	R9!, {R0,R6-R8}
		STR	R6, [R9], #4
		STMIA	R9!, {R6,R8}
		CMP	R9, R4
		BCC	init_block

		MOV	R8, R4		;R8: beginning of the list
		B	old_prime	;of the (p,addr)'s.

; Suppress the multiples of p in the block.

old_loop	STRB	R0, [R9], R7, LSL #1	;Suppress p(30k+1).
		CMP	R9, R4
		BCS	old_store1
		STRB	R0, [R9], R6, LSL #2	;Suppress p(30k+7).
		CMP	R9, R4
		BCS	old_store2
		STRB	R0, [R9], R6, LSL #1	;Suppress p(30k+11).
		CMP	R9, R4
		BCS	old_store3
		STRB	R0, [R9], R6, LSL #2	;Suppress p(30k+13).
		CMP	R9, R4
		BCS	old_store4
		STRB	R0, [R9], R6, LSL #1	;Suppress p(30k+17).
		CMP	R9, R4
		BCS	old_store5
		STRB	R0, [R9], R6, LSL #2	;Suppress p(30k+19).
		CMP	R9, R4
		BCS	old_store6
		STRB	R0, [R9], R7, LSL #1	;Suppress p(30k+23).
		CMP	R9, R4
		BCS	old_store7
		STRB	R0, [R9], R6, LSL #1	;Suppress p(30k+29).
		CMP	R9, R4
		BCC	old_loop
old_store0	STMDB	R8, {R6,R9}
old_prime	LDMIA	R8!, {R6,R9}	;Next (p,addr) or R6 = 0.
		ADD	R14, PC, R6, LSR #25
		BICS	R6, R6, #&F8000000	;C = 1.
		SUB	R9, R9, R3
		CMPNE	R9, R4		;Note: R9 (odd) != R4 (even).
		ADDCC	R7, R6, R6, LSL #1	;R7 = 3p.
		SUBCC	PC, R14, #4*28
		STRNE	R9, [R8, #-4]
		BNE	old_prime

; New prime numbers...

		SUB	R8, R8, #8
		TEQ	R8, R4
		LDRNE	R6, [R8, #-8]	;R6: last prime in the list
		BNE	last_prime	;if this list isn't empty.
		MOV	R6, #7		;First prime: 7.
		B	new_prime

new_loop	STRB	R0, [R9], R7, LSL #1	;Suppress p(30k+1).
		CMP	R9, R4
		BCS	new_store1
		STRB	R0, [R9], R6, LSL #2	;Suppress p(30k+7).
		CMP	R9, R4
		BCS	new_store2
		STRB	R0, [R9], R6, LSL #1	;Suppress p(30k+11).
		CMP	R9, R4
		BCS	new_store3
		STRB	R0, [R9], R6, LSL #2	;Suppress p(30k+13).
		CMP	R9, R4
		BCS	new_store4
		STRB	R0, [R9], R6, LSL #1	;Suppress p(30k+17).
		CMP	R9, R4
		BCS	new_store5
		STRB	R0, [R9], R6, LSL #2	;Suppress p(30k+19).
		CMP	R9, R4
		BCS	new_store6
		STRB	R0, [R9], R7, LSL #1	;Suppress p(30k+23).
		CMP	R9, R4
		BCS	new_store7
		STRB	R0, [R9], R6, LSL #1	;Suppress p(30k+29).
		CMP	R9, R4
		BCC	new_loop
new_store0	STMIA	R8!, {R6,R9}

last_prime	BIC	R6, R6, #&F8000000
next_odd	ADD	R6, R6, #2	;Next odd number.
		LDRB	R14, [R10, R6]
		TEQ	R14, #0		;Loop while it isn't
		BEQ	next_odd	;a prime number p.

new_prime	MLA	R9, R6, R6, R2
		LDR	R7, [R12, #data-offsets]
		SUB	R9, R9, R11
		MUL	R14, R7, R6
		CMP	R9, R4
		LDRCCB	R14, [R12, R14, LSR #28]
		ADDCC	R7, R6, R6, LSL #1	;R7 = 3p.
		SUBCC	PC, PC, R14, LSL #2

		STR	R0, [R8]	;End of the list of the (p,addr)'s.

; Copy the block into the array.

		MOV	R14, R2
		ADD	R11, R1, R11
copy_block	LDMIA	R14!, {R6-R9}
		STMIA	R11!, {R6-R9}
		CMP	R14, R4
		BCC	copy_block
		CMP	R11, R5
		SUBCC	R11, R11, R1
		MOVCC	R10, R1
		BCC	next_block

; End.

		LDMDB	R12, {R8,R9}	;Correction for the integers
		STMIA	R1, {R8,R9}	;from 0 to 7.
		LDMFD	SP!, {R0,R4-R12,PC}

old_store1	ORR	R6, R6, #&18000000
		B	old_store0
old_store2	ORR	R6, R6, #&30000000
		B	old_store0
old_store3	ORR	R6, R6, #&48000000
		B	old_store0
old_store4	ORR	R6, R6, #&60000000
		B	old_store0
old_store5	ORR	R6, R6, #&78000000
		B	old_store0
old_store6	ORR	R6, R6, #&90000000
		B	old_store0
old_store7	ORR	R6, R6, #&A8000000
		B	old_store0

new_store1	ORR	R6, R6, #&18000000
		B	new_store0
new_store2	ORR	R6, R6, #&30000000
		B	new_store0
new_store3	ORR	R6, R6, #&48000000
		B	new_store0
new_store4	ORR	R6, R6, #&60000000
		B	new_store0
new_store5	ORR	R6, R6, #&78000000
		B	new_store0
new_store6	ORR	R6, R6, #&90000000
		B	new_store0
new_store7	ORR	R6, R6, #&A8000000
		B	new_store0

data		DCD	&11111111
		DCB	0,0,1,1,0,1,0,1
offsets		DCB	0,39,27,0,24,0,0,36,21,0,0,33,0,30,18

		END

The following program takes 3 arguments:

the number N (to search for the primes from 0 to N²-1),
the block size (multiple of 480),
a 0 to display the prime numbers, or a positive integer to have the timing (it is the number of iterations, the higher it is, the more it is accurate).

/* Eratosthenes Sieve - test and timing
 *
 * Usage: erat <N> <block_size> <p>
 * --> Search for the prime numbers up to N^2-1
 * p = 0: display
 * p > 0: benchmark (number of iterations)
 */

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void erat(int, char *, char *, int);

int ceil32(int x)
{
  return (x + 31) & ~31;
}

int main(int argc, char **argv)
{
  int i, n, s, p;
  char *t, *u;
  clock_t start, stop;

  if (argc != 4)
    exit(1);
  if ((n = atoi(argv[1])) < 5)
    exit(2);
  if ((s = atoi(argv[2])) <= 0 || s % 480)
    exit(3);
  if ((p = atoi(argv[3])) < 0)
    exit(4);
  if ((t = malloc(n*(n+8)+s+64)) == NULL)
    exit(5);
  t = (char *) ceil32((int) t);
  u = t + ceil32(n*n);
  if (p)
    {
      start = clock();
      for (i = 0; i < p; i++)
        erat(n, t, u, s);
      stop = clock();
      printf("Time = %.3f ms\n",
             (double) (stop-start) * (1000.0/CLK_TCK) / p);
    }
  else
    {
      erat(n, t, u, s);
      for (i = 0; i < n*n; i++)
        if (t[i])
          printf("%d\n", i);
    }
  return 0;
}

Results

For N = 1000 (search for primes up to 1,000,000), with a block size of 9600 bytes, it takes 30.8 ms. Note that it is very fast, since writing one million bytes to memory with STMs takes 24.0 ms, and writing with STRs takes 39.9 ms.

Download

Archive containing the source and the binary to run from the command line.

Last updated:

XML source: 2003-06-18, 23:39:56 (20)
XSL stylesheet: 2003-08-06, 22:11:54 (488)

webmaster@vinc17.org