Version française

Eratosthenes Sieve on Risc PC with StrongARM, version 7

Introduction

Here is an implementation in ARM of the Eratosthenes sieve, especially optimized for the Risc PC with a StrongARM processor (202 MHz). With this configuration, the bandwidth is very low compared to the processor speed. Thus we seek to:

You will find an article (in French) with more details in issue 0C of ARMada News (review published by the ARMada, a French association of users of Acorn machines).

Routines

The ARM routine is below. You can assemble it with !ObjAsm, possibly after replacing the file name h:RegNames. To use it, you can take the small C program below the ARM routine. Otherwise you can call it as follows to search for the primes from 0 to N²-1: register R0 must contain the integer N (higher or equal to 5), register R1 must contain the address (multiple of 32) of an array of 4.⌊N²/4⌋ + 128 + π(N) bytes, where π(N) is the number of primes less or equal to N, and register R2 must contain the block size (multiple of 32 higher or equal to 64). At the return, the registers are not modified and the first N² bytes of the array tell whether the corresponding number is a prime (1) or not (0).

Note: the output format (array of bytes) was chosen and fixed in advance. One could have had a faster routine if another format, using less memory (array of bits and/or non-stored even numbers), had been chosen. However, this changes the problem, which consists in finding an optimization for a fixed output format. Anyway the same techniques would be used.

; Eratosthenes Sieve, version 7 (1997-02-09)
; Search for the prime numbers up to N^2-1.

; In:
; R0: integer N >= 5.
; R1: array of 4*[N^2/4]+128+8*pi(N) bytes. Address: multiple of 32.
; R2: block size (multiple of 32, >= 64).

; Out:
; The registers are not modified.
; Results in R1[0..N^2-1]: 0 (not a prime) or 1 (prime).

		GET	h:RegNames	;Define R0, ..., R14, SP, LR.

		AREA	area_erat, READONLY, CODE, PIC
		EXPORT	erat

erat		STMFD	SP!, {R0,R3-R12,LR}
		MLA	R3, R0, R0, R1	;R3 = R1+N^2.
		MOV	R5, #0		;R5 = 0.
		ADR	R14, offsets
		BIC	R3, R3, #1	;R3 = 2*FLOOR(R3/2).
		ADD	R4, R3, #116	;R4 = 4*FLOOR(R3/4)+116.
		MOV	R7, R1
		STMIB	R4!, {R1,R5}

; Charge le premier bloc.

		ADD	R8, R7, R2	;End of the first block...
		CMP	R8, R3
		MOVHI	R8, R3		;or end of the array of booleans.
		MOV	R0, R1
load_first	LDRB	R11, [R0], #32
		CMP	R0, R8
		BCC	load_first

; In the following:
; R1: array.
; R2: block size.
; R3: R1+N^2 (N even) or R1+N^2-1 (N odd).
; R4: temporary data: couples (prime, address).
; R5: 0.
; R6: current prime number p.
; R7: end of the current block.
; R14: array of offsets (length: 15 words).

next_block	MOV	R8, R7		;Beginning of the next block.
		ADD	R7, R7, R2	;End of the block...
		CMP	R7, R3
		MOVHI	R7, R3		;or end of the array of booleans.

; Block initialisation and suppression of the multiples of 2, 3 and 5 (one
; can exceed the end of the block by at most 116 bytes): one stores 0100
; 0001 0001 0100 0101 0001 0000 0101 0000 0100 0101 0001 0100 0100 0001.

		LDR	R10, [R4, #-4]
		MOV	R9, #&01000000	;0001.
		MOV	R6, #&00000100	;0100.
		ORR	R8, R9, R6	;0101.
suppr_235	STMIA	R10!, {R6,R9}
		STR	R9, [R10], #4
		STMIA	R10!, {R6,R8,R9}
		STMIA	R10!, {R5,R8}
		STMIA	R10!, {R5,R6,R8,R9}
		STR	R6, [R10], #4
		STMIA	R10!, {R6,R9}
		CMP	R10, R7
		BCC	suppr_235
		STR	R10, [R4, #-4]
		MOV	R8, R4
		ADD	R0, R7, R2	;End of the block...
		CMP	R0, R3		;or end of the array of booleans
		BICHI	R0, R3, #&1F	;(lower multiple of 32).
		B	old1_prime

; Load the next block and suppress the multiples of p in the current block.

old1_loop	ADDS	R11, R11, #&80000000	;LDR: once out of 2.
		CMPCS	R0, R7
		LDRCSB	R11, [R0], #-32		;Load the next block...
		STRB	R5, [R9], R12, LSL #1	;Suppress p(30k+1).
		CMP	R9, R7
		BCS	old1_store1
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+7).
		CMP	R9, R7
		BCS	old1_store2
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+11).
		CMP	R9, R7
		BCS	old1_store3
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+13).
		CMP	R9, R7
		BCS	old1_store4
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+17).
		CMP	R9, R7
		BCS	old1_store5
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+19).
		CMP	R9, R7
		BCS	old1_store6
		STRB	R5, [R9], R12, LSL #1	;Suppress p(30k+23).
		CMP	R9, R7
		BCS	old1_store7
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+29).
		CMP	R9, R7
		BCC	old1_loop		;Loop.
old1_store0	STMDB	R8, {R6,R9}	;Store (p,addr).
		CMP	R0, R7		;Branch if the next block
		BCC	old2_prime	;is entirely loaded.
old1_prime	LDMIA	R8!, {R6,R9}	;Next (p,addr) or R6 = 0.
		ADD	R10, PC, R6, LSR #25
		BICS	R6, R6, #&F8000000	;Note: C = 1.
		CMPNE	R9, R7		;Note: R9 (odd) != R7 (even).
		ADDCC	R12, R6, R6, LSL #1	;R12 = 3p.
		SUBCC	PC, R10, #4*30	;Branch to the right STRB.
		BNE	old1_prime	;Branch if R6 != 0.
load_block	CMP	R0, R7		;Load the end of the next block.
		LDRCSB	R11, [R0], #-32
		BCS	load_block
		B	first_new

old2_loop	STRB	R5, [R9], R12, LSL #1	;Suppress p(30k+1).
		CMP	R9, R7
		BCS	old2_store1
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+7).
		CMP	R9, R7
		BCS	old2_store2
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+11).
		CMP	R9, R7
		BCS	old2_store3
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+13).
		CMP	R9, R7
		BCS	old2_store4
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+17).
		CMP	R9, R7
		BCS	old2_store5
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+19).
		CMP	R9, R7
		BCS	old2_store6
		STRB	R5, [R9], R12, LSL #1	;Suppress p(30k+23).
		CMP	R9, R7
		BCS	old2_store7
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+29).
		CMP	R9, R7
		BCC	old2_loop		;Loop.
old2_store0	STMDB	R8, {R6,R9}	;Store (p,addr).
old2_prime	LDMIA	R8!, {R6,R9}	;Next (p,addr) or R6 = 0.
		ADD	R10, PC, R6, LSR #25
		BICS	R6, R6, #&F8000000	;Note: C = 1.
		CMPNE	R9, R7		;Note: R9 (odd) != R7 (even).
		ADDCC	R12, R6, R6, LSL #1	;R12 = 3p.
		SUBCC	PC, R10, #4*28	;Branch to the right STRB.
		BNE	old2_prime	;Branch if R6 != 0.

; New prime numbers.

first_new	LDR	R11, [R14, #data-offsets] ;&11111111.
		SUB	R8, R8, #8
		TEQ	R8, R4
		LDRNE	R6, [R8, #-8]	;R6: last prime in the list
		BNE	last_prime	;if this list isn't empty.
		MOV	R6, #7		;First prime: 7.
		B	new_prime

new_loop	STRB	R5, [R9], R12, LSL #1	;Suppress p(30k+1).
		CMP	R9, R7
		BCS	new_store1
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+7).
		CMP	R9, R7
		BCS	new_store2
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+11).
		CMP	R9, R7
		BCS	new_store3
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+13).
		CMP	R9, R7
		BCS	new_store4
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+17).
		CMP	R9, R7
		BCS	new_store5
		STRB	R5, [R9], R6, LSL #2	;Suppress p(30k+19).
		CMP	R9, R7
		BCS	new_store6
		STRB	R5, [R9], R12, LSL #1	;Suppress p(30k+23).
		CMP	R9, R7
		BCS	new_store7
		STRB	R5, [R9], R6, LSL #1	;Suppress p(30k+29).
		CMP	R9, R7
		BCC	new_loop		;Loop.
new_store0	STMIA	R8!,{R6,R9}	;Store (p,addr).

last_prime	BIC	R6, R6, #&F8000000
next_odd	ADD	R6, R6, #2	;Next odd number.
		LDRB	R10, [R1, R6]
		TEQ	R10, #0		;Loop while it isn't
		BEQ	next_odd	;a prime number p.

new_prime	MLA	R9, R6, R6, R1	;R9: pointer to the first
		MUL	R10, R11, R6	;multiple to suppress: p^2.
		CMP	R9, R7
		LDRCCB	R10, [R14, R10, LSR #28]
		ADDCC	R12, R6, R6, LSL #1	;R12 = 3p.
		SUBCC	PC, PC, R10, LSL #2

		STR	R5, [R8]	;End of the list of the (p,addr)'s.
		TEQ	R7, R3
		BNE	next_block

		LDMDB	R14, {R11,R12}	;Correction for the integers
		STMIA	R1, {R11,R12}	;from 0 to 7.
		LDMFD	SP!, {R0,R3-R12,PC}

old1_store1	ORR	R6, R6, #&18000000
		B	old1_store0
old1_store2	ORR	R6, R6, #&30000000
		B	old1_store0
old1_store3	ORR	R6, R6, #&48000000
		B	old1_store0
old1_store4	ORR	R6, R6, #&60000000
		B	old1_store0
old1_store5	ORR	R6, R6, #&78000000
		B	old1_store0
old1_store6	ORR	R6, R6, #&90000000
		B	old1_store0
old1_store7	ORR	R6, R6, #&A8000000
		B	old1_store0

old2_store1	ORR	R6, R6, #&18000000
		B	old2_store0
old2_store2	ORR	R6, R6, #&30000000
		B	old2_store0
old2_store3	ORR	R6, R6, #&48000000
		B	old2_store0
old2_store4	ORR	R6, R6, #&60000000
		B	old2_store0
old2_store5	ORR	R6, R6, #&78000000
		B	old2_store0
old2_store6	ORR	R6, R6, #&90000000
		B	old2_store0
old2_store7	ORR	R6, R6, #&A8000000
		B	old2_store0

new_store1	ORR	R6, R6, #&18000000
		B	new_store0
new_store2	ORR	R6, R6, #&30000000
		B	new_store0
new_store3	ORR	R6, R6, #&48000000
		B	new_store0
new_store4	ORR	R6, R6, #&60000000
		B	new_store0
new_store5	ORR	R6, R6, #&78000000
		B	new_store0
new_store6	ORR	R6, R6, #&90000000
		B	new_store0
new_store7	ORR	R6, R6, #&A8000000
		B	new_store0

data		DCD	&11111111
		DCB	0,0,1,1,0,1,0,1
offsets		DCB	0,37,25,0,22,0,0,34,19,0,0,31,0,28,16

		END

The following program takes 3 arguments:

  1. the number N (to search for the primes from 0 to N²-1),
  2. the block size (multiple of 32),
  3. a 0 to display the prime numbers, or a positive integer to have the timing (it is the number of iterations, the higher it is, the more it is accurate).

The optimal block size should be about a little less than C/2 - 4N/Ln(N), where C is the size of the data cache. When N becomes very high, the blocks can become very small; but the value of N is first limited by the quantity of RAM.

/* Eratosthenes Sieve - test and timing
 *
 * Usage: erat <N> <block_size> <p>
 * --> Search for the prime numbers up to N^2-1
 * p = 0: display
 * p > 0: benchmark (number of iterations)
 */

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void erat(int, char *, int);

int main(int argc, char **argv)
{
  int i,n,s,p;
  char *t;
  clock_t start,stop;

  if (argc != 4) exit(1);
  if ((n = atoi(argv[1])) < 5) exit(2);
  if ((s = atoi(argv[2])) < 64 || s%32) exit(3);
  if ((p = atoi(argv[3])) < 0) exit(4);
  if (!(t = malloc(n*(n+8)+160))) exit(5);
  t = (char *) (((int) t + 31) & ~31);
  if (p) {
      start = clock();
      for(i=0;i<p;i++) erat(n,t,s);
      stop = clock();
      printf("Time = %.3f ms\n",
             (double) (stop-start) * (1000.0/CLK_TCK) / p);
    }
  else {
      erat(n,t,s);
      for(i=0;i<n*n;i++)
        if (t[i]) printf("%d\n", i);
    }
  return 0;
}

Results

For N = 1000 (search for primes up to 1,000,000), with a block size of 7168 bytes (7 KB), it takes 43.3 ms.

See also a faster version: version 8.



Valid XHTML 1.0!
Last updated:
webmaster@vinc17.org