Here is an implementation in ARM of the Eratosthenes sieve, especially optimized for the Risc PC with a StrongARM processor (202 MHz). With this configuration, the bandwidth is very low compared to the processor speed. Thus we seek to:
You will find an article (in French) with more details in issue 0C of ARMada News (review published by the ARMada, a French association of users of Acorn machines).
The ARM routine is below. You can assemble it with !ObjAsm, possibly after replacing the file name h:RegNames. To use it, you can take the small C program below the ARM routine. Otherwise you can call it as follows to search for the primes from 0 to N²-1: register R0 must contain the integer N (higher or equal to 5), register R1 must contain the address (multiple of 32) of an array of 4.⌊N²/4⌋ + 128 + π(N) bytes, where π(N) is the number of primes less or equal to N, and register R2 must contain the block size (multiple of 32 higher or equal to 64). At the return, the registers are not modified and the first N² bytes of the array tell whether the corresponding number is a prime (1) or not (0).
Note: the output format (array of bytes) was chosen and fixed in advance.
One could have had a faster
routine if another format, using less
memory (array of bits and/or non-stored even numbers), had been chosen.
However, this changes the problem, which consists in finding an optimization
for a fixed output format. Anyway the same techniques would be used.
; Eratosthenes Sieve, version 7 (1997-02-09) ; Search for the prime numbers up to N^2-1. ; In: ; R0: integer N >= 5. ; R1: array of 4*[N^2/4]+128+8*pi(N) bytes. Address: multiple of 32. ; R2: block size (multiple of 32, >= 64). ; Out: ; The registers are not modified. ; Results in R1[0..N^2-1]: 0 (not a prime) or 1 (prime). GET h:RegNames ;Define R0, ..., R14, SP, LR. AREA area_erat, READONLY, CODE, PIC EXPORT erat erat STMFD SP!, {R0,R3-R12,LR} MLA R3, R0, R0, R1 ;R3 = R1+N^2. MOV R5, #0 ;R5 = 0. ADR R14, offsets BIC R3, R3, #1 ;R3 = 2*FLOOR(R3/2). ADD R4, R3, #116 ;R4 = 4*FLOOR(R3/4)+116. MOV R7, R1 STMIB R4!, {R1,R5} ; Charge le premier bloc. ADD R8, R7, R2 ;End of the first block... CMP R8, R3 MOVHI R8, R3 ;or end of the array of booleans. MOV R0, R1 load_first LDRB R11, [R0], #32 CMP R0, R8 BCC load_first ; In the following: ; R1: array. ; R2: block size. ; R3: R1+N^2 (N even) or R1+N^2-1 (N odd). ; R4: temporary data: couples (prime, address). ; R5: 0. ; R6: current prime number p. ; R7: end of the current block. ; R14: array of offsets (length: 15 words). next_block MOV R8, R7 ;Beginning of the next block. ADD R7, R7, R2 ;End of the block... CMP R7, R3 MOVHI R7, R3 ;or end of the array of booleans. ; Block initialisation and suppression of the multiples of 2, 3 and 5 (one ; can exceed the end of the block by at most 116 bytes): one stores 0100 ; 0001 0001 0100 0101 0001 0000 0101 0000 0100 0101 0001 0100 0100 0001. LDR R10, [R4, #-4] MOV R9, #&01000000 ;0001. MOV R6, #&00000100 ;0100. ORR R8, R9, R6 ;0101. suppr_235 STMIA R10!, {R6,R9} STR R9, [R10], #4 STMIA R10!, {R6,R8,R9} STMIA R10!, {R5,R8} STMIA R10!, {R5,R6,R8,R9} STR R6, [R10], #4 STMIA R10!, {R6,R9} CMP R10, R7 BCC suppr_235 STR R10, [R4, #-4] MOV R8, R4 ADD R0, R7, R2 ;End of the block... CMP R0, R3 ;or end of the array of booleans BICHI R0, R3, #&1F ;(lower multiple of 32). B old1_prime ; Load the next block and suppress the multiples of p in the current block. old1_loop ADDS R11, R11, #&80000000 ;LDR: once out of 2. CMPCS R0, R7 LDRCSB R11, [R0], #-32 ;Load the next block... STRB R5, [R9], R12, LSL #1 ;Suppress p(30k+1). CMP R9, R7 BCS old1_store1 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R7 BCS old1_store2 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R7 BCS old1_store3 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R7 BCS old1_store4 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R7 BCS old1_store5 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R7 BCS old1_store6 STRB R5, [R9], R12, LSL #1 ;Suppress p(30k+23). CMP R9, R7 BCS old1_store7 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R7 BCC old1_loop ;Loop. old1_store0 STMDB R8, {R6,R9} ;Store (p,addr). CMP R0, R7 ;Branch if the next block BCC old2_prime ;is entirely loaded. old1_prime LDMIA R8!, {R6,R9} ;Next (p,addr) or R6 = 0. ADD R10, PC, R6, LSR #25 BICS R6, R6, #&F8000000 ;Note: C = 1. CMPNE R9, R7 ;Note: R9 (odd) != R7 (even). ADDCC R12, R6, R6, LSL #1 ;R12 = 3p. SUBCC PC, R10, #4*30 ;Branch to the right STRB. BNE old1_prime ;Branch if R6 != 0. load_block CMP R0, R7 ;Load the end of the next block. LDRCSB R11, [R0], #-32 BCS load_block B first_new old2_loop STRB R5, [R9], R12, LSL #1 ;Suppress p(30k+1). CMP R9, R7 BCS old2_store1 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R7 BCS old2_store2 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R7 BCS old2_store3 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R7 BCS old2_store4 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R7 BCS old2_store5 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R7 BCS old2_store6 STRB R5, [R9], R12, LSL #1 ;Suppress p(30k+23). CMP R9, R7 BCS old2_store7 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R7 BCC old2_loop ;Loop. old2_store0 STMDB R8, {R6,R9} ;Store (p,addr). old2_prime LDMIA R8!, {R6,R9} ;Next (p,addr) or R6 = 0. ADD R10, PC, R6, LSR #25 BICS R6, R6, #&F8000000 ;Note: C = 1. CMPNE R9, R7 ;Note: R9 (odd) != R7 (even). ADDCC R12, R6, R6, LSL #1 ;R12 = 3p. SUBCC PC, R10, #4*28 ;Branch to the right STRB. BNE old2_prime ;Branch if R6 != 0. ; New prime numbers. first_new LDR R11, [R14, #data-offsets] ;&11111111. SUB R8, R8, #8 TEQ R8, R4 LDRNE R6, [R8, #-8] ;R6: last prime in the list BNE last_prime ;if this list isn't empty. MOV R6, #7 ;First prime: 7. B new_prime new_loop STRB R5, [R9], R12, LSL #1 ;Suppress p(30k+1). CMP R9, R7 BCS new_store1 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R7 BCS new_store2 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R7 BCS new_store3 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R7 BCS new_store4 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R7 BCS new_store5 STRB R5, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R7 BCS new_store6 STRB R5, [R9], R12, LSL #1 ;Suppress p(30k+23). CMP R9, R7 BCS new_store7 STRB R5, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R7 BCC new_loop ;Loop. new_store0 STMIA R8!,{R6,R9} ;Store (p,addr). last_prime BIC R6, R6, #&F8000000 next_odd ADD R6, R6, #2 ;Next odd number. LDRB R10, [R1, R6] TEQ R10, #0 ;Loop while it isn't BEQ next_odd ;a prime number p. new_prime MLA R9, R6, R6, R1 ;R9: pointer to the first MUL R10, R11, R6 ;multiple to suppress: p^2. CMP R9, R7 LDRCCB R10, [R14, R10, LSR #28] ADDCC R12, R6, R6, LSL #1 ;R12 = 3p. SUBCC PC, PC, R10, LSL #2 STR R5, [R8] ;End of the list of the (p,addr)'s. TEQ R7, R3 BNE next_block LDMDB R14, {R11,R12} ;Correction for the integers STMIA R1, {R11,R12} ;from 0 to 7. LDMFD SP!, {R0,R3-R12,PC} old1_store1 ORR R6, R6, #&18000000 B old1_store0 old1_store2 ORR R6, R6, #&30000000 B old1_store0 old1_store3 ORR R6, R6, #&48000000 B old1_store0 old1_store4 ORR R6, R6, #&60000000 B old1_store0 old1_store5 ORR R6, R6, #&78000000 B old1_store0 old1_store6 ORR R6, R6, #&90000000 B old1_store0 old1_store7 ORR R6, R6, #&A8000000 B old1_store0 old2_store1 ORR R6, R6, #&18000000 B old2_store0 old2_store2 ORR R6, R6, #&30000000 B old2_store0 old2_store3 ORR R6, R6, #&48000000 B old2_store0 old2_store4 ORR R6, R6, #&60000000 B old2_store0 old2_store5 ORR R6, R6, #&78000000 B old2_store0 old2_store6 ORR R6, R6, #&90000000 B old2_store0 old2_store7 ORR R6, R6, #&A8000000 B old2_store0 new_store1 ORR R6, R6, #&18000000 B new_store0 new_store2 ORR R6, R6, #&30000000 B new_store0 new_store3 ORR R6, R6, #&48000000 B new_store0 new_store4 ORR R6, R6, #&60000000 B new_store0 new_store5 ORR R6, R6, #&78000000 B new_store0 new_store6 ORR R6, R6, #&90000000 B new_store0 new_store7 ORR R6, R6, #&A8000000 B new_store0 data DCD &11111111 DCB 0,0,1,1,0,1,0,1 offsets DCB 0,37,25,0,22,0,0,34,19,0,0,31,0,28,16 END
The following program takes 3 arguments:
The optimal block size should be about a little less than C/2 - 4N/Ln(N), where C is the size of the data cache. When N becomes very high, the blocks can become very small; but the value of N is first limited by the quantity of RAM.
/* Eratosthenes Sieve - test and timing * * Usage: erat <N> <block_size> <p> * --> Search for the prime numbers up to N^2-1 * p = 0: display * p > 0: benchmark (number of iterations) */ #include <stdio.h> #include <stdlib.h> #include <time.h> void erat(int, char *, int); int main(int argc, char **argv) { int i,n,s,p; char *t; clock_t start,stop; if (argc != 4) exit(1); if ((n = atoi(argv[1])) < 5) exit(2); if ((s = atoi(argv[2])) < 64 || s%32) exit(3); if ((p = atoi(argv[3])) < 0) exit(4); if (!(t = malloc(n*(n+8)+160))) exit(5); t = (char *) (((int) t + 31) & ~31); if (p) { start = clock(); for(i=0;i<p;i++) erat(n,t,s); stop = clock(); printf("Time = %.3f ms\n", (double) (stop-start) * (1000.0/CLK_TCK) / p); } else { erat(n,t,s); for(i=0;i<n*n;i++) if (t[i]) printf("%d\n", i); } return 0; }
For N = 1000 (search for primes up to 1,000,000), with a block size of 7168 bytes (7 KB), it takes 43.3 ms.
See also a faster version: version 8.