Here is an implementation in ARM of the Eratosthenes sieve, especially optimized for the Risc PC with a StrongARM processor (202 MHz). With this configuration, the bandwidth is very low compared to the processor speed. After testing the version 7 of the Eratosthenes sieve, we could notice that a program that writes a constant into memory (with the same kind of memory access, i.e. using the cache by reading non initialised data) is only about 10% faster than version 7. Thus we seek to:
Contrary to version 7, we will not seek to use the write buffer as much as possible... maybe in a future version?
You will find an article (in French) with more details in issue 0C of ARMada News (review published by the ARMada, a French association of users of Acorn machines).
The ARM routine is below. You can assemble it with !ObjAsm, possibly after replacing the file name h:RegNames. To use it, you can take the small C program below the ARM routine. Otherwise you can call it as follows to search for the primes from 0 to N²-1: register R0 must contain the integer N (higher or equal to 5), register R1 must contain the address (multiple of 32) of an array of 32.⌈N²/32⌉ bytes, register R2 must contain the address (multiple of 32) of another array (for temporary storages) and register R3 must contain the block size (multiple of 480), which must be less or equal to the size of array (R2) minus 8.π(N), where π(N) is the number of primes less or equal to N. At the return, the registers are not modified and the first N² bytes of array (R1) tell whether the corresponding number is a prime (1) or not (0).
Note: the output format (array of bytes) was chosen and fixed in advance.
One could have had a faster
routine if another format, using less
memory (array of bits and/or non-stored even numbers), had been chosen.
However, this changes the problem, which consists in finding an optimization
for a fixed output format. Anyway the same techniques would be used.
; Eratosthenes Sieve, version 8 (1997-02-11) ; Search for the prime numbers up to N^2-1. ; In: ; R0: integer N >= 5. ; R1: array of 32*CEIL(N^2/32) bytes. Address: multiple of 32. ; R2: array (temporary data). Address: multiple of 32. ; R3: block size (multiple of 480), <= size(R2) - 8*pi(N). ; Out: ; The registers are not modified. ; Results in R1[0..N^2-1]: 0 (not a prime) or 1 (prime). GET h:RegNames ;Define R0, ..., R14, SP, LR. AREA area_erat, READONLY, CODE, PIC EXPORT erat erat STMFD SP!, {R0,R4-R12,LR} ADR R12, offsets MOV R11, #0 MOV R10, R2 MLA R5, R0, R0, R1 ADD R4, R2, R3 MOV R0, #0 STR R0, [R4] ;Empty list. ; In the following: ; R0: 0. ; R1: array of booleans (will contain the results). ; R2: block where the storages are performed (should be in the cache). ; R3: size of block (R2), multiple of 480. ; R4: list of couples (prime, address). ; R5: end of array (R1). ; R6: prime number p, in general. ; R9: pointer into block (R2). ; R10: address of the array read to find the next prime number ; (R2 at the first iteration, then R1). ; R11: offset of the image of array (R1) in block (R2). ; R12: array of offsets. ; Load block (R2) into the cache. MOV R9, R2 load_block LDR R14, [R9], #32 ;32: cache line size. CMP R9, R4 BCC load_block ; Block initialisation: suppression of the multiples of 2, 3 and 5 only. ; One stores: 0100 0001 0001 0100 0101 0001 0000 0101 0000 0100 0101 0001 ; 0100 0100 0001... (period) next_block MOV R9, R2 MOV R8, #&01000000 ;0001. MOV R6, #&00000100 ;0100. ORR R7, R8, R6 ;0101. init_block STMIA R9!, {R6,R8} STR R8, [R9], #4 STMIA R9!, {R6-R8} STMIA R9!, {R0,R7} STMIA R9!, {R0,R6-R8} STR R6, [R9], #4 STMIA R9!, {R6,R8} CMP R9, R4 BCC init_block MOV R8, R4 ;R8: beginning of the list B old_prime ;of the (p,addr)'s. ; Suppress the multiples of p in the block. old_loop STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+1). CMP R9, R4 BCS old_store1 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R4 BCS old_store2 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R4 BCS old_store3 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R4 BCS old_store4 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R4 BCS old_store5 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R4 BCS old_store6 STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+23). CMP R9, R4 BCS old_store7 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R4 BCC old_loop old_store0 STMDB R8, {R6,R9} old_prime LDMIA R8!, {R6,R9} ;Next (p,addr) or R6 = 0. ADD R14, PC, R6, LSR #25 BICS R6, R6, #&F8000000 ;C = 1. SUB R9, R9, R3 CMPNE R9, R4 ;Note: R9 (odd) != R4 (even). ADDCC R7, R6, R6, LSL #1 ;R7 = 3p. SUBCC PC, R14, #4*28 STRNE R9, [R8, #-4] BNE old_prime ; New prime numbers... SUB R8, R8, #8 TEQ R8, R4 LDRNE R6, [R8, #-8] ;R6: last prime in the list BNE last_prime ;if this list isn't empty. MOV R6, #7 ;First prime: 7. B new_prime new_loop STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+1). CMP R9, R4 BCS new_store1 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R4 BCS new_store2 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R4 BCS new_store3 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R4 BCS new_store4 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R4 BCS new_store5 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R4 BCS new_store6 STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+23). CMP R9, R4 BCS new_store7 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R4 BCC new_loop new_store0 STMIA R8!, {R6,R9} last_prime BIC R6, R6, #&F8000000 next_odd ADD R6, R6, #2 ;Next odd number. LDRB R14, [R10, R6] TEQ R14, #0 ;Loop while it isn't BEQ next_odd ;a prime number p. new_prime MLA R9, R6, R6, R2 LDR R7, [R12, #data-offsets] SUB R9, R9, R11 MUL R14, R7, R6 CMP R9, R4 LDRCCB R14, [R12, R14, LSR #28] ADDCC R7, R6, R6, LSL #1 ;R7 = 3p. SUBCC PC, PC, R14, LSL #2 STR R0, [R8] ;End of the list of the (p,addr)'s. ; Copy the block into the array. MOV R14, R2 ADD R11, R1, R11 copy_block LDMIA R14!, {R6-R9} STMIA R11!, {R6-R9} CMP R14, R4 BCC copy_block CMP R11, R5 SUBCC R11, R11, R1 MOVCC R10, R1 BCC next_block ; End. LDMDB R12, {R8,R9} ;Correction for the integers STMIA R1, {R8,R9} ;from 0 to 7. LDMFD SP!, {R0,R4-R12,PC} old_store1 ORR R6, R6, #&18000000 B old_store0 old_store2 ORR R6, R6, #&30000000 B old_store0 old_store3 ORR R6, R6, #&48000000 B old_store0 old_store4 ORR R6, R6, #&60000000 B old_store0 old_store5 ORR R6, R6, #&78000000 B old_store0 old_store6 ORR R6, R6, #&90000000 B old_store0 old_store7 ORR R6, R6, #&A8000000 B old_store0 new_store1 ORR R6, R6, #&18000000 B new_store0 new_store2 ORR R6, R6, #&30000000 B new_store0 new_store3 ORR R6, R6, #&48000000 B new_store0 new_store4 ORR R6, R6, #&60000000 B new_store0 new_store5 ORR R6, R6, #&78000000 B new_store0 new_store6 ORR R6, R6, #&90000000 B new_store0 new_store7 ORR R6, R6, #&A8000000 B new_store0 data DCD &11111111 DCB 0,0,1,1,0,1,0,1 offsets DCB 0,39,27,0,24,0,0,36,21,0,0,33,0,30,18 END
The following program takes 3 arguments:
/* Eratosthenes Sieve - test and timing * * Usage: erat <N> <block_size> <p> * --> Search for the prime numbers up to N^2-1 * p = 0: display * p > 0: benchmark (number of iterations) */ #include <stdio.h> #include <stdlib.h> #include <time.h> void erat(int, char *, char *, int); int ceil32(int x) { return (x + 31) & ~31; } int main(int argc, char **argv) { int i, n, s, p; char *t, *u; clock_t start, stop; if (argc != 4) exit(1); if ((n = atoi(argv[1])) < 5) exit(2); if ((s = atoi(argv[2])) <= 0 || s % 480) exit(3); if ((p = atoi(argv[3])) < 0) exit(4); if ((t = malloc(n*(n+8)+s+64)) == NULL) exit(5); t = (char *) ceil32((int) t); u = t + ceil32(n*n); if (p) { start = clock(); for (i = 0; i < p; i++) erat(n, t, u, s); stop = clock(); printf("Time = %.3f ms\n", (double) (stop-start) * (1000.0/CLK_TCK) / p); } else { erat(n, t, u, s); for (i = 0; i < n*n; i++) if (t[i]) printf("%d\n", i); } return 0; }
For N = 1000 (search for primes up to 1,000,000), with a block size of 9600 bytes, it takes 30.8 ms. Note that it is very fast, since writing one million bytes to memory with STMs takes 24.0 ms, and writing with STRs takes 39.9 ms.
Archive containing the source and the binary to run from the command line.