Armpi 3 - Calculating Pi

(Document written in August 1995.)

Introduction

Armpi 3 is a program that calculates π in decimal for the ARM processors (in particular for the Acorn Risc PC). The following formula will be used: π/4 = arctan(1/2) + arctan(1/3) with arctan x = x - x³/3 + x⁵/5 - x⁷/7 + ... because it is simple and converges quite fast, although other faster formulas are often preferred. I already used this formula on the Apple II (6502) and the Atari ST (68000); you can see the comparisons.

The results will be given at the end.

Running Conditions

The program given here is a standard routine whose return address must be in register LR (Link Register). At the call, register R0 contains the wanted number of digits up to round errors, and register R1 contains the address of the memory block reserved for the calculations and the storage of the result. This memory block must hold at least 2N + B + 1544 bytes (see below), where B is the least multiple of 4 greater or equal to 5N/12. R0 and R1 must be multiples of 4.

At the return, registers R0, R1 and R13 are not modified, and the result is stored at the address given by R1, least significant digit first, with one decimal digit in each byte.

Note: the ARM must be in little-endian configuration.

The Algorithm

The calculations will be performed in binary. Then the final result will be converted into decimal. For N decimal digits, one needs slightly fewer than 5N/12 bytes (1 byte = 8 binary digits). Thus the calculations will be performed with B-byte numbers, where B is the least multiple of 4 greater or equal to 5N/12 (it's better to choose a too high number of digits than a too small one).

Divisions by a same number will be grouped, i.e. instead of calculating and adding (1/2ⁿ)/n and (1/3ⁿ)/n, one will calculate (1/2ⁿ+1/3ⁿ)/n. The calculation of 1/2ⁿ consists of a shift; in fact, a 1 will be added to 1/3ⁿ, which will take a minor constant time. The calculation of 1/3ⁿ consists of a division by 9 at each iteration; a special, very fast algorithm will be chosen for this division: the idea is to multiply by the inverse of 9 in Z/2³²Z. The division by n will be performed by calculating tables and reading in these tables. The additions and subtractions will be performed in a conventional way. Finally there is the conversion into decimal, which relatively takes a long time; the general algorithm is conventional, but the calculations will be performed on 4 decimal digits in parallel (there are 4 digits in a 32-bit word), with a redundant representation (a digit may be greater than 9).

The Program

The following sections contain the explanations of each part of the program. The assembly source is given in a separate file. Here are the different parts:

the division by 9,
the division by n,
the calculation of the table of the partial quotients for the division by n,
the calculation of the table of the products for the division by n,
the conversion into decimal,
the main part.

The routines given before the main part will be used as macros, but here the syntax is different: for more readability, the $ character for the parameters will be omitted. Perl scripts enable to extract the assembly source from the given file in accordance with the standard syntax or the source for extASM. Another Perl script creates a BASIC program (for the Risc PC) from the assembly source returned by the first script.

Long Division by 9

We want to divide a large number by 9; the result must be stored in place of the number. We'll perform a series of divisions by 9 of 36-bit numbers N = (NA,NB) (where the most significant 4 bits NA form a number less than 9), with a 32-bit quotient and a 4-bit remainder.

First we get rid of the most significant 4 bits by subtracting NA × &FFFFFFFC, then by adding 36 if the result is negative. So we obtain a partial quotient equal to NA.a (possibly + 4) with a = &FFFFFFFC / 9 = &1C71C71C, the remainder being the new value of NB. Now we have to divide a 32-bit number by 9.

The inverse of 9 in Z/2³²Z is 2a+1. Let R be the 32-bit result of N.(2a+1). The following table gives the value of N modulo 9 as a function of the interval containing R:

[0,a]......... N % 9 = 0	[a+1,2a]...... N % 9 = 5
[2a+1,3a+1]... N % 9 = 1	[3a+2,4a+1]... N % 9 = 6
[4a+2,5a+2]... N % 9 = 2	[5a+3,6a+2]... N % 9 = 7
[6a+3,7a+3]... N % 9 = 3	[7a+4,8a+3]... N % 9 = 8
[8a+4,9a+3]... N % 9 = 4

NB is multiplied by 2a+1, which is written in base 2: 00111000111000111000111000111001 and which can also be written, using signed digits: 0100m00100m00100m00100m00100m001 where m = -1. So the multiplication can be performed with 1 SUB and 3 ADDs. Then the remainder is determined to have a result in [0,a].

Long Division by `n`

Here we want to divide a large number by a 32-bit number. The quotient will be stored in another block of memory (contrary to the division by 9). The algorithm given here is valid only with a divisor whose MSB (most significant bit) is zero; thus the divisor is in fact a 31-bit number (this also corresponds to the positive divisors in signed arithmetic), but this is widely sufficient for the calculations that will be performed.

The long division is performed in a conventional way in base 2³². At each iteration, the remainder of the division performed at the previous iteration (32-bit word less than the divisor) is known; it is zero at the first iteration. One has to divide the 64-bit number, called partial dividend, composed of the remainder and the next digit of the dividend (32-bit number). One obtains a digit of the quotient (32-bit number) and the remainder, which will be used at the next iteration. The partial dividend and the divisor will be normalized, i.e. left-shifted by a same number of bits so that the MSB of the divisor is 1; since the dividend and the divisor are multiplied by a same number, the quotient is unchanged and the remainder is automatically normalized during the calculation.

Now let's explain the division of the 64-bit partial dividend X = (NA,NB) (2 words) by the 32-bit divisor DV (whose MSB is 1). The division will be performed in 4 identical steps, with tables calculated before the global division (these tables only depend on the divisor). By hypothesis, one has NA < DV.

Considering the most significant 9 bits of the normalized partial dividend, a table with (at least) 512 entries gives 8 bits of the quotient, which form a number NQ called partial quotient. In fact, there may generally be 2 possible partial quotients: NQ and NQ+1; to have a unique quotient, more bits (up to 38) should be considered, which are too many for a table. Thus one assumes that the partial quotient is NQ (as if the bits following the first 9 bits were all zeros), and the quotient will be fixed later if it is NQ+1. Then (NQ<<25).DV must be subtracted from X, by reading NQ.DV in a 256-entry table (each entry being a possible partial quotient), and the result must be shifted by 8 bits to the left. NQ.DV is a 39-bit number (NQ has 8 bits and DV has 31 bits). But if NQ is the real quotient, we know that the most significant 8 bits of X-NQ.DV are zeros, and in this case, the most significant 8 bits of NQ.DV need not be represented; only the least significant bit of these 8 bits is useful to determine the quotient. Therefore only 32 bits of the product NQ.DV will be stored, which exactly form a word P (thus this algorithm is not valid for a 32-bit divisor). So we calculate X' = ((X<<7)-P)<<1. If the bit lost in the last shift is 1 or if X' ≥ DV, then the real quotient is NQ+1, and not NQ; the possible correction of the quotient and X' is done now. These calculations are performed 3 more times. After 4 steps, the 32 bits of the quotient are obtained (and stored) and the remainder is normalized.

Computation of the Table of the Partial Quotients

This table is calculated in the following way, where DV is the normalized divisor:

(NA,NB) = 0 and NQ = 0.
Repeat
- Store NQ into the table.
- Add 2.DV to (NA,NB).
- Store NQ into the table if a carry is generated by the addition of the least significant words.
- Increase NQ.
While NQ<256.

NA contains the most significant 9 bits of the products NQ.DV. 256 to 512 values are stored.

Computation of the Table of the Products

This table is calculated in the following way, where DV is the normalized divisor:

T1 = 0.
Repeat 256 times:
- Store T1 into the table.
- Add DV/2 to T1.

Note: since the divisor is a 31-bit number, the LSB of DV is 0. In the additions, the most significant bits (up to 7 bits) are lost (see Long Division by n).

Conversion Into Base 10

We want to convert a binary number, defined up to a power of 2, into base 10: π (in fact, π/4 was calculated, but 4 is a power of 2). 3 variables (long numbers) will be used during the conversion: the binary number, a decimal number T equal to a power of 2 (the exponent is decreased at each iteration, and becomes negative), and the temporary result R in base 10 (more and more accurate). The two decimal numbers have the same number of digits (including the non significant zeros) and the value of the least significant digit must be (slightly) greater than the value of the least significant digit of the binary number.

Since the most significant bit of the binary number π has the value 2, the first considered power of 2 is 2. The initial value of R is obviously 0. The bits of the binary number are successively read. At each iteration, T is added to R if the bit is 1; then T is divided by 2. One iterates until T is 0 (due to lack of precision).

The decimal numbers have one digit per byte. Like the binary numbers, the decimal numbers are stored with their least significant digit first: this is due to the fact that the ARM is little-endian by default.

In order to optimise, the numbers T and R are interlaced in a 32-bit way; the words of T are stored before the corresponding words of R. This optimisation is specific to the ARM, which can read (or write) several consecutive words with only one instruction. Also fewer registers are needed.

We could perform the division by 2 and the addition as we could do them manually (i.e. one digit after another, in making sure that each digit is between 0 and 9). But the calculation can be much strongly accelerated, first by performing elementary 32-bit operations instead of 8-bit operations, i.e. by considering 4 digits at once, then by using a redundant representation, i.e. digits greater than 9 (but never greater than 255), which avoids calculating the carries at each addition.

The division by 2 is performed word by word starting with the most significant word. At each iteration, a right-rotation with carry is performed; a word M and a new carry are obtained. Bits 7, 15, 23 and 31 contain the carries that will affect the 4 digits. Thus one writes M = N+C, where C contains the carries and N contains the other bits. C>>5 + C>>7 = 5(C>>7) is added to N in order to obtain the correct result. Note: these calculations in parallel are possible due to the fact that the base (10) is divisible by the divisor (2).

The addition is performed word by word starting with the least significant words, without carry, so that it is very fast. But the value of the digits of the result must regularly be decreased to avoid overflows. This cleaning is performed every 8 iterations, i.e. each time a byte of the binary number has been entirely read, after the possible addition. The digits of the result are read one at a time, starting with the least significant digit; each time the digit + the carry is greater or equal to 128, 80 is subtracted and a carry equal to 8 is generated. So the digits are always between 0 and 199, and overflows never occur.

At the end of the conversion, the digits are between 0 and 190. Then the result is normalized: a number whose digits are between 0 and 9 is obtained. The normalized number is copied at the same time where it must be (its words are not interlaced any longer).

Main Part

In the first part (before the conversion), 3 memory blocks are used: one containing the numbers of the form 4/3²ⁿ⁺¹, one containing the quotient in the long division by 2n+1, and one containing the temporary result.

First B is calculated in the following way: B = ((N + N/4)/3 + 3) BIC 3. The addresses of the different blocks are calculated, the divisor and the shift count for the normalization are initialized. We store 4 (1/2 + 1/3) = 1101010101... in the block that will contain the binary result (the most significant bit has the value 2), and 4/3 = 01010101... in the block that contains the negative powers of 3. Then the calculations start.

At each iteration, a division by 9 is performed, the table of the partial quotients and the table of the products for the division by 2n+1 are calculated; a bit, corresponding to 4/2²ⁿ⁺¹, is set, and we go to the conversion if this bit is out of the block, i.e. if the wanted precision has been reached. Then the division by 2n+1 is performed, and the quotient is added or subtracted to/from the result according to the parity of n. The bit that has been set is cleared. 2 is added to the divisor, and the shift count is decreased if the divisor has exceeded a new power of 2. Then we start a new iteration.

For the conversion, new memory blocks are considered. The decimal numbers are initialized and the conversion is performed as described above.

Results and Comparisons

200,000 decimals have been calculated in 7 hours 40 minutes 7 seconds: the binary calculation has lasted 3 hours 47 minutes 15 seconds and the conversion into decimal has lasted 3 hours 52 minutes 51 seconds.

The 3 calculation times are just about proportional to the square of the number of calculated decimals. With 20,000 decimals, the times are: 4 minutes 36 seconds in total, 2 minutes 22 seconds for the binary calculation and 2 minutes 14 seconds for the conversion. The times have respectively been divided by 100, 96, 104.

The binary calculation and the conversion into decimal take just about the same time. Therefore it may be interesting to optimise the one or the other, but if both are not optimised, the acceleration will be limited. Concerning the binary calculation, another formula can be chosen (another sum of atan's). Concerning the conversion, it's harder: the cleaning may be performed slightly less often (every 8 additions instead of every 8 read bits) or a division by 4 may be performed instead of 2 divisions by 2 when a bit is 0 and there is no addition to perform; see Armpi 4. It may be better to perform all calculations in a base of the form 10ⁿ, where n is well chosen.

The comparisons with other computers and other π calculation programs I've written are given in a separate file; all the programs up to Armpi 4 use the same formula, but the algorithms are quite different.