m68k/fpsp040/setox.S

4 |	The entry point setox computes the exponential of a value.
5 |	setoxd does the same except the input value is a denormalized
6 |	number.	setoxm1 computes exp(X)-1, and setoxm1d computes
7 |	exp(X)-1 for denormalized X.
10 |	-----
11 |	Double-extended value in memory location pointed to by address
15 |	------
16 |	exp(X) or exp(X)-1 returned in floating-point register fp0.
19 |	-------------------------
20 |	The returned result is within 0.85 ulps in 64 significant bit, i.e.
21 |	within 0.5001 ulp to 53 bits if the result is subsequently rounded
22 |	to double precision. The result is provably monotonic in double
26 |	-----
27 |	Two timings are measured, both in the copy-back mode. The
28 |	first one is measured when the function is invoked the first time
29 |	(so the instructions and data are not in cache), and the
30 |	second one is measured when the function is reinvoked at the same
33 |	The program setox takes approximately 210/190 cycles for input
35 |	is the usual situation.	For the less common arguments,
36 |	depending on their values, the program may run faster or slower --
37 |	but no worse than 10% slower even in the extreme cases.
39 |	The program setoxm1 takes approximately ??? / ??? cycles for input
41 |	approximately ??? / ??? cycles. For the less common arguments,
42 |	depending on their values, the program may run faster or slower --
43 |	but no worse than 10% slower even in the extreme cases.
46 |	----------------------------------
49 |	------
52 |	Step 2.	Return	ans := ans + sign(X)*2^(-126). Exit.
53 |	Notes:	This will always generate one exception -- inexact.
57 |	-----
60 |		1.1	If |X| >= 2^(-65), go to Step 1.3.
61 |		1.2	Go to Step 7.
62 |		1.3	If |X| < 16380 log(2), go to Step 2.
63 |		1.4	Go to Step 8.
64 |	Notes:	The usual case should take the branches 1.1 -> 1.3 -> 2.
65 |		 To avoid the use of floating-point comparisons, a
67 |		 32-bit integer, the upper (more significant) 16 bits are
68 |		 the sign and biased exponent field of |X|; the lower 16
69 |		 bits are the 16 most significant fraction (including the
70 |		 explicit bit) bits of |X|. Consequently, the comparisons
72 |		 Note also that the constant 16380 log(2) used in Step 1.3
73 |		 is also in the compact form. Thus taking the branch
76 |		 but close to, 16380 log(2) and the branch to Step 9 is
79 |	Step 2.	Calculate N = round-to-nearest-int( X * 64/log2 ).
80 |		2.1	Set AdjFlag := 0 (indicates the branch 1.3 -> 2 was taken)
81 |		2.2	N := round-to-nearest-integer( X * 64/log2 ).
83 |		2.4	Calculate	M = (N - J)/64; so N = 64M + J.
84 |		2.5	Calculate the address of the stored value of 2^(J/64).
85 |		2.6	Create the value Scale = 2^M.
86 |	Notes:	The calculation in 2.2 is really performed by
89 |			N := round-to-nearest-integer(Z)
93 |			constant := single-precision( 64/log 2 ).
95 |		 Using a single-precision constant avoids memory access.
96 |		 Another effect of using a single-precision "constant" is
97 |		 that the calculated value Z is
99 |			Z = X*(64/log2)*(1+eps), |eps| <= 2^(-24).
103 |	Step 3.	Calculate X - N*log2/64.
104 |		3.1	R := X + N*L1, where L1 := single-precision(-log2/64).
105 |		3.2	R := R + N*L2, L2 := extended-precision(-log2/64 - L1).
106 |	Notes:	a) The way L1 and L2 are chosen ensures L1+L2 approximate
107 |		 the value	-log2/64	to 88 bits of accuracy.
110 |		 c) The calculation X+N*L1 is also exact due to cancellation.
115 |			N = rnd-to-int( X*64/log2 (1+eps) ), |eps|<=2^(-24)
117 |			X*64/log2 - N	=	f - eps*X 64/log2
118 |			X - N*log2/64	=	f*log2/64 - eps*X
123 |			|X - N*log2/64| <= (0.5 + 16446/2^(18))*log2/64
127 |	Step 4.	Approximate exp(R)-1 by a polynomial
129 |	Notes:	a) In order to reduce memory access, the coefficients are
132 |		 b) Even with the restrictions above,
133 |			|p - (exp(R)-1)| < 2^(-68.8) for all |R| <= 0.0062.
135 |		 c) To fully utilize the pipeline, p is separated into
143 |		 where T and t are the stored values for 2^(J/64).
147 |		 to 62 bits so that the last two bits of T are zero. The
148 |		 reason for such a special form is that T-1, T-2, and T-8
149 |		 will all be exact --- a property that will give much
150 |		 more accurate computation of the function EXPM1.
154 |		6.1	If AdjFlag = 0, go to 6.3
156 |		6.3	Restore the user FPCR
164 |		 When that is the case, AdjScale = 2^(M1) where M1 is
167 |		 The inexact exception is not generated in 6.4. Although
168 |		 one can argue that the inexact flag should always be
169 |		 raised, to simulate that exception cost to much than the
176 |	Notes:	For non-zero X, the inexact exception will always be
177 |		 raised by 7.3. That is the only exception raised by 7.3.
178 |		 Note also that we use the FMOVEM instruction to move X
180 |		 the FMOVEM may not seem relevant since X is normalized,
181 |		 the precaution will be useful in the library version of
182 |		 this code where the separate entry for denormalized inputs
186 |		8.1	If |X| > 16480 log2, go to Step 9.
187 |		(mimic 2.2 - 2.6)
188 |		8.2	N := round-to-integer( X * 64/log2 )
190 |		8.4	K := (N-J)/64, M1 := truncate(K/2), M = K-M1, AdjFlag := 1.
191 |		8.5	Calculate the address of the stored value 2^(J/64).
192 |		8.6	Create the values Scale = 2^M, AdjScale = 2^M1.
193 |		8.7	Go to Step 3.
194 |	Notes:	Refer to notes for 2.2 - 2.6.
197 |		9.1	If X < 0, go to 9.3
198 |		9.2	ans := Huge, go to 9.4
202 |	Notes:	Exp(X) will surely overflow or underflow, depending on
204 |		 extended-precision numbers whose square over/underflow
205 |		 with an inexact result. Thus, 9.5 always raises the
210 |	--------
215 |	Notes:	This will return X with the appropriate rounding
216 |		 precision prescribed by the user FPCR.
219 |	-------
222 |		1.1	If |X| >= 1/4, go to Step 1.3.
223 |		1.2	Go to Step 7.
224 |		1.3	If |X| < 70 log(2), go to Step 2.
225 |		1.4	Go to Step 10.
226 |	Notes:	The usual case should take the branches 1.1 -> 1.3 -> 2.
228 |		 because EXPM1 is intended to evaluate exp(X)-1 accurately
229 |		 when |X| is small. For further details on the comparisons,
230 |		 see the notes on Step 1 of setox.
232 |	Step 2.	Calculate N = round-to-nearest-int( X * 64/log2 ).
233 |		2.1	N := round-to-nearest-integer( X * 64/log2 ).
235 |		2.3	Calculate	M = (N - J)/64; so N = 64M + J.
236 |		2.4	Calculate the address of the stored value of 2^(J/64).
237 |		2.5	Create the values Sc = 2^M and OnebySc := -2^(-M).
238 |	Notes:	See the notes on Step 2 of setox.
240 |	Step 3.	Calculate X - N*log2/64.
241 |		3.1	R := X + N*L1, where L1 := single-precision(-log2/64).
242 |		3.2	R := R + N*L2, L2 := extended-precision(-log2/64 - L1).
243 |	Notes:	Applying the analysis of Step 3 of setox in this case
247 |	Step 4.	Approximate exp(R)-1 by a polynomial
249 |	Notes:	a) In order to reduce memory access, the coefficients are
252 |		 b) Even with the restriction above,
253 |			|p - (exp(R)-1)| <	|R| * 2^(-72.7)
255 |		 c) To fully utilize the pipeline, p is separated into
263 |		 where T and t are the stored values for 2^(J/64).
267 |		 to 62 bits so that the last two bits of T are zero. The
268 |		 reason for such a special form is that T-1, T-2, and T-8
269 |		 will all be exact --- a property that will be exploited
270 |		 in Step 6 below. The total relative error in p is no
271 |		 bigger than 2^(-67.7) compared to the final result.
273 |	Step 6.	Reconstruction of exp(X)-1
274 |			exp(X)-1 = 2^M * ( 2^(J/64) + p - 2^(-M) ).
275 |		6.1	If M <= 63, go to Step 6.3.
276 |		6.2	ans := T + (p + (t + OnebySc)). Go to 6.6
277 |		6.3	If M >= -3, go to 6.5.
278 |		6.4	ans := (T + (p + t)) + OnebySc. Go to 6.6
282 |	Notes:	The various arrangements of the expressions give accurate
285 |	Step 7.	exp(X)-1 for |X| < 1/4.
286 |		7.1	If |X| >= 2^(-65), go to Step 9.
287 |		7.2	Go to Step 8.
289 |	Step 8.	Calculate exp(X)-1, |X| < 2^(-65).
290 |		8.1	If |X| < 2^(-16312), goto 8.3
291 |		8.2	Restore FPCR; return ans := X - 2^(-16382). Exit.
293 |		8.4	Restore FPCR; ans := ans - 2^(-16382).
295 |	Notes:	The idea is to return "X - tiny" under the user
297 |		 inefficiency, we stay away from denormalized numbers the
298 |		 best we can. For |X| >= 2^(-16312), the straightforward
299 |		 8.2 generates the inexact exception as the case warrants.
301 |	Step 9.	Calculate exp(X)-1, |X| < 1/4, by a polynomial
303 |	Notes:	a) In order to reduce memory access, the coefficients are
307 |		 b) Even with the restriction above,
308 |			|p - (exp(X)-1)| < |X| 2^(-70.6)
311 |		 c) To fully preserve accuracy, the polynomial is computed
314 |		 d) To fully utilize the pipeline, Q is separated into
319 |	Step 10.	Calculate exp(X)-1 for |X| >= 70 log 2.
320 |		10.1 If X >= 70log2 , exp(X) - 1 = exp(X) for all practical
321 |		 purposes. Therefore, go to Step 1 of setox.
322 |		10.2 If X <= -70log2, exp(X) - 1 = -1 for all practical purposes.
323 |		 ans := -1
325 |		 Return ans := ans + 2^(-126). Exit.
326 |	Notes:	10.2 will always create an inexact and return -1 + tiny
327 |		 in the user rounding precision and mode.
334 |       For details on the license for this file, please see the
450 |--entry point for EXP(X), X is denormalized
453 	oril		#0x00800000,%d0		| ...sign(X)*2^(-126)
454 	movel		%d0,-(%sp)
462 |--entry point for EXP(X), here X is finite, non-zero, and not NaN's
464 |--Step 1.
467 	cmpil		#0x3FBE0000,%d0	| ...2^(-65)
472 |--The case |X| >= 2^(-65)
479 |--Step 2.
480 |--This is the normal branch:	2^(-65) <= |X| < 16380 log2.
485 	fmovemx	%fp2-%fp2/%fp3,-(%a7)		| ...save fp2
489 	fmovel		%d0,%fp0		| ...convert to floating-format
501 |--Step 3.
502 |--fp1,fp2 saved on the stack. fp0 is N, fp1 is X,
503 |--a0 points to 2^(J/64), D0 is biased expo. of 2^(M)
505 	fmuls		#0xBC317218,%fp0	| ...N * L1, L1 = lead(-log2/64)
506 	fmulx		L2,%fp2		| ...N * L2, L1+L2 = -log2/64
511 |--Step 4.
512 |--WE NOW COMPUTE EXP(R)-1 BY A POLYNOMIAL
513 |-- R + R*R*(A1 + R*(A2 + R*(A3 + R*(A4 + R*A5))))
514 |--TO FULLY UTILIZE THE PIPELINE, WE COMPUTE S = R*R
515 |--[R+R*S*(A2+S*A4)] + [S*(A1+S*(A3+S*A5))]
546 	faddx		%fp2,%fp0		| ...fp0 is EXP(R) - 1
549 |--Step 5
550 |--final reconstruction process
551 |--EXP(X) = 2^M * ( 2^(J/64) + 2^(J/64)*(EXP(R)-1) )
553 	fmulx		%fp1,%fp0		| ...2^(J/64)*(Exp(R)-1)
554 	fmovemx	(%a7)+,%fp2-%fp2/%fp3	| ...fp2 restored
560 |--Step 6
571 |--Step 7
572 	fmovemx	(%a0),%fp0-%fp0	| ...in case X is denormalized
578 |--Step 8
581 |--Steps 8.2 -- 8.6
586 	fmovemx	 %fp2-%fp2/%fp3,-(%a7)		| ...save fp2
590 	fmovel		%d0,%fp0		| ...convert to floating-format
607 	bra		EXPCONT1		| ...go back to Step 3
610 |--Step 9
620 |--entry point for EXPM1(X), here X is denormalized
621 |--Step 0.
627 |--entry point for EXPM1(X), here X is finite, non-zero, non-NaN
629 |--Step 1.
630 |--Step 1.1
638 |--Step 1.3
639 |--The case |X| >= 1/4
646 |--Step 2.
647 |--This is the case:	1/4 <= |X| <= 70 log2.
652 	fmovemx	%fp2-%fp2/%fp3,-(%a7)		| ...save fp2
656 	fmovel		%d0,%fp0		| ...convert to floating-format
667 |--Step 3.
668 |--fp1,fp2 saved on the stack. fp0 is N, fp1 is X,
669 |--a0 points to 2^(J/64), D0 and a1 both contain M
671 	fmuls		#0xBC317218,%fp0	| ...N * L1, L1 = lead(-log2/64)
672 	fmulx		L2,%fp2		| ...N * L2, L1+L2 = -log2/64
678 |--Step 4.
679 |--WE NOW COMPUTE EXP(R)-1 BY A POLYNOMIAL
680 |-- R + R*R*(A1 + R*(A2 + R*(A3 + R*(A4 + R*(A5 + R*A6)))))
681 |--TO FULLY UTILIZE THE PIPELINE, WE COMPUTE S = R*R
682 |--[R*S*(A2+S*(A4+S*A6))] + [R+S*(A1+S*(A3+S*A5))]
703 	negw		%d0		| ...D0 is -M
705 	addiw		#0x3FFF,%d0	| ...biased expo. of 2^(-M)
710 	oriw		#0x8000,%d0	| ...signed/expo. of -2^(-M)
711 	movew		%d0,ONEBYSC(%a6)	| ...OnebySc is -2^(-M)
722 	faddx		%fp2,%fp0		| ...fp0 IS EXP(R)-1
724 	fmovemx	(%a7)+,%fp2-%fp2/%fp3	| ...fp2 restored
726 |--Step 5
727 |--Compute 2^(J/64)*p
729 	fmulx		(%a1),%fp0	| ...2^(J/64)*(Exp(R)-1)
731 |--Step 6
732 |--Step 6.1
736 |--Step 6.2	M >= 64
743 |--Step 6.3	M <= 63
744 	cmpil		#-3,%d0
747 |--Step 6.4	M <= -4
753 |--Step 6.5	-3 <= M <= 63
760 |--Step 6.6
767 |--Step 7	|X| < 1/4.
768 	cmpil		#0x3FBE0000,%d0	| ...2^(-65)
772 |--Step 8	|X| < 2^(-65)
773 	cmpil		#0x00330000,%d0	| ...2^(-16312)
775 |--Step 8.2
776 	movel		#0x80010000,SC(%a6)	| ...SC is -2^(-16382)
786 |--Step 8.3
799 |--Step 9	exp(X)-1 by a simple polynomial
802 	fmovemx	%fp2-%fp2/%fp3,-(%a7)	| ...save fp2
842 	fmovemx	(%a7)+,%fp2-%fp2/%fp3	| ...fp2 restored
853 |--Step 10	|X| > 70 log2
857 |--Step 10.2
858 	fmoves		#0xBF800000,%fp0	| ...fp0 is -1
860 	fadds		#0x00800000,%fp0	| ...-1 + 2^(-126)