mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2010-10-27, 13:02   #1
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26×113 Posts
Default Fast Approximate Ceiling Function

A challenge: a is double.

The generic ceil() function in the C math library is slow. Horribly slow.

The following code is much faster, but does not work correctly for
positive exact integers:

#define iceil(a) (a <= 0.0? (int)a : (int)a + 1)

For the application I have, exact integers for a are very very very rare
and it does not matter if iceil(a) is wrong by 1.

Can anyone find a faster way? Is there anyway to do this without
the branch? (this macro gets called many,many,many times)
R.D. Silverman is offline   Reply With Quote
Old 2010-10-27, 15:14   #2
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26×113 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
A challenge: a is double.

The generic ceil() function in the C math library is slow. Horribly slow.

The following code is much faster, but does not work correctly for
positive exact integers:

#define iceil(a) (a <= 0.0? (int)a : (int)a + 1)

For the application I have, exact integers for a are very very very rare
and it does not matter if iceil(a) is wrong by 1.

Can anyone find a faster way? Is there anyway to do this without
the branch? (this macro gets called many,many,many times)
I may just choose to accept a small number of additional inaccuracies
by just computing e.g. (int)(a + .999999)
R.D. Silverman is offline   Reply With Quote
Old 2010-10-27, 17:08   #3
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

278416 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
A challenge: a is double.

The generic ceil() function in the C math library is slow. Horribly slow.

The following code is much faster, but does not work correctly for
positive exact integers:

#define iceil(a) (a <= 0.0? (int)a : (int)a + 1)

For the application I have, exact integers for a are very very very rare
and it does not matter if iceil(a) is wrong by 1.

Can anyone find a faster way? Is there anyway to do this without
the branch? (this macro gets called many,many,many times)
I'm assuming your variable a is a 64-bit double. If it's a 32-bit float you need to change the shift count in the macro below to 31. I'm also assuming IEEE floating point representation, so the sign bit is stored in the MSB of a floating point variable.
Code:
#define iceil(a) ((int)a + !(*(unsigned* ) &a) >> 63)))
Note that this will not work if a is anything but a variable. Whether it is faster than the conditional expression is open to experiment.

This variant allows a to be an arbitrary expression but may not be faster because it has an implicit branch.
Code:
#define iceil(a) ((int)(a) + (int (a) < 0.0))
Your macro doesn't work if a is an expression but is easily fixed with the addition of some parentheses.

Paul

Last fiddled with by xilman on 2010-10-27 at 17:09 Reason: Remove a spurious '\'' character.
xilman is offline   Reply With Quote
Old 2010-10-29, 18:10   #4
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

348910 Posts
Default

If you needed a round-to-nearest and not round-to-next-larger, and knew the bit size M of a floating point mantissa, then a standard trick for rounding to integer computes ((a + c) - c), where c is 3*2^(M-1). The addition will right-justify and round the mantissa in an FPU register, the subtraction replaces bits to the right of the 'binary point' with zeros. There are a few caveats though:

- you have to make sure the compiler executes the addition and then subtraction, instead of optimizing them away

- you have to know the mantissa size. For x86 machines this could be 53 bits or 64 bits, depending on the OS, compiler, whatever else your app is doing, whether the operation uses SSE or x87 floating point, etc.

To get around the first problem you can initialize two global variables to c and copy them to stack variables when you need rounding. You can get around the second problem by forcing the x87 precision to 53 bits, so that SSE and non-SSE code will work the same.

There is a similar trick for converting floating point to integer: adding 2^M will right-justify the mantissa without rounding, so you can store to memory and then read the low 32 bits as an integer. The latter is not as necessary now as it was back in the Pentium days because integer conversion is a lot faster on modern x86 machines than it used to be; in the Pentium days the FPU would stall for 6 clocks on an integer conversion, and that was too much for a unit that could churn out an add or multiply every clock.
jasonp is offline   Reply With Quote
Old 2010-10-29, 23:18   #5
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26·113 Posts
Default

Quote:
Originally Posted by jasonp View Post
If you needed a round-to-nearest and not round-to-next-larger, and knew the bit size M of a floating point mantissa, then a standard trick for rounding to integer computes ((a + c) - c), where c is 3*2^(M-1). The addition will right-justify and round the mantissa in an FPU register, the subtraction replaces bits to the right of the 'binary point' with zeros. There are a few caveats though:

- you have to make sure the compiler executes the addition and then subtraction, instead of optimizing them away

- you have to know the mantissa size. For x86 machines this could be 53 bits or 64 bits, depending on the OS, compiler, whatever else your app is doing, whether the operation uses SSE or x87 floating point, etc.

To get around the first problem you can initialize two global variables to c and copy them to stack variables when you need rounding. You can get around the second problem by forcing the x87 precision to 53 bits, so that SSE and non-SSE code will work the same.

There is a similar trick for converting floating point to integer: adding 2^M will right-justify the mantissa without rounding, so you can store to memory and then read the low 32 bits as an integer. The latter is not as necessary now as it was back in the Pentium days because integer conversion is a lot faster on modern x86 machines than it used to be; in the Pentium days the FPU would stall for 6 clocks on an integer conversion, and that was too much for a unit that could churn out an add or multiply every clock.
The Pentium/Core-2 also has an instruction that allows the FPU to set
the rounding mode to "nearest". (one can set other modes as well)

However, I actually need "ceiling", not "nearest"
R.D. Silverman is offline   Reply With Quote
Old 2010-10-30, 00:30   #6
Robert Holmes
 
Robert Holmes's Avatar
 
Oct 2007

23·13 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
The Pentium/Core-2 also has an instruction that allows the FPU to set
the rounding mode to "nearest". (one can set other modes as well)

However, I actually need "ceiling", not "nearest"
Wouldn't nearest(x + 0.4999...) do the ceiling?
Robert Holmes is offline   Reply With Quote
Old 2010-10-30, 01:53   #7
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

161008 Posts
Default

Quote:
Originally Posted by Robert Holmes View Post
Wouldn't nearest(x + 0.4999...) do the ceiling?
Yes, but there is a latency associated with setting the rounding mode.
And if you want to intermingle other FP computations you constantly
set/reset the rounding mode. I've tried it. It is slow.
R.D. Silverman is offline   Reply With Quote
Old 2010-10-30, 05:41   #8
WMHalsdorf
 
WMHalsdorf's Avatar
 
Feb 2005
Bristol, CT

10018 Posts
Default

This should work

#define iceil(a) (a <= 0.0? (int)a : a > (int)a? (int)a+1:(int)a)

Last fiddled with by WMHalsdorf on 2010-10-30 at 06:36
WMHalsdorf is offline   Reply With Quote
Old 2010-10-30, 11:47   #9
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

8,369 Posts
Default

why not floor() + 1 is that any faster ?
science_man_88 is offline   Reply With Quote
Old 2010-10-30, 12:19   #10
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

348910 Posts
Default

The problem is that floor() and ceil() internally change the rounding mode and then do the integer conversion. Maybe they also have to check for extreme floating point values like inifinity and NaN. Changing the rounding mode is a pretty slow operation, and it's not necessary to do it every time you call floor() and ceil(), you can do it once and then do the same arithmetic they do for many inputs at once.

Getting rid of those functions and converting 'a' to an integer won't help you either, because

- the C language requires integer conversion to always truncate, rather than round to nearest, so to be strictly C compliant the library has to still fiddle with the rounding mode

- the CPU has to get the converted result from an FPU register to a CPU register and back (the output of iceil is an integer but later code needs it to be a double). If you are not using SSE2, that means bouncing through memory, doing the comparison and bouncing back into the FPU, which can be extremely slow. With SSE2 you can copy from the SSE2 unit to an integer register directly, but I think even today that operation has a high latency.

Maybe the most efficient choice is to figure out how to use round-to-nearest and try to correct it afterwards...

Last fiddled with by jasonp on 2010-10-30 at 12:32
jasonp is offline   Reply With Quote
Old 2010-10-30, 12:28   #11
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

8,369 Posts
Default

Quote:
Originally Posted by jasonp View Post
The problem with all of those choices is that floor() and ceil() internally change the rounding mode and then do the integer conversion. Changing the rounding mode is a pretty slow operation, and it's not necessary to do it every time you call floor() and ceil(), you can do it once and then do the same arithmetic they do for many inputs at once.

Getting rid of those functions and converting 'a' to an integer won't help you either, because

- the C language requires integer conversion to always truncate, rather than round to nearest, so to be strictly C compliant the library has to still fiddle with the rounding mode

- the CPU has to get the converted result from an FPU register to a CPU register and back (the output of iceil is an integer but later code needs it to be a double). If you are not using SSE2, that means bouncing through memory, doing the comparison and bouncing back into the FPU, which can be extremely slow. With SSE2 you can copy from the SSE2 unit to an integer register directly, but I think even today that operation has a high latency.

TRUNCATE NUMBER IS LIKE FLOOR THEN JUST +1 LOL sorry caps lock was on.

Last fiddled with by jasonp on 2010-10-30 at 12:34 Reason: umm, lol?
science_man_88 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Requestion for fast chebyshev theta function danaj Computer Science & Computational Number Theory 9 2018-03-31 14:59
A useful function. JM Montolio A Miscellaneous Math 28 2018-03-08 14:29
Fun with partition function Batalov And now for something completely different 24 2018-02-27 17:03
phi function rula Homework Help 3 2017-01-18 01:41
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57

All times are UTC. The time now is 00:10.

Fri Oct 23 00:10:44 UTC 2020 up 42 days, 21:21, 0 users, load averages: 1.15, 1.56, 1.67

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.