[Left] Portrait of Isaac Newton. (Unlike his almost-cipher-like contemporary Raphson, there is no shortage of portraits of Newton). [Right] Frontispiece of the (second edition of) Joseph Raphson's Analysis Aequationum Universalis, the work in which what is now referred to as the Newton-Raphson method first appeared in more or less its modern form.

Iterative Methods for Solving Nonlinear Algebraic Equations

Section Index:

The Classical Newton-Raphson Scheme:

"Given some function ƒ(x), for which values x* is ƒ(x*) = 0?" is a ubiquitous problem in mathematics, engineering and the sciences. This page begins with a now more-or-less-standard exposition about iterative techniques for solving this problem, but then delves into some further issues which the author hopes will be of interest, mainly focusing on higher-order-convergent iterative schemes, applications of the resulting iterative methods in other areas (such as computer arithmetic, numerical solution of differential equations and computational number theory), and efficient hardware implementation.

One must first address the question "what is the nature of the function ƒ and of the independent variable x?" Our analysis and examples below will focus almost exclusively on the simplest case: x a scalar (as opposed to vector) real-valued variable and ƒ a scalar real-valued function of x. However, many of the algorithmic methods we will develop will be of a kind which successfully generalize to multiple dimensions, systems of equations and complex-valued functions and independent variables.

When ƒ(x) is nonlinear, including polynomials in x of more than degree 4, finding the zeros (or equivalently, "roots") of ƒ is generally a nontrivial task, and one which only rarely results in closed-form solutions. In the general case one must deploy some kind of successive-approximation-based or iterative numerical method, typically one which begins by bracketing one or more roots (but preferably just one) in some reasonable-sized subinterval x ∈ [a,b]. One then makes some initial guess − call it x0 − at the root's location within that interval and uses information about the function at x0 to generate an improved guess x1. One then repeats the procedure until these successive iterates converge to the root x* within a desired accuracy. Numerous methods have been developed to accomplish this task. By far the most-famous, due to its combination of very rapid convergence (when provided with a sufficiently good initial guess), typically-modest computational expense and broad extensibility to more-general settings is due to the legendary scientist and mathematician Isaac Newton. Assuming the function is "well behaved" in the interval of interest − at a minimum that will mean that ƒ is continuous and has a continuous derivative ƒ′ − Newton's method is based on a deceptively simple geometrical observation: In cases where the function has a simple root, that is ƒ(x) crosses the x-axis with nonzero slope, when one's current approximation xn is sufficiently close to the root, the behavior of ƒ(x) in the neighborhood of the current iterate xn can be well-approximated by the tangent line to the function at xn. By computing the x-location where this tangent line crosses the x-axis one can can obtain an improved approximation to the location of the root, and by repeating this procedure one can approximate the root as closely as one likes, within one's needs and computational capabilities. This geometrical construction yields the now-famous iterative-update formula

xn+1 = xn ƒ
ƒ′
. [NR]

(Note that is formulae such as the above we will often eschew the subscripted xn notation on functions or replace it with the simpler 'x' − whenever the x-value of evaluation of a function or its derivatives is not given explicitly but appears in the context of a specific iterate (e.g. the current iterate xn which begins the right-hand-side of the above formula), it is implied that the evaluation of all other terms on the same side of the equals sign is also at that iterate.)

A geometric illustration of the above iterative method is shown in Figure 1, for the function ƒ(x) = (−x3 + 9x2 − 18x + 6)/6. In the figure, the iteration begins with an initial guess x0 = −1, coinciding with the left boundary of the plot. The red line segments are the local tangents used by the iteration; the point where each intersects the x-axis is the resulting next-iterate. The dashed green vertical line segments represent the function value at each successive iterate, whence the next tangent is generated. For the initial guess pictured, the algorithm yields successive iterates x1 = −0.1282..., x2 = 0.2872..., x3 = 0.4056..., which converge toward the root lying in the interval x ∈ [0,1]. It can be seen that successive iterates appear to get closer to each other as they approach the root being sought, as a consequence of the approximation of the function by its local tangent becomes in some sense "better" (in a fashion we shall quantify more precisely) as the root is neared. We can similarly successively approximate the root in x ∈ [2,3] by starting with an initial guess of, say, x0 = 4. (What happens if we take x0 = 4.1? What about 4.2?)

Figure 1: Newton-Raphson method applied to ƒ(x) = (−x3 + 9x2 − 18x + 6)/6, with initial guess x0 = −1.

Newton introduced the basic idea behind the above method in 1685, but it is the improved form of Newton's method published in 1690 by his contemporary Joseph Raphson which is given in modern notation above, so the method is often referred to as the Newton-Raphson method [Kline]. (We henceforth will often just use just "N-R" for brevity, since we will be referring to the method frequently.)

From the form ƒ/ƒ′ of the term in [NR] it is evident that care must be taken to prevent iterates from landing too close to points where the slope of the function is zero or very close to zero. This risks the tangent line "shooting off to infinity" and is most easily avoided by restricting the iterates to remain within one's initial root-bracketing interval. When an iterate lands outside this interval, further examination of the behavior of ƒ in the region of interest is warranted, followed by a hopefully-better initial guess.

(Aside: The only time a vanishing slope is not disastrous is when ƒ has a root at the point(s) where ƒ′ = 0, and the zero of ƒ is of a type so as to cancel the vanishing denominator there (i.e. the ratio ƒ/ƒ′ vanishes at the same value of x where ƒ′ = 0). But such points have their own problems − try using N-R to find the double root of the parabola ƒ(x) = x2 starting with an initial guess x0=1, for instance. How rapid is the resulting convergence?)

The classical geometrical construction of the above iterative scheme is very useful, but the most-useful mathematical tool for understanding its behavior and the higher-order analogs which we shall derive and examine is the Taylor series expansion, a tool already being used in various forms although the theory behind it was not yet fully developed in Newton's time. This takes the value of ƒ and its derivatives at some chosen value of x called the expansion point and uses those data to reconstruct the value of ƒ(x) in some neighborhood of the expansion point (or even globally, if the function satisfies certain added conditions about differentiability and has no singularities either in its real domain or when extended to complex values of its argument). Roughly speaking, the Taylor series allows us to reconstruct wider-range behavior from local information, which sounds similar in spirit to what N-R seeks to do, namely using local information about a function to find a nearby root.

In classical analysis the expansion point is typically chosen based on the fact that the derivatives of any order ƒ(m) (or a least to a sufficiently high order) are easy to evaluate there, or is located relative to any complex singularities (or poles) of the function which permits the expansion to converge in some neighborhood of interest. In the case of the N-R and analogous schemes this technique is actually used in a kind of "inverse" sense: We have a function typically given via some algebraic formula which we can evaluate (including any desired derivatives) at any desired point in some region of interest. We seek to find a value of the independent variable for which ƒ(x) = 0. In this context, the current iterate xn will serve as the expansion point of the the Taylor series, and the "nearby point" of interest is the root x*, although the necessary truncation of the series by retention of only the leading few terms will instead produce a low-order polynomial equation whose solution is the (hopefully) improved next iterate xn+1.

The Taylor series for ƒ(xn+1) in terms of the data at the current iterate is
ƒ(xn+1) = ƒ(xn) + ƒ′(xn)⋅Δx + ƒ′′(xn)⋅(Δx)2/2 + ... + ƒ(m)(xn)⋅(Δx)m/m! + ... , [Tay]
where Δx := xn+1−xn is the (as yet unknown) amount by which we will increment the current iterate. Again, given some reasonable approximation to a root of the target function ƒ(x), our aim is to produce a tractable formula to compute an updated xn+1 so as to make ƒ(xn+1) as near zero as possible, and thus also to put xn+1 as close to x* as possible. In particular, an iterative scheme which is convergent to (N)th order will yield an ƒ(xn+1) will make the leading portion of the right-hand-side sum of the expansion ƒ(xn) + ... + ƒ(N−1)(xn)⋅(Δx)N−1/N−1 vanish by mutual cancellation, followed by a nonzero "error term" involving the x-increment of order (Δx)N and higher. When Δx is small this nonvanishing remainder will be "smaller to the Nth power" with each succeeding iteration.

The simplest truncation of the infinite sum in the right-hand-side of the above formula (besides the trivial one-term truncation which gives the "do-nothing" iteration ƒ(xn+1) = ƒ(xn)) is to ignore terms of quadratic and higher order in Δx and solve the resulting linearized approximation

0 = ƒ(xn) + ƒ′(xn)⋅Δx ,

yielding Δx = −ƒ/ƒ′, or in increment-expanded form, the familiar N-R formula, xn+1 = xn − ƒ/ƒ′ .


Example 1: Decaying Exponential, Meet Straight Line

By way of a numerical example, let us use N-R to find the (one and only) root of the function ƒ(x) = exp(−x) − x, or alternatively written in terms of the natural exponent e = 2.7182818..., ƒ(x) = e−x − x. Graphically, the root is the value of x where the decaying exponential function intersects the line y = x. We start by bounding the location of the root: At x=0 we have exp(−0) = 1 which is greater than 0, hence greater than x there, thus ƒ(0) > 0. At x=1 (without even needing the constant e to any great accuracy, merely knowing that it is greater than one) we know that exp(−1) = 1/e < 1 so ƒ(1) < 0, hence the interval x ∈ [0,1] must contain the root. So we take our initial guess somewhere in this bounding interval, say x0=1. The N-R iteration formula applied to this function is
xn+1 = xn + e−x − x
e−x + 1
,
where all x-dependent terms in the fraction are evaluated at the current iterate xn+1, and in an actual implementation one would compute the exp(−x) term just once on each iteration and re-use the result in both the numerator and denominator of the fraction, for the sake of computational efficiency.

Table 1 below lists the results, along with the absolute value of the difference between successive iterates, which we above refer to as the iterate-increment Δx. Here we refer to the x-increment loosely as the "error" of the current iterate (since it is a measure of how far ƒ is from 0), and replace the generic Δx with a subscripted en := xn−xn−1, allowing us to write expressions involving increments at subsequent iterations. For the numerical values of the error we use standard decimal-exponential floating-point notation, e.g. 1.234e−10 stands for 1.234⋅10−10. We stop when successive iterates are identical to within "machine double precision" which refers to the IEEE standard for floating-point arithmetic and translates to roughly 16 decimal digits of precision. For the computation summarized in the table we used the unix/linux 'bc' utility which comes preinstalled on such systems (this also includes Apple's OS X) and a floating-point-emulation precision of roughly 100 decimal digits.

(Aside: For those not familiar with bc, a good starting point is, as always, 'man bc'. Simply typing 'bc' in a terminal starts a bc session in the default pure-integer mode, which automatically supports arbitrary-length integers, though for arguments more than a thousand digits or so arithmetic operations will begin to slow down noticeably. For the above computation, one invokes bc in floating-arithmetic mode using 'bc −l', then types 'scale = 100' to override the default floating-point precision of 15-20 decimal digits prior to doing any arithmetic. In terms of bc's limited built-in mathematical-function library, for some real argument x the sine, cosine, exponential and natural-log functions are invoked as s(x), c(x), e(x) and l(x), thus the constant e is given by e(1); to compute the constant π one can use the library arctangent function a(x), e.g. 'pi = 4*a(1)'. One can get an idea of the current floating-point-emulation precision by printing out some floating-point datum having − in exact-arithmetic terms − a nonterminating decimal expansion. For instance, to view the default precision prior to setting the scale value one could compute and print the natural-exponential via 'e = e(1); e' (spaces only needed for readability), then set the scale to some value and repeat the computation/printout. Lastly, permissible identifiers in bc are much as many programming languages, with one added restriction, namely that the first character must not only be alphabetic, but must be lowercase alphabetic. Windows users should consider using the freeware number-theory-oriented Pari program (also available for unix/linux/OS X), where one would set the default precision via e.g. 'default(realprecision,100)', the above functions are invoked via the syntax sin(x), cos(x) exp(x), log(x) and atan(x), and π can be more easily gotten using the built-in constant e.g. 'x = Pi'.)

We thus list a few more digits in the xn than double precision affords, along with levels of error which in an actual double precision computation would instead "bottom out" at around 1e−15 to 1e−16:

Table 1. N-R convergence history for ƒ(x) = exp(−x) − x, with x0=1
n xnen = (xn−xn−1)en/(en−1)2
__ ______________________ ______ ______
0 1
1 0.53... −4.6e−1
2 0.5669... 2.9e−2 .136...
3 0.567143285... 1.6e−4 .1845...
4 0.567143290409783869... 4.4e−9 .18096...
5 0.56714329040978387299...3.5e−18.180948128...
6 0.56714329040978387299...2.3e−36.18094812831744461...

Notice how the error − once it is sufficiently small − behaves as en ≅ C⋅(en−1)2, where C is a proportionality constant, the so-called "asymptotic error ratio", which here is around 0.181. The precise value of such a constant typically matters less than qualitative behavior it reflects. In the present instance the fact that the ratio en/(en−1)2 tends to a nonzero constant means that the number of converged "good" digits of the approximation roughly doubles with each succeeding iteration, at least until the limit set by the precision of the compute hardware (or software) being used. We shall revisit the issue of the precise form of the proportionality constant in more detail below.


A Brief Note on Order of Magnitude Notation

(Note to the reader: Those less interested in the somewhat-technical concept of "asymptotic order of magnitude" discussed in the following several paragraphs should be able to condense things to the following synopsis:

"In this document, the 'Big Oh' notation O(xn) means that the function or algebraic term thus characterized behaves like the function c⋅xn for some nonzero constant c, the so-called asymptotic constant of proportionality, in whichever limit (x → 0 or x → ∞) is applicable at the point where the O() notation is being used"
without any serious loss of understanding. If on the other hand the above italicized quote prompted an involuntary "How dare you, good sir, engage in such egregious notational abuse..." from your internal sense of mathematical propriety, it would perhaps be best to read our disclaimer below before proceeding further.)

We shall be making much use of "Big Oh" notation to encode asymptotic behavior of functions, as in ƒ(x) = O(g(x)), which would be read as "the function ƒ of x is Big Oh of the function g of x", with g(x) being some kind of convenient reference "measuring stick" function, most commonly constructed from powers, logarithms and/or exponentials of x. One can think of Big Oh (formally an uppercase Greek omicron symbol, but here substituted by a simple keyboard uppercase O) and its related variants [Ο, ο, Θ, Ω, ω] as all being shorthand for "order of", with a different specific kind of asymptotic limiting relationship associated with each. We need to be clear what we mean by our particular (mis)usage of O() because we will be using this notation here in a more-restricted sense than is common for this particular order-of-magnitude notation, in no small part because it allows us to be a lazier typist than the more-precise tight-upper-bound "Big Theta" notation Θ() would. (And even the latter differs slightly from our personal Big Oh usage, as noted below).

So, long story short: Unless specifically noted, in this document when we write that some function ƒ(x) = O(g(x)), we mean this in the sense of a maximally tight asymptotic bound, namely that the ratio ƒ(x)/g(x) tends to some nonzero constant in the limit as x either vanishes or tends to infinity. (When discussing quantities where the argument of the function vanishes in the limit, such as error terms, we are formally using "infinitesimal asymptotics"; we also will use the O() notation in the opposite limit of an argument tending to infinity, so-called "infinite asymptotics". In our discussion, which limit applies will always be clear from the context).

Thus our usage of O() is equivalent to the standard Θ() with c1 = c2, but we further allow the asymptotic constant which is the limit of the ratio to be of either sign, whereas all of the standard O() notational conventions listed above define the bounds in terms of a ratio of absolute values (or complex magnitudes, for complex-valued functions) and thus permit only positive asymptotic constants of proportionality. In other words our quirky/eccentric/unheard-of usage of O() is in fact expressible very simply, via the following slight generalization of the above italicized synopsis:

"In this document, the notation ƒ(x) = O(g(x)) means that the ratio ƒ(x)/g(x) tends to some nonzero constant c, in whichever limit (x → 0 or x → ∞) applies."

Lastly, since we shall only need to consider limiting behavior expressible in terms of nonnegative integer powers of the independent variable herein, that means that in the limit x → 0; any term containing a sum of various powers of x will asymptote to the lowest-power term of the sum, whereas in the limit x → ∞, such a summation of terms asymptotes to the highest-power term.


Example 2: A "Better than Expected" Case

The above function has a single simple root, meaning ƒ′(x) is nonzero at the root. Higher derivatives do not appear explicitly in the N-R algorithm, but do affect its convergence properties, as does the presence of more than one root, as our next example will illustrate. Consider ƒ = cos(x), and specifically focus on the root at x = π/2. The N-R iteration formula for this function is xn+1 = xn + cos(x)/sin(x). Table 2 summarizes the results of these first few iterations starting from x0=1, but now with an added error-related column at right which we will explain shortly. In the table we use the known exact root value to tabulate the successive approximants to the root to the first incorrect digit (we don't do any rounding of that digit based on the ones that follow it, since we care about larger-scale effects):

(Aside: Since the exact value of the root here is known a priori and can be computed to any desired precision we could instead use a convergence criterion based on the level of error relative to that, that is |xn − π/2|. However, this is not useful in most real-world applications since in general the exact solution we seek has no known closed-form expression, and the reason we are using an iterative algorithm like this to begin with is to try to numerically approximate the (unknown) exact solution to within some acceptably small level of error. Thus, the nearness to zero of the function ƒ either by itself or better, as reflected in difference between successive iterates, is nearly always used as a proxy for convergence to the true root, whatever it may be.)

Table 2. N-R convergence history for ƒ(x) = cos(x), with x0=1
n xnenen/(en−1)2en/(en−1)3
__ ______________________ ______ ______ ______
0 1
1 1.64... +6.4e−1
2 1.5706... −7.1e−2 −.173 −.269...
3 1.570796326795... +1.2e−4 2.4e−2 −.332...
4 1.57079632679489661923132169163975144202... [accurate to 38 digits]−5.9e−13 −4.0e−4 −.333333330...
5 1.57079632679489661923132169163975144209858469968755291048747... [112 digits]+6.9e−38 2.0e−13−.3333333333333333333333332...

Here the error behaves "better than expected", in that it decreases faster than the expected quadratic convergence, as evidenced by the ratio en/(en−1)2 tending toward zero rather than to a nonzero constant. (In the order-of-magnitude notation we shall make more use of in the sections that follow, if we denote the distance of current iterate from the root as Δx, the asymptotic behavior of the error of the scheme applied to cos(x) is cubic O(Δx3) convergence rather than the expected quadratic convergence, O(Δx2)). This behavior is only hinted at in a strictly double-precision computation where one only clearly sees the trend from iteration 3 to 4 before the error "hits bottom", but things are quite clear at the extended-precision used to generate the above table. Using higher precision the trend of the absolute error over iterations 3, 4 and 5 has the negative exponent in |xn−xn−1| roughly tripling at each successive iterations and thus appears to indicate cubic rather than quadratic convergence. Computing the ratio en/(en−1)3 shows that this does in fact rapidly tend to a constant, which moreover appears to be rapidly converging to the exact fraction −1/3.

We can better understand the above behavior in terms of the Taylor series of the respective target functions. The N-R algorithm zeros the leftmost 2 terms of the right-hand-side of [Tay], leaving an error which is, to leading order, just the ƒ′′(xn)⋅(Δx)2/2 term. For the first of our two test functions above the second derivative is everywhere nonzero and tends to zero only as x tends to +∞, far away from the location of the root. But for ƒ(x) = cos(x) we have ƒ′′(x) = −ƒ(x), thus the second derivative vanishes exactly at the roots of the function, meaning that the leading-order error term in the N-R iterate is the cubic ƒ′′′(xn)⋅(Δx)3/6 term, which explains the "faster than expected" convergence.


The Asymptotic Error Ratio:

To obtain an analytical expression for the asymptotic error ratio we again using the Taylor expansion, but now turn the tables and use the root x* as the expansion point for both the function ƒ and its derivative. We define an iterated small parameter which is the distance of the corresponding iterate from the true root:

εn := xn − x* .

Starting with the N-R iteration formula

xn+1 = xn ƒ
ƒ′
,
and subtracting x* from both sides shows that this error measure satisfies the same formula as the basic iterative update:

εn+1 = εn ƒ
ƒ′
.
At any location a small distance ε from the root we have

ƒ(x*+ε) = εƒ′(x*) + ε2ƒ′′(x*)/2 + ... ,

ƒ′(x*+ε) = ƒ′(x*) + εƒ′′(x*) +... ,

In terms of which we have

ƒ
ƒ′
= εƒ′ + ε2ƒ′′/2 + ...
ƒ′ + εƒ′′ +...
,
where all function evaluations are at the root. Defining a lumped small parameter δ := εƒ′′/ƒ′ and making use of the identity 1/(1 + δ) = 1 − δ + δ2 − ..., we can approximate the division by multiplication:

ƒ/ƒ′ = ( εƒ′ + ε2ƒ′′/2 + ... ) ⋅ ( 1 − εƒ′′/ƒ′ +... ) / ƒ′

= ε ⋅ (1 + εƒ′′ / (2ƒ′) + ... ) ⋅ ( 1 − εƒ′′/ƒ′ +... )

= ε ⋅ (1 − εƒ′′ / (2ƒ′) + ... )
Where all the unsubscripted epsilons in the right-hand-sides are short for εn. We see that the leading εn terms cancel, leaving εn+1 = εn2 ⋅ ƒ′′/(2ƒ′) + O(εn3) . Assuming ƒ′′ ≠ 0 at the root, as εn tends to zero we can neglect the cubic and higher-order terms lumped into the O(εn3) and thus obtain the expression for the asymptotic error ratio:

limεn → 0 ( εn+1n2 ) = − ƒ′′/(2ƒ′) . [AsympRatio]

For our first example that ratio takes value exp(−x*)/(exp(−x*) + 1)/2 = x*/(x*+1)/2, and we leave it to the reader to confirm that this matches the value of the experimentally found error ratio given in the rightmost column of Table 1.


The Basin of Attraction:

For the preceding example of N-R applied to the cosine function, some numerical experiments show the root x* = π/2 is the converged solution for initial guesses in the approximate interval xa < x0 < xb, with xa = 0.4052... and xa = 2.736... . Outside this interval the iterates shoot off far to the left or right of the root, in most cases then converging to one of the other infinitely many roots spaced evenly along the x-axis; for instance an initial guess of x0 = 3 gives x1 ≅ −4 and then quickly converges to the root at −3π/2. Another consequence of an iterate being too close to one of the maxima of the cosine function (as is x0 = 3) and thus giving a large positive or negative x-increment is that the interval immediately surrounding each zero of cos(x) is only one piece − the largest piece, in this case − of the basin of attraction of the root. For example, our root of interest above at π/2 is also found by starting from points inside a smaller subinterval just to the left of the local maximum at 2π (try x0 = 6, for example). But that means that any initial value of x which happens to hit one of these subintervals which end up at π/2 is also part of the basin of attraction of the root. Thus, for example, x0 = 6 gives x1 = 2.56..., x2 = 1.03... and then quickly converges to x = π/2. For any selected root of this function the basin of attraction, as it turns out, consists not of a single or even finitely many discrete subintervals of the x-axis.

n xn
__ ______________________
0 6.2604292
1 −37.67...
2 6.02...
3 2.25...
4 1.43...
5 1.571...
6 1.5707963266...
7 1.57079632679489661923...
. .

For example, consider the sequence of iterates captured in the table at left, starting from a value of x0 chosen for purposes of illustration. Here the starting point is quite close to the local maximum of cos(x) at x = 2π, as a result of which our first iteration takes us far to the left of the origin. As it happens, however, x1 is itself close to another local maximum at −12π, and the next iterate again shoots off far away, in this case in "lucky" fashion which again returns us to near our starting point, but now x2 is sufficiently far from the maximum at 2π that the ensuing update does not take as quite as far afield, and in fact x3 happens to lie within the primary basin of attraction of the root at π/2, so a few more iterations converge to that root to 10 or more digits of precision. To borrow terminology from billiards, that is quite a nifty "trick shot". In fact the billiards analogy is a very good one, and the interested reader will want to try changing the starting value of the above example just a tad − say, just changing the rightmost digit of x0 from 2 to 1 or 3 − and repeating the experiment, to get a better sense of how little margin for error this particular trick shot has.

Figure: A common kind of trick shot in "Newton-Raphson billiards", illustrating the convergence history of the above table.

In the region around each maximum of cos(x), the deeper we look, the more detail emerges. In fact, the boundary of each root's basin of attraction for the N-R scheme is an infinite set of disjoint pieces, with ever-tinier fragments capturing ever-more-distant starting points and ever-more-complex "billiard shots" which end up converging on the root in question. The regions around the local extrema of cos(x) consist of an infinitely, recursively commingled "dust" of pieces of the basins of all the roots of the function. The basin boundaries for the iteration are in fact fractals [cf. the famous book by B. Mandelbrot].

In order to describe some of the intricacies associated with N-R applied to functions having multiple roots, we borrow terminology from dynamical systems theory. An actual root of ƒ(x) is a fixed point of the iteration, meaning that it yields xn+1 = xn, exactly. We define the basin of attraction of a given root as the set of points yielding a sequence of iterates which converges to the root which when used as starting points for the N-R iteration. Here we assume infinitely high numerical precision and an arbitrarily large number of iterations, since in actual finite computations we never exactly reach the root. Despite this real-world limitation, finite-precision computational trials with a limited number of iterations can give a good idea of the complex dynamics which can result even for "simple" target functions. The first observation we make − and we proceed without formal proof here, though there is no small amount of nontrivial set-theoretic subtlety at work − is that the basin of attraction always has a piece consisting of some contiguous subinterval containing the root, with the root strictly in the interior of said interval. We refer to this subinterval as the primary basin of attraction of the root. The points bounding the primary basin will be the closest points on either side of the root for which an iterative update yields a next-iterate which is not itself in the primary basin. This definition may appear circular at first glance, so to clarify, we must consider the various ways in which this can occur. We first observe that any point x can be classified with respect to the basin of a particular root in the following fashion:

The first three of these are more or less self-explanatory (perhaps with the aid of some sketches and examples like the one below); the fourth possibility requires a bit more explanation. One obvious possibility is if iterates starting at x diverge; for example try applying N-R to the target function ƒ(x) := 2 − 1/x which has a single root at x* = ½, starting with some initial guess x0 > 1.

Another possibility is that a point x is part of a sequence of iterates which neither converges nor diverges. The simplest possible case of such a sequences is one consisting of a single point which is some other fixed point of the N-R iteration for the function in question. However, from the definition [NR] such a fixed point would itself be another root of ƒ, which would place it in the category of "in the basin of some other root of ƒ". Thus for N-R such cases must come in the form of a sequence of at least 2 iterates. For our simple cosine-function example, the bounding points of the primary basin of each root (which bound the primary basin but are not part of it) form a 2-point limit cycle, with successive iterates bouncing back and forth endlessly between xa and xb. For points strictly inside the interval bounded by xa and xb, the iterates alternate in terms of which side of the root they lie on, but their distance from the root shrinks monotonically. An initial guess lying a tiny distance away from either xa or xb but lying inside the primary basin will only move closer to the root in a sequence of incremental "baby steps" at first (whose error x − x* alternates in sign but continually decreases in magnitude − it is the incremental decrease in error magnitude to which we are referring when we say "baby steps"), but nonetheless is guaranteed to eventually get there (this is in effect the famous Banach fixed-point theorem in action). An initial guess lying just outside either end of the primary basin behaves similarly but with the error increasing at each step, until it gets large enough that something dramatic happens. We leave it to the reader to play with the numerics and fill in what the possibilities for "drama" are here. Again borrowing terminology from dynamical-systems theory, such iterative cycles which repeat endlessly without converging to a fixed point are called limit cycles of the system, or in the present context, of the iteration. More generally, points lying exactly on the boundary of two basins (by this we mean of any of the pieces forming the basins of two distinct roots) might be part of a cycle of period longer than 2, or even of one which contains an infinitude of points and never repeats, a so-called strange attractor, which is the stuff of chaos theory. The books by Strang [Strang] and others have a much-more in-depth treatment of these issues, so we do not belabor them here, fascinating though they are.

(The following paper has further references related to this property of Newton-style iterative methods and some nice graphical images (cf. especially section 4 thereof) illustrating the phenomenon for the method applied to a simple cubic function in the complex plane, that is, in a 2-D setting.)


Exercise: For the function ƒ(x) = exp(−x) − x, show that the primary basin of attraction of the root is the entire x-axis.


Higher-order analogs of N-R, Part 1: A Taylor-Series Approach

While it is true that N-R may be "good enough" for most applications, we are motivated to find higher-order-convergent analogs of the scheme for several reasons:

1. We hope to gain further insights into the general properties of such schemes, especially with respect to the question "what types of scenarios cause problems for such schemes, and how can these problems be best-mitigated?" One obvious issue here is whether using more local information about the target function ƒ(x) − which means computing derivatives of higher-order than the first derivative needed for N-R − can yield a scheme which is not only more rapidly convergent near a root, but also more robust with respect to poor initial guesses. It is always good to maximize the size of the basins of attraction for the roots, if this can be done cost-effectively.

2. There are applications which benefit from higher-order convergence, especially if it can be obtained at modest cost.

3. It is always useful to have at least several tools to choose from, especially when tackling difficult problems (e.g. functions with multiple roots, "interesting" local behavior of the target function, etc). While visual inspection of the target function will almost always reveal the cause of any difficulties the user of the basic N-R scheme may encounter, this is not a good option when a high degree of robustness and "it should just work" automation are required.

Our first attempt to devise a higher-order scheme uses our old friend the Taylor series [Tay], with the usual list of caveats about derivatives existing in some suitable neighborhood of the root in question). The obvious thing to do in an effort to obtain a cubically convergent scheme is to retain terms up to order (Δx)2 in the Taylor expansion about the current iterate. Geometrically this amounts to approximating the target function in the neighborhood of the current iterate not by a tangent line but by an osculating parabola (from the Latin, literally, "kissing" parabola), described by the resulting quadratic equation in the unknown Δx

0 = ƒ(xn) + ƒ′(xn)⋅Δx + ƒ′′(xn)⋅(Δx)2/2 .

In terms of the customary way of writing the quadratic, this has coefficients a = ƒ′′/2, b = ƒ′, c = ƒ and has a pair of solutions

Δx = −ƒ′ ± √[(ƒ′)2 − 2⋅ƒ⋅ƒ′′]
ƒ′′
.
This poses several immediate problems. Firstly, if the argument of the square root is negative we are sunk because the resulting pair of complex-conjugate increments is nonsensical in the context of the presumed real root of the real-valued function ƒ(x). Nonnegativity of the argument is guaranteed if 2⋅ƒ⋅ƒ′′ ≤ (ƒ′)2, thus our current iterate needs to be "close enough" to the desired root to assure the above condition is met. Since "closeness" depends on the local values of the first and second derivatives, any actual machine implementation of such a scheme should compute the needed derivatives and not proceed any further if the argument of the square root is negative. In such cases it would be best to issue a warning and revert to standard N-R, at least for the current iterate, and in general until one is close enough to the root. If the argument is even "too close to zero", say if the resulting square root is much smaller in magnitude than the ƒ′ it will be added to or subtracted from, it may also be advisable to revert to N-R.

But assume that we are in fact close enough to the root to yield a real nonzero square root. From the form of the above inequality we can see that this is assured if root is simple (ƒ′ nonzero there) and the second derivative ƒ′′ is well-behaved there, meaning that there will exist some decent-sized interval containing the root in which ƒ ≤ (ƒ′)2/(2⋅ƒ′′). Now we hit the second issue: Which of the two distinct solutions should we choose? That one is easy: We should take the one which leaves us closest to our current iterate (i.e. which yields the smaller-in-magnitude increment Δx := xn+1 − xn), firstly because the other solution may be spurious (especially if it takes us far from our current xn), and even if there is another root "out there" which is reasonably well-approximated by the second solution, we don't want our scheme to "bounce back and forth" between pairs of possible roots, we want it to converge to the nearby root, and pronto. Specifically, in order to minimize our increment size we must choose the solution of the quadratic for which the sign out front of the square root is opposite that of (-ƒ′), i.e. the one of the same sign as ƒ′, yielding the scheme (the equation label [3Q] is a mnemonic for "3rd-order iteration, resulting from solving a Quadratic")

xn+1 = xn ƒ′ − {sgn(ƒ′)⋅√[(ƒ′)2 − 2⋅ƒ⋅ƒ′′]}
ƒ′′
, [3Q]
where we choose the equation number to reflect the convergence order (3) and add the trailing 'a' to indicate that this is the first of several third-order schemes we shall consider. The "sign function" sgn(u) is defined as +1 if u > 0 and −1 if u < 0. (We avoid the indeterminacy at u = 0 by the reverting to N-R if the suare root is 0, as noted above). In a C-style programming-language where arithmetic comparisons return a value of 1 if true and 0 if false, a sgn(u) function can be implemented efficiently as sgn(u) = 2*(u > 0) − 1, where the program in question again should avoid the indeterminacy at zero by first checking for zero value of the argument and handling that special case appropriately.

The third issue with this scheme is what to do if the second derivative of ƒ is small or zero, i.e. if the denominator in the above expression threatens to vanish and leave us with an update that causes our scheme to "fly off the handle", as it were, analogously to what happens with the N-R scheme if the first derivative becomes small enough to make the ratio ƒ/ƒ′ dangerously large. However, let us assume ƒ′′ is small in some sense which we will make more clear shortly, and see what happens. We already assume that our initial guess is 'reasonable' in the sense that ƒ(x0) (and thus ƒ(xn) for all subsequent iterates) is not large. So if we further take ƒ′′ as 'small', let us be specific and assume that implies that ƒ⋅ƒ′′ << (ƒ′)2. Rewriting the square root as ƒ′⋅√(1 − 2⋅f⋅ƒ′′/(ƒ′)2) we see that this is in the form ƒ′⋅√(1 − ε) with |ε| << 1 a small parameter, which allows the square root to be approximated as √(1 − ε) ≅ (1 − ε/2), thus sgn(ƒ′)⋅√[(ƒ′)2 − 2⋅f⋅ƒ′′] approximates as ƒ′⋅(1 − ƒ⋅ƒ′′/(ƒ′)2), the leading part of which cancels upon subtraction from ƒ′, leaving us with

xn+1 = xn ƒ⋅ƒ′′/ƒ′
ƒ′′
= xn ƒ
ƒ′
which is just the Newton-Raphson scheme. So a vanishing (or nearly-vanishing) divisor is not a problem here as it is with N-R, unless both the first and second derivatives vanish. (And we assume the reader is not so foolish as to attempt to find the roots of a function such as exp(-x) via such an iterative method.) This is a much more restrictive requirement for problematic "going off on a wild tangent" behavior than was the vanishing first derivative, which is a hint that this third-order scheme should be more robust in terms of handling poor initial guesses than N-R is.


Question: Can we have ƒ′′ 'small' in the sense that ƒ′/ƒ′′ is large but the argument of the square root is not approximable as above?


The fourth and potentially problematic issue is one of computational cost: floating-point square roots tend be computationally expensive relative to "simpler" hardware operations such as add, subtract and multiply. Even on modern cutting-edge microprocessors where much silicon and human ingenuity has been expended in order to make common floating-point operations (including taking square roots) fast, that 'fast' almost always means in a pipelined sense, that is "if you do a whole bunch of these in sequence, the first result will not be available for several dozen CPU clock cycles, but after that, one fresh result will finish every cycle or two". In other words, even on hardware with high pipelined square-root bandwidth, such hardware operations still have high latency, which makes it difficult for a compiler or human programmer to schedule efficiently, i.e. to begin the root computation far enough in advance of when the result will be needed, and to still keep the CPU busy doing other things in the interim. In the above iterative scheme, by the time we have computed the derivatives needed for the square root, there is very little left to do, except possible to compute the inverse denominator 1/ƒ′′ in preparation to multiplication of that inverse by the computed numerator.

The fifth flaw in the ointment − and perhaps the most fatal in terms of generality − is that our "solve a quadratic" scheme generalizes very poorly (in some cases not at all) to more general settings in which analogs of Newtonian linearization-based iteration are frequently used, such as coupled systems of nonlinear algebraic equations and iterative solution methods for discretized nonlinear differential equations.

But before seeking computationally more efficient and hopefully more-generalizable alternatives, let us quickly try our (allegedly) cubic scheme on an actual example, say, the same root of cos(x) at x = π/2 we used above to illustrate some key properties of the Newton-Raphson method. The iterative-update formula [3Q] for ƒ(x) = cos(x) is

xn+1 = xn + {-sin(x) + sgn(sin(x))⋅√[sin2(x) + 2⋅cos2(x)]}/cos(x)

= xn + {-sin(x) + sgn(sin(x))⋅√[1 + cos2(x)]}/cos(x) .

We note one nice property of the target function in this case is that the argument of the square root is strictly positive, simplifying the code logic needed for the iterative update. The osculating parabola passing through ƒ(x0) is shown in the figure, and the resulting convergence history summarized in the table.

Figure: The osculating parabola (red) passing through ƒ(x) = cos(x) at x0=1 and yielding the updated iterate x1=1.54... .

Table 3a. Cubic scheme [3Q] applied to ƒ(x) = cos(x), with x0=1
n xn|xn−xn−1||xn−π/2|
__ _______________________________ ______ ______
0 1 5.4e-1
1 1.54... 5.5e-1 2.5e-2
2 1.570793... 2.5e-2 2.5e-6
3 1.570796326794896616... 2.5e-6 2.5e-18
4 1.57079632679489661923132169163975144209858469968755290... 2.5e-182.5e-54

and we observe that the error decreases cubically − meaning that the number of significant figures roughly triples − on each iteration, as expected. We have also included an extra column at right showing the absolute difference between the current iterate and the known exact root, in order to show that the convergence criterion based on the difference between successive iterates (xn−xn−1) is quite conservative, in that it "lags one iteration behind" the one based on (xn−x*), because once we get near the root the latest iterate xn will be asymptotically closer to the root than xn−1 was. This means our earlier note to the effect that "in general the exact root is not known a priori, the latter convergence measure will not be available" was overly pessimistic: If we are using a scheme with known asymptotic convergence properties, and our numerics are sufficiently precise to establish that for the function and the root in question the convergence behavior is following the expected asymptotic behavior, then we can safely extrapolate the pessimistic error estimate based on the difference between the most-recent pair of iterates to one relative to the exact root being approximated, even though the root will never be known exactly. In fact, the ensuing iteration is the mechanism which implements that extrapolation and uses it to produce a improved estimate of the root.

Some numerical experimentation reveals the primary basin for this third-order scheme about the root is the entire interval containing points closer to this root than to any other root, that is, x0 ∈ (0, π). There is − at least not for this target function − any issue of fractal basin boundaries (nor even basins consisting of more than one contiguous subinterval) as there is for the N-R scheme. The only indeterminacy of any kind is at the extremal points k*π (k integer) of the function, where sin(x) is zero and the sign is indeterminate. At such points both of the possible solutions of the quadratic are equidistant from the current iterate, and we need some way of breaking the tie. E.g. if we take x0 = 0 our quadratic yields the two possibilities x1 = ± √2, both of which are quite-good approximations to the two roots nearest the origin at x = ± π/2, and which will similarly converge to one of these roots to double-precision accuracy in 4 iterations if we "flip a coin" in order to bias the first iteration in the direction of one of these two equidistant roots.

By way of another example, for the simple parabola ƒ(x) = x2 which reduces N-R to a linearly-convergent crawl, our cubic method yields the iteration

xn+1 = xn − {2x − sgn(x)⋅√[4x2 − 4x2]}/2 ,

where the vanishing square root causes the entire right-hand side to vanish, yielding xn+1 = x* = 0, i.e. no matter where our initial guess is, one iteration takes us to the (double) root, within whatever level of roundoff error our numerical implementation of the above formula yields. This should be no surprise: We constructed the scheme by neglecting terms higher than second order in the Taylor expansion around the current iterate and solving the resulting quadratic equation, thus the scheme is guaranteed to yield the exact root of any quadratic polynomial (assuming it has real roots) closest to the initial guess in exactly one step, just as N-R does for the root of any linear equation with nonzero slope (that is, having a unique root). For a cubic polynomial with a degenerate (multiple) root such as ƒ(x) = x3, on the other hand, N-R gives

xn+1 = 2xn/3 ,

which converges via agonizingly slow one-third-size "baby steps". Our (nominally) higher-order scheme is not even in the running here, as the square root term is √(9x4 − 12x4), and thus has a negative argument for all values of x other than the exact solution x = 0.


Exercise: Use the well-known analytical formulae for the roots of a cubic polynomial to derive a fourth-order analog of the iterative scheme [3Q]. What, if any, are the restrictions on ƒ and its derivatives near the root?


Higher-order analogs of N-R, Part 2: Halley's Simplification

The search for effective, efficiently computable higher-order-convergent methods was an active topic of research already among Newton's contemporaries. The most famous such scheme is the cubic method derived by Newton's astronomer friend Edmond Halley, best-known for the famous recurring comet which bears his name. Here we shall rederive Halley's method using modern notation, and then see if Halley's approach can be generalized to obtain even higher-order convergence.

Plaque honoring Edmond Halley and his eponymous recurring comet in the South Cloister of Westminster Abbey.

First, a brief notational comment: In our discussion, we generally use Δx = xn+1 − xn to denote the increment between successive iterates. But Δx is also a measure of the error of the current iterate xn, since presumably the iterates are converging toward the root, meaning that xn+1 is closer to root − preferably asymptotically closer − than xn is. The key difference in our following analysis is that Δx-as-increment is something to be solved for as accurately as possible within the acceptable level of complexity of the resulting iterative scheme, whereas Δx-as-error is quantity whose order of magnitude is known a priori, but whose precise value is not. We use the order-of-magnitude estimate to help guide us as to how accurate the simplifying substitutions we will be making are, and then promptly neglecting the various higher-order error terms in the resulting expressions to obtain linear equations for the iterate-increment which will take us closer to the root we seek. To help keep clear which is which, we denote an (n)th power of the increment-to-be solved for as (Δx)n, and use O(Δxn) for an error term having (n)th order of magnitude, again in more-restricted-than normal sense described above, that when we say an error term Q is of such an order, we mean that the ratio Q(Δx)/Δxn tends to some nonzero constant in the limit as Δx vanishes, and we allow the constant to be of either sign.

Halley's insight was in effect to linearize the quadratic-term in the Taylor series by approximating it with a known quantity at the current iterate, such that the resulting approximation preserves the convergence order obtained by solving the full quadratic. He did this by considering the quadratic term ƒ′′(xn)⋅(Δx)2/2 and replacing one power − but only one − of the squared (Δx)2 increment by the very approximation formula whose order he was trying to improve, that is, the Newton-Raphson formula. Since it is important to keep in mind that N-R used as an iteration formula is exact (in the sense that we use it compute the next iterate within the error of our computational tools), but in terms of approximating the true root N-R is just that, an approximation. The specific approximation encoded in N-R is quadratically accurate, meaning that approximating the iterate-increment needed to take us to the true root with the substitution Δx = −ƒ/ƒ′ leaves a residual error which is O(Δx2). Thus if we are trying to derive an improved formula for Δx, we need to keep track of the resulting error (at least in terms of its order of magnitude), and in to that end, using N-R in this fashion means replacing Δx with the quadratic approximation (−ƒ/ƒ′ + O(Δx2)). Doing that for one power of the quadratic increment gives

ƒ(xn+1) = ƒ(xn) + ƒ′⋅Δx + ƒ′′⋅(−ƒ/ƒ′ + O(Δx2))⋅Δx/2 + O(Δx3) .

Since that substitution is itself multiplied by Δx, retaining only the −ƒ/ƒ′ portion of it and ignoring the rest (as we must, since the error term of the current iterate is unknown at the current iteration − in a very real sense "we are trying to solve for the error" when we compute the next iterate) introduces an error of order O(Δx3), which is the same order as the largest of the remaining neglected terms of the Taylor series. Thus all these cubic-order errors are lumped together and ignored, leaving an approximation which is still third-order accurate but which is now linear in Δx. One then does as usual − sets the left-hand side of the equation ƒ(xn+1) to the desired value of zero and throws away the higher-order error terms in the right-hand-side, obtaining the third-order accurate approximation for the increment:

0 = ƒ(xn) + ƒ′⋅Δx + ƒ′′⋅(−ƒ/ƒ′)⋅Δx/2 ,

which is easily solved for Δx to yield the third-order scheme first derived by Halley (Philos. Trans. Roy. Soc. London, 18 (1694), 136-145):

xn+1 = xn ƒ⋅ƒ′
[(ƒ′)2 − ƒ⋅ƒ′′/2]
. [3H]
-->

In effect, Halley made clever inductive use of Newton-Raphson, by using N-R to "bootstrap" to the next-higher convergence order.

Let's check that the resulting scheme is truly third-order. Again one can see that the absolute error reflected in the number of significant figures is "one iteration ahead" of the difference between successive iterates tabulated in column 3, and that the cubic error ratio en/(en−1)3 tends to a nonzero constant as expected:

Table 3b. Halley method convergence history for ƒ(x) = exp(−x) − x
n xnen = (xn−xn−1)en/(en−1)3
__ ______________________ ______ ______
0 1
1 0.564... −4.4e−1
2 0.5671432907... 2.2e−3 −.0270...
3 0.567143290409783872999968662209... −3.0e−10 −.027568...
4 0.567143290409783872999968662210355... [accurate to 91 digits]7.7e−31 −.02757381763... ...

Let us further try [3H] on the periodic target function ƒ(x) = cos(x), starting at x0=1. The Halley-method iteration for this test function is xn+1 = xn + (cos(x)⋅sin(x))/[sin2(x) + cos2(x)/2] = xn + 2⋅cos(x)⋅sin(x)/[1 + sin2(x)]:

Table 3c. Halley method applied to ƒ(x) = cos(x), with x0=1
n xnen
__ ______________________ _______
0 1
1 1.53... 5.3e-1
2 1.57078... 3.8e-2
3 1.5707963267948964... 9.5e-6
4 1.5707963267948966192313216916397514420985846996870... [accurate to 49 digits] 1.4e-16
5 1.5707963267948966192313216916397514420985846996875529104874722... [146 digits] 4.8e-49

which is again third-order convergent, although the asymptotic constant en/(en−1)3 is somewhat larger for this second example function. Thus Halley's method applied to a function which has a simple root at which the second derivative is zero yields no additional "free" order of convergence as we obtain with N-R applied to such a function, although the above convergence history does reflect a smaller asymptotic error ratio en/(en−1)3 than we obtain with N-R.


Exercise: Numerically compute the asymptotic constant en/(en−1)3 for the example in Table 3c.


Exercise: Apply Halley's method to the cubic ƒ(x) = x3 and compare the results to N-R in terms of convergence speed.


Besides being guaranteed-third-order, the above method has one added advantage over N-R. Further numerical convergence experiments on the cosine function show the primary basin of convergence for this root extends arbitrarily closely to x = 0 and π, though as x approaches either of these extrema of ƒ(x) there (0 from above, π from below) the iterates take ever-longer to "get moving" back toward the root. This behavior results from the x-increment near the extremum (where ƒ′ -> 0 but ƒ′ and ƒ′′ remain O(1)) reducing to (f⋅ƒ′)/[(ƒ′)2 − ƒ⋅ƒ′′/2] ≅ (f⋅ƒ′)/(f⋅ƒ′′/2) = 2⋅ƒ′/ƒ′′, which tends to 0 at the local extrema of ƒ(x). Thus, while the scheme does not appear to suffer from fragmented basins of attraction as does N-R, it can slow to a crawl if an iterate happens to land close to a local maximum or minimum of the function in question.


Exercise: What is the order of the error introduced by replacing both powers of Δx in the second-order term of the Taylor series with the N-R approximation for Δx?


Let us try extending this approach to fourth order, for which the relevant terms of the Taylor series are:

ƒ(xn+1) = ƒ(xn) + ƒ′⋅Δx + ƒ′′⋅Δx2/2+ ƒ′′′⋅Δx3/6 + O(Δx4) .

Now we need to replace one Δx in the second-order term and two powers of Δx in the third-order term with approximations based on data already in hand, in such a way that the resulting introduced errors are O(Δx4), and thus again can be lumped in with the neglected terms of the series. That implies that our replacement for one of the Δx's in the second-order ƒ′′⋅Δx2/2 term needs to be third-order accurate. The approximation going into the Δx used to replace two powers of Δx in the ƒ′′′⋅Δx3/6 term also needs to be third-order accurate, because squaring an (n)th-order-accurate approximation does not change the order of the resulting error, it only modifies the multiplier of said error. Thus, to bootstrap to fourth order it appears that we need to now use Halley's formula rather than N-R for our approximating substitutions. Defining a "Halley increment" H(x) := −ƒ⋅ƒ′ / [(ƒ′)2 − ƒ⋅ƒ′′/2], the approximation of the as-yet-unsolved-for fourth-order-accurate increment is Δx = H(xn) + O(Δx3). Making the above substitutions in order to linearize (in terms of the unknown Δx) the second and third-order terms of the Taylor series expansion we obtain

ƒ(xn+1) = ƒ(xn) + ƒ′⋅Δx + ƒ′′⋅(H + O(Δx3))⋅Δx/2 + ƒ′′′⋅(H2 + O(Δx3))⋅Δx/6 + O(Δx4) .

Setting the left-hand side of the equation to zero and neglecting the fourth-order error terms in the right-hand-side gives a fourth-order accurate linear equation for the increment, with solution:

Δx = -ƒ(xn) / [ƒ′ + ƒ′′⋅H/2 + ƒ′′′⋅H2/6] .

The scheme can be implemented in the above form − first computing the Halley-increment H(xn), then plugging that into the denominator of the above formula − or perhaps with a further slight rearrangement of the denominator and recasting of the result into the form of a 2-step iterative update, where as usual all right-hand-side terms are taken as being evaluated at the current iterate xn:

H(xn) = − ƒ⋅ƒ′
[(ƒ′)2 − ƒ⋅ƒ′′/2]
,       in terms of which      xn+1 = xn ƒ
[ƒ′ + H⋅(ƒ′′ + ƒ′′′⋅H/3)/2]
. [4a]

A quick test on our test function ƒ(x) = exp(−x) − x confirms the scheme as being fourth-order − Note that in these high-precision tabulations we do not care about the number of significant figures merely for the purpose of listing impressively-long strings of converged digits, but rather we desire to make the error small enough to establish that the error ratio in the rightmost column truly is converging:

Table 4a. Method [4a] convergence history for ƒ(x) = exp(−x) − x
n xnenen/(en−1)4
__ ______________________ ______ ______
0 1
1 0.5668... −4.3e−1
2 0.56714329040978383... 2.5e−4 .0070...
3 0.56714329040978387299996866221035554975381578718651250813513107922302... [accurate to 68 digits]−3.7e−17 .010091...
4 0.567143290409783872999968662210355549753815787186512508135131079223045... [accurate to nearly 300 digits]2.0e−68 .010089580... ...

With an eye to perhaps going to even higher-order, it is worth expanding the H-terms in the denominator of [4a] in terms of derivatives of ƒ. Further multiplying both numerator and denominator of the resulting right-hand-side quotient by the denominator of H(x), [(ƒ′)2 − ƒ⋅ƒ′′/2]2, we obtain

xn+1 = xn (ƒ/ƒ′)⋅[(ƒ′)2 − ƒ⋅ƒ′′/2]2
[(ƒ′)4 − 2⋅ƒ(ƒ′)2⋅ƒ′′ + (ƒ⋅ƒ′′)2/2 + ƒ2⋅ƒ′⋅ƒ′′′/6]
, [4A]

where we append an uppercase 'A' to the convergence order forming the equation number in order to differentiate this version of the current scheme from [4a] written in terms of the combination of ƒ and its first two derivatives encoded in the function H(x).

(Aside: This form of the iterative-update (both here and in the general case) also permits a relatively simple form of sanity-checking one's algebra, if one is doing things by hand: In both the numerator and denominator, the various powers of ƒ in each term must sum to the same total, as must the total number of derivatives taken. For example, the terms (ƒ′)4 and ƒ2⋅ƒ′⋅ƒ′′′/6 in the denominator both have 4 powers of ƒ and 4 total derivatives. This is reminiscent of dimensional scalings in physics and engineering, where arithmetic expressions involving various dimensional quantities must have all terms yielding the same combination of the various dimensions. Here, if the independent variable x had (say) dimensions of meters and ƒ(x) of kilograms, each term of the denominator must have the same kga⋅mb combination, in this case kg4⋅m−4. Moreover, since the x-increment we are computing obviously has "dimensions of x", the overall combination of ƒ-terms which gives us our increment Δx must also have the same dimension as x. In the formula [4A] we see that the [(ƒ′)2 − ƒ⋅ƒ′′/2]2 numerator term has the same dimensions as the various terms of the denominator, so those dimensions cancel, leaving just the (ƒ/ƒ′) term, which − just as it does in the simpler N-R method − is easily seen to have the same dimension as x, as it must.)

Although this method has the desired convergence order and can be left in terms of the lumped H-terms as written in equation [4a], we observe that when written out in terms of ƒ-derivatives the result [4A] is "far from pretty". It is often a dangerous thing in mathematics to let a subjective sense of aesthetics come into play, but even putting subjective aesthetics aside, one might certainly be moved to conclude from the growth in formulaic complexity in going from third to fourth order via the Halley-style "linearization bootstrapping" method that the trend is distinctly un-promising. Let us compare the cost of each iterative update for formula [3H] with that of [4a], assuming the usual floating-point arithmetic expedient of replacing the divides by constants such as 2 and 3 with a multiply-by-precomputed-inverse, lumping additions and subtractions together under the rubric "addition", and replacing explicit negations such as in the definition of H(x) in [4a] with "inlined negations" such as using (-H) in the denominator of the x-update expression:


Exercise: Can one achieve any savings in the 4th-order-update operation count by using formula [4A] in place of [4a]?


Note that in practice the function evaluations may dominate the computation, depending on the complexity of the function ƒ(x). Assuming here that the cost of these is simply of the same order as the other arithmetic operations, the increase from convergence order 3 to 4 here nearly doubles our per-iteration operation count. Our mathematical-formulaic art critic might sigh and say, "Sure, this works, but there has to be a better way". Thankfully, there is.

In our derivation of the fourth-order scheme [4a]/[4A], we made the same substitution of the Halley approximation H(x) for (all but one of) the powers of Δx in the quadratic and cubic term, even though those Taylor series terms contain different powers of Δx. This prompts us to ask whether we can instead replace one power of Δx in the ƒ′′′⋅Δx3/6 term by the simpler (and less accurate) Newtonian approximation (−ƒ/ƒ′+O(Δx2)), and the other one by the 3rd-order Halley approximation H(x), rather than replacing both by H? Based on the order-of-the-error algebra one would not expect that to yield a 4th-order scheme because

(−ƒ/ƒ′ + O(Δx2)) ⋅ (H + O(Δx3)) = (−ƒ/ƒ′)⋅H + H⋅O(Δx2) + (−ƒ/ƒ′)⋅O(Δx3) + O(Δx5) ,

appears to contain an unacceptably large second-order error term. Let us, moved perhaps by idle curiosity or sheer stubbornness, proceed nonetheless. Making the above approximating substitution and neglecting the O() error terms, we obtain the following equation, which is linear in the increment Δx:

0 = ƒ(xn) + ƒ′⋅Δx + ƒ′′⋅H⋅Δx/2 + ƒ′′′⋅(−ƒ/ƒ′)⋅H⋅Δx/6 + ... .
The resulting iterative scheme is (with H(x) defined as in [4a]):

xn+1 = xn ƒ
[ƒ′ + H⋅(ƒ′′ − (ƒ/ƒ′)⋅ƒ′′′/3)/2]
. [4b]

Let's give it a try on our test function ƒ(x) = exp(−x) − x:

Table 4b. Method [4b] convergence history for ƒ(x) = exp(−x) − x
n xnenen/(en−1)4
__ ______________________ ______ ______
0 1
1 0.56711... −4.3e−1
2 0.5671432904097838730... 3.3e−4 .00093...
3 0.567143290409783872999968662210355549753815787186512508135131079223045... [87 digits] −9.5e−22 −.00082465...
4 0.567143290409783872999968662210355549753815787186512508135131079223045... [accurate to over 300 digits]−6.6e−97 −.00082449504495...

Much to our surprise, the resulting scheme is in fact 4th-order, with asymptotic constant of proportionality (for the above test function) = −0.00082449504495..., which is an order of magnitude smaller than the constant found for [4a], explaining the smaller error at every step. (But since the constant is dependent on ƒ(x), we care more about the the convergence order.)

The reason our above order-of-magnitude estimate of the error resulting from our substitutions is first-glance misleading is that in the error-estimate expression, we have not taken account of the fact that H itself is an approximation to Δx, and thus is also of O(Δx) in magnitude. Thus the H⋅O(Δx2) error term is really O(Δx3), which − most importantly − is one order smaller than the retained leading term (−ƒ/ƒ′)⋅H.

As we did for our first fourth-order scheme, if we further write out the lumped H-term in [4b] in terms of the function ƒ, after some simplifying rearrangement, we obtain a reasonably compact fourth-order scheme:

xn+1 = xn ƒ[(ƒ′)2 − ƒ⋅ƒ′′/2]
[(ƒ′)3 − ƒ⋅ƒ′⋅ƒ′′ + ƒ2⋅ƒ′′′/6]
. [4B]

--> In this case, the starting formula [4b] simplifies to a "nicer" result (only one power of [(ƒ′)2 − ƒ⋅ƒ′′/2] in the numerator, fewer terms in the denominator) than the scheme [4a] did. The formula [4B] turns out to be the same as the fourth-order scheme we shall derive by more-elegant geometry-inspired means in the next section. Let us see if "good looks" of formula [4B] are reflected in the scheme's per-update cost by comparing it to the Halley method and our first attempt at a 4th-order method. We first give a small C-style snippet of pseudocode showing how one might efficiently implement [4B] in practice:

	/* Iterates stored in successive elements of an array x[]; current iteration = n.

Derivatives of f(x) assumed in variable f0, f1, f2, f3.
Double-precision reciprocals 1/2 and 1/3 in di2 and di3.
n integer; all other quantities double-precision. */
f1f1 = f1*f1; /* (f')^2 */
f0h = f0*di2; /* f/2 */
f0f3 = f0*f3*di3; /* f*f''' */
f0f2h = f0h*f2; /* f*f''/2 */
f0f0f3 = f0h*f0f3; /* f^2*f'''/6 */
f0f2 = f0f2h+f0f2h; /* f*f'' */
num = f0*(f1f1 - f0f2h);
den = f1*(f1f1 - f0f2) + f0f0f3;
x[n+1] = x[n] - num/den;

and use the resulting operation count in our comparison:

So the "good looks" of this latest scheme are reflected more in the fact that the ƒ-form of the scheme [4B] is roughly as cheap to evaluate as the 2-step H / xn+1 form [4b], rather that in some genuine reduction in optimal implementational cost.

There are several further approximational possibilities at this convergence order, which we shall encapsulate in the form of the following


Exercise: Repeat the above derivations and numerical tests, now for the iterative scheme which results from replacing two powers of Δx in the third-order term in the Taylor series with the Newtonian approximation (−ƒ/ƒ′). What is the convergence order of the resulting scheme? Does the formula allow for any "nice" simplifications when written out in term of ƒ and its derivatives?

What happens if you replace all three powers of Δx in the third-order Taylor series term in the with the approximation (−ƒ/ƒ′)?


As we proceed to higher order in this vein, we find ourselves suffering from a proverbial embarrassment of riches, in the form of a proliferation of possible approximating substitutions for the various powers of Δx in the Taylor expansion as we retain additional higher-order terms. While the resulting complications are not entirely unwished-for, it would be nice to be able to narrow our choices again somewhat, preferably in some fashion which is easily systematized and which preferentially leads to "nice looking" formulae of the desired convergence order. The following section shows how to accomplish that goal, in a manner which is both illuminating, yields iterative schemes at each order which tend to be algebraically compact, and which moreover prove to be encodable via a reasonably simple recurrence relation which can simply be followed to any desired order without the tedium of ever-increasing amounts of asymptotic analysis.


Higher-order analogs of N-R, Part 3: An Improved Approach

We shall re-derive Halley's method below by more general (and quite elegant) means, moreover the method used will allow us to derive generalized N-R-style schemes of arbitrarily high order, the only cost of the more-rapid convergence being the need to evaluate derivatives of higher order, and to combine the various derivatives via formulae whose complexity increases roughly linearly with the convergence order obtained.

In 1994, in a brief but very interesting SIAM Review Classroom Note, J. Gerlach ("Accelerated Convergence in Newton's Method," SIAM Review 36, 272-276) used the insight that Newton's method works best on functions which are as nearly linear as possible in the neighborhood of the root being sought to rederive Halley's method in wonderfully elegant fashion, by recasting the geometric property in the form of the question: "For general nonlinear functions, can we locally modify the function near the desired root so as to make it look like a linear function and then apply the Newton-Raphson method to that?" The answer proves to be emphatically "yes", with a small number of caveats which we shall note later.

Here is the recipe: Assume we are near a root of ƒ(x), which is continuous and has continuous first and second derivatives in the neighborhood of interest. (Typically that means the interval bracketed by the initial guess and the nearby root, or some slightly larger neighborhood containing it and enclosing no other roots or local functional extrema.) Assume xn is near root, i.e. ƒ(x) 'small' in some quantifiable sense. Further assume the nearby root is a simple root, i.e. ƒ′(x*) ≠ 0. We could thus define 'smallness' of ƒ(xn) with respect to the slope, say, |ƒ(xn)| << |ƒ′(xn)|. It matters whether ƒ′′ is already zero at the root, but we first consider the most-general case and assume that like ƒ′, the second derivative ƒ′′ is nonzero there (that is, ƒ(x) is nonlinear), and that both ƒ′ and ƒ′′ are large in magnitude compared to ƒ at the current iterate xn. We seek to modify ƒ(x) in the neighborhood of the root by multiplying by it some as-yet-unknown function g(x), so as to nullify the second derivative of the modified function F(x) := ƒ(x)⋅g(x) there. (The choice of a multiplicative modifier function is to ensure that F(x) has the same root as the one of ƒ(x) which we seek.) In other words, we seek g(x) such that the product function F is linear near the root. The second derivative of the modified function F(x) is

F′′ = ƒ′′⋅g + 2⋅ƒ′⋅g′ + ƒ⋅g′′ .

This does not look especially promising until we make use of our assumption that |f| << (|ƒ′| and |ƒ′′|) in order to justify neglect of the ƒ⋅g′′ term. (This also requires assuming things about the relative magnitudes of g, g′ and g′′, but since g is as yet unkown we proceed based on the assumption that neither g nor g′ is negligible relative to g in the neighborhood of the root, an assumption we must of course validate once we have found a candidate function g(x).) Now we have

F′′ ≅ ƒ′′⋅g + 2⋅ƒ′⋅g′ ,

and since we seek to make this zero via a proper choice of modifier function g, we immediately obtain (this will be exact at the root)

ƒ′′⋅g + 2⋅ƒ′⋅g′ = 0 ,

which can be rewritten as

g′/g = −ƒ′′/(2⋅ƒ′) ,

and further in terms of logarithmic differentiation as

d(log(g))/dx = d(−log(ƒ′)/2)/dx .

This immediately integrates to (we set the constant of integration 0, since it proves to be unnecessary to consider nonzero values)

log(g) = −log(ƒ′)/2 ,

with reveals the modifier function we seek to be
g = 1/√(ƒ′) .[G2]

(We leave it to the reader to satisfy him-or-herself that if ƒ′ and ƒ′′ are nonvanishing at the root in question, none of g′, g′ or g′′ is such so as to invalidate our assumption that F′′ reduces to ƒ′′⋅g + 2⋅ƒ′⋅g′ there.)

Now applying N-R to the modified function F(x) we obtain F = ƒ⋅g = ƒ/√(ƒ′), F′ = ƒ′⋅g + ƒ⋅g′ = √(ƒ′) − ƒ (ƒ′)−3/2⋅ƒ′′/2, yielding the iterative update

xn+1 = xn − F/F′ = xn − ƒ/[ƒ′ − ƒ⋅ƒ′′/(2⋅ƒ′)]

and multiplying numerator and denominator of the iterative update term subtracted from xn by ƒ′, we obtain Halley's formula [3H], which we have here rediscovered by more-elegant means based on the geometrical insight and aforementioned reposing of the question "for what kinds of functions does the Newton-Raphson method work best?"

Note that if ƒ′′ = 0 to begin with (i.e. without needing any modifier function applied to it) at the root, Halley's formula immediately can be seen to reduce to the Newton-Raphson method, which is another way of reinforcing that N-R applied to such functions is "automatically third-order convergent".

It is also worth investigating what happens with the modified-function approach used above if the test function ƒ(x) has nonzero slope at the desired root (i.e. the root is simple) but ƒ′′ = 0 there. In that case N-R applied to ƒ(x) near the root will automatically behave in cubically convergent fashion, but − being ambitious − we may wish to see if we can obtain even faster convergence by modifying ƒ(x) to make the next-higher derivative, ƒ′′′, also zero at the (as yet undetermined) root. Again using g(x) for the multiplicative modifier function (but with the expectation that this will be a different function than we derived above) we take one more derivative of our expression for second derivative of the product function F(x):

F′′′ = ƒ′′′⋅g + 3⋅ƒ′′⋅g′ + 3⋅ƒ′⋅g′′ + ƒ⋅g′′′ .

Again assuming that we are sufficiently close to the root to make ƒ(x) negligible, and now assuming similar for ƒ′′ gives

F′′′ ≅ ƒ′′′⋅g + 3⋅ƒ′⋅g′′ ,

and since we seek to make this zero via the modifier function g, this reduces further to

ƒ′′′⋅g + 3⋅ƒ′⋅g′ = 0 ,

which can be rewritten as

g′′/g = −ƒ′′′/(3⋅ƒ′) ,

Note that d2(log(ƒ′))/dx2 = (ƒ′′/ƒ′)′ = [ƒ′′′⋅ƒ′ − (ƒ′′)2]/(ƒ′)2, which simplifies to ƒ′′′/ƒ′ as a result of our assumption that ƒ′′ = 0. The left-hand-side term g′′/g is a bit more problematic, since although d2(log(g))/dx2 = (g′/g)′ = [g′′⋅g − (g′)2]/g2 = g′′/g − (g′/g)2, we have not made any assumptions about g(x) which would allow us to neglect the (g′/g)2 term. In order to proceed we must thus further require that any g(x) we find must have g′ vanishing at the root. Integrating the result twice (each time taking the constant of integration to be 0) gives

log(g) = −log(ƒ′)/3 ,

thus the candidate modifier function for zeroing the third derivative of the product ƒ(x)⋅g(x) is the inverse cube root of ƒ′,

g = (ƒ′)−1/3 .[G3]

This has first derivatives g′ = (ƒ′)−4/3⋅ƒ′′/3, so we post-hoc validate that (g′/g) = −ƒ′′/(3⋅ƒ′) is indeed zero, as required.

Applying N-R to the modified function F(x) we obtain F = ƒ⋅(ƒ′)−1/3 and F′ = ƒ′⋅g + ƒ⋅g′ = (ƒ′)2/3 − ƒ⋅(ƒ′)−4/3⋅ƒ′′/3, yielding an iterative update formula much like Halley's formula above, just with a 3 dividing the ƒ⋅ƒ′′ term in place of a 2. Again, this particular iteration is specifically designed to locally nullify the third derivative near the root and thus be fourth-order-convergent when applied to a function having zero second derivative at the root − it will not be fourth-order, nor even third-order, when applied to a general function:
xn+1 = xn ƒ⋅ƒ′
[(ƒ′)2 − ƒ⋅ƒ′′/3]
. [G4]
-->


Exercise: Apply the iterative formula [G4] to the function ƒ(x) = cos(x) and determine the resulting convergence order. Can you explain the result?


Now any general nonlinear function one may care to write down (or encounter in practice) is unlikely to be of a special form for which ƒ′′(x*) = 0. The main usefulness of the above result is to couple it with the observation that starting with some target function satisfying just the minimal simple-root criteria ƒ(x*) = 0, ƒ′(x*) ≠ 0, we have already shown how to convert that into a function having a zero second derivative at the root. In other words, in this general case, the function to which we should apply the above "cubic to quartic order-raising" transformation [G3] is just the modified function we already constructed to go from quadratic to cubic convergence order, F(x) = ƒ(x)⋅g2(x) = ƒ/√(ƒ′), where we now add a subscript on the multiplicative-modifier function to indicate the order of the derivative of ƒ(x) which it is constructed to nullify. The function g2 = (ƒ′)−½ has first and second derivatives

g2′ = −(ƒ′)−3/2⋅ƒ′′/2 ,

g2′′ = (3/4)⋅(ƒ′)−5/2 (ƒ′′)2 −(ƒ′)−3/2⋅ƒ′′′/2 ,

hence

F′ = [ƒ′⋅g2 + ƒ⋅g2′] = [(ƒ′)1/2 − ƒ⋅(ƒ′)−3/2⋅ƒ′′/2] ,

F′′ = [ƒ′′⋅g2 + 2⋅ƒ′⋅g2′ + ƒ⋅g2′′] = [(ƒ′)−1/2⋅ƒ′′/2 − (ƒ′)−1/2⋅ƒ′′/2 + (3/4)⋅f⋅(ƒ′)−5/2⋅(ƒ′′)2 − ƒ⋅(ƒ′)−3/2⋅ƒ′′′/2] ,

Further using that

(F′)2 = [(ƒ′)1/2 − ƒ⋅(ƒ′)−3/2⋅ƒ′′/2]2 = [ƒ′ − ƒ⋅ƒ′′/ƒ′ + (ƒ⋅ƒ′′/2)2⋅(ƒ′)−3] ,

and that

F⋅F′′ = (ƒ/ƒ′)2⋅[3⋅(ƒ′′/2)2/ƒ′ − ƒ′′′/2] ,

substitution of these expressions into the iterative update [G4] suitable for a function with zero second derivative at the root yields

xn+1 = xn F⋅F′
[(F′)2 − F⋅F′′/3]
= xn ƒ⋅(ƒ′)−1/2 ⋅ [(ƒ′)1/2 − ƒ⋅(ƒ′)−3/2⋅ƒ′′/2]
[ƒ′ − ƒ⋅ƒ′′/ƒ′ + (ƒ⋅ƒ′′/2)2⋅(ƒ′)−3] − (ƒ/ƒ′)2⋅[(ƒ′′/2)2/ƒ′ − ƒ′′′/6]

This looks a bit daunting at first glance (the mathematical term is "gnarly" :), but can be substantially simplified. Multiplying the numerator and denominator by (ƒ′)2 and noting the cancellation of the two (ƒ⋅ƒ′′/2)2/ƒ′ terms in the resulting denominator, we obtain a result identical to the scheme [4B] of the preceding section (we here append a 'G' in honor of Gerlach's contribution):

xn+1 = xn ƒ[(ƒ′)2 − ƒ⋅ƒ′′/2]
[(ƒ′)3 − ƒ⋅ƒ′⋅ƒ′′ + ƒ2⋅ƒ′′′/6]
. [4G]
-->

One can derive schemes of arbitrarily high convergence order in this fashion. Gerlach proves that for a target function satisfying ƒ(x*) = 0, ƒ′(x*) ≠ 0, ƒ′′(x*) = ... = ƒ(m-1)(x*) = 0 and ƒ(m)(x) ≠ 0, the general form of the modifier function needed to zero out the (m)th derivative of ƒ is

gm = (ƒ′)−1/m .[Gm]

At any given order the algebra needed to raise the order "one more notch" is thus quite tractable; the Newton-Raphson formula applied to the generic (m)th-derivative-zeroing product function is of the same form as the last equation above, just with a (1/m) multiplying the ƒ⋅ƒ′′ term in the denominator. Of course for generic target functions having just a simple root at some x which one wishes to pin down − the ones most commonly encountered in real-world applications − one must apply the formula (m-1) times inductively in order to successively zero the 2nd through the (m)th derivatives, and in this context the algebra needed to derive higher-order schemes becomes progressively more unwieldy, although the main theoretical result is that one can do so, simply by following the recipe step-by-step. Nowadays one can use a symbolic algebra software to aid with such derivations, but even that proves unnecessary, since it turns out that the resulting higher-order formulae can be expressed compactly via a simple recurrence relation, which we describe as a way of concluding this section.

After reading the above paper with interest, the present author discovered an elegant recurrence formula for these generalized iterative schemes, namely that for ƒ(x) sufficiently smooth, a (k)th-order-converging iterative scheme is defined by

xn+1 = xn ƒ⋅Gk-1
Gk
, [E1]

where all terms in the right-hand-side fraction are evaluated at the current iterate xn and the function sequence Gk(x) is defined by the following simple recurrence, starting with G1=1 (that is, with k = 2):

Gk = ƒ′⋅Gk-1 ƒ⋅G′k-1
(k-1)
. [E2]

For example, k = 2 gives G2(x) = ƒ′(x), immediately yielding the classical 2nd-order Newton-Raphson scheme, [NR]. k = 3 gives G3(x) = (ƒ′)2 − ƒ(x)⋅ƒ′′/2, yielding the 3rd-order Halley scheme [3H] which we derived above, and so forth. Gerlach and I exchanged some excited mail about this in the months after his paper appeared in 1994, but never formally published it. The same method was later independently reported by Ford and Pennline (SIAM Review 38, 658-659). Happy 300th anniversary, Edmond Halley!


Exercise: Extend the above functional recurrence to k = 4 and compare the result with the fourth-order iterative scheme [4]. Now extend to k = 5, find an efficient computational implementation of the resulting scheme, compare your operation count to the ones given above for the 3rd and 4th-order schemes, and verify the 5th-order convergence speed. (Note that a further useful sanity-check of one's algebra here is that the coefficients of the various ƒ-derivative products appearing in Gk(x) must sum to 1/(k-1)! E.g. for Halley's method we observe the coefficients of the ƒ-terms in the denominator G3 have sum = (1 - 1/2) = 1/2!; for the fourth-order scheme [4B] the denominator G4 yields sum = 1-1+1/6 = 1/3!, etc).


Exercise: Compare the above formulae for schemes of arbitrary convergence order to Householder's method, which approaches the issue via the mathematics of geometric sequences. (Cf. also the webpage by [SeGour]).




Special Case #1: Iterative Inversion

As we have previously mentioned, floating-point division tends to be slow (except possibly in a pipelined sense) on most computer hardware, so especially when one is willing to accept some hopefully-small degradation in the accuracy of the result, it is worth exploring speedier alternatives. Here we consider fast algorithms for computing the multiplicative inverse of a nonzero constant c, which will allow us to replace division by multiplication-by-inverse. The idea is generate some initial approximation to the true inverse and then use an iterative approach to successively refine that. This points to some form of the Newton-Raphson iteration, but applied to which target function? The "obvious choice" ƒ = x − 1/c does not work, because N-R applied to it yields the tautology xn+1 = 1/c. That is, even though this target function is linear and thus of the kind for which N-R converges in a single step, here we have an added restriction on whatever iterative scheme we devise, namely that it must not involve an explicit division. Luckily there is a way out (at least at this order), which is to replace the foregoing linear target function with one which inverts the x and the c-terms:

ƒ(x) := c − 1/x .

Then ƒ′ = 1/x2 which upon substitution into [NR] and a bit of rearrangement yields the well-known iteration formula (the equation label [Inv2] is short for "inverse, second-order")

xn+1 = xn⋅(2 − c⋅xn) ,[Inv2]

which requires only the simple and fast arithmetic operations of multiplication and subtraction.


Exercise: Use the formula [Inv2] in an attempt to iteratively compute the (admittedly trivial) inverse of c = 2, starting with initial guesses 1 and 0.75. Follow this with some experiments attempting to map out the basin of convergence.


One typically applies a bit-shift (in the floating point sense this translates to "multiplication by a suitable power of 2") to the number to be inverted so as to scale it into a standard interval, such as [1.0, 2.0), for which the inverse then also falls into the known range (0.5, 1.0], which − importantly − allows the exponent bits (with respect to the IEEE-standard representation of the floating data type being used) of the scaled denominator and inverse to be known in advance. One then has various options to use this normalized denominator in order to generate an initial guess of some guaranteed accuracy, in order to speed convergence of the subsequent iterates. One obvious approach is to extract the leftmost few significand bits of the normalized denominator and use the result as an index into a precomputed lookup table, say one uses the leftmost byte of the scaled c and does a lookup into a length-256 table of precomputed inverses, each also encoded as a single byte and then converted to a datum of the proper type prior to the N-R iteration. One can obtain one more bit of accuracy in this procedure if one assumes inputs are normalized, in which case one can use one's bytes to store bits 1-9 of the significands, knowing that bit 0, the hidden bit, is guaranteed to = 1.

Or one could use some kind of optimized (to balance efficiency and compactness against the resulting accuracy) "canned" curve-fit to generate approximate inverses for use as initial guesses. This same approach can be used for general division, with c playing the role of denominator, and the above power-of-2 scalings applied to a generic numerator rather than to 1. The following exercise results in just such a curve-fit formula.


Exercise: Generate a nearly-optimal linear curve-fit to the reciprocal function 1/x in the interval x ∈ [1, 2] by the following steps:


So far, so good. What if we desire a third-order-convergent iterative inversion scheme? Applying Halley's cubic-convergence formula to the same target function yields (all terms on the right-hand side are at the current iterate xn)

xn+1 = xn (c/x2 − 1/x3)
(c/x3)
= xn+1 c⋅x − 1
c
,

which, like our failed first attempt to apply N-R to the trial function x−1/c, reduces to 1/c, the very quantity we are trying to approximate without use of explicit inversion. The same things happens for the fourth-order scheme [4]. Clearly, a different approach is needed if we wish to go to higher-order.

Consider the current iterate xn (or just x for short) to be close to desired result, in the sense that

x = 1/c + ε,

with |ε| << 1 a small parameter. We now compute various powers of x and seek simple combinations involving

x2 = 1/c2 + ε⋅(2/c) + ε2 .

Notice that the combination 2⋅x − c⋅x2 leaves a residual consisting of the desired inverse 1/c at zeroth order in terms of ε and cancels the first-order terms, leaving an error which is quadratic in ε. When recast in the form of an iterative-update this result gives us back the Newton-Raphson iterative inversion scheme [Inv2]. The question is, does this alternative approach permit derivation of iterative schemes of higher order, as well? Indeed it does. Write out the cube of x:

x3 = 1/c3 + ε⋅(3/c2) + ε2⋅(3/c) + ε3 .

We now seek a combination of x, x2 and x3 which cancels the ε and ε2 terms, leaving just 1/c and a residual error which is cubic in the small parameter, i.e. A⋅x + B⋅c⋅x2 + C⋅c2⋅x3 = 1/c + O(ε3). Collecting terms of order 0,1 and 2 in ε gives
ε0:A + B + C= 1
ε1:A +2B +3C= 0
ε2: B +3C= 0

This three-term linear system is easily solved to yield A = −B = 3, C = 1, yielding the desired third-order iterative scheme

xn+1 = 3⋅x − 3⋅c⋅x2 + c2⋅x3 = x⋅(3 − 3⋅c⋅x + c2⋅x2) = x⋅(3 − c⋅x⋅(3 − c⋅x)) ,

which we can cast in terms of a 2-step iterative update:
yn = c⋅xn ,
xn+1 = xn⋅(3 − yn⋅(3 − yn)) .[Inv3]

Similarly we can derive a quartically-convergent scheme of the form A⋅x + B⋅c⋅x2 + C⋅c2⋅x3 + D⋅c3⋅x4 = 1/c + O(ε4), which leads to the fourth-order linear system

ε0:A + B + C + D= 1
ε1:A +2B +3C +4D= 0
ε2: + B +3C +6D= 0
ε3: + C +4D= 0

with solution A = 4, B = −6, C = 4, D = −1, yielding the fourth-order iterative scheme

xn+1 = x⋅(4 − 6⋅c⋅x + 4⋅c2⋅x2 − c3⋅x3) = x⋅(4 − (c⋅x)⋅(6 − (c⋅x)⋅(4 − (c⋅x)))) .

We can again efficiently perform each iteration of this scheme in 2-step fashion with the same kind of nested multiply-accumulate in the second step as in the cubic case:

yn = c⋅xn ,
xn+1 = xn⋅(4 − yn⋅(6 − yn⋅(4 − yn))) ,[Inv4]

This needs a total of 4 multiplications and 3 subtractions for each iteration. By way of comparison each iteration of the cubic scheme [Inv3] needs 3 multiplications and 2 subtractions, and each quadratic Newton-Raphson iteration [Inv2] requires 2 multiplications and 1 subtractions. Thus we see that going to higher-order is comparatively cheap, since each additional order of convergence needs just one more multiplication and one more subtraction.

Lastly, note the simple pattern of constants subtracting each of the sequential nested const − y⋅() terms in the above schemes:

2nd order:2
3rd order:3, 3
4th order:4, 6, 4

These are in fact nothing other than the binomial coefficients of the 1st through (n-1)th powers of α in (1 + α)n, i.e. the interior coefficients (excluding the left and right bracketing '1's) of a Pascal's triangle, with the (k)th row (taking the initial solitary '1' as the "zero row") corresponding to the (k)th-order-convergent iterative scheme, the general 2-step iterative formula for which thus is

yn = c⋅xn ,
xn+1 = xn⋅(C(k,1) − yn⋅(C(k,2) − yn⋅( ... yn⋅(C(k,k-1) − yn) ... ))) ,[InvK]

where the binomial coefficients are alternatively be written in standard combinatoric "n choose k" form:

C(n,k) = ( n
k
) = n!
k!(n-k)!
.

Using this shortcut (the proof of which requires generalization and inductive solution of the above linear systems) one can simply write out iterative-inversion formulae of any desired order, for example the quintic scheme is

yn = c⋅xn ,
xn+1 = xn⋅(5 − yn⋅(10 − yn⋅(10 − yn⋅(5 − yn)))) ,[Inv5]

and the sextic is

yn = c⋅xn ,
xn+1 = xn⋅(6 − yn⋅(15 − yn⋅(20 − yn⋅(15 − yn⋅(6 − yn))))) .[Inv6]


Exercise − "Dude, that is one sick equation": Numerically validate the order of convergence of the quintic and sextic iterative inversion formulae [Inv5] and [Inv6]. Now write down the 7th-order − we hesitate to use the Latinate descriptor here, since "septic" has negative connotations related to its usage in other fields than mathematics − inversion scheme using the above shortcut and validate its order of convergence.


When used in "controlled" fashion, that is, with denominator scaled into a known interval and either a lookup table or curve-fit used to generate an initial guess which is guaranteed to converge, it is worth asking whether there is ever any practical need for the above higher-order-convergent algorithms. One common application of the above techniques is to generate inverses "accurate to nearly hardware precision", most commonly meaning either IEEE Standard floating-point single (32-bit with 24-bit significand, where we assume normal data with hidden bit=1) and double (64-bit with 53-bit significand, including the hidden bit) precision. If we start with (say) an 8-bit-accurate initial guess, we will need either two iterations of [Inv2] or one of [Inv3] to get a "fully converged" result, so there may be a slight savings from using the third-order scheme. Again starting with an 8-bit estimate, a double-precision result would need either three [Inv2] steps, or one step of [Inv2] and one of [Inv3], or one step of [Inv7], whose form the reader has already derived, if the reader has been diligently doing the exercises while working through this material.

But increasingly, designers of modern microprocessors have made our work even easier in this regard, by providing hardware operations to speedily generate initial guesses at some useful level of guaranteed precision. For example, the x87 instruction set architecture (ISA) used in Intel and AMD CPUs of recent vintage (specifically ones supporting the SSE vector-data instructions) provides a pair of hardware instructions rcpss and rcpps, which take one and four single-precision inputs, respectively, stored in a 128-bit-wide hardware SSE register and return an approximate floating-point reciprocal with a relative error of at most 1.5⋅2−12, which translates (by design) to "half the bits" of a single-precision significand. One can use that and follow it with just a single N-R iteration using [Inv2] to generate a fully-converged single-precision reciprocal. Such hardware operations are of course architecture-specific and require either assembly code or so-called "compiler intrinsics" to access.

There are no analogous approximate-reciprocal instructions for double-precision data, but such are not needed, since the same ISA provides fast single to double-precision conversion instructions for floating-point SSE-register data, so if double-precision reciprocals are needed one can take a 24-bit-converged single-precision reciprocal, convert to double, and follow with either one more iteration of [Inv2] to get a "nearly converged" result accurate to 47-48 bits, or 2 such iterations, or one iteration of [Inv3] to get a fully converged double-precision reciprocal. For a discussion of this x87 implementation including assembly-code samples, optimization ideas and the resulting timing data, we refer the reader to this math-forum thread. (As it happens, the foregoing thread is what prompted the present author to finally put together some long-simmering thoughts on the topic of iterative schemes in a coherent webpage form).


Special Case #2: Iterative Square Root

Another very common "slow" hardware operation for which an alternate fast emulated alternative is frequently needed is floating-point square root. Rather than becoming less of an issue in recent years as a result of to pipelined "one result per cycle after some large initial latency" implementation on high-end microprocessors, for square root the need for fast versions based on simpler hardware operations has if anything grown more acute due to the use of this operation in computer graphics software used in areas as diverse as CAD software, CGI software for the movie industry, and − of course − computer games.

As we did for inversion, we first consider the issue of which target function to feed to the Newton-Raphson algorithm in order to obtain an efficient iterative scheme for computation of √c. The obvious choice ƒ(x) := x2 − c has the same kind of "obvious" problem that x − 1/c did for iterative inversion, namely that it yields a computationally "expensive" iterative scheme

xn+1 = (xn + c/xn)/2 ,[Sqrt2]

Even if it leads to a stable iteration with reasonable basin-of-attraction properties in the neighborhood of the root we seek, the resulting computation involves explicit inversion, the very operation we previously spent no small amount of effort avoiding. (We can of course compute 1/x at each step using the fast iterative-inversion [Inv2], but having to do a full cycle of inverse-iterations on each square-root iteration is not going to be especially cheap.) The cost of each iterative update here is one division, 2 multiplications (since when working in floating-point arithmetic the divide by 2 can be replaced with a multiply by 0.5 with no loss of precision) and one addition. If one's compute hardware can divide appreciably faster than it can take roots this may still be OK, so let's at least give it a whirl. Taking for example c = 2, starting with x0 = 1, our resulting convergence history at double precision is

n xn
__ ______________________
0 1
1 1.5
2 1.416...
3 1.414215...
4 1.414213562374...
5 1.414213562373095048801689...

which has the expected second-order convergence. Due to the quadratic term in the target function any analysis of the basin-of-attraction properties of the scheme must take account of the fact that both +√c and −√c are fixed points of the iteration, and indeed if we start the above case off with an initial guess x0 = −1 we obtain the same convergence history as above, just with a minus sign on each iterate. Thus the basin of attraction for any positive c is in fact a symmetric pair of basins separated by x = 0, which is invalid as an initial iterate due to the c/x term.

The iterative scheme [Sqrt2] whereby one takes an initial guess at the square root of c, divides c by the guess and then takes the arithmetic average of the current guess x and c/x to obtain an improved approximation is also interesting due to its long historical pedigree. Amazingly, this very same method for successively approximating square roots was known to and used by the ancient Babylonians, who themselves did not possess an algorithm for long division (much like the lack of a direct division capability by early compute hardware of that other ancient period, the 1940s through 1980s), again necessitating use of approximation algorithms to compute the needed inverse 1/x. In the case of the Babylonians, they used − ta da! − lookup tables for the needed reciprocals. Thus, this particular implementation of a quadratically convergent square-root-finding method likely predates the more general Newton-Raphson method by over 3000 years.

If we start things off very close to 0 our x1 iterate will be proportionally large but since the scheme proves stable for large x of either sign, it will still converge. E.g. for c = 2 and x0 = 0.01 we obtain

n xn
__ ______________________
0 0.01
1 100.005
2 50.01...
3 25.02...
4 12.55...
5 6.3...
6 3.3...
7 1.9...
8 1.49...
9 1.416...
10 1.414215...
11 1.4142135623738...
12 1.414213562373095048801689...

In other words, the cost of a poor initial guess is not failure to converge, merely slow convergence since for iterates far larger than desired root, the error is only cut roughly in half at each iteration until the root comes near enough for the scheme to 'feel the attraction of the root' and begin to exhibit the quadratic convergence guaranteed for asymptotically small error.

Using the same target function we can also derive a cubic scheme using Halley's formula. Substituting ƒ = x2− c, ƒ′ = 2x, ƒ′′ = 2 into [3H] gives

xn+1 = xn⋅(xn2 + 3⋅c)
(3⋅xn2 + c)
. [Sqrt3]

A quick substitution-check confirms that x = ±√c are again fixed points of the resulting iteration. The cost of each iterative update is 2 additions, 4 multiplications (we assume 3⋅c is precomputed, saving one multiplication per update, and xn2 is computed just once for use in both numerator and denominator) and one division to obtain the denominator in the above quotient. Repeating our sample computation with c = 2 and x0 = 1 we obtain the convergence history

n xn
__ ______________________
0 1
1 1.4
2 1.4142131...
3 1.4142135623730950487...
4 1.41421356237309504880168872420969807856967187537694807317667971... ,

which shows the expected cubic property.

Thus, the above target function ƒ(x) = x2− c works well if we are willing to do a division each iteration − although we shall shortly remove that necessity − and need no higher than third-order convergence. Since all derivatives of ƒ beyond the second vanish identically, it is not clear whether there are any higher-order schemes to be gotten for this choice of ƒ(x). Let us test the question using the standard fourth-order member [4] of the generalized Newtonian iterative schemes,

xn+1 = xn ƒ[(ƒ′)2 − ƒ⋅ƒ′′/2]
[(ƒ′)3 − ƒ⋅ƒ′′/ƒ′ + ƒ2⋅ƒ′′′/6]
.

Using that ƒ′′′ = 0 this can be simplified to

xn+1 = xn ƒ[(ƒ′)2 − ƒ⋅ƒ′′/2]
ƒ′ [(ƒ′)2 − ƒ⋅ƒ′′/6]
.

where we note that the resulting iteration formula does not reduce to Halley's method merely as a result of the vanishing third derivative. Further substituting ƒ(x) = x2− c we obtain

xn+1 = xn (x2− c)⋅(3⋅x2 + c)
4⋅x⋅(x2 + c)
. [Sqrt4]

The result is slightly more computationally expensive than Halley's method − using standard abbreviations for the key hardware operations and treating addition and subtraction as equivalent in terms of computational cost, the per-update cost is 4 ADD, 5 MUL and 1 DIV − but is it truly a fourth-order scheme? Again trying this on our example case c = 2 with x0 = 1 indicates that it is:

n xn
__ ______________________
0 1
1 1.416...
2 1.414213562374...
3 1.414213562373095048801688724209698078569671875377... ,

and direct substitution of x = √c + ε into the iteration formula followed by some algebra to simplify things confirms the error in xn+1 to indeed be O(ε4). However, since all the above schemes require explicit inversion and convergence speed beyond cubic is unlikely to be of much practical use, we do not consider them further here.


Exercise: Further investigate the idea of speeding the iterative square root algorithm [Sqrt2] by coupling it with an iterative inversion scheme in order to compute the reciprocal of the denominator a each step of the square-root iteration. Notice that since xn is presumed to be converging to √c, one does not need to begin each fresh iterative-inversion cycle with a naive initial guess, but rather can use the converged inverse for the preceding iterate xn-1 to start things off in a "bootstrapping" fashion.


If the square-root iterate is converged to some relative-error ε, to what error level must the corresponding inverse computation be converged in order to preserve the overall convergence order of the scheme based on the iterative square-root formula used?

Make use of the above ideas to minimize the iterative-inversion work. If the square root computation needs O(n) iterations (where we now use "Big Oh of" in the infinite-asymptotic sense), does coupling that with an iterative inversion at each iteration (assume this also needs O(n) iterations to fully converge starting with a naive non-bootstrapped initial value) make the overall arithmetic operation count for the combined-iteration scheme O(n2), or is the work asymptotically less? (Note that the best-case scenario here would be the same O(n) order of magnitude as the operation count for the division-based scheme).


Next, we investigate whether it is possible to derive an divisionless set of such iterative-square-root schemes in similar fashion as we did for the case of the iterative inverse. If we use an alternative target function involving the inverse of the current iterate, say ƒ(x) = 1/x2 − c, applying N-R to this yields a second-order iterative formula for the reciprocal square-root of the computationally efficient kind we seek, with a per-iteration cost of 1 ADD and 4 MUL:

xn+1 = xn⋅(3 − c⋅xn2)/2 ,[Isqrt2]

--> The result is of the simple arithmetic form we desire, and the fact that the iteration converges − if it converges − to 1/√c is not a problem since we can simply multiply the final converged result by c to obtain √c. Let's try it out on our c = 2 case, again starting with x0=1 and here multiplying each iterate (rather than just the final one) by c to ease comparison with the above results:

n c⋅xn
__ ______________________
0 2
1 1
2 1.25
3 1.38...
4 1.413...
5 1.414212...
6 1.414213562372...
7 1.4142135623730950488016884...

The scheme appears to take a bit longer to "get going" than did N-R using the target function ƒ(x) = x2− c, but once the result gets close to the true answer the convergence is again quadratic. The basin of attraction of the scheme turns out to again be the semi-infinite interval containing the square root having the same sign as x0, but the qualitative behavior of the scheme is somewhat different far from the root. For example, starting things off with a deliberately poor guess x0 = 0.01 yields a sequence of iterates which increase by a multiplicative factor of roughly 3/2 at each step as long as the iterates are close to 0, and which approach the root strictly from below, in contrast to the wild-overshoot followed by a sequence of approximate halvings we observed for the previous second-order scheme. However, such issues can (and should) be avoided entirely by use of an input normalization and initial guess taken from a table lookup or curve-fit formula of the kind we derived for iterative inversion.

At this point we note one other property shared by all of the iterative-update formula we shall derive, which results from the fact that they involve just the arithmetic operations of addition, subtraction, multiplication and division. Namely, if the constant c whose square root is being sought is whole number (or more generally, a rational number) and the initial iterate is also rational, that is x0 = p/q with p and q integer, then the iterative update, if performed exactly, can yield only rational iterates. If √c is not rational − as will be the case for most rational c's, in the mathematical sense that the rational values of c for which √c is also rational constitute a set of zero measure within the reals − this immediately demonstrates that in such cases these iterative methods will never actually achieve the exact value they are constructed to converge upon. For example, the above quadratically convergent iterative-update scheme for the inverse square root applied to an initial rational iterate x0 = p0/q0 can be rewritten in terms of a paired update of the numerator and denominator, which the reader may enjoy trying with some simple starting values such as c = 2, p0 = q0 = 1, using the arbitrary-precision integer-arithmetic library of their choice:

pn+1 = pn⋅(3⋅qn2 − c⋅pn2) ,           qn+1 = 2⋅qn3 .[Isqrt2_Ratio]

Each of the above updates should ideally be followed by division of both pn+1 and qn+1 by their greatest common divisor, gcd(pn+1, qn+1) in order to reduce the resulting fraction to its simplest form, but this additional step may be omitted for the purposes of simple numerical experimentation, if the user does not have a gcd functionality handy. It is an interesting question to compare the number of bits of accuracy in the current iterate xn (as compared to the exact inverse-square root) with the total number of bits needed to exactly store pn and qn; in other words to explore whether the exact-rational approach yields an efficient approximant.

Moving on to higher-order, if we substitute our new target function ƒ(x) = 1/x2 − c into Halley's formula we obtain

xn+1 = xn⋅(3 + c⋅xn2)
(1 + 3⋅c⋅xn2)
.

which is cubically convergent but needs a division. Once again, we need try other means in order to obtain a higher-order formula requiring no hardware divide. To do this, we proceed much as we did in the case of iterative inversion: Again using our second-order iterative formula [Isqrt2] as a guide, assume the current iterate xn (or just x for short) to be close to desired result, in the sense that

x = 1/√c + ε,

with |ε| << 1 a small parameter. We now seek simple combinations involving x and various powers of

c⋅x2 = 1 + ε⋅(2⋅√c) + ε2⋅c .

Multiplying the above two expressions together gives

c⋅x3 = 1/√c + 3⋅ε + ε2⋅(3⋅√c) + ε3⋅c ,

where we only care about the terms through O(ε2). We seek a linear combination of x and c⋅x3 satisfying

A⋅x + B⋅c⋅x3 = 1/√c + 0⋅ε + O(ε2) .

Substituting the above expressions for x and c⋅x3 into this last expression and separately collecting terms of order unity and O(ε) yields the linear system A + B = 1, A + 3⋅B = 0, which has solution A = 3/2, B = −1/2, corresponding (by construction) to our already-in-hand second-order formula [Isqrt2].

Continuing in this vein, we write out the expression for c2⋅x3 which results from multiplying the above formulae for c⋅x2 and c⋅x3 and retaining terms through O(ε3):

c2⋅x3 = 1/√c + 5⋅ε + ε2⋅(10⋅√c) + O(ε3) .

We now seek a linear combination of x, c⋅x3 and c2⋅x5 which cancels the ε and ε2 terms, leaving just 1/√c and a residual error which is cubic in the small parameter, i.e. A⋅x + B⋅c⋅x3 + C⋅c2⋅x5 = 1/√c + O(ε3). Collecting terms of order 0,1 and 2 in ε gives

ε0:A + B + C= 1
ε1:A +3B +5C= 0
ε2: 3B+10C= 0

This three-term linear system is easily solved to yield A = 15/8, B = −5/4, C = 3/8, yielding the desired third-order iterative scheme, written in terms of a 2-step iterative update:

yn = c⋅xn2 ,
xn+1 = xn⋅(15 − 10⋅yn + 3⋅yn2)/8 .[Isqrt3]


Exercise: Numerically validate the order of convergence of the cubic iterative inverse-square-root formula [Isqrt3]. What rational constant does the asymptotic error ratio appear to be converging to? Can you find a simple "nested multiply-accumulate" way of rewriting the second step of the 2-step update?


Note that (again analogously to the case of iterative-inverse) the SSE-supporting versions of the x87 instruction set architecture provide a pair of hardware instructions rsqrtss and rsqrtps (the 'rsqrt' instruction prefix is a mnemonic for "reciprocal square root") specifically for the purpose of generating starting values for the above fast iterative inverse-square-root algorithms. These 2 hardware operations take one and four single-precision inputs, respectively, stored in a 128-bit-wide hardware SSE register and return an approximate floating-point reciprocal square root with a relative error of at most 1.5⋅2−12, which translates (by design) to "half the bits" of a single-precision significand. One can use that and follow it with just a single N-R iteration using [Isqrt2] to generate a fully-converged single-precision reciprocal, and follow that with a fast hardware-conversion-to-IEEE-double and two more iterations using [Isqrt2] or one of [Isqrt3] to efficiently generate double-precision-accurate results. If one requires more architecture-generic functionality, either a byte-sized lookup table or low-order curve-fit are good options for generating initial guesses of known accuracy. A properly constructed byte-sized lookup table which accounts for the hidden bit of the IEEE floating-point representation and thus uses the table data to store the next 8 bits of the initial-guess significands will yield at least 9 starting bits of accuracy, which can be followed by one iteration of [Isqrt2] and one of [Isqrt3] (in either order) to produce a fully converged (that is, within the roundoff error incurred by the computation) double-precision result.

On the curve-fit front, the Extended Exercise in the section below shows how to efficiently obtain an approximate initial guess for 1/x (or more generally, other functions of interest) by way of a series expansion in Chebyshev Polynomials. Such techniques are extremely effective ways of obtaining rapidly convergent series-expansion approximations of a target function, and thus of obtaining maximal accuracy or "information content" with a minimal number of terms, which moreover are efficiently computable.


Extended Exercise − "Hilbert Space ... the Final Frontier": Efficient Approximation of Functions via Orthogonal Polynomials

(Note: Readers already well-acquainted with approximation in Hilbert spaces, Fourier series and orthogonal polynomials may wish to skip directly ahead to the section on the Chebyshev Polynomials, at their own peril, of course. Galahad: "Oh, let me face the peril!" Launcelot: "No, It's too perilous.")

A powerful way of accurately approximating smooth functions over some desired interval is via spectral methods. We do not seek here to give anything approaching a complete tutorial on such methods, but do need to introduce a small amount of machinery from the field of functional analysis by way of background. This quickly leads in the direction of much deeper mathematical topics, but to understand the small tidbit we develop here needs no more than a basic algebra and high-school-level calculus background. We ask the interested reader possessing such to please not be daunted by the "fancy notation" - in the end it all is just convenient notational shorthand for key operations definable in terms of familiar things such as absolute values, sums, differences and integrals. While the formal mathematical term for the current topic is the rather weighty-sounding approximation in a Hilbert Space, if you have ever done a homework problem involving approximating a function using a Fourier series expansion, you already has some practical experience with the subject, though you may have not been informed of this. (Perhaps for your own protection.)

The key to the theory of functional approximation is the concept of how to measure the "distance" between two given functions, call them ƒ and g. To do this we proceed analogously as for finite-dimensional vector spaces (e.g. 2-vectors (x,y) in the plane) and first define a function space S simply as a collection of objects, in this case functions of one or more variables, which is closed under the operations of addition and scalar multiplication: for any pair of functions ƒ and g in S (what it means for a function to be "in a space" will be clarified shortly), ƒ + g is in S, as is αƒ, where α is a real (or more generally, complex) number. Given a space S we now define a metric m(⋅, ⋅) which is an operation taking as its input any two objects in S (each dot in the preceding m(⋅, ⋅) notation is standard mathematical shorthand for "any object in S") and producing a nonnegative real number, the distance between the 2 input objects. Here our objects are functions, and for any three functions ƒ, g and h in S, for a metric to define a "sensible" measure of distance we further require it to satisfy the following three properties:

We will be especially concerned with convergent sequences of functions in the space, which are sequences {ƒ0, ƒ1, ... } in S such that limn→∞ ƒn = ƒ, where the limit of the sequence, ƒ, is also a function in S. This completeness property is in the same vein as well-known examples from the basic arithmetic of numbers: For example, taking c = 2 and x0 = 1 in our iterative-square-root formula [Sqrt2] yields an infinite sequence of ever-better rational approximations to √2: 1, 3/2, 17/12, 577/408, 665857/470832, ... , which converges in the sense that limn→∞ xn = √2, but since the limit itself is not a rational number, we conclude that the set of rational numbers is not complete. When dealing with function spaces, completeness is significantly harder to to establish and characterize than e.g. for the rational numbers − It requires a particular subclass of convergent sequences, the so-called Cauchy sequence, and further required an entirely new way of understanding integration of functions, the Lebesgue integral to be developed − but for simplicity (and thanks to the hard work of many brilliant mathematicians such as the aforementioned Messrs. Cauchy and Lebesgue), we shall simply assume it.

Normed Linear Spaces

In order to better quantify the accuracy of our functional approximations we will actually use a slightly more restricted measure of "closeness" called a norm, denoted || ⋅ ||, which is a function taking any single object in S and producing a nonnegative real number. Much like a metric a norm must satisfy three properties, but since a norm takes just a single function as its input, the symmetry property does not apply to the norm, at least not directly:

Given a norm we immediately have a metric by defining m(ƒ, g) = ||ƒ−g||, where the symmetry property on metrics follows from the symmetry of subtraction. Note that while every norm defines a metric, the converse need not hold; this is what we meant above when we described a norm as a more restricted measure of "closeness" than a metric. Norms also provide a very convenient means of defining what it means for a function to be "in a space": Any function space with a norm is called a normed space, and in fact we can define the space S as the set of all functions ƒ with bounded norm, ||ƒ|| < ∞. Two examples of norms, one defined in terms of an integral, the second not, are the L2 (pronounced "ell two") norm, also commonly referred to as the square norm or Euclidean norm, which defines a space consisting of all functions which are square-integrable:

||ƒ(x)||L2   =   [ ∫x |ƒ(x)|2 dx ]½ ,

and the L or supremum norm (also known as the uniform norm),

||ƒ(x)||L   =   maxx |ƒ(x)| .

(The reason for the L subscript in the definition of the supremum norm is that is can be considered as the limit of the Lp family of norms defined below, as p→∞.) The function space associated with a norm usually takes its name from that of the norm; thus the function space S defined by the L2 norm is usually simply referred to as "L2".

Both of the above are special cases of a more general family of function spaces parametrized by a positive power p, the so-called Lp spaces, defined by the norm

||ƒ(x)||Lp   =   [ ∫x |ƒ(x)|p dx ]1/p .

One often restricts the integral to some particular range of the independent variable, x ∈ [a,b], in such cases the space defined by the particular norm will also include these bounds, e.g. the space of functions defined by the Lp norm over the interval [a,b] is abbreviated Lp[a,b].

Most such norms defined on functions have obvious analogs for finite-dimensional vectors, the result of simply replacing "integral of a function" with "sum over components of a vector". For example, doing this replacement for the L2 yields the well-known formula for the Euclidean norm on an n-vector x = {x0, x1, ... ,xn-1}:

||x||L2   =   [ xx ]½   =   [ n-1

i=0
xn2 ]½ ,

where the dot between the two bold-faced vector symbols now really means a vector inner product. One crucial difference between norms on finite-dimensional vector and infinite-dimensional function spaces is that while the former are all equivalent (in the sense that given two finite-vector norms ||⋅||a and ||⋅||b, we can always find two positive constants c1 and c2 so as to bound the second norm above and below in terms of the first, c1⋅||x||a ≤ ||x||b ≤ c2⋅||x||a), this is definitively not the case for function spaces. To borrow a simple example from [Keener], ƒ(x) = x−1/3 is in both of the spaces L1[0,1] and L2[0,1], but is not in any space Lp[0,1] having p ≥ 3, since the resulting integral diverges.

Inner Product Spaces

Of special usefulness in functional approximation are specially chosen norms which further define an inner product; these correspond to a special subcategory of function spaces called inner product spaces or Hilbert spaces, after the famous German mathematician David Hilbert, who helped pioneer the development of their theoretical foundations in the first decade of the 20th century. Note that while every inner product has an associated norm, the converse does not generally hold; The above Lp spaces, for instance, only give rise to an associated inner product for the special case p = 2.

An inner product on a function space works very much the same way as the more-familiar (and geometrically more easily visualizable) kind on finite-dimensional vectors: It allows us to measure redundancy between any 2 objects in the space via projection of one object onto another. In essence the inner product of a pair of function ƒ and g measures the "overlap" between the functions in precise way. For function spaces the inner product does not have a nice geometrical interpretation in terms of "angle between vectors" as it does for finite-dimensional spaces, but the associated concepts of linear independence and orthogonality still capture the same essential information-theoretic ideas: Linear independence of two functions means they are less than 100% redundant, and orthogonality means there is no redundancy whatsoever between the functions. Formally, an inner product must satisfy the following four criteria (where we restrict ourselves to real-valued functions):

The only modification of the above needed to accommodate complex-valued functions is that criterion [1] must be generalized to conjugate-symmetry, which mean replacing one side of the equality with its complex conjugate. (This is needed for the associated norm to satisfy the requirements for such). Just as every norm naturally gives rise to a metric on the space (but not vice versa), every inner product has an associated norm, defined simply by ||ƒ|| 2 = ⟨ ƒ, ƒ ⟩, where once again the reverse is not generally true. We now make the following

Definition: A Hilbert space is a complete normed function space with an inner product.

(Note that this is not to be confused with a Dilbert Space, which is a completely dys-functional workspace with no applicable norms, run by a capriciously malevolent pointy-haired boss.)

This innocuous-looking definition encompasses some profound mathematics, and it is difficult to overstate its usefulness and importance, not just in mathematics but also in areas such as physics: It provides the mathematical underpinnings for quantum mechanics, to name just one notable use. We will be using it here in the sense that it provides a powerful framework for approximation of functions. Our motivation for this little mathematical "side trip" is to provide an efficient way of approximating a given target function over some domain of interest. Efficiency in this context translates to "capture the behavior of the function to a desired accuracy using a sequence of approximating functions which is as short as possible". We will do this by choosing a space (typically by specifying an inner product which has some desirable mathematical qualities) and defining a basis, which is a (necessarily infinite) set of functions which spans the space, that is, which permits any function in the space to be written as a weighted sum of the individual basis functions. The coefficients of the resulting expansion are found by projecting the target function onto the basis set using the same inner product which defines the space.

The Gram-Schmidt Orthogonalization

Since in actual computations we must necessarily truncate the series expansion resulting from the projection to a finite (perhaps quite small) number of terms, in order to achieve efficiency in the resulting sequential approximation we want each added basis function onto which we project to capture maximal information content. Intuitively, this means minimizing the redundancy among our basis functions. The inner product also provides the means for doing this, by allowing us to take any 2 functions and first measure, then remove, the redundant "overlap" between them. This process is referred to as orthogonalization, and proceeds just as in finite-dimensional vector spaces using the same famous recipe, the Gram-Schmidt algorithm, named after its inventors. Gram-Schmidt allows us to start with any linearly independent set of basis vectors and turn those into an orthogonal set. This makes the resulting operation of projection of a target function onto the basis much simpler, because we no longer need to worry about "information sharing" among the basis functions: If our basis functions were not orthogonal, each one would be capturing some piece of the target function unique to itself along with pieces shared with other of its brethren, and there would be no direct way to construct the series coefficients resulting from the projection. If we wish to approximate a target function ƒ using such a linearly independent but non-orthogonal set of basis functions {b0(x), b1(x), ... } by way of the expansion

ƒ(x)   =  

k=0
αk bk(x) ,

then for any choice of a finite subset of the basis, a so-called n-term truncation of the basis which consists of only keeping the first n terms of the above expansion, we must solve an n x n linear system of equations

Mα   =   y,

where α = {α0, α1, ... , αn-1 } is the vector of coefficients we need to solve for, the (j,k)th element of the multiplier matrix (meaning the entry in the (j)th row and (k)th column, where we as usual use zero-offset indexing and thus have both indices ranging from 0 to n − 1) is Mj,k = ⟨bj, bk⟩, and the elements of the right-hand-side vector are given by projecting the target function onto the elements of the basis: yj = ⟨ƒ, bj⟩, for j = 0, ... , n−1. With an orthogonal basis, on the other hand, the off-diagonal (j ≠ k) terms of the matrix M vanish, and thus each coefficient is simply the projection of the target function onto that and only that basis function: αj = ⟨ƒ, bj⟩. The hope is that using the Gram-Schmidt procedure, generating such an orthogonal basis can be done with appreciably less effort that solving the above general matrix equation.

Given any two functions in the space ƒ and g, we can orthogonalize the pair with respect to each other by picking one (say ƒ) as our "reference" function and projecting the second onto it to measure the overlap − let us call the resulting numerical coefficient r, for "redundancy" − then subtracting rƒ from g. In other words we desire to compute r such that ⟨g − rƒ, ƒ⟩ = 0. Using the linearity and associativity properties of the inner product this can be rewritten as

0   =   ⟨g, ƒ⟩ − ⟨rƒ, ƒ⟩   =   ⟨ƒ, g⟩ − r⋅||ƒ||2 ,

from which we obtain the projection coefficient

r   =   g, ƒ⟩
||ƒ||2
,

which allows us to construct an orthogonalized version of g:

φ(x)   =   g − rƒ   =   g g, ƒ⟩
||ƒ||2
⋅ƒ .

For any validly defined inner product, assume we have in hand a linearly independent set of basis functions {b0(x), b1(x), ... }. The Gram-Schmidt recipe proceeds inductively, by assuming we have orthogonalized the first (k) elements of our basis, yielding the result {φ0(x), φ1(x), ... , φk-1(x), bk(x), bk+1(x), ... }. The first not-yet-processed term bk(x) is orthogonalized by projecting it onto each of the already-orthogonalized terms φ0(x), φ1(x), ... , φk-1(x) in turn, and subtracting the overlap with each:

φk(x)   =   bk k−1

j=0
⟨bk, φj
||φj||2
⋅φj .

So, given a norm and an associated inner product, we can achieve maximal approximational efficiency by expressing the function in question (which in our examples below will be given, but more generally may itself be the sought-after solution to some algebraic or differential equation being studied) by expressing the function as a coefficient-weighted sum of an infinite series of orthogonal basis functions {φ0(x), φ1(x), φ2(x), ...}:

ƒ(x)   =  

j=0
cj⋅φj(x) , where cj   =   ⟨ƒ, φj⟩.

Orthogonal Polynomials

In practice one takes an initial basis either of a priori orthogonal functions (such as the Fourier series basis listed below) or a set of non-orthogonal but algebraically simple − meaning simple to pairwise-integrate and thus orthogonalize − functions such as the one consisting of the monomials {1, x, x2, ... }. This latter approach yields a basis of orthogonal polynomials, and following that link reveals a veritable bestiary of such bases, several of which we shall examine in more detail shortly. The zero-order basis function has nothing it needs orthogonalization with respect to, hence is nearly always taken simply as φ0(x) = b0(x) = 1. A common (because useful) family of inner products consists of the weighted L2 inner products:

L2[a,b; ω]:             ⟨ƒ, g⟩ = b

a
ƒ⋅g⋅ω(x) dx .

Here ω(x) is a weight function which must be of suitable form to define a proper norm (nonnegative, zero-valued at most on a set of measure zero, itself at least L1-integrable, etc) and the integral is over the interval of definition of the basis set, which must coincide with that over which ƒ(x) is defined, or must be made to do so by way of a suitable coordinate transformation. (We shall illustrate several examples of such coordinate mappings applied to the particular basis of Chebyshev polynomials below.) In terms of this weighted L2 inner product, our basis functions will be constructed to satisfy the following orthogonality property:

⟨φk, φk⟩ = ||φk||2 > 0,         ⟨φj, φk⟩ = b

a
φj(x)⋅φk(x)⋅ω(x) dx = 0    if    j ≠ k .

Different x-intervals and weight functions yield various sets of orthogonal functions, and note that our basis need to be a polynomial one: for example the particular choice ω(x) = 1 and x ∈ [0, 2π] admits of the famous Fourier Series orthogonal basis, which is the best-known example of an a priori orthogonal basis:

{φ(x)} = { cos(kx), sin(kx) }    for    k = 0, ... , ∞.

Such "orthogonal by construction" bases are handy when available, but note that they do possess drawbacks: For the above, the computational expense of evaluating sines and cosines is just one example. On the other hand taking ω(x) = 1 and x ∈ [−1, 1] and applying Gram-Schmidt on a simple starting basis of the monomials yields the following orthogonal-polynomial basis:

φ0 = 1, φ1 = x, φ2 = x2 − 1/3, φ3 = x⋅(x2 − 3/5), ...

When performing Gram-Schmidt one usually normalizes the resulting orthogonal basis in some consistent way, thus producing an orthonormal basis. For instance one might scale each orthogonalized resultant basis function by a multiplicative constant such that ||φk||2 = 1 for all k. The precise normalized value need not be unity, nor need it be same for each value of the index k; it usually makes better sense to choose some value which is "natural" for the basis in question. For the orthogonal polynomials of various kinds, one good way of judging "naturalness" is if the normalization yields a recurrence relation for the basis in question (more on that will follow soon) which has a convenient form.

Our present family of polynomials is typically normalized such that ||φk||2 = 2/(2k+1). The lowest-order basis functions φ0 and φ1 already satisfy this normalization requirement; for φ2 we compute ||φ2||2 = ∫x∈[−1,+1][x4 − 2x2/3 + 1/9]dx = (2/5 − 2/9) = 8/45 compared to the desired normalized value of 2/5, so φ2 must be scaled by multiplying it by 3/2, and so forth. The thus-normalized basis functions are given the letter P in place of the φ, and are known as the Legendre polynomials, the first few of which are plotted in the figure below. (The zero-order one, P0 = 1, coincides with the top frame of the figure bounding box).

The first few Legendre Polynomials

Most of the better-known orthogonal-polynomial bases have closed-form shortcuts of various kinds for generating the basis elements up to any desired order, but all of them share a key property, namely that they are obtainable via a three-term recurrence formula, allowing the (k+1)th basis function to be obtained by combining the (k)th and (k-1)th according to some algebraic formula, without explicitly referencing any terms of lower order. This is both conceptually convenient and computationally efficient. For the Legendre polynomials this recurrence is known as Bonnet’s recursion formula and takes the form

(k+1)⋅Pk+1 = (2k+1)⋅x⋅Pk − k⋅Pk−1   for k > 1,   with P0 = 1 and P1 = x. [Legendre]

There are two other families of such orthogonal polynomials which are of interest in our present context because they are defined on other kinds of x-intervals. The first are the Laguerre polynomials (or sometimes in "Franco-Russian co-production" form as the Laguerre-Sonin polynomials), defined via the L2 inner product on the semi-infinite interval x ∈ [0, ∞) with weight function ω(x) = exp(−x). The Laguerre polynomials are given in closed form by Rodrigues' formula − which however requires a (k)th-order derivative to be taken − and more practically, via a three-term recurrence:

Lk(x) = ex
k!
dk
dxk
(e−x xk) ;              Lk+1(x) = [ (2k + 1 − x)⋅Lk(x) − k⋅Lk−1(x) ]
(k + 1)
  for k > 1,   with L0 = 1 and L1 = x. [Laguerre]

The first few Laguerre polynomials are plotted in the figure below. Note that even though the associated norm is formally defined only over x ∈ [0, ∞), the resulting polynomials are defined for negative x-values, as well; they simply are not L2 integrable over the doubly-infinite interval, since the weight function diverges as x → -∞:

The first few Laguerre Polynomials

The second family of interest is the Hermite polynomials (pronounced "Air-MEET", for the benefit of our non-Francophone readership), defined on the doubly-infinite interval x ∈ (−∞, ∞) with weight function ω(x) = exp(−x2). Similarly to the Laguerre polynomials, the Hermite polynomial of any desired order can be obtained either by taking an (k)th derivative or by evaluating the first k terms of a three-term recurrence:

Hk(x) = (-1)k⋅ex2 dk
dxk
(e−x2) ;              Hk+1(x) = 2x⋅Hk(x) − 2k⋅Hk−1(x)   for k > 1,   with H0 = 1 and H1 = 2x. [Hermite]

Before moving on to the particular basis we shall be using for the function-approximation exercise which will conclude this section, there are two important things which must be clarified. Firstly, we have blithely thrown around the term "basis" for our constructions above, but have not in fact proved that the function sets in question truly are bases, in the sense that the approximation

n

k=0
⟨ƒ, φk⟩⋅φk

really does converge to ƒ(x) as n → ∞, for any ƒ in the space. Function sets which satisfy that property are referred to as complete sets, not to be confused with the concept of completeness as applied to the space itself. The main thing to note in this regard is that every sequence we refer to as a basis is by definition complete, though the formal proof is often nontrivial. For an example of an infinite function set which is not a basis, consider retaining only the sine terms of the above Fourier Series basis. We see immediately that there will be a problem approximating any function ƒ which is nonzero at one or both endpoints of the interval [0, 2π] using such a Fourier sine series, since all of the functions in the alleged basis set are identically zero at the endpoints of the interval. Thus, this set of functions is not a basis. (The rigorous proof of this is a bit more involved, but we content ourselves here with the obviously glaring endpoint-value problem.


Question: Does the Fourier cosine series suffer from a similar deficiency?


Secondly, when we say that a basis for a Hilbert space satisfies the infinite-summation equality

ƒ(x)   =

k=0
⟨ƒ, φk⟩⋅φk ,

for every ƒ in the space, it is important to note that the mathematical theory is very specific in which sense the concept of equality as applied to a convergent function sequence is meant, and it is not the usual "equals really means equals everywhere" sense. The equals sign in the foregoing expression in fact is defined in terms of the "distance between functions" concept we used to open this section, which when specialized to a Hilbert space means in the sense of the applicable norm. And in fact, given any two functions ƒ and g the norm ||ƒ − g|| may vanish without the two functions agreeing at every point at which they are defined. Anyone who has ever done the classic homework exercise of approximating a square wave or sawtooth function using Fourier series is familiar with Gibbs phenomenon, those high-frequency "wiggles" in the series approximation which appear around discontinuities in the target function or its derivatives. Distressingly, those wiggles do not "go away" as one takes ever-more basis functions in one's approximation, although their amplitude is bounded; what does happen is that the overshoot oscillations become ever-more localized around the discontinuities, hence their contribution to the norm of the difference between ƒ and its series approximation does tend to zero, that is the above equality really means (we apologize for the vertical breaks in the double bars of the bracketing norm, but our running experiment in this page of how far we could push standard HTML in terms of mathematical-expression formatting hit its limits here)

lim n → ∞ ||
||
||
ƒ(x) − n−1

k=0
⟨ƒ, φk⟩⋅φk ||
||
||
=   0 .

(The analogous effect when using polynomial approximation is known as Runge's phenomenon, and there the problem can be even worse, in the sense that the maximum amplitude of the overshoots may be unbounded as n → ∞.) In other words, we do not guarantee uniform − that is, pointwise − convergence of a target function and its infinite-term expansion-approximation, but rather equality "almost everywhere" in the limit, which is formally defined as being everywhere except possibly on a set of zero measure. We realize this may seem rather a bit of mathematical flimflammery, but it is a well-defined (and well-characterized) bit of mathematical flimflammery. And for the present discussion we are only interested in approximating smooth functions, for which there is no issue of Gibbs phenomenon, i.e. convergence will be both in the sense of the applicable function-space norm and in the uniform "at every point" sense.

Lastly, while we have mentioned than in numerical practice one deals strictly with with n-term finite truncations of one's chosen basis, we have so far sidestepped the question, "which n terms should we retain?" In general one simply retains the n lowest-order terms and adds higher-order ones incrementally if one needs higher resolution. (To gauge the latter one needs some kind of convergence measure; typical ones include computing the incremental change in the lower-order coefficients resulting from retaining more terms, and also the absolute magnitude of the coefficients of the highest-order terms). Most discussions of this topic take it as given that one should proceed in this manner, without providing any justification. The reason is that all of our bases (both of the Fourier and polynomial variety) have been constructed to have "low frequency" basis functions (as characterized by their number of roots in the computational interval) at low index values, with increasing index corresponding to adding higher-frequency elements to the approximation. Proceeding from low to higher frequency (and hence higher physical resolution) is generally effective because an experienced numerical analyst will always try to match the characteristic length scale (or equivalently, characteristic frequency spectrum) of the problem being studied to the chosen basis and numerical solution domain.

There are many cases where this simple strategy is however sub-optimal, sometimes extremely so. For example, if one is dealing with a target function with known even or odd parity, the coefficients of the basis elements of the opposite parity will be zero, so there is no need to include them in one's computation. As an another example, the author once encountered a problem where the solution exhibited behavior asymptotically described by a simple power law xp (p a nonnegative integer) in the neighborhood of x = 0. The problem was, when the parameter p was large, approximation via orthogonal polynomials became prohibitively expensive because one needed O(p) terms in the expansion to accurately capture this power-law behavior. Some numerical trial-and-error and plotting of solutions for moderate values of p revealed that as p increased, the extrema of the solutions became ever-more localized in the interior of the computational domain, whereas the chosen basis was one optimized for high resolution near x = 0. This type of issue with highly localized behavior, especially when one cannot predict in advance where the action will occur, is a known limitation of Fourier and orthogonal-polynomial bases, and a variety of alternative techniques have been developed to cope with it, such as wavelet bases, which are built up of elements which are themselves localized. These are however outside the scope of the current discussion.

So, in summary, given a target function and some orthonormal basis with which one wishes to approximate it, one truncates the partial sum to retain only the n lowest-order terms

ƒ(x)   = n−1

k=0
ck⋅φk(x) , [fExpand]

whose coefficients c0, ... , cn−1 are then solving for numerically, the idea being that with a properly chosen basis the resulting n-term finite approximations will not only converge, ||ƒ − ƒn|| → 0 as n → ∞, but will do so rapidly. Even within the context of Fourier and orthogonal-polynomial bases, the question still remains, "are some bases better than others?" The answer to this is a qualified "yes", in the sense in that it depends on both the target function and its interval of definition. Nonetheless, there are some bases which are "more equal than others" in terms of their numerical properties and variety of usages. We shall now introduce one of the most famous and numerically-useful ones.

The Chebyshev Polynomials

The weight function giving rise to the Chebyshev Polynomials

The Chebyshev polynomials: are the orthogonal family which results from taking the same interval x ∈ [−1, 1] as for the Legendre polynomials but using the at-first-glance peculiar-looking weight function ω(x) = (1 − x2)−½. This weight function (pictured at right) is shaped like a letter U, has a minimum ω = 1 at x = 0, the midpoint of the interval, and diverges to +∞ at both endpoints, but in an integrable fashion. (The resulting orthogonal basis is specifically known as the Chebyshev polynomials of the first kind, as there are two distinct families of polynomials bearing that famous mathematician's name, but as we shall deal with only the one set, we have no need of the extra qualifier in our remaining discussion). These are denoted Tk(x), where the "T" is taken from any of several alternative transliterations of the same Russian surname which begin with that letter, e.g. "Tchebychev".

Let us compute the first few of the Tk by applying Gram-Schmidt to our standard monomial basis. First we compute the (squared) norm of the zero-order basis function, T0 = b0 = 1:

||T0||2   = +1

−1
(1 − x2)−½ dx .

Using a simple trigonometric substitution, we let x = cos(θ), in terms of which the integral transforms into a trivial integration of dθ between 0 and π, whence ||T0||2 = π .

For the Chebyshev polynomials, the "natural" normalization turns out to be the following one:

||T0||2 = π ;   ||Tk||2 = π/2   for   k > 0.

In doing the term-by-term orthogonalization it proves convenient to first compute a pair of "basic integrals" of the of the weight function and of its reciprocal over the interval. The first of these is the norm-squared integral whose value we just computed:

I0   := +1

−1
ω(x) dx   =   ||T0||2   =   π ;             J0     := +1

−1
ω−1(x) dx   = +1

−1
(1 − x2) dx .

Using the same trigonometric substitution as above, we let x = cos(θ), in terms of which the J-integral transforms into

J0   = 0

−π
−sin2(θ) dθ   = π

0
½⋅[1 − cos(2θ)] dθ   =   π/2   =   ½⋅I0 .

We next observe that since our weight function is of even parity, ω(−x) = ω(x), as is the interval of definition, multiplying the weight function by an odd power of x yields the integral of an odd function over an even interval, with result 0. Thus in terms of the orthogonalization, we only need to concern ourselves with two generalized families of the above integrals, in which an even power of x is included in the respective integrands:

I2k   = +1

−1
x2k⋅ω(x) dx ;               J2k   = +1

−1
x2k⋅ω−1(x) dx ,   k a positive integer.

The J2k-integral will appear naturally as a result of integration by parts of the next-higher-indexed I-integral, I2k+2. For example, taking k = 1 yields an expression which can easily be integrated by parts, using that d((1 − x2)½)/dx = −x⋅ω(x):

I2   = +1

−1
x2⋅(1 − x2)−½ dx   = [ −x⋅(1 − x2) +1
]
−1
  + +1

−1
(1 − x2) dx   =   0 + J0   =   ½⋅I0 .

We leave it as an exercise to the reader to show that in the general case,

I2k+2   =   (2k+1)⋅J2k ,

Again using trigonometric substitution and double-angle identities (inductively at higher index values) one can derive identities needed in the Gram-Schmidt orthogonalization such as

I2   =   J0   =   I0/2 ;         I4   =   3⋅J2   =   3⋅I0/8 ,   etc.

The usefulness of this is that the inner product of any two elements of our starting monomial basis can be written in terms of the I-integrals as

⟨bj, bk⟩   =   Ij+k   for   (j+k) even;   0 otherwise.

The first two elements of our "starter basis" need no orthogonalization, that is T0 = b0 = 1 and T1 = b1 = 1. Let us use the above tabulated integrals to illustrate the orthonormalization of b2. We now include a to-be-determined normalization constant c in our procedure and write

c⋅T2 = b2 ⟨b2, T0
||T0||2
= x2 I2
I0
= x2 − 1/2 .

In order to obtain the constant c we compute the norm-squared of T2 with respect to the weight function ω(x), again using our tabulated integrals to ease the work:

||T2||2   =   ⟨T2, T2⟩ = +1

−1
ω(x)⋅(x2 − 1/2)2/c2 dx =   (I4 − I2 + I0/4)/c2   =   (3/8 − 1/2 + 1/4)⋅I0/c2   =   I0/(8c2) .

Recall that our desired normalization was ||Tk||2 = π/2 = I0/2 for all k > 0, thus c = ½ and our orthonormalized second-order basis function is

T2   =   2x2 − 1 .

The next few Tk are

T3 = 4x3 − 3x ,    T4 = 8x4 − 8x2 + 1 ,    T5 = 16x5 − 20x3 + 5x , ... .

One can see that the coefficients of the various powers of x in each of the basis functions sum to unity. An immediate consequence of this is that all of the Chebyshev basis functions take a value of precisely unity at the right endpoint of the interval: Tk(1) = +1. At the left endpoint of the interval, Tk(−1) = (−1)k = ±1 depending on whether k is even or odd, respectively. This nice (anti)symmetry property is only a small hint at the many nice properties of this family of polynomials, the first few even and odd of which are plotted separately in the two figures below.

The first few even and odd Chebyshev Polynomials

Properties of the Chebyshev polynomials:

In practice, after computing the first several such terms, the observant algebraist will quickly spot a pattern, and realize that one can replace the somewhat-tedious algebra of term-by-term orthonormalization with a handy shortcut, namely that the Tk can generated using the following simple three-term recurrence formula, which we offer here without proof:

Tk+1 = 2x⋅Tk − Tk−1   for k > 1,   with T0 = 1 and T1 = x. . [ChebRecur]

The Chebyshev polynomials are arguably the most-versatile of all the orthogonal-polynomial bases, for this and several additional reasons. Firstly, the "curious" weight function used to define the Tk has as a consequence that the roots (that is, the zeros) of the Tk become increasingly closely clustered near the endpoints of the interval as k → ∞. To see this, we take advantage of another useful property of the Tk, namely that they satisfy the delightful and useful trigonometric identity,

Tk(cos(θ)) = cos(k⋅θ) , [ChebCos]

where the Chebyshev interval x ∈ [−1,+1] formally maps to θ ∈ [−π, 0], but we are free to instead use the mapping x = −cos(θ) if we prefer to have both coordinates cover their respective intervals "in the same direction." It is the above identity which prompts Boyd [Boyd] to refer to the Chebyshev polynomials as "a Fourier cosine series in disguise". One can combine the above two formulae to obtain a useful trigonometric identity:

cos((k+1)⋅θ) = 2⋅cos(θ)⋅cos(k⋅θ) − cos((k−1)⋅θ) .

The identity [ChebCos] further allows one to write down a closed-form expression for the roots of the basis function of any order: The roots of Tn are simply the n points in the same interval where n⋅θ is an odd integer multiple of π/2:

n⋅θk = (2k+1)⋅π/2,    for    k = 0, ... , n−1 .

In terms of the untransformed x-coordinate the roots or "collocation points" as they are often called thus are given by

xk = cos[ (k + ½)⋅π/n ],    for    k = 0, ... , n−1 .[ChebZero]

Tn similarly has n+1 extrema, located at

xk = cos[ k⋅π/n ],    for    k = 0, ... , n .[ChebMax]

With our choice of normalization, Tn takes values of exactly ± 1 at the extrema, and this regularity property of the extremal amplitudes of the Tn is one of the keys to the extraordinary accuracy of functional approximation using Chebyshev series.

The "clustering around the endpoints" of the roots of the Tn may prove suboptimal if that is not where the action is, so to speak, but in many computational problems one − e.g. in so-called "boundary value problems" − one will have a well-defined subdomain, often set by the presence in a physical boundary in the problem being modeled, of one's problem of interest where one desires high spatial resolution, in which case one maps one of the endpoints of the Chebyshev domain there. In the context of our current application of functional approximation this will not prove to be the case, but as we shall see the resulting convergence is still sufficiently rapid that the increasingly tight cosine clustering of the roots of the higher-order Tn will not have an opportunity to become an issue.

Coordinate Mappings From Physical to "Chebyshev Space:

In practice the domain of interest (restricting ourselves to functions of a single real variable) might be a finite one [a, b], a semi-infinite one such as [0, ∞), or the doubly infinite interval (∞, ∞). For functions whose natural domain of definition does not coincide with that of the Chebyshev basis, we can map x ∈ [−1, +1] to on arbitrary finite interval or to a semi- or doubly-infinite one via a suitable "coordinate stretchings. For example if the domain of our function is the finite interval y ∈ [a,b] the simple uniform mapping from "Chebyshev space" to this interval is

y = ½⋅[(b − a)⋅x + (b + a)] .

(Note that in the above we have used x and y in the opposite sense in which they are used in the analogous derivation in Numerical Recipes [NumRec] − we reserve the variable x as corresponding to the Chebyshev interval at all times, whereas the aforementioned reference abruptly switches to x being the independent variable in the "physical" domain at this point of their discussion.)

For functions defined on a semi- or doubly-infinite interval, we have more to say in the next section.

Numerical Approximation via Chebyshev Polynomials:

Given some chosen orthogonal polynomial basis defined over a variable x and a function ƒ(x) which one wishes to approximate using some finite n-term truncation of said basis, the general approach for obtaining a near-maximally accurate approximation − in the sense that the uniform norm of the error, |ƒ(x) − ∑nck⋅φk(x)|, is minimized over all possible choices of the coefficients c0, ... , cn−1, is to require the approximation to be exact at some "optimally chosen" set of n discrete points in the domain of definition. (Since we have n coefficient values to choose, n is the maximum number of degrees of freedom in this respect). The question "Which n values of x are the best ones at which to sample the function being approximated?" is nontrivial but proves to have a surprisingly simple answer, namely that the optimal set of points at which to require the approximation to hold exactly is just the n zeros of the first neglected basis function, φn(x).

To give a rough sense of why this set of points is optimal, if one makes the simplifying (but not ridiculously so) assumption that all the error of the n-term approximation lies in the first neglected basis function Tn (i.e. that all coefficients beyond cn are identically zero), then in order to minimize the maximum error resulting from neglecting that term we would like to "spread the error out" as evenly as possible. Since the extrema of Tn all have the same amplitude, the easiest way to accomplish this uniformization of the error resulting from neglect of the higher-order term is to "pin things down" by requiring the error to be zero at the zeros of Tn, thus effectively limiting how large its amplitude can be in the subintervals between adjacent "pinning points", that is, adjacent zeros of Tn.

Note that the precise choice of collocation points is not critical in terms of accuracy: A common alternative choice occurs when one is dealing with a boundary value problem requiring B independent boundary conditions on (in the form of some conditions on ƒ and perhaps one or more of its derivatives to be satisfied at x = a and x = b), in which case one usually makes use of an "extrema plus endpoints" mesh of collocation points. Since we have n basis functions whose coefficients are to be solved for and B of our n resulting "degrees of freedom" are taken up in satisfying boundary conditions, we have (n − B) degrees of freedom in the interior of our Chebyshev interval. This points toward a mesh based on the zeros of Tn−B, but for B > 0 (as is true by assumption) this would mean trying to satisfy some equations at the zeros of one of our retained basis functions, which is a recipe for very bad things to happen, in the form of numerical ill-conditioning, Instead we use that the (k)th-order Chebyshev basis function Tk has (k − 1) extrema in the interior of the interval (plus the two endpoint extrema), and set (n − B) = (k − 1), which yields the proper choice of interior collocation points, namely the (n − B) interval-interior extrema of Tn − B + 1. Any one of those may happen to lie fairly close to one or more zeros of the other retained basis functions, but not systematically close to all of the zeros of any one of them. We satisfy our function-being-approximated at those interior points, plus the B conditions to be satisfied at the two endpoints. This combination again leads to (n − B) + B = n equations in the n unknown Chebyshev coefficients, that is, a well-posed linear system.

For our current problem, where there are no explicit boundary conditions, the crucial reason to preferring an interior-points mesh consisting of the n zeros of Tn is that it allows us to take advantage of a discrete orthogonality condition we shall describe shortly, which leads to an extremely efficient method of functional approximation.

In general we would evaluate each of the n retained basis functions φ0(x), ... , φn−1 at each of the n zeros of the first neglected basis function φn (this so-called "spectral collocation" approach is not specific to a Chebyshev basis, so we use the generic-basis notation φ here) to set up an n x n linear system of equations

Ac = f,

where c = {c0, ... , cn-1 } is the vector of basis function coefficients we need to solve for, the (j,k)th element of the matrix A is the result of evaluating the (j)th basis function at the (k)th of the n zeros: Aj,k = φj(xk) and the elements of the right-hand-side column vector are simply the values of the target function at those same points: f = {ƒ(x0), ... , ƒ(xn−1)}. For the case of a boundary-value problem one would satisfy the above equation at (n − B) of the n collocation points, and the remaining B rows of the linear system would instead implement the boundary conditions. The effort of solving the linear system is O(n3) using a standard implementation of Gaussian implementation, which is quite reasonable on a modern PC for n up to around 1000, more than enough for most practical problems. This technique also generalizes well to cases where the function ƒ(x) itself it not known, but rather is the solution of some nontrivial equation (e.g. a differential equation), so one should always have a decent-quality linear system solver handy as part of one's basic numerical toolkit. But for the specific case of approximating a known function using the specific basis of the Chebyshev polynomials, even the relatively modest effort of solving a linear system is more than is in fact needed.

As in nicely described in Numerical Recipes, the regularity of the extrema of the Tn is key to their approximational accuracy, and one can further obtain an extremely efficient (in terms of computational cost) approximation algorithm by making use of the fact that in addition to the continuous (integral) orthogonality criterion by which the Tn are constructed, they also satisfy a discrete one: If xk (k = 0, ... , n−1) are the n zeros of Tn given by the formula [ChebZero], then any two of the lower-order basis functions satisfy the following discrete orthogonality relation:

n−1

k=0
Ti(xk)⋅Tj(xk) = 0
n/2
n
if i ≠ j
i = j ≠ 0
i = j = 0
. [DiscOrtho]

For example, taking n = 2 and using that the zeros of T2 are located at x0,1 = ±1/√2, we use that T0 = 1 and T1 = x to verify that the above summation formula yields the expected results:

T0(x0)⋅T0(x0) + T0(x1)⋅T0(x1) = 1 + 1 = 2 = n;

T0(x0)⋅T1(x0) + T0(x1)⋅T1(x1) = −1/√2 + 1/√2 = 0;

T1(x0)⋅T1(x0) + T1(x1)⋅T1(x1) = ½ + ½ = 1 = n/2;

The importance of this discrete orthogonality relation is that, given the value of the function ƒ(x) we wish to approximate, evaluated at the zeros of the first-neglected basis function as given by [ChebZero], we can compute the coefficients of our n-term Chebyshev series in very efficient fashion, much more cheaply than even straightforward solution of the linear system entails. Each row of out matrix equation can be written as discrete n-term summation:

n−1

j=0
Tj(xk)⋅cj   = ƒ(xk),    for    k = 0, ... , n−1 .

Now consider multiplying each row by T0(xk) and summing all the resulting rows into a single equation.

n−1

k=0
n−1

j=0
T0(xk)⋅Tj(xk)⋅cj   = n−1

k=0
ƒ(xk)⋅T0(xk) .

Interchanging the order of the nested left-hand-side summations and using [DiscOrtho] we see that the only term which survives the resulting cancellation is the j=0 one which causes c0 to be summed n times, yielding n⋅c0. Thus we obtain the following simple expression for the zeroth-order coefficient:

c0 = 1
n
n−1

k=0
ƒ(xk)⋅T0(xk) . [Csum0]

We can repeat the above multiply-each-row-and-sum procedure n−1 times using Tj(xk) with j = 1, ... , n−1 as the respective row-multipliers; for these higher-order Tjs, the only difference is that discrete-orthogonality cancellation yields n⋅cj/2 on the left-hand-side of the result. Thus the higher-order series coefficients are similarly the result of evaluating an n-term discrete sum and multiplying the result by 2/n, rather than the 1/n scaling factor unique to c0:

cj = 2
n
n−1

k=0
ƒ(xk)⋅Tj(xk),    for    j = 1, ... , n−1 . [CsumJ]

Thus our problem reduces to one of evaluating n such n-term sums, which requires just O(n2) arithmetic operations, a full order of magnitude less than solution of the general linear system using the standard procedure of Gaussian elimination.

Given basic functions to perform the above basic Chebyshev-approximation computations, there are several additional utility formulae which prove very useful. The first of these takes the coefficients c0, ... , cn−1 of an n-term Chebyshev approximation to a function ƒ(x) and efficiently computes the corresponding coefficients d0, ... , dn−1 of the derivative ƒ′(x). The algorithm results from observing that the derivative of the (k)th-order Chebyshev polynomial Tk can be written as a linear combination of Tk−1 and Tk−3. This allows the coefficients of ƒ′(x) to be computed by starting at the upper index n−1, computing the upper two "seed" coefficients separately, and the rest via a reverse-running loop. The following algorithm can be applied D times to an initial set of coefficients in order to generate the coefficients of the series-truncated approximation to the Dth derivative of the function in question:

dn−1 = 0; dn−2 = 2⋅(n−1)⋅cn−1;      [loop] dk = dk+2 + 2⋅(k+1)⋅ck+1       for   k = n−3, ... , 0,   with d0 requiring a final halving after the loop finishes . [ChebDeriv]
-->

Once one has computed the coefficients of one's n-term Chebyshev approximation, it is also important to have an efficient method to evaluate the resulting approximation at any desired point in the interval of definition. One can use the three-term recurrence [ChebRecur] to generate the values of T0, ... , Tn−1 at any desired value of x while simultaneously accumulating the coefficient-weighted partial sums, but note that when dealing with such sequential computation of terms of recurrences there are nontrivial issues of numerical stability to consider, which can wreak havoc with one's numerics if not dealt with carefully. Numerical Recipes has an excellent summary of this topic; the upshot for the matter at hand is that while the Chebyshev recurrence tends to have good numerical properties in this regard, it is both more accurate and more efficient to use a method known as Clenshaw's recurrence formula for the evaluation of the approximation ƒn(x) from-its expansion coefficients. This recasts the three-term recurrence one wishes to evaluate in terms of another recurrence which "runs in reverse". For the case of an n-term Chebyshev approximation with coefficients c0, .... , cn−1, the Clenshaw formula starts with a pair of null "seed values" en+1 = en = 0 and then runs through the Chebyshev coefficients like so:

ek = 2x⋅ek+1 − ek+2 + ck       for   k = n−1, ... , 1, in terms of which the n-term approximation is     ƒn(x) = x⋅e1 − e2 + ½⋅c0 . [ChebEval]

The above is implemented in [NumRec] in the function chebev; note that since one only needs two contiguous e-coefficients at every step of the index-decrementing loop, ove can simply define a pair of variables − call these (say) e1 and e2 − to store the values of ek+1 and ek+2 for the current value of the loop index k. We also need an added variable tmp for the purpose of temporary storage. Then on each loop iteration we copy the current value of e1 into tmp, perform the above computation of ek and store the result in e1, then copy tmp into e2. Upon exit from the loop we use the final value of e1 and d2 together with that of c0 to generate the desired result, ƒn(x).

With the above formulae for efficiently obtaining the coefficients of the approximation ƒn(x) via evaluation of ƒ(x) at the roots of Tn(x) and for doing the reverse operation of quickly and stably evaluating ƒn(x) at any desired value of x ∈ [−1, +1], we are at last ready to proceed to the long-promised


Exercise: Generate a nearly-optimal general curve-fit to the reciprocal function ƒ(y) = 1/y in the interval y ∈ [1, 2] by using Chebyshev approximation (We again use y as the independent variable for the domain of definition of ƒ so as to not confuse it with the Chebyshev interval). Specifically, implement code in your favorite programming language to use the above summation formulae [Csum0] and [CsumJ] to compute the Chebyshev coefficients of some user-provided function on a general finite interval [a, b], using IEEE double-precision floating-point arithmetic. Note that Numerical Recipes has a routine chebft which implements the above coefficients-evaluation sums, but which uses O(n2) cosine evaluations and hence is computationally inefficient in the sense that the constant of proportionality associated with the quadratic asymptotic operation-count estimate is fairly large, by way of a double summation in which each execution of the innermost loop evaluates one series coefficient. The authors note that if runtime is a concern one can use fast cosine transform methods to evaluate the above sums in just O(n log n) operations. If one is willing to sacrifice a small amount of accuracy, a quicker way to speed things − which still needs O(n2) operations but reduces the proportionality constant − is to flip the order of the double summation in one's code and instead have the inner loop evaluate all n basis functions Tj at one specific value of xk using the three-term recurrence [ChebRecur], thus each pass through the inner loop increments all the coefficients by one more term of their respective sums, and the outer loop iterates over the xk values, that is the n roots of Tn(x). The final coefficient values are then scaled by the appropriate factor 1/n or 2/n upon completion of the double summation. This will generally be a tad less accurate than the Numerical Recipes approach, since the three-term-recurrence allows roundoff error to accumulate in contrast with the approach of computing each term afresh using [ChebCos], but of course the recurrence formula is extremely cheap to compute numerically. Having outlined several options, we leave the implementational details up to the reader.

Test your code on the target function ƒ(y) = 1/y, defined on the interval y ∈ [1,2], using values of n ranging from 1 to 20. A very simple test is to validate that the result for n = 1 is c0 = 2/3, within machine epsilon; Another is to check that as n increases the zero-order coefficient converges to 1/√2. Then answer the following questions:


Newton-Raphson For Numerical Solution of Nonlinear Differential Equations:

Our final section ties together several key ideas we have already developed in order to show how the Newton-Raphson method can be used to develop iterative techniques not only for solving for the roots of a nonlinear target function, but also for finding functions which themselves are solutions of some nonlinear differential equation of interest, in other words in settings where the "target function" is a differential equation, and the "root" is a function satisfying the equation and any applicable boundary conditions. The idea of linearization about an approximation to the true solution applies both to ordinary (i.e. in a single independent variable) and partial differential equations (those in 2 or more variables), but because the latter subject involves far more issues than the linearization technique and involves multiple degrees of freedom by way of the independent variables, we shall only consider ordinary differential equations, for which application of Newton's method is much more reliably useful.

We shall illustrate the key ideas on a well-studied nonlinear ordinary differential equation from the author's major field of college study, fluid mechanics. The equation is the famous one governing the fluid velocity distribution in the neighborhood of an idealized flat surface, which can be imagined to be modeled as a perfectly smooth, infinitely thin flat solid plate inserted into a uniform, time-invariant constant 2-dimensional (i.e. we assume no-z-direction variation, and thus deal only with the x,y-plane) velocity stream moving purely in the x-direction. The plate's leading edge can be taken to be at the origin (x = y = 0), and the plate is assumed to coincide with the positive part of the x-axis, i.e. to extend infinitely in the positive direction. If we assume the plate "magically appears" at time t = 0 and is somehow held perfectly fixed for all later times, the immediate effect is that the fluid is brought to rest where it contacts the plate due to viscous friction exerted by the solid surface, the so-called no-slip condition. This will transiently cause some complex flow adjustments, but once the flow has had time to "settle down" it will again become time-invariant or steady. Extensive laboratory measurements have shown that once this steady-state condition is achieved, while the flow in the immediate neighborhood of the leading-edge point is structurally rather complex, sufficiently far downstream of that point, the flow profiles all look remarkably similar, when properly scaled to reflect the gradual increase in boundary-layer thickness with increasing downstream distance. This idea of self-similarity together with an order-of-magnitude analysis of the various terms in the governing equations allow one to simplify said equations, the Navier-Stokes equations of continuum fluid mechanics. These are in their full glory a system of time-dependent coupled nonlinear partial differential equations whose general solution is considered so daunting that proving even some of their most-basic aspects in the general sense, such as "under what circumstances do solutions of the equations exist?" is a feat which will net the first person to accomplish it a million-dollar prize.

Fortunately, fluid-mechanicists are a not-easily-daunted lot, and there has been tremendous progress made, especially in the area of approximate solutions and exact solutions under restricted assumptions. The flat-plate boundary-layer is a famous one of such. Under the above assumptions, together with the scaling properties which result from the assumption that viscous effects are predominantly confined to a thin layer near the solid surface, the Navier-Stokes equations can be reduced to a single ordinary differential equation written in terms of the variation in transverse variable y (distance from the plate) of a stream function ƒ(y), whose y-derivative is the streamwise flow velocity:

ƒ′′′ + ƒ⋅ƒ′′ = 0 ,  subject to   ƒ(0) = ƒ′(0) = 0 and ƒ′ → 1 as y → ∞. [Blasius]

This is the famous Blasius boundary layer equation. Its deriver Paul Blasius (pronounced "BLAH-zee-oos") was a student of legendary early-twentieth-century fluid mechanicist Ludwig Prandtl (pronounced "Prandtl"). Despite the relatively simple appearance of the equation and the massive simplification of the Navier-Stokes equations it embodies, it nonetheless still contains a nonlinear ƒ⋅ƒ′′ term and admits of no known closed-form solutions, not even of the 'indirect kind" such as the elliptic integrals used to write solutions of another famous ordinary differential equation, that governing the motion of an idealized pendulum in 2 space dimensions, or the Green's function integral techniques used to "solve" for many other kinds of equations. (We put quotes around the word solve because we do not mean "produce a solution function" in the ordinary sense, but rather "express the solution of the differential equation in the form of an integral which must be evaluated", most often numerically. If we could evaluate it analytically, there would be our solution function in direct form). Much effort by many clever minds has gone into various approaches for approximating the solution of the Blasius equation; in the early days this typically involved expansion of the solution in some kind of low-order polynomial near y = 0, but since all such polynomials necessarily diverge as y → ∞ such approximations are at best useful only in the immediate neighborhood of the boundary layer. (But since that is the region of greatest interest, that is often perfectly adequate for purposes of engineering analysis). Around mid-century the technique of matched asymptotic expansions came into its heyday; this involves decomposing a problem into multiple subregions in each of which different balances of terms predominate, and in each of which at least one or terms of the full equation is parametrizable by a small parameter ε. One then attempts to approximate the solutions of these various subproblems by way of an expansion in various powers of the small parameter, and then "splicing together" the results in the transition zones between adjacent subregions. In the present case one would decompose the physical y-domain into an inner region including the boundary layer where viscous terms predominate, and an outer region where viscous effects quickly become negligible with increasing y and in which the solution asymptotes to its behavior at infinity.

Around the same time as that in which matched asymptotic expansion techniques were becoming all the rage the field of numerical analysis was also rapidly coming into its own. The earliest generally successful numerical approaches for dealing with equations like the Blasius one involved numerical discretization via finite differences, in which one approximates the domain of interest (in the case the positive y-axis) by a grid of discrete points. In our case the simplest version of that would mean defining a set of mesh points starting at the inner boundary y = 0 and occurring at integer multiples of a constant mesh spacing Δy:

yj = j⋅Δy ,    for   j = 0, 1, 2, ...

and then approximating the derivatives appearing in the equation by finite differences, e.g. via formulae such as the following. The accuracy of such approximations can be quantified by plugging them into the Taylor series expansion of ƒ about the point of interest − we list the finite-difference truncation error for each derivative formula by way of an "order-of" error term:

ƒ′(yj) = ƒ(yj+1) − ƒ(yj)
Δy
+ O(Δy),    ƒ′(yj) = ƒ(yj) − ƒ(yj−1)
Δy
+ O(Δy),    ƒ′(yj) = ƒ(yj+1) − ƒ(yj−1)
2⋅Δy
+ O(Δy2).
(forward difference) (backward difference) (central difference)

The second and third derivatives might be approximated using formulae such as the following, whose accuracy we leave to the reader to determine:

ƒ′′(yj) ≈ ƒ(yj+1) − 2⋅ƒ(yj) + ƒ(yj−1)
Δy2
,          ƒ′′′(yj) ≈ ƒ(yj+2) − 2⋅ƒ(yj+1) + 2⋅ƒ(yj−1) − ƒ(yj−2)
2⋅Δy3
.

The accuracy of such finite-difference approximations improves as the mesh spacing shrinks, but this of course means more computational work, so in practice one must balance the two. Since our example equation is only a boundary-value problem in the loose sense of having an "outer boundary at infinity", one cannot simply set up a discretized form of the Blasius equation on a finite domain and solve it via numerical algebra. The typical approach to dealing with this issue was to employ a "shooting" method, a form of numerical integration which starts at the inner boundary and uses the finite-difference formulae for the derivatives to compute ƒ at the next-farther point out, continuing until the numerical solution exhibits far-field asymptotic behavior, or something which can at least be compared to it. Since shooting has no knowledge of the far-field when it starts off it invariably involves guessing at the value of some inner-boundary parameter whose true value is not known a priori. In this case we know that ƒ(0) = ƒ′(0) = 0 from the boundary conditions, but since we are dealing with a third-order equation we need three independent pieces of information to specify the solution there. The third of these, the value of ƒ′′(0), the shear stress parameter (because the shear stress resulting from frictional "pull" of the fluid flow on the plate is proportional to the y-derivative of the streamwise velocity at the wall) is the a prior-unknown one an must be guessed at. The resulting behavior of the numerically integrated approximate solution once one is "sufficiently far" outside the boundary layer will generally not satisfy the desired far-field behavior ƒ′ → 1 as y → ∞, but one can then adjust one's guess for ƒ′′(0) and try again in iterative fashion until the required far-field behavior is achieved within desired accuracy, much like sighting in a rifle on a target range.

It was not until the 1980s and 1990s that more-powerful and less-trial-and-error-necessitating numerical methods were devised for dealing with nonlinear equations on semi-infinite and doubly-infinite domains. We shall illustrate such using our by-now old friend, the spectral-numerical Chebyshev polynomial expansion.

We begin with the linearization step, which is essentially the Newton-Raphson method applied to our unknown solution function. (The precise mathematical statement of this in the context of function spaces is via the 1940 Kantorovich theorem; in honor of this famous result, such iterative approaches in the context of function-space settings is often referred to as the Newton−Kantorovich method.) Assume we have some guess to what the function looks like; in the manner of iterative methods let us assume at the current (k)th iteration − we here use k rather than n, because the latter variable will be used to parametrize our truncated polynomial expansion − some function ƒk(y) is our approximation to the true solution ƒ(y). We assume ƒk(y) is "close" to ƒ(y) in the sense that ||ƒ − ƒk|| << ||ƒ|| for some suitable functional norm − the uniform norm should serve as well as any in this regard. We next define the difference function as the point-by-point discrepancy between the current approximation and the exact solution:

δƒ(y) := ƒ(y) − ƒk(y) . [fDiff]

It is this difference function δƒ for which we will solve. Of course the exact difference function requires solving a form of our original nonlinear equation, but if our assumption that ||δƒ|| << ||ƒ|| is not-reasonable, then substituting into our original equation and exploiting the presumed smallness of δƒ to produce a linearized approximation should get us appreciably closer to our true solution, in an accelerating fashion as δƒ gets smaller from one iteration to the next. Let us illustrate using the Blasius equation: substitution of [fDiff] into [Blasius] yields

k + δƒ)′′′ + (ƒk + δƒ)⋅(ƒk + δƒ)′′ = 0 .

Multiplying out the nonlinear term and neglecting the quadratic term in δƒ in the result yields a linear differential equation for δƒ, in which the Blasius equation evaluated at the current iterate serves as a right-hand-side "forcing term":

δƒ′′′ + δƒ⋅ƒ′′k + δƒ′′⋅ƒk = −(ƒ′′′k + ƒk⋅ƒ′′k) . [BlasiusLinear]

Owing to the neglect of the quadratic error term, solution of this equation will not yield a function satisfying the original equation; it will simply (hopefully) get us an improved estimate

ƒk+1 = ƒk + δƒ ,

with the error ||δƒ|| decreasing quadratically with each iteration once we are sufficiently close to our desired exact solution.

In practice one would start with some initial guess at the solution function ƒ0(y) and then expand δƒ(y) in a Chebyshev series using an extrema-plus-endpoints collocation mesh. This yields a linear system for the Chebyshev series coefficients, with the forcing term −(ƒ′′′0 + ƒk⋅ƒ′′0) evaluated at each of the same collocation points appearing as a right-hand-side column vector. An immediate problem is how to handle the semi-infinite domain of the equation when using a basis set defined on a finite interval. Since the technique of function approximation using orthogonal polynomials is rooted in the mathematics of Hilbert spaces, we consider that issue by first contemplating how to handle a typical "well-behaved" function on a semi-infinite interval, meaning one which is L2 integrable. We shall then have to figure out how to apply the result to the solution of the Blasius equation, which diverges as y → ∞ and hence is not L2 integrable, but that proves not overly difficult.

For a function ƒ(x) defined on the semi-infinite interval y∈ [0,∞), the requirement that ƒ(x) have finite norm means that ƒ(x) must vanish at infinity, and moreover do so quickly enough to make the square-integral of ƒ converge. In other words, one expects that all the "interesting behavior" of ƒ(x) will be at finite x, and the only thing really mattering in the limit x→∞ is the rapidity of decay of ƒ(x): e.g. algebraic decay x−α for some constant α > 0, or exponential decay, or whatever. In order to efficiently capture such behavior using a finite-term Chebyshev series approximation it is desirable to get rid of the clustering of collocation points near the right endpoint of the interval x ∈ [−1,+1], which means choosing a suitable nonlinear mapping of this interval to y ∈ [0,∞). A typical such mapping is

y = A⋅ 1 − x
1 + x
.

Here A is a computational parameter (typically order of unity) we dub the "half-points location" because precisely half the collocation points lie to the left of y = A and the other half to the right. (If n is odd, the middle collocation point has x = 0 and thus lies directly atop y = A in the transformed coordinates.) The parameter A should be located right around the region where the presumably nicely decaying far-field behavior of the target function manifests, but the quality of results is not terribly sensitive to this. As long as the choice of mapping parameter is "reasonable" based on the behavior of the target function, the precise value chosen for the constant A does not drastically affect the convergence properties of the resulting approximation. In our numerical solutions discussed below, we simply used that the scalings giving rise to the Blasius equation assume the boundary layer is of order unity thickness in the scaled transverse coordinate y, so we simply take A = 1 and see where that gets us.

Before proceeding with the discretization and the numerics, there is one nontrivial aspect to the above semi-infinite-interval mapping function which must be dealt with, which the author encountered when he first tried to use the above approach on a similar nonlinear equation with an asymptotically nonvanishing solution function. Namely, this stretched-grid method will only converge if the dependent variables being solved for are "well-behaved" in the far field, in the specific sense that the the solution function must remain bounded and its derivatives must tend to zero at infinity. (The latter implies that the solution function must in fact tend to a constant, which however need not itself be zero.) A little analysis shows that this is a direct consequence of the above nonlinear mapping between x and y-space being singular at x=1, as a result of which the dependent variables must thus decay as y → ∞ sufficiently rapidly to cancel out this singularity in the mapping function.

In the present case, we require the streamwise velocity to tend to the constant far-field velocity as y → ∞, which shows up in the 2-point boundary value problem defined in the above expression labeled [Blasius] as the outer boundary condition ƒ′ → 1. Thus we know that ƒ ∼ y + C in the far-field, though the constant C is a priori unknown. This suggests defining g(y) = ƒ - y as our modified solution function. (We observe that L2-integrability is not in fact required.)

Once one has chosen a suitable far-field rescaling in the form of a modified solution function, one must numerically evaluate the needed derivatives at the collocation points of one's discrete mesh. For a Chebyshev basis, one can make use of the three-term-recurrence to obtain the derivatives to any needed order: Differentiating [ChebRecur] term-by-term yields

T′k+1 = 2x⋅T′k + 2⋅Tk − T′k−1   for k > 1,   with T′0 = 0 and T′1 = 1 ,

which could be implemented by first computing the needed Tk values using [ChebRecur], then using those to supply the 2⋅Tk in the right-hand-side of the above recurrence. This "bootstrapping" approach can be extended to any desired derivative order. Once one has the needed derivatives with respect to the x-coordinate of the Chebyshev interval [−1,+1], one must use one's coordinate transformation to map them to the domain of definition of one's target equation, via the identity which results from differentiating [fExpand] with respect to that independent variable y and applying the chain rule to write the result in terms of derivatives of the bais functions with respect to x:

g′(y)   = dx
dy
n−1

k=0
ck⋅T′k(x) ,

The last thing which is required is an initial guess, now in the form of a solution function. Published velocity profiles for the flat-plate boundary layer suggested to the author to try a hyperbolic tangent function as an initial guess for the streamwise velocity U(y), which translates into the modified solution function as g0′ = tanh(y) − 1. This must be integrated and differentiated, to yield g0 = log(cosh(y)) − y (we take the integration constant to be 0 here, which equates to assuming C = 0 in terms of the above far-field asymptotics), g0′′ = 1 − tanh2(y) and g0′′′ = −2⋅tanh(y)⋅g0′′(y). This is functionally well-behaved but numerically problematic for large y since the positive-exponential component of the hyperbolic cosine diverges, which leads to floating-point overflow in the form of "not a number" exceptions. Since for such large values of y the negative-exponential component of cosh(y) is vanishingly small, we can simply use that g0 = log(cosh(y)) − y ∼ log(exp(y)) ∼ (y − y) → 0 as y → ∞. Thus for large y (we used y > 50 as the criterion) we simply replace the above formulae with g0 = g0′ = g0′′ = g0′′′ = 0.

Lastly, while we are formally dealing with a 2-point boundary value problem, we note that the "boundary condition at infinity" carries no real information, since proper solutions of the equation will naturally have velocity tending to the far-field constant value, as there is no physical mechanism to cause the velocity to deviate from this as y → ∞ as there is at y = 0 due to the presence of the solid wall there. (As conveyed by the inner no-slip boundary conditions). Thus, in our numerics we simply replaced the rows of our linear system corresponding to y = 0 and y = ∞ by the inner boundary conditions g(0) = 0 and g′(0) = −1 and left the intermediate solution functions "freely dangling" in the far field. A few step-through trial iterations showed the entire far-field function adjusting more or less in lockstep on each iteration and quickly converging to the a priori unknown constant C. Our iterative trials always converged (in the sense that the maximum absolute element of the right-hand-side residuals vector dropped below 10−12) in 4-6 iterations.

Convergence of wall shear stress parameter for Chebyshev-discretized Blasius equation. Significant digits underlined. All arithmetic performed in double precision.
n ƒ′′(0)
___ _______________
8 0.6985300874574 ...
9 0.3030667227405 ...
10 0.3244042508420 ...
15 0.3019109279829...
20 0.3236058395200...
30 0.3338257052582...
40 0.3321166740668...
60 0.3320536562997...
80 0.3320573042616...
100 0.3320573362103...
150 0.3320573362150...
200 0.3320573362152...
. .

The above combination of techniques yields approximate solutions of the Blasius equation with extremely rapid convergence and high accuracy. To convey an idea of the power of combination of the Chebyshev basis and a well-chosen coordinate mapping: the aforelinked Wikipedia page cites the accepted value of the shear stress parameter obtained via approximate numerical solution of the Blasius equation as ƒ′′(0) ≈ 0.332; more-recent literature such as this paper cites high-accuracy computations yielding as many as 9 significant digits in this parameter, whose value (when multiplied by the fluid viscosity) gives the skin-friction drag per unit area due to the boundary-layer flow. (Regarding the linked manuscript, since that result is less-accurate than we achieve here, the reader may wonder what all that mathematical sophistication accomplished. The problem is that the authors used an inferior coordinate mapping function, which no amount of basis-related optimization can make up for. As we shall show shortly, with a better choice of mapping function and understanding of the interplay between far-field asymptotics and the numerics, there simply is no need to do anything more clever with the basis set than use the standard Chebyshev basis.) Our results are summarized in the table at left and the figures below. Starting at around n = 10 things converge very rapidly, though note the closeness of the shear stress value obtained at n = 10 appears to be sheer happenstance. We obtain 11 figures of accuracy in this parameter for n = 100, and one more if we push the number of basis functions even higher. Double-precision roundoff error limits things here; we could easily obtain more significant digits by going to 128-bit floating point quadruple precision, if we desired; doing so (and also experimenting with various values of the half-points parameter A) gives the shear stress constant converged to an impressive 33 decimal digits: 0.332057336215196298937180062010582..., confirming our double-precision 12-digit-converged result, with the least significant listed digit (which alternated between 1 and 2 in multiple large-n runs not shown in the table, but nearly always rounded to 2 based on the next-lower digit) is correct in the rounded-to-nearest sense. Note that the number of converged digits in the latter result (obtained with n = 500-600 Chebyshev coefficients) is slightly more than one could reasonably expect using the IEEE xfloat-standardized quad-precision floating-point format with its 113 bits (112 explicit, 1 hidden) bits of mantissa precision; this is because for these computations the author used his own custom "hand-rolled" qfloat data package, which emulates a 128-bit float using a pair of 64-bit integers, the upper one of which allocates the same sign bit and 11 binary exponent bits as the IEEE double type, the lower one providing 64 added bits of mantissa precision. This compares to the 128 total bits and 15 exponent bits of IEEE xfloat, thus we gain 4 more mantissa bits in exchange for having an argument-magnitude range no greater than that of the IEEE double type. The author has never found the need for the 15 exponent bits of xfloat in his own work, and indeed the 11-bit exponent restriction is not in the least problematic for the present computation, allowing us to enjoy the full benefits of the added precision.

The figure below at right displays the distribution of basis function coefficients (in log-absolute-value form) for the numerical results for various values of n. With the exception of the occasional outliers, all the distributions follow the same trend, and for n = 100 we can see the higher-order coefficients are getting close to double-precision-epsilon levels, though clear signs of the resulting expected "bottoming out" only manifest for n > 100. The figure at left displays the numerical solution profiles (g(y) and its first three derivatives) for n = 100. (We truncate the solution domain at y = 7 in order to capture the region "where the action is".) The red plot of of g(y) asymptotes to the a priori unknown integration constant C = -1.72078... in the far-field; physically this constant represents the integrated velocity defect (or when multiplied by the fluid mass density, the total momentum deficit) due to the presence boundary layer. The purple curve of g′, upon re-adding of the subtracted constant 1, is the famous Blasius boundary layer streamwise velocity profile. This exhibits nearly-linear increasing behavior near the wall, then transitions rapidly to the constant far-field "freestream" velocity for coordinate values ranging from roughly y = 3 to y = 6. Correspondingly the shear stress g′′ (green) has its maximum value 0.332... at the wall (y = 0) and is negligible (roughly 1000x smaller) by the time the right plot boundary is reached. The local minimum of the shear-stress gradient (g′′′, light blue) around y = 2.8 could serve to mark the outer edge of the "inner viscous region", although the similarly-arbitrary definition based on the location where the streamwise velocity reaches 99% of its freestream value (which occurs at roughly y ∼= 5) is generally used to measure the thickness of the boundary layer.

Solution profiles and Chebyshev coefficients for Blasius equation discretized using the coordinate mapping y = A⋅(1+x)/(1-x) with A = 1, for various values of n.

Lastly, note that in this case the unknown parameter governing the far-field behavior appeared as an additive constant in the zeroth derivative of the solution function, allowing it to "fall out" as an output of the iterative convergence. For an example (also drawn from fluid dynamics, in this case the theory of concentrated viscous vortex flows) featuring far-field asymptotics in which the unknown parameter must be explicitly included as an unknown and iterated upon as are the Chebyshev coefficients of the solution function here, we refer the interested reader to this unpublished paper by Mayer & Lee. (The reasons it was not published have to do with question about the criterion used to infer physical realizability of the flows under study, not the asymptotics or numerics of the novel "automated matched asymptotic expansion via iterative refinement" procedure used there).


References:

[Kline] Morris Kline, "Mathematical Thought From Ancient to Modern Times", Vol. 1, Oxford University Press, 1972.

[Strang] Gilbert Strang, Calculus, 2nd Edition, Wellesley-Cambridge press, 2010.

[NumRec] Press, Flannery, Teukolsky and Vetterling, Numerical Recipes: The Art of Scientific Computing, 3rd Edition, Cambridge University Press, 2007. (All versions contain the same descriptive and mathematical material, but one must pick the programming-language flavor of one's choice).

[Keener] J. P. Keener,Principles of Applied Mathematics: Transformation and Approximation, 2nd edition, Westview Press, 2000.

[Boyd] J. P. Boyd, Chebyshev and Fourier Spectral Methods, 2 Revised edition, Dover Publications, 2001.

[SeGour] Newton's method and high order iterations, Pascal Sebah and Xavier Gourdon, 2001.


http://mersenneforum.org/mayer/nr.html -- Last Revised: 07 Jun 2014
Direct comments, questions, corrections and suggestions to ewmayer@aol.com
If you enjoyed this page, please consider linking to it!