New Algorithms for Computing the Matrix Sine and Cosine Separately or Simultaneously

. Several existing algorithms for computing the matrix cosine employ polynomial or rational approximations combined with scaling and use of a double angle formula. Their derivations are based on forward error bounds. We derive new algorithms for computing the matrix cosine, the matrix sine, and both simultaneously that are backward stable in exact arithmetic and behave in a forward stable manner in ﬂoating point arithmetic. Our new algorithms employ both Pad´e approxi- mants of sin x and new rational approximants to cos x and sin x obtained from Pad´e approximants to e x . The amount of scaling and the degree of the approximants are chosen to minimize the computa- tional cost subject to backward stability in exact arithmetic. Numerical experiments show that the new algorithms have backward and forward errors that rival or surpass those of existing algorithms and are particularly favorable for triangular matrices. and sine. Our third experiment compares cosmsinm new to cosmsinm . For each test matrix we show the largest of the two errors in evaluating the matrix cosine and sine: if a particular test matrix results in backward or forward errors e c and e s for the cosine and sine, respectively, we plot max( e c , e s ). In the plots the quantity cond( f, A ) u denotes the larger of cond(cos , A ) u and cond(sin , A ) u . It is clear from Figure 5 that the new algorithm has signiﬁcantly better forward and backward error performance than cosmsinm . Plots of the individual errors for the sine and cosine have similar forms. The largest relative errors returned by cosmsinm were 2.9e2 and 1e3 for the matrix cosine and sine, respectively (both achieved for the ﬁrst test matrix in the plot), while the corresponding errors for cosmsinm new were 1.5e-7 and 5e-7.


Introduction.
In recent years research into the computation of matrix functions has primarily focused on the matrix exponential, the logarithm, and matrix powers. Also of interest are the matrix sine and cosine, which can be defined for A ∈ C n×n by their Maclaurin series Their importance stems from their role in second order differential equations. For example, the second order system (1.3) y (t) + Ay(t) = g(t), y(0) = y 0 , y (0) = y 0 , which arises in finite element semidiscretization of the wave equation, has solution where √ A denotes any square root of A [13, p. 124], [32]; see also [21,Prob. 4.1] for the case g(t) = 0. In practical computation a Krylov subspace method can be used The algorithm has two parameters: the amount of scaling s and the function r, which is either a truncated Taylor series or a Padé approximant.
Serbin and Blalock do not propose any specific algorithmic parameters. Higham and Smith [25] develop an algorithm based on the [8/8] Padé approximant with the scaling parameter s chosen with the aid of a forward error bound. Hargreaves and Higham [15] derive an algorithm with a variable choice of Padé degree (up to degree 20), again based on forward error bounds. They also give an algorithm that computes cos A and sin A simultaneously at a lower computational cost than computing the two functions separately. However they do not give an algorithm to compute sin A alone because the double angle formula sin(2X) = 2 sin X cos X for the sine involves the cosine. Indeed, as far as we are aware, no algorithm of the general form in Algorithm 1.1 has been proposed for computing sin A directly.
Recently Sastre et al. [31] have derived a new algorithm for the cosine that combines Taylor series approximations of degree up to 40 with sharper forward error bounds derived using ideas similar to those in [1, sect. 4].
In this work we develop three new algorithms for the sine and cosine that are based on backward error analysis and are backward stable in exact arithmetic. Our backward error analysis has two advantages over the forward error analyses used in the derivation of existing algorithms. First, the backward error analysis applies to the overall algorithm, whereas the forward error analyses bound the forward error of the function of the scaled matrix only, and the best overall forward error bound contains a term exponential in the number of double angle steps [21,Thm. 12.5]. Second, the current forward error-based algorithms actually bound the absolute error of the function of the scaled matrix rather than the relative error, which is a heuristic choice based on numerical experiments. With backward error analysis there is no need for such considerations, as relative backward errors are unequivocally appropriate.
A second key feature of our algorithms is that they exploit triangularity. They optionally reduce the original matrix to Schur form, but particular benefits are obtained when the original matrix is (real quasi-)triangular.
The algorithm for the matrix cosine follows the outline of Algorithm 1.1 but uses Padé approximants of the exponential rather than the cosine. The algorithm for the matrix sine scales by powers of 3 rather than 2, employs Padé approximants to the sine and the exponential, and uses the triple angle formula to undo the effects of the scaling. The algorithm for computing the cosine and the sine simultaneously makes use of a new choice of double angle formula that improves stability. The outline of the paper is as follows. In section 2 we perform backward error analysis of the Padé approximants for the sine and cosine and show how certain limitations can be overcome using alternative rational approximations based on the matrix exponential. In section 3 we give explicit formulas for the cosine and sine of 2 × 2 (real quasi-)triangular matrices, which are used in the algorithms for better accuracy. In sections 4-6 we design cost-effective methods for computing the matrix cosine and sine (separately and simultaneously) using the double and triple angle formulas. We then compare our algorithms numerically against existing alternatives in section 7 and give some concluding remarks in section 8.

Backward error analysis for the sine and cosine.
We begin by analyzing the backward error of Padé approximants to the matrix sine and cosine. We find that the backward error analysis restricts the range of applicability of the Padé approximants-so much so for the cosine as to make the approximants unusable. We therefore propose and analyze an alternative method of approximation based on Padé approximants to the exponential.

Padé approximants of matrix sine.
Let r m (x) = p m (x)/q m (x) denote the [m/m] Padé approximant to the sine function. Since the sine function is odd the Padé table splits into 2 × 2 blocks containing identical entries having odd numerator p m and even denominator q m [5, p. 65]. From this we can show that for k ≥ 2 the number of matrix multiplications required to form r 2k is equal to that for forming r 2k+1 , and hence we need only consider diagonal Padé approximants of odd order after r 1 and r 2 . For a similar discussion on the Padé table of the cosine see [28] and [35, p. 245].
We begin our analysis by defining h 2m+1 : C → C as where arcsin x denotes the principal arc sine, that is, the one for which arcsin x ∈ [−π/2, π/2] for x on the interval [−1, 1]. For x ∈ C with |x| ≤ 1 we have sin arcsin x = x. It follows that for X ∈ C n×n with ρ(r m (X)) ≤ 1, where ρ denotes the spectral radius, arcsin r m (X) is defined and, from (2.1), This means that ΔX = h 2m+1 (X) represents the backward error of approximating sin X by the [m/m] Padé approximant, assuming that ρ(r m (X)) ≤ 1. Figure 1 shows the order stars for the Padé approximants r m , that is, the regions of the complex plane where |r m (x)| ≤ 1 for a range of m [6], [27]. For a given value of m, if all the eigenvalues of X lie within this region then our backward error analysis holds. Since the regions are not circular it is difficult to check this criterion in practice, but we found numerically that for m = 1 : 13 the largest circular disk centered at the origin that lies within all the regions has radius approximately arcsinh 1 ≈ 0.881. Therefore it is sufficient to check that ρ(X) ≤ arcsinh 1 for our analysis to hold. Since and because arcsin and sin are odd functions we can write for some coefficients c m,k . Taking the norm of this equation, using [1, Thm. 4.2(b)], and recalling that ΔX = h 2m+1 (X), we can bound the normwise relative backward error by and the integer p ≥ 1 is such that m ≥ p(p − 1). Note that we have ρ(X) ≤ α p (X) ≤ X and α p (X) can be much smaller than X for nonnormal X, so the gains from working with (2.4) instead of the corresponding bound expressed solely in terms of X can be significant. It is easy to show that α 3 (X) ≤ α 2 (X) ≤ α 1 (X) and α 4 (X) ≤ α 2 (X), but the relationship between α 3 (X), α 4 (X), and α 5 (X) depends on X, so we need to compute or estimate all of the latter three quantities and use the smallest, subject to the constraint m ≥ p(p − 1).
To ensure maximum accuracy we would like to bound the backward error (2.4) by the unit roundoff u = 2 −53 ≈ 1.1 × 10 −16 in IEEE double precision arithmetic. If we define (2.6) β m = max β : then α p (X) ≤ β m ≤ arcsinh 1 implies that ΔX / X ≤ u for m ≤ 13. In the second row of Table 2 we show the values β m calculated using variable precision arithmetic in the Symbolic Math Toolbox for several values of m. We have β m > arcsinh 1 ≈ 0.881 for m ≥ 9, but our backward error bounds require that ρ(X) ≤ arcsinh 1. This means that for β 7 < α p (X) ≤ arcsinh 1 we can use m = 9, but for arcsinh 1 < α p (X) ≤ β 9 our backward error results are not applicable, so we artificially reduce β 9 to 0.881. From this point on we will not consider m > 9.

Padé approximants of matrix cosine.
For the matrix cosine we can attempt a similar analysis. Let r m (x) be the [m/m] Padé approximant to the cosine and define h 2m (x) = arccos r m (x) − x, where arccos denotes the principal arc cosine, which maps [−1, 1] to [0, π]. Analogously to the argument in the previous section, in order to obtain a backward error relation we need |x| ≤ 1. Figure 2 shows the order stars of r m for a range of m. Requiring the eigenvalues of X to lie inside the order stars imposes tight restrictions on them. Even more prohibitively, the eigenvalues must also lie in the strip of the right half-plane with real part between 0 and π, so that the principal branch of arccos gives arccos cos x = x.
As we are not aware of any satisfactory way to overcome these restrictions we will use an alternative rational approximation whose backward error analysis is applicable for every set of eigenvalues.

Exploiting Padé approximants of the exponential. Let the [m/m]
Padé approximant to e x be denoted by r m (x) = p m (x)/q m (x). Al-Mohy and Higham [1, sect. 3] show that if ρ(e −X r m (X) − I) < 1 and ρ(X) < min{ |t| : q m (t) = 0 } then where h 2m+1 (X) := log(e −X r m (X)), with log the principal logarithm. As shown in [1, sect. 5], h 2m+1 is an odd function, so h 2m+1 (−X) = −h 2m+1 (X). Hence which suggests using as approximations the rational functions with real coefficients From (2.7) and (2.9) we have so the two approximations have the same backward error, φ(X). To bound this backward error, for a power series p(x) let p(x) be the power series where each coefficient of p(x) is replaced by its modulus. Then φ(x) = ∞ k=m |c k,m |x 2k+1 and so we have where the θ m are given in Table 1 (reproduced from [20, Table 2.1]) and p is chosen to minimize α p (X) in (2.5) subject to m ≥ p(p − 1). For the cosine we will use c m but for the sine we will use a mixture of s m and Padé approximants to sin x. We now need to devise a strategy for choosing the parameters s (the amount of scaling) and m (the degree of the Padé approximant). We have θ m ≥ β m for m ≤ 21, as can partially be seen from Tables 1 and 2, which means that less scaling is required if we approximate sin X by s m than if we approximate sin X by the Padé approximant r m . On the other hand, the numerator and denominator polynomials of c m and s m are of higher degree than those of the Padé approximants of the cosine and sine, so c m and s m will be more expensive to evaluate. In all cases considered here, s m is a [2m − 1/2m] order rational approximant whereas c m is of order [2m/2m]. For example In sections 4-6 we explain how to balance the degree of the rational approximant with the amount of scaling in order to achieve the desired backward error at minimal cost.

Backward error propagation in multiple angle formulas.
We have shown how to achieve a small backward error for the rational approximation of the sine or cosine of the scaled matrix. We now show that this backward error propagates linearly through the multiple angle formula phase of the algorithm (lines 3-5 of Algorithm 1.1), resulting in a small overall backward error. We state the result for the cosine; an analogous result holds for the sine.
Lemma 2.1. Let A ∈ C n×n and X = η −s A for a positive integer η and nonnegative integer s, and suppose that r(X) = cos(X + ΔX) for a rational function r. Then the approximation Y from the "scaling and multiple angle" method satisfies Y = cos(A + ΔA), where ΔA = η s ΔX in exact arithmetic, and hence Proof. We prove the result for η = 2 (the double angle formula) for simplicity, though the same argument can be applied for any integer η (and the corresponding multiple angle formula).
By assumption, the initial approximation to the cosine from the rational approximant is C 0 = cos(X + ΔX), where X = 2 −s A. Applying the double angle formula s times gives Therefore we have cos A ≈ Y = C s = cos(A + ΔA) where ΔA = 2 s ΔX, and ΔX / X = 2 s ΔX / 2 s X = ΔA / A . Lemma 2.1 shows that there is no growth in the relative backward error during the multiple angle phase in exact arithmetic. Hence by choosing the parameters s and m so that ΔX / X ≤ u we achieve an overall backward error bounded by u. This contrasts with the algorithms for the matrix cosine in [15], [21,Alg. 12.6], [31], which choose r to make r(X) − cos X small, since the overall error r(A) − cos A bears no simple relation to r(X) − cos X .

3.
Recomputing the cosine and sine of 2 × 2 matrices. If we begin with an initial (real) Schur decomposition of our input matrix A, so that we are working with upper (quasi-)triangular matrices, then we can obtain higher overall accuracy by explicitly computing (instead of approximating) the diagonal blocks (including all of the first superdiagonal) explicitly throughout the algorithm. This idea has been used to good effect in recent algorithms for the matrix exponential [1], the logarithm [2], [3], and matrix powers [22], [23].

A463
For the cosine and sine we can avoid subtractive cancellation in the numerator of the divided difference when λ 1 ≈ λ 2 by computing These formulas are evaluated to high relative accuracy in floating point arithmetic when λ 1 ≈ λ 2 because the function sin x/x is well conditioned for small |x|; moreover, A 2 × 2 block of a real upper quasi-triangular matrix computed by the LAPACK Schur decomposition code dgees [4] has the form We can use the polynomial definition of a matrix function [21, Def. 1.4] to show that where θ = (−bc) 1/2 . Assuming that accurate implementations of sinh and cosh are available these formulae will compute the cosine and sine of 2 × 2 matrices to high componentwise accuracy without any subtractive cancellation.

Algorithm for the matrix cosine.
We now build an algorithm for the matrix cosine based upon the rational approximation c m of (2.11) along with the double angle formula.
The cost of our algorithm will be dominated by the matrix multiplications performed when forming the rational approximant and during the repeated application of the double angle formula. Each multiplication costs 2n 3 flops for full matrices or n 3 /3 flops for (quasi-)triangular matrices if a Schur decomposition is used.
We choose the two parameters m, the order of approximation, and s, the number of scalings, to minimize the number of matrix multiplications while ensuring a backward error of order u (in exact arithmetic).
First we discuss which values of m need to be considered for the approximant c m (X). We will consider m = 1: m max , where m max is defined as the largest m for which the denominator of c m (X) has condition number, in the appropriate norm, smaller than 10 for any X in the region where α p (X) ≤ θ m . The appropriate norm is as follows: for any > 0 there exists a norm · satisfying X ≤ ρ(X)+ ≤ α p (X)+ , where we will choose = u to be the unit roundoff. This norm is necessary since we intend to choose the value of m using α p (X), which is potentially much smaller than X , and therefore we have no a priori knowledge of how large X will be. In this norm the corresponding condition number of the denominator q m (X) of c m (X) can Table 1 The number of matrix multiplications π(cm(X)) required to evaluate cm(X), values of θm, values of p to be considered, and upper bounds κm for the condition number in the · norm of the denominator polynomial of cm (and sm) for αp(X) ≤ θm. Values of m for which π(cm(X)) = π(c m+1 (X)) are not shown as they provide no benefit. For m = 21, θ 21 is artificially reduced from 13.95 to 13 in order to ensure κ 21 ≤ 10.
where ∞ k=0 a k x k is the Taylor series of q m (x) −1 and q m is the same polynomial as q m with all coefficients replaced by their absolute values.
Using 250 digit arithmetic and calculating the first 350 terms of the Taylor series in the bound (4.2), we found that for the matrix cosine this means considering m max = 21, where θ 21 is reduced from 13.95, which has corresponding condition number 10.75, to 13 with condition number 7.86 for additional stability. Note that this part of our analysis also applies to the matrix sine since c m and s m share the same denominator polynomial.
This condition ensures that solving the multiple right-hand side system to form the rational approximant does not introduce significant errors into the computationat least as long as · is not too badly scaled. The upper bounds κ m on the condition number for each m of interest are shown in Table 1.
Now that we have a range of m to consider, we must find those values for which c m can be formed in the smallest number of matrix multiplications. There are many ways to evaluate the rational approximants: in the appendix we show that applying the Paterson-Stockmeyer scheme [21, pp. 73-74], [30] is more efficient than explicit computation of the necessary powers. Evaluating the numerator and denominator polynomials using the Paterson-Stockmeyer scheme as described in the appendix we see that, for example, c 7 (X) and c 8 (X) can both be formed using a minimum of six matrix multiplications. Since c 8 (X) can be applied to X with a larger value of α p (X) it renders c 7 (X) redundant. Table 1 contains the relevant values of m and θ m along with the number of matrix multiplications required to form c m (X). It also shows the values of p that we need to consider in order to minimize α p (X) in (2.5) for each m. one matrix multiplication and therefore it is worth increasing s by q if more than q multiplications can be saved when evaluating the rational approximant. (Note that this logic applies equally well to full matrices and triangular matrices.) From Table 1 we see that there are a number of places where such an increase in s is desirable. For example when θ 10 < α 3 (X) ≤ 2θ 8 we can perform one extra scaling and use c 8 (X) instead of c 12 (X), saving one multiplication.
We also point out that computing α p (X) can sometimes require more powers of X than the rational approximant. For instance when m = 2 we need α 2 (X) = min( X 4 1/4 , X 6 1/6 ), but forming c 2 (X) requires only X 4 (we use explicit powers for m ≤ 4, which have the same cost as the Paterson-Stockmeyer scheme: see Table 4). However, we will use the 1-norm, for which we can estimate the norm of X for any integer > 0, without calculating the matrix power explicitly, using the block 1-norm estimator of Higham and Tisseur [26]. A further optimization is that after eliminating m = 1 as a possibility, for example, all higher order approximants c m (X) require X 4 so we can compute and store X 4 and hence use X 4 1/4 1 rather than an estimate of it in the logical tests. A similar optimization can be performed after eliminating m = 2 and m = 6. Taking all these issues into account leads to the following algorithm to determine the parameters s and m. The many logical tests are the price we pay for making the most efficient choice of parameters.  Table 1. 10 Compute and store A 6 . 11 d 6 = A 6 1/6 1 % Compute exact value and update α 2 .
We can now present our full algorithm for computing the matrix cosine, which is backward stable in exact arithmetic. The algorithm is written using an initial Schur decomposition, but a transformation-free algorithm can be obtained by removing lines 5, 8, and 10 and replacing line 1 with T = A.
Algorithm 4.2. Given A ∈ C n×n this algorithm computes C = cos A. The algorithm is designed for use with IEEE double precision arithmetic. 1 Compute the (real if A ∈ R n×n ) Schur decomposition A = QT Q * . Recompute the diagonal blocks of C = cos(2 j T ) using (3.1) (3.3), (3.6). 9 end 10 C ← QCQ * Cost. (28 + (π + s + 1)/3)n 3 flops where π denotes the number of matrix multiplications needed to form c m (x). The transformation-free version of the algorithm costs (2(s + π) + 8/3)n 3 flops. Comparing these two we see that it is cheaper to use the Schur decomposition if π + s ≥ 16. The latter inequality is readily satisfied, for example if A = 448I, and using the Schur decomposition allows us to carry out the accurate recomputation of the diagonal blocks.

Algorithm for the matrix sine.
In this section we design an algorithm to compute the matrix sine. Whereas for the cosine we used the rational approximations Table 2 The number of matrix multiplications π(·) required to evaluate rm(x), values of βm in (2.6), values of p to be considered, and an upper bound κm for the condition number of the denominator polynomial of rm for αp(X) ≤ βm. Values of m for which π(rm(X)) = π(r m+1 (X)) are not shown as they provide no benefit. The value of β 9 has been artificially reduced from 1.14 to 8.81e-1 as required by our backward error analysis in section 2.1; this changes the value of κ 9 . c m (X), for the sine we must consider both s m (X) and the Padé approximants for optimal efficiency. Another difference from the cosine case is that we will use the triple angle formula sin(3X) = 3 sin X − 4 sin 3 X instead of the double-angle formula sin(2X) = 2 sin X cos X, as the latter formula requires cos X. Table 2 shows the number of matrix multiplications required to form the Padé approximants r m (X), denoted π(r m (X)). The values of π(s m (X)) are π(s 1 (X)) = 1 and for m ≥ 2 we have π(s m (X)) = π(c m (X))+1, where π(c m (X)) is given in Table 1. The table also contains the values of p that we need to consider and the values β m such that α p (X) ≤ β m implies that the backward error of the result is bounded by u in exact arithmetic. The last row of the table shows an upper bound κ m on the condition number of the denominator polynomial of r m (X) for α p (X) ≤ β m , for the · norm defined in the previous section. For s m (X) the values for θ m , p, and κ m exactly match those already given for the cosine in Table 1: the values of θ m match due to the relationship to the matrix exponential (see section 2.3), the p match due to the degree of the approximants, and the κ m match as c m and s m have identical denominator polynomials.
For the Padé approximants r m , the maximum order of approximation considered is m = 9 in order to ensure the validity of the bounds, as explained in section 2.1. On the other hand, as for the matrix cosine, we consider up to m = 21 for the alternative rational approximants s m (X). The use of s m (X) allows larger p within the values α p (X) used in our backward error bounds and can therefore reduce the amount of scaling required. For example, suppose A is nilpotent of degree 10 with α 3 (A) β 9 . Using only Padé approximants we would require a large amount of scaling, but since A is nilpotent we know that α 5 (A) = 0, allowing the use of s 21 (A) with no scaling required.
We scale X = 3 −s A and each application of the triple angle formula requires two matrix multiplications. Hence it is beneficial to increase s by q if we can save more than 2q matrix multiplications in forming s m . For example, suppose that θ 12 < min(α 3 (X), α 4 (X)) ≤ θ 15 but additionally α 3 (X) ≤ 9β 9 ; then by performing two extra scalings we can use r 9 (X) at a cost of 4 + π(r 9 (X)) = 9 multiplications, as opposed to π(s 15 (X)) = 10 multiplications. For each X we choose among the r m and s m approximants, while considering extra scaling as above, always striving for the parameters with minimal cost subject to the backward error constraint.
This discussion leads to the following parameter selection algorithm.  r 1 (x), quit, end 5 Compute and store A 2 . 6 d 2 = A 2 1/2 1 % Compute exact value and update α 1 .

Algorithm for simultaneous computation of the matrix cosine and sine.
We now design an algorithm to compute the matrix cosine and sine simultaneously, a requirement that arises in the evaluation of (1.4), for example. We use the rational approximants c m (x) and s m (x) introduced in section 2.3, since they have the same denominator polynomial: this saves a significant amount of computation since we need only compute one denominator for both approximants and furthermore we can reuse an LU factorization of the denominator when computing c m (X) and s m (X).
Since we are now computing both the cosine and sine we can use the double angle formulas for both. However for the cosine there are two such formulas for us to choose from: (6.1) cos(2X) = 2 cos 2 X − I, cos(2X) = I − 2 sin 2 X.
We have found empirically that using cos(2X) = I − 2 sin 2 X generally gives more accurate computed results, sometimes significantly so, though a theoretical explanation for this observation is currently lacking. In Table 3 we show the number of matrix multiplications π m required to form both c m (x) and s m (x) using the Paterson-Stockmeyer scheme (see appendix). Since each invocation of the double angle formulas for the cosine and sine costs two matrix multiplications in total, it is worth performing q extra scalings if more than 2q matrix multiplications can be saved in the formation of the rational approximant. This results in the following parameter selection algorithm. Algorithm 6.1. Given A ∈ C n×n this algorithm determines the parameters s and m such that the algorithm for approximating cos A and sin A based on c m (2 −s A) and s m (2 −s A) and the double angle formulas produces (in exact arithmetic) backward errors bounded by u for both functions. The algorithm uses the parameters θ m given in Table 1.  Recompute the block diagonals of C = cos(2 j T ) using (3.1), (3.3), (3.6). 14 end 15 S ← QSQ * 16 C ← QCQ * Cost. (95/3 + (π + 2s)/3)n 3 flops where π denotes the number of matrix multiplications needed to form both approximants. The corresponding transformation-free algorithm costs (14/3 + 2(π + 2s))n 3 flops. Comparing these two costs we see that it is cheaper to use the Schur decomposition if π + 2s ≥ 17.
For comparison, the cost of calling Algorithms 4.2 and 5.2 separately-but using only one Schur decomposition-is (31 + (π c + π s + s c + s s + 2)/3)n 3 flops, or (16/3 + 2(π c + π s + s c + s s ))n 3 flops if the transformation-free version is used, where the subscripts c and s denote the values obtained from the cosine and sine algorithms, respectively. To illustrate the benefit to the overall cost gained by computing the matrix cosine and sine simultaneously consider the matrix A = gallery('frank',1000): we obtain the values π c = 9, π s = 5, s c = 14, s s = 11 when using Algorithms 4.2 and 5.2, and π = 14, s = 14 when using Algorithm 6.2. Therefore the cost of computing the matrix cosine and sine separately is (220/3)n 3 and (316/3)n 3 flops (with and without a Schur decomposition, respectively) as opposed to (137/3)n 3 and (266/3)n 3 flops when computing both functions simultaneously. For the Schur decomposition algorithm this is a computational saving of around 38 percent.  [17], the MATLAB gallery function, and the matrix function literature. We select matrices for which none of the algorithms overflow and multiply them by a uniform random scalar between 1 and 60 to ensure that some scaling will occur in the algorithms. Each matrix is then used once in this form and a second time multiplied by the imaginary unit i in order to test the algorithms for complex matrices. In total 76 matrices are used. Similar test matrices were used in recent work on the matrix cosine [15], [21, chap. 12], [31].
Our numerical experiments compare the normwise forward and backward errors of the competing methods. If Y denotes the computed value of f (A) then the forward error is and the backward error is The backward error is difficult to compute exactly so we make a linearized approximation of it, which is justified as we are interested in cases where ψ(Y ) 1. Our approximation is based on the first order expansion where K f (A) ∈ C n 2 ×n 2 is the Kronecker matrix and vec(E) stacks the columns of E vertically from first to last [21, chap. 3], [24]. We approximate the backward error ψ(Y ) by the minimum 2-norm solution of (7.4), where the Kronecker matrix (obtained using [21,Alg. 3.17]) and f (A) are computed, and the system solved, in 250 digit arithmetic. A similar idea is used effectively by Deadman and Higham in [8, sect. 5] to compute linearized backward errors of functional identities. Since we showed in Lemma 2.1 that all our algorithms are backward stable in exact arithmetic we expect the relative error (7.1) to be bounded by a modest multiple of cond(f, A)u, where the condition number cond(f, A) is defined in [21, chap. 3].
Accurate values of cos A and sin A for use in (7.1) and (7.4) are generated in 250 digit arithmetic using the Symbolic Toolbox by adding a random perturbation of norm 10 −125 to A, then diagonalizing it, and taking the cosine (or sine) of the eigenvalues; the perturbation ensures the eigenvalues are distinct so that the diagonalization is always possible. This idea is based on [7] and has been used successfully in [3], [23], and [24]. The condition number of f is estimated using the code funm condest1 from the Matrix Function Toolbox [18], [21,Alg. 3.20].
We test our new algorithms against the existing alternatives: • cosm for cos A from [18], which implements an algorithm of Hargreaves  [31].
The first two algorithms use variable degree Padé approximants, while costay uses a variable degree truncated Taylor series. None of these algorithms performs a Schur decomposition of the input matrix or recomputes the diagonal blocks (as described in section 3), and all are based on forward error analysis. We will perform comparisons both with and without the initial Schur decomposition within our algorithms.
Our first three experiments are designed to test our new algorithms against the aforementioned alternatives on full matrices. We denote our new algorithms by • cosm new: Algorithm 4.2, • sinm new: Algorithm 5.2, • cosmsinm new: Algorithm 6.2. In each case we plot our results in a 4 × 2 grid. Each plot in the (1, 1) position shows the forward errors (7.1), with the new algorithm in transformation-free form, along with an estimate of cond(f, A)u. The (1, 2) plot presents the same relative errors as a performance profile [11], [16, sect. 22.4], to which we apply the strategy from [10] to avoid tiny relative errors skewing the results. Briefly, for a given curve on the performance profile and a given α the corresponding p value is the proportion of problems for which the error for that method was within a factor α of the smallest error over all methods. The (2, 1) plot shows the backward errors (7.2), and the (2, 2) plot is a performance profile for the backward error results. The remaining four plots show the same results when we allow our new algorithms to compute an initial Schur decomposition, allowing recomputation of the diagonal blocks (section 3). The colored triangles in each plot show any cases where the algorithm returned an error greater than one.
Our next experiment tests each algorithm on matrices that are already triangular: we precompute the Schur decomposition of each test matrix before sending the triangular factor to the algorithms. This shows how well the algorithms exploit triangularity. In some applications the original matrices are triangular; see, for example, [37] in the case of the matrix exponential.
The final experiment compares the accuracy of cosm new and costay on a matrix resulting from the semidiscretization of a nonlinear wave problem [12,Prob. 4]. We compare the two algorithms for a range of discretization points, where the matrix becomes more nonnormal (and hence the problem more difficult) as the resolution increases. Figure 3 shows the results for the matrix cosine. For the transformation-free case of cosm new we see from the (1, 2) plot that costay was the most accurate algorithm for about 60 percent of the matrices, as shown by the α = 1 ordinate, but was less reliable than cosm new, as shown by the relative position of the curves for α ≥ 2. Indeed visible in the (1, 1) plot is one relative error returned by costay (represented by the green triangle), which was 1, as opposed to 9e-8 and 3e-2 for cosm new and cosm, respectively. The condition number of this problem was 4.8e9, so we would expect a forward error of order 10 −7 from a forward stable algorithm. From the (2, 1) and (2, 2) plots we see that none of the algorithms is always backward stable and that cosm new shows a small advantage over costay. The largest backward errors were 1, 2.1e1, and 1.4e-2 for costay, cosm, and cosm new, respectively.

Matrix cosine.
Also visible in the (1, 1) plot is one case where cosm returned a relative error > 1 (denoted by the black triangle) while cosm new returned an answer with essentially full accuracy.
For the tests in which a Schur decomposition is used (the lower four plots) we see an improvement in backward stability of cosm new at the expense of some slight The (1,1) and (1, 2) plots of Figure 6 show the cost of each algorithm in multiples of n 3 flops using the transformation-free and Schur versions, respectively. We see that the transformation-free version of cosm new is marginally more expensive than costay and cosm in most cases but is significantly cheaper than the latter on occasion. The Schur version has a relatively stable cost of around 32n 3 flops, due to the fixed overhead of the Schur decomposition, which could be advantageous for highly oscillatory differential equations [36], where the matrices have large eigenvalues and hence a heavy scaling may be needed, requiring a large number of (now triangular) matrix multiplications. The precise criterion under which the Schur algorithm is cheaper than the transformation-free version was explained at the end of section 4.

Matrix sine.
For our second experiment, since there are no other algorithms dedicated to computing the matrix sine, we compare sinm new against the use of costay to compute sin A = cos(A−πI/2). Our comparison of these two algorithms is shown in Figure 4. For the transformation-free algorithm we see that sinm new has significantly better forward error and backward error performance than costay. The use of the Schur decomposition improves the forward stability of sinm new: the forward errors in the (3, 1) plot are always within a factor 15 of the condition number times u as opposed to 186 for the (1, 1) plot. As in the previous experiment we see that costay is not reliable as it returns a relative error and a backward error 1 in one test case.
The cost of each algorithm is given in the (2, 1) and (2, 2) plots of Figure 6 for the transformation-free and Schur versions, respectively. In each case sinm new is generally more expensive than costay, and using the Schur decomposition gives a fairly stable cost of around 32n 3 flops.

Matrix cosine and sine.
Our third experiment compares cosmsinm new to cosmsinm. For each test matrix we show the largest of the two errors in evaluating the matrix cosine and sine: if a particular test matrix results in backward or forward errors e c and e s for the cosine and sine, respectively, we plot max(e c , e s ). In the plots the quantity cond(f, A)u denotes the larger of cond(cos, A)u and cond(sin, A)u. It is clear from Figure 5 that the new algorithm has significantly better forward and backward error performance than cosmsinm. Plots of the individual errors for the sine and cosine have similar forms. The largest relative errors returned by cosmsinm were 2.9e2 and 1e3 for the matrix cosine and sine, respectively (both achieved for the first test matrix in the plot), while the corresponding errors for cosmsinm new were 1.5e-7 and 5e-7.
Plots showing the cost of each algorithm, both avoiding and utilizing an initial Schur decomposition, are given in the (3,1) and (3,2) positions of Figure 6, respectively. We see that the transformation-free version of cosmsinm new is slightly more expensive than cosmsinm, but allowing an initial Schur decomposition makes our new algorithm cheaper in some of the test cases. Figures 7-9 show the results of applying the algorithms to matrices that are already (quasi)-triangular, obtained by taking the (real) Schur form of each matrix before passing it to the competing algorithms. The new algorithms are greatly superior to the existing ones in terms of both forward and backward error, often achieving values of order u. By comparison with Figures 3-5 it is clear that the Schur decomposition is a significant source of error in our algorithms.

Triangular matrices.
The cost of the competing algorithms for triangular matrices is shown in Figure 10. In the majority of cases our new algorithms are more efficient than the alternatives. 7.5. Wave discretization problem. Our final experiment compares the forward errors of cosm new and costay when computing the cosine of a matrix arising from the semidiscretization of the wave equation [12,Problem 4] where x ∈ (0, 1), t > 0, and With Dirichlet boundary conditions the solution is u(x, t) = a(x) cos(10t). If we perform a semidiscretization in the spatial variable with mesh size 1/n we obtain a system of the form (1.3) with parametrized matrix This matrix becomes increasingly nonnormal as n grows.
In Figure 11 we show the forward errors of the two algorithms when computing cos A for a range of n as α varies between 0 and 10. The results shown are for the transformation-free version of cosm new; similar results were obtained using the Schur decomposition. For n = 10, both algorithms behave in a forward stable manner, with costay generally more accurate. As n increases cosm new becomes significantly more accurate than costay, the latter showing signs of forward instability and having errors varying much less smoothly with α.
8. Concluding remarks. The goal of this work was to derive algorithms for computing the matrix sine and cosine-both separately and together-that are backward stable in exact arithmetic, thereby providing a more rigorous foundation than for previous algorithms, all of which are based on bounding absolute or forward errors of the function of a scaled matrix. Algorithms with this form of backward stability are already available for the matrix exponential and matrix logarithm. A key finding is that Padé approximants of the matrix cosine do not lend themselves to deriving backward stable algorithms, while those for the matrix sine put strong constraints on the spectral radius of the matrix. We therefore introduced new rational approximants obtained from Padé approximants of the exponential, which yield backward stable approximants of the sine and cosine with no a priori limit on the spectral radius. We also gave the first multiple angle formula-based algorithm for the matrix sine, which uses the triple angle formula in order to avoid the cosines that would be needed by the double angle formula. Experiments show that the new algorithms behave in a forward stable manner in floating point arithmetic, have better backward stability properties than their competitors, and are especially effective for triangular matrices.