Linear systems of equations and their solution form the cornerstone of much Engineering and Science. Linear algebra is a paragon of Mathematics in the sense that its theory is what mathematicians try to emulate when they develop theory for many other less neat subjects. I think Linear Algebra ought to be required mathematics for any scientist or engineer. (For example, I think Quantum Mechanics makes a lot more sense when taught in terms of *inner products* than just some magic which drops from the sky.) Unfortunately, in many schools, it is not. You can learn it online, and Professor Gilbert Strang’s lectures and books are *the best*. (I actually prefer the second edition of his *Linear Algebra and Its Applications*, but I confess I haven’t looked at the fourth edition of the text, just the third, and I haven’t looked at his fifth edition of Introduction to Linear Algebra.)

There’s a lot to learn about numerical methods for linear systems, too, and Strang’s *Applications* teaches a lot that’s very good there, including the SVD of which Professor Strang writes “it is not nearly as famous as it should be.” I very much agree. You’ll see it usable *everywhere*, from dealing with some of the linear systems I’ll mention below to support for Principal Components Analysis in Statistics, to singular-spectrum analysis of time series, to Recommender Systems, a keystone algorithm in so-called Machine Learning work.

The study of numerical linear algebra is widespread and taught in several excellent books. My favorites are Golub and Van Loan’s *Matrix Computations*, Björck’s *Numerical Methods for Least Squares Problems*, and Trefethen and Bau’s *Numerical Linear Algebra*. But it’s interesting how fragile these solution methods are, and how quickly one needs to appeal to Calculus directly with but small changes in these problems. That’s what this post is about.

So what am I talking about? I’ll use small systems of linear equations as examples, despite it being entirely possible and even common to work with systems which have thousands or millions of variables and equations. Here’s a basic one:

written for brevity

Of course, in any application the equation looks more like:

In **R** or MATLAB the result is easily obtained. I work and evangelize **R**, so any computation here will be recorded in it. Doing

solve(A,b)

or

lm(b ~ A + 0)

will produce

It’s also possible to solve several at once, for example, from:

And, of course, having an unknown but a known is direct, just using matrix multiplication:

yielding:

Linear Algebra gives us sensible ways to interpret inconsistent systems like:

by making reasonable assumptions about what the solution to such a system should mean. **R** via `lm(.)`

gives:

Sets of solutions to things like

can be countenanced and there is even a way which I’ll talk about below for picking out a unique one: the *minimum norm* solution. This is where the SVD comes in. To learn about all the ways these things can be thought about and done, I recommend:

D. D. Jackson, “Interpretation of inaccurate, insufficient and inconsistent data”, *Geophysical Journal International*, 1972, 28(2), 97-109.

(That’s an * awesome* title for a paper, by the way.)

**What if there are holes?**

Going back to (3), however, suppose instead it looks something like this:

and we don’t know *what* is. Can it be calculated?

Well, it *has* to be able to be calculated: It’s the only unknown in this system, with the rules of matrix multiplication just being a shorthand for combining things. So, it’s entirely correct to think that the constants could be manipulated algebraically so they all show up on one side of equals, and on the other. That’s a lot of algebra, though.

We might *guess* that was symmetric so think . But what about the following case, in (12)?

Now there are 3 unknowns, , , and . The answer is available in (3), but suppose that wasn’t known?

This problem is one of finding those parameters, *searching* for them if you like. To search, it helps to have a measure of how far away from a goal one is, that is, some kind of score. (14) is what I propose as a score, obtained by taking (12) and rewriting it as below, (13):

is an norm, and .

In other words, is the length of the vector . It’s non-negative. Accordingly, making (14) as small as possible means pushing the left and right sides of (12) towards each other. When (14) is zero the left and right sides are equal.

Now, there are many possible values for , , and . In most applications, considering all *flonum* values for these is not necessary. Typically, the application suggests a reasonable range for each of them, from a *low* value to a *high* value. Let

be the range of values for ,

be the range of values for , and

be the range of values for , each dictated by the application. If , , and are each randomly but independently chosen from the unit interval, then a particular value of (14) can be expressed

where

So, this is an *optimization problem* where what’s wanted is to make (16) as small as possible, searching among triplets of values for , , and . How does that get done? **R** package **nloptr**. This is a package from *CRAN* which does a rich set of numerical nonlinear optimizations, allowing the user to choose the algorithm and other controls, like ranges of search and *constraints* upon the control parameters.

Another reason why these techniques are interesting is it is intriguing *and fun* to see how far one can get knowing very little. And when little is known, letting algorithms run for a while to make up for that ignorance doesn’t seem like such a bad trade.

**An illustration**

In order to illustrate the *I don’t know much* case, I’m opting for:

What a run produces is:

Call:

nloptr(x0 = rep(0.5, 3), eval_f = objective1, lb = rep(0, 3), ub = rep(1, 3), opts = nloptr.options1, alpha.beta = alpha.beta)

```
```Minimization using NLopt version 2.4.2

NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )

Number of Iterations....: 100000

Termination conditions: xtol_rel: 1e-04 maxeval: 1e+05

Number of inequality constraints: 0

Number of equality constraints: 0

Current value of objective function: 0.000734026668840609

Current value of controls: 0.74997066329 0.5556247383 0.75010835335

**Y1 resulting estimates for , , and are: 1.00, 4.00, 3**

That’s nloptr-speak for reporting on the call, the termination conditions and result. The bottom line in bold tells what was expected, that .

What about the code? The pertinent portion is shown below, and all the code is downloadable as a single **R** script from here. There’s also a trace of the execution of that script available as well.

L2norm<- function(x)

{

sqrt( sum(x*x) )

}

```
```r<- function(sigma, alpha, beta)

{

stopifnot( (0 <= sigma) && (sigma <= 1) )

stopifnot( alpha < beta )

alpha*(1 - sigma) + beta*sigma

}

# Recall original was:

#

# A<- matrix(c(1, 2, 3, 2, 1, 4, 3, 4, 1), 3, 3, byrow=TRUE)

P1.func<- function(x, alpha.beta)

{

stopifnot( is.vector(x) )

stopifnot( 3 == length(x) )

#

sigma11<- x[1]

sigma23<- x[2]

sigma31<- x[3]

alpha11<- alpha.beta[1]

beta11<- alpha.beta[2]

alpha23<- alpha.beta[3]

beta23<- alpha.beta[4]

alpha31<- alpha.beta[5]

beta31<- alpha.beta[6]

#

P1<- matrix( c( r(sigma11,alpha11,beta11), 2, 3,

2, 1, r(sigma23,alpha23,beta23),

r(sigma31,alpha31,beta31), r(sigma23,alpha23,beta23), 1

),

nrow=3, ncol=3, byrow=TRUE )

return(P1)

}

objective1<- function(x, alpha.beta)

{

stopifnot( is.vector(x) )

stopifnot( 3 == length(x) )

b<- matrix(c(12,4,16),3,1)

x.right<- matrix(c(-3,6,1),3,1)

P1<- P1.func(x, alpha.beta)

d<- b - P1 %*% x.right

# L2 norm

return( L2norm(d) )

}

nloptr.options1<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-6, "print_level"=0, "maxeval"=100000, "population"=1000)

alpha.beta<- c(-2, 2, -1, 8, -6, 6)

Y1<- nloptr(x0=rep(0.5,3),

eval_f=objective1,

lb=rep(0,3), ub=rep(1,3),

opts=nloptr.options1,

alpha.beta=alpha.beta

)

print(Y1)

cat(sprintf("Y1 resulting estimates for a_{11}, a_{23}, and a_{31} are: %.2f, %.2f, %2.f\n",

r(Y1$solution[1], alpha.beta[1], alpha.beta[2]), r(Y1$solution[2], alpha.beta[3], alpha.beta[4]),

r(Y1$solution[3], alpha.beta[5], alpha.beta[6])))

**But what is it good for? Case 1: Markov chain transition matrices**

Consider again (1):

This happens also to be the template for a 3-state *Markov chain* with their many applications.

The following example is taken from the famous paper by Rabiner, as presented by Resch:

- L. R. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition”,
*Proceedings of the IEEE*, February 1989, 77(2), DOI:10.1109/5.18626. - B. Resch, “Hidden Markov Models”, notes for the course
*Computational Intelligence*, Graz University of Technology, 2011. - N. J. Welton, A. E. Ades, “Estimation of Markov Chain transition probabilities and rates from fully and partially observed data: Uncertainty propagation, evidence synthesis, and model calibration”,
*Medical Decision Making*, 2005, 25(6), 633-645. - T. P. Runarsson, X. Yao, “Search biases in constrained evolutionary optimization,”
*IEEE Transactions on Systems, Man, and Cybernetics*, Part C:*Applications and Reviews*, 205, 35(2), 233-243. - T. P. Runarsson, X. Yao, “Stochastic ranking for constrained evolutionary optimization,”
*IEEE Transactions on Evolutionary Computation*, 2000, 4(3), 284-294.

They begin with the transition diagram:

which if cast into the form of (1′) and (2) looks like:

The rows, top-to-bottom, are labeled *sunny*, *rainy*, and *foggy*, as are the columns, left-to-right. Cell gives the probability for going from state to state . For example, the probability of going from *sunny* to *foggy* is 0.15. Here’s a prettier rendition from Resch:

Resch and Rabiner go on to teach *Hidden Markov Models* (“HMMs”) where is not known and, moreover, the weather is not directly observed. Instead, information about the weather is obtained by observing whether or not a third party takes an umbrella to work or not. Here, however, suppose the weather *is* directly known. And suppose is known except nothing is known about what happens after *foggy*except when it remains *foggy*. Symbolically:

Note in (18) or Resch’s tableau how the rows each sum to one. This is a characteristic of first order Markov models: Once in a state, the transition has to go *somewhere*, even if to stay in that state. Transitions can’t just cause the system to disappear, so all the outgoing probabilities need to sum to one. This means, however, that when what happens when it is *foggy* is introduced, there aren’t two unconstrained parameters, there is only one. Accordingly, rather than introducing , I could write . As it turns out, in my experience with nloptr, it is often better to specify this constraint explicitly so the optimizer

knows about it rather than building it implicitly into the objective function, even at the price of introducing another parameter and its space to explore.

The challenge I’ll pose here is somewhat tougher than that faced by HMMs. The data in hand is not a series of *sunny*, *rainy*, or *foggy* weather records but, because, say, the records were jumbled, all that’s in hand is a *count* of how many *sunny*, *rainy*, and *foggy* days there were, and what the count of days were following them. In particular:

meaning that the first day of a set of pairs began where the first day was *sunny* 1020 times, *rainy* 301 times, and *foggy* 155 times. Statistical spidey sense wonders about how many observations are needed to pin town transition probabilities well, but let’s set that aside for now. (At least it’s plausible that if ordering information is given up, there might be a need for more count information.) *And* the count of what the weather was on the second days is:

or 854 *sunny* days, 416 *rainy* days, and 372 *foggy* foggy days.

Note that unlike in (16) here in (19) there is no need to pick upper and lower bounds on the value: This is a probability so it is by definition limited to the unit interval. *But* always so *that* constraint needs to be stated.

Here’s the code:

P2.func<- function(x)

{

# Sunny, Rainy, Foggy

stopifnot( is.vector(x) )

stopifnot( 2 == length(x) )

#

a.31<- x[1]

a.32<- x[2]

#

P2<- matrix( c( 0.8, 0.05, 0.15,

0.2, 0.6, 0.2,

a.31, a.32, 0.5

),

nrow=3, ncol=3, byrow=TRUE )

return(P2)

}

```
```objective2<- function(x)

{

stopifnot( is.vector(x) )

stopifnot( 2 == length(x) )

x.right<- matrix(c(1020, 301, 155), 3, 1)

b<- matrix(c(854, 416, 372),3,1)

P2<- P2.func(x)

d<- b - P2 %*% x.right

# L2 norm

return( L2norm(d) )

}

constraint2<- function(x)

{

return( (x[1] + x[2] - 0.5 ))

}

nloptr.options2<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-4, "print_level"=0, "maxeval"=100000, "population"=1000)

Y2<- nloptr(x0=rep(0.5,2),

eval_f=objective2,

eval_g_eq=constraint2,

lb=rep(0,2), ub=rep(1,2),

opts=nloptr.options2

)

print(Y2)

cat(sprintf("Y2 resulting estimates for a_{31}, a_{32} are: %.2f, %.2f\n",

Y2$solution[1], Y2$solution[2]))

This run results in:

Call:

nloptr(x0 = rep(0.5, 2), eval_f = objective2, lb = rep(0, 2), ub = rep(1, 2), eval_g_eq = constraint2, opts = nloptr.options2)

```
```Minimization using NLopt version 2.4.2

NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )

Number of Iterations....: 100000

Termination conditions: xtol_rel: 1e-04 maxeval: 1e+05

Number of inequality constraints: 0

Number of equality constraints: 1

Current value of objective function: 0.500013288564363

Current value of controls: 0.20027284199 0.29972776012

`Y2 resulting estimates for are: 0.20, 0.30`

Suppose some of the data is missing? In particular, suppose instead:

where is on the unit interval and so all that’s known is that is between 155 and 1020, that is, bounded by the other two terms in .

Now there are two parameters to search, but they are unconstrained, apart from being on the unit interval. The code for this is:

P3.func<- function(x)

{

# Sunny, Rainy, Foggy

stopifnot( is.vector(x) )

stopifnot( 3 == length(x) )

#

a.31<- x[1]

a.32<- x[2]

# There's an x[3] but it isn't used in the P3.func. See

# the objective3.

#

P3<- matrix( c( 0.8, 0.05, 0.15,

0.2, 0.6, 0.2,

a.31, a.32, 0.5

),

nrow=3, ncol=3, byrow=TRUE )

return(P3)

}

```
```objective3<- function(x)

{

stopifnot( is.vector(x) )

stopifnot( 3 == length(x) )

x.right<- matrix(c(1020, r(x[3], 155, 1020), 155), 3, 1)

b<- matrix(c(854, 416, 372),3,1)

P3<- P3.func(x)

d<- b - P3 %*% x.right

# L2 norm

return( L2norm(d) )

}

constraint3<- function(x)

{

stopifnot( 3 == length(x) )

return( (x[1] + x[2] - 0.5 ))

}

nloptr.options3<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-4, "print_level"=0, "maxeval"=100000, "population"=1000)

Y3<- nloptr(x0=rep(0.5,3),

eval_f=objective3,

eval_g_eq=constraint3,

lb=rep(0,3), ub=rep(1,3),

opts=nloptr.options3

)

print(Y3)

`cat(sprintf("Y3 resulting estimates for a_{31}, a_{32}, and eta are: %.2f, %.2f, %.2f\n",`

Y3$solution[1], Y3$solution[2], Y3$solution[3]))

The results are:

Call:

nloptr(x0 = rep(0.5, 3), eval_f = objective3, lb = rep(0, 3), ub = rep(1, 3), eval_g_eq = constraint3, opts = nloptr.options3)

```
```Minimization using NLopt version 2.4.2

NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )

Number of Iterations....: 100000

Termination conditions: xtol_rel: 1e-04 maxeval: 1e+05

Number of inequality constraints: 0

Number of equality constraints: 1

Current value of objective function: 0.639962390444759

Current value of controls: 0.20055501795 0.29944464945 0.16847867543

Y3 resulting estimates for a_{31}, a_{32}, and are: 0.20, 0.30, 0.17, with that corresponding to **301**

That 301 versus the true 372 isn’t too bad.

An example of where this kind of estimation is done more generally, see:

**But what is it good for? Case 2: Learning prediction matrices**

When systems like (2) arise in cases of statistical regression, the matrix is called a *prediction* or *design matrix*. The idea is that its columns represent sequences of predictions for the *response*, represented by the column vector , and the purpose of regression is to find the best weights, represented by column vector , for predicting the response.

Consider (2) again but instead of and being column vectors, as in (5), they are matrices, and , respectively. In other words, the situation is that there are lots of pairs available. And then suppose *nothing* is known about , that is, it just contains nine unknown parameters:

There are, of course, questions about how many pairs are needed in tandem with choice of number of iterations (see maxeval discussion in *Miscellaneous Notes* below.) Here, 8 pairs were used for purposes of illustration.

and

The code for this case is:

```
```objective4<- function(x)

{

stopifnot( is.vector(x) )

stopifnot( 9 == length(x) )

B<- matrix(c(5215, 13693, 7265, 4217, 9367, 10588, 14372, 12043,

7528, 17825, 11024, 4989, 14860, 9447, 16162, 13087,

6161, 12798, 7702, 3023, 9551, 8908, 11429, 8734

), 3, 8, byrow=TRUE)

X.right<- matrix(c(1356, 7505, 4299, 3419, 7132, 1965, 8365, 8031,

5689, 8065, 7001, 638, 8977, 1088, 3229, 1016,

3777, 8135, 3689, 1993, 3635, 9776, 8967, 7039

), 3, 8, byrow=TRUE)

P4<- matrix(x, nrow=3, ncol=3, byrow=TRUE)

d<- B - P4 %*% X.right

# L2 norm for matrix

return( L2normForMatrix(d, scaling=1000) )

}

nloptr.options4<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-6, "print_level"=0, "maxeval"=300000, "population"=1000)

Y4<- nloptr(x0=rep(0.5,9),

eval_f=objective4,

lb=rep(0,9), ub=rep(1,9),

opts=nloptr.options4

)

print(Y4)

`cat("Y4 resulting estimates for :\n")`

print(matrix(Y4$solution, 3, 3, byrow=TRUE))

The run results are:

Call:

nloptr(x0 = rep(0.5, 9), eval_f = objective4, lb = rep(0, 9), ub = rep(1, 9), opts = nloptr.options4)

```
```Minimization using NLopt version 2.4.2

NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )

Number of Iterations....: 300000

Termination conditions: xtol_rel: 1e-06 maxeval: 3e+05

Number of inequality constraints: 0

Number of equality constraints: 0

Current value of objective function: 0.0013835300300619

Current value of controls: 0.66308125177 0.13825982301 0.93439957114 0.92775614187 0.63095968859 0.70967190127 0.3338899268 0.47841968691 0.79082981177

`Y4 resulting estimates for mathbf{A}:`

[,1] [,2] [,3]

[1,] 0.66308125177 0.13825982301 0.93439957114

[2,] 0.92775614187 0.63095968859 0.70967190127

[3,] 0.33388992680 0.47841968691 0.79082981177

In fact, the held back version of used to generate these test data sets was:

and that matches the result rather well. So, in a sense, the algorithm has “learned” from the 8 data pairs presented.

*Miscellaneous notes*

##### All the runs of **nloptr** here were done with the following settings. The algorithm is always *ISRES*. The parameters xrel_tol = 1.0e-4 and population = 1000. maxeval, the number of iterations, varied depending upon the problem. For Y1, Y2, Y3, and Y4 it was 100000, 100000, 100000, and 300000, respectively In all instances, the appropriate optimization controls are given by the nloptr.options*n* variable, where .

Per the description, *ISRES*, which is an acronym for *Improved Stochastic Ranking Evolution Strategy*:

The evolution strategy is based on a combination of a mutation rule (with a log-normal step-size update and exponential smoothing) and differential variation (a Nelder–Mead-like update rule). The fitness ranking is simply via the objective function for problems without nonlinear constraints, but when nonlinear constraints are included the stochastic ranking proposed by Runarsson and Yao is employed. The population size for

ISRESdefaults to 20×(n+1) in n dimensions, but this can be changed with the nlopt_set_population function.This method supports arbitrary nonlinear inequality and equality constraints in addition to the bound constraints, and is specified within

NLoptas NLOPT_GN_ISRES.

Further notes are available in: