Linear systems of equations and their solution form the cornerstone of much Engineering and Science. Linear algebra is a paragon of Mathematics in the sense that its theory is what mathematicians try to emulate when they develop theory for many other less neat subjects. I think Linear Algebra ought to be required mathematics for any scientist or engineer. (For example, I think Quantum Mechanics makes a lot more sense when taught in terms of inner products than just some magic which drops from the sky.) Unfortunately, in many schools, it is not. You can learn it online, and Professor Gilbert Strang’s lectures and books are the best. (I actually prefer the second edition of his Linear Algebra and Its Applications, but I confess I haven’t looked at the fourth edition of the text, just the third, and I haven’t looked at his fifth edition of Introduction to Linear Algebra.)
There’s a lot to learn about numerical methods for linear systems, too, and Strang’s Applications teaches a lot that’s very good there, including the SVD of which Professor Strang writes “it is not nearly as famous as it should be.” I very much agree. You’ll see it usable everywhere, from dealing with some of the linear systems I’ll mention below to support for Principal Components Analysis in Statistics, to singular-spectrum analysis of time series, to Recommender Systems, a keystone algorithm in so-called Machine Learning work.
The study of numerical linear algebra is widespread and taught in several excellent books. My favorites are Golub and Van Loan’s Matrix Computations, Björck’s Numerical Methods for Least Squares Problems, and Trefethen and Bau’s Numerical Linear Algebra. But it’s interesting how fragile these solution methods are, and how quickly one needs to appeal to Calculus directly with but small changes in these problems. That’s what this post is about.
So what am I talking about? I’ll use small systems of linear equations as examples, despite it being entirely possible and even common to work with systems which have thousands or millions of variables and equations. Here’s a basic one:
written for brevity
Of course, in any application the equation looks more like:
In R or MATLAB the result is easily obtained. I work and evangelize R, so any computation here will be recorded in it. Doing
solve(A,b)
or
lm(b ~ A + 0)
will produce
It’s also possible to solve several at once, for example, from:
And, of course, having an unknown but a known
is direct, just using matrix multiplication:
yielding:
Linear Algebra gives us sensible ways to interpret inconsistent systems like:
by making reasonable assumptions about what the solution to such a system should mean. R via lm(.)
gives:
Sets of solutions to things like
can be countenanced and there is even a way which I’ll talk about below for picking out a unique one: the minimum norm solution. This is where the SVD comes in. To learn about all the ways these things can be thought about and done, I recommend:
D. D. Jackson, “Interpretation of inaccurate, insufficient and inconsistent data”, Geophysical Journal International, 1972, 28(2), 97-109.
(That’s an awesome title for a paper, by the way.)
What if there are holes?
Going back to (3), however, suppose instead it looks something like this:
and we don’t know what is. Can it be calculated?
Well, it has to be able to be calculated: It’s the only unknown in this system, with the rules of matrix multiplication just being a shorthand for combining things. So, it’s entirely correct to think that the constants could be manipulated algebraically so they all show up on one side of equals, and on the other. That’s a lot of algebra, though.
We might guess that was symmetric so think
. But what about the following case, in (12)?
Now there are 3 unknowns, ,
, and
. The answer is available in (3), but suppose that wasn’t known?
This problem is one of finding those parameters, searching for them if you like. To search, it helps to have a measure of how far away from a goal one is, that is, some kind of score. (14) is what I propose as a score, obtained by taking (12) and rewriting it as below, (13):
is an
norm, and
.
In other words, is the length of the vector
. It’s non-negative. Accordingly, making (14) as small as possible means pushing the left and right sides of (12) towards each other. When (14) is zero the left and right sides are equal.
Now, there are many possible values for ,
, and
. In most applications, considering all flonum values for these is not necessary. Typically, the application suggests a reasonable range for each of them, from a low value to a high value. Let
be the range of values for ,
be the range of values for , and
be the range of values for , each dictated by the application. If
,
, and
are each randomly but independently chosen from the unit interval, then a particular value of (14) can be expressed
where
So, this is an optimization problem where what’s wanted is to make (16) as small as possible, searching among triplets of values for ,
, and
. How does that get done? R package nloptr. This is a package from CRAN which does a rich set of numerical nonlinear optimizations, allowing the user to choose the algorithm and other controls, like ranges of search and constraints upon the control parameters.
Another reason why these techniques are interesting is it is intriguing and fun to see how far one can get knowing very little. And when little is known, letting algorithms run for a while to make up for that ignorance doesn’t seem like such a bad trade.
An illustration
In order to illustrate the I don’t know much case, I’m opting for:
What a run produces is:
Call:
nloptr(x0 = rep(0.5, 3), eval_f = objective1, lb = rep(0, 3), ub = rep(1, 3), opts = nloptr.options1, alpha.beta = alpha.beta)
Minimization using NLopt version 2.4.2
NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )
Number of Iterations....: 100000
Termination conditions: xtol_rel: 1e-04 maxeval: 1e+05
Number of inequality constraints: 0
Number of equality constraints: 0
Current value of objective function: 0.000734026668840609
Current value of controls: 0.74997066329 0.5556247383 0.75010835335
Y1 resulting estimates for ,
, and
are: 1.00, 4.00, 3
That’s nloptr-speak for reporting on the call, the termination conditions and result. The bottom line in bold tells what was expected, that .
What about the code? The pertinent portion is shown below, and all the code is downloadable as a single R script from here. There’s also a trace of the execution of that script available as well.
L2norm<- function(x)
{
sqrt( sum(x*x) )
}
r<- function(sigma, alpha, beta)
{
stopifnot( (0 <= sigma) && (sigma <= 1) )
stopifnot( alpha < beta )
alpha*(1 - sigma) + beta*sigma
}
# Recall original was:
#
# A<- matrix(c(1, 2, 3, 2, 1, 4, 3, 4, 1), 3, 3, byrow=TRUE)
P1.func<- function(x, alpha.beta)
{
stopifnot( is.vector(x) )
stopifnot( 3 == length(x) )
#
sigma11<- x[1]
sigma23<- x[2]
sigma31<- x[3]
alpha11<- alpha.beta[1]
beta11<- alpha.beta[2]
alpha23<- alpha.beta[3]
beta23<- alpha.beta[4]
alpha31<- alpha.beta[5]
beta31<- alpha.beta[6]
#
P1<- matrix( c( r(sigma11,alpha11,beta11), 2, 3,
2, 1, r(sigma23,alpha23,beta23),
r(sigma31,alpha31,beta31), r(sigma23,alpha23,beta23), 1
),
nrow=3, ncol=3, byrow=TRUE )
return(P1)
}
objective1<- function(x, alpha.beta)
{
stopifnot( is.vector(x) )
stopifnot( 3 == length(x) )
b<- matrix(c(12,4,16),3,1)
x.right<- matrix(c(-3,6,1),3,1)
P1<- P1.func(x, alpha.beta)
d<- b - P1 %*% x.right
# L2 norm
return( L2norm(d) )
}
nloptr.options1<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-6, "print_level"=0, "maxeval"=100000, "population"=1000)
alpha.beta<- c(-2, 2, -1, 8, -6, 6)
Y1<- nloptr(x0=rep(0.5,3),
eval_f=objective1,
lb=rep(0,3), ub=rep(1,3),
opts=nloptr.options1,
alpha.beta=alpha.beta
)
print(Y1)
cat(sprintf("Y1 resulting estimates for a_{11}, a_{23}, and a_{31} are: %.2f, %.2f, %2.f\n",
r(Y1$solution[1], alpha.beta[1], alpha.beta[2]), r(Y1$solution[2], alpha.beta[3], alpha.beta[4]),
r(Y1$solution[3], alpha.beta[5], alpha.beta[6])))
But what is it good for? Case 1: Markov chain transition matrices
Consider again (1):
This happens also to be the template for a 3-state Markov chain with their many applications.
The following example is taken from the famous paper by Rabiner, as presented by Resch:
- L. R. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition”, Proceedings of the IEEE, February 1989, 77(2), DOI:10.1109/5.18626.
- B. Resch, “Hidden Markov Models”, notes for the course Computational Intelligence, Graz University of Technology, 2011.
- N. J. Welton, A. E. Ades, “Estimation of Markov Chain transition probabilities and rates from fully and partially observed data: Uncertainty propagation, evidence synthesis, and model calibration”, Medical Decision Making, 2005, 25(6), 633-645.
- T. P. Runarsson, X. Yao, “Search biases in constrained evolutionary optimization,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 205, 35(2), 233-243.
- T. P. Runarsson, X. Yao, “Stochastic ranking for constrained evolutionary optimization,” IEEE Transactions on Evolutionary Computation, 2000, 4(3), 284-294.
They begin with the transition diagram:
which if cast into the form of (1′) and (2) looks like:
The rows, top-to-bottom, are labeled sunny, rainy, and foggy, as are the columns, left-to-right. Cell gives the probability for going from state
to state
. For example, the probability of going from sunny to foggy is 0.15. Here’s a prettier rendition from Resch:
Resch and Rabiner go on to teach Hidden Markov Models (“HMMs”) where is not known and, moreover, the weather is not directly observed. Instead, information about the weather is obtained by observing whether or not a third party takes an umbrella to work or not. Here, however, suppose the weather is directly known. And suppose
is known except nothing is known about what happens after foggyexcept when it remains foggy. Symbolically:
Note in (18) or Resch’s tableau how the rows each sum to one. This is a characteristic of first order Markov models: Once in a state, the transition has to go somewhere, even if to stay in that state. Transitions can’t just cause the system to disappear, so all the outgoing probabilities need to sum to one. This means, however, that when what happens when it is foggy is introduced, there aren’t two unconstrained parameters, there is only one. Accordingly, rather than introducing , I could write
. As it turns out, in my experience with nloptr, it is often better to specify this constraint explicitly so the optimizer
knows about it rather than building it implicitly into the objective function, even at the price of introducing another parameter and its space to explore.
The challenge I’ll pose here is somewhat tougher than that faced by HMMs. The data in hand is not a series of sunny, rainy, or foggy weather records but, because, say, the records were jumbled, all that’s in hand is a count of how many sunny, rainy, and foggy days there were, and what the count of days were following them. In particular:
meaning that the first day of a set of pairs began where the first day was sunny 1020 times, rainy 301 times, and foggy 155 times. Statistical spidey sense wonders about how many observations are needed to pin town transition probabilities well, but let’s set that aside for now. (At least it’s plausible that if ordering information is given up, there might be a need for more count information.) And the count of what the weather was on the second days is:
or 854 sunny days, 416 rainy days, and 372 foggy foggy days.
Note that unlike in (16) here in (19) there is no need to pick upper and lower bounds on the value: This is a probability so it is by definition limited to the unit interval. But always so that constraint needs to be stated.
Here’s the code:
P2.func<- function(x)
{
# Sunny, Rainy, Foggy
stopifnot( is.vector(x) )
stopifnot( 2 == length(x) )
#
a.31<- x[1]
a.32<- x[2]
#
P2<- matrix( c( 0.8, 0.05, 0.15,
0.2, 0.6, 0.2,
a.31, a.32, 0.5
),
nrow=3, ncol=3, byrow=TRUE )
return(P2)
}
objective2<- function(x)
{
stopifnot( is.vector(x) )
stopifnot( 2 == length(x) )
x.right<- matrix(c(1020, 301, 155), 3, 1)
b<- matrix(c(854, 416, 372),3,1)
P2<- P2.func(x)
d<- b - P2 %*% x.right
# L2 norm
return( L2norm(d) )
}
constraint2<- function(x)
{
return( (x[1] + x[2] - 0.5 ))
}
nloptr.options2<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-4, "print_level"=0, "maxeval"=100000, "population"=1000)
Y2<- nloptr(x0=rep(0.5,2),
eval_f=objective2,
eval_g_eq=constraint2,
lb=rep(0,2), ub=rep(1,2),
opts=nloptr.options2
)
print(Y2)
cat(sprintf("Y2 resulting estimates for a_{31}, a_{32} are: %.2f, %.2f\n",
Y2$solution[1], Y2$solution[2]))
This run results in:
Call:
nloptr(x0 = rep(0.5, 2), eval_f = objective2, lb = rep(0, 2), ub = rep(1, 2), eval_g_eq = constraint2, opts = nloptr.options2)
Minimization using NLopt version 2.4.2
NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )
Number of Iterations....: 100000
Termination conditions: xtol_rel: 1e-04 maxeval: 1e+05
Number of inequality constraints: 0
Number of equality constraints: 1
Current value of objective function: 0.500013288564363
Current value of controls: 0.20027284199 0.29972776012
Y2 resulting estimates for are: 0.20, 0.30
Suppose some of the data is missing? In particular, suppose instead:
where is on the unit interval and so all that’s known is that
is between 155 and 1020, that is, bounded by the other two terms in
.
Now there are two parameters to search, but they are unconstrained, apart from being on the unit interval. The code for this is:
P3.func<- function(x)
{
# Sunny, Rainy, Foggy
stopifnot( is.vector(x) )
stopifnot( 3 == length(x) )
#
a.31<- x[1]
a.32<- x[2]
# There's an x[3] but it isn't used in the P3.func. See
# the objective3.
#
P3<- matrix( c( 0.8, 0.05, 0.15,
0.2, 0.6, 0.2,
a.31, a.32, 0.5
),
nrow=3, ncol=3, byrow=TRUE )
return(P3)
}
objective3<- function(x)
{
stopifnot( is.vector(x) )
stopifnot( 3 == length(x) )
x.right<- matrix(c(1020, r(x[3], 155, 1020), 155), 3, 1)
b<- matrix(c(854, 416, 372),3,1)
P3<- P3.func(x)
d<- b - P3 %*% x.right
# L2 norm
return( L2norm(d) )
}
constraint3<- function(x)
{
stopifnot( 3 == length(x) )
return( (x[1] + x[2] - 0.5 ))
}
nloptr.options3<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-4, "print_level"=0, "maxeval"=100000, "population"=1000)
Y3<- nloptr(x0=rep(0.5,3),
eval_f=objective3,
eval_g_eq=constraint3,
lb=rep(0,3), ub=rep(1,3),
opts=nloptr.options3
)
print(Y3)
cat(sprintf("Y3 resulting estimates for a_{31}, a_{32}, and eta are: %.2f, %.2f, %.2f\n",
Y3$solution[1], Y3$solution[2], Y3$solution[3]))
The results are:
Call:
nloptr(x0 = rep(0.5, 3), eval_f = objective3, lb = rep(0, 3), ub = rep(1, 3), eval_g_eq = constraint3, opts = nloptr.options3)
Minimization using NLopt version 2.4.2
NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )
Number of Iterations....: 100000
Termination conditions: xtol_rel: 1e-04 maxeval: 1e+05
Number of inequality constraints: 0
Number of equality constraints: 1
Current value of objective function: 0.639962390444759
Current value of controls: 0.20055501795 0.29944464945 0.16847867543
Y3 resulting estimates for a_{31}, a_{32}, and are: 0.20, 0.30, 0.17, with that
corresponding to 301
That 301 versus the true 372 isn’t too bad.
An example of where this kind of estimation is done more generally, see:
But what is it good for? Case 2: Learning prediction matrices
When systems like (2) arise in cases of statistical regression, the matrix is called a prediction or design matrix. The idea is that its columns represent sequences of predictions for the response, represented by the column vector
, and the purpose of regression is to find the best weights, represented by column vector
, for predicting the response.
Consider (2) again but instead of and
being column vectors, as in (5), they are matrices,
and
, respectively. In other words, the situation is that there are lots of
pairs available. And then suppose nothing is known about
, that is, it just contains nine unknown parameters:
There are, of course, questions about how many pairs are needed in tandem with choice of number of iterations (see maxeval discussion in Miscellaneous Notes below.) Here, 8 pairs were used for purposes of illustration.
and
The code for this case is:
objective4<- function(x)
{
stopifnot( is.vector(x) )
stopifnot( 9 == length(x) )
B<- matrix(c(5215, 13693, 7265, 4217, 9367, 10588, 14372, 12043,
7528, 17825, 11024, 4989, 14860, 9447, 16162, 13087,
6161, 12798, 7702, 3023, 9551, 8908, 11429, 8734
), 3, 8, byrow=TRUE)
X.right<- matrix(c(1356, 7505, 4299, 3419, 7132, 1965, 8365, 8031,
5689, 8065, 7001, 638, 8977, 1088, 3229, 1016,
3777, 8135, 3689, 1993, 3635, 9776, 8967, 7039
), 3, 8, byrow=TRUE)
P4<- matrix(x, nrow=3, ncol=3, byrow=TRUE)
d<- B - P4 %*% X.right
# L2 norm for matrix
return( L2normForMatrix(d, scaling=1000) )
}
nloptr.options4<- list("algorithm"="NLOPT_GN_ISRES", "xtol_rel"=1.0e-6, "print_level"=0, "maxeval"=300000, "population"=1000)
Y4<- nloptr(x0=rep(0.5,9),
eval_f=objective4,
lb=rep(0,9), ub=rep(1,9),
opts=nloptr.options4
)
print(Y4)
cat("Y4 resulting estimates for :\n")
print(matrix(Y4$solution, 3, 3, byrow=TRUE))
The run results are:
Call:
nloptr(x0 = rep(0.5, 9), eval_f = objective4, lb = rep(0, 9), ub = rep(1, 9), opts = nloptr.options4)
Minimization using NLopt version 2.4.2
NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was reached. )
Number of Iterations....: 300000
Termination conditions: xtol_rel: 1e-06 maxeval: 3e+05
Number of inequality constraints: 0
Number of equality constraints: 0
Current value of objective function: 0.0013835300300619
Current value of controls: 0.66308125177 0.13825982301 0.93439957114 0.92775614187 0.63095968859 0.70967190127 0.3338899268 0.47841968691 0.79082981177
Y4 resulting estimates for mathbf{A}:
[,1] [,2] [,3]
[1,] 0.66308125177 0.13825982301 0.93439957114
[2,] 0.92775614187 0.63095968859 0.70967190127
[3,] 0.33388992680 0.47841968691 0.79082981177
In fact, the held back version of used to generate these test data sets was:
and that matches the result rather well. So, in a sense, the algorithm has “learned” from the 8 data pairs presented.
Miscellaneous notes
All the runs of nloptr here were done with the following settings. The algorithm is always ISRES. The parameters xrel_tol = 1.0e-4 and population = 1000. maxeval, the number of iterations, varied depending upon the problem. For Y1, Y2, Y3, and Y4 it was 100000, 100000, 100000, and 300000, respectively In all instances, the appropriate optimization controls are given by the nloptr.optionsn variable, where
.
Per the description, ISRES, which is an acronym for Improved Stochastic Ranking Evolution Strategy:
The evolution strategy is based on a combination of a mutation rule (with a log-normal step-size update and exponential smoothing) and differential variation (a Nelder–Mead-like update rule). The fitness ranking is simply via the objective function for problems without nonlinear constraints, but when nonlinear constraints are included the stochastic ranking proposed by Runarsson and Yao is employed. The population size for ISRES defaults to 20×(n+1) in n dimensions, but this can be changed with the nlopt_set_population function.
This method supports arbitrary nonlinear inequality and equality constraints in addition to the bound constraints, and is specified within NLopt as NLOPT_GN_ISRES.
Further notes are available in: