Thursday, December 21, 2017

Ridge regression


"Ridge regression adds an additional λI to the matrix XTX so that it's non-singular, and we can take the inverse of the whole thing" [1]

It adds bias. Take a matrix that has rank 1 (that is, each row adds no more information).

import numpy as np
from numpy.linalg import matrix_rank, det, inv, norm

mat_rank_1 = np.matrix('1    2    3;'
                       '10   20   30;'
                       '100, 200, 300')

We can demonstrate it does indeed have rank 1:

print mat_rank_1, "has rank ", matrix_rank(mat_rank_1)  # it is indeed 1

Now if were to use it, say, in linear regression, we'd have problems when we invert it:

mTm = np.dot(mat_rank_1.T, mat_rank_1)

print "\ndeterminant of rank 1 matrix multiplied by its transpose:", det(mTm)  # and not too surprisingly, the determinant is 0

The following blows up with "numpy.linalg.linalg.LinAlgError: Singular matrix":

inv(mmt)

So, we can add some bias:

l = np.eye(3, 3) * 0.1

wbTwb = np.dot(mat_rank_1.T, mat_rank_1) + l
print wbTwb, "has determinant", det(wbTwb), "and rank", matrix_rank(wbTwb)

print inv(wbTwb)

and this doesn't blow up.

"Ridge regression was originally developed to deal with the problem of having more features than data points. But it can also be used to add bias into our estimations. We can use the λ value to impose a maximum value on the sum of all our [weights]" [1]

John D Cook warns: "A little noise makes the system go from theoretically impossible to solve to theoretically possible to solve, but that may not be very useful in practice. It will let you compute a solution, but that solution may be meaningless... You use knowledge of your domain beyond the specific data at hand to guide how you change your matrix.

Ill Conditioned

He goes on: "The condition number of a matrix is the norm of the matrix times the norm of its inverse... A small change to a matrix might not change its norm much, but it might change the norm of its inverse a great deal. If that is the case, the matrix is called ill-conditioned because it has a large condition number.

"You can think of condition number as an error multiplier... Note that condition number isn’t limited to loss of precision due to floating point arithmetic. If you have some error in your input b, say due to measurement error, then you will have some corresponding error in your solution to Ax = b, even if you solve the system Ax = b exactly. If the matrix A is ill-conditioned, any error in b (rounding error, measurement error, statistical uncertainty, etc.) will result in a correspondingly much larger error in x."

In our case, it we can calculate this condition number thus:

print "norm of original matrix", norm(wbTwb)
print "norm of inverse of original matrix", norm(inv(wbTwb))
print "Condition number", norm(wbTwb) * norm(inv(wbTwb))

Which results in:

norm of original matrix 141414.1
norm of inverse of original matrix 14.1421356238
Condition number 1999897.38132

Yoiks. That modest λ=0.1 made our matrix quite ill conditioned.

Interestingly, making our original matrix rank 3 by avoid linear dependencies between rows:

mat_rank_1 = np.matrix('1    2    3;'
                       '20   10   30;'
                       '100, 300, 200')

reduces the condition number by an order of magnitude even for the same value of λ.

What's more, making λ much bigger (say 10) makes the original matrix less ill-conditioned by two orders of magnitude. But of course, now we're significantly distorting our data - the "meaningless" that Cook talks of.

An excellent article on how ill-conditioning can effect neural networks can be found here.

[1] Machine Learning in Action


No comments:

Post a Comment