5.0 Controlling ML Models - Regularisation in Practise

Questions and discussion for this lecture live here. Hit Reply below to ask a question.

Thanks for this lecture on regularisation (or, inevitably. “regularization”). The technique used by Lasso is evidently very effective (and trying Ridge, it also seems almost as good)… But, as I understand it, the main point that Tim makes is that, in linear regression, the relationship can be a linear combination of any functions/features of the independent variable(s).
So I tried messing about without features which are exponentials and Lasso seems to handle them too, but nowhere nearly as well as polynomials…

You invited us to keep the questions coming: so here’s another one!
What is this business of scaling the independent variable(s)?! Probably due to my background (as an applied physicist), my mathematical knowledge is generally weak, and I’ve just learned techniques as I need them for application in specific areas. But … I thought that these regression techniques worked best with “affine functions” (which, as I understand it, have the same “shape” regardless of scale) and, also as I understand it, functions which raise a variable to a power have that property. More generally, if scaling affects the result, why is the range 0.0 to 1.0 special?

Hi Jon,

I am sorry that I missed these, in the move over. You raise a good point.

Polynomials tend to in general work very well, because of a Taylor expansion or series. All functions can be expressed as a linear combinations of polynomials expanded from a point.

Taylor series - Wikipedia.

Equally, expansion in terms of different wavelengths of cos and sine waves work well, since all continuous are well represented in Fourier space.

The key thing here is that the “features” matter. In linear models you have to put them in at the beginning, and this requires some engineering judgement. With more advanced learn the representation as well!

Hope this helps.

Tim

Generally in polynomials you want to normalise your data otherwise the parameters you learn become very sensitive.

Take for example a model here the input is x and x ranges from 0 to 1000

When I build a polynomial model I then have a_0 + a_1* x + . . . + a_5x^5

Since when x = 1000, then the value of a_5 really drives the response, so my fitting is very sensitive.

If x is rescale to to a max of 1, then at x = 1 x^5 is 1, hence the output is insensitive to the precision of a_5.

Does this answer the question?

Thank you, Tim. Where to start?!? (This might not be short!)

I think I understand your specific point about normalisation … is this ultimately about the stability of the numerical solution due to the finite precision of the numerical values? If so, that makes sense.

Presumably, to make sense of the results, it’s also necessary to know the factor(s) by which the values were normalised so that they can be converted back to their original scale for interpretation later.

On your more general point about polynomials, I think I also have a lot to (re)learn! During my “applied physicist” course, one of the (actually, very!) interesting topics was titled “approximate mathematics” (which is often all that physicists need!). I recall one major guideline when (approximately!) solving equations: if the boundary conditions are at finite locations, use polynomials; if they’re at infinity, use exponentials. Having later engaged on research using Fourier approaches, I also have to say that I still tend to have a predilection for (complex) exponentials! But maybe I haven’t fully understood which can express which!

I have yet another (more general) question! In all of the examples, so far, the results have been a set of “values” (parameters/coefficients/hyperparameters/whatevers) which can be used to predict a result. However, the training has only occurred over a specific range of the independent variable(s). Doesn’t that range specify the region of applicability of the predictions. If so, shouldn’t that range of independent variable(s), as well as the “values”, be reported as part of the results?
While implementing your solutions/walkthroughs, I also been applying the predictions in regions outside the range of training data. The results are very variable!! For simple polynomial approaches, they scream off to (+/-) infinity very quickly; but for more sophisticated approaches the departure is less rapid, but happens sooner or later.
Presumably this is about interpolation versus extrapolation?
I.e., in general, how important is the inclusion of the range of validity of the independent variables(s) in the reporting of the results?

Hi John,

All great questions.

You have got it with the first one. Numerical stability is the issue - perfect. It also important in methods where you look at the distance in input space. The simplest example I cover is k nearest neighbour. If the inputs are scaled differently, then distance is dominated by one variable over another. Hopefully I covered that fully in the KNN lecture.

You points on polynomials, Fourier expansions and exponentials make complete sense. The bottom line is for linear models you have to choose your features, and the choice of these features play a key role in how well you can “approximate” the true underlying relationship. A good example of this in physics is that Fourier modes are not good for approximating a localised function (e.g. a soliton) this is because Fourier modes are all periodic and have wiggling tails to infinity. Different wavelengths (infinitely many) then have to cancel each other out in the tails, having said this exponentials are perfect for describing localised features.

Your final point tells me you are a man after my own heart. You are a Bayesian. As you rightly say that the problem with models which just predict, like the ones we cover (and also Neural Networks) you don’t really have a feel for how far you are extrapolating from the data. The answer to this (I believe) is you need Bayesian Linear Regression . . . . which is to come in Mikkel course in the next few weeks called “Introduction of Bayesian Uncertainty Quantification”. Quick google search, gives this as a quick example with some code https://alpopkes.com/posts/machine_learning/bayesian_linear_regression/ but Mikkel will break this down a bit more for you.

Glad to hear you enjoyed the Bluebird demo, at British Science week.

Best

Tim