2.0 Projecting data features into higher dimensions

Questions and discussion for this lecture live here. Hit Reply below to ask a question.

At last, I’m getting started on this course by Andy Corbett, and it’s very interesting!

I think I spotted something odd in this lesson (“2 . Projecting data features …”) and, having followed lessons 1 - 5 so far, still think it odd. At 16:00, when describing the condition at the decision surface, surely the value of r (i.e. beta0 + beta1 * x1 + beta2 * x2) should be zero, rather than 0.5, which is the value of f(x1, x2). Or have I misunderstood this?

Well spotted! The function f(x1, x_2) = sigma(b0 + b1x1 + b2x2) is our predictor, with values in [0, 1]. We draw a line in the sand at 0.5 so our decision surface is

(**) b0 + b1x1 + b2x2 = sigma^{-1}(0.5)

but in the lecture the sigma has been missed out, it’s incorrectly written “b0 + b1x1 + b2x2 = 0.5”. Consequently, the next equation should read

(**) x2 = (sigma^{-1}(0.5) - b0 - b1x1) / b2.

And as you note, sigma^{-1}(0.5) = 0 zero in this case. But of course, in practice, other flavours of sigma and 0.5 are available!

Thanks, Andy. I’m sure that there are various flavours of sigma.
But I’m not sure that there are different flavours of “a half”, unless that’s the difference between “a glass half full” and “a glass half empty”!

You’ve got it. The real question is, “what should we do with points close to the decision boundary?”. The glass half empty gang might want to push the threshold up to 0.8 so that points close to the decision boundary are classified as zero.

But this topic—the ambiguity at the decision boundary–is the motivation for formulating ‘Support Vector Machine’ models. And as serendipity would have it, that’s the next topic of the course!

My comment was meant to be a bit of a joke! However, I’ve always thought the glass half full or empty difference is not really about how (in which direction) one is looking at it but, more importantly, whether one is filling it or emptying it, i.e. the rate of the change of the level.
You’re right of course, that the threshold could be at other levels and, presumably, this also relates to the ROC curve described by Tim in the previous course.
How does the choice of the function (in this case, sigma) affect this? Obviously, it determines the r for a given threshold (in a simple step function at zero, the threshold would have no effect). But, for example, does the slope and curvature (and higher derivatives) of the function affect the performance in other ways? (If this is covered later in the course, then, no doubt, I’ll find out then!)

Differentiability is an interesting one. We are looking at a smooth S shape here. It’s derivative is continuours. But if we instead chose an angular activation function, a Z shape, then its derivative is no longer continuous at the bends. So, if you are relying on the gradients for optimisation, you might experience a large discontinuous jump which could derail you from convergence.

This is not a problem here. But these functions are really important for when we look at neural networks. There are even more interesting features to consider there. We shall indeed chew into the nuance of these choices in the coming course “Deep Learning 101”.