Well, between the rainy weather, and playing lots of games of FIFA 2012 this weekend, I banged by head on the wall for a couple of hours on the Stanford Machine Learning problem set (the third one), and I sadly couldn’t get my Octave code to function.
The video lectures are about neural networks, but the problem set is about implementing an All Vs. One classification algorithm, for handwriting identification. You are given a data set (5000 rows, with 400 features) – each row represents an image of some person writing a number from 0 to 9 inclusive. I think the 400 features are a pixel representation of those images (so 400 pixels, with some number representing grayscale for the image at that pixel location).
So the first part of the problem set is to write the cost function J.
This is a standard logistic regression problem. So H(theta), the hypothesis, is represented by the sigmoid function – H(theta) = 1 / (1 + e) ^ (-z), where z is equal to Theta(transpose) dot product with X(i).
Yeah, it sounds like a lot of greek doesn’t it? It probably helps to know a little linear algebra – so I think that’s why the first week of lectures included a brush up on vectors and Matrix multiplication.
This important sigmoid function looks like an S curve – that is assymptotic to 0 as x goes to negative infinity, and assymptotic to 1 as x goes to positive infinity. The S curve crosses the y axis at 0.5. It’s a neat function, and you are using it essentially to map any Real number as the input to your prediction, and your prediction will range from 0 to 1. This is important because in log regressions, you typically are trying to train a model, using your data set, to output a prediction in range (0, 1), which is representative of your probability of whether you think a new input is classified as true (y=1) or false (y=0). So examples of log regressions they talk about in this course include classifying whether a tumor is benign or malignant based on some observed features (like size of the tumor, etc.).
So – each row in the 5000 row data set represents 400 pixels, so there are 400 x’s (x1, x2, …. x400) storing a number mapped to pixel intensity for each of those 400 pixels. And the vector y, which is 5000 rows by 1 column, represents what that number actually was (what did the handwriting really indicate – so it’s a number from 0 to 9).
One wrinkle in the model is you have to add a column of ones to the matrix X. (The matrix X houses all the 5000 rows, 400 columns of pixel intensity data). This column of ones is a bias column, or really a constant number. Think of it in terms of your linear equation y = mx + b. B is the contant number (adjustment) that shifts your line up or down (because b is the y-intercept). So in the form of this model, you also have a constant factor which will shift your prediction up or down.
So really, the goal of your logistic regression, is to solve for theta (which is a vector of length 401). Remember there were 400 features, but we added that bias feature (or constant) as the first column in the matrix X.
So here’s where vectorization is handy. This calculation Theta(transpose) dot product with the ith row of the X matrix is equal to theta(0)*x(0) + theta(1)*x(1) + ….. + theta(401)*x(401). So it’s a real number in R1 (just like a scalar). And this number can be anything in R1. Now if you apply that neat sigmoid function : 1 divided by(1+e) raised to the negative z (where z is this dotproduct of theta(transpose) times the ith row of the X matrix) – you can transform that row to an output of range (0,1), which represents your prediction, or probability that your classification algorithm thinks this data point is true (y=1).
Now, in this case, the y vector can be 0, 1, 2, 3, …. , 9, so you really need to run 10 different log regressions – which is what an all versus one classification algo does.
So in your first regression, let’s say we’ll train an algorithm to identify whether a handwritten number is zero, or whether it is not – you just set the y vector to be a binary classification, y is set to 1 if the data in the y vector was zero. Otherwise you set the y vector to be 0.
Now – you just need to solve for theta – which is the weighting (or coefficients) – that give you a good prediction.
So, in order to solve for theta, you need to come up with a cost function. In linear regression, the cost function is something you may be familiar with – it’s the sum of the squared distances between the actual observation and the prediction – times a scalar (1/2m) where m is the number of rows.
In log regression – the cost function J is a little different:
J(theta) = (1/m) * sum from i = 1 to m of [-y(i)*log(h(i)) - (1-y)log(1-h(i))]
Where h is the hypothesis function which is the sigmoid, or 1/(1+e)^(-theta(transpose) * x(i).
What??? lol
Really what you’re doing here is looking at each of your rows in the Matrix – doing that dotproduct theta(transpose) times ith row of X, and coming up with a number. Then you make that number negative and take the sigmoid function – and you come up with a prediction, or probability, in the range (0,1).
And now you compare that prediction to the real observation in your y vector – was y =1 , or was y =0 . In the log regression, there are only those two discrete possibilities.
If you predicted 1, and y was 1 – hooray – that was a good prediction.
If you predicted 1, and y was 0 – boo – horrible prediction.
So what you’re doing in the cost function is quantifying the effectiveness of your prediction model. For each row (or observation in your data set) – you’re saying how good (or awful) was my prediction based on my parameters theta (the coefficients in my model).
Here’s part of the intuition. If the observation y was 1, and your prediction y was 1, then you want your cost model for that prediction to be zero. You were perfect. So remember your cost model has two parts the -y*log(h) and the (1-y)log(1-h). So if y was 1, you can ignore the second half of the cost function (for that observation because it gets zeroed out). In the first half, log(h), should equal zero, because your cost function should reflect that you were correct. So that log function just asks, what is the power of e that gives me the answer h? In this case you want the log to be zero, which occurs when h is 1, right? e to the zero power is 1. So your hypothesis h was 1, your log of 1 was zero, and your negative y * zero was zero. Ok, you can go through the intuition for the other scenarios, but essentially your cost function gets very large (infinity) if you predict 0 and the actual observation was 1, and vice versa.
So how the hell do you know which thetas give you the solution.
So once you have your cost function J, you can use a gradient descent algorithm to solve for theta. Essentially, you are taking partial derivatives for the cost function dJ/dtheta(1), dJ/dtheta(2), which are the gradients of the cost function. Think of them as the slope, or the direction that the cost curve slopes as you go in the direction of the theta(dimension). And with differentiable, convex functions, you can get to a minimum, when you set the derivative to zero. In 2 dimensions – the derivative of y = x^2, dy/dx (or the gradient) is 2x. This derivative is zero, when x = 0. It’s a bit harder to see in multiple dimensions, but it’s the same idea – only this time we have 401 dimensions, and we need to solve for 401 thetas and calculate 401 partial derivatives dJ/dtheta.
My calculus is very rusty – so I wasn’t really able to do the derivative of the cost function. I know you need to apply the chain rule somewhere here, and use the notion that the derivative of the log(x) is 1/x.
So there is also a formula for the gradient – which it turns out looks pretty familiar
dJ/dtheta = (1/m) * sum for i=1 to m (h(i) – y(i))*x(i)
So to read this – you’re saying for every row in the matrix – calculate the difference between my prediction and the actual value of y (which is a 5000 row vector) and then you multiply that by the vector of the x (feature). Remember, you’re doing this for 401 partial derivatives, so in your matrix, you have 401 columns, each column represents a feature of x, which is also a 5000 row vector. So you can dot product them, and get a scalar number.
Finally, this problem set also had you work on something called regularization – which involves another parameter lambda, used to smooth out your predictions – to prevent overfitting and underfitting the data.
In Octave, there are functions like fminunc and fmincg – where if you pass them the cost function, the gradient, the matrix of data X, and the solution vector y, and the lambda for the regulatization feature – Octave will generate and solve for the coefficients theta.
Anyways – even after watching all these videos, and monkeying around with the Octave code, I still can’t get it to work. Frustrating. I hope they post the solution code – as I know I have a bug somewhere in my code, and I can’t find it. I would ask for help in the Q&A forums, but the Stanford honor code says not to share your code, so I’m going to take a zero for this week’s problem set, barring some Archimedes moment in the next week or so.
Instead, I will focus on making sure I can get Chelsea promoted to Division 8 in FIFA 2012. I don’t know why I chose Chelsea. Drogba is pretty cool!
{ 2 comments… read them below or add one }
I too was stuck with this ‘fmincg’ and googling landed me here. Though, I guess, I have kind of figured it out.
BTW, there is no such inbuilt function like ‘fmincg’. The help doc that you see are the comments written inside the function.
If you feel like discussing, cost function calculation is almost the same as the last week’s exercise. As for fmincg, like fminunc, it returns optimum value of theta. Since, the question was about one vs all, we need to train the data-sets for “num_labels” different functions, i.e num_labels different theta vectors. Now, since your cost function returns a “column vector” and as we need to populate the all_theta matrix with these “vectors in rows”, we will find the transpose of the returned vector and populate the matrix. Now, since there are num_labels number of classifiers. We will have to do this num_labels times.
Well, if you don’t feel like discussing and want to adhere to strictly self-learned rule, you can just ignore the above few lines.
PS: I kind of hate football.
Hi Harsh – thanks for the note and the offer. I let this problem stew in my mind for a couple days, and I picked it up again this morning and I found my error.
I actually did two things wrong – first, I thought I could write the Octave code for the cost function with one simple Matrix vector product, which was wrong. And so I went back and implemented using the hint in the problem set (using the .* operator in Octave, and then the sum function to sum up the entries in the resulting vector).
I also found a second error in my code. I was trying to regularize the cost function by also just doing a simple dot product of theta transpose with theta. I forgot that you don’t need to include the first entry of that theta vector (that corresponds to the bias unit) in that regularization process.
These problem sets are very tricky. I think I like Python more than Octave, haha.
Rob