Univariate Linear Regression is probably the most simple form of Machine Learning. Understanding the theory part is very important and then using the concept in programming is also very critical.In this Univariate Linear Regression using Octave – Machine Learning Step by Step tutorial we will see how to implement this using Octave.Even if we understand something mathematically, understanding the implementation can be tedious. Since many professor/researcher uses Octave/MatLab for teaching Machine Learning, it could be very well the the first tool you might be using to understand machine learning.This tutorial will help you to get familiar with Octave/MATLAB and understand the implementation in much easier way, without spending lot of time in Octave/MATLAB.
Objective:
Implement Univariate Linear Regression using Gradient Descent and Normal Equation in Octave/MATLAB.
Case Study:
We will use a data set from UCI (http://archive.ics.uci.edu/ml/) regarding Car millage. I have truncated the training set to only 100 data set and uploaded it to the git repository ( Link given below ). The data set has two columns:
- weight of the cars in thousand lb – [ \( x_1 \) ]
- mpg for the cars – [ y ]
We will first visualize the relationship between the weight and mpg. Then use the data in order to predict mpg of a car by providing the weight of the car using Linear Regression.
Data Structure:
Here is the data, (weight,mpg). The weight is in (1000 X lb) factor. So 3.5040 = 3.5040 X 1000lb = 3504lb
1 2 3 4 5 6 7 8 |
3.5040,18.00 3.6930,15.00 3.4360,18.00 3.4330,16.00 3.4490,17.00 4.3410,15.00 4.3540,14.00 .... |
Gradient Descent:
In this tutorial we will use the most simple form of Gradient Descent. Just to recap, lets go through the equations:
- Linear Regression hypothesis function:
\(
\hat y=h_\theta(x)=\theta_0+\theta_1x
\) - Cost Function:
\(
J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) – y^{(i)})^2
\) - Gradient Descent Generic Function :
\(
\theta_j:=\theta_j – \alpha \frac{d}{d\theta_j} J(\theta_0,\theta_1)
\) - Gradient Descent for Linear Regression :
Assuming, \( x_0 = 1 \)
\(
\theta_j:=\theta_j – \alpha \frac{1}{m} \sum_{i=1}^m ((h_\theta(x^{(i)}) – y^{(i)}) . x^{(i)}_j)
\)
for j= 1,2,….,n
Visualize the Data:
Our first step would be to visualize the data. Lets open Octave/MATLAB and create a file named main.m
. Make sure the data.txt
is in the same folder. Load the data using load()
function. Then assign the first column to x
variable and the 2nd column to y
. The x
and y
matrix both will have 100 rows and 1 col.
Next, use the plot()
function to draw the plot. Also set the labels for the axis.
1 2 3 4 5 6 7 8 9 |
car_data=load('data.txt'); x=car_data(:,1); % X would have 100 col and 1 row y=car_data(:,2); % y would have 100 col and 1 row figure; plot(x,y,'bx','MarkerSize',10); xlabel('(y) Weight -->','fontsize',14); ylabel('(x) mpg -->','fontsize',14); |
Once you execute the above code the following chart will be displayed. You can see how the weight of the cars are inversely related to the mpg. The car with weight 5X1000lb=5000lb
have below 10mpg.
Calculate \( \theta_0 \) and \( \theta_1 \) :
Initialize the required variables :
As per Gradient Descent for Linear Regression equation, we need \( \alpha \) and the number of iterations to be set. Here for our example we will set the \( \alpha \) to 0.1
and num_of_iterations
to 1000
. Later we will find our whether the num_of_iterations is enough,more or less.
We will calculate the number of training set m
using the length()
function.
Afterwards, we will initialize \( \theta \) to [0,0]
.
Our x
matrix has only 1 variable which is \( x_1 \), however as we have assumed \( x_0 \) = 1 in order to generalize the Gradient Descent equation, we need to add \( x_0 \) to the x
matrix as a new column.
Refer the last line of the code fragment below, added another column to the matrix x
using the ones()
function. The ones()
function will return a matrix of Length of training set X 1
(col X row) which will get added as a new column. The x now will be a 100 X 2 dimensional matrix.
1 2 3 4 5 6 |
alpha = 0.1; num_of_iterations = 1000; theta=zeros(2,1); %using the zeros() function. x=[ones(m,1),x]; |
Here is the part of x matrix after we have added \( x_0 \).
1 2 3 4 5 6 7 8 |
1.0000 3.5040 1.0000 3.6930 1.0000 3.4360 1.0000 3.4330 1.0000 3.4490 1.0000 4.3410 1.0000 4.3540 ..... |
Iterate the Gradient Descent Function :
Our next task is to Calculate the \( \theta \) and iterate 1000 times for convergence. So lets create a for
loop, then calculate \( h_\theta(x) \) by multiplying x
and theta
(Refer the equation above). x
is (100 X 2)
matrix and theta
is (2 X 1)
matrix. The resultant matrix would be a (100 X 1 )
matrix.
Then we will do an element wise subtraction. The h_of_x
would still be a (100 X 1 )
matrix.
Note : The below code can also write using more loops, however matrix calculations are much easier and faster to compute.
1 2 3 4 5 |
for i=1:num_of_iterations h_of_x=(x*theta).-y; end |
Its time to calculate \( \theta_j \), since we have only 1 feature, we will calculate both \( \theta_0 \) AND \( \theta_1 \) without using loops/matrix multiplication.
The below calculations are easy, however notice we are using transpose of h_of_x
. Here h_of_x
and x(:,1) or x(:,2)
both are ( 100 X 1 )
matrix. So we can’t multiply them without transposing one of them. If you take transpose of h_of_x
and then multiply with x(:,1) or x(:,2)
then the resultant would be a scaler value (element wise multiplication then summation ). This is also called products of row vector
and column vector
, which will always yield a scaler value. Here is the simplified version of the equation.
\begin{bmatrix}
a & b
\end{bmatrix}
.
\begin{bmatrix}
c \\
d
\end{bmatrix}
=
[a*c+b*d]
\)
Here is the code to calculate \( \theta \).
1 2 3 4 5 6 7 8 |
for i=1:num_of_iterations h_of_x=(x*theta).-y; theta(1)=theta(1)-(alpha/m)*h_of_x'*x(:,1); theta(2)=theta(2)-(alpha/m)*h_of_x'*x(:,2); end |
Plot the Hypothesis:
Let’s first print the value of \( \theta \) then the will plot the line in the previous scatter plot. We already had the \( x_1 \) (weight), so we will pass that to the plot function as the first argument. Remember to pass only the 2nd column from the x
matrix.
In order to calculate the value of y
, we will multiply the x (100 X 2)
matrix with the theta (2 X 1)
, which will output ( 100 X 1 )
matrix (this was already discussed during calculation of h_of_x
).
1 2 3 4 5 6 7 8 9 10 11 |
fprintf('θ0 = %f θ1 = %f \n', theta(1), theta(2)); hold on; %First plot x, the for y calculate y. %(100X2) * ( 2 X 1) =( 100 X 1 ) plot(x(:,2), x*theta, 'r-','linewidth',2); legend('Training data', 'Linear regression'); hold off; |
Once you run the code you can see the straight line.
Prediction:
We can predict the mpg for any given weight. Lets try to get 3 predictions and also plot them in the same chart. We will predict the mpg for 1.3 X 1000lb=1300lb, 3600 and 5500lb respectively.
Here, the x
is supplied as i
in the plot
function.The y needs to be calculated like previous code (x * theta)
, however this time we will manually pass x as [1, i]
, since \( x_0 = 1 \) and \( x_1 = i\)
1 2 3 4 5 6 7 8 9 10 11 |
predict=[1.3,3.6,5.5]; hold on; for i=predict plot(i, [1, i]*theta, 'g*','MarkerSize',14); end legend('Training data', 'Linear regression') hold off; |
Here is the final output. The three predictions are displayed as green star.
Cost Function:
Another very important aspect is cost function. Initially we had set the iterations as 1000, however how to make sure 1000 is enough, more or less? also how we can find whether the gradient descent is converging at all?
The cost function would provide the answers for the above questions. We will store the cost function in an Array and plot that against the iterations. Then we can notice how the cost gradually converging to zero.
Here is the code.We have added the last line in the for
loop to calculate the cost
. Later you can plot it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
J=zeros(num_of_iterations,1); for i=1:num_of_iterations h_of_x=(x*theta).-y; ....... skipping the code %compute J(θ) - Cost J(i)=1/(2*m)*sum(h_of_x.^2); end ..... figure; plot([1:num_of_iterations],J,'linewidth',2); xlabel('Num of iterations -->','fontsize',14); ylabel('Cost Function J(theta) -->','fontsize',14); |
From the plot below you can clearly see how \( J(\theta_0,\theta_1) \) has decreased over iterations and converging towards zero. You can also see that the line kind of getting flat from 300 iteration onwards. This tells us that function is already converged.
Normal Equation :
We can also use the Normal Equation in order to calculate the \( \theta \). The Normal Equation is easy to understand/implement and faster for smaller number of features.However if you have 100s of features it tends to take more compute time than gradient descent. Again, for recap , here is the Normal Equation:
\(\theta=(X^TX)^{-1}X^Ty
\)
Calculate \( \theta \) :
In Octave/MATLAB we can implement this in just one line. We will then plot in the same chart as Gradient Descent in order to compare both the results.
1 2 3 4 5 |
thetaNormal = (pinv(x'*x))*x'*y; hold on; plot(x(:,2), x*thetaNormal, 'k-','linewidth',1); hold off; |
Refer the chart below, the black line is from the Normal Equation, which is overlaying the line from Gradient Descent. This also confirms that both the equations are resulting the same output.
Full Octave Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
clear ; close all; clc; %Load Car Data %weight,mpg - data.txt ( 100 rows ) car_data=load('data.txt'); x=car_data(:,1); % X would have 100 cols & 1 row y=car_data(:,2); % y would have 100 cols & 1 row m=length(y); %Now lets plot the data in scatter plot figure; plot(x,y,'bx','MarkerSize',10); xlabel('(y) Weight -->','fontsize',14); ylabel('(x) mpg -->','fontsize',14); fprintf('Press any key to continue ...'); pause; %now we will calculate Gradient Descent %add x0 to the feature matrix x=[ones(m,1),x]; %initialize theta θ=[θ0,θ1] theta=zeros(2,1); num_of_iterations = 1000; alpha = 0.1; J=zeros(num_of_iterations,1); for i=1:num_of_iterations h_of_x=(x*theta).-y; %h_of_x so that we would get a real number ( element wise multiplication + sum ) theta(1)=theta(1)-(alpha/m)*h_of_x'*x(:,1); theta(2)=theta(2)-(alpha/m)*h_of_x'*x(:,2); %compute J(θ) - Cost J(i)=1/(2*m)*sum(h_of_x.^2); end fprintf('θ0 = %f θ1 = %f \n', theta(1), theta(2)); hold on; %First plot x, the for y calculate y. %(100X2) * ( 2 X 1) =( 100 X 1 ) plot(x(:,2), x*theta, 'r-','linewidth',2); hold off; fprintf('Press any key to continue ...'); pause; predict=[1.3,3.6,5.5]; hold on; for i=predict plot(i, [1, i]*theta, 'g*','MarkerSize',14); legend('Training data', 'Linear regression','Prediction') end hold off; fprintf('Press any key to continue ...'); pause; thetaNormal = (pinv(x'*x))*x'*y; hold on; plot(x(:,2), x*thetaNormal, 'k-','linewidth',1); hold off; fprintf('Press any key to continue ...'); pause; figure; plot([1:num_of_iterations],J,'linewidth',2); xlabel('Num of iterations -->','fontsize',14); ylabel('Cost Function J(theta) -->','fontsize',14); |
You can find the data.txt
and also the above Octave code (main.m) in GitHub. Here is the link below.
Conclusion:
We saw very simple implementation of Univariate Linear Regression using both the Gradient Descent and Normal Equation. In the next tutorial we will learn more on Multivariate Linear Regression.
Leave a Reply