#### Introduction

Julia's first-class array implementation and high performance make it an ideal language for performing statistical analysis on large data sets.

The linreg function not only makes it easy to do simple linear regression on data sets containing one independent variable, but it can also performs multiple regression, that is, linear regression for data sets with multiple independent variables.

#### Background

The data set we will analyze in this example represents samples of gasoline of various octane ratings. For each sample, the octane rating was measured along with the component makeup in terms of three components, which we will refer to as "component 1", "component 2", and "component 3". We want to model octane rating as a function of the component makeup of gasoline.

Here's a sample of the data:

Index Component 1 (%) Component 2 (%) Component 3 (%) Condition Octane Rating
* data taken from Helmuth Spaeth, Mathematical Algorithms for Linear Regression. Academic Press, 1991.
1 53.33 1.72 54 1.66219 92.19
2 59.13 1.2 53 1.58399 92.74
3 57.39 1.42 55 1.61731 91.88
4 56.43 1.78 55 1.66228 92.8

We want to find the coefficients that satisfy the following relationship:

$octane = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$

Where the values for x represent the weight fraction of components 1 through 3, the values for $\beta$ are constant coefficients, and $\epsilon$ represents an error or y-offset term. We can express this more generally:

$y = \sum{\beta_i x_i + \epsilon_i}$

Or in vector form, which is more convenient for translating this expression to Julia:

$y = X\beta + \epsilon$

Where:

$y = \left( \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right), X = \left( \begin{array}{c} x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{n}^{T} \end{array} \right) = \left( \begin{array}{ccc} x_{11} & \ldots & x_{1p} \\ x_{21} & \ldots & x_{2p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \ldots & x_{np}\end{array} \right), \beta = \left( \begin{array}{c} \beta_{1} \\ \vdots \\ \beta_{p} \end{array} \right), \epsilon = \left( \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{array} \right)$

Referring to the above table, we will treat column 6 as y, and columns 2 — 4 as X.

Our challenge will be to find a set of coefficients, $\beta$, and error terms, $\epsilon$, to minimize the sum of the squares of the residuals, where residual is defined as the difference between the actual y value, and our predicted value, which we will call $\hat{y}$.

The general solution to this problem, using ordinary least squares (the simplest method), is of the form:

$\hat{\beta} = (X^T X)^{-1}X^Ty$

Julia has a function that can be used to load CSV files. We will use the readcsv method to read data out of our sample data file and into an array for processing. Let's create a new directory called Gasoline for our project and within it create a file, called main.jl, using Julia Studio.

For this tutorial, we will use the gasoline.csv file: download the data set and place it in the Gasoline directory.

To load in the data, we simply call the readcsv function:

data = readcsv("gasoline.csv")


Now that we have the data loaded into memory, we can get the subranges that represent the slices of data that correspond to x and y. Much like Python and MATLAB®, Julia allows you to specify ranges within matrices using the : operator. For example, data[1,:] gets all the data in the first row of the matrix. data[:,1] gets all the data in the first column.

# Get the columns that correspond with our X values
# and create a new matrix to hold them
# syntax: data[row, column]
# to get a whole row do data[row, :]
# to get a whole column do data[:, column]
# to get a range of columns do data[:, column_a:column_b]

x = data[:, 2:4]
y = data[:, 6]


Try printing out these values in the console to confirm that the data has the shape you expect.

#### Regression

Now that we have all the data arranged in the form we need it, we can perform linear regression on it.

The syntax for calling linreg is simple: linreg(x_values, y_values, weight_values). All arguments must be Arrays of Floats, and the weight_values argument is optional. This call returns the coefficients of the linear fit.

# Call linreg
coefs = linreg(x, y)


Paste the previous commands into the main.jl file, and our final program looks like this:

# in main.jl

# Load CSV data

# Dissect data
x = data[:, 2:4]
y = data[:, 6]

# Call linreg
coefs = linreg(x, y)


#### Conclusion

Running the regression produces the following y-offset and coefficients for the three components:

Offset ($\epsilon$) Component 1 ($\hat{\beta_1}$) Component 2 ($\hat{\beta_2}$) Component 3 ($\hat{\beta_3}$)
102.44000 -0.09526 -0.14831 -0.08314

If you found this too easy and you're feeling ambitious, try looking at the Generalized Linear Models package and producing similar results.

$\hat{\beta} = (X^T \Omega^{-1} X)^{-1}X^T\Omega^{-1}y$