## AI Vectorization

In the Coursera class "Machine Learning" offered by Stanford University, the professor claims the algorithms are vectorized. However, the algorithms use for loops. Can the algorithms be vectorized further, for efficiency?

In this post, I would like to explore this question, and to share my results. All vectors are assumed to be column vectors.

Before we dive in, I want to recommend the Machine Learning Coursera class. It has lots of resources, and the professor has very solid explanations. Even for people with prior AI experience, it shows new tricks, and it will help solidify your understanding.

### Linear Regression

#### Refresher

Before we start, here is a quick refresher of what each variables mean:

$$x$$ - Matrix of examples
$$x_i$$ - Vector of an example
$$m$$ - The number of examples
$$y_i$$ - The actual value, for example i, of some statistic we're predicting
$$\theta$$ - Model parameters
$$h_{\theta} (x_i)$$ - Our model's prediction for the example i
$$h_{\theta} (x)$$ - Our model's predictions for the examples
$$J(\theta)$$ - The cost function, aka how bad our model is

The linear model used to make predictions is:

$$h_{\theta} (x_i) = x_i^T \cdot \theta \\ h_{\theta} (x) = x \cdot \theta$$

The standard cost function is:

$$J(\theta) = (1/2m) * (\sum_{i=1}^{n} ( h_{\theta} (x_i) - y_i)^2 )$$

"Machine Learning" offers the following algorithm for computing the gradient of the cost:

$$\partial_0 J(\theta) = (1/m) * (\sum_{i=1}^{m} h_{\theta} (x_i) - y_i) \\ \partial_j J(\theta) = (1/m) * (\sum_{i=1}^{m} (h_{\theta} (x_i) - y_i) * (x_j))$$

The main issue with this is that the algorithm is iterative, and hence must be run multiple times. However, the algorithm can be vectorized, making it much faster to computer:

$$\nabla J(\theta) = (1/m) * ( x^T \cdot ( h_{\theta} (x) - y) )$$

#### Proof

Edit January 1, 2019: A few minor mistakes have been fixed.

Let's start from the beginning and take the gradient of J. We'll start by rewriting it a little:

$$\nabla J(\theta) = (1/2m) * \nabla ( g(\theta)^T \cdot g(\theta) )\\ g(\theta) = h_{\theta} (x) - y$$

Then breaking it up via the chain rule. Due to the matrices, the order of the derivatives matters:

$$\nabla J(\theta) = \nabla g(\theta)^T \cdot (1/2m) * \nabla ( g^T \cdot g ) \vert_g \\ \nabla J(\theta) = \nabla g(\theta)^T \cdot (1/2m) * ( 2 * g ) \\ \nabla J(\theta) = \nabla h_{\theta} (x)^T \cdot (1/m) * ( h_{\theta} (x) - y )$$

Next, let's find the gradient of h, with respect to $$\theta$$:

$$\nabla h_{\theta} (x) = x$$

Inserting back, we get:

$$\nabla J(\theta) = x^T \cdot (1/m) * ( h_{\theta} (x) - y )$$

Finally, let's rewrite it neatly - how it's mostly commonly written:

$$\nabla J(\theta) = (1/m) * ( x^T \cdot ( h_{\theta} (x) - y ) )$$

Similarly, "Machine Learning" offers the following gradient descent algorithm:

$$\partial_j J(\theta) = (\sum_{i=1}^{m} (h_{\theta} (x_i) - y_i) * (x_j)) * (1/m)$$

Thanks to nice partials, the gradient is the same as in linear regression. Hence, we can re-use the fully vectorized gradient equation for linear regression:

$$\nabla J(\theta) = ( X^T \cdot ( h_{\theta} (x) - y) ) * (1/m)$$

As a snack, here's a refresher for what each variable means:

$$a^{(1)}$$ - The activations of NN layer n
$$\delta^{(n)}$$ - The "error term" of NN layer n
$$\Delta^{(n)}$$ - The cumulative gradient of NN layer n

The "Machine Learning" algorithm for computing a NN's gradient is: This looks very daunting. Can we vectorize this behemoth of a for loop? Let's work our way down.

For forward propagation, let's make each row of matrix $$a_1$$ contain the vector $$a^{(1)}$$ for each example:

$$a_1 = x\\ a_2 = sigmoid(a_1 \cdot \theta_1^T)\\ a_3 = sigmoid(a_2 \cdot \theta_2^T)$$

For back propagation, let's make each row of matrix $$\delta_3$$ contain the vector $$\delta^{(3)}$$ for each example:

$$\delta_3 = a_3 - Y\\ \delta_2 = ((\delta_3 \cdot \theta_2) * (a_2 * (1-a_2)))$$

Finally, we are left with vectorizing $$\Delta^{(n)}$$, however there are issues. $$\Delta^{(n)}$$ is a matrix, yet our input data, $$a^{(l)}$$ and $$\delta^{(l)}$$, are a vector and matrix respectively. Ignoring algorithmic correctness, there is no way to multiply a vector and a matrix to get a matrix. This might be possible with tensor mathematics, so if anyone knows, please comment below. However, that begs the question if we still gain optimization benefits?

On that note, unless performance ia a constraint for your AI, it's best to keep readable code. In that situation, I recommend using "Machine Learning" 's semi-vectorized approach.