Lessons from DeepLearning.ai - Sequence Models

As I continue my daily hacking, I sometimes like to take a side course to learn a new field or topic. One of them is Andrew Ng's Sequence Model class. Below I'll be sharing some of the lessons taught in the class.

This class is the fifth and final class in the Deep Learning Specialization, and was only recently released. It talks about Recurrent Neural Networks (RNN), and doesn't assume any prior knowledge, beyond that of basic neural networks and python.

As always, I highly recommend Andrew's classes to everyone.

What is a RNN?

Recurrent Neural Networks (RNNs) are neural networks adapted to work on sequence based data. For example, they can interpret sentences, music, speech, and even DNA strands!

RNNs are like regular neural networks, with the main difference being that they have an internal state. This internal state is used to keep track of information over time, like a personal notebook.

The above is a many-to-many model. In this case, for each input \( x^{ < t > } \) there is an output \( y^{ < t > } \).


LSTM and GRU are twists on the normal RNN building block. Their goal is to help capture long term concepts, and to a lesser extent, dealing with exploding/vanishing gradients. In Keras, it's very simple to set these up:

X = LSTM(128, return_sequences = True)(X)

To cover notation, 128 corresponds with the output size and hidden size of the LSTM. The code is currently setup for a many-to-many model, but can be updated to a many-to-one model by setting return_sequences to False. Use return_state to receive the hidden state as well.

Bi-directional RNN

One limitation of basic RNNs is that each prediction for a word only uses information from preceding words. For example, let's say we want a system that tells us what words are names. Half way through a sentence, we have:

"my favorite is teddy"

Does "teddy" correspond with "Teddy Roosevelt" or "teddy bear"? That's unknown, so we can't accurately mark "teddy" as part of a name or not.

To address this challenge, we can use bi-directional RNNs. They're organized to make predictions based on all words of a sentence, but they're also twice as large. The layout is:

Generalization through Word Embeddings

One of the challenges for Natural Language Processing (NLP) systems is the question of how to represent input such that the network runs quickly, but also learns well. It's possible to represent each word as a one hot vector, but that's computationally slow. It's also possible to represent each word as some number, but then lots of words look very similar.

Instead, why not use a mix? Introducing word embeddings. We'll represent each word as a n-dimensional vector, with each dimension representing a trait about the word. For example, "fruit" might be represented as {food: 0.99, gender: -0.05, size: 0.2}, and "king" might be represented as {food: -0.9, gender: 0.92, size: 0.56}.

One amazing benefit of word embeddings is generalization. Let's say we have NLP system that predicts the next word in a sequence:

I drank a glass of orange _____.
I drank a glass of apple _____.

Not only that, but let's say it's heard of "orange juice", but not "apple juice"! A tragedy. However, considering that "orange" and "apple" have similar features, our system can guess that "apple"s would make a nice juice. Hence it will output "juice" as the next word in both cases.

Word Embeddings are also great for transfer learning, considering that word embeddings learned on one task can instantly be applied to another task. For a list of pretrained word embeddings, go here.

Reducing Bias in Word Embeddings

Lots of biases exist in the real world. Women are statistically more likely to be in certain jobs, and men as well. These biases are very likely to manifest in training data, and so as scientists, we need to be careful. Thankfully, it can be easier to eliminate biases in machine learning then in the real world.

One tool for reducing bias in word embeddings is as follows. First figure out which dimension seems to correspond with a certain bias, like gender, and normalize all word embeddings along this dimension. If a word is gender-neutral, then set its "bias" to 0. If a pair of words are gender-specific, set their embeddings equal except for their "bias", which should be exact opposites.

One concrete example is for the word "fruit". Before:

{food: 0.99, gender: -0.05, size: 0.2}

After normalization:

{food: 0.99, gender: 0, size: 0.2}

Beam Search

Computers aren't gods. When a computer looks at the universe of possibilities, the universe stares back, and the computer dies.

When working with sentence generation systems, there are tons of possible sentences that our system needs to sort through. It's exponential. As such, one critical component is beam search, which limits the number of sentences being considered at a time.

For a slightly more technical definition... Beam search is like depth-first search, but instead of considering every possibility, we keep the n best per depth, and only consider branches continuing from these.

Speech Recognition

As many of you know from Siri, speech recognition ain't easy. However, it is used in many applications.

With speech recognition, one usually wants a standard many-to-many RNN. The input is the audio sampled at, say, 100 Hz. The output is the letters or words that it is hearing. This creates problems when it comes to the disparity between input and output size, but it is fixable.

This becomes easier with the simplified problem called "trigger word detection". Ie, a device looks for key words and acts on them, like "Hey Siri" or "Okay Google". Since this reduces what the system has to look for, the output is much simpler and easier to learn.