back to index
Andrey Karpathy. Neural Networks: Zero to Hero
Episode 2. The spelled-out intro to language modeling: building makemore
link |
Hi, everyone. Hope you're well.
link |
Next up, what I'd like to do is I'd like to build out Make More.
link |
Like microGrad before it,
link |
Make More is a repository that I have on my GitHub web page.
link |
You can look at it. But just like with microGrad,
link |
I'm going to build it out step-by-step,
link |
and I'm going to spell everything out.
link |
We're going to build it out slowly and together.
link |
Now, what is Make More?
link |
Make More, as the name suggests,
link |
makes more of things that you give it.
link |
Here's an example.
link |
Names.txt is an example dataset to Make More.
link |
When you look at Names.txt,
link |
you'll find that it's a very large dataset of names.
link |
Here's lots of different types of names.
link |
In fact, I believe there are 32,000 names
link |
that I found randomly on a government website.
link |
If you train Make More on this dataset,
link |
it will learn to make more of things like this.
link |
In particular, in this case,
link |
that will mean more things that sound name-like,
link |
but are actually unique names.
link |
Maybe if you have a baby and you're trying to assign a name,
link |
maybe you're looking for a cool new sounding unique name,
link |
Make More might help you.
link |
Here are some example generations from
link |
the neural network once we train it on our dataset.
link |
Here's some example unique names that it will generate.
link |
Dontel, iRot, Zendi, and so on.
link |
All these are sound name-like,
link |
but they're not, of course, names.
link |
Under the hood, Make More is a character level language model.
link |
What that means is that it is treating
link |
every single line here as an example,
link |
and within each example,
link |
it's treating them all as sequences of individual characters.
link |
R-E-E-S-E is this example,
link |
and that's the sequence of characters,
link |
and that's the level on which we are building out Make More.
link |
What it means to be a character level language model then,
link |
is that it's just modeling those sequences of
link |
characters and it knows how to
link |
predict the next character in the sequence.
link |
Now, we're actually going to implement
link |
a large number of character level language models
link |
in terms of the neural networks that are
link |
involved in predicting the next character in a sequence.
link |
Very simple bigram and bag of word models,
link |
multilayered perceptrons, recurrent neural networks,
link |
all the way to modern transformers.
link |
In fact, the transformer that we will build will be
link |
basically the equivalent transformer to
link |
GPT-2 if you have heard of GPT.
link |
That's a big deal. It's a model network,
link |
and by the end of the series,
link |
you will actually understand how that
link |
works on the level of characters.
link |
Now, to give you a sense of the extensions here,
link |
after characters, we will probably spend
link |
some time on the word level so that we can
link |
generate documents of words,
link |
not just little segments of characters,
link |
but we can generate entire much larger documents.
link |
Then we're probably going to go into images
link |
and image text networks such as Dolly,
link |
StableDiffusion, and so on.
link |
But for now, we have to start here,
link |
character level language modeling. Let's go.
link |
Like before, we are starting with
link |
a completely blank Jupyter Notebook page.
link |
The first thing is I would like to
link |
basically load up the dataset names.txt.
link |
We're going to open up names.txt for reading,
link |
and we're going to read in everything into a massive string.
link |
Then because it's a massive string,
link |
we only like the individual words and put them in the list.
link |
Let's call splitlines on that string to get
link |
all of our words as a Python list of strings.
link |
Basically, we can look at, for example,
link |
the first 10 words,
link |
and we have that it's a list of Emma,
link |
Olivia, Ava, and so on.
link |
If we look at the top of the page here,
link |
that is indeed what we see. That's good.
link |
This list actually makes me feel that
link |
this is probably sorted by frequency.
link |
But these are the words.
link |
Now, we'd like to actually learn
link |
a little bit more about this dataset.
link |
Let's look at the total number of words.
link |
We expect this to be roughly 32,000.
link |
Then what is the, for example, shortest word?
link |
So min of length of each word for w in words.
link |
So the shortest word will be length 2,
link |
and max of length w for w in words.
link |
So the longest word will be 15 characters.
link |
Let's now think through our very first language model.
link |
As I mentioned, a character level language model
link |
is predicting the next character in a sequence
link |
given already some concrete sequence of characters before it.
link |
Now, what we have to realize here is that every single word here,
link |
like Isabella, is actually
link |
quite a few examples packed in to that single word.
link |
Because what is an existence of a word like Isabella
link |
in the dataset telling us really?
link |
It's saying that the character i
link |
is a very likely character to come first in a sequence of a name.
link |
The character s is likely to come after i.
link |
The character a is likely to come after is.
link |
The character b is very likely to come after isa,
link |
and so on all the way to a following Isabelle.
link |
Then there's one more example actually packed in here,
link |
and that is that after there's Isabella,
link |
the word is very likely to end.
link |
So that's one more explicit piece of information that we have here,
link |
that we have to be careful with.
link |
So there's a lot packed into a single individual word in terms of
link |
the statistical structure of what's likely to follow in these character sequences.
link |
Then of course, we don't have just an individual word,
link |
we actually have 32,000 of these,
link |
and so there's a lot of structure here to model.
link |
Now in the beginning, what I'd like to start with,
link |
is I'd like to start with building a bigram language model.
link |
Now in a bigram language model,
link |
we're always working with just two characters at a time.
link |
So we're only looking at one character that we are given,
link |
and we're trying to predict the next character in the sequence.
link |
So what characters are likely to follow r,
link |
what characters are likely to follow a, and so on.
link |
We're just modeling that little local structure.
link |
We're forgetting the fact that we may have a lot more information,
link |
we're always just looking at the previous character to predict the next one.
link |
So it's a very simple and weak language model,
link |
but I think it's a great place to start.
link |
So now let's begin by looking at these bigrams in our dataset and what they look like.
link |
These bigrams again are just two characters in a row.
link |
So for w in words,
link |
each w here is an individual word string.
link |
We want to iterate this word with consecutive characters.
link |
So two characters at a time,
link |
sliding it through the word.
link |
Now, a interesting, nice way,
link |
cute way to do this in Python by the way,
link |
is doing something like this.
link |
For character1, character2 in zip of w and w at one, one column.
link |
Print character1, character2, and let's not do all the words,
link |
let's just do the first three words.
link |
I'm going to show you in a second how this works.
link |
But for now, basically as an example,
link |
let's just do the very first word alone, emma.
link |
You see how we have a emma and this will just print em,
link |
The reason this works is because w is the string emma,
link |
w at one column is the string mma,
link |
and zip takes two iterators and it pairs them up and then
link |
creates an iterator over the tuples of their consecutive entries.
link |
Any one of these lists is shorter than the other,
link |
then it will just halt and return.
link |
So basically, that's why we return em, mm, ma.
link |
But then because this iterator,
link |
the second one here, runs out of elements,
link |
zip just ends and that's why we only get these tuples.
link |
So pretty cute. These are the consecutive elements in the first word.
link |
Now we have to be careful because we actually have more information
link |
here than just these three examples.
link |
As I mentioned, we know that e is very likely to come first,
link |
and we know that a in this case is coming last.
link |
So one way to do this is basically we're going to
link |
create a special array here, characters,
link |
and we're going to hallucinate a special start token here.
link |
I'm going to call it special start.
link |
So this is a list of one element,
link |
plus w, and then plus a special end character.
link |
The reason I'm wrapping list of w here is because w is a string,
link |
emma, list of w will just have the individual characters in the list.
link |
Then doing this again now,
link |
but not iterating over w's,
link |
but over the characters will give us something like this.
link |
So this is a bigram of the start character and e,
link |
and this is a bigram of the a and the special end character.
link |
Now we can look at, for example,
link |
what this looks like for Olivia or Eva.
link |
Indeed, we can actually potentially do this for the entire dataset,
link |
but we won't print that, that's going to be too much.
link |
But these are the individual character bigrams and we can print them.
link |
Now, in order to learn the statistics about
link |
which characters are likely to follow other characters,
link |
the simplest way in the bigram language models is to simply do it by counting.
link |
So we're basically just going to count how often
link |
any one of these combinations occurs in the training set in these words.
link |
So we're going to need some kind of a dictionary that's going to
link |
maintain some counts for every one of these bigrams.
link |
So let's use a dictionary b,
link |
and this will map these bigrams.
link |
So bigram is a tuple of character one,
link |
character two, and then b at bigram will be b.get of bigram,
link |
which is basically the same as b at bigram.
link |
But in the case that bigram is not in the dictionary b,
link |
we would like to, by default, return a zero plus one.
link |
So this will basically add up all the bigrams and count how often they occur.
link |
Let's get rid of printing or rather,
link |
let's keep the printing and let's just inspect what b is in this case.
link |
We see that many bigrams occur just a single time.
link |
This one allegedly occurred three times.
link |
So a was an ending character three times,
link |
and that's true for all of these words.
link |
All of Emma, Olivia, and Eva end with a.
link |
So that's why this occurred three times.
link |
Now let's do it for all the words.
link |
Oops, I should not have printed.
link |
I meant to erase that. Let's kill this.
link |
Let's just run, and now b will have the statistics of the entire dataset.
link |
So these are the counts across all the words of the individual bigrams.
link |
We could, for example, look at some of the most common ones and least common ones.
link |
This grows in Python,
link |
but the way to do this,
link |
the simplest way I like is we just use b.items.
link |
b.items returns the tuples of key value.
link |
In this case, the keys are the character bigrams and the values are the counts.
link |
Then what we want to do is we want to do sorted of this.
link |
But by default, sort is on the first item of a tuple.
link |
But we want to sort by the values which are
link |
the second element of a tuple that is the key value.
link |
So we want to use the key equals lambda that takes
link |
the key value and returns the key value at one,
link |
not at zero but at one, which is the count.
link |
So we want to sort by the count of these elements.
link |
Actually, we want it to go backwards.
link |
So here what we have is the bigram QnR occurs only a single time.
link |
DZ occurred only a single time.
link |
When we sort this the other way around,
link |
we're going to see the most likely bigrams.
link |
So we see that n was very often an ending character many, many times.
link |
Apparently, n always follows an a,
link |
and that's a very likely combination as well.
link |
So this is the individual counts that we achieve over the entire dataset.
link |
Now, it's actually going to be significantly more convenient for us to
link |
keep this information in a two-dimensional array instead of a Python dictionary.
link |
So we're going to store this information in a 2D array,
link |
and the rows are going to be the first character of the bigram,
link |
and the columns are going to be the second character.
link |
Each entry in this two-dimensional array will tell us how
link |
often that first character follows the second character in the dataset.
link |
So in particular, the array representation that we're going to use,
link |
or the library, is that of PyTorch.
link |
PyTorch is a deep learning neural network framework,
link |
but part of it is also this torch.tensor,
link |
which allows us to create
link |
multi-dimensional arrays and manipulate them very efficiently.
link |
So let's import PyTorch,
link |
which you can do by import torch.
link |
Then we can create arrays.
link |
So let's create an array of zeros,
link |
and we give it a size of this array.
link |
Let's create a three by five array as an example,
link |
and this is a three by five array of zeros.
link |
By default, you'll notice a.dtype,
link |
which is short for data type, is float 32.
link |
So these are single precision floating point numbers.
link |
Because we are going to represent counts,
link |
let's actually use.dtype as torch.int32.
link |
So these are 32-bit integers.
link |
So now you see that we have integer data inside this tensor.
link |
Now, tensors allow us to really
link |
manipulate all the individual entries and do it very efficiently.
link |
So for example, if we want to change this bit,
link |
we have to index into the tensor.
link |
In particular, here, this is the first row,
link |
and because it's zero indexed.
link |
So this is row index one and column index zero, one, two, three.
link |
So a at one comma three,
link |
we can set that to one,
link |
and then a will have a one over there.
link |
We can, of course, also do things like this.
link |
So now a will be two over there, or three.
link |
Also, we can, for example, say a00 is five,
link |
and then a will have a five over here.
link |
So that's how we can index into the arrays.
link |
Now, of course, the array that we are interested in is much, much bigger.
link |
So for our purposes,
link |
we have 26 letters of the alphabet,
link |
and then we have two special characters, S and E.
link |
So we want 26 plus two,
link |
or 28 by 28 array,
link |
and let's call it the capital N,
link |
because it's going to represent the counts.
link |
Let me erase this stuff.
link |
So that's the array that starts at zeros, 28 by 28.
link |
Now, let's copy-paste this here.
link |
But instead of having a dictionary B,
link |
which we're going to erase,
link |
Now, the problem here is that we have these characters which are strings,
link |
but we have to now basically index into
link |
a array and we have to index using integers.
link |
So we need some kind of a lookup table from characters to integers.
link |
So let's construct such a character array.
link |
The way we're going to do this is we're going to take all the words,
link |
which is a list of strings.
link |
We're going to concatenate all of it into a massive string.
link |
So this is just simply the entire dataset as a single string.
link |
We're going to pass this to the set constructor,
link |
which takes this massive string and
link |
throws out duplicates because sets do not allow duplicates.
link |
So set of this will just be the set of all the lowercase characters.
link |
There should be a total of 26 of them.
link |
Now, we actually don't want a set,
link |
But we don't want a list sorted in some weird arbitrary way,
link |
we want it to be sorted from A to Z.
link |
So those are our characters.
link |
Now, what we want is this lookup table as I mentioned.
link |
So let's create a special S2I, I will call it.
link |
S is string or character,
link |
and this will be an S2I mapping for IS in enumerate of these characters.
link |
So enumerate basically gives us this iterator over
link |
the integer, index, and the actual element of the list.
link |
Then we are mapping the character to the integer.
link |
So S2I is a mapping from A to 0,
link |
B to 1, etc, all the way from Z to 25.
link |
That's going to be useful here,
link |
but we actually also have to specifically set that S will be 26,
link |
and S2I at E will be 27 because Z was 25.
link |
So those are the lookups.
link |
Now, we can come here and we can map
link |
both character 1 and character 2 to their integers.
link |
So this will be S2I at character 1,
link |
and IX2 will be S2I of character 2.
link |
Now, we should be able to do this line,
link |
but using our array.
link |
this is the two-dimensional array indexing I've shown you before,
link |
and honestly just plus equals 1 because everything starts at 0.
link |
So this should work and give us
link |
a large 28 by 28 array of all these counts.
link |
So if we print n, this is the array,
link |
but of course it looks ugly.
link |
So let's erase this ugly mess,
link |
and let's try to visualize it a bit more nicer.
link |
So for that, we're going to use a library called matplotlib.
link |
So matplotlib allows us to create figures.
link |
So we can do things like pltim show of the count array.
link |
So this is the 28 by 28 array,
link |
and this is the structure,
link |
but even this, I would say, is still pretty ugly.
link |
So we're going to try to create a much nicer visualization of it,
link |
and I wrote a bunch of code for that.
link |
The first thing we're going to need is,
link |
we're going to need to invert this array here, this dictionary.
link |
So S2I is a mapping from S to I,
link |
and in I2S, we're going to reverse this dictionary.
link |
So iterate over all the items and just reverse that array.
link |
So I2S maps inversely from 0 to A,
link |
1 to B, etc. So we'll need that.
link |
Then here's the code that I came up with to try to
link |
make this a little bit nicer.
link |
We create a figure, we plot n,
link |
and then we visualize a bunch of things here.
link |
Let me just run it so you get a sense of what this is.
link |
So you see here that we have the array spaced out,
link |
and every one of these is basically like B follows G 0 times,
link |
B follows H 41 times.
link |
So A follows J 175 times.
link |
What you can see that I'm doing here is,
link |
first I show that entire array,
link |
and then I iterate over all the individual cells here,
link |
and I create a character string here,
link |
which is the inverse mapping I2S of the integer I and the integer J.
link |
So those are the bigrams in a character representation.
link |
Then I plot just the bigram text,
link |
and then I plot the number of times that this bigram occurs.
link |
Now, the reason that there's a dot item here is
link |
because when you index into these arrays,
link |
these are torch tensors,
link |
you see that we still get a tensor back.
link |
So the type of this thing,
link |
you'd think it would be just an integer, 149,
link |
but it's actually a torch dot tensor.
link |
If you do dot item,
link |
then it will pop out that individual integer.
link |
So it'll just be 149.
link |
So that's what's happening there.
link |
These are just some options to make it look nice.
link |
So what is the structure of this array?
link |
We have all these counts and we see that some of them occur often,
link |
and some of them do not occur often.
link |
Now, if you scrutinize this carefully,
link |
you will notice that we're not actually being very clever.
link |
That's because when you come over here,
link |
you'll notice that, for example,
link |
we have an entire row of completely zeros.
link |
That's because the end character is never
link |
possibly going to be the first character of a bigram,
link |
because we're always placing these end tokens at the end of a bigram.
link |
Similarly, we have an entire column of zeros here,
link |
because the S character will never possibly be the second element of a bigram,
link |
because we always start with S and we end with E,
link |
and we only have the words in between.
link |
So we have an entire column of zeros,
link |
an entire row of zeros,
link |
and in this little two-by-two matrix here as well,
link |
the only one that can possibly happen is if S directly follows E.
link |
That can be non-zero if we have a word that has no letters.
link |
So in that case, there's no letters in the word,
link |
it's an empty word, and we just have S follows E.
link |
But the other ones are just not possible.
link |
So we're basically wasting space,
link |
and not only that, but the S and the E are getting very crowded here.
link |
I was using these brackets because there's
link |
convention in natural language processing to use
link |
these kinds of brackets to denote special tokens,
link |
but we're going to use something else.
link |
So let's fix all this and make it prettier.
link |
We're not actually going to have two special tokens,
link |
we're only going to have one special token.
link |
So we're going to have n by n array of 27 by set 27 instead.
link |
Instead of having two,
link |
we will just have one and I will call it a dot.
link |
Let me swing this over here.
link |
Now, one more thing that I would like to do is I would actually
link |
like to make this special character half position zero,
link |
and I would like to offset all the other letters off.
link |
I find that a little bit more pleasing.
link |
So we need a plus one here so that the first character,
link |
which is A, will start at one.
link |
So S2i will now be A starts at one and dot is zero.
link |
I2S, of course, we're not changing this because I2S
link |
just creates a reverse mapping and this will work fine.
link |
So one is A, two is B,
link |
zero is dot. So we've reversed that.
link |
Here, we have a dot and a dot.
link |
This should work fine.
link |
Make sure I start at zeros, count.
link |
Then here, we don't go up to 28,
link |
and this should just work.
link |
So we see that dot dot never happened,
link |
it's at zero because we don't have empty words.
link |
Then this row here now is just very simply
link |
the counts for all the first letters.
link |
So J starts a word,
link |
H starts a word, I starts a word, etc.
link |
Then these are all the ending characters.
link |
In between, we have the structure of
link |
what characters follow each other.
link |
So this is the counts array of our entire dataset.
link |
This array actually has all the information necessary for us to
link |
actually sample from this bigram character level language model.
link |
Roughly speaking, what we're going to do is we're just going to
link |
start following these probabilities and these counts,
link |
and we're going to start sampling from the model.
link |
In the beginning, of course,
link |
we start with the dot,
link |
the start token dot.
link |
So to sample the first character of a name,
link |
we're looking at this row here.
link |
So we see that we have the counts,
link |
and those counts externally are telling us how
link |
often any one of these characters is to start a word.
link |
So if we take this n and we grab the first row,
link |
we can do that by using just indexing at zero,
link |
and then using this notation colon for the rest of that row.
link |
So n zero colon is indexing into the zero row,
link |
and then it's grabbing all the columns.
link |
So this will give us a one-dimensional array of the first row.
link |
You notice 0, 4, 4, 10,
link |
It's just the first row.
link |
The shape of this is 27.
link |
It's just the row of 27.
link |
The other way that you can do this also is you
link |
just grab the zero row like this.
link |
This is equivalent.
link |
Now, these are the counts.
link |
Now, what we'd like to do is we'd like to
link |
basically sample from this.
link |
Since these are the raw counts,
link |
we actually have to convert this to probabilities.
link |
So we create a probability vector.
link |
So we'll take n of zero,
link |
and we'll actually convert this to float first.
link |
So these integers are converted to floating-point numbers.
link |
The reason we're creating floats is because we're
link |
about to normalize these counts.
link |
So to create a probability distribution here,
link |
we basically want to do p divide p dot sum.
link |
Now, we get a vector of smaller numbers,
link |
and these are now probabilities.
link |
So of course, because we divided by the sum,
link |
the sum of p now is one.
link |
So this is a nice proper probability distribution.
link |
It sums to one, and this is giving us the probability for
link |
any single character to be the first character of a word.
link |
So now we can try to sample from this distribution.
link |
To sample from these distributions,
link |
we're going to use Torch.Multinomial,
link |
which I've pulled up here.
link |
So Torch.Multinomial returns samples
link |
from the multinomial probability distribution,
link |
which is a complicated way of saying,
link |
you give me probabilities and I will give you integers,
link |
which are sampled according to the probability distribution.
link |
So this is the signature of the method.
link |
To make everything deterministic,
link |
we're going to use a generator object in PyTorch.
link |
So this makes everything deterministic.
link |
So when you run this on your computer,
link |
you're going to get the exact same results
link |
that I'm getting here on my computer.
link |
So let me show you how this works.
link |
Here's a deterministic way of creating
link |
a Torch generator object,
link |
seeding it with some number that we can agree on.
link |
So that seeds a generator, gives us an object g.
link |
Then we can pass that g to a function that creates here,
link |
random numbers, Torch.rand creates random numbers,
link |
three of them, and it's using
link |
this generator object as a source of randomness.
link |
So without normalizing it, I can just print.
link |
This is like numbers between 0 and 1 that are random,
link |
according to this thing.
link |
Whenever I run it again,
link |
I'm always going to get the same result because I
link |
keep using the same generator object which I'm seeding here.
link |
Then if I divide to normalize,
link |
I'm going to get a nice probability distribution
link |
of just three elements.
link |
Then we can use Torch.Multinomial to draw samples from it.
link |
So this is what that looks like.
link |
Torch.Multinomial will take
link |
the Torch tensor of probability distributions.
link |
Then we can ask for a number of samples, let's say 20.
link |
Replacement equals true means that when we draw an element,
link |
we can draw it and then we can put it back into
link |
the list of eligible indices to draw again.
link |
We have to specify replacement as true because by
link |
default for some reason it's false.
link |
I think it's just something to be careful with.
link |
The generator is passed in here,
link |
so we're going to always get
link |
deterministic results, the same results.
link |
If I run these two,
link |
we're going to get a bunch of samples from this distribution.
link |
Now, you'll notice here that the probability for
link |
the first element in this tensor is 60 percent.
link |
In these 20 samples,
link |
we'd expect 60 percent of them to be zero.
link |
We'd expect 30 percent of them to be one.
link |
Because the element index two has only 10 percent probability,
link |
very few of these samples should be two.
link |
Indeed, we only have a small number of twos.
link |
We can sample as many as we would like.
link |
The more we sample, the more
link |
these numbers should roughly have the distribution here.
link |
We should have lots of zeros,
link |
half as many ones,
link |
and we should have three times as few ones,
link |
and three times as few twos.
link |
You see that we have very few twos,
link |
we have some ones, and most of them are zero.
link |
That's what Torchline multinomial is doing.
link |
For us here, we are interested in this row.
link |
We've created this p here,
link |
and now we can sample from it.
link |
If we use the same seed,
link |
and then we sample from this distribution,
link |
let's just get one sample,
link |
then we see that the sample is say 13.
link |
This will be the index.
link |
You see how it's a tensor that wraps 13?
link |
We again have to use that item to pop out that integer.
link |
Now, index would be just the number 13.
link |
Of course, we can map
link |
the i2s of ix to figure out
link |
exactly which character we're sampling here.
link |
We're sampling m. We're saying that
link |
the first character is m in our generation.
link |
Just looking at the row here,
link |
m was drawn, and we can see that
link |
m actually starts a large number of words.
link |
m started 2,500 words out of 32,000 words,
link |
so almost a bit less than 10 percent
link |
of the words start with m.
link |
This was actually a fairly likely character to draw.
link |
That would be the first character of our word,
link |
and now we can continue to sample
link |
more characters because now we know that m is already sampled.
link |
Now to draw the next character,
link |
we will come back here and we will look
link |
for the row that starts with m.
link |
You see m and we have a row here.
link |
We see that mdot is 516,
link |
ma is this many, mb is this many, etc.
link |
These are the counts for the next row,
link |
and that's the next character that we are going to now generate.
link |
I think we are ready to actually just write out the loop
link |
because I think you're starting to get a sense
link |
of how this is going to go.
link |
We always begin at index 0 because that's the start token.
link |
Then while true, we're going to grab
link |
the row corresponding to index that we're currently on,
link |
so that's n array at ix,
link |
converted to float is rp.
link |
Then we normalize the speed to sum to one.
link |
I accidentally ran the infinite loop.
link |
We normalize p to sum to one.
link |
Then we need this generator object.
link |
We're going to initialize up here and we're going to
link |
draw a single sample from this distribution.
link |
Then this is going to tell us what index is going to be next.
link |
If the index sampled is zero,
link |
then that's now the end token, so we will break.
link |
Otherwise, we are going to print i2s of ix.
link |
That's pretty much it.
link |
We're just, this should work.
link |
More. That's the name that we've sampled.
link |
We started with m,
link |
the next step was o,
link |
then r, and then dot.
link |
This dot, we print it here as well.
link |
Let's now do this a few times.
link |
Let's actually create an out list here.
link |
Instead of printing, we're going to append,
link |
so out.append this character.
link |
Then here, let's just print it at the end.
link |
Let's just join up all the outs and we're just going to print more.
link |
Now, we're always getting the same result because of the generator.
link |
If we want to do this a few times,
link |
we can go for i in range 10.
link |
We can sample 10 names and we can just do that 10 times.
link |
These are the names that we're getting out. Let's do 20.
link |
I'll be honest with you, this doesn't look right.
link |
I stare at it a few minutes to convince myself that it actually is right.
link |
The reason these samples are so terrible is
link |
that bigram language model is actually just like really terrible.
link |
We can generate a few more here.
link |
You can see that they're name like a little bit like Yanu,
link |
Erily, etc, but they're just totally messed up.
link |
The reason that this is so bad,
link |
we're generating h as a name,
link |
but you have to think through it from the model's eyes.
link |
It doesn't know that this h is the very first h.
link |
All it knows is that h was previously,
link |
and now how likely is h the last character?
link |
Well, it's somewhat likely,
link |
and so it just makes it last character.
link |
It doesn't know that there were other things before it,
link |
or there were not other things before it.
link |
That's why it's generating all these nonsense names.
link |
Another way to do this is to
link |
convince yourself that it is actually doing something reasonable,
link |
even though it's so terrible,
link |
is these little p's here are 27,
link |
like 27. How about if we did something like this?
link |
Instead of p having any structure whatsoever,
link |
how about if p was just torch.ones of 27?
link |
By default, this is a float 32,
link |
so this is fine. Divide 27.
link |
What I'm doing here is this is
link |
the uniform distribution which will make everything equally likely,
link |
and we can sample from that.
link |
Let's see if that does any better.
link |
This is what you have from a model that is completely untrained,
link |
where everything is equally likely,
link |
so it's obviously garbage.
link |
Then if we have a trained model which is trained on just bigrams,
link |
this is what we get.
link |
You can see that it is more name-like, it is actually working.
link |
It's just bigram is so terrible and we have to do better.
link |
Now next, I would like to fix an inefficiency that we have going on here.
link |
Because what we're doing here is we're always fetching
link |
a row of n from the counts matrix up ahead,
link |
and then we're always doing the same things.
link |
We're converting to float and we're dividing,
link |
and we're doing this every single iteration of this loop,
link |
and we just keep renormalizing these rows over and over again,
link |
and it's extremely inefficient and wasteful.
link |
What I'd like to do is I'd like to actually prepare
link |
a matrix capital P that will just have the probabilities in it.
link |
In other words, it's going to be the same as the capital N matrix here of counts,
link |
but every single row will have the row of probabilities that is normalized to one,
link |
indicating the probability distribution for the next character,
link |
given the character before it as defined by which row we're in.
link |
Basically, what we'd like to do is we'd like to just do it up front here,
link |
and then we would like to just use that row here.
link |
Here, we would like to just do p equals p of ix instead.
link |
The other reason I want to do this is not just for efficiency,
link |
but also I would like us to practice
link |
these n-dimensional tensors and I'd like us to practice
link |
their manipulation and especially something that's called
link |
broadcasting that we'll go into in a second.
link |
We're actually going to have to become very good at
link |
these tensor manipulations because if we're
link |
going to build out all the way to transformers,
link |
we're going to be doing some pretty complicated array operations for
link |
efficiency and we need to really understand that and be very good at it.
link |
Intuitively, what we want to do is that we first want to grab
link |
the floating point copy of n,
link |
and I'm mimicking the line here basically,
link |
and then we want to divide all the rows so that they sum to one.
link |
We'd like to do something like this, p divide p dot sum.
link |
But now we have to be careful because p dot sum actually
link |
produces a sum, sorry,
link |
equals n dot float copy.
link |
p dot sum sums up all of the counts of this entire matrix n,
link |
and gives us a single number of just the summation of everything.
link |
That's not the way we want to divide.
link |
We want to simultaneously and in parallel,
link |
divide all the rows by their respective sums.
link |
What we have to do now is we have to go into documentation for
link |
tors.sum and we can scroll down here to a definition that is relevant to us,
link |
which is where we don't only provide an input array that we want to sum,
link |
but we also provide the dimension along which we want to sum.
link |
In particular, we want to sum up over rows.
link |
Now, one more argument that I want you to pay attention to here
link |
is the keepDim is false.
link |
If keepDim is true,
link |
then the output tensor is of the same size as input,
link |
except of course the dimension along which you summed,
link |
which will become just one.
link |
But if you pass in keepDim as false,
link |
then this dimension is squeezed out.
link |
Torch.sum not only does the sum and collapses dimension to be of size one,
link |
but in addition, it does what's called a squeeze,
link |
where it squeezes out that dimension.
link |
Basically, what we want here is we instead want to do p dot sum of sum axis.
link |
In particular, notice that p dot shape is 27 by 27.
link |
When we sum up across axis zero,
link |
then we would be taking the zeroth dimension and we would be summing across it.
link |
When keepDim is true,
link |
then this thing will not only give us the counts along the columns,
link |
but notice that basically the shape of this is one by 27.
link |
We just get a row vector.
link |
The reason we get a row vector here again is because we pass in zero dimension.
link |
This zeroth dimension becomes one and we've done a sum and we get a row.
link |
Basically, we've done the sum this way,
link |
vertically, and arrived at just a single one by 27 vector of counts.
link |
What happens when you take out keepDim is that we just get 27.
link |
It squeezes out that dimension and we just get a one-dimensional vector of size 27.
link |
Now, we don't actually want one by 27 row vector,
link |
because that gives us the counts or the sums across the columns.
link |
We actually want to sum the other way along dimension one.
link |
You'll see that the shape of this is 27 by one,
link |
so it's a column vector.
link |
It's a 27 by one vector of counts.
link |
That's because what's happened here is that we're going horizontally,
link |
and this 27 by 27 matrix becomes a 27 by one array.
link |
Now, you'll notice, by the way,
link |
that the actual numbers of these counts are identical.
link |
That's because this special array of counts here comes from bigram statistics.
link |
Actually, it just so happens by chance,
link |
or because of the way this array is constructed,
link |
that the sums along the columns or along the rows,
link |
horizontally or vertically, is identical.
link |
But actually, what we want to do in this case is we want to sum across the rows horizontally.
link |
What we want here is to be that sum of one with keepDim true,
link |
27 by one column vector.
link |
Now, what we want to do is we want to divide by that.
link |
Now, we have to be careful here again.
link |
Is it possible to take what's a p.shape you see here as 27 by 27?
link |
Is it possible to take a 27 by 27 array and divide it by what is a 27 by one array?
link |
Is that an operation that you can do?
link |
Whether or not you can perform this operation is determined by what's called broadcasting rules.
link |
If you just search broadcasting semantics in Torch,
link |
you'll notice that there's a special definition for what's called broadcasting,
link |
that for whether or not these two arrays can be combined in a binary operation like division.
link |
The first condition is each tensor has at least one dimension,
link |
which is the case for us.
link |
Then when iterating over the dimension sizes,
link |
starting at the trailing dimension,
link |
the dimension sizes must either be equal,
link |
one of them is one, or one of them does not exist.
link |
We need to align the two arrays and their shapes,
link |
which is very easy because both of these shapes have two elements, so they're aligned.
link |
Then we iterate over from the right and going to the left.
link |
Each dimension must be either equal, one of them is a one, or one of them does not exist.
link |
In this case, they're not equal, but one of them is a one, so this is fine.
link |
Then this dimension, they're both equal, so this is fine.
link |
All the dimensions are fine and therefore this operation is broadcastable.
link |
That means that this operation is allowed.
link |
What is it that these arrays do when you divide 27 by 27 by 27 by one?
link |
What it does is that it takes this dimension one and it stretches it out,
link |
it copies it to match 27 here in this case.
link |
In our case, it takes this column vector,
link |
which is 27 by one and it copies it 27 times to make these both be 27 by 27 internally.
link |
You can think of it that way.
link |
It copies those counts and then it does an element-wise division,
link |
which is what we want because these counts,
link |
we want to divide by them on every single one of these columns in this matrix.
link |
This actually we expect will normalize every single row.
link |
We can check that this is true by taking the first row,
link |
for example, and taking its sum.
link |
We expect this to be one because it's now normalized.
link |
Then we expect this now because if we actually correctly normalize all the rows,
link |
we expect to get the exact same result here.
link |
Let's run this. It's the exact same result. This is correct.
link |
Now I would like to scare you a little bit.
link |
I basically encourage you very strongly to read through broadcasting semantics,
link |
and I encourage you to treat this with respect.
link |
It's not something to play fast and loose with,
link |
it's something to really respect,
link |
really understand, and look up maybe some tutorials for broadcasting and practice it,
link |
and be careful with it because you can very quickly run into bugs.
link |
Let me show you what I mean.
link |
You see how here we have p.sum of one, keep them as true.
link |
The shape of this is 27 by one.
link |
Let me take out this line just so we have the n and then we can see the counts.
link |
We can see that this is all the counts across all the rows,
link |
and it's 27 by one column vector.
link |
Now, suppose that I tried to do the following,
link |
but I erase keep them as true here. What does that do?
link |
If keep them is not true, it's false.
link |
Then remember, according to documentation,
link |
it gets rid of this dimension one,
link |
it squeezes it out.
link |
Basically, we just get all the same counts,
link |
the same result, except the shape of it is not 27 by one,
link |
it's just 27, the one that disappears.
link |
But all the counts are the same.
link |
You'd think that this divide that would work.
link |
First of all, can we even write this and is it even expected to run?
link |
Is it broadcastable? Let's determine if this result is broadcastable.
link |
p.summit1 is shape, is 27.
link |
so 27 by 27 broadcasting into 27.
link |
Now, rules of broadcasting, number 1,
link |
align all the dimensions on the right, done.
link |
Now, iteration over all the dimensions starting from the right,
link |
going to the left.
link |
All the dimensions must either be equal,
link |
one of them must be one or one of them does not exist.
link |
Here, they are all equal.
link |
Here, the dimension does not exist.
link |
Internally, what broadcasting will do is it will create a one here,
link |
and then we see that one of them is a one,
link |
and this will get copied,
link |
and this will run, this will broadcast.
link |
You'd expect this to work because we can divide this.
link |
Now, if I run this,
link |
you'd expect it to work, but it doesn't.
link |
You actually get garbage.
link |
You get a wrong result because this is actually a bug.
link |
This keep them equals true makes it work.
link |
This is a bug. In both cases,
link |
we are doing the correct counts.
link |
We are summing up across the rows,
link |
but keep them as saving us and making it work.
link |
In this case, I'd like to encourage you to potentially
link |
pause this video at this point and try to think about
link |
why this is buggy and why the keep them was necessary here.
link |
The reason for this is I'm trying to hint it here
link |
when I was giving you a bit of a hint on how this works.
link |
This 27 vector internally inside the broadcasting,
link |
this becomes a one by 27,
link |
and one by 27 is a row vector.
link |
Now, we are dividing 27 by 27 by one by 27,
link |
and torch will replicate this dimension.
link |
Basically, it will take this row vector,
link |
and it will copy it vertically now 27 times.
link |
The 27 by 27 aligns exactly,
link |
and element-wise divides.
link |
Basically, what's happening here is we're
link |
actually normalizing the columns instead of normalizing the rows.
link |
You can check that what's happening here is that
link |
p at zero, which is the first row of p,
link |
that sum is not one, it's seven.
link |
It is the first column as an example that sums to one.
link |
To summarize, where does the issue come from?
link |
The issue comes from the silent adding of a dimension here,
link |
because in broadcasting rules,
link |
you align on the right and go from right to left,
link |
and if dimension doesn't exist, you create it.
link |
That's where the problem happens.
link |
We still did the counts correctly.
link |
We did the counts across the rows,
link |
and we got the counts on the right here as a column vector.
link |
But because the keepdence was true,
link |
this dimension was discarded,
link |
and now we just have a vector of 27.
link |
Because of broadcasting the way it works,
link |
this vector of 27 suddenly becomes a row vector,
link |
and then this row vector gets replicated vertically,
link |
and that every single point we are dividing by
link |
the count in the opposite direction.
link |
This thing just doesn't work.
link |
This needs to be keepdence equals true in this case.
link |
Then we have that p at zero is normalized.
link |
Conversely, the first column,
link |
you'd expect to potentially not be normalized,
link |
and this is what makes it work.
link |
Pretty subtle, and hopefully this helps to scare you,
link |
that you should have respect for broadcasting,
link |
be careful, check your work,
link |
and understand how it works under the hood,
link |
and make sure that it's broadcasting
link |
in the direction that you like.
link |
Otherwise, you're going to introduce very subtle bugs,
link |
very hard to find bugs, and just be careful.
link |
One more note on efficiency.
link |
We don't want to be doing this here because this
link |
creates a completely new tensor that we store into p.
link |
We prefer to use in-place operations if possible.
link |
This would be an in-place operation,
link |
has the potential to be faster.
link |
It doesn't create new memory under the hood.
link |
Then let's erase this, we don't need it.
link |
Let's also just do fewer,
link |
just so I'm not wasting space.
link |
We're actually in a pretty good spot now.
link |
We trained a bigram language model,
link |
and we trained it really just by counting
link |
how frequently any pairing occurs and then
link |
normalizing so that we get a nice property distribution.
link |
Really these elements of this array p are
link |
really the parameters of our bigram language model,
link |
giving us and summarizing the statistics of these bigrams.
link |
We trained the model and then we know how to sample from the model.
link |
We just iteratively sampled
link |
the next character and feed it in each time and get a next character.
link |
Now what I'd like to do is I'd like to somehow
link |
evaluate the quality of this model.
link |
We'd like to somehow summarize
link |
the quality of this model into a single number.
link |
How good is it at predicting the training set as an example?
link |
In the training set, we can evaluate now
link |
the training loss and this training loss is telling us
link |
about the quality of this model in
link |
a single number just like we saw in micrograd.
link |
Let's try to think through the quality of
link |
the model and how we would evaluate it.
link |
Basically, what we're going to do is we're going to copy-paste
link |
this code that we previously used for counting.
link |
Let me just print these bigrams first.
link |
We're going to use fstrings and I'm going to
link |
print character 1 followed by character 2,
link |
these are the bigrams and then I don't want
link |
to do it for all the words, just the first three words.
link |
Here we have Emma, Olivia, and Ava bigrams.
link |
Now what we'd like to do is we'd like to basically look at
link |
the probability that the model
link |
assigns to every one of these bigrams.
link |
In other words, we can look at the probability,
link |
which is summarized in the matrix P of ix1,
link |
ix2 and then we can print it here as probability.
link |
Because these probabilities are way too large,
link |
let me percent or column 0.4f to truncate it a bit.
link |
What do we have here? We're looking at the probabilities that
link |
the model assigns to every one of these bigrams in the dataset.
link |
We can see some of them are 4 percent,
link |
3 percent, etc, just to have a measuring stick in our mind,
link |
by the way, with 27 possible characters or tokens.
link |
If everything was equally likely,
link |
then you'd expect all these probabilities to
link |
be 4 percent roughly.
link |
Anything above 4 percent means that we've learned
link |
something useful from these bigram statistics.
link |
You see that roughly some of these are 4 percent,
link |
but some of them are as high as 40 percent,
link |
35 percent, and so on.
link |
You see that the model actually assigned
link |
a pretty high probability to whatever's in the training set,
link |
and so that's a good thing.
link |
Basically, if you have a very good model,
link |
you'd expect that these probabilities should be near one,
link |
because that means that your model
link |
is correctly predicting what's going to come next,
link |
especially on the training set where you train your model.
link |
Now we'd like to think about how can we summarize
link |
these probabilities into a single number
link |
that measures the quality of this model.
link |
Now, when you look at the literature into
link |
maximum likelihood estimation and statistical modeling and so on,
link |
you'll see that what's typically used
link |
here is something called the likelihood,
link |
and the likelihood is the product of all of these probabilities.
link |
The product of all of these probabilities is
link |
the likelihood and it's really telling us about
link |
the probability of the entire data set
link |
assigned by the model that we've trained,
link |
and that is a measure of quality.
link |
The product of these should be as high as possible.
link |
When you are training the model and when you have a good model,
link |
your product of these probabilities should be very high.
link |
Now, because the product of these probabilities
link |
is an unwieldy thing to work with,
link |
you can see that all of them are between zero and one,
link |
so your product of these probabilities will be a very tiny number.
link |
So for convenience,
link |
what people work with usually is not the likelihood,
link |
but they work with what's called the log likelihood.
link |
So the product of these is the likelihood.
link |
To get the log likelihood,
link |
we just have to take the log of the probability.
link |
The log of the probability here,
link |
the log of x from zero to one,
link |
the log is a, you see here,
link |
monotonic transformation of the probability,
link |
where if you pass in one, you get zero.
link |
So probability one gets you log probability of zero.
link |
And then as you go lower and lower probability,
link |
the log will grow more and more negative
link |
until all the way to negative infinity at zero.
link |
So here we have a lock prob,
link |
which is really just a torch dot log of probability.
link |
Let's print it out to get a sense of what that looks like.
link |
Lock prob, also 0.4f.
link |
So as you can see,
link |
when we plug in numbers that are very close,
link |
some of our higher numbers,
link |
we get closer and closer to zero.
link |
And then if we plug in very bad probabilities,
link |
we get more and more negative number.
link |
So, and the reason we work with this is
link |
for large extent convenience, right?
link |
Because we have mathematically that if you have
link |
some product a times b times c of all these probabilities,
link |
right, the likelihood is the product
link |
of all these probabilities.
link |
Then the log of these is just log of a plus log of b
link |
If you remember your logs from your high school
link |
or undergrad and so on.
link |
So we have that basically,
link |
the likelihood is the product of probabilities.
link |
The log likelihood is just the sum of the logs
link |
of the individual probabilities.
link |
So log likelihood starts at zero.
link |
And then log likelihood here,
link |
we can just accumulate simply.
link |
And then at the end, we can print this.
link |
Print the log likelihood.
link |
Maybe you're familiar with this.
link |
So log likelihood is negative 38.
link |
Now we actually want,
link |
so how high can log likelihood get?
link |
It can go to zero.
link |
So when all the probabilities are one,
link |
log likelihood will be zero.
link |
And then when all the probabilities are lower,
link |
this will grow more and more negative.
link |
Now, we don't actually like this
link |
because what we'd like is a loss function.
link |
And a loss function has the semantics
link |
because we're trying to minimize the loss.
link |
So we actually need to invert this.
link |
And that's what gives us something called
link |
the negative log likelihood.
link |
Negative log likelihood is just negative
link |
of the log likelihood.
link |
These are F strings, by the way,
link |
if you'd like to look this up.
link |
Negative log likelihood equals.
link |
So negative log likelihood now is just the negative of it.
link |
And so the negative log likelihood is a very nice
link |
loss function because the lowest it can get is zero.
link |
And the higher it is,
link |
the worse off the predictions are that you're making.
link |
And then one more modification to this
link |
that sometimes people do is that for convenience,
link |
they actually like to normalize by,
link |
they like to make it an average instead of a sum.
link |
And so here, let's just keep some counts as well.
link |
So n plus equals one starts at zero.
link |
And then here we can have sort of like
link |
a normalized log likelihood.
link |
If we just normalize it by the count,
link |
then we will sort of get the average log likelihood.
link |
So this would be usually our loss function here
link |
is what this is what we would use.
link |
So our loss function for the training set
link |
assigned by the model is 2.4.
link |
That's the quality of this model.
link |
And the lower it is, the better off we are.
link |
And the higher it is, the worse off we are.
link |
And the job of our training is to find the parameters
link |
that minimize the negative log likelihood loss.
link |
And that would be like a high quality model.
link |
Okay, so to summarize, I actually wrote it out here.
link |
So our goal is to maximize likelihood,
link |
which is the product of all the probabilities
link |
assigned by the model.
link |
And we want to maximize this likelihood
link |
with respect to the model parameters.
link |
And in our case, the model parameters here
link |
are defined in the table.
link |
These numbers, the probabilities are the model parameters
link |
sort of in our Barygram language model so far.
link |
But you have to keep in mind that here we are storing
link |
everything in a table format, the probabilities.
link |
But what's coming up as a brief preview
link |
is that these numbers will not be kept explicitly,
link |
but these numbers will be calculated by a neural network.
link |
So that's coming up.
link |
And we want to change and tune the parameters
link |
of these neural networks.
link |
We want to change these parameters
link |
to maximize the likelihood,
link |
the product of the probabilities.
link |
Now, maximizing the likelihood
link |
is equivalent to maximizing the log likelihood
link |
because log is a monotonic function.
link |
Here's the graph of log.
link |
And basically all it is doing is it's just scaling your,
link |
you can look at it as just a scaling of the loss function.
link |
And so the optimization problem here and here
link |
are actually equivalent because this is just scaling.
link |
You can look at it that way.
link |
And so these are two identical optimization problems.
link |
Maximizing the log likelihood is equivalent
link |
to minimizing the negative log likelihood.
link |
And then in practice, people actually minimize
link |
the average negative log likelihood
link |
to get numbers like 2.4.
link |
And then this summarizes the quality of your model.
link |
And we'd like to minimize it and make it
link |
as small as possible.
link |
And the lowest it can get is zero.
link |
And the lower it is, the better off your model is
link |
because it's assigning high probabilities to your data.
link |
Now let's estimate the probability
link |
over the entire training set,
link |
just to make sure that we get something around 2.4.
link |
Let's run this over the entire, oops.
link |
Let's take out the print segment as well.
link |
Okay, 2.45 over the entire training set.
link |
Now what I'd like to show you is that you can actually
link |
evaluate the probability for any word that you want.
link |
Like for example, if we just test a single word, Andre,
link |
and bring back the print statement,
link |
then you see that Andre is actually kind of like
link |
an unlikely word, or like on average,
link |
we take three log probability to represent it.
link |
And roughly that's because Ej apparently
link |
is very uncommon as an example.
link |
Now think through this.
link |
When I take Andre and I append q,
link |
and I test the probability of it under aq,
link |
we actually get infinity.
link |
And that's because jq has a 0% probability
link |
according to our model.
link |
So the log likelihood,
link |
so the log of zero will be negative infinity.
link |
We get infinite loss.
link |
So this is kind of undesirable, right?
link |
Because we plugged in a string
link |
that could be like a somewhat reasonable name.
link |
But basically what this is saying is that this model
link |
is exactly 0% likely to predict this name.
link |
And our loss is infinity on this example.
link |
And really what the reason for that is that j
link |
is followed by q zero times.
link |
And so jq is 0% likely.
link |
So it's actually kind of gross
link |
and people don't like this too much.
link |
To fix this, there's a very simple fix
link |
that people like to do to sort of like
link |
smooth out your model a little bit.
link |
And it's called model smoothing.
link |
And roughly what's happening is that
link |
we will add some fake counts.
link |
So imagine adding a count of one to everything.
link |
So we add a count of one like this,
link |
and then we recalculate the probabilities.
link |
And that's model smoothing.
link |
And you can add as much as you like.
link |
You can add five and that will give you a smoother model.
link |
And the more you add here,
link |
the more uniform model you're gonna have.
link |
And the less you add, the more peaked model
link |
you are gonna have, of course.
link |
So one is like a pretty decent count to add.
link |
And that will ensure that there will be no zeros
link |
in our probability matrix P.
link |
And so this will of course change the generations
link |
In this case it didn't, but in principle it could.
link |
But what that's gonna do now is that
link |
nothing will be infinity unlikely.
link |
So now our model will predict some other probability.
link |
And we see that jq now has a very small probability
link |
so the model still finds it very surprising
link |
that this was a word or a bigram,
link |
but we don't get negative infinity.
link |
So it's kind of like a nice fix
link |
that people like to apply sometimes
link |
and it's called model smoothing.
link |
Okay, so we've now trained a respectable bigram
link |
character level language model.
link |
And we saw that we both sort of trained the model
link |
by looking at the counts of all the bigrams
link |
and normalizing the rows to get probability distributions.
link |
We saw that we can also then use those parameters
link |
of this model to perform sampling of new words.
link |
So we sample new names according to those distributions.
link |
And we also saw that we can evaluate
link |
the quality of this model.
link |
And the quality of this model is summarized
link |
in a single number, which is the negative log likelihood.
link |
And the lower this number is, the better the model is
link |
because it is giving high probabilities
link |
to the actual next characters
link |
in all the bigrams in our training set.
link |
So that's all well and good,
link |
but we've arrived at this model explicitly
link |
by doing something that felt sensible.
link |
We were just performing counts
link |
and then we were normalizing those counts.
link |
Now, what I would like to do is I would like
link |
to take an alternative approach.
link |
We will end up in a very, very similar position,
link |
but the approach will look very different
link |
because I would like to cast the problem
link |
of bigram character level language modeling
link |
into the neural network framework.
link |
And in the neural network framework,
link |
we're going to approach things slightly differently,
link |
but again, end up in a very similar spot.
link |
I'll go into that later.
link |
Now, our neural network is going to be
link |
still a bigram character level language model.
link |
So it receives a single character as an input,
link |
then there's neural network with some weights
link |
or some parameters W,
link |
and it's going to output the probability distribution
link |
over the next character in a sequence.
link |
It's going to make guesses as to what is likely
link |
to follow this character that was input to the model.
link |
And then in addition to that,
link |
we're going to be able to evaluate any setting
link |
of the parameters of the neural net
link |
because we have the loss function,
link |
the negative log likelihood.
link |
So we're going to take a look at its probability distributions
link |
and we're going to use the labels,
link |
which are basically just the identity
link |
of the next character in that bigram, the second character.
link |
So knowing what second character actually comes next
link |
in the bigram allows us to then look at
link |
how high of probability the model assigns
link |
to that character.
link |
And then we of course want the probability to be very high.
link |
And that is another way of saying that the loss is low.
link |
So we're going to use gradient-based optimization then
link |
to tune the parameters of this network
link |
because we have the loss function
link |
and we're going to minimize it.
link |
So we're going to tune the weights
link |
so that the neural net is correctly predicting
link |
the probabilities for the next character.
link |
So let's get started.
link |
The first thing I want to do is I want to compile
link |
the training set of this neural network, right?
link |
So create the training set of all the bigrams, okay?
link |
And here, I'm going to copy-paste this code
link |
because this code iterates over all the bigrams.
link |
So here we start with the words,
link |
we iterate over all the bigrams,
link |
and previously, as you recall, we did the counts.
link |
But now we're not going to do counts,
link |
we're just creating a training set.
link |
Now this training set will be made up of two lists.
link |
We have the bigrams,
link |
we have the inputs and the targets, the labels.
link |
And these bigrams will denote x, y,
link |
those are the characters, right?
link |
And so we're given the first character of the bigram
link |
and then we're trying to predict the next one.
link |
Both of these are going to be integers.
link |
So here we'll take x's dot append is just x1,
link |
y's dot append is x2.
link |
And then here, we actually don't want lists of integers,
link |
we will create tensors out of these.
link |
So x's is torch dot tensor of x's,
link |
and y's is torch dot tensor of y's.
link |
And then we don't actually want to take all the words
link |
just yet because I want everything to be manageable.
link |
So let's just do the first word, which is emma.
link |
And then it's clear what these x's and y's would be.
link |
Here, let me print character one, character two,
link |
just so you see what's going on here.
link |
So the bigrams of these characters is.emmmma.
link |
So this single word, as I mentioned,
link |
has one, two, three, four, five examples
link |
for our neural network.
link |
There are five separate examples in emma.
link |
And those examples are summarized here.
link |
When the input to the neural network is integer zero,
link |
the desired label is integer five, which corresponds to E.
link |
When the input to the neural network is five,
link |
we want its weights to be arranged
link |
so that 13 gets a very high probability.
link |
When 13 is put in, we want 13 to have a high probability.
link |
When 13 is put in,
link |
we also want one to have a high probability.
link |
When one is input,
link |
we want zero to have a very high probability.
link |
So there are five separate input examples,
link |
to a neural net, in this data set.
link |
I wanted to add a tangent of a note of caution
link |
to be careful with a lot of the APIs
link |
of some of these frameworks.
link |
You saw me silently use torch.tensor with a lowercase T,
link |
and the output looked right.
link |
But you should be aware that there's actually two ways
link |
of constructing a tensor.
link |
There's a torch.lowercaseTensor,
link |
and there's also a torch.capitalTensor class.
link |
Which you can also construct.
link |
So you can actually call both.
link |
You can also do torch.capitalTensor,
link |
and you get an Xs and Ys as well.
link |
So that's not confusing at all.
link |
There are threads on what is the difference
link |
between these two.
link |
And unfortunately, the docs are just like not clear
link |
on the difference.
link |
And when you look at the docs of lowercaseTensor,
link |
constructs tensor with no autograd history by copying data.
link |
It's just like, it doesn't have a lot to do with this.
link |
It doesn't make sense.
link |
So the actual difference, as far as I can tell,
link |
is explained eventually in this random thread
link |
that you can Google.
link |
And really it comes down to, I believe,
link |
that, where is this?
link |
Torch.tensor infers the D type, the data type, automatically,
link |
while torch.tensor just returns a float tensor.
link |
I would recommend stick to torch.lowercaseTensor.
link |
So indeed, we see that when I construct this
link |
with a capital T, the data type here of Xs is float 32.
link |
But torch.lowercaseTensor,
link |
you see how it's now X.D type is now integer.
link |
So it's advised that you use lowercase T
link |
and you can read more about it if you like
link |
in some of these threads.
link |
But basically, I'm pointing out some of these things
link |
because I want to caution you
link |
and I want you to get used to reading
link |
a lot of documentation
link |
and reading through a lot of Q and As
link |
and threads like this.
link |
And some of this stuff is unfortunately not easy
link |
and not very well documented
link |
and you have to be careful out there.
link |
What we want here is integers
link |
because that's what makes sense.
link |
And so lowercaseTensor is what we are using.
link |
Okay, now we want to think through
link |
how we're going to feed in these examples
link |
into a neural network.
link |
Now, it's not quite as straightforward as plugging it in
link |
because these examples right now are integers.
link |
So there's like a zero, five, or 13.
link |
It gives us the index of the character
link |
and you can't just plug an integer index
link |
into a neural net.
link |
These neural nets are sort of made up of these neurons
link |
and these neurons have weights.
link |
And as you saw in microGrad,
link |
these weights act multiplicatively on the inputs,
link |
WX plus B, there's 10HS and so on.
link |
And so it doesn't really make sense
link |
to make an input neuron take on integer values
link |
that you feed in and then multiply on with weights.
link |
So instead, a common way of encoding integers
link |
is what's called one-hot encoding.
link |
In one-hot encoding, we take an integer like 13
link |
and we create a vector that is all zeros
link |
except for the 13th dimension, which we turn to a one.
link |
And then that vector can feed into a neural net.
link |
Now, conveniently, PyTorch actually has something
link |
called the one-hot function inside torch and in functional.
link |
It takes a tensor made up of integers.
link |
Long is an integer.
link |
And it also takes a number of classes,
link |
which is how large you want your vector to be.
link |
So here, let's import torch. and in.functional.sf.
link |
This is a common way of importing it.
link |
And then let's do f.one-hot.
link |
And we feed in the integers that we want to encode.
link |
So we can actually feed in the entire array of Xs.
link |
And we can tell it that num classes is 27.
link |
So it doesn't have to try to guess it.
link |
It may have guessed that it's only 13
link |
and would give us an incorrect result.
link |
So this is the one-hot.
link |
Let's call this xinc for xencoded.
link |
And then we see that xencoded.shape is five by 27.
link |
And we can also visualize it, plt.imshow of xinc,
link |
to make it a little bit more clear
link |
because this is a little messy.
link |
So we see that we've encoded
link |
all the five examples into vectors.
link |
We have five examples, so we have five rows.
link |
And each row here is now an example into a neural net.
link |
And we see that the appropriate bit is turned on as a one.
link |
And everything else is zero.
link |
So here, for example, the zeroth bit is turned on.
link |
The fifth bit is turned on.
link |
13th bits are turned on for both of these examples.
link |
And the first bit here is turned on.
link |
So that's how we can encode integers into vectors.
link |
And then these vectors can feed in to neural nets.
link |
One more issue to be careful with here, by the way,
link |
is let's look at the data type of the encoding.
link |
We always want to be careful
link |
with data types. What would you expect
link |
xencoding's data type to be?
link |
When we're plugging numbers into neural nets,
link |
we don't want them to be integers.
link |
We want them to be floating-point numbers
link |
that can take on various values.
link |
But the dtype here is actually 64-bit integer.
link |
And the reason for that, I suspect,
link |
is that one-hot received a 64-bit integer here,
link |
and it returned the same data type.
link |
And when you look at the signature of one-hot,
link |
it doesn't even take a dtype, a desired data type,
link |
of the output tensor.
link |
And so we can't, in a lot of functions in Torch,
link |
we'd be able to do something like dtype equals
link |
torch.float32, which is what we want,
link |
but one-hot does not support that.
link |
So instead, we're going to want to cast this
link |
to float like this,
link |
so that these, everything is the same,
link |
everything looks the same, but the dtype is float32.
link |
And floats can feed into neural nets.
link |
So now let's construct our first neural net.
link |
So now let's construct our first neuron.
link |
This neuron will look at these input vectors.
link |
And as you remember from microGrad,
link |
these neurons basically perform a very simple function,
link |
wx plus b, where wx is a dot product, right?
link |
So we can achieve the same thing here.
link |
Let's first define the weights of this neuron, basically.
link |
What are the initial weights at initialization
link |
Let's initialize them with torch.randin.
link |
Torch.randin fills a tensor with random numbers
link |
drawn from a normal distribution.
link |
And a normal distribution has a probability density function
link |
And so most of the numbers drawn from this distribution
link |
will be around zero, but some of them will be as high
link |
as almost three and so on.
link |
And very few numbers will be above three in magnitude.
link |
So we need to take a size as an input here.
link |
And I'm going to use size as to be 27 by one.
link |
So 27 by one, and then let's visualize w.
link |
So w is a column vector of 27 numbers.
link |
And these weights are then multiplied by the inputs.
link |
So now to perform this multiplication,
link |
we can take X encoding and we can multiply it with w.
link |
This is a matrix multiplication operator in PyTorch.
link |
And the output of this operation is five by one.
link |
The reason it's five by one is the following.
link |
We took X encoding, which is five by 27,
link |
and we multiplied it by 27 by one.
link |
And in matrix multiplication, you see that the output
link |
will become five by one because these 27 will multiply
link |
So basically what we're seeing here out of this operation
link |
is we are seeing the five activations of this neuron
link |
on these five inputs.
link |
And we've evaluated all of them in parallel.
link |
We didn't feed in just a single input to the single neuron.
link |
We fed in simultaneously all the five inputs
link |
into the same neuron.
link |
And in parallel, PyTorch has evaluated the wx plus b,
link |
but here it's just wx, there's no bias.
link |
It has evaluated w times x for all of them independently.
link |
Now, instead of a single neuron though,
link |
I would like to have 27 neurons.
link |
And I'll show you in a second why I want 27 neurons.
link |
So instead of having just a one here,
link |
which is indicating this presence of one single neuron,
link |
And then when w is 27 by 27,
link |
this will, in parallel, evaluate all the 27 neurons
link |
on all the five inputs.
link |
Giving us a much better, much, much bigger result.
link |
So now what we've done is five by 27 multiplied 27 by 27.
link |
And the output of this is now five by 27.
link |
So we can see that the shape of this is five by 27.
link |
So what is every element here telling us, right?
link |
It's telling us for every one of 27 neurons that we created,
link |
what is the firing rate of those neurons
link |
on every one of those five examples?
link |
So the element, for example, three comma 13,
link |
is giving us the firing rate of the 13th neuron
link |
looking at the third input.
link |
And the way this was achieved is by a dot product
link |
between the third input and the third input.
link |
And the 13th column of this w matrix here.
link |
So using matrix multiplication,
link |
we can very efficiently evaluate the dot product
link |
between lots of input examples in a batch
link |
and lots of neurons where all of those neurons have weights
link |
in the columns of those w's.
link |
And in matrix multiplication,
link |
we're just doing those dot products in parallel.
link |
Just to show you that this is the case,
link |
we can take xnk and we can take the third row.
link |
And we can take the w and take its 13th column.
link |
And then we can do xnk at three,
link |
elementwise multiply with w at 13,
link |
Well, there's no plus b, it's just wx dot product.
link |
And that's this number.
link |
So you see that this is just being done efficiently
link |
by the matrix multiplication operation
link |
for all the input examples
link |
and for all the output neurons of this first layer.
link |
Okay, so we fed our 27 dimensional inputs
link |
into a first layer of a neural net that has 27 neurons.
link |
So we have 27 inputs and now we have 27 neurons.
link |
These neurons perform w times x.
link |
They don't have a bias
link |
and they don't have a non-linearity like 10h.
link |
We're going to leave them to be a linear layer.
link |
In addition to that,
link |
we're not going to have any other layers.
link |
This is going to be it.
link |
It's just going to be the dumbest, smallest,
link |
simplest neural net, which is just a single linear layer.
link |
And now I'd like to explain
link |
what I want those 27 outputs to be.
link |
Intuitively, what we're trying to produce here
link |
for every single input example
link |
is we're trying to produce
link |
some kind of a probability distribution
link |
for the next character in a sequence.
link |
And there's 27 of them.
link |
But we have to come up with precise semantics
link |
for exactly how we're going to interpret
link |
these 27 numbers that these neurons take on.
link |
Now intuitively, you see here that these numbers are negative
link |
and some of them are positive, et cetera.
link |
And that's because these are coming out
link |
of a neural net layer initialized
link |
with these normal distribution parameters.
link |
But what we want is we want something like we had here.
link |
Like each row here told us the counts
link |
and then we normalize the counts to get probabilities.
link |
And we want something similar to come out of a neural net.
link |
But what we just have right now
link |
is just some negative and positive numbers.
link |
Now we want those numbers to somehow represent
link |
the probabilities for the next character.
link |
But you see that probabilities,
link |
they have a special structure.
link |
They're positive numbers and they sum to one.
link |
And so that doesn't just come out of a neural net.
link |
And then they can't be counts
link |
because these counts are positive
link |
and counts are integers.
link |
So counts are also not really a good thing
link |
to output from a neural net.
link |
So instead what the neural net is going to output
link |
and how we are going to interpret the 27 numbers
link |
is that these 27 numbers are giving us log counts, basically.
link |
So instead of giving us counts directly,
link |
like in this table, they're giving us log counts.
link |
And to get the counts, we're going to take the log counts
link |
and we're going to exponentiate them.
link |
Now, exponentiation takes the following form.
link |
It takes numbers that are negative or they are positive.
link |
It takes the entire real line.
link |
And then if you plug in negative numbers,
link |
you're going to get e to the x, which is always below one.
link |
So you're getting numbers lower than one.
link |
And if you plug in numbers greater than zero,
link |
you're getting numbers greater than one
link |
all the way growing to the infinity.
link |
And this here grows to zero.
link |
So basically we're going to take these numbers here
link |
and instead of them being positive and negative
link |
and all over the place,
link |
we're going to interpret them as log counts.
link |
And then we're going to element-wise
link |
exponentiate these numbers.
link |
Exponentiating them now gives us something like this.
link |
And you see that these numbers now
link |
because they went through an exponent,
link |
all the negative numbers turned into numbers below one,
link |
And all the positive numbers originally turned into
link |
even more positive numbers, sort of greater than one.
link |
So like for example, seven is some positive number
link |
over here that is greater than zero.
link |
But exponentiated outputs here basically give us something
link |
that we can use and interpret as the equivalent
link |
of counts originally.
link |
So you see these counts here, 112, 751, 1, et cetera.
link |
The neural net is kind of now predicting counts.
link |
And these counts are positive numbers.
link |
They can never be below zero.
link |
So that makes sense.
link |
And they can now take on various values
link |
depending on the settings of W.
link |
So let me break this down.
link |
We're going to interpret these to be the log counts.
link |
In other words, for this,
link |
that is often used is so-called logits.
link |
These are logits, log counts.
link |
And these will be sort of the counts, logits exponentiated.
link |
And this is equivalent to the n matrix,
link |
sort of the n array that we used previously.
link |
Remember this was the n.
link |
This is the array of counts.
link |
And each row here are the counts for the next character.
link |
So those are the counts.
link |
And now the probabilities are just the counts normalized.
link |
And so I'm not going to find the same,
link |
but basically I'm not going to scroll all over the place.
link |
We've already done this.
link |
We want to counts.sum along the first dimension,
link |
and we want to keep them as true.
link |
We've went over this,
link |
and this is how we normalize the rows of our counts matrix
link |
to get our probabilities.
link |
So now these are the probabilities.
link |
And these are the counts that we have currently.
link |
And now when I show the probabilities,
link |
you see that every row here, of course,
link |
will sum to one, because they're normalized.
link |
And the shape of this is five by 27.
link |
And so really what we've achieved is
link |
for every one of our five examples,
link |
we now have a row that came out of a neural net.
link |
And because of the transformations here,
link |
we made sure that this output of this neural net now
link |
are probabilities, or we can say that
link |
are probabilities, or we can interpret to be probabilities.
link |
So our WX here gave us logits,
link |
and then we interpret those to be log counts.
link |
We exponentiate to get something that looks like counts,
link |
and then we normalize those counts
link |
to get a probability distribution.
link |
And all of these are differentiable operations.
link |
So what we've done now is we are taking inputs.
link |
We have differentiable operations
link |
that we can back propagate through,
link |
and we're getting out probability distributions.
link |
So, for example, for the zeroth example that fed in,
link |
which was the zeroth example here,
link |
was a one-hot vector of zero.
link |
And it basically corresponded
link |
to feeding in this example here.
link |
So we're feeding in a dot into a neural net.
link |
And the way we fed the dot into a neural net
link |
is that we first got its index,
link |
then we one-hot encoded it,
link |
then it went into the neural net,
link |
and out came this distribution of probabilities.
link |
And its shape is 27.
link |
There's 27 numbers.
link |
And we're going to interpret this
link |
as the neural net's assignment
link |
for how likely every one of these characters,
link |
the 27 characters, are to come next.
link |
And as we tune the weights W,
link |
we're going to be, of course,
link |
getting different probabilities out
link |
for any character that you input.
link |
And so now the question is just,
link |
can we optimize and find a good W
link |
such that the probabilities coming out are pretty good?
link |
And the way we measure pretty good is by the loss function.
link |
Okay, so I organized everything into a single summary
link |
so that hopefully it's a bit more clear.
link |
So it starts here.
link |
We have an input dataset.
link |
We have some inputs to the neural net,
link |
and we have some labels
link |
for the correct next character in a sequence.
link |
These are integers.
link |
Here I'm using tors generators now
link |
so that you see the same numbers that I see.
link |
And I'm generating 27 neurons weights.
link |
And each neuron here receives 27 inputs.
link |
Then here, we're going to plug in
link |
all the input examples, Xs, into a neural net.
link |
So here, this is a forward pass.
link |
First, we have to encode all of the inputs
link |
into one-hot representations.
link |
So we have 27 classes.
link |
We pass in these integers.
link |
And Xinc becomes a array that is five by 27,
link |
zeros, except for a few ones.
link |
We then multiply this in the first layer of a neural net
link |
exponentiate the logits to get fake counts, sort of,
link |
and normalize these counts to get probabilities.
link |
So these last two lines, by the way, here,
link |
are called the softmax, which I pulled up here.
link |
Softmax is a very often used layer in a neural net
link |
that takes these Zs, which are logits,
link |
exponentiates them, and divides and normalizes.
link |
It's a way of taking outputs of a neural net layer.
link |
And these outputs can be positive or negative.
link |
And it outputs probability distributions.
link |
It outputs something that is always sums to one
link |
and are positive numbers, just like probabilities.
link |
So this is kind of like a normalization function,
link |
if you want to think of it that way.
link |
And you can put it on top of any other linear layer
link |
inside a neural net.
link |
And it basically makes a neural net output probabilities
link |
that's very often used.
link |
And we used it as well here.
link |
So this is the forward pass,
link |
and that's how we made a neural net output probability.
link |
Now, you'll notice that all of these,
link |
this entire forward pass is made up of differentiable layers.
link |
Everything here, we can backpropagate through.
link |
And we saw some of the backpropagation in microGrad.
link |
This is just multiplication and addition.
link |
All that's happening here is just multiply and then add.
link |
And we know how to backpropagate through them.
link |
Exponentiation, we know how to backpropagate through.
link |
And then here we are summing,
link |
and sum is easily backpropagatable as well,
link |
and division as well.
link |
So everything here is a differentiable operation,
link |
and we can backpropagate through.
link |
Now, we achieve these probabilities,
link |
which are five by 27.
link |
For every single example,
link |
we have a vector of probabilities that sum to one.
link |
And then here I wrote a bunch of stuff
link |
to sort of like break down the examples.
link |
So we have five examples making up Emma, right?
link |
And there are five bigrams inside Emma.
link |
So bigram example one is that E
link |
is the beginning character right after dot.
link |
And the indexes for these are zero and five.
link |
So then we feed in a zero,
link |
that's the input to the neural net.
link |
We get probabilities from the neural net
link |
that are 27 numbers.
link |
And then the label is five,
link |
because E actually comes after dot.
link |
So that's the label.
link |
And then we use this label five
link |
to index into the probability distribution here.
link |
So this index five here is zero, one, two, three, four, five.
link |
It's this number here, which is here.
link |
So that's basically the probability assigned
link |
by the neural net to the actual correct character.
link |
You see that the network currently thinks
link |
that this next character, that E following dot,
link |
is only 1% likely,
link |
which is of course not very good, right?
link |
Because this actually is a training example,
link |
and the network thinks that this is currently
link |
very, very unlikely.
link |
But that's just because we didn't get very lucky
link |
in generating a good setting of W.
link |
So right now this network thinks this is unlikely,
link |
and 0.01 is not a good outcome.
link |
So the log likelihood then is very negative.
link |
And the negative log likelihood is very positive.
link |
And so four is a very high negative log likelihood,
link |
and that means we're going to have a high loss.
link |
Because what is the loss?
link |
The loss is just the average negative log likelihood.
link |
So the second character is E, M.
link |
And you see here that also the network thought
link |
that M following E is very unlikely, 1%.
link |
For M following M, it thought it was 2%.
link |
And for A following M,
link |
it actually thought it was 7% likely.
link |
So just by chance,
link |
this one actually has a pretty good probability,
link |
and therefore a pretty low negative log likelihood.
link |
And finally here, it thought this was 1% likely.
link |
So overall, this is a very good result.
link |
So overall, our average negative log likelihood,
link |
which is the loss, the total loss that summarizes
link |
basically how well this network currently works,
link |
at least on this one word,
link |
not on the full data set, just the one word,
link |
is 3.76, which is actually a fairly high loss.
link |
This is not a very good setting of Ws.
link |
Now here's what we can do.
link |
We're currently getting 3.76.
link |
We can actually come here and we can change our W.
link |
We can resample it.
link |
So let me just add one to have a different seed,
link |
and then we get a different W,
link |
and then we can rerun this.
link |
And with this different seed,
link |
with this different setting of Ws,
link |
So this is a much better W, right?
link |
And it's better because the probabilities
link |
just happen to come out higher
link |
for the characters that actually are next.
link |
And so you can imagine actually just resampling this.
link |
So, okay, this was not very good.
link |
Let's try one more.
link |
Okay, this was terrible setting
link |
because we have a very high loss.
link |
So anyway, I'm gonna erase this.
link |
What I'm doing here, which is just guess and check
link |
of randomly assigning parameters
link |
and seeing if the network is good,
link |
that is amateur hour.
link |
That's not how you optimize a neural net.
link |
The way you optimize a neural net
link |
is you start with some random guess,
link |
and we're gonna come up with a random guess
link |
and we're gonna commit to this one,
link |
even though it's not very good.
link |
But now the big deal is we have a loss function.
link |
So this loss is made up only of differentiable operations.
link |
And we can minimize the loss by tuning Ws,
link |
by computing the gradients of the loss
link |
with respect to these W matrices.
link |
And so then we can tune W to minimize the loss
link |
and find a good setting of W
link |
using gradient-based optimization.
link |
So let's see how that would work.
link |
Now, things are actually going to look almost identical
link |
to what we had with microGrad.
link |
So here I pulled up the lecture from microGrad,
link |
the notebook that's from this repository.
link |
And when I scroll all the way to the end
link |
where we left off with microGrad,
link |
we had something very, very similar.
link |
We had a number of input examples.
link |
In this case, we had four input examples inside Xs
link |
and we had their targets, desired targets.
link |
Just like here, we have our Xs now,
link |
but we have five of them.
link |
And they're now integers instead of vectors.
link |
But we're going to convert our integers to vectors,
link |
except our vectors will be 27 large instead of three large.
link |
And then here, what we did is,
link |
first we did a forward pass
link |
where we ran a neural net on all of the inputs
link |
to get predictions.
link |
Our neural net at the time, this NFX,
link |
was a multilayer perceptron.
link |
Our neural net is going to look different
link |
because our neural net is just a single layer,
link |
single linear layer, followed by a softmax.
link |
So that's our neural net.
link |
And the loss here was the mean squared error.
link |
So we simply subtracted the prediction
link |
from the ground truth and squared it and summed it all up.
link |
And that was the loss.
link |
And loss was the single number
link |
that summarized the quality of the neural net.
link |
And when loss is low, like almost zero,
link |
that means the neural net is predicting correctly.
link |
So we had a single number
link |
that summarized the performance of the neural net.
link |
And everything here was differentiable
link |
and was stored in a massive compute graph.
link |
And then we iterated over all the parameters.
link |
We made sure that the gradients are set to zero
link |
and we called loss.backward.
link |
And loss.backward initiated backpropagation
link |
at the final output node of loss, right?
link |
So, yeah, remember these expressions?
link |
We had loss all the way at the end.
link |
We start backpropagation and we went all the way back
link |
and we made sure that we populated
link |
all the parameters.grad.
link |
So.grad started at zero, but backpropagation filled it in.
link |
And then in the update, we iterated all the parameters
link |
and we simply did a parameter update
link |
where every single element of our parameters
link |
was notched in the opposite direction of the gradient.
link |
And so we're going to do the exact same thing here.
link |
So I'm going to pull this up on the side here,
link |
so that we have it available.
link |
And we're actually going to do the exact same thing.
link |
So this was the forward pass.
link |
And props is our widespread.
link |
So now we have to evaluate the loss,
link |
but we're not using the mean squared error.
link |
We're using the negative log likelihood
link |
because we are doing classification.
link |
We're not doing regression as it's called.
link |
So here, we want to calculate loss.
link |
Now, the way we calculate it is,
link |
it's just this average negative log likelihood.
link |
Now this props here has a shape of five by 27.
link |
And so to get all the,
link |
we basically want to pluck out the probabilities
link |
at the correct indices here.
link |
So in particular, because the labels are stored here
link |
in the array wise,
link |
basically what we're after is for the first example,
link |
we're looking at probability of five, right?
link |
For the second example,
link |
at the second row or row index one,
link |
we are interested in the probability asides to index 13.
link |
At the second example, we also have 13.
link |
At the third row, we want one.
link |
And at the last row, which is four, we want zero.
link |
So these are the probabilities we're interested in, right?
link |
And you can see that they're not amazing.
link |
They're not amazing as we saw above.
link |
So these are the probabilities we want,
link |
but we want like a more efficient way
link |
to access these probabilities,
link |
not just listing them out in a tuple like this.
link |
So it turns out that the way to this in PyTorch,
link |
one of the ways at least,
link |
is we can basically pass in all of these,
link |
all of these integers in a vectors.
link |
So these ones, you see how they're just zero, one, two,
link |
we can actually create that using mp, not mp, sorry,
link |
torch.arrange of five, zero, one, two, three, four.
link |
So we can index here with torch.arrange of five,
link |
and here we index with ys.
link |
And you see that that gives us exactly these numbers.
link |
So that plucks out the probabilities
link |
that the neural network assigns
link |
to the correct next character.
link |
Now we take those probabilities and we don't,
link |
we actually look at the log probability.
link |
So we want to dot log,
link |
and then we want to just average that up.
link |
So take the mean of all of that.
link |
And then it's the negative average log likelihood
link |
So the loss here is 3.7 something.
link |
And you see that this loss 3.76, 3.76 is exactly
link |
as we've obtained before,
link |
but this is a vectorized form of that expression.
link |
So we get the same loss.
link |
And the same loss we can consider sort of
link |
as part of this forward pass,
link |
and we've achieved here now loss.
link |
Okay, so we made our way all the way to loss.
link |
We've defined the forward pass.
link |
We forwarded the network and the loss.
link |
Now we're ready to do the backward pass.
link |
So backward pass, we want to first make sure
link |
that all the gradients are reset.
link |
So they're at zero.
link |
Now in PyTorch, you can set the gradients to be zero,
link |
but you can also just set it to none.
link |
And setting it to none is more efficient.
link |
And PyTorch will interpret none as like a lack
link |
of a gradient and it's the same as zeros.
link |
So this is a way to set to zero the gradient.
link |
And now we do loss.backward.
link |
Before we do loss.backward, we need one more thing.
link |
If you remember from micrograd,
link |
PyTorch actually requires that we pass in requires grad
link |
is true so that we tell PyTorch that we are interested
link |
in calculating the gradient for this leaf tensor.
link |
By default, this is false.
link |
So let me recalculate with that.
link |
And then set to none and loss.backward.
link |
Now something magical happened when loss.backward was run
link |
because PyTorch, just like micrograd,
link |
when we did the forward pass here,
link |
it keeps track of all the operations under the hood.
link |
It builds a full computational graph.
link |
Just like the graphs we produced in micrograd,
link |
those graphs exist inside PyTorch.
link |
And so it knows all the dependencies
link |
and all the mathematical operations of everything.
link |
And when you then calculate the loss,
link |
we can call a.backward on it.
link |
And.backward then fills in the gradients
link |
of all the intermediates all the way back to w's,
link |
which are the parameters of our neural net.
link |
So now we can do w.grad
link |
and we see that it has structure.
link |
There's stuff inside it.
link |
And these gradients, every single element here,
link |
so w.shape is 27 by 27.
link |
w.grad.shape is the same, 27 by 27.
link |
And every element of w.grad is telling us
link |
the influence of that weight on the loss function.
link |
So for example, this number all the way here,
link |
if this element, the zero zero element of w,
link |
because the gradient is positive,
link |
it's telling us that this has a positive influence
link |
on the loss, slightly nudging w, slightly taking w zero zero
link |
and adding a small h to it would increase the loss mildly
link |
because this gradient is positive.
link |
Some of these gradients are also negative.
link |
So that's telling us about the gradient information
link |
and we can use this gradient information
link |
to update the weights of this neural network.
link |
So let's now do the update.
link |
It's going to be very similar to what we had in microGrad.
link |
We need no loop over all the parameters
link |
because we only have one parameter, tensor, and that is w.
link |
So we simply do w.data plus equals,
link |
we can actually copy this almost exactly,
link |
negative 0.1 times w.grad.
link |
And that would be the update to the tensor.
link |
So that updates the tensor and because the tensor is updated,
link |
we would expect that now the loss should decrease.
link |
So here, if I print loss.item, it was 3.76, right?
link |
So we've updated the w here.
link |
So if I recalculate forward pass,
link |
loss now should be slightly lower.
link |
So 3.76 goes to 3.74.
link |
And then we can again set grad to none and backward, update.
link |
And now the parameter is changed again.
link |
So if we recalculate the forward pass,
link |
we expect a lower loss again, 3.72, okay?
link |
And this is again doing the,
link |
we're now doing gradient descent.
link |
And when we achieve a low loss, that will mean that the network
link |
is assigning high probabilities
link |
to the correct next characters.
link |
Okay, so I rearranged everything
link |
and I put it all together from scratch.
link |
So here is where we construct our dataset of bigrams.
link |
You see that we are still iterating
link |
only over the first word, emma.
link |
I'm going to change that in a second.
link |
I added a number that counts the number of elements in Xs,
link |
and I'm going to change that in a second.
link |
I'm going to adjust the number of elements in Xs
link |
so that we explicitly see that the number of examples is five,
link |
because currently we're just working with emma,
link |
and there's five bigrams there.
link |
And here I added a loop of exactly what we had before.
link |
So we had 10 iterations of gradient descent
link |
of forward pass, backward pass, and an update.
link |
And so running these two cells, initialization
link |
and gradient descent,
link |
gives us some improvement on the loss function.
link |
But now I want to use all the words.
link |
And there's not five, but 228,000 bigrams now.
link |
However, this should require no modification whatsoever.
link |
Everything should just run,
link |
because all the code we wrote doesn't care
link |
if there's five bigrams or 228,000 bigrams,
link |
and with everything we should just work.
link |
So you see that this will just run.
link |
But now we are optimizing over the entire training set
link |
of all the bigrams.
link |
And you see now that we are decreasing very slightly.
link |
So actually we can probably afford a larger learning rate.
link |
We can probably afford even larger learning rate.
link |
Even 50 seems to work on this very, very simple example.
link |
So let me reinitialize and let's run 100 iterations.
link |
We seem to be coming up to some pretty good losses here.
link |
2.47, let me run 100 more.
link |
What is the number that we expect, by the way, in the loss?
link |
We expect to get something around
link |
what we had originally, actually.
link |
So all the way back, if you remember,
link |
in the beginning of this video,
link |
when we optimized just by counting,
link |
our loss was roughly 2.47.
link |
So that's the number that we expect to get,
link |
just by counting, our loss was roughly 2.47
link |
after we added smoothing.
link |
But before smoothing, we had roughly 2.45 likelihood,
link |
And so that's actually roughly the vicinity
link |
of what we expect to achieve.
link |
But before we achieved it by counting,
link |
and here we are achieving roughly the same result,
link |
but with gradient-based optimization.
link |
So we come to about 2.46, 2.45, et cetera.
link |
And that makes sense, because fundamentally,
link |
we're not taking in any additional information.
link |
We're still just taking in the previous character
link |
and trying to predict the next one.
link |
But instead of doing it explicitly by counting
link |
we are doing it with gradient-based learning.
link |
And it just so happens that the explicit approach
link |
happens to very well optimize the loss function
link |
without any need for a gradient-based optimization,
link |
because the setup for bigram language models
link |
is so straightforward and so simple.
link |
We can just afford to estimate those probabilities directly
link |
and maintain them in a table.
link |
But the gradient-based approach
link |
is significantly more flexible.
link |
So we've actually gained a lot,
link |
because what we can do now is we can expand this approach
link |
and complexify the neural net.
link |
So currently, we're just taking a single character
link |
and feeding into a neural net,
link |
and the neural net is extremely simple.
link |
But we're about to iterate on this substantially.
link |
We're going to be taking multiple previous characters,
link |
and we're going to be feeding them into increasingly
link |
more complex neural nets.
link |
But fundamentally, the output of the neural net
link |
will always just be logits.
link |
And those logits will go through
link |
the exact same transformation.
link |
We are going to take them through a softmax,
link |
calculate the loss function and the negative log-likelihood,
link |
and do gradient-based optimization.
link |
And so actually, as we complexify the neural nets
link |
and work all the way up to transformers,
link |
none of this will really fundamentally change.
link |
None of this will fundamentally change.
link |
The only thing that will change is the way we do
link |
the forward pass, where we take in some previous characters
link |
and calculate logits for the next character in a sequence.
link |
That will become more complex,
link |
but we'll use the same machinery to optimize it.
link |
And it's not obvious how we would have extended
link |
this bigram approach into the case
link |
where there are many more characters at the input,
link |
because eventually these tables would get way too large
link |
because there's way too many combinations
link |
of what previous characters could be.
link |
If you only have one previous character,
link |
we can just keep everything in a table that counts.
link |
But if you have the last 10 characters that are input,
link |
we can't actually keep everything in the table anymore.
link |
So this is fundamentally an unscalable approach,
link |
and the neural network approach
link |
is significantly more scalable.
link |
And it's something that actually we can improve on over time.
link |
So that's where we will be digging next.
link |
I wanted to point out two more things.
link |
Number one, I want you to notice that this xinc here,
link |
this is made up of one-hot vectors.
link |
And then those one-hot vectors
link |
are multiplied by this W matrix.
link |
And we think of this as multiple neurons being forwarded
link |
in a fully connected manner.
link |
But actually what's happening here is that, for example,
link |
if you have a one-hot vector here that has a one
link |
at say the fifth dimension,
link |
then because of the way the matrix multiplication works,
link |
multiplying that one-hot vector with W
link |
actually ends up plucking out the fifth row of W.
link |
Logits would become just the fifth row of W.
link |
And that's because of the way the matrix multiplication works.
link |
So that's actually what ends up happening.
link |
So, but that's actually exactly what happened before.
link |
Because remember all the way up here,
link |
We took the first character
link |
and then that first character indexed
link |
into a row of this array here.
link |
And that row gave us the probability distribution
link |
for the next character.
link |
So the first character was used as a lookup
link |
into a matrix here to get the probability distribution.
link |
Well, that's actually exactly what's happening here.
link |
Because we're taking the index,
link |
we're encoding it as one-hot and multiplying it by W.
link |
So Logits literally becomes the appropriate row of W.
link |
And that gets, just as before,
link |
exponentiated to create the counts
link |
and then normalized and becomes probability.
link |
So this W here is literally the same as this array here.
link |
But W remember is the log counts, not the counts.
link |
So it's more precise to say that W exponentiated
link |
W dot exp is this array.
link |
But this array was filled in by counting
link |
and by basically populating the counts of bigrams.
link |
Whereas in the gradient-based framework,
link |
we initialize it randomly and then we let loss guide us
link |
to arrive at the exact same array.
link |
So this array exactly here is basically the array W
link |
at the end of optimization,
link |
except we arrived at it piece by piece by following the loss.
link |
And that's why we also obtain
link |
the same loss function at the end.
link |
And the second notice, if I come here,
link |
remember the smoothing where we added fake counts
link |
to our counts in order to smooth out
link |
and make more uniform the distributions
link |
of these probabilities.
link |
And that prevented us from assigning zero probability
link |
to any one bigram.
link |
Now, if I increase the count here,
link |
what's happening to the probability?
link |
As I increase the count,
link |
probability becomes more and more uniform, right?
link |
Because these counts go only up to like 900 or whatever.
link |
So if I'm adding plus a million to every single number here,
link |
you can see how the row and its probability then
link |
when we divide is just going to become more and more close
link |
to exactly even probability, uniform distribution.
link |
It turns out that the gradient-based framework
link |
has an equivalent to smoothing.
link |
In particular, think through these Ws here,
link |
which we initialized randomly.
link |
We could also think about initializing Ws to be zero.
link |
If all the entries of W are zero,
link |
then you'll see that logits will become all zero
link |
and then exponentiating those logits becomes all one.
link |
And then the probabilities turn out to be exactly the same.
link |
So basically when Ws are all equal to each other,
link |
or say, especially zero,
link |
then the probabilities come out completely uniform.
link |
So trying to incentivize W to be near zero
link |
is basically equivalent to label smoothing.
link |
And the more you incentivize that in a loss function,
link |
the more smooth distribution you're going to achieve.
link |
So this brings us to something that's called regularization,
link |
where we can actually augment the loss function
link |
to have a small component
link |
that we call a regularization loss.
link |
In particular, what we're going to do is we can take W
link |
and we can, for example, square all of its entries.
link |
And then we can, oops, sorry about that.
link |
We can take all the entries of W and we can sum them.
link |
And because we're squaring, there will be no signs anymore.
link |
Negatives and positives all get squashed
link |
to be positive numbers.
link |
And then the way this works is you achieve zero loss
link |
if W is exactly or zero,
link |
but if W has non-zero numbers, you accumulate loss.
link |
And so we can actually take this and we can add it on here.
link |
So we can do something like loss plus W square dot sum.
link |
Or let's actually, instead of sum, let's take a mean,
link |
because otherwise the sum gets too large.
link |
So mean is like a little bit more manageable.
link |
And then we have a regularization loss here,
link |
let's say 0.01 times or something like that.
link |
You can choose the regularization strength.
link |
And then we can just optimize this.
link |
And now this optimization actually has two components.
link |
Not only is it trying to make
link |
all the probabilities work out,
link |
but in addition to that, there's an additional component
link |
that simultaneously tries to make all Ws be zero,
link |
because if Ws are non-zero, you feel a loss.
link |
And so minimizing this, the only way to achieve that
link |
is for W to be zero.
link |
And so you can think of this as adding like a spring force
link |
or like a gravity force that pushes W to be zero.
link |
So W wants to be zero and the probabilities
link |
want to be uniform, but they also simultaneously
link |
want to match up your probabilities
link |
as indicated by the data.
link |
And so the strength of this regularization
link |
is exactly controlling the amount of counts
link |
that you add here.
link |
Adding a lot more counts here
link |
corresponds to increasing this number,
link |
because the more you increase it,
link |
the more this part of the loss function dominates this part,
link |
and the more these weights will be unable to grow,
link |
because as they grow, they accumulate way too much loss.
link |
And so if this is strong enough,
link |
then we are not able to overcome the force of this loss.
link |
And we will never, and basically everything
link |
will be uniform predictions.
link |
So I thought that's kind of cool.
link |
Okay, and lastly, before we wrap up,
link |
I wanted to show you how you would sample
link |
from this neural net model.
link |
And I copy pasted the sampling code from before,
link |
where remember that we sampled five times.
link |
And all we did, we started at zero,
link |
we grabbed the current IX row of P,
link |
and that was our probability row
link |
from which we sampled the next index
link |
and just accumulated that and break when zero.
link |
And running this gave us these results.
link |
I still have the P in memory, so this is fine.
link |
Now, this P doesn't come from the row of P,
link |
instead it comes from this neural net.
link |
First, we take IX and we encode it into a one-hot row.
link |
Of X-sync, this X-sync multiplies our W,
link |
which really just plucks out the row of W
link |
corresponding to IX, really that's what's happening.
link |
And that gets our logits,
link |
and then we normalize those logits,
link |
exponentiate to get counts,
link |
and then normalize to get the distribution.
link |
And then we can sample from the distribution.
link |
So if I run this, kind of anti-climatic or climatic,
link |
depending on how you look at it, but we get the exact same result.
link |
And that's because this is the identical model.
link |
Not only does it achieve the same loss,
link |
but as I mentioned, these are identical models,
link |
and this W is the log counts of what we've estimated before.
link |
But we came to this answer in a very different way,
link |
and it's got a very different interpretation.
link |
But fundamentally, this is basically the same model
link |
and gives the same samples here.
link |
And so that's kind of cool. Okay, so we've actually covered a lot of ground.
link |
We introduced the bigram character level language model.
link |
We saw how we can train the model,
link |
how we can sample from the model,
link |
and how we can evaluate the quality of the model
link |
using the negative log likelihood loss.
link |
And then we actually trained the model in two completely different ways
link |
that actually get the same result and the same model.
link |
In the first way, we just counted up the frequency of all the bigrams
link |
and normalized. In the second way,
link |
we used the negative log likelihood loss as a guide
link |
to optimizing the counts matrix or the counts array
link |
so that the loss is minimized in a gradient-based framework.
link |
And we saw that both of them give the same result, and that's it.
link |
Now, the second one of these, the gradient-based framework,
link |
is much more flexible.
link |
And right now, our neural network is super simple.
link |
We're taking a single previous character,
link |
and we're taking it through a single linear layer to calculate the logits.
link |
This is about to complexify.
link |
So in the follow-up videos,
link |
we're going to be taking more and more of these characters,
link |
and we're going to be feeding them into a neural net.
link |
But this neural net will still output the exact same thing.
link |
The neural net will output logits.
link |
And these logits will still be normalized in the exact same way,
link |
and all the loss and everything else in the gradient-based framework,
link |
everything stays identical.
link |
It's just that this neural net will now complexify all the way to transformers.
link |
So that's going to be pretty awesome, and I'm looking forward to it.