back to index
    Andrey Karpathy. Neural Networks: Zero to Hero
Episode 2. The spelled-out intro to language modeling: building makemore

            
                link |
                
                 Hi, everyone. Hope you're well.
             
            
                link |
                
                 Next up, what I'd like to do is I'd like to build out Make More.
             
            
                link |
                
                 Like microGrad before it,
             
            
                link |
                
                 Make More is a repository that I have on my GitHub web page.
             
            
                link |
                
                 You can look at it. But just like with microGrad,
             
            
                link |
                
                 I'm going to build it out step-by-step,
             
            
                link |
                
                 and I'm going to spell everything out.
             
            
                link |
                
                 We're going to build it out slowly and together.
             
            
                link |
                
                 Now, what is Make More?
             
            
                link |
                
                 Make More, as the name suggests,
             
            
                link |
                
                 makes more of things that you give it.
             
            
                link |
                
                 Here's an example.
             
            
                link |
                
                 Names.txt is an example dataset to Make More.
             
            
                link |
                
                 When you look at Names.txt,
             
            
                link |
                
                 you'll find that it's a very large dataset of names.
             
            
                link |
                
                 Here's lots of different types of names.
             
            
                link |
                
                 In fact, I believe there are 32,000 names
             
            
                link |
                
                 that I found randomly on a government website.
             
            
                link |
                
                 If you train Make More on this dataset,
             
            
                link |
                
                 it will learn to make more of things like this.
             
            
                link |
                
                 In particular, in this case,
             
            
                link |
                
                 that will mean more things that sound name-like,
             
            
                link |
                
                 but are actually unique names.
             
            
                link |
                
                 Maybe if you have a baby and you're trying to assign a name,
             
            
                link |
                
                 maybe you're looking for a cool new sounding unique name,
             
            
                link |
                
                 Make More might help you.
             
            
                link |
                
                 Here are some example generations from
             
            
                link |
                
                 the neural network once we train it on our dataset.
             
            
                link |
                
                 Here's some example unique names that it will generate.
             
            
                link |
                
                 Dontel, iRot, Zendi, and so on.
             
            
                link |
                
                 All these are sound name-like,
             
            
                link |
                
                 but they're not, of course, names.
             
            
                link |
                
                 Under the hood, Make More is a character level language model.
             
            
                link |
                
                 What that means is that it is treating
             
            
                link |
                
                 every single line here as an example,
             
            
                link |
                
                 and within each example,
             
            
                link |
                
                 it's treating them all as sequences of individual characters.
             
            
                link |
                
                 R-E-E-S-E is this example,
             
            
                link |
                
                 and that's the sequence of characters,
             
            
                link |
                
                 and that's the level on which we are building out Make More.
             
            
                link |
                
                 What it means to be a character level language model then,
             
            
                link |
                
                 is that it's just modeling those sequences of
             
            
                link |
                
                 characters and it knows how to
             
            
                link |
                
                 predict the next character in the sequence.
             
            
                link |
                
                 Now, we're actually going to implement
             
            
                link |
                
                 a large number of character level language models
             
            
                link |
                
                 in terms of the neural networks that are
             
            
                link |
                
                 involved in predicting the next character in a sequence.
             
            
                link |
                
                 Very simple bigram and bag of word models,
             
            
                link |
                
                 multilayered perceptrons, recurrent neural networks,
             
            
                link |
                
                 all the way to modern transformers.
             
            
                link |
                
                 In fact, the transformer that we will build will be
             
            
                link |
                
                 basically the equivalent transformer to
             
            
                link |
                
                 GPT-2 if you have heard of GPT.
             
            
                link |
                
                 That's a big deal. It's a model network,
             
            
                link |
                
                 and by the end of the series,
             
            
                link |
                
                 you will actually understand how that
             
            
                link |
                
                 works on the level of characters.
             
            
                link |
                
                 Now, to give you a sense of the extensions here,
             
            
                link |
                
                 after characters, we will probably spend
             
            
                link |
                
                 some time on the word level so that we can
             
            
                link |
                
                 generate documents of words,
             
            
                link |
                
                 not just little segments of characters,
             
            
                link |
                
                 but we can generate entire much larger documents.
             
            
                link |
                
                 Then we're probably going to go into images
             
            
                link |
                
                 and image text networks such as Dolly,
             
            
                link |
                
                 StableDiffusion, and so on.
             
            
                link |
                
                 But for now, we have to start here,
             
            
                link |
                
                 character level language modeling. Let's go.
             
            
                link |
                
                 Like before, we are starting with
             
            
                link |
                
                 a completely blank Jupyter Notebook page.
             
            
                link |
                
                 The first thing is I would like to
             
            
                link |
                
                 basically load up the dataset names.txt.
             
            
                link |
                
                 We're going to open up names.txt for reading,
             
            
                link |
                
                 and we're going to read in everything into a massive string.
             
            
                link |
                
                 Then because it's a massive string,
             
            
                link |
                
                 we only like the individual words and put them in the list.
             
            
                link |
                
                 Let's call splitlines on that string to get
             
            
                link |
                
                 all of our words as a Python list of strings.
             
            
                link |
                
                 Basically, we can look at, for example,
             
            
                link |
                
                 the first 10 words,
             
            
                link |
                
                 and we have that it's a list of Emma,
             
            
                link |
                
                 Olivia, Ava, and so on.
             
            
                link |
                
                 If we look at the top of the page here,
             
            
                link |
                
                 that is indeed what we see. That's good.
             
            
                link |
                
                 This list actually makes me feel that
             
            
                link |
                
                 this is probably sorted by frequency.
             
            
                link |
                
                 But these are the words.
             
            
                link |
                
                 Now, we'd like to actually learn
             
            
                link |
                
                 a little bit more about this dataset.
             
            
                link |
                
                 Let's look at the total number of words.
             
            
                link |
                
                 We expect this to be roughly 32,000.
             
            
                link |
                
                 Then what is the, for example, shortest word?
             
            
                link |
                
                 So min of length of each word for w in words.
             
            
                link |
                
                 So the shortest word will be length 2,
             
            
                link |
                
                 and max of length w for w in words.
             
            
                link |
                
                 So the longest word will be 15 characters.
             
            
                link |
                
                 Let's now think through our very first language model.
             
            
                link |
                
                 As I mentioned, a character level language model
             
            
                link |
                
                 is predicting the next character in a sequence
             
            
                link |
                
                 given already some concrete sequence of characters before it.
             
            
                link |
                
                 Now, what we have to realize here is that every single word here,
             
            
                link |
                
                 like Isabella, is actually
             
            
                link |
                
                 quite a few examples packed in to that single word.
             
            
                link |
                
                 Because what is an existence of a word like Isabella
             
            
                link |
                
                 in the dataset telling us really?
             
            
                link |
                
                 It's saying that the character i
             
            
                link |
                
                 is a very likely character to come first in a sequence of a name.
             
            
                link |
                
                 The character s is likely to come after i.
             
            
                link |
                
                 The character a is likely to come after is.
             
            
                link |
                
                 The character b is very likely to come after isa,
             
            
                link |
                
                 and so on all the way to a following Isabelle.
             
            
                link |
                
                 Then there's one more example actually packed in here,
             
            
                link |
                
                 and that is that after there's Isabella,
             
            
                link |
                
                 the word is very likely to end.
             
            
                link |
                
                 So that's one more explicit piece of information that we have here,
             
            
                link |
                
                 that we have to be careful with.
             
            
                link |
                
                 So there's a lot packed into a single individual word in terms of
             
            
                link |
                
                 the statistical structure of what's likely to follow in these character sequences.
             
            
                link |
                
                 Then of course, we don't have just an individual word,
             
            
                link |
                
                 we actually have 32,000 of these,
             
            
                link |
                
                 and so there's a lot of structure here to model.
             
            
                link |
                
                 Now in the beginning, what I'd like to start with,
             
            
                link |
                
                 is I'd like to start with building a bigram language model.
             
            
                link |
                
                 Now in a bigram language model,
             
            
                link |
                
                 we're always working with just two characters at a time.
             
            
                link |
                
                 So we're only looking at one character that we are given,
             
            
                link |
                
                 and we're trying to predict the next character in the sequence.
             
            
                link |
                
                 So what characters are likely to follow r,
             
            
                link |
                
                 what characters are likely to follow a, and so on.
             
            
                link |
                
                 We're just modeling that little local structure.
             
            
                link |
                
                 We're forgetting the fact that we may have a lot more information,
             
            
                link |
                
                 we're always just looking at the previous character to predict the next one.
             
            
                link |
                
                 So it's a very simple and weak language model,
             
            
                link |
                
                 but I think it's a great place to start.
             
            
                link |
                
                 So now let's begin by looking at these bigrams in our dataset and what they look like.
             
            
                link |
                
                 These bigrams again are just two characters in a row.
             
            
                link |
                
                 So for w in words,
             
            
                link |
                
                 each w here is an individual word string.
             
            
                link |
                
                 We want to iterate this word with consecutive characters.
             
            
                link |
                
                 So two characters at a time,
             
            
                link |
                
                 sliding it through the word.
             
            
                link |
                
                 Now, a interesting, nice way,
             
            
                link |
                
                 cute way to do this in Python by the way,
             
            
                link |
                
                 is doing something like this.
             
            
                link |
                
                 For character1, character2 in zip of w and w at one, one column.
             
            
                link |
                
                 Print character1, character2, and let's not do all the words,
             
            
                link |
                
                 let's just do the first three words.
             
            
                link |
                
                 I'm going to show you in a second how this works.
             
            
                link |
                
                 But for now, basically as an example,
             
            
                link |
                
                 let's just do the very first word alone, emma.
             
            
                link |
                
                 You see how we have a emma and this will just print em,
             
            
            
                link |
                
                 The reason this works is because w is the string emma,
             
            
                link |
                
                 w at one column is the string mma,
             
            
                link |
                
                 and zip takes two iterators and it pairs them up and then
             
            
                link |
                
                 creates an iterator over the tuples of their consecutive entries.
             
            
                link |
                
                 Any one of these lists is shorter than the other,
             
            
                link |
                
                 then it will just halt and return.
             
            
                link |
                
                 So basically, that's why we return em, mm, ma.
             
            
                link |
                
                 But then because this iterator,
             
            
                link |
                
                 the second one here, runs out of elements,
             
            
                link |
                
                 zip just ends and that's why we only get these tuples.
             
            
                link |
                
                 So pretty cute. These are the consecutive elements in the first word.
             
            
                link |
                
                 Now we have to be careful because we actually have more information
             
            
                link |
                
                 here than just these three examples.
             
            
                link |
                
                 As I mentioned, we know that e is very likely to come first,
             
            
                link |
                
                 and we know that a in this case is coming last.
             
            
                link |
                
                 So one way to do this is basically we're going to
             
            
                link |
                
                 create a special array here, characters,
             
            
                link |
                
                 and we're going to hallucinate a special start token here.
             
            
                link |
                
                 I'm going to call it special start.
             
            
                link |
                
                 So this is a list of one element,
             
            
                link |
                
                 plus w, and then plus a special end character.
             
            
                link |
                
                 The reason I'm wrapping list of w here is because w is a string,
             
            
                link |
                
                 emma, list of w will just have the individual characters in the list.
             
            
                link |
                
                 Then doing this again now,
             
            
                link |
                
                 but not iterating over w's,
             
            
                link |
                
                 but over the characters will give us something like this.
             
            
                link |
                
                 So this is a bigram of the start character and e,
             
            
                link |
                
                 and this is a bigram of the a and the special end character.
             
            
                link |
                
                 Now we can look at, for example,
             
            
                link |
                
                 what this looks like for Olivia or Eva.
             
            
                link |
                
                 Indeed, we can actually potentially do this for the entire dataset,
             
            
                link |
                
                 but we won't print that, that's going to be too much.
             
            
                link |
                
                 But these are the individual character bigrams and we can print them.
             
            
                link |
                
                 Now, in order to learn the statistics about
             
            
                link |
                
                 which characters are likely to follow other characters,
             
            
                link |
                
                 the simplest way in the bigram language models is to simply do it by counting.
             
            
                link |
                
                 So we're basically just going to count how often
             
            
                link |
                
                 any one of these combinations occurs in the training set in these words.
             
            
                link |
                
                 So we're going to need some kind of a dictionary that's going to
             
            
                link |
                
                 maintain some counts for every one of these bigrams.
             
            
                link |
                
                 So let's use a dictionary b,
             
            
                link |
                
                 and this will map these bigrams.
             
            
                link |
                
                 So bigram is a tuple of character one,
             
            
                link |
                
                 character two, and then b at bigram will be b.get of bigram,
             
            
                link |
                
                 which is basically the same as b at bigram.
             
            
                link |
                
                 But in the case that bigram is not in the dictionary b,
             
            
                link |
                
                 we would like to, by default, return a zero plus one.
             
            
                link |
                
                 So this will basically add up all the bigrams and count how often they occur.
             
            
                link |
                
                 Let's get rid of printing or rather,
             
            
                link |
                
                 let's keep the printing and let's just inspect what b is in this case.
             
            
                link |
                
                 We see that many bigrams occur just a single time.
             
            
                link |
                
                 This one allegedly occurred three times.
             
            
                link |
                
                 So a was an ending character three times,
             
            
                link |
                
                 and that's true for all of these words.
             
            
                link |
                
                 All of Emma, Olivia, and Eva end with a.
             
            
                link |
                
                 So that's why this occurred three times.
             
            
                link |
                
                 Now let's do it for all the words.
             
            
                link |
                
                 Oops, I should not have printed.
             
            
                link |
                
                 I meant to erase that. Let's kill this.
             
            
                link |
                
                 Let's just run, and now b will have the statistics of the entire dataset.
             
            
                link |
                
                 So these are the counts across all the words of the individual bigrams.
             
            
                link |
                
                 We could, for example, look at some of the most common ones and least common ones.
             
            
                link |
                
                 This grows in Python,
             
            
                link |
                
                 but the way to do this,
             
            
                link |
                
                 the simplest way I like is we just use b.items.
             
            
                link |
                
                 b.items returns the tuples of key value.
             
            
                link |
                
                 In this case, the keys are the character bigrams and the values are the counts.
             
            
                link |
                
                 Then what we want to do is we want to do sorted of this.
             
            
                link |
                
                 But by default, sort is on the first item of a tuple.
             
            
                link |
                
                 But we want to sort by the values which are
             
            
                link |
                
                 the second element of a tuple that is the key value.
             
            
                link |
                
                 So we want to use the key equals lambda that takes
             
            
                link |
                
                 the key value and returns the key value at one,
             
            
                link |
                
                 not at zero but at one, which is the count.
             
            
                link |
                
                 So we want to sort by the count of these elements.
             
            
                link |
                
                 Actually, we want it to go backwards.
             
            
                link |
                
                 So here what we have is the bigram QnR occurs only a single time.
             
            
                link |
                
                 DZ occurred only a single time.
             
            
                link |
                
                 When we sort this the other way around,
             
            
                link |
                
                 we're going to see the most likely bigrams.
             
            
                link |
                
                 So we see that n was very often an ending character many, many times.
             
            
                link |
                
                 Apparently, n always follows an a,
             
            
                link |
                
                 and that's a very likely combination as well.
             
            
                link |
                
                 So this is the individual counts that we achieve over the entire dataset.
             
            
                link |
                
                 Now, it's actually going to be significantly more convenient for us to
             
            
                link |
                
                 keep this information in a two-dimensional array instead of a Python dictionary.
             
            
                link |
                
                 So we're going to store this information in a 2D array,
             
            
                link |
                
                 and the rows are going to be the first character of the bigram,
             
            
                link |
                
                 and the columns are going to be the second character.
             
            
                link |
                
                 Each entry in this two-dimensional array will tell us how
             
            
                link |
                
                 often that first character follows the second character in the dataset.
             
            
                link |
                
                 So in particular, the array representation that we're going to use,
             
            
                link |
                
                 or the library, is that of PyTorch.
             
            
                link |
                
                 PyTorch is a deep learning neural network framework,
             
            
                link |
                
                 but part of it is also this torch.tensor,
             
            
                link |
                
                 which allows us to create
             
            
                link |
                
                 multi-dimensional arrays and manipulate them very efficiently.
             
            
                link |
                
                 So let's import PyTorch,
             
            
                link |
                
                 which you can do by import torch.
             
            
                link |
                
                 Then we can create arrays.
             
            
                link |
                
                 So let's create an array of zeros,
             
            
                link |
                
                 and we give it a size of this array.
             
            
                link |
                
                 Let's create a three by five array as an example,
             
            
                link |
                
                 and this is a three by five array of zeros.
             
            
                link |
                
                 By default, you'll notice a.dtype,
             
            
                link |
                
                 which is short for data type, is float 32.
             
            
                link |
                
                 So these are single precision floating point numbers.
             
            
                link |
                
                 Because we are going to represent counts,
             
            
                link |
                
                 let's actually use.dtype as torch.int32.
             
            
                link |
                
                 So these are 32-bit integers.
             
            
                link |
                
                 So now you see that we have integer data inside this tensor.
             
            
                link |
                
                 Now, tensors allow us to really
             
            
                link |
                
                 manipulate all the individual entries and do it very efficiently.
             
            
                link |
                
                 So for example, if we want to change this bit,
             
            
                link |
                
                 we have to index into the tensor.
             
            
                link |
                
                 In particular, here, this is the first row,
             
            
                link |
                
                 and because it's zero indexed.
             
            
                link |
                
                 So this is row index one and column index zero, one, two, three.
             
            
                link |
                
                 So a at one comma three,
             
            
                link |
                
                 we can set that to one,
             
            
                link |
                
                 and then a will have a one over there.
             
            
                link |
                
                 We can, of course, also do things like this.
             
            
                link |
                
                 So now a will be two over there, or three.
             
            
                link |
                
                 Also, we can, for example, say a00 is five,
             
            
                link |
                
                 and then a will have a five over here.
             
            
                link |
                
                 So that's how we can index into the arrays.
             
            
                link |
                
                 Now, of course, the array that we are interested in is much, much bigger.
             
            
                link |
                
                 So for our purposes,
             
            
                link |
                
                 we have 26 letters of the alphabet,
             
            
                link |
                
                 and then we have two special characters, S and E.
             
            
                link |
                
                 So we want 26 plus two,
             
            
                link |
                
                 or 28 by 28 array,
             
            
                link |
                
                 and let's call it the capital N,
             
            
                link |
                
                 because it's going to represent the counts.
             
            
                link |
                
                 Let me erase this stuff.
             
            
                link |
                
                 So that's the array that starts at zeros, 28 by 28.
             
            
                link |
                
                 Now, let's copy-paste this here.
             
            
                link |
                
                 But instead of having a dictionary B,
             
            
                link |
                
                 which we're going to erase,
             
            
            
                link |
                
                 Now, the problem here is that we have these characters which are strings,
             
            
                link |
                
                 but we have to now basically index into
             
            
                link |
                
                 a array and we have to index using integers.
             
            
                link |
                
                 So we need some kind of a lookup table from characters to integers.
             
            
                link |
                
                 So let's construct such a character array.
             
            
                link |
                
                 The way we're going to do this is we're going to take all the words,
             
            
                link |
                
                 which is a list of strings.
             
            
                link |
                
                 We're going to concatenate all of it into a massive string.
             
            
                link |
                
                 So this is just simply the entire dataset as a single string.
             
            
                link |
                
                 We're going to pass this to the set constructor,
             
            
                link |
                
                 which takes this massive string and
             
            
                link |
                
                 throws out duplicates because sets do not allow duplicates.
             
            
                link |
                
                 So set of this will just be the set of all the lowercase characters.
             
            
                link |
                
                 There should be a total of 26 of them.
             
            
                link |
                
                 Now, we actually don't want a set,
             
            
            
                link |
                
                 But we don't want a list sorted in some weird arbitrary way,
             
            
                link |
                
                 we want it to be sorted from A to Z.
             
            
            
                link |
                
                 So those are our characters.
             
            
                link |
                
                 Now, what we want is this lookup table as I mentioned.
             
            
                link |
                
                 So let's create a special S2I, I will call it.
             
            
                link |
                
                 S is string or character,
             
            
                link |
                
                 and this will be an S2I mapping for IS in enumerate of these characters.
             
            
                link |
                
                 So enumerate basically gives us this iterator over
             
            
                link |
                
                 the integer, index, and the actual element of the list.
             
            
                link |
                
                 Then we are mapping the character to the integer.
             
            
                link |
                
                 So S2I is a mapping from A to 0,
             
            
                link |
                
                 B to 1, etc, all the way from Z to 25.
             
            
                link |
                
                 That's going to be useful here,
             
            
                link |
                
                 but we actually also have to specifically set that S will be 26,
             
            
                link |
                
                 and S2I at E will be 27 because Z was 25.
             
            
                link |
                
                 So those are the lookups.
             
            
                link |
                
                 Now, we can come here and we can map
             
            
                link |
                
                 both character 1 and character 2 to their integers.
             
            
                link |
                
                 So this will be S2I at character 1,
             
            
                link |
                
                 and IX2 will be S2I of character 2.
             
            
                link |
                
                 Now, we should be able to do this line,
             
            
                link |
                
                 but using our array.
             
            
            
                link |
                
                 this is the two-dimensional array indexing I've shown you before,
             
            
                link |
                
                 and honestly just plus equals 1 because everything starts at 0.
             
            
                link |
                
                 So this should work and give us
             
            
                link |
                
                 a large 28 by 28 array of all these counts.
             
            
                link |
                
                 So if we print n, this is the array,
             
            
                link |
                
                 but of course it looks ugly.
             
            
                link |
                
                 So let's erase this ugly mess,
             
            
                link |
                
                 and let's try to visualize it a bit more nicer.
             
            
                link |
                
                 So for that, we're going to use a library called matplotlib.
             
            
                link |
                
                 So matplotlib allows us to create figures.
             
            
                link |
                
                 So we can do things like pltim show of the count array.
             
            
                link |
                
                 So this is the 28 by 28 array,
             
            
                link |
                
                 and this is the structure,
             
            
                link |
                
                 but even this, I would say, is still pretty ugly.
             
            
                link |
                
                 So we're going to try to create a much nicer visualization of it,
             
            
                link |
                
                 and I wrote a bunch of code for that.
             
            
                link |
                
                 The first thing we're going to need is,
             
            
                link |
                
                 we're going to need to invert this array here, this dictionary.
             
            
                link |
                
                 So S2I is a mapping from S to I,
             
            
                link |
                
                 and in I2S, we're going to reverse this dictionary.
             
            
                link |
                
                 So iterate over all the items and just reverse that array.
             
            
                link |
                
                 So I2S maps inversely from 0 to A,
             
            
                link |
                
                 1 to B, etc. So we'll need that.
             
            
                link |
                
                 Then here's the code that I came up with to try to
             
            
                link |
                
                 make this a little bit nicer.
             
            
                link |
                
                 We create a figure, we plot n,
             
            
                link |
                
                 and then we visualize a bunch of things here.
             
            
                link |
                
                 Let me just run it so you get a sense of what this is.
             
            
                link |
                
                 So you see here that we have the array spaced out,
             
            
                link |
                
                 and every one of these is basically like B follows G 0 times,
             
            
                link |
                
                 B follows H 41 times.
             
            
                link |
                
                 So A follows J 175 times.
             
            
                link |
                
                 What you can see that I'm doing here is,
             
            
                link |
                
                 first I show that entire array,
             
            
                link |
                
                 and then I iterate over all the individual cells here,
             
            
                link |
                
                 and I create a character string here,
             
            
                link |
                
                 which is the inverse mapping I2S of the integer I and the integer J.
             
            
                link |
                
                 So those are the bigrams in a character representation.
             
            
                link |
                
                 Then I plot just the bigram text,
             
            
                link |
                
                 and then I plot the number of times that this bigram occurs.
             
            
                link |
                
                 Now, the reason that there's a dot item here is
             
            
                link |
                
                 because when you index into these arrays,
             
            
                link |
                
                 these are torch tensors,
             
            
                link |
                
                 you see that we still get a tensor back.
             
            
                link |
                
                 So the type of this thing,
             
            
                link |
                
                 you'd think it would be just an integer, 149,
             
            
                link |
                
                 but it's actually a torch dot tensor.
             
            
                link |
                
                 If you do dot item,
             
            
                link |
                
                 then it will pop out that individual integer.
             
            
                link |
                
                 So it'll just be 149.
             
            
                link |
                
                 So that's what's happening there.
             
            
                link |
                
                 These are just some options to make it look nice.
             
            
                link |
                
                 So what is the structure of this array?
             
            
                link |
                
                 We have all these counts and we see that some of them occur often,
             
            
                link |
                
                 and some of them do not occur often.
             
            
                link |
                
                 Now, if you scrutinize this carefully,
             
            
                link |
                
                 you will notice that we're not actually being very clever.
             
            
                link |
                
                 That's because when you come over here,
             
            
                link |
                
                 you'll notice that, for example,
             
            
                link |
                
                 we have an entire row of completely zeros.
             
            
                link |
                
                 That's because the end character is never
             
            
                link |
                
                 possibly going to be the first character of a bigram,
             
            
                link |
                
                 because we're always placing these end tokens at the end of a bigram.
             
            
                link |
                
                 Similarly, we have an entire column of zeros here,
             
            
                link |
                
                 because the S character will never possibly be the second element of a bigram,
             
            
                link |
                
                 because we always start with S and we end with E,
             
            
                link |
                
                 and we only have the words in between.
             
            
                link |
                
                 So we have an entire column of zeros,
             
            
                link |
                
                 an entire row of zeros,
             
            
                link |
                
                 and in this little two-by-two matrix here as well,
             
            
                link |
                
                 the only one that can possibly happen is if S directly follows E.
             
            
                link |
                
                 That can be non-zero if we have a word that has no letters.
             
            
                link |
                
                 So in that case, there's no letters in the word,
             
            
                link |
                
                 it's an empty word, and we just have S follows E.
             
            
                link |
                
                 But the other ones are just not possible.
             
            
                link |
                
                 So we're basically wasting space,
             
            
                link |
                
                 and not only that, but the S and the E are getting very crowded here.
             
            
                link |
                
                 I was using these brackets because there's
             
            
                link |
                
                 convention in natural language processing to use
             
            
                link |
                
                 these kinds of brackets to denote special tokens,
             
            
                link |
                
                 but we're going to use something else.
             
            
                link |
                
                 So let's fix all this and make it prettier.
             
            
                link |
                
                 We're not actually going to have two special tokens,
             
            
                link |
                
                 we're only going to have one special token.
             
            
                link |
                
                 So we're going to have n by n array of 27 by set 27 instead.
             
            
                link |
                
                 Instead of having two,
             
            
                link |
                
                 we will just have one and I will call it a dot.
             
            
                link |
                
                 Let me swing this over here.
             
            
                link |
                
                 Now, one more thing that I would like to do is I would actually
             
            
                link |
                
                 like to make this special character half position zero,
             
            
                link |
                
                 and I would like to offset all the other letters off.
             
            
                link |
                
                 I find that a little bit more pleasing.
             
            
                link |
                
                 So we need a plus one here so that the first character,
             
            
                link |
                
                 which is A, will start at one.
             
            
                link |
                
                 So S2i will now be A starts at one and dot is zero.
             
            
                link |
                
                 I2S, of course, we're not changing this because I2S
             
            
                link |
                
                 just creates a reverse mapping and this will work fine.
             
            
                link |
                
                 So one is A, two is B,
             
            
                link |
                
                 zero is dot. So we've reversed that.
             
            
                link |
                
                 Here, we have a dot and a dot.
             
            
                link |
                
                 This should work fine.
             
            
                link |
                
                 Make sure I start at zeros, count.
             
            
                link |
                
                 Then here, we don't go up to 28,
             
            
            
                link |
                
                 and this should just work.
             
            
                link |
                
                 So we see that dot dot never happened,
             
            
                link |
                
                 it's at zero because we don't have empty words.
             
            
                link |
                
                 Then this row here now is just very simply
             
            
                link |
                
                 the counts for all the first letters.
             
            
                link |
                
                 So J starts a word,
             
            
                link |
                
                 H starts a word, I starts a word, etc.
             
            
                link |
                
                 Then these are all the ending characters.
             
            
                link |
                
                 In between, we have the structure of
             
            
                link |
                
                 what characters follow each other.
             
            
                link |
                
                 So this is the counts array of our entire dataset.
             
            
                link |
                
                 This array actually has all the information necessary for us to
             
            
                link |
                
                 actually sample from this bigram character level language model.
             
            
                link |
                
                 Roughly speaking, what we're going to do is we're just going to
             
            
                link |
                
                 start following these probabilities and these counts,
             
            
                link |
                
                 and we're going to start sampling from the model.
             
            
                link |
                
                 In the beginning, of course,
             
            
                link |
                
                 we start with the dot,
             
            
                link |
                
                 the start token dot.
             
            
                link |
                
                 So to sample the first character of a name,
             
            
                link |
                
                 we're looking at this row here.
             
            
                link |
                
                 So we see that we have the counts,
             
            
                link |
                
                 and those counts externally are telling us how
             
            
                link |
                
                 often any one of these characters is to start a word.
             
            
                link |
                
                 So if we take this n and we grab the first row,
             
            
                link |
                
                 we can do that by using just indexing at zero,
             
            
                link |
                
                 and then using this notation colon for the rest of that row.
             
            
                link |
                
                 So n zero colon is indexing into the zero row,
             
            
                link |
                
                 and then it's grabbing all the columns.
             
            
                link |
                
                 So this will give us a one-dimensional array of the first row.
             
            
            
                link |
                
                 You notice 0, 4, 4, 10,
             
            
            
                link |
                
                 It's just the first row.
             
            
                link |
                
                 The shape of this is 27.
             
            
                link |
                
                 It's just the row of 27.
             
            
                link |
                
                 The other way that you can do this also is you
             
            
                link |
                
                 just grab the zero row like this.
             
            
                link |
                
                 This is equivalent.
             
            
                link |
                
                 Now, these are the counts.
             
            
                link |
                
                 Now, what we'd like to do is we'd like to
             
            
                link |
                
                 basically sample from this.
             
            
                link |
                
                 Since these are the raw counts,
             
            
                link |
                
                 we actually have to convert this to probabilities.
             
            
                link |
                
                 So we create a probability vector.
             
            
                link |
                
                 So we'll take n of zero,
             
            
                link |
                
                 and we'll actually convert this to float first.
             
            
                link |
                
                 So these integers are converted to floating-point numbers.
             
            
                link |
                
                 The reason we're creating floats is because we're
             
            
                link |
                
                 about to normalize these counts.
             
            
                link |
                
                 So to create a probability distribution here,
             
            
                link |
                
                 we basically want to do p divide p dot sum.
             
            
                link |
                
                 Now, we get a vector of smaller numbers,
             
            
                link |
                
                 and these are now probabilities.
             
            
                link |
                
                 So of course, because we divided by the sum,
             
            
                link |
                
                 the sum of p now is one.
             
            
                link |
                
                 So this is a nice proper probability distribution.
             
            
                link |
                
                 It sums to one, and this is giving us the probability for
             
            
                link |
                
                 any single character to be the first character of a word.
             
            
                link |
                
                 So now we can try to sample from this distribution.
             
            
                link |
                
                 To sample from these distributions,
             
            
                link |
                
                 we're going to use Torch.Multinomial,
             
            
                link |
                
                 which I've pulled up here.
             
            
                link |
                
                 So Torch.Multinomial returns samples
             
            
                link |
                
                 from the multinomial probability distribution,
             
            
                link |
                
                 which is a complicated way of saying,
             
            
                link |
                
                 you give me probabilities and I will give you integers,
             
            
                link |
                
                 which are sampled according to the probability distribution.
             
            
                link |
                
                 So this is the signature of the method.
             
            
                link |
                
                 To make everything deterministic,
             
            
                link |
                
                 we're going to use a generator object in PyTorch.
             
            
                link |
                
                 So this makes everything deterministic.
             
            
                link |
                
                 So when you run this on your computer,
             
            
                link |
                
                 you're going to get the exact same results
             
            
                link |
                
                 that I'm getting here on my computer.
             
            
                link |
                
                 So let me show you how this works.
             
            
                link |
                
                 Here's a deterministic way of creating
             
            
                link |
                
                 a Torch generator object,
             
            
                link |
                
                 seeding it with some number that we can agree on.
             
            
                link |
                
                 So that seeds a generator, gives us an object g.
             
            
                link |
                
                 Then we can pass that g to a function that creates here,
             
            
                link |
                
                 random numbers, Torch.rand creates random numbers,
             
            
                link |
                
                 three of them, and it's using
             
            
                link |
                
                 this generator object as a source of randomness.
             
            
                link |
                
                 So without normalizing it, I can just print.
             
            
                link |
                
                 This is like numbers between 0 and 1 that are random,
             
            
                link |
                
                 according to this thing.
             
            
                link |
                
                 Whenever I run it again,
             
            
                link |
                
                 I'm always going to get the same result because I
             
            
                link |
                
                 keep using the same generator object which I'm seeding here.
             
            
                link |
                
                 Then if I divide to normalize,
             
            
                link |
                
                 I'm going to get a nice probability distribution
             
            
                link |
                
                 of just three elements.
             
            
                link |
                
                 Then we can use Torch.Multinomial to draw samples from it.
             
            
                link |
                
                 So this is what that looks like.
             
            
                link |
                
                 Torch.Multinomial will take
             
            
                link |
                
                 the Torch tensor of probability distributions.
             
            
                link |
                
                 Then we can ask for a number of samples, let's say 20.
             
            
                link |
                
                 Replacement equals true means that when we draw an element,
             
            
                link |
                
                 we can draw it and then we can put it back into
             
            
                link |
                
                 the list of eligible indices to draw again.
             
            
                link |
                
                 We have to specify replacement as true because by
             
            
                link |
                
                 default for some reason it's false.
             
            
                link |
                
                 I think it's just something to be careful with.
             
            
                link |
                
                 The generator is passed in here,
             
            
                link |
                
                 so we're going to always get
             
            
                link |
                
                 deterministic results, the same results.
             
            
                link |
                
                 If I run these two,
             
            
                link |
                
                 we're going to get a bunch of samples from this distribution.
             
            
                link |
                
                 Now, you'll notice here that the probability for
             
            
                link |
                
                 the first element in this tensor is 60 percent.
             
            
                link |
                
                 In these 20 samples,
             
            
                link |
                
                 we'd expect 60 percent of them to be zero.
             
            
                link |
                
                 We'd expect 30 percent of them to be one.
             
            
                link |
                
                 Because the element index two has only 10 percent probability,
             
            
                link |
                
                 very few of these samples should be two.
             
            
                link |
                
                 Indeed, we only have a small number of twos.
             
            
                link |
                
                 We can sample as many as we would like.
             
            
                link |
                
                 The more we sample, the more
             
            
                link |
                
                 these numbers should roughly have the distribution here.
             
            
                link |
                
                 We should have lots of zeros,
             
            
                link |
                
                 half as many ones,
             
            
                link |
                
                 and we should have three times as few ones,
             
            
                link |
                
                 and three times as few twos.
             
            
                link |
                
                 You see that we have very few twos,
             
            
                link |
                
                 we have some ones, and most of them are zero.
             
            
                link |
                
                 That's what Torchline multinomial is doing.
             
            
                link |
                
                 For us here, we are interested in this row.
             
            
                link |
                
                 We've created this p here,
             
            
                link |
                
                 and now we can sample from it.
             
            
                link |
                
                 If we use the same seed,
             
            
                link |
                
                 and then we sample from this distribution,
             
            
                link |
                
                 let's just get one sample,
             
            
                link |
                
                 then we see that the sample is say 13.
             
            
                link |
                
                 This will be the index.
             
            
                link |
                
                 You see how it's a tensor that wraps 13?
             
            
                link |
                
                 We again have to use that item to pop out that integer.
             
            
                link |
                
                 Now, index would be just the number 13.
             
            
                link |
                
                 Of course, we can map
             
            
                link |
                
                 the i2s of ix to figure out
             
            
                link |
                
                 exactly which character we're sampling here.
             
            
                link |
                
                 We're sampling m. We're saying that
             
            
                link |
                
                 the first character is m in our generation.
             
            
                link |
                
                 Just looking at the row here,
             
            
                link |
                
                 m was drawn, and we can see that
             
            
                link |
                
                 m actually starts a large number of words.
             
            
                link |
                
                 m started 2,500 words out of 32,000 words,
             
            
                link |
                
                 so almost a bit less than 10 percent
             
            
                link |
                
                 of the words start with m.
             
            
                link |
                
                 This was actually a fairly likely character to draw.
             
            
                link |
                
                 That would be the first character of our word,
             
            
                link |
                
                 and now we can continue to sample
             
            
                link |
                
                 more characters because now we know that m is already sampled.
             
            
                link |
                
                 Now to draw the next character,
             
            
                link |
                
                 we will come back here and we will look
             
            
                link |
                
                 for the row that starts with m.
             
            
                link |
                
                 You see m and we have a row here.
             
            
                link |
                
                 We see that mdot is 516,
             
            
                link |
                
                 ma is this many, mb is this many, etc.
             
            
                link |
                
                 These are the counts for the next row,
             
            
                link |
                
                 and that's the next character that we are going to now generate.
             
            
                link |
                
                 I think we are ready to actually just write out the loop
             
            
                link |
                
                 because I think you're starting to get a sense
             
            
                link |
                
                 of how this is going to go.
             
            
                link |
                
                 We always begin at index 0 because that's the start token.
             
            
                link |
                
                 Then while true, we're going to grab
             
            
                link |
                
                 the row corresponding to index that we're currently on,
             
            
                link |
                
                 so that's n array at ix,
             
            
                link |
                
                 converted to float is rp.
             
            
                link |
                
                 Then we normalize the speed to sum to one.
             
            
                link |
                
                 I accidentally ran the infinite loop.
             
            
                link |
                
                 We normalize p to sum to one.
             
            
                link |
                
                 Then we need this generator object.
             
            
                link |
                
                 We're going to initialize up here and we're going to
             
            
                link |
                
                 draw a single sample from this distribution.
             
            
                link |
                
                 Then this is going to tell us what index is going to be next.
             
            
                link |
                
                 If the index sampled is zero,
             
            
                link |
                
                 then that's now the end token, so we will break.
             
            
                link |
                
                 Otherwise, we are going to print i2s of ix.
             
            
                link |
                
                 That's pretty much it.
             
            
                link |
                
                 We're just, this should work.
             
            
                link |
                
                 More. That's the name that we've sampled.
             
            
                link |
                
                 We started with m,
             
            
                link |
                
                 the next step was o,
             
            
                link |
                
                 then r, and then dot.
             
            
                link |
                
                 This dot, we print it here as well.
             
            
                link |
                
                 Let's now do this a few times.
             
            
                link |
                
                 Let's actually create an out list here.
             
            
                link |
                
                 Instead of printing, we're going to append,
             
            
                link |
                
                 so out.append this character.
             
            
                link |
                
                 Then here, let's just print it at the end.
             
            
                link |
                
                 Let's just join up all the outs and we're just going to print more.
             
            
                link |
                
                 Now, we're always getting the same result because of the generator.
             
            
                link |
                
                 If we want to do this a few times,
             
            
                link |
                
                 we can go for i in range 10.
             
            
                link |
                
                 We can sample 10 names and we can just do that 10 times.
             
            
                link |
                
                 These are the names that we're getting out. Let's do 20.
             
            
                link |
                
                 I'll be honest with you, this doesn't look right.
             
            
                link |
                
                 I stare at it a few minutes to convince myself that it actually is right.
             
            
                link |
                
                 The reason these samples are so terrible is
             
            
                link |
                
                 that bigram language model is actually just like really terrible.
             
            
                link |
                
                 We can generate a few more here.
             
            
                link |
                
                 You can see that they're name like a little bit like Yanu,
             
            
                link |
                
                 Erily, etc, but they're just totally messed up.
             
            
                link |
                
                 The reason that this is so bad,
             
            
                link |
                
                 we're generating h as a name,
             
            
                link |
                
                 but you have to think through it from the model's eyes.
             
            
                link |
                
                 It doesn't know that this h is the very first h.
             
            
                link |
                
                 All it knows is that h was previously,
             
            
                link |
                
                 and now how likely is h the last character?
             
            
                link |
                
                 Well, it's somewhat likely,
             
            
                link |
                
                 and so it just makes it last character.
             
            
                link |
                
                 It doesn't know that there were other things before it,
             
            
                link |
                
                 or there were not other things before it.
             
            
                link |
                
                 That's why it's generating all these nonsense names.
             
            
                link |
                
                 Another way to do this is to
             
            
                link |
                
                 convince yourself that it is actually doing something reasonable,
             
            
                link |
                
                 even though it's so terrible,
             
            
                link |
                
                 is these little p's here are 27,
             
            
                link |
                
                 like 27. How about if we did something like this?
             
            
                link |
                
                 Instead of p having any structure whatsoever,
             
            
                link |
                
                 how about if p was just torch.ones of 27?
             
            
                link |
                
                 By default, this is a float 32,
             
            
                link |
                
                 so this is fine. Divide 27.
             
            
                link |
                
                 What I'm doing here is this is
             
            
                link |
                
                 the uniform distribution which will make everything equally likely,
             
            
                link |
                
                 and we can sample from that.
             
            
                link |
                
                 Let's see if that does any better.
             
            
                link |
                
                 This is what you have from a model that is completely untrained,
             
            
                link |
                
                 where everything is equally likely,
             
            
                link |
                
                 so it's obviously garbage.
             
            
                link |
                
                 Then if we have a trained model which is trained on just bigrams,
             
            
                link |
                
                 this is what we get.
             
            
                link |
                
                 You can see that it is more name-like, it is actually working.
             
            
                link |
                
                 It's just bigram is so terrible and we have to do better.
             
            
                link |
                
                 Now next, I would like to fix an inefficiency that we have going on here.
             
            
                link |
                
                 Because what we're doing here is we're always fetching
             
            
                link |
                
                 a row of n from the counts matrix up ahead,
             
            
                link |
                
                 and then we're always doing the same things.
             
            
                link |
                
                 We're converting to float and we're dividing,
             
            
                link |
                
                 and we're doing this every single iteration of this loop,
             
            
                link |
                
                 and we just keep renormalizing these rows over and over again,
             
            
                link |
                
                 and it's extremely inefficient and wasteful.
             
            
                link |
                
                 What I'd like to do is I'd like to actually prepare
             
            
                link |
                
                 a matrix capital P that will just have the probabilities in it.
             
            
                link |
                
                 In other words, it's going to be the same as the capital N matrix here of counts,
             
            
                link |
                
                 but every single row will have the row of probabilities that is normalized to one,
             
            
                link |
                
                 indicating the probability distribution for the next character,
             
            
                link |
                
                 given the character before it as defined by which row we're in.
             
            
                link |
                
                 Basically, what we'd like to do is we'd like to just do it up front here,
             
            
                link |
                
                 and then we would like to just use that row here.
             
            
                link |
                
                 Here, we would like to just do p equals p of ix instead.
             
            
                link |
                
                 The other reason I want to do this is not just for efficiency,
             
            
                link |
                
                 but also I would like us to practice
             
            
                link |
                
                 these n-dimensional tensors and I'd like us to practice
             
            
                link |
                
                 their manipulation and especially something that's called
             
            
                link |
                
                 broadcasting that we'll go into in a second.
             
            
                link |
                
                 We're actually going to have to become very good at
             
            
                link |
                
                 these tensor manipulations because if we're
             
            
                link |
                
                 going to build out all the way to transformers,
             
            
                link |
                
                 we're going to be doing some pretty complicated array operations for
             
            
                link |
                
                 efficiency and we need to really understand that and be very good at it.
             
            
                link |
                
                 Intuitively, what we want to do is that we first want to grab
             
            
                link |
                
                 the floating point copy of n,
             
            
                link |
                
                 and I'm mimicking the line here basically,
             
            
                link |
                
                 and then we want to divide all the rows so that they sum to one.
             
            
                link |
                
                 We'd like to do something like this, p divide p dot sum.
             
            
                link |
                
                 But now we have to be careful because p dot sum actually
             
            
                link |
                
                 produces a sum, sorry,
             
            
                link |
                
                 equals n dot float copy.
             
            
                link |
                
                 p dot sum sums up all of the counts of this entire matrix n,
             
            
                link |
                
                 and gives us a single number of just the summation of everything.
             
            
                link |
                
                 That's not the way we want to divide.
             
            
                link |
                
                 We want to simultaneously and in parallel,
             
            
                link |
                
                 divide all the rows by their respective sums.
             
            
                link |
                
                 What we have to do now is we have to go into documentation for
             
            
                link |
                
                 tors.sum and we can scroll down here to a definition that is relevant to us,
             
            
                link |
                
                 which is where we don't only provide an input array that we want to sum,
             
            
                link |
                
                 but we also provide the dimension along which we want to sum.
             
            
                link |
                
                 In particular, we want to sum up over rows.
             
            
                link |
                
                 Now, one more argument that I want you to pay attention to here
             
            
                link |
                
                 is the keepDim is false.
             
            
                link |
                
                 If keepDim is true,
             
            
                link |
                
                 then the output tensor is of the same size as input,
             
            
                link |
                
                 except of course the dimension along which you summed,
             
            
                link |
                
                 which will become just one.
             
            
                link |
                
                 But if you pass in keepDim as false,
             
            
                link |
                
                 then this dimension is squeezed out.
             
            
                link |
                
                 Torch.sum not only does the sum and collapses dimension to be of size one,
             
            
                link |
                
                 but in addition, it does what's called a squeeze,
             
            
                link |
                
                 where it squeezes out that dimension.
             
            
                link |
                
                 Basically, what we want here is we instead want to do p dot sum of sum axis.
             
            
                link |
                
                 In particular, notice that p dot shape is 27 by 27.
             
            
                link |
                
                 When we sum up across axis zero,
             
            
                link |
                
                 then we would be taking the zeroth dimension and we would be summing across it.
             
            
                link |
                
                 When keepDim is true,
             
            
                link |
                
                 then this thing will not only give us the counts along the columns,
             
            
                link |
                
                 but notice that basically the shape of this is one by 27.
             
            
                link |
                
                 We just get a row vector.
             
            
                link |
                
                 The reason we get a row vector here again is because we pass in zero dimension.
             
            
                link |
                
                 This zeroth dimension becomes one and we've done a sum and we get a row.
             
            
                link |
                
                 Basically, we've done the sum this way,
             
            
                link |
                
                 vertically, and arrived at just a single one by 27 vector of counts.
             
            
                link |
                
                 What happens when you take out keepDim is that we just get 27.
             
            
                link |
                
                 It squeezes out that dimension and we just get a one-dimensional vector of size 27.
             
            
                link |
                
                 Now, we don't actually want one by 27 row vector,
             
            
                link |
                
                 because that gives us the counts or the sums across the columns.
             
            
                link |
                
                 We actually want to sum the other way along dimension one.
             
            
                link |
                
                 You'll see that the shape of this is 27 by one,
             
            
                link |
                
                 so it's a column vector.
             
            
                link |
                
                 It's a 27 by one vector of counts.
             
            
                link |
                
                 That's because what's happened here is that we're going horizontally,
             
            
                link |
                
                 and this 27 by 27 matrix becomes a 27 by one array.
             
            
                link |
                
                 Now, you'll notice, by the way,
             
            
                link |
                
                 that the actual numbers of these counts are identical.
             
            
                link |
                
                 That's because this special array of counts here comes from bigram statistics.
             
            
                link |
                
                 Actually, it just so happens by chance,
             
            
                link |
                
                 or because of the way this array is constructed,
             
            
                link |
                
                 that the sums along the columns or along the rows,
             
            
                link |
                
                 horizontally or vertically, is identical.
             
            
                link |
                
                 But actually, what we want to do in this case is we want to sum across the rows horizontally.
             
            
                link |
                
                 What we want here is to be that sum of one with keepDim true,
             
            
                link |
                
                 27 by one column vector.
             
            
                link |
                
                 Now, what we want to do is we want to divide by that.
             
            
                link |
                
                 Now, we have to be careful here again.
             
            
                link |
                
                 Is it possible to take what's a p.shape you see here as 27 by 27?
             
            
                link |
                
                 Is it possible to take a 27 by 27 array and divide it by what is a 27 by one array?
             
            
                link |
                
                 Is that an operation that you can do?
             
            
                link |
                
                 Whether or not you can perform this operation is determined by what's called broadcasting rules.
             
            
                link |
                
                 If you just search broadcasting semantics in Torch,
             
            
                link |
                
                 you'll notice that there's a special definition for what's called broadcasting,
             
            
                link |
                
                 that for whether or not these two arrays can be combined in a binary operation like division.
             
            
                link |
                
                 The first condition is each tensor has at least one dimension,
             
            
                link |
                
                 which is the case for us.
             
            
                link |
                
                 Then when iterating over the dimension sizes,
             
            
                link |
                
                 starting at the trailing dimension,
             
            
                link |
                
                 the dimension sizes must either be equal,
             
            
                link |
                
                 one of them is one, or one of them does not exist.
             
            
            
                link |
                
                 We need to align the two arrays and their shapes,
             
            
                link |
                
                 which is very easy because both of these shapes have two elements, so they're aligned.
             
            
                link |
                
                 Then we iterate over from the right and going to the left.
             
            
                link |
                
                 Each dimension must be either equal, one of them is a one, or one of them does not exist.
             
            
                link |
                
                 In this case, they're not equal, but one of them is a one, so this is fine.
             
            
                link |
                
                 Then this dimension, they're both equal, so this is fine.
             
            
                link |
                
                 All the dimensions are fine and therefore this operation is broadcastable.
             
            
                link |
                
                 That means that this operation is allowed.
             
            
                link |
                
                 What is it that these arrays do when you divide 27 by 27 by 27 by one?
             
            
                link |
                
                 What it does is that it takes this dimension one and it stretches it out,
             
            
                link |
                
                 it copies it to match 27 here in this case.
             
            
                link |
                
                 In our case, it takes this column vector,
             
            
                link |
                
                 which is 27 by one and it copies it 27 times to make these both be 27 by 27 internally.
             
            
                link |
                
                 You can think of it that way.
             
            
                link |
                
                 It copies those counts and then it does an element-wise division,
             
            
                link |
                
                 which is what we want because these counts,
             
            
                link |
                
                 we want to divide by them on every single one of these columns in this matrix.
             
            
                link |
                
                 This actually we expect will normalize every single row.
             
            
                link |
                
                 We can check that this is true by taking the first row,
             
            
                link |
                
                 for example, and taking its sum.
             
            
                link |
                
                 We expect this to be one because it's now normalized.
             
            
                link |
                
                 Then we expect this now because if we actually correctly normalize all the rows,
             
            
                link |
                
                 we expect to get the exact same result here.
             
            
                link |
                
                 Let's run this. It's the exact same result. This is correct.
             
            
                link |
                
                 Now I would like to scare you a little bit.
             
            
                link |
                
                 I basically encourage you very strongly to read through broadcasting semantics,
             
            
                link |
                
                 and I encourage you to treat this with respect.
             
            
                link |
                
                 It's not something to play fast and loose with,
             
            
                link |
                
                 it's something to really respect,
             
            
                link |
                
                 really understand, and look up maybe some tutorials for broadcasting and practice it,
             
            
                link |
                
                 and be careful with it because you can very quickly run into bugs.
             
            
                link |
                
                 Let me show you what I mean.
             
            
                link |
                
                 You see how here we have p.sum of one, keep them as true.
             
            
                link |
                
                 The shape of this is 27 by one.
             
            
                link |
                
                 Let me take out this line just so we have the n and then we can see the counts.
             
            
                link |
                
                 We can see that this is all the counts across all the rows,
             
            
                link |
                
                 and it's 27 by one column vector.
             
            
                link |
                
                 Now, suppose that I tried to do the following,
             
            
                link |
                
                 but I erase keep them as true here. What does that do?
             
            
                link |
                
                 If keep them is not true, it's false.
             
            
                link |
                
                 Then remember, according to documentation,
             
            
                link |
                
                 it gets rid of this dimension one,
             
            
                link |
                
                 it squeezes it out.
             
            
                link |
                
                 Basically, we just get all the same counts,
             
            
                link |
                
                 the same result, except the shape of it is not 27 by one,
             
            
                link |
                
                 it's just 27, the one that disappears.
             
            
                link |
                
                 But all the counts are the same.
             
            
                link |
                
                 You'd think that this divide that would work.
             
            
                link |
                
                 First of all, can we even write this and is it even expected to run?
             
            
                link |
                
                 Is it broadcastable? Let's determine if this result is broadcastable.
             
            
                link |
                
                 p.summit1 is shape, is 27.
             
            
            
                link |
                
                 so 27 by 27 broadcasting into 27.
             
            
                link |
                
                 Now, rules of broadcasting, number 1,
             
            
                link |
                
                 align all the dimensions on the right, done.
             
            
                link |
                
                 Now, iteration over all the dimensions starting from the right,
             
            
                link |
                
                 going to the left.
             
            
                link |
                
                 All the dimensions must either be equal,
             
            
                link |
                
                 one of them must be one or one of them does not exist.
             
            
                link |
                
                 Here, they are all equal.
             
            
                link |
                
                 Here, the dimension does not exist.
             
            
                link |
                
                 Internally, what broadcasting will do is it will create a one here,
             
            
                link |
                
                 and then we see that one of them is a one,
             
            
                link |
                
                 and this will get copied,
             
            
                link |
                
                 and this will run, this will broadcast.
             
            
                link |
                
                 You'd expect this to work because we can divide this.
             
            
                link |
                
                 Now, if I run this,
             
            
                link |
                
                 you'd expect it to work, but it doesn't.
             
            
                link |
                
                 You actually get garbage.
             
            
                link |
                
                 You get a wrong result because this is actually a bug.
             
            
                link |
                
                 This keep them equals true makes it work.
             
            
                link |
                
                 This is a bug. In both cases,
             
            
                link |
                
                 we are doing the correct counts.
             
            
                link |
                
                 We are summing up across the rows,
             
            
                link |
                
                 but keep them as saving us and making it work.
             
            
                link |
                
                 In this case, I'd like to encourage you to potentially
             
            
                link |
                
                 pause this video at this point and try to think about
             
            
                link |
                
                 why this is buggy and why the keep them was necessary here.
             
            
                link |
                
                 The reason for this is I'm trying to hint it here
             
            
                link |
                
                 when I was giving you a bit of a hint on how this works.
             
            
                link |
                
                 This 27 vector internally inside the broadcasting,
             
            
                link |
                
                 this becomes a one by 27,
             
            
                link |
                
                 and one by 27 is a row vector.
             
            
                link |
                
                 Now, we are dividing 27 by 27 by one by 27,
             
            
                link |
                
                 and torch will replicate this dimension.
             
            
                link |
                
                 Basically, it will take this row vector,
             
            
                link |
                
                 and it will copy it vertically now 27 times.
             
            
                link |
                
                 The 27 by 27 aligns exactly,
             
            
                link |
                
                 and element-wise divides.
             
            
                link |
                
                 Basically, what's happening here is we're
             
            
                link |
                
                 actually normalizing the columns instead of normalizing the rows.
             
            
                link |
                
                 You can check that what's happening here is that
             
            
                link |
                
                 p at zero, which is the first row of p,
             
            
                link |
                
                 that sum is not one, it's seven.
             
            
                link |
                
                 It is the first column as an example that sums to one.
             
            
                link |
                
                 To summarize, where does the issue come from?
             
            
                link |
                
                 The issue comes from the silent adding of a dimension here,
             
            
                link |
                
                 because in broadcasting rules,
             
            
                link |
                
                 you align on the right and go from right to left,
             
            
                link |
                
                 and if dimension doesn't exist, you create it.
             
            
                link |
                
                 That's where the problem happens.
             
            
                link |
                
                 We still did the counts correctly.
             
            
                link |
                
                 We did the counts across the rows,
             
            
                link |
                
                 and we got the counts on the right here as a column vector.
             
            
                link |
                
                 But because the keepdence was true,
             
            
                link |
                
                 this dimension was discarded,
             
            
                link |
                
                 and now we just have a vector of 27.
             
            
                link |
                
                 Because of broadcasting the way it works,
             
            
                link |
                
                 this vector of 27 suddenly becomes a row vector,
             
            
                link |
                
                 and then this row vector gets replicated vertically,
             
            
                link |
                
                 and that every single point we are dividing by
             
            
                link |
                
                 the count in the opposite direction.
             
            
                link |
                
                 This thing just doesn't work.
             
            
                link |
                
                 This needs to be keepdence equals true in this case.
             
            
                link |
                
                 Then we have that p at zero is normalized.
             
            
                link |
                
                 Conversely, the first column,
             
            
                link |
                
                 you'd expect to potentially not be normalized,
             
            
                link |
                
                 and this is what makes it work.
             
            
                link |
                
                 Pretty subtle, and hopefully this helps to scare you,
             
            
                link |
                
                 that you should have respect for broadcasting,
             
            
                link |
                
                 be careful, check your work,
             
            
                link |
                
                 and understand how it works under the hood,
             
            
                link |
                
                 and make sure that it's broadcasting
             
            
                link |
                
                 in the direction that you like.
             
            
                link |
                
                 Otherwise, you're going to introduce very subtle bugs,
             
            
                link |
                
                 very hard to find bugs, and just be careful.
             
            
                link |
                
                 One more note on efficiency.
             
            
                link |
                
                 We don't want to be doing this here because this
             
            
                link |
                
                 creates a completely new tensor that we store into p.
             
            
                link |
                
                 We prefer to use in-place operations if possible.
             
            
                link |
                
                 This would be an in-place operation,
             
            
                link |
                
                 has the potential to be faster.
             
            
                link |
                
                 It doesn't create new memory under the hood.
             
            
                link |
                
                 Then let's erase this, we don't need it.
             
            
                link |
                
                 Let's also just do fewer,
             
            
                link |
                
                 just so I'm not wasting space.
             
            
                link |
                
                 We're actually in a pretty good spot now.
             
            
                link |
                
                 We trained a bigram language model,
             
            
                link |
                
                 and we trained it really just by counting
             
            
                link |
                
                 how frequently any pairing occurs and then
             
            
                link |
                
                 normalizing so that we get a nice property distribution.
             
            
                link |
                
                 Really these elements of this array p are
             
            
                link |
                
                 really the parameters of our bigram language model,
             
            
                link |
                
                 giving us and summarizing the statistics of these bigrams.
             
            
                link |
                
                 We trained the model and then we know how to sample from the model.
             
            
                link |
                
                 We just iteratively sampled
             
            
                link |
                
                 the next character and feed it in each time and get a next character.
             
            
                link |
                
                 Now what I'd like to do is I'd like to somehow
             
            
                link |
                
                 evaluate the quality of this model.
             
            
                link |
                
                 We'd like to somehow summarize
             
            
                link |
                
                 the quality of this model into a single number.
             
            
                link |
                
                 How good is it at predicting the training set as an example?
             
            
                link |
                
                 In the training set, we can evaluate now
             
            
                link |
                
                 the training loss and this training loss is telling us
             
            
                link |
                
                 about the quality of this model in
             
            
                link |
                
                 a single number just like we saw in micrograd.
             
            
                link |
                
                 Let's try to think through the quality of
             
            
                link |
                
                 the model and how we would evaluate it.
             
            
                link |
                
                 Basically, what we're going to do is we're going to copy-paste
             
            
                link |
                
                 this code that we previously used for counting.
             
            
                link |
                
                 Let me just print these bigrams first.
             
            
                link |
                
                 We're going to use fstrings and I'm going to
             
            
                link |
                
                 print character 1 followed by character 2,
             
            
                link |
                
                 these are the bigrams and then I don't want
             
            
                link |
                
                 to do it for all the words, just the first three words.
             
            
                link |
                
                 Here we have Emma, Olivia, and Ava bigrams.
             
            
                link |
                
                 Now what we'd like to do is we'd like to basically look at
             
            
                link |
                
                 the probability that the model
             
            
                link |
                
                 assigns to every one of these bigrams.
             
            
                link |
                
                 In other words, we can look at the probability,
             
            
                link |
                
                 which is summarized in the matrix P of ix1,
             
            
                link |
                
                 ix2 and then we can print it here as probability.
             
            
                link |
                
                 Because these probabilities are way too large,
             
            
                link |
                
                 let me percent or column 0.4f to truncate it a bit.
             
            
                link |
                
                 What do we have here? We're looking at the probabilities that
             
            
                link |
                
                 the model assigns to every one of these bigrams in the dataset.
             
            
                link |
                
                 We can see some of them are 4 percent,
             
            
                link |
                
                 3 percent, etc, just to have a measuring stick in our mind,
             
            
                link |
                
                 by the way, with 27 possible characters or tokens.
             
            
                link |
                
                 If everything was equally likely,
             
            
                link |
                
                 then you'd expect all these probabilities to
             
            
                link |
                
                 be 4 percent roughly.
             
            
                link |
                
                 Anything above 4 percent means that we've learned
             
            
                link |
                
                 something useful from these bigram statistics.
             
            
                link |
                
                 You see that roughly some of these are 4 percent,
             
            
                link |
                
                 but some of them are as high as 40 percent,
             
            
                link |
                
                 35 percent, and so on.
             
            
                link |
                
                 You see that the model actually assigned
             
            
                link |
                
                 a pretty high probability to whatever's in the training set,
             
            
                link |
                
                 and so that's a good thing.
             
            
                link |
                
                 Basically, if you have a very good model,
             
            
                link |
                
                 you'd expect that these probabilities should be near one,
             
            
                link |
                
                 because that means that your model
             
            
                link |
                
                 is correctly predicting what's going to come next,
             
            
                link |
                
                 especially on the training set where you train your model.
             
            
                link |
                
                 Now we'd like to think about how can we summarize
             
            
                link |
                
                 these probabilities into a single number
             
            
                link |
                
                 that measures the quality of this model.
             
            
                link |
                
                 Now, when you look at the literature into
             
            
                link |
                
                 maximum likelihood estimation and statistical modeling and so on,
             
            
                link |
                
                 you'll see that what's typically used
             
            
                link |
                
                 here is something called the likelihood,
             
            
                link |
                
                 and the likelihood is the product of all of these probabilities.
             
            
                link |
                
                 The product of all of these probabilities is
             
            
                link |
                
                 the likelihood and it's really telling us about
             
            
                link |
                
                 the probability of the entire data set
             
            
                link |
                
                 assigned by the model that we've trained,
             
            
                link |
                
                 and that is a measure of quality.
             
            
                link |
                
                 The product of these should be as high as possible.
             
            
                link |
                
                 When you are training the model and when you have a good model,
             
            
                link |
                
                 your product of these probabilities should be very high.
             
            
                link |
                
                 Now, because the product of these probabilities
             
            
                link |
                
                 is an unwieldy thing to work with,
             
            
                link |
                
                 you can see that all of them are between zero and one,
             
            
                link |
                
                 so your product of these probabilities will be a very tiny number.
             
            
                link |
                
                 So for convenience,
             
            
                link |
                
                 what people work with usually is not the likelihood,
             
            
                link |
                
                 but they work with what's called the log likelihood.
             
            
                link |
                
                 So the product of these is the likelihood.
             
            
                link |
                
                 To get the log likelihood,
             
            
                link |
                
                 we just have to take the log of the probability.
             
            
                link |
                
                 The log of the probability here,
             
            
                link |
                
                 the log of x from zero to one,
             
            
                link |
                
                 the log is a, you see here,
             
            
                link |
                
                 monotonic transformation of the probability,
             
            
                link |
                
                 where if you pass in one, you get zero.
             
            
                link |
                
                 So probability one gets you log probability of zero.
             
            
                link |
                
                 And then as you go lower and lower probability,
             
            
                link |
                
                 the log will grow more and more negative
             
            
                link |
                
                 until all the way to negative infinity at zero.
             
            
                link |
                
                 So here we have a lock prob,
             
            
                link |
                
                 which is really just a torch dot log of probability.
             
            
                link |
                
                 Let's print it out to get a sense of what that looks like.
             
            
                link |
                
                 Lock prob, also 0.4f.
             
            
                link |
                
                 So as you can see,
             
            
                link |
                
                 when we plug in numbers that are very close,
             
            
                link |
                
                 some of our higher numbers,
             
            
                link |
                
                 we get closer and closer to zero.
             
            
                link |
                
                 And then if we plug in very bad probabilities,
             
            
                link |
                
                 we get more and more negative number.
             
            
            
                link |
                
                 So, and the reason we work with this is
             
            
                link |
                
                 for large extent convenience, right?
             
            
                link |
                
                 Because we have mathematically that if you have
             
            
                link |
                
                 some product a times b times c of all these probabilities,
             
            
                link |
                
                 right, the likelihood is the product
             
            
                link |
                
                 of all these probabilities.
             
            
                link |
                
                 Then the log of these is just log of a plus log of b
             
            
            
                link |
                
                 If you remember your logs from your high school
             
            
                link |
                
                 or undergrad and so on.
             
            
                link |
                
                 So we have that basically,
             
            
                link |
                
                 the likelihood is the product of probabilities.
             
            
                link |
                
                 The log likelihood is just the sum of the logs
             
            
                link |
                
                 of the individual probabilities.
             
            
                link |
                
                 So log likelihood starts at zero.
             
            
                link |
                
                 And then log likelihood here,
             
            
                link |
                
                 we can just accumulate simply.
             
            
                link |
                
                 And then at the end, we can print this.
             
            
                link |
                
                 Print the log likelihood.
             
            
            
                link |
                
                 Maybe you're familiar with this.
             
            
                link |
                
                 So log likelihood is negative 38.
             
            
            
                link |
                
                 Now we actually want,
             
            
                link |
                
                 so how high can log likelihood get?
             
            
                link |
                
                 It can go to zero.
             
            
                link |
                
                 So when all the probabilities are one,
             
            
                link |
                
                 log likelihood will be zero.
             
            
                link |
                
                 And then when all the probabilities are lower,
             
            
                link |
                
                 this will grow more and more negative.
             
            
                link |
                
                 Now, we don't actually like this
             
            
                link |
                
                 because what we'd like is a loss function.
             
            
                link |
                
                 And a loss function has the semantics
             
            
            
                link |
                
                 because we're trying to minimize the loss.
             
            
                link |
                
                 So we actually need to invert this.
             
            
                link |
                
                 And that's what gives us something called
             
            
                link |
                
                 the negative log likelihood.
             
            
                link |
                
                 Negative log likelihood is just negative
             
            
                link |
                
                 of the log likelihood.
             
            
                link |
                
                 These are F strings, by the way,
             
            
                link |
                
                 if you'd like to look this up.
             
            
                link |
                
                 Negative log likelihood equals.
             
            
                link |
                
                 So negative log likelihood now is just the negative of it.
             
            
                link |
                
                 And so the negative log likelihood is a very nice
             
            
                link |
                
                 loss function because the lowest it can get is zero.
             
            
                link |
                
                 And the higher it is,
             
            
                link |
                
                 the worse off the predictions are that you're making.
             
            
                link |
                
                 And then one more modification to this
             
            
                link |
                
                 that sometimes people do is that for convenience,
             
            
                link |
                
                 they actually like to normalize by,
             
            
                link |
                
                 they like to make it an average instead of a sum.
             
            
                link |
                
                 And so here, let's just keep some counts as well.
             
            
                link |
                
                 So n plus equals one starts at zero.
             
            
                link |
                
                 And then here we can have sort of like
             
            
                link |
                
                 a normalized log likelihood.
             
            
                link |
                
                 If we just normalize it by the count,
             
            
                link |
                
                 then we will sort of get the average log likelihood.
             
            
                link |
                
                 So this would be usually our loss function here
             
            
                link |
                
                 is what this is what we would use.
             
            
                link |
                
                 So our loss function for the training set
             
            
                link |
                
                 assigned by the model is 2.4.
             
            
                link |
                
                 That's the quality of this model.
             
            
                link |
                
                 And the lower it is, the better off we are.
             
            
                link |
                
                 And the higher it is, the worse off we are.
             
            
                link |
                
                 And the job of our training is to find the parameters
             
            
                link |
                
                 that minimize the negative log likelihood loss.
             
            
                link |
                
                 And that would be like a high quality model.
             
            
                link |
                
                 Okay, so to summarize, I actually wrote it out here.
             
            
                link |
                
                 So our goal is to maximize likelihood,
             
            
                link |
                
                 which is the product of all the probabilities
             
            
                link |
                
                 assigned by the model.
             
            
                link |
                
                 And we want to maximize this likelihood
             
            
                link |
                
                 with respect to the model parameters.
             
            
                link |
                
                 And in our case, the model parameters here
             
            
                link |
                
                 are defined in the table.
             
            
                link |
                
                 These numbers, the probabilities are the model parameters
             
            
                link |
                
                 sort of in our Barygram language model so far.
             
            
                link |
                
                 But you have to keep in mind that here we are storing
             
            
                link |
                
                 everything in a table format, the probabilities.
             
            
                link |
                
                 But what's coming up as a brief preview
             
            
                link |
                
                 is that these numbers will not be kept explicitly,
             
            
                link |
                
                 but these numbers will be calculated by a neural network.
             
            
                link |
                
                 So that's coming up.
             
            
                link |
                
                 And we want to change and tune the parameters
             
            
                link |
                
                 of these neural networks.
             
            
                link |
                
                 We want to change these parameters
             
            
                link |
                
                 to maximize the likelihood,
             
            
                link |
                
                 the product of the probabilities.
             
            
                link |
                
                 Now, maximizing the likelihood
             
            
                link |
                
                 is equivalent to maximizing the log likelihood
             
            
                link |
                
                 because log is a monotonic function.
             
            
                link |
                
                 Here's the graph of log.
             
            
                link |
                
                 And basically all it is doing is it's just scaling your,
             
            
                link |
                
                 you can look at it as just a scaling of the loss function.
             
            
                link |
                
                 And so the optimization problem here and here
             
            
                link |
                
                 are actually equivalent because this is just scaling.
             
            
                link |
                
                 You can look at it that way.
             
            
                link |
                
                 And so these are two identical optimization problems.
             
            
                link |
                
                 Maximizing the log likelihood is equivalent
             
            
                link |
                
                 to minimizing the negative log likelihood.
             
            
                link |
                
                 And then in practice, people actually minimize
             
            
                link |
                
                 the average negative log likelihood
             
            
                link |
                
                 to get numbers like 2.4.
             
            
                link |
                
                 And then this summarizes the quality of your model.
             
            
                link |
                
                 And we'd like to minimize it and make it
             
            
                link |
                
                 as small as possible.
             
            
                link |
                
                 And the lowest it can get is zero.
             
            
                link |
                
                 And the lower it is, the better off your model is
             
            
                link |
                
                 because it's assigning high probabilities to your data.
             
            
                link |
                
                 Now let's estimate the probability
             
            
                link |
                
                 over the entire training set,
             
            
                link |
                
                 just to make sure that we get something around 2.4.
             
            
                link |
                
                 Let's run this over the entire, oops.
             
            
                link |
                
                 Let's take out the print segment as well.
             
            
                link |
                
                 Okay, 2.45 over the entire training set.
             
            
                link |
                
                 Now what I'd like to show you is that you can actually
             
            
                link |
                
                 evaluate the probability for any word that you want.
             
            
                link |
                
                 Like for example, if we just test a single word, Andre,
             
            
                link |
                
                 and bring back the print statement,
             
            
                link |
                
                 then you see that Andre is actually kind of like
             
            
                link |
                
                 an unlikely word, or like on average,
             
            
                link |
                
                 we take three log probability to represent it.
             
            
                link |
                
                 And roughly that's because Ej apparently
             
            
                link |
                
                 is very uncommon as an example.
             
            
                link |
                
                 Now think through this.
             
            
                link |
                
                 When I take Andre and I append q,
             
            
                link |
                
                 and I test the probability of it under aq,
             
            
                link |
                
                 we actually get infinity.
             
            
                link |
                
                 And that's because jq has a 0% probability
             
            
                link |
                
                 according to our model.
             
            
                link |
                
                 So the log likelihood,
             
            
                link |
                
                 so the log of zero will be negative infinity.
             
            
                link |
                
                 We get infinite loss.
             
            
                link |
                
                 So this is kind of undesirable, right?
             
            
                link |
                
                 Because we plugged in a string
             
            
                link |
                
                 that could be like a somewhat reasonable name.
             
            
                link |
                
                 But basically what this is saying is that this model
             
            
                link |
                
                 is exactly 0% likely to predict this name.
             
            
                link |
                
                 And our loss is infinity on this example.
             
            
                link |
                
                 And really what the reason for that is that j
             
            
                link |
                
                 is followed by q zero times.
             
            
            
            
                link |
                
                 And so jq is 0% likely.
             
            
                link |
                
                 So it's actually kind of gross
             
            
                link |
                
                 and people don't like this too much.
             
            
                link |
                
                 To fix this, there's a very simple fix
             
            
                link |
                
                 that people like to do to sort of like
             
            
                link |
                
                 smooth out your model a little bit.
             
            
                link |
                
                 And it's called model smoothing.
             
            
                link |
                
                 And roughly what's happening is that
             
            
                link |
                
                 we will add some fake counts.
             
            
                link |
                
                 So imagine adding a count of one to everything.
             
            
                link |
                
                 So we add a count of one like this,
             
            
                link |
                
                 and then we recalculate the probabilities.
             
            
                link |
                
                 And that's model smoothing.
             
            
                link |
                
                 And you can add as much as you like.
             
            
                link |
                
                 You can add five and that will give you a smoother model.
             
            
                link |
                
                 And the more you add here,
             
            
                link |
                
                 the more uniform model you're gonna have.
             
            
                link |
                
                 And the less you add, the more peaked model
             
            
                link |
                
                 you are gonna have, of course.
             
            
                link |
                
                 So one is like a pretty decent count to add.
             
            
                link |
                
                 And that will ensure that there will be no zeros
             
            
                link |
                
                 in our probability matrix P.
             
            
                link |
                
                 And so this will of course change the generations
             
            
            
                link |
                
                 In this case it didn't, but in principle it could.
             
            
                link |
                
                 But what that's gonna do now is that
             
            
                link |
                
                 nothing will be infinity unlikely.
             
            
                link |
                
                 So now our model will predict some other probability.
             
            
                link |
                
                 And we see that jq now has a very small probability
             
            
                link |
                
                 so the model still finds it very surprising
             
            
                link |
                
                 that this was a word or a bigram,
             
            
                link |
                
                 but we don't get negative infinity.
             
            
                link |
                
                 So it's kind of like a nice fix
             
            
                link |
                
                 that people like to apply sometimes
             
            
                link |
                
                 and it's called model smoothing.
             
            
                link |
                
                 Okay, so we've now trained a respectable bigram
             
            
                link |
                
                 character level language model.
             
            
                link |
                
                 And we saw that we both sort of trained the model
             
            
                link |
                
                 by looking at the counts of all the bigrams
             
            
                link |
                
                 and normalizing the rows to get probability distributions.
             
            
                link |
                
                 We saw that we can also then use those parameters
             
            
                link |
                
                 of this model to perform sampling of new words.
             
            
                link |
                
                 So we sample new names according to those distributions.
             
            
                link |
                
                 And we also saw that we can evaluate
             
            
                link |
                
                 the quality of this model.
             
            
                link |
                
                 And the quality of this model is summarized
             
            
                link |
                
                 in a single number, which is the negative log likelihood.
             
            
                link |
                
                 And the lower this number is, the better the model is
             
            
                link |
                
                 because it is giving high probabilities
             
            
                link |
                
                 to the actual next characters
             
            
                link |
                
                 in all the bigrams in our training set.
             
            
                link |
                
                 So that's all well and good,
             
            
                link |
                
                 but we've arrived at this model explicitly
             
            
                link |
                
                 by doing something that felt sensible.
             
            
                link |
                
                 We were just performing counts
             
            
                link |
                
                 and then we were normalizing those counts.
             
            
                link |
                
                 Now, what I would like to do is I would like
             
            
                link |
                
                 to take an alternative approach.
             
            
                link |
                
                 We will end up in a very, very similar position,
             
            
                link |
                
                 but the approach will look very different
             
            
                link |
                
                 because I would like to cast the problem
             
            
                link |
                
                 of bigram character level language modeling
             
            
                link |
                
                 into the neural network framework.
             
            
                link |
                
                 And in the neural network framework,
             
            
                link |
                
                 we're going to approach things slightly differently,
             
            
                link |
                
                 but again, end up in a very similar spot.
             
            
                link |
                
                 I'll go into that later.
             
            
                link |
                
                 Now, our neural network is going to be
             
            
                link |
                
                 still a bigram character level language model.
             
            
                link |
                
                 So it receives a single character as an input,
             
            
                link |
                
                 then there's neural network with some weights
             
            
                link |
                
                 or some parameters W,
             
            
                link |
                
                 and it's going to output the probability distribution
             
            
                link |
                
                 over the next character in a sequence.
             
            
                link |
                
                 It's going to make guesses as to what is likely
             
            
                link |
                
                 to follow this character that was input to the model.
             
            
                link |
                
                 And then in addition to that,
             
            
                link |
                
                 we're going to be able to evaluate any setting
             
            
                link |
                
                 of the parameters of the neural net
             
            
                link |
                
                 because we have the loss function,
             
            
                link |
                
                 the negative log likelihood.
             
            
                link |
                
                 So we're going to take a look at its probability distributions
             
            
                link |
                
                 and we're going to use the labels,
             
            
                link |
                
                 which are basically just the identity
             
            
                link |
                
                 of the next character in that bigram, the second character.
             
            
                link |
                
                 So knowing what second character actually comes next
             
            
                link |
                
                 in the bigram allows us to then look at
             
            
                link |
                
                 how high of probability the model assigns
             
            
                link |
                
                 to that character.
             
            
                link |
                
                 And then we of course want the probability to be very high.
             
            
                link |
                
                 And that is another way of saying that the loss is low.
             
            
                link |
                
                 So we're going to use gradient-based optimization then
             
            
                link |
                
                 to tune the parameters of this network
             
            
                link |
                
                 because we have the loss function
             
            
                link |
                
                 and we're going to minimize it.
             
            
                link |
                
                 So we're going to tune the weights
             
            
                link |
                
                 so that the neural net is correctly predicting
             
            
                link |
                
                 the probabilities for the next character.
             
            
                link |
                
                 So let's get started.
             
            
                link |
                
                 The first thing I want to do is I want to compile
             
            
                link |
                
                 the training set of this neural network, right?
             
            
                link |
                
                 So create the training set of all the bigrams, okay?
             
            
                link |
                
                 And here, I'm going to copy-paste this code
             
            
                link |
                
                 because this code iterates over all the bigrams.
             
            
                link |
                
                 So here we start with the words,
             
            
                link |
                
                 we iterate over all the bigrams,
             
            
                link |
                
                 and previously, as you recall, we did the counts.
             
            
                link |
                
                 But now we're not going to do counts,
             
            
                link |
                
                 we're just creating a training set.
             
            
                link |
                
                 Now this training set will be made up of two lists.
             
            
                link |
                
                 We have the bigrams,
             
            
                link |
                
                 we have the inputs and the targets, the labels.
             
            
                link |
                
                 And these bigrams will denote x, y,
             
            
                link |
                
                 those are the characters, right?
             
            
                link |
                
                 And so we're given the first character of the bigram
             
            
                link |
                
                 and then we're trying to predict the next one.
             
            
                link |
                
                 Both of these are going to be integers.
             
            
                link |
                
                 So here we'll take x's dot append is just x1,
             
            
                link |
                
                 y's dot append is x2.
             
            
                link |
                
                 And then here, we actually don't want lists of integers,
             
            
                link |
                
                 we will create tensors out of these.
             
            
                link |
                
                 So x's is torch dot tensor of x's,
             
            
                link |
                
                 and y's is torch dot tensor of y's.
             
            
                link |
                
                 And then we don't actually want to take all the words
             
            
                link |
                
                 just yet because I want everything to be manageable.
             
            
                link |
                
                 So let's just do the first word, which is emma.
             
            
                link |
                
                 And then it's clear what these x's and y's would be.
             
            
                link |
                
                 Here, let me print character one, character two,
             
            
                link |
                
                 just so you see what's going on here.
             
            
                link |
                
                 So the bigrams of these characters is.emmmma.
             
            
                link |
                
                 So this single word, as I mentioned,
             
            
                link |
                
                 has one, two, three, four, five examples
             
            
                link |
                
                 for our neural network.
             
            
                link |
                
                 There are five separate examples in emma.
             
            
                link |
                
                 And those examples are summarized here.
             
            
                link |
                
                 When the input to the neural network is integer zero,
             
            
                link |
                
                 the desired label is integer five, which corresponds to E.
             
            
                link |
                
                 When the input to the neural network is five,
             
            
                link |
                
                 we want its weights to be arranged
             
            
                link |
                
                 so that 13 gets a very high probability.
             
            
                link |
                
                 When 13 is put in, we want 13 to have a high probability.
             
            
                link |
                
                 When 13 is put in,
             
            
                link |
                
                 we also want one to have a high probability.
             
            
                link |
                
                 When one is input,
             
            
                link |
                
                 we want zero to have a very high probability.
             
            
                link |
                
                 So there are five separate input examples,
             
            
                link |
                
                 to a neural net, in this data set.
             
            
                link |
                
                 I wanted to add a tangent of a note of caution
             
            
                link |
                
                 to be careful with a lot of the APIs
             
            
                link |
                
                 of some of these frameworks.
             
            
                link |
                
                 You saw me silently use torch.tensor with a lowercase T,
             
            
                link |
                
                 and the output looked right.
             
            
                link |
                
                 But you should be aware that there's actually two ways
             
            
                link |
                
                 of constructing a tensor.
             
            
                link |
                
                 There's a torch.lowercaseTensor,
             
            
                link |
                
                 and there's also a torch.capitalTensor class.
             
            
                link |
                
                 Which you can also construct.
             
            
                link |
                
                 So you can actually call both.
             
            
                link |
                
                 You can also do torch.capitalTensor,
             
            
                link |
                
                 and you get an Xs and Ys as well.
             
            
                link |
                
                 So that's not confusing at all.
             
            
                link |
                
                 There are threads on what is the difference
             
            
                link |
                
                 between these two.
             
            
                link |
                
                 And unfortunately, the docs are just like not clear
             
            
                link |
                
                 on the difference.
             
            
                link |
                
                 And when you look at the docs of lowercaseTensor,
             
            
                link |
                
                 constructs tensor with no autograd history by copying data.
             
            
                link |
                
                 It's just like, it doesn't have a lot to do with this.
             
            
                link |
                
                 It doesn't make sense.
             
            
                link |
                
                 So the actual difference, as far as I can tell,
             
            
                link |
                
                 is explained eventually in this random thread
             
            
                link |
                
                 that you can Google.
             
            
                link |
                
                 And really it comes down to, I believe,
             
            
                link |
                
                 that, where is this?
             
            
                link |
                
                 Torch.tensor infers the D type, the data type, automatically,
             
            
                link |
                
                 while torch.tensor just returns a float tensor.
             
            
                link |
                
                 I would recommend stick to torch.lowercaseTensor.
             
            
                link |
                
                 So indeed, we see that when I construct this
             
            
                link |
                
                 with a capital T, the data type here of Xs is float 32.
             
            
                link |
                
                 But torch.lowercaseTensor,
             
            
                link |
                
                 you see how it's now X.D type is now integer.
             
            
                link |
                
                 So it's advised that you use lowercase T
             
            
                link |
                
                 and you can read more about it if you like
             
            
                link |
                
                 in some of these threads.
             
            
                link |
                
                 But basically, I'm pointing out some of these things
             
            
                link |
                
                 because I want to caution you
             
            
                link |
                
                 and I want you to get used to reading
             
            
                link |
                
                 a lot of documentation
             
            
                link |
                
                 and reading through a lot of Q and As
             
            
                link |
                
                 and threads like this.
             
            
                link |
                
                 And some of this stuff is unfortunately not easy
             
            
                link |
                
                 and not very well documented
             
            
                link |
                
                 and you have to be careful out there.
             
            
                link |
                
                 What we want here is integers
             
            
                link |
                
                 because that's what makes sense.
             
            
                link |
                
                 And so lowercaseTensor is what we are using.
             
            
                link |
                
                 Okay, now we want to think through
             
            
                link |
                
                 how we're going to feed in these examples
             
            
                link |
                
                 into a neural network.
             
            
                link |
                
                 Now, it's not quite as straightforward as plugging it in
             
            
                link |
                
                 because these examples right now are integers.
             
            
                link |
                
                 So there's like a zero, five, or 13.
             
            
                link |
                
                 It gives us the index of the character
             
            
                link |
                
                 and you can't just plug an integer index
             
            
                link |
                
                 into a neural net.
             
            
                link |
                
                 These neural nets are sort of made up of these neurons
             
            
                link |
                
                 and these neurons have weights.
             
            
                link |
                
                 And as you saw in microGrad,
             
            
                link |
                
                 these weights act multiplicatively on the inputs,
             
            
                link |
                
                 WX plus B, there's 10HS and so on.
             
            
                link |
                
                 And so it doesn't really make sense
             
            
                link |
                
                 to make an input neuron take on integer values
             
            
                link |
                
                 that you feed in and then multiply on with weights.
             
            
                link |
                
                 So instead, a common way of encoding integers
             
            
                link |
                
                 is what's called one-hot encoding.
             
            
                link |
                
                 In one-hot encoding, we take an integer like 13
             
            
                link |
                
                 and we create a vector that is all zeros
             
            
                link |
                
                 except for the 13th dimension, which we turn to a one.
             
            
                link |
                
                 And then that vector can feed into a neural net.
             
            
                link |
                
                 Now, conveniently, PyTorch actually has something
             
            
                link |
                
                 called the one-hot function inside torch and in functional.
             
            
                link |
                
                 It takes a tensor made up of integers.
             
            
                link |
                
                 Long is an integer.
             
            
                link |
                
                 And it also takes a number of classes,
             
            
                link |
                
                 which is how large you want your vector to be.
             
            
                link |
                
                 So here, let's import torch. and in.functional.sf.
             
            
                link |
                
                 This is a common way of importing it.
             
            
                link |
                
                 And then let's do f.one-hot.
             
            
                link |
                
                 And we feed in the integers that we want to encode.
             
            
                link |
                
                 So we can actually feed in the entire array of Xs.
             
            
                link |
                
                 And we can tell it that num classes is 27.
             
            
                link |
                
                 So it doesn't have to try to guess it.
             
            
                link |
                
                 It may have guessed that it's only 13
             
            
                link |
                
                 and would give us an incorrect result.
             
            
                link |
                
                 So this is the one-hot.
             
            
                link |
                
                 Let's call this xinc for xencoded.
             
            
                link |
                
                 And then we see that xencoded.shape is five by 27.
             
            
                link |
                
                 And we can also visualize it, plt.imshow of xinc,
             
            
                link |
                
                 to make it a little bit more clear
             
            
                link |
                
                 because this is a little messy.
             
            
                link |
                
                 So we see that we've encoded
             
            
                link |
                
                 all the five examples into vectors.
             
            
                link |
                
                 We have five examples, so we have five rows.
             
            
                link |
                
                 And each row here is now an example into a neural net.
             
            
                link |
                
                 And we see that the appropriate bit is turned on as a one.
             
            
                link |
                
                 And everything else is zero.
             
            
                link |
                
                 So here, for example, the zeroth bit is turned on.
             
            
                link |
                
                 The fifth bit is turned on.
             
            
                link |
                
                 13th bits are turned on for both of these examples.
             
            
                link |
                
                 And the first bit here is turned on.
             
            
                link |
                
                 So that's how we can encode integers into vectors.
             
            
                link |
                
                 And then these vectors can feed in to neural nets.
             
            
                link |
                
                 One more issue to be careful with here, by the way,
             
            
                link |
                
                 is let's look at the data type of the encoding.
             
            
                link |
                
                 We always want to be careful
             
            
                link |
                
                 with data types. What would you expect
             
            
                link |
                
                 xencoding's data type to be?
             
            
                link |
                
                 When we're plugging numbers into neural nets,
             
            
                link |
                
                 we don't want them to be integers.
             
            
                link |
                
                 We want them to be floating-point numbers
             
            
                link |
                
                 that can take on various values.
             
            
                link |
                
                 But the dtype here is actually 64-bit integer.
             
            
                link |
                
                 And the reason for that, I suspect,
             
            
                link |
                
                 is that one-hot received a 64-bit integer here,
             
            
                link |
                
                 and it returned the same data type.
             
            
                link |
                
                 And when you look at the signature of one-hot,
             
            
                link |
                
                 it doesn't even take a dtype, a desired data type,
             
            
                link |
                
                 of the output tensor.
             
            
                link |
                
                 And so we can't, in a lot of functions in Torch,
             
            
                link |
                
                 we'd be able to do something like dtype equals
             
            
                link |
                
                 torch.float32, which is what we want,
             
            
                link |
                
                 but one-hot does not support that.
             
            
                link |
                
                 So instead, we're going to want to cast this
             
            
                link |
                
                 to float like this,
             
            
                link |
                
                 so that these, everything is the same,
             
            
                link |
                
                 everything looks the same, but the dtype is float32.
             
            
                link |
                
                 And floats can feed into neural nets.
             
            
                link |
                
                 So now let's construct our first neural net.
             
            
                link |
                
                 So now let's construct our first neuron.
             
            
                link |
                
                 This neuron will look at these input vectors.
             
            
                link |
                
                 And as you remember from microGrad,
             
            
                link |
                
                 these neurons basically perform a very simple function,
             
            
                link |
                
                 wx plus b, where wx is a dot product, right?
             
            
                link |
                
                 So we can achieve the same thing here.
             
            
                link |
                
                 Let's first define the weights of this neuron, basically.
             
            
                link |
                
                 What are the initial weights at initialization
             
            
            
                link |
                
                 Let's initialize them with torch.randin.
             
            
                link |
                
                 Torch.randin fills a tensor with random numbers
             
            
                link |
                
                 drawn from a normal distribution.
             
            
                link |
                
                 And a normal distribution has a probability density function
             
            
            
                link |
                
                 And so most of the numbers drawn from this distribution
             
            
                link |
                
                 will be around zero, but some of them will be as high
             
            
                link |
                
                 as almost three and so on.
             
            
                link |
                
                 And very few numbers will be above three in magnitude.
             
            
                link |
                
                 So we need to take a size as an input here.
             
            
                link |
                
                 And I'm going to use size as to be 27 by one.
             
            
                link |
                
                 So 27 by one, and then let's visualize w.
             
            
                link |
                
                 So w is a column vector of 27 numbers.
             
            
                link |
                
                 And these weights are then multiplied by the inputs.
             
            
                link |
                
                 So now to perform this multiplication,
             
            
                link |
                
                 we can take X encoding and we can multiply it with w.
             
            
                link |
                
                 This is a matrix multiplication operator in PyTorch.
             
            
                link |
                
                 And the output of this operation is five by one.
             
            
                link |
                
                 The reason it's five by one is the following.
             
            
                link |
                
                 We took X encoding, which is five by 27,
             
            
                link |
                
                 and we multiplied it by 27 by one.
             
            
                link |
                
                 And in matrix multiplication, you see that the output
             
            
                link |
                
                 will become five by one because these 27 will multiply
             
            
            
                link |
                
                 So basically what we're seeing here out of this operation
             
            
                link |
                
                 is we are seeing the five activations of this neuron
             
            
                link |
                
                 on these five inputs.
             
            
                link |
                
                 And we've evaluated all of them in parallel.
             
            
                link |
                
                 We didn't feed in just a single input to the single neuron.
             
            
                link |
                
                 We fed in simultaneously all the five inputs
             
            
                link |
                
                 into the same neuron.
             
            
                link |
                
                 And in parallel, PyTorch has evaluated the wx plus b,
             
            
                link |
                
                 but here it's just wx, there's no bias.
             
            
                link |
                
                 It has evaluated w times x for all of them independently.
             
            
                link |
                
                 Now, instead of a single neuron though,
             
            
                link |
                
                 I would like to have 27 neurons.
             
            
                link |
                
                 And I'll show you in a second why I want 27 neurons.
             
            
                link |
                
                 So instead of having just a one here,
             
            
                link |
                
                 which is indicating this presence of one single neuron,
             
            
            
                link |
                
                 And then when w is 27 by 27,
             
            
                link |
                
                 this will, in parallel, evaluate all the 27 neurons
             
            
                link |
                
                 on all the five inputs.
             
            
                link |
                
                 Giving us a much better, much, much bigger result.
             
            
                link |
                
                 So now what we've done is five by 27 multiplied 27 by 27.
             
            
                link |
                
                 And the output of this is now five by 27.
             
            
                link |
                
                 So we can see that the shape of this is five by 27.
             
            
                link |
                
                 So what is every element here telling us, right?
             
            
                link |
                
                 It's telling us for every one of 27 neurons that we created,
             
            
                link |
                
                 what is the firing rate of those neurons
             
            
                link |
                
                 on every one of those five examples?
             
            
                link |
                
                 So the element, for example, three comma 13,
             
            
                link |
                
                 is giving us the firing rate of the 13th neuron
             
            
                link |
                
                 looking at the third input.
             
            
                link |
                
                 And the way this was achieved is by a dot product
             
            
                link |
                
                 between the third input and the third input.
             
            
                link |
                
                 And the 13th column of this w matrix here.
             
            
                link |
                
                 So using matrix multiplication,
             
            
                link |
                
                 we can very efficiently evaluate the dot product
             
            
                link |
                
                 between lots of input examples in a batch
             
            
                link |
                
                 and lots of neurons where all of those neurons have weights
             
            
                link |
                
                 in the columns of those w's.
             
            
                link |
                
                 And in matrix multiplication,
             
            
                link |
                
                 we're just doing those dot products in parallel.
             
            
                link |
                
                 Just to show you that this is the case,
             
            
                link |
                
                 we can take xnk and we can take the third row.
             
            
                link |
                
                 And we can take the w and take its 13th column.
             
            
                link |
                
                 And then we can do xnk at three,
             
            
                link |
                
                 elementwise multiply with w at 13,
             
            
            
            
                link |
                
                 Well, there's no plus b, it's just wx dot product.
             
            
                link |
                
                 And that's this number.
             
            
                link |
                
                 So you see that this is just being done efficiently
             
            
                link |
                
                 by the matrix multiplication operation
             
            
                link |
                
                 for all the input examples
             
            
                link |
                
                 and for all the output neurons of this first layer.
             
            
                link |
                
                 Okay, so we fed our 27 dimensional inputs
             
            
                link |
                
                 into a first layer of a neural net that has 27 neurons.
             
            
                link |
                
                 So we have 27 inputs and now we have 27 neurons.
             
            
                link |
                
                 These neurons perform w times x.
             
            
                link |
                
                 They don't have a bias
             
            
                link |
                
                 and they don't have a non-linearity like 10h.
             
            
                link |
                
                 We're going to leave them to be a linear layer.
             
            
                link |
                
                 In addition to that,
             
            
                link |
                
                 we're not going to have any other layers.
             
            
                link |
                
                 This is going to be it.
             
            
                link |
                
                 It's just going to be the dumbest, smallest,
             
            
                link |
                
                 simplest neural net, which is just a single linear layer.
             
            
                link |
                
                 And now I'd like to explain
             
            
                link |
                
                 what I want those 27 outputs to be.
             
            
                link |
                
                 Intuitively, what we're trying to produce here
             
            
                link |
                
                 for every single input example
             
            
                link |
                
                 is we're trying to produce
             
            
                link |
                
                 some kind of a probability distribution
             
            
                link |
                
                 for the next character in a sequence.
             
            
                link |
                
                 And there's 27 of them.
             
            
                link |
                
                 But we have to come up with precise semantics
             
            
                link |
                
                 for exactly how we're going to interpret
             
            
                link |
                
                 these 27 numbers that these neurons take on.
             
            
                link |
                
                 Now intuitively, you see here that these numbers are negative
             
            
                link |
                
                 and some of them are positive, et cetera.
             
            
                link |
                
                 And that's because these are coming out
             
            
                link |
                
                 of a neural net layer initialized
             
            
                link |
                
                 with these normal distribution parameters.
             
            
                link |
                
                 But what we want is we want something like we had here.
             
            
                link |
                
                 Like each row here told us the counts
             
            
                link |
                
                 and then we normalize the counts to get probabilities.
             
            
                link |
                
                 And we want something similar to come out of a neural net.
             
            
                link |
                
                 But what we just have right now
             
            
                link |
                
                 is just some negative and positive numbers.
             
            
                link |
                
                 Now we want those numbers to somehow represent
             
            
                link |
                
                 the probabilities for the next character.
             
            
                link |
                
                 But you see that probabilities,
             
            
                link |
                
                 they have a special structure.
             
            
                link |
                
                 They're positive numbers and they sum to one.
             
            
                link |
                
                 And so that doesn't just come out of a neural net.
             
            
                link |
                
                 And then they can't be counts
             
            
                link |
                
                 because these counts are positive
             
            
                link |
                
                 and counts are integers.
             
            
                link |
                
                 So counts are also not really a good thing
             
            
                link |
                
                 to output from a neural net.
             
            
                link |
                
                 So instead what the neural net is going to output
             
            
                link |
                
                 and how we are going to interpret the 27 numbers
             
            
                link |
                
                 is that these 27 numbers are giving us log counts, basically.
             
            
                link |
                
                 So instead of giving us counts directly,
             
            
                link |
                
                 like in this table, they're giving us log counts.
             
            
                link |
                
                 And to get the counts, we're going to take the log counts
             
            
                link |
                
                 and we're going to exponentiate them.
             
            
                link |
                
                 Now, exponentiation takes the following form.
             
            
                link |
                
                 It takes numbers that are negative or they are positive.
             
            
                link |
                
                 It takes the entire real line.
             
            
                link |
                
                 And then if you plug in negative numbers,
             
            
                link |
                
                 you're going to get e to the x, which is always below one.
             
            
                link |
                
                 So you're getting numbers lower than one.
             
            
                link |
                
                 And if you plug in numbers greater than zero,
             
            
                link |
                
                 you're getting numbers greater than one
             
            
                link |
                
                 all the way growing to the infinity.
             
            
                link |
                
                 And this here grows to zero.
             
            
                link |
                
                 So basically we're going to take these numbers here
             
            
                link |
                
                 and instead of them being positive and negative
             
            
                link |
                
                 and all over the place,
             
            
                link |
                
                 we're going to interpret them as log counts.
             
            
                link |
                
                 And then we're going to element-wise
             
            
                link |
                
                 exponentiate these numbers.
             
            
                link |
                
                 Exponentiating them now gives us something like this.
             
            
                link |
                
                 And you see that these numbers now
             
            
                link |
                
                 because they went through an exponent,
             
            
                link |
                
                 all the negative numbers turned into numbers below one,
             
            
            
                link |
                
                 And all the positive numbers originally turned into
             
            
                link |
                
                 even more positive numbers, sort of greater than one.
             
            
                link |
                
                 So like for example, seven is some positive number
             
            
                link |
                
                 over here that is greater than zero.
             
            
                link |
                
                 But exponentiated outputs here basically give us something
             
            
                link |
                
                 that we can use and interpret as the equivalent
             
            
                link |
                
                 of counts originally.
             
            
                link |
                
                 So you see these counts here, 112, 751, 1, et cetera.
             
            
                link |
                
                 The neural net is kind of now predicting counts.
             
            
                link |
                
                 And these counts are positive numbers.
             
            
                link |
                
                 They can never be below zero.
             
            
                link |
                
                 So that makes sense.
             
            
                link |
                
                 And they can now take on various values
             
            
                link |
                
                 depending on the settings of W.
             
            
                link |
                
                 So let me break this down.
             
            
                link |
                
                 We're going to interpret these to be the log counts.
             
            
                link |
                
                 In other words, for this,
             
            
                link |
                
                 that is often used is so-called logits.
             
            
                link |
                
                 These are logits, log counts.
             
            
                link |
                
                 And these will be sort of the counts, logits exponentiated.
             
            
                link |
                
                 And this is equivalent to the n matrix,
             
            
                link |
                
                 sort of the n array that we used previously.
             
            
                link |
                
                 Remember this was the n.
             
            
                link |
                
                 This is the array of counts.
             
            
                link |
                
                 And each row here are the counts for the next character.
             
            
                link |
                
                 So those are the counts.
             
            
                link |
                
                 And now the probabilities are just the counts normalized.
             
            
                link |
                
                 And so I'm not going to find the same,
             
            
                link |
                
                 but basically I'm not going to scroll all over the place.
             
            
                link |
                
                 We've already done this.
             
            
                link |
                
                 We want to counts.sum along the first dimension,
             
            
                link |
                
                 and we want to keep them as true.
             
            
                link |
                
                 We've went over this,
             
            
                link |
                
                 and this is how we normalize the rows of our counts matrix
             
            
                link |
                
                 to get our probabilities.
             
            
            
                link |
                
                 So now these are the probabilities.
             
            
                link |
                
                 And these are the counts that we have currently.
             
            
                link |
                
                 And now when I show the probabilities,
             
            
                link |
                
                 you see that every row here, of course,
             
            
                link |
                
                 will sum to one, because they're normalized.
             
            
                link |
                
                 And the shape of this is five by 27.
             
            
                link |
                
                 And so really what we've achieved is
             
            
                link |
                
                 for every one of our five examples,
             
            
                link |
                
                 we now have a row that came out of a neural net.
             
            
                link |
                
                 And because of the transformations here,
             
            
                link |
                
                 we made sure that this output of this neural net now
             
            
                link |
                
                 are probabilities, or we can say that
             
            
                link |
                
                 are probabilities, or we can interpret to be probabilities.
             
            
                link |
                
                 So our WX here gave us logits,
             
            
                link |
                
                 and then we interpret those to be log counts.
             
            
                link |
                
                 We exponentiate to get something that looks like counts,
             
            
                link |
                
                 and then we normalize those counts
             
            
                link |
                
                 to get a probability distribution.
             
            
                link |
                
                 And all of these are differentiable operations.
             
            
                link |
                
                 So what we've done now is we are taking inputs.
             
            
                link |
                
                 We have differentiable operations
             
            
                link |
                
                 that we can back propagate through,
             
            
                link |
                
                 and we're getting out probability distributions.
             
            
                link |
                
                 So, for example, for the zeroth example that fed in,
             
            
                link |
                
                 which was the zeroth example here,
             
            
                link |
                
                 was a one-hot vector of zero.
             
            
                link |
                
                 And it basically corresponded
             
            
                link |
                
                 to feeding in this example here.
             
            
                link |
                
                 So we're feeding in a dot into a neural net.
             
            
                link |
                
                 And the way we fed the dot into a neural net
             
            
                link |
                
                 is that we first got its index,
             
            
                link |
                
                 then we one-hot encoded it,
             
            
                link |
                
                 then it went into the neural net,
             
            
                link |
                
                 and out came this distribution of probabilities.
             
            
                link |
                
                 And its shape is 27.
             
            
                link |
                
                 There's 27 numbers.
             
            
                link |
                
                 And we're going to interpret this
             
            
                link |
                
                 as the neural net's assignment
             
            
                link |
                
                 for how likely every one of these characters,
             
            
                link |
                
                 the 27 characters, are to come next.
             
            
                link |
                
                 And as we tune the weights W,
             
            
                link |
                
                 we're going to be, of course,
             
            
                link |
                
                 getting different probabilities out
             
            
                link |
                
                 for any character that you input.
             
            
                link |
                
                 And so now the question is just,
             
            
                link |
                
                 can we optimize and find a good W
             
            
                link |
                
                 such that the probabilities coming out are pretty good?
             
            
                link |
                
                 And the way we measure pretty good is by the loss function.
             
            
                link |
                
                 Okay, so I organized everything into a single summary
             
            
                link |
                
                 so that hopefully it's a bit more clear.
             
            
                link |
                
                 So it starts here.
             
            
                link |
                
                 We have an input dataset.
             
            
                link |
                
                 We have some inputs to the neural net,
             
            
                link |
                
                 and we have some labels
             
            
                link |
                
                 for the correct next character in a sequence.
             
            
                link |
                
                 These are integers.
             
            
                link |
                
                 Here I'm using tors generators now
             
            
                link |
                
                 so that you see the same numbers that I see.
             
            
                link |
                
                 And I'm generating 27 neurons weights.
             
            
                link |
                
                 And each neuron here receives 27 inputs.
             
            
                link |
                
                 Then here, we're going to plug in
             
            
                link |
                
                 all the input examples, Xs, into a neural net.
             
            
                link |
                
                 So here, this is a forward pass.
             
            
                link |
                
                 First, we have to encode all of the inputs
             
            
                link |
                
                 into one-hot representations.
             
            
                link |
                
                 So we have 27 classes.
             
            
                link |
                
                 We pass in these integers.
             
            
                link |
                
                 And Xinc becomes a array that is five by 27,
             
            
                link |
                
                 zeros, except for a few ones.
             
            
                link |
                
                 We then multiply this in the first layer of a neural net
             
            
            
                link |
                
                 exponentiate the logits to get fake counts, sort of,
             
            
                link |
                
                 and normalize these counts to get probabilities.
             
            
                link |
                
                 So these last two lines, by the way, here,
             
            
                link |
                
                 are called the softmax, which I pulled up here.
             
            
                link |
                
                 Softmax is a very often used layer in a neural net
             
            
                link |
                
                 that takes these Zs, which are logits,
             
            
                link |
                
                 exponentiates them, and divides and normalizes.
             
            
                link |
                
                 It's a way of taking outputs of a neural net layer.
             
            
                link |
                
                 And these outputs can be positive or negative.
             
            
                link |
                
                 And it outputs probability distributions.
             
            
                link |
                
                 It outputs something that is always sums to one
             
            
                link |
                
                 and are positive numbers, just like probabilities.
             
            
                link |
                
                 So this is kind of like a normalization function,
             
            
                link |
                
                 if you want to think of it that way.
             
            
                link |
                
                 And you can put it on top of any other linear layer
             
            
                link |
                
                 inside a neural net.
             
            
                link |
                
                 And it basically makes a neural net output probabilities
             
            
                link |
                
                 that's very often used.
             
            
                link |
                
                 And we used it as well here.
             
            
                link |
                
                 So this is the forward pass,
             
            
                link |
                
                 and that's how we made a neural net output probability.
             
            
                link |
                
                 Now, you'll notice that all of these,
             
            
                link |
                
                 this entire forward pass is made up of differentiable layers.
             
            
                link |
                
                 Everything here, we can backpropagate through.
             
            
                link |
                
                 And we saw some of the backpropagation in microGrad.
             
            
                link |
                
                 This is just multiplication and addition.
             
            
                link |
                
                 All that's happening here is just multiply and then add.
             
            
                link |
                
                 And we know how to backpropagate through them.
             
            
                link |
                
                 Exponentiation, we know how to backpropagate through.
             
            
                link |
                
                 And then here we are summing,
             
            
                link |
                
                 and sum is easily backpropagatable as well,
             
            
                link |
                
                 and division as well.
             
            
                link |
                
                 So everything here is a differentiable operation,
             
            
                link |
                
                 and we can backpropagate through.
             
            
                link |
                
                 Now, we achieve these probabilities,
             
            
                link |
                
                 which are five by 27.
             
            
                link |
                
                 For every single example,
             
            
                link |
                
                 we have a vector of probabilities that sum to one.
             
            
                link |
                
                 And then here I wrote a bunch of stuff
             
            
                link |
                
                 to sort of like break down the examples.
             
            
                link |
                
                 So we have five examples making up Emma, right?
             
            
                link |
                
                 And there are five bigrams inside Emma.
             
            
                link |
                
                 So bigram example one is that E
             
            
                link |
                
                 is the beginning character right after dot.
             
            
                link |
                
                 And the indexes for these are zero and five.
             
            
                link |
                
                 So then we feed in a zero,
             
            
                link |
                
                 that's the input to the neural net.
             
            
                link |
                
                 We get probabilities from the neural net
             
            
                link |
                
                 that are 27 numbers.
             
            
                link |
                
                 And then the label is five,
             
            
                link |
                
                 because E actually comes after dot.
             
            
                link |
                
                 So that's the label.
             
            
                link |
                
                 And then we use this label five
             
            
                link |
                
                 to index into the probability distribution here.
             
            
                link |
                
                 So this index five here is zero, one, two, three, four, five.
             
            
                link |
                
                 It's this number here, which is here.
             
            
                link |
                
                 So that's basically the probability assigned
             
            
                link |
                
                 by the neural net to the actual correct character.
             
            
                link |
                
                 You see that the network currently thinks
             
            
                link |
                
                 that this next character, that E following dot,
             
            
                link |
                
                 is only 1% likely,
             
            
                link |
                
                 which is of course not very good, right?
             
            
                link |
                
                 Because this actually is a training example,
             
            
                link |
                
                 and the network thinks that this is currently
             
            
                link |
                
                 very, very unlikely.
             
            
                link |
                
                 But that's just because we didn't get very lucky
             
            
                link |
                
                 in generating a good setting of W.
             
            
                link |
                
                 So right now this network thinks this is unlikely,
             
            
                link |
                
                 and 0.01 is not a good outcome.
             
            
                link |
                
                 So the log likelihood then is very negative.
             
            
                link |
                
                 And the negative log likelihood is very positive.
             
            
                link |
                
                 And so four is a very high negative log likelihood,
             
            
                link |
                
                 and that means we're going to have a high loss.
             
            
                link |
                
                 Because what is the loss?
             
            
                link |
                
                 The loss is just the average negative log likelihood.
             
            
                link |
                
                 So the second character is E, M.
             
            
                link |
                
                 And you see here that also the network thought
             
            
                link |
                
                 that M following E is very unlikely, 1%.
             
            
                link |
                
                 For M following M, it thought it was 2%.
             
            
                link |
                
                 And for A following M,
             
            
                link |
                
                 it actually thought it was 7% likely.
             
            
                link |
                
                 So just by chance,
             
            
                link |
                
                 this one actually has a pretty good probability,
             
            
                link |
                
                 and therefore a pretty low negative log likelihood.
             
            
                link |
                
                 And finally here, it thought this was 1% likely.
             
            
                link |
                
                 So overall, this is a very good result.
             
            
                link |
                
                 So overall, our average negative log likelihood,
             
            
                link |
                
                 which is the loss, the total loss that summarizes
             
            
                link |
                
                 basically how well this network currently works,
             
            
                link |
                
                 at least on this one word,
             
            
                link |
                
                 not on the full data set, just the one word,
             
            
                link |
                
                 is 3.76, which is actually a fairly high loss.
             
            
                link |
                
                 This is not a very good setting of Ws.
             
            
                link |
                
                 Now here's what we can do.
             
            
                link |
                
                 We're currently getting 3.76.
             
            
                link |
                
                 We can actually come here and we can change our W.
             
            
                link |
                
                 We can resample it.
             
            
                link |
                
                 So let me just add one to have a different seed,
             
            
                link |
                
                 and then we get a different W,
             
            
                link |
                
                 and then we can rerun this.
             
            
                link |
                
                 And with this different seed,
             
            
                link |
                
                 with this different setting of Ws,
             
            
            
                link |
                
                 So this is a much better W, right?
             
            
                link |
                
                 And it's better because the probabilities
             
            
                link |
                
                 just happen to come out higher
             
            
                link |
                
                 for the characters that actually are next.
             
            
                link |
                
                 And so you can imagine actually just resampling this.
             
            
            
                link |
                
                 So, okay, this was not very good.
             
            
                link |
                
                 Let's try one more.
             
            
            
                link |
                
                 Okay, this was terrible setting
             
            
                link |
                
                 because we have a very high loss.
             
            
                link |
                
                 So anyway, I'm gonna erase this.
             
            
                link |
                
                 What I'm doing here, which is just guess and check
             
            
                link |
                
                 of randomly assigning parameters
             
            
                link |
                
                 and seeing if the network is good,
             
            
                link |
                
                 that is amateur hour.
             
            
                link |
                
                 That's not how you optimize a neural net.
             
            
                link |
                
                 The way you optimize a neural net
             
            
                link |
                
                 is you start with some random guess,
             
            
                link |
                
                 and we're gonna come up with a random guess
             
            
                link |
                
                 and we're gonna commit to this one,
             
            
                link |
                
                 even though it's not very good.
             
            
                link |
                
                 But now the big deal is we have a loss function.
             
            
                link |
                
                 So this loss is made up only of differentiable operations.
             
            
                link |
                
                 And we can minimize the loss by tuning Ws,
             
            
                link |
                
                 by computing the gradients of the loss
             
            
                link |
                
                 with respect to these W matrices.
             
            
                link |
                
                 And so then we can tune W to minimize the loss
             
            
                link |
                
                 and find a good setting of W
             
            
                link |
                
                 using gradient-based optimization.
             
            
                link |
                
                 So let's see how that would work.
             
            
                link |
                
                 Now, things are actually going to look almost identical
             
            
                link |
                
                 to what we had with microGrad.
             
            
                link |
                
                 So here I pulled up the lecture from microGrad,
             
            
                link |
                
                 the notebook that's from this repository.
             
            
                link |
                
                 And when I scroll all the way to the end
             
            
                link |
                
                 where we left off with microGrad,
             
            
                link |
                
                 we had something very, very similar.
             
            
                link |
                
                 We had a number of input examples.
             
            
                link |
                
                 In this case, we had four input examples inside Xs
             
            
                link |
                
                 and we had their targets, desired targets.
             
            
                link |
                
                 Just like here, we have our Xs now,
             
            
                link |
                
                 but we have five of them.
             
            
                link |
                
                 And they're now integers instead of vectors.
             
            
                link |
                
                 But we're going to convert our integers to vectors,
             
            
                link |
                
                 except our vectors will be 27 large instead of three large.
             
            
                link |
                
                 And then here, what we did is,
             
            
                link |
                
                 first we did a forward pass
             
            
                link |
                
                 where we ran a neural net on all of the inputs
             
            
                link |
                
                 to get predictions.
             
            
                link |
                
                 Our neural net at the time, this NFX,
             
            
                link |
                
                 was a multilayer perceptron.
             
            
                link |
                
                 Our neural net is going to look different
             
            
                link |
                
                 because our neural net is just a single layer,
             
            
                link |
                
                 single linear layer, followed by a softmax.
             
            
                link |
                
                 So that's our neural net.
             
            
                link |
                
                 And the loss here was the mean squared error.
             
            
                link |
                
                 So we simply subtracted the prediction
             
            
                link |
                
                 from the ground truth and squared it and summed it all up.
             
            
                link |
                
                 And that was the loss.
             
            
                link |
                
                 And loss was the single number
             
            
                link |
                
                 that summarized the quality of the neural net.
             
            
                link |
                
                 And when loss is low, like almost zero,
             
            
                link |
                
                 that means the neural net is predicting correctly.
             
            
                link |
                
                 So we had a single number
             
            
                link |
                
                 that summarized the performance of the neural net.
             
            
                link |
                
                 And everything here was differentiable
             
            
                link |
                
                 and was stored in a massive compute graph.
             
            
                link |
                
                 And then we iterated over all the parameters.
             
            
                link |
                
                 We made sure that the gradients are set to zero
             
            
                link |
                
                 and we called loss.backward.
             
            
                link |
                
                 And loss.backward initiated backpropagation
             
            
                link |
                
                 at the final output node of loss, right?
             
            
                link |
                
                 So, yeah, remember these expressions?
             
            
                link |
                
                 We had loss all the way at the end.
             
            
                link |
                
                 We start backpropagation and we went all the way back
             
            
                link |
                
                 and we made sure that we populated
             
            
                link |
                
                 all the parameters.grad.
             
            
                link |
                
                 So.grad started at zero, but backpropagation filled it in.
             
            
                link |
                
                 And then in the update, we iterated all the parameters
             
            
                link |
                
                 and we simply did a parameter update
             
            
                link |
                
                 where every single element of our parameters
             
            
                link |
                
                 was notched in the opposite direction of the gradient.
             
            
                link |
                
                 And so we're going to do the exact same thing here.
             
            
                link |
                
                 So I'm going to pull this up on the side here,
             
            
                link |
                
                 so that we have it available.
             
            
                link |
                
                 And we're actually going to do the exact same thing.
             
            
                link |
                
                 So this was the forward pass.
             
            
            
                link |
                
                 And props is our widespread.
             
            
                link |
                
                 So now we have to evaluate the loss,
             
            
                link |
                
                 but we're not using the mean squared error.
             
            
                link |
                
                 We're using the negative log likelihood
             
            
                link |
                
                 because we are doing classification.
             
            
                link |
                
                 We're not doing regression as it's called.
             
            
                link |
                
                 So here, we want to calculate loss.
             
            
                link |
                
                 Now, the way we calculate it is,
             
            
                link |
                
                 it's just this average negative log likelihood.
             
            
                link |
                
                 Now this props here has a shape of five by 27.
             
            
                link |
                
                 And so to get all the,
             
            
                link |
                
                 we basically want to pluck out the probabilities
             
            
                link |
                
                 at the correct indices here.
             
            
                link |
                
                 So in particular, because the labels are stored here
             
            
                link |
                
                 in the array wise,
             
            
                link |
                
                 basically what we're after is for the first example,
             
            
                link |
                
                 we're looking at probability of five, right?
             
            
            
                link |
                
                 For the second example,
             
            
                link |
                
                 at the second row or row index one,
             
            
                link |
                
                 we are interested in the probability asides to index 13.
             
            
                link |
                
                 At the second example, we also have 13.
             
            
                link |
                
                 At the third row, we want one.
             
            
                link |
                
                 And at the last row, which is four, we want zero.
             
            
                link |
                
                 So these are the probabilities we're interested in, right?
             
            
                link |
                
                 And you can see that they're not amazing.
             
            
                link |
                
                 They're not amazing as we saw above.
             
            
                link |
                
                 So these are the probabilities we want,
             
            
                link |
                
                 but we want like a more efficient way
             
            
                link |
                
                 to access these probabilities,
             
            
                link |
                
                 not just listing them out in a tuple like this.
             
            
                link |
                
                 So it turns out that the way to this in PyTorch,
             
            
                link |
                
                 one of the ways at least,
             
            
                link |
                
                 is we can basically pass in all of these,
             
            
            
                link |
                
                 all of these integers in a vectors.
             
            
                link |
                
                 So these ones, you see how they're just zero, one, two,
             
            
            
                link |
                
                 we can actually create that using mp, not mp, sorry,
             
            
                link |
                
                 torch.arrange of five, zero, one, two, three, four.
             
            
                link |
                
                 So we can index here with torch.arrange of five,
             
            
                link |
                
                 and here we index with ys.
             
            
                link |
                
                 And you see that that gives us exactly these numbers.
             
            
                link |
                
                 So that plucks out the probabilities
             
            
                link |
                
                 that the neural network assigns
             
            
                link |
                
                 to the correct next character.
             
            
                link |
                
                 Now we take those probabilities and we don't,
             
            
                link |
                
                 we actually look at the log probability.
             
            
                link |
                
                 So we want to dot log,
             
            
                link |
                
                 and then we want to just average that up.
             
            
                link |
                
                 So take the mean of all of that.
             
            
                link |
                
                 And then it's the negative average log likelihood
             
            
            
                link |
                
                 So the loss here is 3.7 something.
             
            
                link |
                
                 And you see that this loss 3.76, 3.76 is exactly
             
            
                link |
                
                 as we've obtained before,
             
            
                link |
                
                 but this is a vectorized form of that expression.
             
            
                link |
                
                 So we get the same loss.
             
            
                link |
                
                 And the same loss we can consider sort of
             
            
                link |
                
                 as part of this forward pass,
             
            
                link |
                
                 and we've achieved here now loss.
             
            
                link |
                
                 Okay, so we made our way all the way to loss.
             
            
                link |
                
                 We've defined the forward pass.
             
            
                link |
                
                 We forwarded the network and the loss.
             
            
                link |
                
                 Now we're ready to do the backward pass.
             
            
                link |
                
                 So backward pass, we want to first make sure
             
            
                link |
                
                 that all the gradients are reset.
             
            
                link |
                
                 So they're at zero.
             
            
                link |
                
                 Now in PyTorch, you can set the gradients to be zero,
             
            
                link |
                
                 but you can also just set it to none.
             
            
                link |
                
                 And setting it to none is more efficient.
             
            
                link |
                
                 And PyTorch will interpret none as like a lack
             
            
                link |
                
                 of a gradient and it's the same as zeros.
             
            
                link |
                
                 So this is a way to set to zero the gradient.
             
            
                link |
                
                 And now we do loss.backward.
             
            
                link |
                
                 Before we do loss.backward, we need one more thing.
             
            
                link |
                
                 If you remember from micrograd,
             
            
                link |
                
                 PyTorch actually requires that we pass in requires grad
             
            
                link |
                
                 is true so that we tell PyTorch that we are interested
             
            
                link |
                
                 in calculating the gradient for this leaf tensor.
             
            
                link |
                
                 By default, this is false.
             
            
                link |
                
                 So let me recalculate with that.
             
            
                link |
                
                 And then set to none and loss.backward.
             
            
                link |
                
                 Now something magical happened when loss.backward was run
             
            
                link |
                
                 because PyTorch, just like micrograd,
             
            
                link |
                
                 when we did the forward pass here,
             
            
                link |
                
                 it keeps track of all the operations under the hood.
             
            
                link |
                
                 It builds a full computational graph.
             
            
                link |
                
                 Just like the graphs we produced in micrograd,
             
            
                link |
                
                 those graphs exist inside PyTorch.
             
            
                link |
                
                 And so it knows all the dependencies
             
            
                link |
                
                 and all the mathematical operations of everything.
             
            
                link |
                
                 And when you then calculate the loss,
             
            
                link |
                
                 we can call a.backward on it.
             
            
                link |
                
                 And.backward then fills in the gradients
             
            
                link |
                
                 of all the intermediates all the way back to w's,
             
            
                link |
                
                 which are the parameters of our neural net.
             
            
                link |
                
                 So now we can do w.grad
             
            
                link |
                
                 and we see that it has structure.
             
            
                link |
                
                 There's stuff inside it.
             
            
                link |
                
                 And these gradients, every single element here,
             
            
                link |
                
                 so w.shape is 27 by 27.
             
            
                link |
                
                 w.grad.shape is the same, 27 by 27.
             
            
                link |
                
                 And every element of w.grad is telling us
             
            
                link |
                
                 the influence of that weight on the loss function.
             
            
                link |
                
                 So for example, this number all the way here,
             
            
                link |
                
                 if this element, the zero zero element of w,
             
            
                link |
                
                 because the gradient is positive,
             
            
                link |
                
                 it's telling us that this has a positive influence
             
            
                link |
                
                 on the loss, slightly nudging w, slightly taking w zero zero
             
            
                link |
                
                 and adding a small h to it would increase the loss mildly
             
            
                link |
                
                 because this gradient is positive.
             
            
                link |
                
                 Some of these gradients are also negative.
             
            
                link |
                
                 So that's telling us about the gradient information
             
            
                link |
                
                 and we can use this gradient information
             
            
                link |
                
                 to update the weights of this neural network.
             
            
                link |
                
                 So let's now do the update.
             
            
                link |
                
                 It's going to be very similar to what we had in microGrad.
             
            
                link |
                
                 We need no loop over all the parameters
             
            
                link |
                
                 because we only have one parameter, tensor, and that is w.
             
            
                link |
                
                 So we simply do w.data plus equals,
             
            
                link |
                
                 we can actually copy this almost exactly,
             
            
                link |
                
                 negative 0.1 times w.grad.
             
            
                link |
                
                 And that would be the update to the tensor.
             
            
                link |
                
                 So that updates the tensor and because the tensor is updated,
             
            
                link |
                
                 we would expect that now the loss should decrease.
             
            
                link |
                
                 So here, if I print loss.item, it was 3.76, right?
             
            
                link |
                
                 So we've updated the w here.
             
            
                link |
                
                 So if I recalculate forward pass,
             
            
                link |
                
                 loss now should be slightly lower.
             
            
                link |
                
                 So 3.76 goes to 3.74.
             
            
                link |
                
                 And then we can again set grad to none and backward, update.
             
            
                link |
                
                 And now the parameter is changed again.
             
            
                link |
                
                 So if we recalculate the forward pass,
             
            
                link |
                
                 we expect a lower loss again, 3.72, okay?
             
            
                link |
                
                 And this is again doing the,
             
            
                link |
                
                 we're now doing gradient descent.
             
            
                link |
                
                 And when we achieve a low loss, that will mean that the network
             
            
                link |
                
                 is assigning high probabilities
             
            
                link |
                
                 to the correct next characters.
             
            
                link |
                
                 Okay, so I rearranged everything
             
            
                link |
                
                 and I put it all together from scratch.
             
            
                link |
                
                 So here is where we construct our dataset of bigrams.
             
            
                link |
                
                 You see that we are still iterating
             
            
                link |
                
                 only over the first word, emma.
             
            
                link |
                
                 I'm going to change that in a second.
             
            
                link |
                
                 I added a number that counts the number of elements in Xs,
             
            
                link |
                
                 and I'm going to change that in a second.
             
            
                link |
                
                 I'm going to adjust the number of elements in Xs
             
            
                link |
                
                 so that we explicitly see that the number of examples is five,
             
            
                link |
                
                 because currently we're just working with emma,
             
            
                link |
                
                 and there's five bigrams there.
             
            
                link |
                
                 And here I added a loop of exactly what we had before.
             
            
                link |
                
                 So we had 10 iterations of gradient descent
             
            
                link |
                
                 of forward pass, backward pass, and an update.
             
            
                link |
                
                 And so running these two cells, initialization
             
            
                link |
                
                 and gradient descent,
             
            
                link |
                
                 gives us some improvement on the loss function.
             
            
                link |
                
                 But now I want to use all the words.
             
            
                link |
                
                 And there's not five, but 228,000 bigrams now.
             
            
                link |
                
                 However, this should require no modification whatsoever.
             
            
                link |
                
                 Everything should just run,
             
            
                link |
                
                 because all the code we wrote doesn't care
             
            
                link |
                
                 if there's five bigrams or 228,000 bigrams,
             
            
                link |
                
                 and with everything we should just work.
             
            
                link |
                
                 So you see that this will just run.
             
            
                link |
                
                 But now we are optimizing over the entire training set
             
            
                link |
                
                 of all the bigrams.
             
            
                link |
                
                 And you see now that we are decreasing very slightly.
             
            
                link |
                
                 So actually we can probably afford a larger learning rate.
             
            
                link |
                
                 We can probably afford even larger learning rate.
             
            
                link |
                
                 Even 50 seems to work on this very, very simple example.
             
            
                link |
                
                 So let me reinitialize and let's run 100 iterations.
             
            
            
                link |
                
                 We seem to be coming up to some pretty good losses here.
             
            
                link |
                
                 2.47, let me run 100 more.
             
            
                link |
                
                 What is the number that we expect, by the way, in the loss?
             
            
                link |
                
                 We expect to get something around
             
            
                link |
                
                 what we had originally, actually.
             
            
                link |
                
                 So all the way back, if you remember,
             
            
                link |
                
                 in the beginning of this video,
             
            
                link |
                
                 when we optimized just by counting,
             
            
                link |
                
                 our loss was roughly 2.47.
             
            
                link |
                
                 So that's the number that we expect to get,
             
            
                link |
                
                 just by counting, our loss was roughly 2.47
             
            
                link |
                
                 after we added smoothing.
             
            
                link |
                
                 But before smoothing, we had roughly 2.45 likelihood,
             
            
            
                link |
                
                 And so that's actually roughly the vicinity
             
            
                link |
                
                 of what we expect to achieve.
             
            
                link |
                
                 But before we achieved it by counting,
             
            
                link |
                
                 and here we are achieving roughly the same result,
             
            
                link |
                
                 but with gradient-based optimization.
             
            
                link |
                
                 So we come to about 2.46, 2.45, et cetera.
             
            
                link |
                
                 And that makes sense, because fundamentally,
             
            
                link |
                
                 we're not taking in any additional information.
             
            
                link |
                
                 We're still just taking in the previous character
             
            
                link |
                
                 and trying to predict the next one.
             
            
                link |
                
                 But instead of doing it explicitly by counting
             
            
            
                link |
                
                 we are doing it with gradient-based learning.
             
            
                link |
                
                 And it just so happens that the explicit approach
             
            
                link |
                
                 happens to very well optimize the loss function
             
            
                link |
                
                 without any need for a gradient-based optimization,
             
            
                link |
                
                 because the setup for bigram language models
             
            
                link |
                
                 is so straightforward and so simple.
             
            
                link |
                
                 We can just afford to estimate those probabilities directly
             
            
                link |
                
                 and maintain them in a table.
             
            
                link |
                
                 But the gradient-based approach
             
            
                link |
                
                 is significantly more flexible.
             
            
                link |
                
                 So we've actually gained a lot,
             
            
                link |
                
                 because what we can do now is we can expand this approach
             
            
                link |
                
                 and complexify the neural net.
             
            
                link |
                
                 So currently, we're just taking a single character
             
            
                link |
                
                 and feeding into a neural net,
             
            
                link |
                
                 and the neural net is extremely simple.
             
            
                link |
                
                 But we're about to iterate on this substantially.
             
            
                link |
                
                 We're going to be taking multiple previous characters,
             
            
                link |
                
                 and we're going to be feeding them into increasingly
             
            
                link |
                
                 more complex neural nets.
             
            
                link |
                
                 But fundamentally, the output of the neural net
             
            
                link |
                
                 will always just be logits.
             
            
                link |
                
                 And those logits will go through
             
            
                link |
                
                 the exact same transformation.
             
            
                link |
                
                 We are going to take them through a softmax,
             
            
                link |
                
                 calculate the loss function and the negative log-likelihood,
             
            
                link |
                
                 and do gradient-based optimization.
             
            
                link |
                
                 And so actually, as we complexify the neural nets
             
            
                link |
                
                 and work all the way up to transformers,
             
            
                link |
                
                 none of this will really fundamentally change.
             
            
                link |
                
                 None of this will fundamentally change.
             
            
                link |
                
                 The only thing that will change is the way we do
             
            
                link |
                
                 the forward pass, where we take in some previous characters
             
            
                link |
                
                 and calculate logits for the next character in a sequence.
             
            
                link |
                
                 That will become more complex,
             
            
                link |
                
                 but we'll use the same machinery to optimize it.
             
            
                link |
                
                 And it's not obvious how we would have extended
             
            
                link |
                
                 this bigram approach into the case
             
            
                link |
                
                 where there are many more characters at the input,
             
            
                link |
                
                 because eventually these tables would get way too large
             
            
                link |
                
                 because there's way too many combinations
             
            
                link |
                
                 of what previous characters could be.
             
            
                link |
                
                 If you only have one previous character,
             
            
                link |
                
                 we can just keep everything in a table that counts.
             
            
                link |
                
                 But if you have the last 10 characters that are input,
             
            
                link |
                
                 we can't actually keep everything in the table anymore.
             
            
                link |
                
                 So this is fundamentally an unscalable approach,
             
            
                link |
                
                 and the neural network approach
             
            
                link |
                
                 is significantly more scalable.
             
            
                link |
                
                 And it's something that actually we can improve on over time.
             
            
                link |
                
                 So that's where we will be digging next.
             
            
                link |
                
                 I wanted to point out two more things.
             
            
                link |
                
                 Number one, I want you to notice that this xinc here,
             
            
                link |
                
                 this is made up of one-hot vectors.
             
            
                link |
                
                 And then those one-hot vectors
             
            
                link |
                
                 are multiplied by this W matrix.
             
            
                link |
                
                 And we think of this as multiple neurons being forwarded
             
            
                link |
                
                 in a fully connected manner.
             
            
                link |
                
                 But actually what's happening here is that, for example,
             
            
                link |
                
                 if you have a one-hot vector here that has a one
             
            
                link |
                
                 at say the fifth dimension,
             
            
                link |
                
                 then because of the way the matrix multiplication works,
             
            
                link |
                
                 multiplying that one-hot vector with W
             
            
                link |
                
                 actually ends up plucking out the fifth row of W.
             
            
                link |
                
                 Logits would become just the fifth row of W.
             
            
                link |
                
                 And that's because of the way the matrix multiplication works.
             
            
                link |
                
                 So that's actually what ends up happening.
             
            
                link |
                
                 So, but that's actually exactly what happened before.
             
            
                link |
                
                 Because remember all the way up here,
             
            
            
                link |
                
                 We took the first character
             
            
                link |
                
                 and then that first character indexed
             
            
                link |
                
                 into a row of this array here.
             
            
                link |
                
                 And that row gave us the probability distribution
             
            
                link |
                
                 for the next character.
             
            
                link |
                
                 So the first character was used as a lookup
             
            
                link |
                
                 into a matrix here to get the probability distribution.
             
            
                link |
                
                 Well, that's actually exactly what's happening here.
             
            
                link |
                
                 Because we're taking the index,
             
            
                link |
                
                 we're encoding it as one-hot and multiplying it by W.
             
            
                link |
                
                 So Logits literally becomes the appropriate row of W.
             
            
                link |
                
                 And that gets, just as before,
             
            
                link |
                
                 exponentiated to create the counts
             
            
                link |
                
                 and then normalized and becomes probability.
             
            
                link |
                
                 So this W here is literally the same as this array here.
             
            
                link |
                
                 But W remember is the log counts, not the counts.
             
            
                link |
                
                 So it's more precise to say that W exponentiated
             
            
                link |
                
                 W dot exp is this array.
             
            
                link |
                
                 But this array was filled in by counting
             
            
                link |
                
                 and by basically populating the counts of bigrams.
             
            
                link |
                
                 Whereas in the gradient-based framework,
             
            
                link |
                
                 we initialize it randomly and then we let loss guide us
             
            
                link |
                
                 to arrive at the exact same array.
             
            
                link |
                
                 So this array exactly here is basically the array W
             
            
                link |
                
                 at the end of optimization,
             
            
                link |
                
                 except we arrived at it piece by piece by following the loss.
             
            
                link |
                
                 And that's why we also obtain
             
            
                link |
                
                 the same loss function at the end.
             
            
                link |
                
                 And the second notice, if I come here,
             
            
                link |
                
                 remember the smoothing where we added fake counts
             
            
                link |
                
                 to our counts in order to smooth out
             
            
                link |
                
                 and make more uniform the distributions
             
            
                link |
                
                 of these probabilities.
             
            
                link |
                
                 And that prevented us from assigning zero probability
             
            
                link |
                
                 to any one bigram.
             
            
                link |
                
                 Now, if I increase the count here,
             
            
                link |
                
                 what's happening to the probability?
             
            
                link |
                
                 As I increase the count,
             
            
                link |
                
                 probability becomes more and more uniform, right?
             
            
                link |
                
                 Because these counts go only up to like 900 or whatever.
             
            
                link |
                
                 So if I'm adding plus a million to every single number here,
             
            
                link |
                
                 you can see how the row and its probability then
             
            
                link |
                
                 when we divide is just going to become more and more close
             
            
                link |
                
                 to exactly even probability, uniform distribution.
             
            
                link |
                
                 It turns out that the gradient-based framework
             
            
                link |
                
                 has an equivalent to smoothing.
             
            
                link |
                
                 In particular, think through these Ws here,
             
            
                link |
                
                 which we initialized randomly.
             
            
                link |
                
                 We could also think about initializing Ws to be zero.
             
            
                link |
                
                 If all the entries of W are zero,
             
            
                link |
                
                 then you'll see that logits will become all zero
             
            
                link |
                
                 and then exponentiating those logits becomes all one.
             
            
                link |
                
                 And then the probabilities turn out to be exactly the same.
             
            
            
                link |
                
                 So basically when Ws are all equal to each other,
             
            
                link |
                
                 or say, especially zero,
             
            
                link |
                
                 then the probabilities come out completely uniform.
             
            
                link |
                
                 So trying to incentivize W to be near zero
             
            
                link |
                
                 is basically equivalent to label smoothing.
             
            
                link |
                
                 And the more you incentivize that in a loss function,
             
            
                link |
                
                 the more smooth distribution you're going to achieve.
             
            
                link |
                
                 So this brings us to something that's called regularization,
             
            
                link |
                
                 where we can actually augment the loss function
             
            
                link |
                
                 to have a small component
             
            
                link |
                
                 that we call a regularization loss.
             
            
                link |
                
                 In particular, what we're going to do is we can take W
             
            
                link |
                
                 and we can, for example, square all of its entries.
             
            
                link |
                
                 And then we can, oops, sorry about that.
             
            
                link |
                
                 We can take all the entries of W and we can sum them.
             
            
                link |
                
                 And because we're squaring, there will be no signs anymore.
             
            
                link |
                
                 Negatives and positives all get squashed
             
            
                link |
                
                 to be positive numbers.
             
            
                link |
                
                 And then the way this works is you achieve zero loss
             
            
                link |
                
                 if W is exactly or zero,
             
            
                link |
                
                 but if W has non-zero numbers, you accumulate loss.
             
            
                link |
                
                 And so we can actually take this and we can add it on here.
             
            
                link |
                
                 So we can do something like loss plus W square dot sum.
             
            
                link |
                
                 Or let's actually, instead of sum, let's take a mean,
             
            
                link |
                
                 because otherwise the sum gets too large.
             
            
                link |
                
                 So mean is like a little bit more manageable.
             
            
                link |
                
                 And then we have a regularization loss here,
             
            
                link |
                
                 let's say 0.01 times or something like that.
             
            
                link |
                
                 You can choose the regularization strength.
             
            
                link |
                
                 And then we can just optimize this.
             
            
                link |
                
                 And now this optimization actually has two components.
             
            
                link |
                
                 Not only is it trying to make
             
            
                link |
                
                 all the probabilities work out,
             
            
                link |
                
                 but in addition to that, there's an additional component
             
            
                link |
                
                 that simultaneously tries to make all Ws be zero,
             
            
                link |
                
                 because if Ws are non-zero, you feel a loss.
             
            
                link |
                
                 And so minimizing this, the only way to achieve that
             
            
                link |
                
                 is for W to be zero.
             
            
                link |
                
                 And so you can think of this as adding like a spring force
             
            
                link |
                
                 or like a gravity force that pushes W to be zero.
             
            
                link |
                
                 So W wants to be zero and the probabilities
             
            
                link |
                
                 want to be uniform, but they also simultaneously
             
            
                link |
                
                 want to match up your probabilities
             
            
                link |
                
                 as indicated by the data.
             
            
                link |
                
                 And so the strength of this regularization
             
            
                link |
                
                 is exactly controlling the amount of counts
             
            
                link |
                
                 that you add here.
             
            
                link |
                
                 Adding a lot more counts here
             
            
                link |
                
                 corresponds to increasing this number,
             
            
                link |
                
                 because the more you increase it,
             
            
                link |
                
                 the more this part of the loss function dominates this part,
             
            
                link |
                
                 and the more these weights will be unable to grow,
             
            
                link |
                
                 because as they grow, they accumulate way too much loss.
             
            
                link |
                
                 And so if this is strong enough,
             
            
                link |
                
                 then we are not able to overcome the force of this loss.
             
            
                link |
                
                 And we will never, and basically everything
             
            
                link |
                
                 will be uniform predictions.
             
            
                link |
                
                 So I thought that's kind of cool.
             
            
                link |
                
                 Okay, and lastly, before we wrap up,
             
            
                link |
                
                 I wanted to show you how you would sample
             
            
                link |
                
                 from this neural net model.
             
            
                link |
                
                 And I copy pasted the sampling code from before,
             
            
                link |
                
                 where remember that we sampled five times.
             
            
                link |
                
                 And all we did, we started at zero,
             
            
                link |
                
                 we grabbed the current IX row of P,
             
            
                link |
                
                 and that was our probability row
             
            
                link |
                
                 from which we sampled the next index
             
            
                link |
                
                 and just accumulated that and break when zero.
             
            
                link |
                
                 And running this gave us these results.
             
            
                link |
                
                 I still have the P in memory, so this is fine.
             
            
                link |
                
                 Now, this P doesn't come from the row of P,
             
            
                link |
                
                 instead it comes from this neural net.
             
            
                link |
                
                 First, we take IX and we encode it into a one-hot row.
             
            
                link |
                
                 Of X-sync, this X-sync multiplies our W,
             
            
                link |
                
                 which really just plucks out the row of W
             
            
                link |
                
                 corresponding to IX, really that's what's happening.
             
            
                link |
                
                 And that gets our logits,
             
            
                link |
                
                 and then we normalize those logits,
             
            
                link |
                
                 exponentiate to get counts,
             
            
                link |
                
                 and then normalize to get the distribution.
             
            
                link |
                
                 And then we can sample from the distribution.
             
            
                link |
                
                 So if I run this, kind of anti-climatic or climatic,
             
            
                link |
                
                 depending on how you look at it, but we get the exact same result.
             
            
                link |
                
                 And that's because this is the identical model.
             
            
                link |
                
                 Not only does it achieve the same loss,
             
            
                link |
                
                 but as I mentioned, these are identical models,
             
            
                link |
                
                 and this W is the log counts of what we've estimated before.
             
            
                link |
                
                 But we came to this answer in a very different way,
             
            
                link |
                
                 and it's got a very different interpretation.
             
            
                link |
                
                 But fundamentally, this is basically the same model
             
            
                link |
                
                 and gives the same samples here.
             
            
                link |
                
                 And so that's kind of cool. Okay, so we've actually covered a lot of ground.
             
            
                link |
                
                 We introduced the bigram character level language model.
             
            
                link |
                
                 We saw how we can train the model,
             
            
                link |
                
                 how we can sample from the model,
             
            
                link |
                
                 and how we can evaluate the quality of the model
             
            
                link |
                
                 using the negative log likelihood loss.
             
            
                link |
                
                 And then we actually trained the model in two completely different ways
             
            
                link |
                
                 that actually get the same result and the same model.
             
            
                link |
                
                 In the first way, we just counted up the frequency of all the bigrams
             
            
                link |
                
                 and normalized. In the second way,
             
            
                link |
                
                 we used the negative log likelihood loss as a guide
             
            
                link |
                
                 to optimizing the counts matrix or the counts array
             
            
                link |
                
                 so that the loss is minimized in a gradient-based framework.
             
            
                link |
                
                 And we saw that both of them give the same result, and that's it.
             
            
                link |
                
                 Now, the second one of these, the gradient-based framework,
             
            
                link |
                
                 is much more flexible.
             
            
                link |
                
                 And right now, our neural network is super simple.
             
            
                link |
                
                 We're taking a single previous character,
             
            
                link |
                
                 and we're taking it through a single linear layer to calculate the logits.
             
            
                link |
                
                 This is about to complexify.
             
            
                link |
                
                 So in the follow-up videos,
             
            
                link |
                
                 we're going to be taking more and more of these characters,
             
            
                link |
                
                 and we're going to be feeding them into a neural net.
             
            
                link |
                
                 But this neural net will still output the exact same thing.
             
            
                link |
                
                 The neural net will output logits.
             
            
                link |
                
                 And these logits will still be normalized in the exact same way,
             
            
                link |
                
                 and all the loss and everything else in the gradient-based framework,
             
            
                link |
                
                 everything stays identical.
             
            
                link |
                
                 It's just that this neural net will now complexify all the way to transformers.
             
            
                link |
                
                 So that's going to be pretty awesome, and I'm looking forward to it.