link |

00:00:00.000

Hello, my name is Andre and I've been training deep neural networks for a bit more than a decade

link |

00:00:04.800

and in this lecture I'd like to show you what neural network training looks like under the

link |

00:00:08.880

hood. So in particular we are going to start with a blank Jupyter notebook and by the end of this

link |

00:00:13.920

lecture we will define and train a neural net and you'll get to see everything that goes on under

link |

00:00:18.720

the hood and exactly sort of how that works on an intuitive level. Now specifically what I would

link |

00:00:23.600

like to do is I would like to take you through building of micrograd. Now micrograd is this

link |

00:00:29.440

library that I released on github about two years ago but at the time I only uploaded the source

link |

00:00:34.160

code and you'd have to go and buy yourself and really figure out how it works. So in this lecture

link |

00:00:40.160

I will take you through it step by step and kind of comment on all the pieces of it. So what's

link |

00:00:44.560

micrograd and why is it interesting? Micrograd is basically an autograd engine. Autograd is short

link |

00:00:52.800

for automatic gradient and really what it does is it implements backpropagation. Now backpropagation

link |

00:00:58.320

is this algorithm that allows you to efficiently evaluate the gradient of some kind of a loss

link |

00:01:04.320

function with respect to the weights of a neural network and what that allows us to do then is we

link |

00:01:09.840

can iteratively tune the weights of that neural network to minimize the loss function and therefore

link |

00:01:14.320

improve the accuracy of the network. So backpropagation would be at the mathematical

link |

00:01:18.960

core of any modern deep neural network library like say PyTorch or JAX. So the functionality

link |

00:01:24.800

of micrograd is I think best illustrated by an example. So if we just scroll down here

link |

00:01:29.760

you'll see that micrograd basically allows you to build out mathematical expressions

link |

00:01:34.240

and here what we are doing is we have an expression that we're building out where you

link |

00:01:38.080

have two inputs a and b and you'll see that a and b are negative 4 and 2 but we are wrapping those

link |

00:01:46.160

values into this value object that we are going to build out as part of micrograd.

link |

00:01:50.480

So this value object will wrap the numbers themselves and then we are going to build

link |

00:01:55.520

out a mathematical expression here where a and b are transformed into c d and eventually e f and g

link |

00:02:03.360

and I'm showing some of the functionality of micrograd and the operations that it supports.

link |

00:02:08.160

So you can add two value objects, you can multiply them, you can raise them to a constant power,

link |

00:02:13.920

you can offset by one, negate, squash at zero, square, divide by constant, divide by it, etc.

link |

00:02:22.240

And so we're building out an expression graph with these two inputs a and b and we're creating

link |

00:02:27.600

an output value of g and micrograd will in the background build out this entire mathematical

link |

00:02:33.600

expression. So it will for example know that c is also a value, c was a result of an addition

link |

00:02:39.680

operation and the child nodes of c are a and b because it will maintain pointers to a and b

link |

00:02:48.160

value objects. So we'll basically know exactly how all of this is laid out and then not only can we

link |

00:02:54.080

do what we call the forward pass where we actually look at the value of g of course, that's pretty

link |

00:02:58.160

straightforward, we will access that using the dot data attribute and so the output of the forward

link |

00:03:04.080

pass, the value of g, is 24.7 it turns out, but the big deal is that we can also take this g value

link |

00:03:11.280

object and we can call dot backward and this will basically initialize backpropagation at the node g.

link |

00:03:18.960

And what backpropagation is going to do is it's going to start at g and it's going to go

link |

00:03:23.040

backwards through that expression graph and it's going to recursively apply the chain rule from

link |

00:03:27.840

calculus and what that allows us to do then is we're going to evaluate basically the derivative

link |

00:03:34.000

of g with respect to all the internal nodes like e d and c but also with respect to the inputs a

link |

00:03:41.120

and b and then we can actually query this derivative of g with respect to a for example

link |

00:03:47.360

that's a dot grad in this case it happens to be 138 and the derivative of g with respect to b

link |

00:03:53.040

which also happens to be here 645 and this derivative we'll see soon is very important

link |

00:03:58.960

information because it's telling us how a and b are affecting g through this mathematical expression

link |

00:04:05.840

so in particular a dot grad is 138 so if we slightly nudge a and make it slightly larger

link |

00:04:13.840

138 is telling us that g will grow and the slope of that growth is going to be 138

link |

00:04:19.040

and the slope of growth of b is going to be 645 so that's going to tell us about how g will respond

link |

00:04:25.040

if a and b get tweaked a tiny amount in a positive direction okay now you might be confused about

link |

00:04:32.960

what this expression is that we built out here and this expression by the way is completely

link |

00:04:37.040

meaningless i just made it up i'm just flexing about the kinds of operations that are supported

link |

00:04:41.840

by micrograd and what we actually really care about are neural networks but it turns out that

link |

00:04:46.320

neural networks are just mathematical expressions just like this one but actually a slightly bit

link |

00:04:51.040

less crazy even neural networks are just a mathematical expression they take the input data

link |

00:04:57.040

as an input and they take the weights of a neural network as an input and it's a mathematical

link |

00:05:01.200

expression and the output are your predictions of your neural net or the loss function we'll

link |

00:05:05.840

see this in a bit but basically neural networks just happen to be a certain class of mathematical

link |

00:05:10.720

expressions but the back propagation is actually significantly more general it doesn't actually care

link |

00:05:16.000

about neural networks at all it only tells about arbitrary mathematical expressions and then we

link |

00:05:20.480

happen to use that machinery for training of neural networks now one more note i would like

link |

00:05:25.440

to make at this stage is that as you see here micrograd is a scalar valued autograd engine

link |

00:05:30.240

so it's working on the you know level of individual scalars like negative four and two and we're

link |

00:05:35.120

taking neural nets and we're breaking them down all the way to these atoms of individual neural

link |

00:05:39.200

networks and all the little pluses and times and it's just excessive and so obviously you would

link |

00:05:43.840

never be doing any of this in production it's real just for them for pedagogical reasons because it

link |

00:05:48.480

allows us to not have to deal with these n-dimensional tensors that you would use in

link |

00:05:52.800

modern deep neural network library so this is really done so that you understand and refactor

link |

00:05:58.400

out back propagation and chain rule and understanding of training and then if you actually

link |

00:06:04.400

want to train bigger networks you have to be using these tensors but none of the math changes this is

link |

00:06:08.560

done purely for efficiency we are basically taking scale value all the scale values we're

link |

00:06:13.600

packaging them up into tensors which are just arrays of these scalars and then because we have

link |

00:06:18.560

these large arrays we're making operations on those large arrays that allows us to take advantage of

link |

00:06:23.920

the parallelism in a computer and all those operations can be done in parallel and then the

link |

00:06:28.800

whole thing runs faster but really none of the math changes and they've done purely for efficiency

link |

00:06:33.040

so i don't think that it's pedagogically useful to be dealing with tensors from scratch and i think

link |

00:06:37.840

and that's why i fundamentally wrote micrograd because you can understand how things work

link |

00:06:42.160

at the fundamental level and then you can speed it up later okay so here's the fun part my claim is

link |

00:06:47.840

that micrograd is what you need to train your own networks and everything else is just efficiency

link |

00:06:52.400

so you'd think that micrograd would be a very complex piece of code and that turns out to not

link |

00:06:57.440

be the case so if we just go to micrograd and you'll see that there's only two files here in

link |

00:07:03.680

micrograd this is the actual engine it doesn't know anything about neural nets and this is the

link |

00:07:08.320

entire neural nets library on top of micrograd so engine and nn.py so the actual backpropagation

link |

00:07:16.960

autograd engine that gives you the power of neural networks is literally 100 lines of code

link |

00:07:24.800

100 lines of code of like very simple python which we'll understand by the end of this lecture

link |

00:07:30.640

and then nn.py this neural network library built on top of the autograd engine is like a joke it's

link |

00:07:38.080

like we have to define what is a neuron and then we have to define what is a layer of neurons and

link |

00:07:43.360

then we define what is a multi-layer perceptron which is just a sequence of layers of neurons

link |

00:07:47.840

and so it's just a total joke so basically there's a lot of power that comes from only 150 lines of

link |

00:07:54.640

code and that's all you need to understand to understand neural network training and everything

link |

00:07:59.200

else is just efficiency and of course there's a lot to efficiency but fundamentally that's all

link |

00:08:04.800

that's happening okay so now let's dive right in and implement micrograd step by step the first

link |

00:08:09.600

thing i'd like to do is i'd like to make sure that you have a very good understanding intuitively of

link |

00:08:13.440

what a derivative is and exactly what information it gives you so let's start with some basic imports

link |

00:08:19.440

that i copy-paste in every jupyter notebook always and let's define the function scalar-valued

link |

00:08:25.280

function f of x as follows so i just made this up randomly i just wanted a scalar-valued function

link |

00:08:32.080

that takes a single scalar x and returns a single scalar y and we can call this function of course

link |

00:08:37.680

so we can pass in say 3.0 and get 20 back now we can also plot this function to get a sense of its

link |

00:08:43.920

shape you can tell from the mathematical expression that this is probably a parabola

link |

00:08:48.160

it's a quadratic and so if we just create a set of um scale values that we can feed in using for

link |

00:08:57.440

example a range from negative five to five and steps up on to five so this is so x is just from

link |

00:09:04.000

negative five to five not including five in steps of 0.25 and we can actually call this function on

link |

00:09:10.720

this numpy array as well so we get a set of y's if we call f on x's and these y's are basically

link |

00:09:18.000

also applying uh function on every one of these elements independently and we can plot this using

link |

00:09:24.640

matplotlib it's a plt.plot x's and y's and we get a nice parabola so previously here we fed in

link |

00:09:31.280

3.0 somewhere here and we received 20 back which is here the y coordinate so now i'd like to think

link |

00:09:37.200

through what is the derivative of this function at any single input point x right so what is the

link |

00:09:43.520

derivative at different points x of this function now if you remember back to your calculus class

link |

00:09:48.800

you've probably derived derivatives so we take this mathematical expression 3x squared minus 4x

link |

00:09:54.080

plus 5 and you would write out on a piece of paper and you would you know apply the product rule and

link |

00:09:58.000

all the other rules and derive the mathematical expression of the great derivative of the original

link |

00:10:02.720

function and then you could plug in different x's and see what the derivative is we're not going to

link |

00:10:07.760

actually do that because no one in neural networks actually writes out the expression for the neural

link |

00:10:12.960

net it would be a massive expression it would be you know thousands tens of thousands of terms no

link |

00:10:17.760

one actually derives the derivative of course and so we're not going to take this kind of like

link |

00:10:22.400

symbolic approach instead what i'd like to do is i'd like to look at the derivative of the

link |

00:10:25.600

derivative and just make sure that we really understand what the derivative is measuring

link |

00:10:29.200

what it's telling you about the function and so if we just look up derivative

link |

00:10:39.200

we see that um okay so this is not a very good definition of derivative this is a definition of

link |

00:10:43.360

what it means to be differentiable but if you remember from your calculus it is the limit as

link |

00:10:47.840

h goes to zero of f of x plus h minus f of x over h so basically what it's saying is if you slightly

link |

00:10:55.760

bump up you're at some point x that you're interested in or hey and if you slightly bump up

link |

00:11:01.440

you know you slightly increase it by small number h how does the function respond with what

link |

00:11:06.480

sensitivity does it respond where's the slope at that point does the function go up or does it go

link |

00:11:11.120

down and by how much and that's the slope of that function the the slope of that response at that

link |

00:11:17.760

point and so we can basically evaluate um the relative here numerically by taking a very small

link |

00:11:23.760

h of course the definition would ask us to take h to zero we're just going to pick a very small h

link |

00:11:28.960

0.001 and let's say we're interested in 0.3.0 so we can look at f of x of course is 20

link |

00:11:34.880

and now f of x plus h so if we slightly nudge x in a positive direction how is the function going to

link |

00:11:40.560

respond and just looking at this do you expand do you expect f of x plus h to be slightly greater

link |

00:11:46.240

than 20 or do you expect it to be slightly lower than 20 and since this 3 is here and this is 20

link |

00:11:52.960

if we slightly go positively the function will respond positively so you'd expect this to be

link |

00:11:58.240

slightly greater than 20 and now by how much is it going to respond to the function

link |

00:12:02.720

greater than 20 and now by how much is telling you the sort of the the strength of that slope right

link |

00:12:09.360

the the size of the slope so f of x plus h minus f of x this is how much the function responded

link |

00:12:16.000

in a positive direction and we have to normalize by the run so we have the rise over run to get

link |

00:12:22.400

the slope so this of course is just a numerical approximation of the slope because we have to make

link |

00:12:28.240

h very very small to converge to the exact amount now if i'm doing too many zeros at some point

link |

00:12:36.800

i'm going to get an incorrect answer because we're using floating point arithmetic and the

link |

00:12:41.120

representations of all these numbers in computer memory is finite and at some point we get into

link |

00:12:45.680

trouble so we can converge towards the right answer with this approach but basically um at

link |

00:12:52.160

3 the slope is 14 and you can see that by taking 3x square minus 4x plus 5 and differentiating it

link |

00:12:59.280

in our head so 3x square would be 6x minus 4 and then we plug in x equals 3 so that's 18 minus 4

link |

00:13:08.000

is 14 so this is correct so that's at 3 now how about the slope at say negative 3 would you expect

link |

00:13:18.480

what would you expect for the slope now telling the exact value is really hard but what is the

link |

00:13:23.200

sign of that slope so at negative 3 if we slightly go in the positive direction at x the function

link |

00:13:30.400

would actually go down and so that tells you that the slope would be negative so we'll get a slight

link |

00:13:34.560

number below uh below 20 and so if we take the slope we expect something negative negative 22

link |

00:13:41.920

okay and at some point here of course the slope would be 0 now for this specific function i looked

link |

00:13:48.640

it up previously and it's at point 2 over 3 so at roughly 2 over 3 that's somewhere here

link |

00:13:57.040

this derivative would be 0 so basically at that precise point

link |

00:14:03.440

yeah at that precise point if we nudge in a positive direction the function doesn't respond

link |

00:14:08.160

this stays the same almost and so that's why the slope is zero okay now let's look at a bit more

link |

00:14:12.560

complex case so we're going to start you know complexifying a bit so now we have a function

link |

00:14:19.680

here with output variable d that is a function of three scalar inputs a b and c so a b and c

link |

00:14:27.200

are some specific values three inputs into our expression graph and a single output d

link |

00:14:31.920

and so if we just print d we get four and now what i like to do is i'd like to again look at

link |

00:14:37.680

the derivatives of d with respect to a b and c and think through uh again just the intuition of

link |

00:14:44.320

what this derivative is telling us so in order to evaluate this derivative we're going to get a bit

link |

00:14:50.160

hacky here we're going to again have a very small value of h and then we're going to fix the inputs

link |

00:14:56.320

at some values that we're interested in so these are the this is the point abc at which we're going

link |

00:15:02.640

to be evaluating the the derivative of d with respect to all a b and c at that point so there's

link |

00:15:09.040

the inputs and now we have d1 is that expression and then we're going to for example look at the

link |

00:15:14.160

derivative of d with respect to a so we'll take a and we'll bump it by h and then we'll get d2 to

link |

00:15:20.480

be the exact same function and now we're going to print um you know f1 d1 is d1 d2 is d2 and print

link |

00:15:32.640

slope so the derivative or slope here will be um of course d2 minus d1 divided h so d2 minus d1 is

link |

00:15:44.880

how much the function increased uh when we bumped the uh the specific input that we're interested in

link |

00:15:52.880

by a tiny amount and this is the normalized by h to get the slope so um yeah so this so

link |

00:16:06.560

this so i just run this we're going to print d1 which we know is four now d2 will be bumped a will

link |

00:16:17.920

be bumped by h so let's just think through a little bit uh what d2 will be uh printed out here

link |

00:16:27.600

in particular d1 will be four will d2 be a number slightly greater than four or slightly lower than

link |

00:16:35.280

four and that's going to tell us the the sign of the derivative so we're bumping a by h

link |

00:16:45.440

b is minus three c is 10 so you can just intuitively think through this derivative

link |

00:16:50.720

and what it's doing a will be slightly more positive and but b is a negative number

link |

00:16:57.520

so if a is slightly more positive because b is negative three we're actually going to be

link |

00:17:04.400

adding less to d so you'd actually expect that the value of the function will go down so let's

link |

00:17:14.160

just see this yeah and so we went from four to 3.9996 and that tells you that the slope will be

link |

00:17:22.320

negative and then um will be a negative number because we went down and then the exact number

link |

00:17:29.680

of slope will be exact amount of slope is negative three and you can also convince yourself that

link |

00:17:35.120

negative three is the right answer mathematically and analytically because if you have a times b

link |

00:17:39.920

plus c and you are you know you have calculus then uh differentiating a times b plus c with

link |

00:17:46.000

respect to a gives you just b and indeed the value of b is negative three which is the derivative

link |

00:17:51.760

that we have so you can tell that that's correct so now if we do this with b so if we bump b by a

link |

00:17:58.880

little bit in a positive direction we'd get different slopes so what is the influence of b

link |

00:18:04.400

on the output d so if we bump b by a tiny amount in a positive direction then because a is positive

link |

00:18:11.600

we'll be adding more to d right so um and now what is the what is the sensitivity what is the

link |

00:18:17.840

slope of that addition and it might not surprise you that this should be two and why is it two

link |

00:18:25.280

because d of d by db differentiating with respect to b would be would give us a and the value of a

link |

00:18:32.720

is two so that's also working well and then if c gets bumped a tiny amount in h by h then of course

link |

00:18:40.480

a times b is unaffected and now c becomes slightly bit higher what does that do to the function it

link |

00:18:45.920

makes it slightly bit higher because we're simply adding c and it makes it slightly bit higher by

link |

00:18:50.480

the exact same amount that we added to c and so that tells you that the slope is one that will be

link |

00:18:59.200

the rate at which d will increase as we scale c okay so we now have some intuitive sense of

link |

00:19:06.880

what this derivative is telling you about the function and we'd like to move to neural networks

link |

00:19:10.960

now as i mentioned neural networks will be pretty massive expressions mathematical expressions

link |

00:19:15.040

so we need some data structures that maintain these expressions and that's what we're going

link |

00:19:18.320

to start to build out now so we're going to uh build out this value object that i showed you

link |

00:19:24.880

in the readme page of micrograd so let me copy paste a skeleton of the first very simple value

link |

00:19:32.160

object so class value takes a single scalar value that it wraps and keeps track of and that's it so

link |

00:19:40.800

we can for example do value of 2.0 and then we can get we can look at its content and python

link |

00:19:48.080

will internally use the repper function to return this string like that so this is a value object

link |

00:19:59.040

with data equals two that we're creating here now we'd like to do is like we'd like to be able to

link |

00:20:04.720

have not just like two values but we'd like to do a plus b right we'd like to add them so currently

link |

00:20:11.840

you would get an error because python doesn't know how to add two value objects so we have to tell it

link |

00:20:18.720

so here's addition so you have to basically use these special double underscore methods in python

link |

00:20:27.200

to define these operators for these objects so if we call um the function

link |

00:20:33.280

if we call um the uh if we use this plus operator python will internally call a dot add of b that's

link |

00:20:44.000

what will happen internally and so b will be the other and uh self will be a and so we see that

link |

00:20:51.600

what we're going to return is a new value object and it's just uh it's going to be wrapping the

link |

00:20:57.040

plus of their data but remember now because uh data is the actual like numbered python number

link |

00:21:04.160

so this operator here is just the typical floating point plus addition now it's not an addition of

link |

00:21:10.000

value objects and will return a new value so now a plus b should work and it should print value of

link |

00:21:17.200

negative one because that's two plus minus three there we go okay let's now implement multiply

link |

00:21:23.520

just so we can recreate this expression here so multiply i think it won't surprise you

link |

00:21:28.240

will be fairly similar so instead of add we're going to be using mull and then here of course

link |

00:21:34.560

we want to do times and so now we can create a c value object which will be 10.0 and now we should

link |

00:21:40.560

be able to do a times b well let's just do a times b first um that's value of negative six now

link |

00:21:49.280

and by the way i skipped over this a little bit uh suppose that i didn't have the repper function

link |

00:21:52.960

here uh then it's just that you'll get some kind of an ugly expression so what repper is doing is

link |

00:21:59.120

it's providing us a way to print out like a nicer looking expression in python uh so we don't just

link |

00:22:04.800

have something cryptic we actually are you know it's value of negative six so this gives us a

link |

00:22:11.280

times and then this we should now be able to add c to it because we've defined and told the python

link |

00:22:17.120

because we've defined and told the python how to do mull and add and so this will call this will

link |

00:22:22.400

basically be equivalent to a dot mull of b and then this new value object will be dot add of c

link |

00:22:32.560

and so let's see if that worked yep so that worked well that gave us four which is what

link |

00:22:37.200

we expect from before and i believe we can just call them manually as well there we go so

link |

00:22:42.800

yeah okay so now what we are missing is the connected tissue of this expression as i mentioned

link |

00:22:47.680

we want to keep these expression graphs so we need to know and keep pointers about what values

link |

00:22:53.040

produce what other values so here for example we are going to introduce a new variable which we'll

link |

00:22:58.480

call children and by default it will be an empty tuple and then we're actually going to keep a

link |

00:23:03.040

slightly different variable in the class which we'll call underscore prev which will be the set

link |

00:23:07.920

of children uh this is how i done i did it in the original micrograd looking at my code here

link |

00:23:13.920

i can't remember exactly the reason i believe it was efficiency but this underscore children will

link |

00:23:18.320

be a tuple for convenience but then when we actually maintain it in the class it will be

link |

00:23:21.920

just this set yes i i believe for efficiency um so now when we are creating a value like this

link |

00:23:30.160

with a constructor children will be empty and prev will be the empty set but when we are creating

link |

00:23:35.040

a value through addition or multiplication we're going to feed in the children of this

link |

00:23:40.160

value which in this case is self and other so those are the children here

link |

00:23:49.040

so now we can do d.prev and we'll see that the children of the we now know are this a value of

link |

00:23:56.480

negative six and value of ten and this of course is the value resulting from a times b and the c

link |

00:24:02.640

value which is ten now the last piece of information we don't know so we know now the

link |

00:24:08.320

children of every single value but we don't know what operation created this value so we need one

link |

00:24:13.520

more element here let's call it underscore pop and by default this is the empty set for leaves

link |

00:24:20.480

and then we'll just maintain it here and now the operation will be just a simple string

link |

00:24:26.400

and in the case of addition it's plus in the case of multiplication is times so now we not

link |

00:24:33.520

just have d.prev we also have a d.op and we know that d was produced by an addition of those two

link |

00:24:39.520

values and so now we have the full mathematical expression and we're building out this data

link |

00:24:44.880

structure and we know exactly how each value came to be by what expression and from what other values

link |

00:24:50.000

now because these expressions are about to get quite a bit larger we'd like a way to nicely

link |

00:24:54.160

visualize these expressions that we're building out so for that i'm going to copy paste a bunch of

link |

00:24:59.360

slightly scary code that's going to visualize this these expression graphs for us so here's

link |

00:25:04.880

the code and i'll explain it in a bit but first let me just show you what this code does basically

link |

00:25:10.160

what it does is it creates a new function draw dot that we can call on some root node

link |

00:25:15.040

and then it's going to visualize it so if we call draw dot on d which is this final value here that

link |

00:25:21.200

is a times b plus c it creates something like this so this is d and you see that this is a times b

link |

00:25:29.280

creating an interpretive value plus c gives us this output node d so that's drawn out of d and

link |

00:25:37.440

i'm not going to go through this in complete detail but i'm going to show you how to do that

link |

00:25:41.120

right out of d and i'm not going to go through this in complete detail you can take a look at

link |

00:25:46.240

graphvis and its api graphvis is a open source graph visualization software and what we're doing

link |

00:25:52.080

here is we're building out this graph in graphvis api and you can basically see that trace is this

link |

00:25:58.960

helper function that enumerates all the nodes and edges in the graph so that just builds a

link |

00:26:03.760

set of all the nodes and edges and then we iterate for all the nodes and we create special node

link |

00:26:08.080

objects for them in using dot node and then we also create edges using dot dot edge

link |

00:26:16.880

and the only thing that's like slightly tricky here is you'll notice that i

link |

00:26:20.480

basically add these fake nodes which are these operation nodes so for example this node here

link |

00:26:25.760

is just like a plus node and i create these special op nodes here and i connect them

link |

00:26:35.760

accordingly so these nodes of course are not actual nodes in the original graph they're not

link |

00:26:42.160

actually a value object the only value objects here are the things in squares those are actual

link |

00:26:47.840

value objects or representations thereof and these op nodes are just created in this draw dot

link |

00:26:53.200

routine so that it looks nice let's also add labels to these graphs just so we know what

link |

00:26:58.240

variables are where so let's create a special underscore label um or let's just do label equals

link |

00:27:06.160

empty by default and save it in each node and then here we're going to do label as a

link |

00:27:13.520

a label is b label is c um and then let's create a special um equals a times b

link |

00:27:30.800

and e dot label will be e it's kind of naughty and e will be e plus c and a d dot label will be

link |

00:27:40.000

b okay so nothing really changes i just added this new e function a new e variable and then here when

link |

00:27:48.720

we are printing this i'm going to print the label here so this will be a percent s bar and this will

link |

00:27:56.560

be n dot label and so now we have the label on the left here so it says a b creating e and then

link |

00:28:06.160

e plus c creates d just like we have it here and finally let's make this expression just one layer

link |

00:28:12.240

deeper so d will not be the final output node instead after d we are going to create a new value

link |

00:28:19.600

object called f we're going to start running out of variables soon f will be negative 2.0 and its

link |

00:28:26.880

label will of course just be f and then l capital l will be the output of our graph and l will be

link |

00:28:35.360

and l will be d times f okay so l will be negative 8 is the output so now we don't just draw a d we

link |

00:28:46.880

draw l okay and somehow the label of l is undefined oops although label has to be explicitly sort of

link |

00:28:57.920

given to it there we go so l is the output so let's quickly recap what we've done so far

link |

00:29:04.080

we are able to build out mathematical expressions using only plus and times so far

link |

00:29:09.200

they are scalar valued along the way and we can do this forward pass and build out a mathematical

link |

00:29:15.440

expression so we have multiple inputs here a b c and f going into a mathematical expression that

link |

00:29:21.680

produces a single output l and this here is visualizing the forward pass so the output of

link |

00:29:27.920

the forward pass is negative 8 that's the value now what we'd like to do next is we'd like to

link |

00:29:33.760

run back propagation and in back propagation we are going to start here at the end and we're going

link |

00:29:39.120

to reverse and calculate the gradient along along all these intermediate values and really what

link |

00:29:46.000

we're computing for every single value here we're going to compute the derivative of that

link |

00:29:51.920

node with respect to l so the derivative of l with respect to l is just one and then we're going to

link |

00:30:01.200

derive what is the derivative of l with respect to f with respect to d with respect to c with

link |

00:30:06.400

respect to e with respect to b and with respect to a and in a neural network setting you'd be

link |

00:30:12.320

very interested in the derivative of basically this loss function l with respect to the weights

link |

00:30:18.080

of a neural network and here of course we have just these variables a b c and f but some of

link |

00:30:23.040

these will eventually represent the weights of a neural net and so we'll need to know how those

link |

00:30:27.360

weights are impacting the loss function so we'll be interested basically in the derivative of the

link |

00:30:32.320

output with respect to some of its leaf nodes and those leaf nodes will be the weights of the neural

link |

00:30:37.360

net and the other leaf nodes of course will be the data itself but usually we will not want or

link |

00:30:42.640

use the derivative of the loss function with respect to data because the data is fixed but

link |

00:30:47.600

the weights will be iterated on using the gradient information so next we are going to create a

link |

00:30:53.440

variable inside the value class that maintains the derivative of l with respect to that value

link |

00:31:01.040

and we will call this variable grad so there is a dot data and there's a self at grad and initially

link |

00:31:08.240

it will be zero and remember that zero is basically means no effect so at initialization

link |

00:31:14.160

we're assuming that every value does not impact does not affect the out the output right because

link |

00:31:20.240

if the gradient is zero that means that changing this variable is not changing the loss function

link |

00:31:25.680

so by default we assume that the gradient is zero and then now that we have grad and it's 0.0

link |

00:31:36.640

we are going to be able to visualize it here after data so here grad is 0.4f

link |

00:31:41.360

and this will be in that graph and now we are going to be showing both the data and the grad

link |

00:31:50.080

initialized at zero and we are just about getting ready to calculate the back propagation

link |

00:31:56.480

and of course this grad again as i mentioned is representing the derivative of the output

link |

00:32:01.040

in this case l with respect to this value so with respect to so this is the derivative of

link |

00:32:06.400

l with respect to f with respect to d and so on so let's now fill in those gradients and actually

link |

00:32:11.520

do back propagation manually so let's start filling in these gradients and start all the way at the

link |

00:32:15.760

end as i mentioned here first we are interested to fill in this gradient here so what is the derivative

link |

00:32:21.520

of l with respect to l in other words if i change l by a tiny amount h how much does l change it

link |

00:32:31.440

changes by h so it's proportional and therefore the derivative will be one we can of course

link |

00:32:38.160

measure these or estimate these numerical gradients numerically just like we've seen before so if i

link |

00:32:43.840

take this expression and i create a def lol function here and put this here now the reason

link |

00:32:51.040

i'm creating a gating function lol here is because i don't want to pollute or mess up the global scope

link |

00:32:56.160

here this is just kind of like a little staging area and as you know in python all of these will

link |

00:33:00.400

be local variables to this function so i'm not changing any of the global scope here so here l1

link |

00:33:06.720

will be l and then copy pasting this expression we're going to add a small amount h in for example

link |

00:33:18.640

a right and this would be measuring the derivative of l with respect to a so here this will be l2

link |

00:33:27.040

and then we want to print that derivative so print l2 minus l1 which is how much l changed

link |

00:33:33.840

and then normalize it by h so this is the rise over run and we have to be careful because l

link |

00:33:39.840

is a value node so we actually want its data so that these are floats dividing by h and this should

link |

00:33:48.400

print the derivative of l with respect to a because a is the one that we bumped a little bit by h so

link |

00:33:54.640

what is the derivative of l with respect to a it's six okay and obviously if we change l

link |

00:34:05.040

by h then that would be here effectively this looks really awkward but changing l by h

link |

00:34:16.080

you see the derivative here is one

link |

00:34:17.680

that's kind of like the base case of what we are doing here so basically we can come up here and

link |

00:34:23.680

we can manually set l.grad to one this is our manual backpropagation l.grad is one and let's

link |

00:34:30.480

redraw and we'll see that we filled in grad is one for l we're now going to continue the

link |

00:34:36.880

backpropagation so let's here look at the derivatives of l with respect to d and f let's do

link |

00:34:43.040

a d first so what we are interested in if i create a markdown on here is we'd like to know

link |

00:34:48.960

basically we have that l is d times f and we'd like to know what is uh d l by d d what is that

link |

00:34:58.640

and if you know your calculus uh l is d times f so what is d l by d d it would be f and if you

link |

00:35:05.920

don't believe me we can also just derive it because the proof would be fairly straightforward

link |

00:35:09.920

we go to the definition of the derogative which is f of x plus h minus f of x divide h

link |

00:35:18.320

as a limit limit of h goes to zero of this kind of expression so when we have l is d times f then

link |

00:35:25.040

increasing d by h would give us the output of d plus h times f that's basically f of x plus h

link |

00:35:33.520

right minus d times f and then divide h and symbolically expanding out here we would have

link |

00:35:42.560

basically d times f plus h times f minus d times f divide h and then you see how the df minus df

link |

00:35:50.480

cancels so you're left with h times f divide h which is f so in the limit as h goes to zero of

link |

00:35:58.400

you know um derogative um definition we just get f in the case of d times f

link |

00:36:07.440

so symmetrically d l by d f will just be d so what we have is that f dot grad we see now

link |

00:36:17.440

is just the value of d which is four

link |

00:36:20.320

and we see that d dot grad is just uh the value of f

link |

00:36:28.560

and so the value of f is negative two so we'll set those manually

link |

00:36:36.720

let me erase this markdown node and then let's redraw what we have

link |

00:36:40.080

okay and let's just make sure that these were correct so we seem to think that d l by d d is

link |

00:36:45.920

negative two so let's double check um let me erase this plus h from before and now we want

link |

00:36:51.840

the derivative with respect to f so let's just come here when i create f and let's do a plus h

link |

00:36:57.040

here and this should print a derivative of f plus h so let's do a plus h here and let's

link |

00:37:03.120

do a plus h here and this should print a derivative of l with respect to f so we expect to see four

link |

00:37:10.640

yeah and this is four up to floating point funkiness and then d l by d d should be f

link |

00:37:19.200

which is negative two grad is negative two so if we again come here and we change d

link |

00:37:25.920

d dot data plus equals h right here so we expect so we've added a little h and then we see how l

link |

00:37:32.800

changed and we expect to print uh negative two there we go so we've numerically verified what

link |

00:37:43.440

we're doing here is kind of like an inline gradient check gradient check is when we are

link |

00:37:48.640

deriving this like background check and then we're going to print a negative two

link |

00:37:53.040

check is when we are deriving this like back propagation and getting the derivative with

link |

00:37:57.440

respect to all the intermediate results and then numerical gradient is just you know

link |

00:38:03.200

estimating it using small step size now we're getting to the crux of back propagation so this

link |

00:38:09.440

will be the most important node to understand because if you understand the gradient for this

link |

00:38:14.400

node you understand all of back propagation and all of training of neural nets basically so we

link |

00:38:20.000

need to derive dl by dc in other words the derivative of l with respect to c because

link |

00:38:26.880

we've computed all these other gradients already now we're coming here and we're continuing the

link |

00:38:31.200

back propagation manually so we want dl by dc and then we'll also derive dl by de now here's

link |

00:38:38.800

a problem how do we derive dl by dc we actually know the derivative l with respect to d so we know

link |

00:38:47.760

how l is sensitive to d but how is l sensitive to c so if we wiggle c how does that impact l through

link |

00:38:56.000

d so we know dl by dc and we also here know how c impacts d and so just very intuitively if you

link |

00:39:06.160

know the impact that c is having on d and the impact that d is having on l then you should be

link |

00:39:11.920

able to somehow put that information together to figure out how c impacts l and indeed this is what

link |

00:39:17.440

we can actually do so in particular we know just concentrating on d first let's look at how what is

link |

00:39:23.920

the derivative basically of d with respect to c so in other words what is dd by dc

link |

00:39:31.600

so here we know that d is c times c plus e that's what we know and now we're interested in dd by dc

link |

00:39:39.600

if you just know your calculus again and you remember then differentiating c plus e with

link |

00:39:44.000

respect to c you know that that gives you 1.0 and we can also go back to the basics and derive this

link |

00:39:50.880

because again we can go to our f of x plus h minus f of x divided by h that's the definition

link |

00:39:57.280

of a derivative as h goes to zero and so here focusing on c and its effect on d we can basically

link |

00:40:05.120

do the f of x plus h will be c is incremented by h plus c that's the first evaluation of our function

link |

00:40:12.640

minus c plus e and then divide h and so what is this just expanding this out this will be c plus h

link |

00:40:22.640

plus c minus c minus e divide h and then you see here how c minus c cancels e minus e cancels we're

link |

00:40:30.320

left with h over h which is 1.0 and so by symmetry also d d by d e will be 1.0 as well so basically

link |

00:40:43.440

the derivative of a sum expression is very simple and this is the local derivative so i call this

link |

00:40:48.960

the local derivative because we have the final output value all the way at the end of this graph

link |

00:40:53.520

and we're now like a small node here and this is a little plus node and the little plus node

link |

00:40:59.680

doesn't know anything about the rest of the graph that it's embedded in all it knows is that it did

link |

00:41:04.720

plus it took a c and an e added them and created a d and this plus node also knows the local

link |

00:41:11.280

influence of c on d or rather the derivative of d with respect to c and it also knows the

link |

00:41:17.440

derivative of d with respect to e but that's not what we want that's just a local derivative what

link |

00:41:22.960

we actually want is dl by dc and l could l is here just one step away but in a general case

link |

00:41:30.720

this little plus node is could be embedded in like a massive graph so again we know how l impacts d

link |

00:41:37.920

and now we know how c and e impact d how do we put that information together to write dl by dc

link |

00:41:44.240

and the answer of course is the chain rule in calculus and so i pulled up a chain rule here

link |

00:41:51.360

from Wikipedia and i'm going to go through this very briefly so chain rule Wikipedia sometimes

link |

00:41:58.080

can be very confusing and calculus can be very confusing like this is the way i learned chain

link |

00:42:05.280

rule and it was very confusing like what is happening it's just complicated so i like this

link |

00:42:10.320

expression much better if a variable z depends on a variable y which itself depends on a variable x

link |

00:42:16.960

then z depends on x as well obviously through the intermediate variable y in this case the chain

link |

00:42:22.400

rule is expressed as if you want dz by dx then you take the dz by dy and you multiply it by dy

link |

00:42:31.200

by dx so the chain rule fundamentally is telling you how we chain these derivatives together

link |

00:42:39.680

derivatives together correctly so to differentiate through a function composition we have to apply

link |

00:42:47.200

a multiplication of those derivatives so that's really what chain rule is telling us and there's

link |

00:42:55.280

a nice little intuitive explanation here which i also think is kind of cute the channel states

link |

00:42:59.840

that knowing the instantaneous rate of change of z with respect to y and y relative to x allows

link |

00:43:04.240

one to calculate the instantaneous rate of change of z relative to x as a product of those two rates

link |

00:43:09.440

of change simply the product of those two so here's a good one if a car travels twice as

link |

00:43:15.600

fast as a bicycle and the bicycle is four times as fast as a walking man then the car travels

link |

00:43:21.200

two times four eight times as fast as a man and so this makes it very clear that the correct thing

link |

00:43:27.840

to do sort of is to multiply so car is twice as fast as bicycle and bicycle is four times as fast

link |

00:43:35.600

as man so the car will be eight times as fast as the man and so we can take these intermediate

link |

00:43:42.720

rates of change if you will and multiply them together and that justifies the chain rule

link |

00:43:49.120

intuitively so have a look at chain rule about here really what it means for us is there's a

link |

00:43:54.240

very simple recipe for deriving what we want which is dl by dc and what we have so far is we know

link |

00:44:02.240

want and we know what is the impact of d on l so we know dl by dd the derivative of l with respect

link |

00:44:13.760

to dd we know that that's negative two and now because of this local reasoning that we've done

link |

00:44:19.360

here we know dd by dc so how does c impact d and in particular this is a plus node so the local

link |

00:44:28.880

derivative is simply 1.0 it's very simple and so the chain rule tells us that dl by dc going

link |

00:44:36.480

through this intermediate variable will just be simply dl by dd times

link |

00:44:47.920

dd by dc that's chain rule so this is identical to what's happening here except

link |

00:44:56.240

z is rl y is rd and x is rc so we literally just have to multiply these and because

link |

00:45:08.400

these local derivatives like dd by dc are just one we basically just copy over dl by dd because

link |

00:45:15.600

this is just times one so what is it so because dl by dd is negative two what is dl by dc well

link |

00:45:24.080

it's the local gradient 1.0 times dl by dd which is negative two so literally what a plus node does

link |

00:45:31.280

you can look at it that way is it literally just routes the gradient because the plus nodes local

link |

00:45:36.720

derivatives are just one and so in the chain rule one times dl by dd is just dl by dd and so that

link |

00:45:48.080

derivative just gets routed to both c and to e in this case so basically we have that e.grad

link |

00:45:57.040

or let's start with c since that's the one we looked at is negative two times one negative two

link |

00:46:06.160

and in the same way by symmetry e.grad will be negative two that's the claim so we can set those

link |

00:46:12.960

we can redraw and you see how we just assign negative two negative two so this back propagating

link |

00:46:19.120

signal which is carrying the information of like what is the derivative of l with respect to all

link |

00:46:23.200

the intermediate nodes we can imagine it almost like flowing backwards through the graph and a

link |

00:46:28.000

plus node will simply distribute the derivative to all the leaf nodes sorry to all the children

link |

00:46:33.040

nodes of it so this is the claim and now let's verify it so let me remove the plus node and

link |

00:46:39.440

now instead what we're going to do is we want to increment c so c.data will be incremented by h

link |

00:46:44.800

and when i run this we expect to see negative two negative two and then of course for e

link |

00:46:53.200

so e.data plus equals h and we expect to see negative two simple

link |

00:46:57.680

so those are the derivatives of these internal nodes and now we're going to recurse our way

link |

00:47:07.840

backwards again and we're again going to apply the chain rule so here we go our second application

link |

00:47:13.840

of chain rule and we will apply it all the way through the graph we just happen to only have

link |

00:47:17.840

one more node remaining we have that dl by d and we're going to apply the chain rule

link |

00:47:23.680

one more node remaining we have that dl by de as we have just calculated is negative two

link |

00:47:30.720

so we know that so we know the derivative of l with respect to e

link |

00:47:36.720

and now we want dl by da right and the chain rule is telling us that that's just dl by de

link |

00:47:46.240

negative two times the local gradient so what is the local gradient basically de by da we have to

link |

00:47:54.800

look at that so i'm a little times node inside a massive graph and i only know that i did a times b

link |

00:48:03.760

and i produced an e so now what is de by da and de by db that's the only thing that i sort of

link |

00:48:11.120

know about that's my local gradient so because we have that e is a times b we're asking what is de

link |

00:48:18.800

by da and of course we just did that here we had a times so i'm not going to re-derive it

link |

00:48:26.560

but if you want to differentiate this with respect to a you'll just get b right the value of b which

link |

00:48:33.600

in this case is negative 3.0 so basically we have that dl by da well let me just do it right here

link |

00:48:43.680

we have that a dot grad and we are applying chain rule here is dl by de which we see here is

link |

00:48:50.800

negative two times what is de by da it's the value of b which is negative three

link |

00:48:59.440

that's it and then we have b dot grad is again dl by de which is negative two just the same way

link |

00:49:08.800

times what is de by d um db is the value of a which is 2.0 that's the value of a

link |

00:49:19.120

so these are our claimed derivatives let's redraw and we see here that a dot grad turns out to be

link |

00:49:28.720

six because that is negative two times negative three and b dot grad is negative four times sorry

link |

00:49:35.760

is negative two times two which is negative four so those are our claims let's delete this and

link |

00:49:41.360

let's verify them we have a here a dot data plus equals h so the claim is that a dot grad is six

link |

00:49:55.120

let's verify six and we have b dot data plus equals h so nudging b by h and then we have

link |

00:50:05.600

b dot data plus equals h so nudging b by h and looking at what happens we claim it's negative four

link |

00:50:15.120

and indeed it's negative four plus minus again float oddness and that's it this that was the

link |

00:50:25.360

manual back propagation all the way from here to all the leaf nodes and we've done it piece by

link |

00:50:32.560

piece and really all we've done is as you saw we iterated through all the nodes one by one

link |

00:50:37.840

and locally applied the chain rule we always know what is the derivative of l with respect to this

link |

00:50:43.040

little output and then we look at how this output was produced this output was produced through some

link |

00:50:48.080

operation and we have the pointers to the children nodes of this operation and so in this little

link |

00:50:53.760

operation we know what the local derivatives are and we just multiply them onto the derivative

link |

00:50:58.560

always so we just go through and recursively multiply on the local derivatives and that's

link |

00:51:04.320

what back propagation is it's just a recursive application of chain rule backwards through the

link |

00:51:08.880

computation graph let's see this power in action just very briefly what we're going to do is we're

link |

00:51:14.560

going to nudge our inputs to try to make l go up so in particular what we're doing is we want a

link |

00:51:22.000

data we're going to change it and if we want l to go up that means we just have to go in the

link |

00:51:27.280

direction of the gradient so a should increase in the direction of gradient by like some small

link |

00:51:33.920

step amount this is the step size and we don't just want this for b but also for b

link |

00:51:41.440

also for c also for f those are leaf nodes which we usually have control over and if we nudge in

link |

00:51:52.000

direction of the gradient we expect a positive influence on l so we expect l to go up positively

link |

00:51:59.920

so it should become less negative it should go up to say negative you know six or something like

link |

00:52:04.560

that it's hard to tell exactly and we have to rerun the forward pass so let me just do that here

link |

00:52:16.560

this would be the forward pass f would be unchanged this is effectively the forward pass

link |

00:52:21.840

and now if we print l dot data we expect because we nudged all the values all the inputs in the

link |

00:52:28.480

rational gradient we expected a less negative l we expect it to go up so maybe it's negative

link |

00:52:34.080

six or so let's see what happens okay negative seven and this is basically one step of an

link |

00:52:41.840

optimization that will end up running and really this gradient just give us some power because we

link |

00:52:47.200

know how to influence the final outcome and this will be extremely useful for training

link |

00:52:50.880

neural nets as well as cnc so now i would like to do one more example of manual back propagation

link |

00:52:57.760

using a bit more complex and useful example we are going to back propagate through a neuron

link |

00:53:05.440

so we want to eventually build out neural networks and in the simplest case these are

link |

00:53:10.960

multi-layer perceptrons as they're called so this is a two layer neural net and it's got these hidden

link |

00:53:16.480

layers made up of neurons and these neurons are fully connected to each other now biologically

link |

00:53:20.800

neurons are very complicated devices but we have very simple mathematical models of them

link |

00:53:26.080

and so this is a very simple mathematical model of a neuron you have some inputs x's and then you

link |

00:53:32.160

have these synapses that have weights on them so the w's are weights and then the synapse interacts

link |

00:53:41.920

with the input to this neuron multiplicatively so what flows to the cell body of this neuron

link |

00:53:48.080

is w times x but there's multiple inputs so there's many w times x's flowing to the cell body

link |

00:53:54.960

the cell body then has also like some bias so this is kind of like the

link |

00:53:59.680

inner innate sort of trigger happiness of this neuron so this bias can make it a bit more trigger

link |

00:54:05.280

happy or a bit less trigger happy regardless of the input but basically we're taking all the w

link |

00:54:10.160

times x of all the inputs adding the bias and then we take it through an activation function

link |

00:54:16.000

and this activation function is usually some kind of a squashing function like a sigmoid or 10h or

link |

00:54:21.040

something like that so as an example we're going to use the 10h in this example numpy has a np dot

link |

00:54:29.280

10h so we can call it on a range then we can plot it this is the 10h function and you see that the

link |

00:54:38.160

inputs as they come in get squashed on the y coordinate here so right at zero we're going to

link |

00:54:45.760

get exactly zero and then as you go more positive in the input then you'll see that the function

link |

00:54:51.200

will only go up to one and then plateau out and so if you pass in very positive inputs we're going

link |

00:54:57.440

to cap it smoothly at one and on the negative side we're going to cap it smoothly to negative one

link |

00:55:02.560

so that's 10h and that's the squashing function or an activation function and what comes out of

link |

00:55:08.800

this neuron is just the activation function applied to the dot product of the weights and the inputs

link |

00:55:16.000

so let's write one out i'm going to copy paste because

link |

00:55:22.640

i'm going to copy paste because i don't want to type too much but okay so here we have the inputs

link |

00:55:31.520

x1 x2 so this is a two-dimensional neuron so two inputs are going to come in these are thought of

link |

00:55:36.800

as the weights of this neuron weights w1 w2 and these weights again are the synaptic strengths

link |

00:55:43.520

for each input and this is the bias of the neuron b and now we want to do is according to this model

link |

00:55:52.160

we need to multiply x1 times w1 and x2 times w2 and then we need to add bias on top of it

link |

00:56:01.440

and it gets a little messy here but all we are trying to do is x1 w1 plus x2 w2 plus b

link |

00:56:07.920

and these are multiplied here except i'm doing it in small steps so that we actually

link |

00:56:12.800

have pointers to all these intermediate nodes so we have x1 w1 variable x times x2 w2 variable

link |

00:56:19.360

and i'm also labeling them so n is now the cell body raw activation without

link |

00:56:28.400

the activation function for now and this should be enough to basically plot it so draw a dot of n

link |

00:56:37.840

gives us x1 times w1 x2 times w2 being added then the bias gets added on top of this

link |

00:56:45.040

and this n is this sum so we're now going to take it through an activation function

link |

00:56:52.000

and let's say we use the tanh so that we produce the output so what we'd like to do here is we'd

link |

00:56:57.520

like to do the output and i'll call it o is n dot tanh okay but we haven't yet written the tanh

link |

00:57:07.760

now the reason that we need to implement another tanh function here is that tanh is a hyperbolic

link |

00:57:14.800

function and we've only so far implemented a plus and a times and you can't make a tanh

link |

00:57:19.520

out of just pluses and times you also need exponentiation so tanh is this kind of a formula

link |

00:57:25.440

here you can use either one of these and you see that there is exponentiation involved which we

link |

00:57:30.240

have not implemented yet for our little value node here so we're not going to be able to produce

link |

00:57:35.040

tanh yet and we have to go back up and implement something like it now one option here is we could

link |

00:57:42.560

actually implement exponentiation right and we could return the exp of a value instead of a tanh

link |

00:57:50.560

of a value because if we had exp then we have everything else that we need so because we know

link |

00:57:56.880

how to add and we know how to we know how to add and we know how to multiply so we'd be able to

link |

00:58:03.280

create tanh if we knew how to exp but for the purposes of this example i specifically wanted to

link |

00:58:08.720

show you that we don't necessarily need to have the most atomic pieces in in this value object

link |

00:58:16.960

we can actually like create functions at arbitrary points of abstraction they can be complicated

link |

00:58:24.080

functions but they can be also very very simple functions like a plus and it's totally up to us

link |

00:58:28.720

the only thing that matters is that we know how to differentiate through any one function so we

link |

00:58:33.840

take some inputs and we make an output the only thing that matters it can be arbitrarily complicated

link |

00:58:38.000

arbitrarily complex function as long as you know how to create the local derivative if you know the

link |

00:58:43.360

local derivative of how the inputs impact the output then that's all you need so we're going

link |

00:58:47.760

to cluster up all of this expression and we're not going to break it down to its atomic pieces

link |

00:58:53.200

we're just going to directly implement tanh so let's do that dab tanh and then out will be a value

link |

00:59:00.720

of and we need this expression here so um let me actually copy paste

link |

00:59:14.160

let's grab n which is a cell dot theta and then this i believe is the tanh math dot exp of

link |

00:59:22.560

two no and my n minus one over two and plus one maybe i can call this x just so that it matches

link |

00:59:32.000

exactly okay and now this will be t and uh children of this node there's just one child

link |

00:59:42.080

and i'm wrapping it in a tuple so this is a tuple of one object just self and here the name of this

link |

00:59:48.400

operation will be tanh and we're going to return that okay so now values should be implementing tanh

link |

00:59:59.920

and now we can scroll all the way down here and we can actually do n dot tanh and that's going to

link |

01:00:05.200

return the tanh output of n and now we should be able to draw it out of o not of n so let's see how

link |

01:00:13.600

that worked there we go and went through tanh to produce this output so now tanh is a sort of

link |

01:00:26.000

our little micro grad supported node here as an operation and as long as we know the derivative

link |

01:00:33.120

of tanh then we'll be able to back propagate through it now let's see this tanh in action

link |

01:00:38.240

currently it's not squashing too much because the input to it is pretty low so the bias was increased

link |

01:00:44.400

to say eight then we'll see that what's flowing into the tanh now is two and tanh is squashing

link |

01:00:53.600

it to 0.96 so we're already hitting the tail of this tanh and it will sort of smoothly go up to

link |

01:00:59.200

one and then plateau out over there okay so now i'm going to do something slightly strange i'm

link |

01:01:04.000

going to change this bias from eight to this number 6.88 etc and i'm going to do this for

link |

01:01:10.880

specific reasons because we're about to start back propagation and i want to make sure that

link |

01:01:16.400

our numbers come out nice they're not like very crazy numbers they're nice numbers that we can

link |

01:01:20.640

sort of understand in our head let me also add o's label o is short for output here

link |

01:01:26.560

so that's the error okay so 0.88 flows into tanh comes out 0.7 so on so now we're going to do back

link |

01:01:33.440

propagation and we're going to fill in all the gradients so what is the derivative o with respect

link |

01:01:39.440

to all the inputs here and of course in a typical neural network setting what we really care about

link |

01:01:45.200

the most is the derivative of these neurons on the weights specifically the w2 and w1 because

link |

01:01:52.400

those are the weights that we're going to be changing part of the optimization and the other

link |

01:01:56.320

thing that we have to remember is here we have only a single neuron but in the neural net you

link |

01:01:59.920

typically have many neurons and they're connected so this is only like a one small neuron a piece

link |

01:02:05.520

of a much bigger puzzle and eventually there's a loss function that sort of measures the accuracy

link |

01:02:09.760

of the neural net and we're back propagating with respect to that accuracy and trying to increase it

link |

01:02:15.280

so let's start off by propagation here and end so let's start off with back propagation

link |

01:02:20.960

here in the end what is the derivative of o with respect to o the base case sort of we know always

link |

01:02:27.440

is that the gradient is just 1.0 so let me fill it in and then uh let me split out uh the drawing

link |

01:02:37.760

function um here and then here cell clear this output here okay so now when we draw o we'll see

link |

01:02:51.920

that oh that grad is one so now we're going to back propagate through the 10h so to back propagate

link |

01:02:57.840

through 10h we need to know the local derivative of 10h so if we have that uh o is 10h of n then

link |

01:03:08.640

what is d o by d n now what you could do is you could come here and you could take this expression

link |

01:03:15.360

and you could do your calculus derivative taking um and that would work but we can also just scroll

link |

01:03:21.840

down wikipedia here into a section that hopefully tells us that derivative uh d by dx of 10h of x

link |

01:03:30.560

is any of these i like this one one minus 10h square of x so this is one minus 10h of x squared

link |

01:03:39.360

so basically what this is saying is that d o by dn is one minus 10h of n squared

link |

01:03:49.280

and we already have 10h of n it's just o so it's one minus o squared so o is the output here so

link |

01:03:56.480

the output is this number o dot data is this number and then what this is saying is that d o

link |

01:04:07.520

by dn is one minus this squared so one minus o dot data squared is 0.5 conveniently so that's

link |

01:04:17.920

the local derivative of this 10h operation here is 0.5 and uh so that would be d o by dn

link |

01:04:25.840

so we can fill in that n dot grad is 0.5 we'll just fill it in

link |

01:04:37.120

so this is exactly 0.5 one half so now we're going to continue the back propagation

link |

01:04:49.280

this is 0.5 and this is a plus node so how is backdrop going to what is backdrop going to do

link |

01:04:55.680

here and if you remember our previous example a plus is just a distributor of gradient so this

link |

01:05:02.160

gradient will simply flow to both of these equally and that's because the local derivative of this

link |

01:05:06.960

operation is one for every one of its nodes so one times 0.5 is 0.5 so therefore we know that

link |

01:05:14.960

this node here which we called this its grad is just 0.5 and we know that b dot grad is also 0.5

link |

01:05:24.880

so let's set those and let's draw

link |

01:05:28.880

so those are 0.5 continuing we have another plus 0.5 again we'll just distribute so 0.5 will flow

link |

01:05:35.840

to both of these so we can set theirs x2w2 as well dot grad is 0.5 and let's redraw pluses

link |

01:05:49.520

are my favorite operations to back propagate through because it's very simple so now what's

link |

01:05:55.520

flowing into these expressions is 0.5 and so really again keep in mind what the derivative

link |

01:05:59.520

is telling us at every point in time along here this is saying that if we want the output of this

link |

01:06:05.600

neuron to increase then the influence on these expressions is positive on the output both of them

link |

01:06:13.200

are positive contribution to the output so now back propagating to x2 and w2 first this is a

link |

01:06:24.640

times node so we know that the local derivative is you know the other term so if we want to

link |

01:06:29.920

calculate x2 dot grad then can you think through what it's going to be so x2 dot grad will be w2

link |

01:06:43.440

dot data times this x2w2 dot grad right and w2 dot grad will be x2 dot data times x2w2 dot grad

link |

01:07:01.360

right so that's the little local piece of chain rule

link |

01:07:07.040

let's set them and let's redraw so here we see that the gradient on our weight

link |

01:07:11.840

two is zero because x2's data was zero right but x2 will have the gradient 0.5 because data here

link |

01:07:19.440

was one and so what's interesting here right is because the input x2 was zero then because

link |

01:07:25.840

of the way the times works of course this gradient will be zero and think about intuitively why that

link |

01:07:31.760

is derivative always tells us the influence of this on the final output if i wiggle w2 how is

link |

01:07:40.080

the output changing it's not changing because we're multiplying by zero so because it's not

link |

01:07:45.200

changing there is no derivative and zero is the correct answer because we're squashing that zero

link |

01:07:52.160

and let's do it here 0.5 should come here and flow through this times and so we'll have

link |

01:07:58.960

that x1 dot grad is can you think through a little bit what what this should be

link |

01:08:05.040

the local derivative of times with respect to x1 is going to be w1 so w1's data times x1w1

link |

01:08:14.960

dot grad and w1 dot grad will be x1 dot data times x1w2 w1 dot grad let's see what those

link |

01:08:25.760

came out to be so this is 0.5 so this would be negative 1.5 and this would be 1 and this

link |

01:08:33.440

1 and we've back propagated through this expression these are the actual final derivatives

link |

01:08:39.120

so if we want this neuron's output to increase we know that what's necessary is that w2 we have no

link |

01:08:48.080

gradient w2 doesn't actually matter to this neuron right now but this neuron this weight

link |

01:08:53.360

should go up so if this weight goes up then this neuron's output would have gone up and

link |

01:08:59.680

proportionally because the gradient is one okay so doing the back propagation manually is obviously

link |

01:09:04.480

ridiculous so we are now going to put an end to this suffering and we're going to see how we can

link |

01:09:09.200

implement the backward pass a bit more automatically we're not going to be doing all of

link |

01:09:13.280

it manually out here it's now pretty obvious to us by example how these pluses and times are back

link |

01:09:18.560

propagating gradients so let's go up to the value object and we're going to start codifying what

link |

01:09:25.520

we've seen in the examples below so we're going to do this by storing a special self dot backward

link |

01:09:34.800

and underscore backward and this will be a function which is going to do that little

link |

01:09:39.680

piece of chain rule at each little node that compute that took inputs and produced output

link |

01:09:45.200

we're going to store how we are going to chain the the outputs gradient into the inputs gradients

link |

01:09:51.440

so by default this will be a function that doesn't do anything so and you can also see

link |

01:09:59.760

that here in the value in micro grad so we have this backward function by default doesn't do

link |

01:10:06.080

anything this is a empty function and that would be sort of the case for example for a leaf node

link |

01:10:11.600

for leaf node there's nothing to do but now if when we're creating these out values these out

link |

01:10:18.720

values are an addition of self and other and so we will want to sell set outs backward to be

link |

01:10:28.560

the function that propagates the gradient so let's define what should happen

link |

01:10:39.600

and we're going to store it in the closure let's define what should happen when we call outs grad

link |

01:10:44.800

for addition our job is to take outs grad and propagate it into selfs grad and other dot grad

link |

01:10:54.080

so basically we want to sell self dot grad to something and we want to set others dot grad

link |

01:10:59.680

to something okay and the way we saw below how chain rule works we want to take the local derivative

link |

01:11:07.840

times the sort of global derivative i should call it which is the derivative of the final output of

link |

01:11:17.120

the expression with respect to outs data with respect to out so the local derivative of self

link |

01:11:27.440

in an addition is 1.0 so it's just 1.0 times outs grad that's the chain rule and others dot grad

link |

01:11:37.520

0.0 times grad and what you basically what you're seeing here is that outs grad will simply be

link |

01:11:43.840

copied onto selfs grad and others grad as we saw happens for an addition operation

link |

01:11:49.920

so we're going to later call this function to propagate the gradient having done an addition

link |

01:11:55.760

let's now do the multiplication we're going to also define that backward

link |

01:11:59.280

and we're going to set its backward to be backward

link |

01:12:07.840

and we want to chain out grad into self dot grad

link |

01:12:14.400

and others dot grad and this will be a little piece of chain rule for multiplication

link |

01:12:20.400

so we'll have so what should this be can you think through

link |

01:12:24.000

here so what is the local derivative here the local derivative was others dot data

link |

01:12:35.440

and then oops others dot data and then times out that grad that's channel

link |

01:12:42.560

and here we have self dot data times out that grad that's what we've been doing

link |

01:12:46.480

and finally here for 10h that backward and then we want to set outs backwards to be just backward

link |

01:12:57.520

and here we need to back propagate we have out that grad and we want to chain it into self dot grad

link |

01:13:06.480

and self dot grad will be the local derivative of this operation that we've done here which is 10h

link |

01:13:12.240

and so we saw that the local gradient is one minus the 10h of x squared which here is t

link |

01:13:19.840

that's the local derivative because that's t is the output of this 10h so one minus t square is

link |

01:13:24.720

the local derivative and then the gradient has to be multiplied because of the chain rule

link |

01:13:30.960

so out grad is chained through the local gradient into salt out grad and that should be basically

link |

01:13:36.880

it so we're going to redefine our value node we're going to swing all the way down here

link |

01:13:44.160

and we're going to redefine our expression make sure that all the grads are zero okay but now we

link |

01:13:52.960

don't have to do this manually anymore we are going to basically be calling the dot backward

link |

01:13:58.480

in the right order so first we want to call os dot backward so o was the outcome of 10h

link |

01:14:14.000

right so calling os that back those goes backward will be this function this is what it will do

link |

01:14:21.200

now we have to be careful because there's a times out dot grad and out dot grad remember is

link |

01:14:28.640

initialized to zero so here we see grad zero so as a base case we need to set os dot grad to 1.0

link |

01:14:41.920

to initialize this with one

link |

01:14:43.360

and then once this is one we can call o dot backward and what that should do is it should

link |

01:14:48.560

propagate this grad through 10h so the local derivative times the global derivative which

link |

01:14:55.520

is initialized at one so this should um

link |

01:15:01.760

uh don't so i thought about redoing it but i figured i should just leave the error in here

link |

01:15:10.720

because it's pretty funny why is an anti-object not callable uh it's because i screwed up we're

link |

01:15:18.080

trying to save these functions so this is correct so we're going to call this function

link |

01:15:23.600

this here you don't want to call the function because that returns none these functions return

link |

01:15:32.480

none we just want to store the function so let me redefine the value object and then we're going

link |

01:15:38.320

to come back in redefine the expression draw dot everything is great o dot grad is one

link |

01:15:44.720

o dot grad is one and now now this should work of course okay so all that backward should have

link |

01:15:53.280

this grad should now be 0.5 if we redraw and if everything went correctly 0.5 yay okay so now we

link |

01:16:01.200

need to call ends dot grad and it's not backward sorry ends backward and then we're going to call

link |

01:16:10.400

ends dot backward sorry ends backward so that seems to have worked

link |

01:16:17.840

so ends dot backward routed the gradient to both of these so this is looking great

link |

01:16:24.640

now we could of course call uh call b dot grad b dot backward sorry what's gonna happen

link |

01:16:32.000

well b doesn't have it backward b's backward because b is a leaf node b's backward is by

link |

01:16:38.960

initialization the empty function so nothing would happen but we can call call it on it

link |

01:16:45.840

but when we call this one it's backward

link |

01:16:53.520

then we expect this 0.5 to get further routed right so there we go 0.5 0.5

link |

01:16:59.840

and then finally we want to call it here on x2w2 and on x1w1

link |

01:17:14.720

let's do both of those and there we go so we get 0.5 negative 1.5 and 1 exactly as we did before

link |

01:17:24.160

but now we've done it through calling that backward sort of manually so we have one last

link |

01:17:32.880

piece to get rid of which is us calling underscore backward manually so let's think through what we

link |

01:17:38.400

are actually doing we've laid out a mathematical expression and now we're trying to go backwards

link |

01:17:43.840

through that expression so going backwards through the expression just means that we never want to

link |

01:17:49.600

call a dot backward for any node before we've done sort of everything after it so we have to do

link |

01:17:59.280

everything after it before we're ever going to call dot backward on any one node we have to get

link |

01:18:03.360

all of its full dependencies everything that it depends on has to propagate to it before we can

link |

01:18:08.960

continue back propagation so this ordering of graphs can be achieved using something called

link |

01:18:15.200

topological sort so topological sort is basically a laying out of a graph such that all the edges

link |

01:18:23.120

go only from left to right basically so here we have a graph it's a directed acyclic graph a DAG

link |

01:18:30.560

and this is two different topological orders of it I believe where basically you'll see that it's

link |

01:18:35.680

laying out of the nodes such that all the edges go only one way from left to right and implementing

link |

01:18:41.760

topological sort you can look in wikipedia and so on I'm not going to go through it in detail

link |

01:18:47.760

but basically this is what builds a topological graph we maintain a set of visited nodes and then

link |

01:18:56.160

we are going through starting at some root node which for us is O that's where I want to start

link |

01:19:02.960

the topological sort and starting at O we go through all of its children and we need to lay

link |

01:19:08.400

them out from left to right and basically this starts at O if it's not visited then it marks it

link |

01:19:16.080

as visited and then it iterates through all of its children and calls build topological on them

link |

01:19:23.120

and then after it's gone through all the children it adds itself so basically this node that we're

link |

01:19:30.080

going to call it on like say O is only going to add itself to the topo list after all of the

link |

01:19:36.400

children have been processed and that's how this function is guaranteeing that you're only not

link |

01:19:41.600

going to be in the list once all your children are in the list and that's the invariant that is

link |

01:19:45.600

being maintained so if we build topo on O and then inspect this list we're going to see that it

link |

01:19:52.320

ordered our value objects and the last one is the value of 0.707 which is the output so this is O

link |

01:20:01.760

and then this is N and then all the other nodes get laid out before it so that builds the topological

link |

01:20:10.080

graph and really what we're doing now is we're just calling dot underscore backward on all of

link |

01:20:16.000

the nodes in a topological order so if we just reset the gradients they're all zero what did we

link |

01:20:22.640

do we started by setting O dot grad to be one that's the base case then we built a topological

link |

01:20:34.320

order and then we went for node in reversed of topo now in in the reverse order because this

link |

01:20:48.080

list goes from you know we need to go through it in reversed order so starting at O node dot backward

link |

01:20:57.360

and this should be it there we go those are the correct derivatives finally we are going to hide

link |

01:21:07.600

this functionality so i'm going to copy this and we're going to hide it inside the value class

link |

01:21:13.520

because we don't want to have all that code lying around so instead of an underscore backward we're

link |

01:21:18.640

now going to define an actual backward so that backward without the underscore

link |

01:21:24.880

and that's going to do all the stuff that we just derived so let me just clean this up a little bit

link |

01:21:29.440

so we're first going to build a topological graph starting at self so build topo of self

link |

01:21:41.920

build topo of self will populate the topological order into the topo list which is a local variable

link |

01:21:49.280

then we set self dot grads to be one and then for each node in the reversed list

link |

01:21:55.920

so starting at s and going to all the children underscore backward and that should be it so

link |

01:22:04.800

save come down here we define okay all the grads are zero and now what we can do is

link |

01:22:13.360

O dot backward without the underscore and

link |

01:22:20.240

there we go and that's that's back propagation

link |

01:22:25.280

place for one neuron now we shouldn't be too happy with ourselves actually because we have

link |

01:22:30.400

a bad bug um and we have not surfaced the bug because of some specific conditions that we are

link |

01:22:36.080

have we have to think about right now so here's the simplest case that shows the bug say i create

link |

01:22:43.440

a single node a and then i create a b that is a plus a and then i call backward

link |

01:22:51.440

so what's going to happen is a is three and then a b is a plus a so there's two arrows on top of

link |

01:22:57.440

each other here then we can see that b is of course the forward pass works b is just a plus a

link |

01:23:05.200

which is six but the gradient here is not actually correct that we calculate it automatically and

link |

01:23:12.640

that we calculate it automatically and that's because um of course uh just doing calculus

link |

01:23:21.600

in your head the derivative of b with respect to a should be uh two one plus one it's not one

link |

01:23:30.800

intuitively what's happening here right so b is the result of a plus a

link |

01:23:34.400

and then we call backward on it so let's go up and see what that does

link |

01:23:39.120

um b is a result of addition so out is b and then when we call backward what happened is self.grad

link |

01:23:48.960

was set to one and then other.grad was set to one but because we're doing a plus a self and other

link |

01:23:58.240

are actually the exact same object so we are overriding the gradient we are setting it to one

link |

01:24:04.160

and then we are setting it again to one and that's why it stays at one so that's a problem

link |

01:24:11.760

there's another way to see this in a little bit more complicated expression

link |

01:24:18.720

so here we have a and b and then uh d will be the multiplication of the two and e will be the

link |

01:24:27.600

addition of the two and then we multiply e times d to get f and then we call it f.backward and

link |

01:24:35.200

these gradients if you check will be incorrect so fundamentally what's happening here again is

link |

01:24:42.400

basically we're going to see an issue anytime we use a variable more than once until now in these

link |

01:24:47.360

expressions above every variable is used exactly once so we didn't see the issue but here if a

link |

01:24:52.800

variable is used more than once what's going to happen during backward pass we're back propagating

link |

01:24:57.120

from f to e to d so far so good but now e calls it backward and it deposits its gradients to a and

link |

01:25:04.000

b but then we come back to d and call backward and it overwrites those gradients at a and b

link |

01:25:11.600

so that's obviously a problem and the solution here if you look at the multivariate case of the

link |

01:25:18.880

chain rule and its generalization there the solution there is basically that we have to

link |

01:25:23.680

accumulate these gradients these gradients add and so instead of setting those gradients

link |

01:25:31.760

we can simply do plus equals we need to accumulate those gradients plus equals plus equals plus

link |

01:25:39.120

equals plus equals plus equals plus equals and this will be okay remember because we are initializing

link |

01:25:49.120

them at zero so they start at zero and then any contribution that flows backwards will simply add

link |

01:25:58.800

so now if we redefine this one because the plus equals this now works because a dot grad started

link |

01:26:07.280

at zero and when we call b dot backward we deposit one and then we deposit one again and now this is

link |

01:26:13.360

two which is correct and here this will also work and we'll get correct gradients because when we

link |

01:26:18.960

call e dot backward we will deposit the gradients from this branch and then we get to back to d dot

link |

01:26:24.160

backward it will deposit its own gradients and then those gradients simply add on top of each

link |

01:26:29.360

other and so we just accumulate those gradients and that fixes the issue okay now before we move

link |

01:26:34.320

on let me actually do a bit of cleanup here and delete some of these some of this intermediate

link |

01:26:39.760

work so i'm not going to need any of this now that we've derived all of it um we are going to keep

link |

01:26:47.040

this because i want to come back to it delete the 10h delete harmonic example delete the step

link |

01:26:55.920

delete this keep the code that draws and then delete this example and leave behind only the

link |

01:27:03.280

definition of value and now let's come back to this non-linearity here that we implemented the

link |

01:27:08.320

10h now i told you that we could have broken down 10h into its explicit atoms in terms of other

link |

01:27:15.360

expressions if we had the exp function so if you remember 10h is defined like this and we chose to

link |

01:27:21.040

develop 10h as a single function and we can do that because we know it's derivative and we can

link |

01:27:25.440

back propagate through it but we can also break down 10h into and express it as a function of exp

link |

01:27:31.200

and i would like to do that now because i want to prove to you that you get all the same results

link |

01:27:34.800

and all the same gradients but also because it forces us to implement a few more expressions

link |

01:27:39.840

it forces us to do exponentiation addition subtraction division and things like that and

link |

01:27:45.200

i think it's a good exercise to go through a few more of these okay so let's scroll up

link |

01:27:50.080

to the definition of value and here one thing that we currently can't do is we can do like a

link |

01:27:55.120

value of say 2.0 but we can't do you know here for example we want to add constant one and we

link |

01:28:02.720

can't do something like this and we can't do it because it says int object has no attribute data

link |

01:28:08.480

that's because a plus one comes right here to add and then other is the integer one and then here

link |

01:28:15.360

python is trying to access one dot data and that's not a thing that's because basically one is not a

link |

01:28:20.480

value object and we only have addition for value objects so as a matter of convenience so that we

link |

01:28:25.760

can create expressions like this and make them make sense we can simply do something like this

link |

01:28:32.320

basically we let other alone if other is an instance of value but if it's not an instance

link |

01:28:38.080

of value we're going to assume that it's a number like an integer or a float and we're going to

link |

01:28:41.520

simply wrap it in in value and then other will just become value of other and then other will

link |

01:28:46.720

have a data attribute and this should work so if i just save this redefine value then this should

link |

01:28:52.320

work there we go okay now let's do the exact same thing for multiply because we can't do something

link |

01:28:57.680

like this again for the exact same reason so we just have to go to mall and if other is not a

link |

01:29:05.280

value then let's wrap it in value let's redefine value and now this works now here's a kind of

link |

01:29:11.360

unfortunate and not obvious part a times two works we saw that but two times a is that going to work

link |

01:29:19.600

you'd expect it to write but actually it will not and the reason it won't is because python doesn't

link |

01:29:25.120

know like when you do a times two basically um so a times two python will go and it will basically

link |

01:29:32.400

do something like a dot mall of two that's basically what it will call but to it two times

link |

01:29:38.560

a is the same as two dot mall of a and it doesn't two can't multiply value and so it's really

link |

01:29:46.000

confused about that so instead what happens is in python the way this works is you are free to

link |

01:29:50.800

define something called the rmall and rmall is kind of like a fallback so if a python can't do

link |

01:29:58.800

two times a it will check if um if by any chance a knows how to multiply two and that will be called

link |

01:30:06.720

into rmall so because python can't do two times a it will check is there an rmall in value and

link |

01:30:13.680

because there is it will now call that and what we'll do here is we will swap the order of the

link |

01:30:19.520

operands so basically two times a will redirect to rmall and rmall will basically call a times two

link |

01:30:26.080

and that's how that will work so redefining that with rmall two times a becomes four okay now

link |

01:30:32.880

looking at the other elements that we still need we need to know how to exponentiate and how to divide

link |

01:30:37.040

so let's first the explanation do the exponentiation part we're going to introduce

link |

01:30:41.760

a single function exp here and exp is going to mirror 10h in the sense that it's a simple

link |

01:30:48.880

single function that transform a single scalar value and outputs a single scalar value

link |

01:30:53.120

so we pop out the python number we use math.exp to exponentiate it create a new value object

link |

01:30:58.640

everything that we've seen before the tricky part of course is how do you back propagate through

link |

01:31:02.720

e to the x and so here you can potentially pause the video and think about what should go here

link |

01:31:12.640

okay so basically we need to know what is the local derivative of e to the x so d by dx of e

link |

01:31:18.880

to the x is famously just e to the x and we've already just calculated e to the x and it's inside

link |

01:31:24.720

and it's inside out.data so we can do out.data times and out.grad that's the chain rule

link |

01:31:32.000

so we're just chaining on to the current running grad and this is what the expression looks like

link |

01:31:37.120

it looks a little confusing but this is what it is and that's the explanation

link |

01:31:41.760

so redefining we should now be able to call a.exp and hopefully the backward pass works as well

link |

01:31:48.160

okay and the last thing we'd like to do of course is we'd like to be able to divide

link |

01:31:51.440

now i actually will implement something slightly more powerful than division because division is

link |

01:31:56.080

just a special case of something a bit more powerful so in particular just by rearranging

link |

01:32:02.000

if we have some kind of a b equals value of 4.0 here we'd like to basically be able to do a divide

link |

01:32:08.000

b and we'd like this to be able to give us 0.5 now division actually can be reshuffled as follows

link |

01:32:14.720

if we have a divide b that's actually the same as a multiplying 1 over b and that's the same as a

link |

01:32:20.080

multiplying b to the power of negative 1 and so what i'd like to do instead is i'd basically like

link |

01:32:25.600

to implement the operation of x to the k for some constant k so it's an integer or a float

link |

01:32:33.280

and we would like to be able to differentiate this and then as a special case negative 1 will

link |

01:32:38.720

be division and so i'm doing that just because it's more general and yeah you might as well do

link |

01:32:44.880

yeah you might as well do it that way so basically what i'm saying is we can redefine uh division

link |

01:32:51.280

which we will put here somewhere yeah we can put it here somewhere what i'm saying is that we can

link |

01:32:57.200

redefine division so self-divide other this can actually be rewritten as self times other to the

link |

01:33:03.760

power of negative one and now a value raised to the power of negative one we have now defined that

link |

01:33:10.640

so here's so we need to implement the pow function where am i going to put the pow function maybe

link |

01:33:17.200

here somewhere there's this color for it so this function will be called when we try to raise a

link |

01:33:24.080

value to some power and other will be that power now i'd like to make sure that other is only an

link |

01:33:30.160

int or a float usually other is some kind of a different value object but here other will be

link |

01:33:35.600

forced to be an int or a float otherwise the math uh won't work for forward trying to achieve in

link |

01:33:42.320

the specific case that would be a different derivative expression if we wanted other to be

link |

01:33:47.120

a value so here we create the up the value which is just uh you know this data raised to the power

link |

01:33:53.040

of other and other here could be for example negative one that's what we are hoping to achieve

link |

01:33:58.320

and then uh this is the backward stub and this is the fun part which is what is the uh chain rule

link |

01:34:04.640

expression here for back for um back propagating through the power function where the power is to

link |

01:34:12.320

the power of some kind of a constant so this is the exercise and maybe pause the video here and

link |

01:34:16.800

see if you can figure it out yourself as to what we should put here okay so um you can actually go

link |

01:34:28.880

here and look at derivative rules as an example and we see lots of derivative rules that you can

link |

01:34:33.040

hopefully know from calculus in particular what we're looking for is the power rule

link |

01:34:37.920

because that's telling us that if we're trying to take d by dx of x to the n which is what we're

link |

01:34:42.000

doing here then that is just n times x to the n minus one right okay so that's telling us about

link |

01:34:51.520

the local derivative of this power operation so all we want here basically n is now other and

link |

01:34:59.840

self.data is x and so this now becomes other which is n times self.data which is now a python

link |

01:35:10.000

int or a float it's not a value object we're accessing the data attribute raised to the power

link |

01:35:16.800

of other minus one or n minus one i can put brackets around this but this doesn't matter

link |

01:35:22.080

because um power takes precedence over multiply in pyhelm so that would have been okay and then

link |

01:35:28.480

okay and that's the local derivative only but now we have to chain it and we change it just

link |

01:35:33.520

simply by multiplying by our top grad that's chain rule and this should uh technically work

link |

01:35:40.720

and we're gonna find out soon but now if we do this this should now work and we get 0.5

link |

01:35:47.760

so the forward pass works but does the backward pass work and i realized that we actually also

link |

01:35:52.480

have to know how to subtract so right now a minus b will not work to make it work we need one more

link |

01:36:00.080

piece of code here and basically this is the subtraction and the way we're going to implement

link |

01:36:07.120

subtraction is we're going to implement it by addition of a negation and then to implement

link |

01:36:11.520

negation we're going to multiply by negative one so just again using the stuff we've already built

link |

01:36:15.920

and just expressing it in terms of what we have and a minus b is not working okay so now let's

link |

01:36:21.760

scroll again to this expression here for this neuron and let's just uh compute the backward

link |

01:36:27.840

pass here once we've defined o and let's draw it so here's the gradients for all these leaf

link |

01:36:33.920

nodes for this two-dimensional neuron that has a 10h that we've seen before so now what i'd like

link |

01:36:39.040

to do is i'd like to break up this 10h into this expression here so let me copy paste this here

link |

01:36:46.560

and now instead of we'll preserve the label and we will change how we define o so in particular

link |

01:36:53.760

we're going to implement this formula here so we need e to the 2x minus 1 over e to the x plus 1

link |

01:36:59.840

so e to the 2x we need to take 2 times n and we need to exponentiate it that's e to the 2x

link |

01:37:07.040

and then because we're using it twice let's create an intermediate variable e and then define o as

link |

01:37:13.680

e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1 and that should be it and then we

link |

01:37:23.520

should be able to draw dot of o so now before i run this what do we expect to see number one we're

link |

01:37:30.560

expecting to see a much longer graph here because we've broken up 10h into a bunch of other

link |

01:37:35.200

operations but those operations are mathematically equivalent and so what we're expecting to see is

link |

01:37:40.400

number one the same result here so the forward pass works and number two because of that

link |

01:37:46.000

mathematical equivalence we expect to see the same backward pass and the same gradients on these

link |

01:37:50.640

leaf nodes so these gradients should be identical so let's run this so number one let's verify that

link |

01:37:58.800

instead of a single 10h node we have now exp and we have plus we have times negative one this is

link |

01:38:06.160

the division and we end up with the same forward pass here and then the gradients we have to be

link |

01:38:11.760

careful because they're in slightly different order potentially the gradients for w2x2 should

link |

01:38:16.320

be 0 and 0.5 w2 and x2 are 0 and 0.5 and w1x1 are 1 and negative 1.5 1 and negative 1.5 so that means

link |

01:38:26.400

that both our forward passes and backward passes were correct because this turned out to be

link |

01:38:31.040

equivalent to 10h before and so the reason I wanted to go through this exercise is number one

link |

01:38:37.200

we got to practice a few more operations and writing more backwards passes and number two

link |

01:38:42.400

I wanted to illustrate the point that the level at which you implement your operations is totally

link |

01:38:49.040

up to you you can implement backward passes for tiny expressions like a single individual plus or

link |

01:38:53.760

a single times or you can implement them for say 10h which is a kind of a potentially you can see

link |

01:39:00.400

it as a composite operation because it's made up of all these more atomic operations but really all

link |

01:39:05.120

of this is kind of like a fake concept all that matters is we have some kind of inputs and some

link |

01:39:08.960

kind of an output and this output is a function of the inputs in some way and as long as you can

link |

01:39:12.960

do forward pass and the backward pass of that little operation it doesn't matter what that

link |

01:39:18.160

operation is and how composite it is if you can write the local gradients you can chain the

link |

01:39:23.680

gradient and you can continue back propagation so the design of what those functions are is

link |

01:39:28.400

completely up to you so now I would like to show you how you can do the exact same thing

link |

01:39:32.960

but using a modern deep neural network library like for example PyTorch which I've roughly modeled

link |

01:39:38.640

micrograd by and so PyTorch is something you would use in production and I'll show you how you can

link |

01:39:45.040

do the exact same thing but in PyTorch API so I'm just going to copy paste it in and walk you through

link |

01:39:50.160

it a little bit this is what it looks like so we're going to import PyTorch and then we need to

link |

01:39:55.600

define these value objects like we have here now micrograd is a scalar valued engine so we only have

link |

01:40:04.560

scalar values like 2.0 but in PyTorch everything is based around tensors and like I mentioned

link |

01:40:10.320

tensors are just n-dimensional arrays of scalars so that's why things get a little bit more

link |

01:40:15.840

complicated here I just need a scalar valued tensor a tensor with just a single element

link |

01:40:21.520

but by default when you work with PyTorch you would use more complicated tensors like this

link |

01:40:28.480

so if I import PyTorch then I can create tensors like this and this tensor for example is a 2x3

link |

01:40:36.880

array of scalars in a single compact representation so we can check its shape

link |

01:40:44.080

we see that it's a 2x3 array and so on so this is usually what you would work with in the actual

link |

01:40:50.640

libraries so here I'm creating a tensor that has only a single element 2.0 and then I'm casting it

link |

01:41:00.160

to be double because Python is by default using double precision for its floating point numbers so

link |

01:41:06.320

I'd like everything to be identical by default the data type of these tensors will be float 32

link |

01:41:12.640

so it's only using a single precision float so I'm casting it to double so that we have float 64

link |

01:41:18.880

just like in Python so I'm casting to double and then we get something similar to value of two

link |

01:41:25.760

the next thing I have to do is because these are leaf nodes by default PyTorch assumes that they

link |

01:41:29.840

do not require gradients so I need to explicitly say that all of these nodes require gradients

link |

01:41:35.920

so this is going to construct scalar valued one element tensors make sure that PyTorch knows that

link |

01:41:41.920

they require gradients now by default these are set to false by the way because of efficiency

link |

01:41:47.040

reasons because usually you would not want gradients for leaf nodes like the inputs to the

link |

01:41:52.640

network and this is just trying to be efficient in the most common cases so once we've defined

link |

01:41:57.760

all of our values in PyTorch land we can perform arithmetic just like we can here in micrograd land

link |

01:42:03.360

so this will just work and then there's a torch.tanh also and when we get back as a tensor

link |

01:42:08.960

again and we can just like in micrograd it's got a data attribute and it's got grad attributes

link |

01:42:15.840

so these tensor objects just like in micrograd have a dot data and a dot grad and we can

link |

01:42:21.920

run this in micrograd and the only difference here is that we need to call it dot item because

link |

01:42:28.400

otherwise PyTorch dot item basically takes a single tensor of one element and it just returns

link |

01:42:35.760

that element stripping out the tensor so let me just run this and hopefully we are going to get

link |

01:42:41.440

this is going to print the forward pass which is 0.707 and this will be the gradients which

link |

01:42:48.000

are 0.5 0 negative 1.5 and 1 so if we just run this there we go 0.7 so the forward pass agrees

link |

01:42:57.600

and then 0.5 0 negative 1.5 and 1 so PyTorch agrees with us and just to show you here basically

link |

01:43:04.480

oh here's a tensor with a single element and it's a double and we can call that item on it to just

link |

01:43:12.480

single number out so that's what item does and o is a tensor object like I mentioned and it's got

link |

01:43:19.360

a backward function just like we've implemented and then all of these also have a dot grad so

link |

01:43:24.480

like x2 for example has a grad and it's a tensor and we can pop out the individual number with dot

link |

01:43:29.680

item so basically Torch can do what we did in micrograd as a special case when your

link |

01:43:37.360

tensors are all single element tensors but the big deal with PyTorch is that everything is

link |

01:43:42.720

significantly more efficient because we are working with these tensor objects and we can do

link |

01:43:47.120

lots of operations in parallel on all of these tensors but otherwise what we've built very much

link |

01:43:53.600

agrees with the API of PyTorch okay so now that we have some machinery to build out pretty

link |

01:43:57.920

complicated mathematical expressions we can also start building up neural nets and as I mentioned

link |

01:44:02.560

neural nets are just a specific class of mathematical expressions so we're going to

link |

01:44:07.600

start building out a neural net piece by piece and eventually we'll build out a two-layer

link |

01:44:11.360

multi-layer layer perceptron as it's called and I'll show you exactly what that means let's start

link |

01:44:16.320

with a single individual neuron we've implemented one here but here I'm going to implement one that

link |

01:44:21.360

also subscribes to the PyTorch API in how it designs its neural network modules so just like

link |

01:44:27.840

we saw that we can like match the API of PyTorch on the autograd side we're going to try to do

link |

01:44:33.840

that on the neural network modules so here's class neuron and just for the sake of efficiency

link |

01:44:40.800

I'm going to copy paste some sections that are relatively straightforward

link |

01:44:45.520

so the constructor will take number of inputs to this neuron which is how many inputs come to a

link |

01:44:51.600

neuron so this one for example is three inputs and then it's going to create a weight that is some

link |

01:44:57.680

random number between negative one and one for every one of those inputs and a bias that controls

link |

01:45:02.960

the overall trigger happiness of this neuron and then we're going to implement a def underscore

link |

01:45:09.040

underscore call of self and x some input x and really what we don't want to do here is w times

link |

01:45:16.080

x plus b where w times x here is a dot product specifically now if you haven't seen call let me

link |

01:45:24.240

just return 0.0 here from now the way this works now is we can have an x which is say like 2.0 3.0

link |

01:45:30.720

then we can initialize a neuron that is two-dimensional because these are two

link |

01:45:34.640

numbers and then we can feed those two numbers into that neuron to get an output and so when

link |

01:45:40.320

you use this notation n of x python will use call so currently call just returns 0.0

link |

01:45:50.080

now we'd like to actually do the forward pass of this neuron instead so what we're going to

link |

01:45:55.440

do here first is we need to basically multiply all of the elements of w with all of the elements

link |

01:46:01.360

of x pairwise we need to multiply them so the first thing we're going to do is we're going to

link |

01:46:06.160

zip up salta w and x and in python zip takes two iterators and it creates a new iterator

link |

01:46:14.320

that iterates over the tuples of their corresponding entries

link |

01:46:17.920

so for example just to show you we can print this list and still return 0.0 here

link |

01:46:23.840

so we see that these w's are paired up with the x's w with x

link |

01:46:41.440

and now what we want to do is

link |

01:46:42.720

um for w i x i in we want to multiply w times w i times x i and then we want to sum all of that

link |

01:46:55.600

together to come up with an activation and add also salt b on top so that's the raw activation

link |

01:47:02.480

and then of course we need to pass that through a non-linearity so what we're going to be returning

link |

01:47:06.720

is act dot 10h and here's out so now we see that we are getting some outputs and we get a different

link |

01:47:15.120

output from a neuron each time because we are initializing different weights and biases and then

link |

01:47:20.480

to be a bit more efficient here actually sum by the way takes a second optional parameter which

link |

01:47:26.480

is the start and by default the start is zero so these elements of this sum will be added on top

link |

01:47:33.360

of zero to begin with but actually we can just start with cell dot b and then we just have an

link |

01:47:38.240

expression like this and then the generator expression here must be parenthesized in python

link |

01:47:52.400

yep so now we can forward a single neuron next up we're going to define a layer of neurons so here

link |

01:47:58.560

we have a schematic for a mlp so we see that these mlps each layer this is one layer has actually a

link |

01:48:05.760

number of neurons and they're not connected to each other but all of them are fully connected

link |

01:48:08.800

to the input so what is a layer of neurons it's just it's just a set of neurons evaluated

link |

01:48:13.760

independently so in the interest of time i'm going to do something fairly straightforward here

link |

01:48:20.400

it's literally a layer is just a list of neurons and then how many neurons do we have we take that

link |

01:48:28.560

as an input argument here how many neurons do you want in your layer number of outputs in this layer

link |

01:48:34.000

and so we just initialize completely independent neurons with this given dimensionality and we

link |

01:48:39.040

call on it we just independently evaluate them so now instead of a neuron we can make a layer

link |

01:48:46.160

of neurons they are two-dimensional neurons and let's have three of them and now we see that we

link |

01:48:50.480

have three independent evaluations of three different neurons right okay and finally let's

link |

01:48:57.200

complete this picture and define an entire multi-layer perceptron or mlp and as we can

link |

01:49:02.320

see here in an mlp these layers just feed into each other sequentially so let's come here and

link |

01:49:07.760

i'm just going to copy the code here in interest of time so an mlp is very similar to a multi-layer

link |

01:49:14.800

we're taking the number of inputs as before but now instead of taking a single nout which is

link |

01:49:19.920

number of neurons in a single layer we're going to take a list of nouts and this list defines the

link |

01:49:25.200

sizes of all the layers that we want in our mlp so here we just put them all together and then

link |

01:49:30.400

iterate over consecutive pairs of these sizes and create layer objects for them and then in the call

link |

01:49:36.720

function we are just calling them sequentially so that's an mlp really and let's actually

link |

01:49:41.600

re-implement this picture so we want three input neurons and then two layers of four and an output

link |

01:49:46.400

unit so we want a three-dimensional input say this is an example input we want three inputs

link |

01:49:54.880

into two layers of four and one output and this of course is an mlp and there we go that's a

link |

01:50:02.880

forward pass of an mlp to make this a little bit nicer you see how we have just a single element

link |

01:50:07.760

but it's wrapped in a list because layer always returns lists so for convenience returnouts at

link |

01:50:14.640

zero if lenouts is exactly a single element else returnfullest and this will allow us to just get

link |

01:50:21.520

a single value out at the last layer that only has a single neuron and finally we should be able to

link |

01:50:27.440

draw dot of n of x and as you might imagine these expressions are now getting relatively involved

link |

01:50:36.000

relatively involved so this is an entire mlp that we're defining now

link |

01:50:45.280

all the way until a single output okay and so obviously you would never differentiate

link |

01:50:51.600

on pen and paper these expressions but with micrograd we will be able to back propagate

link |

01:50:56.320

all the way through this and back propagate into these weights of all these neurons so let's see

link |

01:51:03.200

how that works okay so let's create ourselves a very simple example data set here so this data

link |

01:51:09.200

set has four examples and so we have four possible inputs into the neural net and we have four

link |

01:51:16.240

desired targets so we'd like the neural net to assign or output 1.0 when it's fed this example

link |

01:51:24.240

negative one when it's fed these examples and one when it's fed this example so it's a very simple

link |

01:51:28.560

binary classifier neural net basically that we would like here now let's think what the neural

link |

01:51:33.440

that currently thinks about these four examples we can just get their predictions basically we

link |

01:51:38.640

can just call n of x for x and x's and then we can print so these are the outputs of the neural net

link |

01:51:47.040

on those four examples so the first one is 0.91 but we'd like it to be one so we should push this

link |

01:51:55.120

one higher this one we want to be higher this one says 0.88 and we want this to be negative one

link |

01:52:02.640

this is 0.88 we want it to be negative one and this one is 0.88 we want it to be one

link |

01:52:08.240

so how do we make the neural net and how do we tune the weights to better predict the desired

link |

01:52:15.200

targets and the trick used in deep learning to achieve this is to calculate a single number that

link |

01:52:21.920

somehow measures the total performance of your neural net and we call this single number the loss

link |

01:52:28.000

so the loss first is a single number that we're going to define that basically measures how well

link |

01:52:34.560

the neural net is performing right now we have the intuitive sense that it's not performing very well

link |

01:52:38.480

because we're not very much close to this so the loss will be high and we'll want to minimize the

link |

01:52:43.760

loss so in particular in this case what we're going to do is we're going to implement the mean

link |

01:52:48.080

squared error loss so what this is doing is we're going to basically iterate for y ground truth

link |

01:52:56.400

and y output in zip of y's and y bread so we're going to pair up the ground truths with the

link |

01:53:04.800

predictions and the zip iterates over tuples of them and for each y ground truth and y output

link |

01:53:13.200

we're going to subtract them and square them so let's first see what these losses are these are

link |

01:53:20.560

individual loss components and so basically for each one of the four we are taking the

link |

01:53:27.440

prediction and the ground truth we are subtracting them and squaring them so because this one is so

link |

01:53:34.800

close to its target 0.91 is almost one subtracting them gives a very small number so here we would

link |

01:53:42.640

get like a negative 0.1 and then squaring it just makes sure that regardless of whether we

link |

01:53:49.280

are more negative or more positive we always get a positive number instead of squaring we should

link |

01:53:55.280

we could also take for example the absolute value we need to discard the sign and so you

link |

01:53:59.840

see that the expression is ranged so that you only get zero exactly when y out is equal to y

link |

01:54:05.600

ground truth when those two are equal so your prediction is exactly the target you are going

link |

01:54:09.760

to get zero and if your prediction is not the target you are going to get some other number

link |

01:54:15.040

so here for example we are way off and so that's why the loss is quite high and the more off we are

link |

01:54:21.920

the greater the loss will be so we don't want high loss we want low loss and so the final

link |

01:54:28.960

loss here will be just the sum of all of these numbers so you see that this should be zero

link |

01:54:36.320

roughly plus zero roughly but plus seven so loss should be about seven here and now we want to

link |

01:54:45.920

minimize the loss we want the loss to be low because if loss is low then every one of the

link |

01:54:52.960

predictions is equal to its target so the loss the lowest it can be is zero and the greater it is the

link |

01:55:00.480

worse off the neural net is predicting so now of course if we do loss that backward something magical

link |

01:55:08.640

happened when i hit enter and the magical thing of course that happened is that we can look at

link |

01:55:13.600

n dot layers dot neuron and dot layers at say like the first layer dot neurons at zero

link |

01:55:22.400

because remember that mlp has the layers which is a list and each layer has neurons which is a list

link |

01:55:28.720

and that gives us an individual neuron and then it's got some weights

link |

01:55:32.800

and so we can for example look at the weights at zero

link |

01:55:38.160

um oops it's not called weights it's called w and that's a value but now this value also has a grad

link |

01:55:48.000

because of the backward pass and so we see that because this gradient here on this particular

link |

01:55:54.240

weight of this particular neuron of this particular layer is negative we see that its influence on the

link |

01:55:59.360

loss is also negative so slightly increasing this particular weight of this neuron of this layer

link |

01:56:05.600

would make the loss go down and we actually have this information for every single one of our

link |

01:56:11.840

neurons and all their parameters actually it's worth looking at also a draw dot of loss by the

link |

01:56:16.880

way so previously we looked at the draw dot of a single neuron neuron forward pass and that was

link |

01:56:22.160

already a large expression but what is this expression we actually forwarded every one of

link |

01:56:27.920

those four examples and then we have the loss on top of them with the mean squared error and so

link |

01:56:33.120

this is a really massive graph because this graph that we've built up now oh my gosh this graph that

link |

01:56:41.280

we've built up now which is kind of excessive it's excessive because it has four forward passes of a

link |

01:56:46.960

neural net for every one of the examples and then it has the loss on top and it ends with the value

link |

01:56:52.640

of the loss which was 7.12 and this loss will now back propagate through all the forward passes

link |

01:56:58.720

all the way through just every single intermediate value of the neural net all the way back to of

link |

01:57:04.880

course the parameters of the weights which are the input so these weight parameters here are inputs

link |

01:57:10.880

to this neural net and these numbers here these scalars are inputs to the neural net so if we

link |

01:57:17.280

went around here we will probably find some of these examples this 1.0 potentially maybe this

link |

01:57:23.920

1.0 or you know some of the others and you'll see that they all have gradients as well the thing is

link |

01:57:29.280

these gradients on the input data are not that useful to us and that's because the input data

link |

01:57:35.760

seems to be not changeable it's it's a given to the problem and so it's a fixed input we're not

link |

01:57:41.200

going to be changing it or messing with it even though we do have gradients for it but some of

link |

01:57:46.800

these gradients here will be for the neural network parameters the w's and the b's and those

link |

01:57:53.280

we of course we want to change okay so now we're going to want some convenience codes to gather up

link |

01:57:59.600

all the parameters of the neural net so that we can operate on all of them simultaneously and

link |

01:58:04.720

every one of them we will nudge a tiny amount based on the gradient information so let's collect the

link |

01:58:11.360

parameters of the neural net all in one array so let's create a parameters of self that just returns

link |

01:58:19.680

uh cell.w which is a list concatenated with a list of cell.b

link |

01:58:27.120

so this will just return a list list plus list just you know gives you a list so that's parameters

link |

01:58:33.200

of neuron and i'm calling it this way because also pytorch has parameters on every single

link |

01:58:38.800

and in module and it does exactly what we're doing here it just returns the parameter tensors

link |

01:58:45.360

for us it's the parameter scalars now layer is also a module so it will have parameters

link |

01:58:52.560

self and basically what we want to do here is something like this like

link |

01:58:57.680

uh params is here and then for neuron in self.neurons

link |

01:59:05.360

we want to get neuron.parameters and we want to params.extend

link |

01:59:12.000

right so these are the parameters of this neuron and then we want to put them on top of params so

link |

01:59:17.200

params.extend of piece and then we want to return params so this there's way too much code

link |

01:59:25.520

so actually there's a way to simplify this which is return p for neuron in self.neurons

link |

01:59:37.680

for p in neuron.parameters so it's a single list comprehension in python you can sort of nest them

link |

01:59:46.880

like this and you can then create the desired array so this is these are identical we can

link |

01:59:55.760

take this out and then let's do the same here def parameters self and return a parameter for layer

link |

02:00:09.040

in self.layers for p in layer.parameters and that should be good now let me pop out this

link |

02:00:23.280

so we don't reinitialize our network because we need to reinitialize our

link |

02:00:29.840

okay so unfortunately we will have to probably reinitialize the network because we just

link |

02:00:37.920

add functionality because this class of course we i want to get all the end up parameters but

link |

02:00:43.760

that's not going to work because this is the old class okay so unfortunately we do have to

link |

02:00:50.000

reinitialize the network which will change some of the numbers but let me do that so that we pick

link |

02:00:55.120

up the new api we can now do end up parameters and these are all the weights and biases inside

link |

02:01:01.440

the entire neural net so in total this mlp has 41 parameters and now we'll be able to change them

link |

02:01:13.680

if we recalculate the loss here we see that unfortunately we have slightly different

link |

02:01:18.160

predictions and slightly different loss but that's okay okay so we see that this neurons

link |

02:01:26.720

gradient is slightly negative we can also look at its data right now which is 0.85 so this is

link |

02:01:33.760

the current value of this neuron and this is its gradient on the loss so what we want to do now is

link |

02:01:40.400

we want to iterate for every p in end up parameters so for all the 41 parameters in this neural net

link |

02:01:50.960

we actually want to change p dot data slightly according to the gradient information okay so

link |

02:01:59.440

dot dot to do here but this will be basically a tiny update in this gradient descent scheme

link |

02:02:07.120

and in gradient descent we are thinking of the gradient as a vector pointing in the direction of

link |

02:02:14.240

increased loss and so in gradient descent we are modifying p dot data by a small step size

link |

02:02:24.800

in the direction of the gradient so the step size as an example could be like a very small number

link |

02:02:28.960

like 0.01 is the step size times p dot grad right but we have to think through some of the signs

link |

02:02:37.120

here so in particular working with this specific example here we see that if we just left it like

link |

02:02:44.880

this then this neuron's value would be currently increased by a tiny amount of the gradient

link |

02:02:52.160

the gradient is negative so this value of this neuron would go slightly down it would become

link |

02:02:57.280

like 0.8 you know 4 or something like that but if this neuron's value goes lower that would actually

link |

02:03:07.680

increase the loss that's because the derivative of this neuron is negative so increasing this makes

link |

02:03:16.320

the loss go down so increasing it is what we want to do instead of decreasing it so basically what

link |

02:03:22.320

we're missing here is we're actually missing a negative sign and again this other interpretation

link |

02:03:27.920

and that's because we want to minimize the loss we don't want to maximize the loss we want to

link |

02:03:31.280

decrease it and the other interpretation as i mentioned is you can think of the gradient

link |

02:03:35.360

vector so basically just the vector of all the gradients as pointing in the direction of increasing

link |

02:03:42.880

the loss but then we want to decrease it so we actually want to go in the opposite direction

link |

02:03:47.840

and so you can convince yourself that this sort of like that's the right thing here with this

link |

02:03:51.360

the right thing here with a negative because we want to minimize the loss

link |

02:03:55.440

so if we nudge all the parameters by a tiny amount

link |

02:04:00.640

then we'll see that this data will have changed a little bit so now this neuron is a tiny amount

link |

02:04:07.760

greater value so 0.854 went to 0.857 and that's a good thing because slightly increasing this neuron

link |

02:04:17.120

data makes the loss go down according to the gradient and so the correcting has happened

link |

02:04:22.640

sign-wise and so now what we would expect of course is that because we've changed all these

link |

02:04:28.480

parameters we expect that the loss should have gone down a bit so we want to re-evaluate the

link |

02:04:34.560

loss let me basically this is just a data definition that hasn't changed but the forward

link |

02:04:41.440

pass here of the network we can recalculate and actually let me do it outside here so that we can

link |

02:04:50.400

compare the two loss values so here if i recalculate the loss we'd expect the new loss

link |

02:04:57.200

now to be slightly lower than this number so hopefully what we're getting now is a tiny bit

link |

02:05:02.080

lower than 4.84 4.36 okay and remember the way we've arranged this is that low loss means that our

link |

02:05:11.200

predictions are matching the targets so our predictions now are probably slightly closer

link |

02:05:16.320

to the targets and now all we have to do is we have to iterate this process so again we've done

link |

02:05:23.840

the forward pass and this is the loss now we can lost that backward let me take these out and we

link |

02:05:30.080

can do a step size and now we should have a slightly lower loss 4.36 goes to 3.9 and okay so

link |

02:05:39.840

we've done the forward pass here's the backward pass nudge and now the loss is 3.66

link |

02:05:48.240

3.47 and you get the idea we just continue doing this and this is gradient descent we're just

link |

02:05:55.600

iteratively doing forward pass backward pass update forward pass backward pass update and

link |

02:06:00.880

the neural net is improving its predictions so here if we look at ypred now ypred

link |

02:06:10.640

we see that this value should be getting closer to one so this value should be getting more

link |

02:06:15.840

positive these should be getting more negative and this one should be also getting more positive

link |

02:06:19.760

so if we just iterate this a few more times actually we may be able to afford to go a bit

link |

02:06:26.880

faster let's try a slightly higher learning rate oops okay there we go so now we're at 0.31

link |

02:06:37.360

if you go too fast by the way if you try to make it too big of a step you may actually overstep

link |

02:06:42.560

it's overconfidence because again remember we don't actually know exactly about the loss function

link |

02:06:46.720

the loss function has all kinds of structure and we only know about the very local dependence of

link |

02:06:51.840

all these parameters on the loss but if we step too far we may step into you know a part of the

link |

02:06:56.960

loss that is completely different and that can destabilize training and make your loss actually

link |

02:07:01.280

blow up even so the loss is now 0.04 so actually the predictions should be really

link |

02:07:08.000

quite close let's take a look so you see how this is almost one almost negative one almost one

link |

02:07:14.720

we can continue going so yep backward update oops there we go so we went way too fast and

link |

02:07:25.280

we actually overstepped so we got too too eager where are we now oops okay seven in negative

link |

02:07:34.000

nine so this is very very low loss and the predictions are basically perfect so somehow we

link |

02:07:43.120

basically we were doing way too big updates and we briefly exploded but then somehow we ended up

link |

02:07:47.200

getting into a really good spot so usually this learning rate and the tuning of it is a is a subtle

link |

02:07:52.960

art you want to set your learning rate if it's too low you're going to take way too long to converge

link |

02:07:58.000

but if it's too high the whole thing gets unstable and you're not going to be able to

link |

02:08:01.360

converge but if it's too high the whole thing gets unstable and you might actually even explode the

link |

02:08:06.000

loss depending on your loss function so finding the step size to be just right it's it's a pretty

link |

02:08:12.160

subtle art sometimes when you're using sort of vanilla gradient descent but we happen to get into

link |

02:08:16.800

a good spot we can look at n dot parameters so this is the setting of weights and biases that

link |

02:08:26.800

makes our network predict the desired targets very very close and basically we've successfully

link |

02:08:36.880

trained a neural net okay let's make this a tiny bit more respectable and implement an actual

link |

02:08:41.600

training loop and what that looks like so this is the data definition that stays this is the forward

link |

02:08:46.800

pass so for k in range you know we're going to take a bunch of steps first you do the forward

link |

02:08:58.880

pass we evaluate the loss let's reinitialize the neural net from scratch and here's the data

link |

02:09:06.960

and we first do forward pass then we do the backward pass and then we do an update that's

link |

02:09:19.360

gradient descent and then we should be able to iterate this and we should be able to print the

link |

02:09:27.600

current step the current loss let's just print the sort of number of the loss and then we should

link |

02:09:36.320

be good and that should be it and then the learning rate 0.01 is a little too small 0.1

link |

02:09:44.720

we saw is like a little bit dangerously too high let's go somewhere in between and we'll optimize

link |

02:09:50.080

this for not 10 steps but let's go for say 20 steps let me erase all of this junk

link |

02:09:57.680

and let's run the optimization and you see how we've actually converged slower

link |

02:10:04.720

in a more controlled manner and got to a loss that is very low so I expect widespread to be quite good

link |

02:10:14.480

there we go

link |

02:10:20.640

and that's it okay so this is kind of embarrassing but we actually have a really terrible bug

link |

02:10:26.240

in here and it's a subtle bug and it's a very common bug and I can't believe I've done it for

link |

02:10:31.920

the 20th time in my life especially on camera and I could have re-shot the whole thing but I think

link |

02:10:37.600

it's pretty funny and you know you get to appreciate a bit what working with neural nets

link |

02:10:42.880

maybe is like sometimes we are guilty of a common bug I've actually tweeted the most common neural

link |

02:10:51.840

that mistakes a long time ago now and I'm not really going to explain any of these except for

link |

02:10:59.520

we are guilty of number three you forgot to zero grad before dot backward what is that

link |

02:11:07.440

basically what's happening and it's a subtle bug and I'm not sure if you saw it is that all of these

link |

02:11:13.600

weights here have a dot data and a dot grad and dot grad starts at zero

link |

02:11:19.120

and then we do backward and we fill in the gradients and then we do an update on the data

link |

02:11:23.840

but we don't flush the grad it stays there so when we do the second forward pass and we do backward

link |

02:11:31.120

again remember that all the backward operations do a plus equals on the grad and so these gradients

link |

02:11:36.880

just add up and they never get reset to zero so basically we didn't zero grad so here's how we

link |

02:11:44.480

zero grad before backward we need to iterate over all the parameters and we need to make sure that

link |

02:11:51.840

p dot grad is set to zero we need to reset it to zero just like it is in the constructor so remember

link |

02:11:59.600

all the way here for all these value nodes grad is reset to zero and then all these backward passes

link |

02:12:05.360

do a plus equals from that grad but we need to make sure that we reset the values to zero

link |

02:12:11.760

so that when we do backward all of them start at zero and the actual backward pass accumulates

link |

02:12:18.240

the loss derivatives into the grads so this is zero grad in PyTorch and we will slightly

link |

02:12:27.360

we'll get a slightly different optimization let's reset the neural net the data is the same

link |

02:12:32.400

this is now I think correct and we get a much more you know we get a much more

link |

02:12:37.280

slower descent we still end up with pretty good results and we can continue this a bit more

link |

02:12:43.600

to get down lower and lower and lower yeah so the only reason that the previous thing worked

link |

02:12:52.960

it's extremely buggy the only reason that worked is that this is a very very simple problem

link |

02:13:00.000

and it's very easy for this neural net to fit this data and so the grads ended up accumulating

link |

02:13:06.480

and it effectively gave us a massive step size and it made us converge extremely fast

link |

02:13:13.760

but basically now we have to do more steps to get to very low values of loss and get

link |

02:13:19.520

Wipred to be really good we can try to step a bit greater and we can get a much better result

link |

02:13:25.920

we can try to step a bit greater yeah we're going to get closer and closer to one minus one and one

link |

02:13:38.400

so working with neural nets is sometimes tricky because

link |

02:13:44.960

you may have lots of bugs in the code and your network might actually work just like ours worked

link |

02:13:50.640

but chances are is that if we had a more complex problem then actually this bug would have made

link |

02:13:55.760

us not optimize the loss very well and we were only able to get away with it because

link |

02:14:00.880

the problem is very simple so let's now bring everything together and summarize what we learned

link |

02:14:06.080

what are neural nets neural nets are these mathematical expressions fairly simple

link |

02:14:11.120

mathematical expressions in the case of multilayered perceptron that take input as the

link |

02:14:16.400

input as the data and they take input the weights and the parameters of the neural net

link |

02:14:21.600

mathematical expression for the forward pass followed by a loss function and the loss function

link |

02:14:26.640

tries to measure the accuracy of the predictions and usually the loss will be low when your

link |

02:14:31.920

predictions are matching your targets or where the network is basically behaving well so we

link |

02:14:37.120

we manipulate the loss function so that when the loss is low the network is doing what you

link |

02:14:41.200

want it to do on your problem and then we backward the loss use back propagation to get the gradient

link |

02:14:48.800

and then we know how to tune all the parameters to decrease the loss locally

link |

02:14:52.800

but then we have to iterate that process many times in what's called the gradient descent

link |

02:14:56.800

so we simply follow the gradient information and that minimizes the loss and the loss is

link |

02:15:01.680

arranged so that when the loss is minimized the network is doing what you want it to do

link |

02:15:05.360

and yeah so we just have a blob of neural stuff and we can make it do arbitrary things

link |

02:15:11.600

and that's what gives neural nets their power it's you know this is a very tiny network with 41

link |

02:15:16.560

parameters but you can build significantly more complicated neural nets with billions

link |

02:15:23.040

at this point almost trillions of parameters and it's a massive blob of neural tissue simulated

link |

02:15:28.640

neural tissue roughly speaking and you can make it do extremely complex problems and these neural

link |

02:15:35.440

nets then have all kinds of very fascinating emergent properties in when you try to make them

link |

02:15:41.680

do significantly hard problems as in the case of gpt for example we have massive amounts of text

link |

02:15:48.640

from the internet and we're trying to get a neural nets to predict to take like a few words and try

link |

02:15:53.360

to predict the next word in a sequence that's the learning problem and it turns out that when you

link |

02:15:57.600

train this on all of internet the neural net actually has like really remarkable emergent

link |

02:16:01.920

properties but that neural net would have hundreds of billions of parameters but it works on

link |

02:16:07.440

fundamentally the exact same principles the neural net of course will be a bit more complex but

link |

02:16:12.400

otherwise the evaluating the gradient is there and would be identical and the gradient descent would

link |

02:16:19.200

be there and would be basically identical but people usually use slightly different updates

link |

02:16:23.760

this is a very simple stochastic gradient descent update and the loss function would not be a mean

link |

02:16:29.680

squared error they would be using something called the cross entropy loss for predicting the next

link |

02:16:34.000

token so there's a few more details but fundamentally the neural network setup and

link |

02:16:37.600

neural network training is identical and pervasive and now you understand intuitively how that works

link |

02:16:43.760

under the hood in the beginning of this video i told you that by the end of it you would understand

link |

02:16:47.760

everything in micrograd and then we'd slowly build it up let me briefly prove that to you

link |

02:16:52.080

so i'm going to step through all the code that is in micrograd as of today actually potentially

link |

02:16:56.720

some of the code will change by the time you watch this video because i intend to continue

link |

02:16:59.920

developing micrograd but let's look at what we have so far at least init.py is empty when you

link |

02:17:05.680

go to engine.py that has the value everything here you should mostly recognize so we have the

link |

02:17:10.720

data dot grad attributes we have the backward function we have the previous set of children

link |

02:17:15.520

and the operation that produced this value we have addition multiplication and raising to a

link |

02:17:21.760

scalar power we have the relu non-linearity which is slightly different type of non-linearity than

link |

02:17:26.720

tanh that we used in this video both of them are non-linearities and notably tanh is not actually

link |

02:17:32.160

present in micrograd as of right now but i intend to add it later we have the backward which is

link |

02:17:37.840

identical and then all of these other operations which are built up on top of operations here

link |

02:17:43.040

so value should be very recognizable except for the non-linearity used in this video

link |

02:17:48.400

there's no massive difference between relu and tanh and sigmoid and these other non-linearities

link |

02:17:52.720

they're all roughly equivalent and can be used in mlps so i use tanh because it's a bit smoother

link |

02:17:57.760

and because it's a little bit more complicated than relu and therefore it's stressed a little

link |

02:18:02.000

bit more the local gradients and working with those derivatives which i thought would be useful

link |

02:18:06.960

nn.py is the neural networks library as i mentioned so you should recognize identical

link |

02:18:11.600

implementation of neuron layer and mlp notably or not so much we have a class module here there's a

link |

02:18:18.160

parent class of all these modules i did that because there's an n.module class in pytorch

link |

02:18:23.920

and so this exactly matches that api and n.module in pytorch has also a zero grad which i refactored

link |

02:18:29.680

out here so that's the end of micrograd really then there's a test which you'll see basically

link |

02:18:38.000

creates two chunks of code one in micrograd and one in pytorch and we'll make sure that the forward

link |

02:18:43.920

and the backward paths agree identically for a slightly less complicated expression a slightly

link |

02:18:48.560

more complicated expression everything agrees so we agree with pytorch on all of these operations

link |

02:18:54.080

and finally there's a demo that i.py y and b here and it's a bit more complicated binary classification

link |

02:18:58.800

demo than the one i covered in this lecture so we only had a tiny data set of four examples

link |

02:19:04.480

here we have a bit more complicated example with lots of blue points and lots of red points

link |

02:19:09.200

and we're trying to again build a binary classifier to distinguish two-dimensional

link |

02:19:13.280

points as red or blue it's a bit more complicated mlp here with it's a bigger

link |

02:19:18.560

mlp the loss is a bit more complicated because it supports batches so because our data set was

link |

02:19:25.360

so tiny we always did a forward pass on the entire data set of four examples but when your data set

link |

02:19:30.480

is like a million examples what we usually do in practice is we basically pick out some random

link |

02:19:36.240

subset we call that a batch and then we only process the batch forward backward and update

link |

02:19:41.360

so we don't have to forward the entire training set and then we have a batch of four examples

link |

02:19:46.400

so we don't have to forward the entire training set so this supports batching because there's

link |

02:19:51.360

a lot more examples here we do a forward pass the loss is slightly more different this is a

link |

02:19:57.280

max margin loss that i implement here the one that we used was the mean squared error loss

link |

02:20:02.640

because it's the simplest one there's also the binary cross entropy loss all of them can be used

link |

02:20:07.600

for binary classification and don't make too much of a difference in the simple examples that we

link |

02:20:11.680

looked at so far there's something called l2 regularization used here this has to do with

link |

02:20:17.840

generalization of the neural net and controls the overfitting in machine learning setting

link |

02:20:22.480

but i did not cover these concepts in this video potentially later and the training

link |

02:20:27.360

loop you should recognize so forward backward with zero grad and update and so on you'll

link |

02:20:34.720

notice that in the update here the learning rate is scaled as a function of number of iterations

link |

02:20:39.440

and it shrinks and this is something called learning rate decay so in the beginning you

link |

02:20:44.640

have a high learning rate and as the network sort of stabilizes near the end you bring down the

link |

02:20:49.840

learning rate to get some of the fine details in the end and in the end we see the decision

link |

02:20:54.800

surface of the neural net and we see that it learned to separate out the red and the blue

link |

02:20:58.960

area based on the data points so that's the slightly more complicated example in the demo

link |

02:21:04.560

demo.hypyymb that you're free to go over but yeah as of today that is micrograd i also wanted

link |

02:21:10.640

to show you a little bit of real stuff so that you get to see how this is actually implemented

link |

02:21:14.160

in a production grade library like pytorch so in particular i wanted to show i wanted to find and

link |

02:21:19.360

show you the backward pass for 10h in pytorch so here in micrograd we see that the backward

link |

02:21:25.120

password 10h is 1 minus t square where t is the output of the 10h of x times out that grad which

link |

02:21:34.480

is the chain rule so we're looking for something that looks like this now i went to pytorch which

link |

02:21:41.520

has an open source github code base and i looked through a lot of its code and honestly i spent

link |

02:21:49.040

about 15 minutes and i couldn't find 10h and that's because these libraries unfortunately

link |

02:21:53.920

they grow in size and entropy and if you just search for 10h you get apparently 2800 results

link |

02:22:00.320

and 400 and 406 files so i don't know what these files are doing honestly and why there are so many

link |

02:22:08.560

mentions of 10h but unfortunately these libraries are quite complex they're meant to be used not

link |

02:22:13.600

really inspected eventually i did stumble on someone who tries to change the 10h backward

link |

02:22:21.120

code for some reason and someone here pointed to the cpu kernel and the cuda kernel for 10h backward

link |

02:22:27.760

so this so basically depends on if you're using pytorch on a cpu device or on a gpu

link |

02:22:32.320

which these are different devices and i haven't covered this but this is the 10h backward kernel

link |

02:22:37.760

for cpu and the reason it's so large is that number one this is like if you're using a complex

link |

02:22:45.920

type which we haven't even talked about if you're using a specific data type of bfloat 16 which we

link |

02:22:50.480

haven't talked about and then if you're not then this is the kernel and deep here we see something

link |

02:22:57.360

that resembles our backward pass so they have a times one minus b square so this b here must be

link |

02:23:06.560

the output of the 10h and this is the out.grad so here we found it deep inside pytorch on this

link |

02:23:14.800

location for some reason inside binary ops kernel when 10h is not actually a binary op and then this

link |

02:23:21.760

is the gpu kernel we're not complex we're here and here we go with one line of code so we did

link |

02:23:31.600

find it but basically unfortunately these code bases are very large and micrograd is very very

link |

02:23:37.760

simple but if you actually want to use real stuff finding the code for it you'll actually find that

link |

02:23:42.240

difficult i also wanted to show you a little example here where pytorch is showing you how

link |

02:23:48.240

you can register a new type of function that you want to add to pytorch as a lego building block

link |

02:23:53.600

so here if you want to for example add a genre polynomial three here's how you could do it you

link |

02:24:00.480

will register it as a class that subclasses torch.rgrad.function and then you have to tell

link |

02:24:07.040

pytorch how to forward your new function and how to backward through it so as long as you

link |

02:24:13.360

can do the forward pass of this little function piece that you want to add and as long as you

link |

02:24:16.960

know the the local derivative the local gradients which are implemented in the backward pytorch

link |

02:24:21.920

will be able to back propagate through your function and then you can use this as a lego block

link |

02:24:25.920

in a larger lego castle of all the different lego blocks that pytorch already has and so that's the

link |

02:24:31.760

only thing you have to tell pytorch and everything would just work and you can register new types of

link |

02:24:35.760

functions in this way following this example and that is everything that i wanted to cover in this

link |

02:24:40.720

lecture so i hope you enjoyed building out micrograd with me i hope you find it interesting

link |

02:24:45.120

insightful and yeah i will post a lot of the links that are related to this video in the video

link |

02:24:51.920

description below i will also probably post a link to a discussion forum or discussion group where

link |

02:24:57.440

you can ask questions related to this video and then i can answer or someone else can answer your

link |

02:25:02.480

questions and i may also do a follow-up video that answers some of the most common questions

link |

02:25:08.400

but for now that's it i hope you enjoyed it if you did then please like and subscribe

link |

02:25:12.480

so that youtube knows to feature this video to more people and that's it for now i'll see you later

link |

02:25:22.400

now here's the problem we know dl by wait what is the problem

link |

02:25:29.520

and that's everything i wanted to cover in this lecture so i hope um you enjoyed us building out

link |

02:25:34.640

micrograb micrograb okay now let's do the exact same thing for multiply because we can't do

link |

02:25:42.240

something like a times two oops i know what happened there

Andrey Karpathy. Neural Networks: Zero to Hero

Episode 1. The spelled-out intro to neural networks and backpropagation: building micrograd