back to index
Andrey Karpathy. Neural Networks: Zero to Hero
Episode 1. The spelled-out intro to neural networks and backpropagation: building micrograd
link |
Hello, my name is Andre and I've been training deep neural networks for a bit more than a decade
link |
and in this lecture I'd like to show you what neural network training looks like under the
link |
hood. So in particular we are going to start with a blank Jupyter notebook and by the end of this
link |
lecture we will define and train a neural net and you'll get to see everything that goes on under
link |
the hood and exactly sort of how that works on an intuitive level. Now specifically what I would
link |
like to do is I would like to take you through building of micrograd. Now micrograd is this
link |
library that I released on github about two years ago but at the time I only uploaded the source
link |
code and you'd have to go and buy yourself and really figure out how it works. So in this lecture
link |
I will take you through it step by step and kind of comment on all the pieces of it. So what's
link |
micrograd and why is it interesting? Micrograd is basically an autograd engine. Autograd is short
link |
for automatic gradient and really what it does is it implements backpropagation. Now backpropagation
link |
is this algorithm that allows you to efficiently evaluate the gradient of some kind of a loss
link |
function with respect to the weights of a neural network and what that allows us to do then is we
link |
can iteratively tune the weights of that neural network to minimize the loss function and therefore
link |
improve the accuracy of the network. So backpropagation would be at the mathematical
link |
core of any modern deep neural network library like say PyTorch or JAX. So the functionality
link |
of micrograd is I think best illustrated by an example. So if we just scroll down here
link |
you'll see that micrograd basically allows you to build out mathematical expressions
link |
and here what we are doing is we have an expression that we're building out where you
link |
have two inputs a and b and you'll see that a and b are negative 4 and 2 but we are wrapping those
link |
values into this value object that we are going to build out as part of micrograd.
link |
So this value object will wrap the numbers themselves and then we are going to build
link |
out a mathematical expression here where a and b are transformed into c d and eventually e f and g
link |
and I'm showing some of the functionality of micrograd and the operations that it supports.
link |
So you can add two value objects, you can multiply them, you can raise them to a constant power,
link |
you can offset by one, negate, squash at zero, square, divide by constant, divide by it, etc.
link |
And so we're building out an expression graph with these two inputs a and b and we're creating
link |
an output value of g and micrograd will in the background build out this entire mathematical
link |
expression. So it will for example know that c is also a value, c was a result of an addition
link |
operation and the child nodes of c are a and b because it will maintain pointers to a and b
link |
value objects. So we'll basically know exactly how all of this is laid out and then not only can we
link |
do what we call the forward pass where we actually look at the value of g of course, that's pretty
link |
straightforward, we will access that using the dot data attribute and so the output of the forward
link |
pass, the value of g, is 24.7 it turns out, but the big deal is that we can also take this g value
link |
object and we can call dot backward and this will basically initialize backpropagation at the node g.
link |
And what backpropagation is going to do is it's going to start at g and it's going to go
link |
backwards through that expression graph and it's going to recursively apply the chain rule from
link |
calculus and what that allows us to do then is we're going to evaluate basically the derivative
link |
of g with respect to all the internal nodes like e d and c but also with respect to the inputs a
link |
and b and then we can actually query this derivative of g with respect to a for example
link |
that's a dot grad in this case it happens to be 138 and the derivative of g with respect to b
link |
which also happens to be here 645 and this derivative we'll see soon is very important
link |
information because it's telling us how a and b are affecting g through this mathematical expression
link |
so in particular a dot grad is 138 so if we slightly nudge a and make it slightly larger
link |
138 is telling us that g will grow and the slope of that growth is going to be 138
link |
and the slope of growth of b is going to be 645 so that's going to tell us about how g will respond
link |
if a and b get tweaked a tiny amount in a positive direction okay now you might be confused about
link |
what this expression is that we built out here and this expression by the way is completely
link |
meaningless i just made it up i'm just flexing about the kinds of operations that are supported
link |
by micrograd and what we actually really care about are neural networks but it turns out that
link |
neural networks are just mathematical expressions just like this one but actually a slightly bit
link |
less crazy even neural networks are just a mathematical expression they take the input data
link |
as an input and they take the weights of a neural network as an input and it's a mathematical
link |
expression and the output are your predictions of your neural net or the loss function we'll
link |
see this in a bit but basically neural networks just happen to be a certain class of mathematical
link |
expressions but the back propagation is actually significantly more general it doesn't actually care
link |
about neural networks at all it only tells about arbitrary mathematical expressions and then we
link |
happen to use that machinery for training of neural networks now one more note i would like
link |
to make at this stage is that as you see here micrograd is a scalar valued autograd engine
link |
so it's working on the you know level of individual scalars like negative four and two and we're
link |
taking neural nets and we're breaking them down all the way to these atoms of individual neural
link |
networks and all the little pluses and times and it's just excessive and so obviously you would
link |
never be doing any of this in production it's real just for them for pedagogical reasons because it
link |
allows us to not have to deal with these n-dimensional tensors that you would use in
link |
modern deep neural network library so this is really done so that you understand and refactor
link |
out back propagation and chain rule and understanding of training and then if you actually
link |
want to train bigger networks you have to be using these tensors but none of the math changes this is
link |
done purely for efficiency we are basically taking scale value all the scale values we're
link |
packaging them up into tensors which are just arrays of these scalars and then because we have
link |
these large arrays we're making operations on those large arrays that allows us to take advantage of
link |
the parallelism in a computer and all those operations can be done in parallel and then the
link |
whole thing runs faster but really none of the math changes and they've done purely for efficiency
link |
so i don't think that it's pedagogically useful to be dealing with tensors from scratch and i think
link |
and that's why i fundamentally wrote micrograd because you can understand how things work
link |
at the fundamental level and then you can speed it up later okay so here's the fun part my claim is
link |
that micrograd is what you need to train your own networks and everything else is just efficiency
link |
so you'd think that micrograd would be a very complex piece of code and that turns out to not
link |
be the case so if we just go to micrograd and you'll see that there's only two files here in
link |
micrograd this is the actual engine it doesn't know anything about neural nets and this is the
link |
entire neural nets library on top of micrograd so engine and nn.py so the actual backpropagation
link |
autograd engine that gives you the power of neural networks is literally 100 lines of code
link |
100 lines of code of like very simple python which we'll understand by the end of this lecture
link |
and then nn.py this neural network library built on top of the autograd engine is like a joke it's
link |
like we have to define what is a neuron and then we have to define what is a layer of neurons and
link |
then we define what is a multi-layer perceptron which is just a sequence of layers of neurons
link |
and so it's just a total joke so basically there's a lot of power that comes from only 150 lines of
link |
code and that's all you need to understand to understand neural network training and everything
link |
else is just efficiency and of course there's a lot to efficiency but fundamentally that's all
link |
that's happening okay so now let's dive right in and implement micrograd step by step the first
link |
thing i'd like to do is i'd like to make sure that you have a very good understanding intuitively of
link |
what a derivative is and exactly what information it gives you so let's start with some basic imports
link |
that i copy-paste in every jupyter notebook always and let's define the function scalar-valued
link |
function f of x as follows so i just made this up randomly i just wanted a scalar-valued function
link |
that takes a single scalar x and returns a single scalar y and we can call this function of course
link |
so we can pass in say 3.0 and get 20 back now we can also plot this function to get a sense of its
link |
shape you can tell from the mathematical expression that this is probably a parabola
link |
it's a quadratic and so if we just create a set of um scale values that we can feed in using for
link |
example a range from negative five to five and steps up on to five so this is so x is just from
link |
negative five to five not including five in steps of 0.25 and we can actually call this function on
link |
this numpy array as well so we get a set of y's if we call f on x's and these y's are basically
link |
also applying uh function on every one of these elements independently and we can plot this using
link |
matplotlib it's a plt.plot x's and y's and we get a nice parabola so previously here we fed in
link |
3.0 somewhere here and we received 20 back which is here the y coordinate so now i'd like to think
link |
through what is the derivative of this function at any single input point x right so what is the
link |
derivative at different points x of this function now if you remember back to your calculus class
link |
you've probably derived derivatives so we take this mathematical expression 3x squared minus 4x
link |
plus 5 and you would write out on a piece of paper and you would you know apply the product rule and
link |
all the other rules and derive the mathematical expression of the great derivative of the original
link |
function and then you could plug in different x's and see what the derivative is we're not going to
link |
actually do that because no one in neural networks actually writes out the expression for the neural
link |
net it would be a massive expression it would be you know thousands tens of thousands of terms no
link |
one actually derives the derivative of course and so we're not going to take this kind of like
link |
symbolic approach instead what i'd like to do is i'd like to look at the derivative of the
link |
derivative and just make sure that we really understand what the derivative is measuring
link |
what it's telling you about the function and so if we just look up derivative
link |
we see that um okay so this is not a very good definition of derivative this is a definition of
link |
what it means to be differentiable but if you remember from your calculus it is the limit as
link |
h goes to zero of f of x plus h minus f of x over h so basically what it's saying is if you slightly
link |
bump up you're at some point x that you're interested in or hey and if you slightly bump up
link |
you know you slightly increase it by small number h how does the function respond with what
link |
sensitivity does it respond where's the slope at that point does the function go up or does it go
link |
down and by how much and that's the slope of that function the the slope of that response at that
link |
point and so we can basically evaluate um the relative here numerically by taking a very small
link |
h of course the definition would ask us to take h to zero we're just going to pick a very small h
link |
0.001 and let's say we're interested in 0.3.0 so we can look at f of x of course is 20
link |
and now f of x plus h so if we slightly nudge x in a positive direction how is the function going to
link |
respond and just looking at this do you expand do you expect f of x plus h to be slightly greater
link |
than 20 or do you expect it to be slightly lower than 20 and since this 3 is here and this is 20
link |
if we slightly go positively the function will respond positively so you'd expect this to be
link |
slightly greater than 20 and now by how much is it going to respond to the function
link |
greater than 20 and now by how much is telling you the sort of the the strength of that slope right
link |
the the size of the slope so f of x plus h minus f of x this is how much the function responded
link |
in a positive direction and we have to normalize by the run so we have the rise over run to get
link |
the slope so this of course is just a numerical approximation of the slope because we have to make
link |
h very very small to converge to the exact amount now if i'm doing too many zeros at some point
link |
i'm going to get an incorrect answer because we're using floating point arithmetic and the
link |
representations of all these numbers in computer memory is finite and at some point we get into
link |
trouble so we can converge towards the right answer with this approach but basically um at
link |
3 the slope is 14 and you can see that by taking 3x square minus 4x plus 5 and differentiating it
link |
in our head so 3x square would be 6x minus 4 and then we plug in x equals 3 so that's 18 minus 4
link |
is 14 so this is correct so that's at 3 now how about the slope at say negative 3 would you expect
link |
what would you expect for the slope now telling the exact value is really hard but what is the
link |
sign of that slope so at negative 3 if we slightly go in the positive direction at x the function
link |
would actually go down and so that tells you that the slope would be negative so we'll get a slight
link |
number below uh below 20 and so if we take the slope we expect something negative negative 22
link |
okay and at some point here of course the slope would be 0 now for this specific function i looked
link |
it up previously and it's at point 2 over 3 so at roughly 2 over 3 that's somewhere here
link |
this derivative would be 0 so basically at that precise point
link |
yeah at that precise point if we nudge in a positive direction the function doesn't respond
link |
this stays the same almost and so that's why the slope is zero okay now let's look at a bit more
link |
complex case so we're going to start you know complexifying a bit so now we have a function
link |
here with output variable d that is a function of three scalar inputs a b and c so a b and c
link |
are some specific values three inputs into our expression graph and a single output d
link |
and so if we just print d we get four and now what i like to do is i'd like to again look at
link |
the derivatives of d with respect to a b and c and think through uh again just the intuition of
link |
what this derivative is telling us so in order to evaluate this derivative we're going to get a bit
link |
hacky here we're going to again have a very small value of h and then we're going to fix the inputs
link |
at some values that we're interested in so these are the this is the point abc at which we're going
link |
to be evaluating the the derivative of d with respect to all a b and c at that point so there's
link |
the inputs and now we have d1 is that expression and then we're going to for example look at the
link |
derivative of d with respect to a so we'll take a and we'll bump it by h and then we'll get d2 to
link |
be the exact same function and now we're going to print um you know f1 d1 is d1 d2 is d2 and print
link |
slope so the derivative or slope here will be um of course d2 minus d1 divided h so d2 minus d1 is
link |
how much the function increased uh when we bumped the uh the specific input that we're interested in
link |
by a tiny amount and this is the normalized by h to get the slope so um yeah so this so
link |
this so i just run this we're going to print d1 which we know is four now d2 will be bumped a will
link |
be bumped by h so let's just think through a little bit uh what d2 will be uh printed out here
link |
in particular d1 will be four will d2 be a number slightly greater than four or slightly lower than
link |
four and that's going to tell us the the sign of the derivative so we're bumping a by h
link |
b is minus three c is 10 so you can just intuitively think through this derivative
link |
and what it's doing a will be slightly more positive and but b is a negative number
link |
so if a is slightly more positive because b is negative three we're actually going to be
link |
adding less to d so you'd actually expect that the value of the function will go down so let's
link |
just see this yeah and so we went from four to 3.9996 and that tells you that the slope will be
link |
negative and then um will be a negative number because we went down and then the exact number
link |
of slope will be exact amount of slope is negative three and you can also convince yourself that
link |
negative three is the right answer mathematically and analytically because if you have a times b
link |
plus c and you are you know you have calculus then uh differentiating a times b plus c with
link |
respect to a gives you just b and indeed the value of b is negative three which is the derivative
link |
that we have so you can tell that that's correct so now if we do this with b so if we bump b by a
link |
little bit in a positive direction we'd get different slopes so what is the influence of b
link |
on the output d so if we bump b by a tiny amount in a positive direction then because a is positive
link |
we'll be adding more to d right so um and now what is the what is the sensitivity what is the
link |
slope of that addition and it might not surprise you that this should be two and why is it two
link |
because d of d by db differentiating with respect to b would be would give us a and the value of a
link |
is two so that's also working well and then if c gets bumped a tiny amount in h by h then of course
link |
a times b is unaffected and now c becomes slightly bit higher what does that do to the function it
link |
makes it slightly bit higher because we're simply adding c and it makes it slightly bit higher by
link |
the exact same amount that we added to c and so that tells you that the slope is one that will be
link |
the rate at which d will increase as we scale c okay so we now have some intuitive sense of
link |
what this derivative is telling you about the function and we'd like to move to neural networks
link |
now as i mentioned neural networks will be pretty massive expressions mathematical expressions
link |
so we need some data structures that maintain these expressions and that's what we're going
link |
to start to build out now so we're going to uh build out this value object that i showed you
link |
in the readme page of micrograd so let me copy paste a skeleton of the first very simple value
link |
object so class value takes a single scalar value that it wraps and keeps track of and that's it so
link |
we can for example do value of 2.0 and then we can get we can look at its content and python
link |
will internally use the repper function to return this string like that so this is a value object
link |
with data equals two that we're creating here now we'd like to do is like we'd like to be able to
link |
have not just like two values but we'd like to do a plus b right we'd like to add them so currently
link |
you would get an error because python doesn't know how to add two value objects so we have to tell it
link |
so here's addition so you have to basically use these special double underscore methods in python
link |
to define these operators for these objects so if we call um the function
link |
if we call um the uh if we use this plus operator python will internally call a dot add of b that's
link |
what will happen internally and so b will be the other and uh self will be a and so we see that
link |
what we're going to return is a new value object and it's just uh it's going to be wrapping the
link |
plus of their data but remember now because uh data is the actual like numbered python number
link |
so this operator here is just the typical floating point plus addition now it's not an addition of
link |
value objects and will return a new value so now a plus b should work and it should print value of
link |
negative one because that's two plus minus three there we go okay let's now implement multiply
link |
just so we can recreate this expression here so multiply i think it won't surprise you
link |
will be fairly similar so instead of add we're going to be using mull and then here of course
link |
we want to do times and so now we can create a c value object which will be 10.0 and now we should
link |
be able to do a times b well let's just do a times b first um that's value of negative six now
link |
and by the way i skipped over this a little bit uh suppose that i didn't have the repper function
link |
here uh then it's just that you'll get some kind of an ugly expression so what repper is doing is
link |
it's providing us a way to print out like a nicer looking expression in python uh so we don't just
link |
have something cryptic we actually are you know it's value of negative six so this gives us a
link |
times and then this we should now be able to add c to it because we've defined and told the python
link |
because we've defined and told the python how to do mull and add and so this will call this will
link |
basically be equivalent to a dot mull of b and then this new value object will be dot add of c
link |
and so let's see if that worked yep so that worked well that gave us four which is what
link |
we expect from before and i believe we can just call them manually as well there we go so
link |
yeah okay so now what we are missing is the connected tissue of this expression as i mentioned
link |
we want to keep these expression graphs so we need to know and keep pointers about what values
link |
produce what other values so here for example we are going to introduce a new variable which we'll
link |
call children and by default it will be an empty tuple and then we're actually going to keep a
link |
slightly different variable in the class which we'll call underscore prev which will be the set
link |
of children uh this is how i done i did it in the original micrograd looking at my code here
link |
i can't remember exactly the reason i believe it was efficiency but this underscore children will
link |
be a tuple for convenience but then when we actually maintain it in the class it will be
link |
just this set yes i i believe for efficiency um so now when we are creating a value like this
link |
with a constructor children will be empty and prev will be the empty set but when we are creating
link |
a value through addition or multiplication we're going to feed in the children of this
link |
value which in this case is self and other so those are the children here
link |
so now we can do d.prev and we'll see that the children of the we now know are this a value of
link |
negative six and value of ten and this of course is the value resulting from a times b and the c
link |
value which is ten now the last piece of information we don't know so we know now the
link |
children of every single value but we don't know what operation created this value so we need one
link |
more element here let's call it underscore pop and by default this is the empty set for leaves
link |
and then we'll just maintain it here and now the operation will be just a simple string
link |
and in the case of addition it's plus in the case of multiplication is times so now we not
link |
just have d.prev we also have a d.op and we know that d was produced by an addition of those two
link |
values and so now we have the full mathematical expression and we're building out this data
link |
structure and we know exactly how each value came to be by what expression and from what other values
link |
now because these expressions are about to get quite a bit larger we'd like a way to nicely
link |
visualize these expressions that we're building out so for that i'm going to copy paste a bunch of
link |
slightly scary code that's going to visualize this these expression graphs for us so here's
link |
the code and i'll explain it in a bit but first let me just show you what this code does basically
link |
what it does is it creates a new function draw dot that we can call on some root node
link |
and then it's going to visualize it so if we call draw dot on d which is this final value here that
link |
is a times b plus c it creates something like this so this is d and you see that this is a times b
link |
creating an interpretive value plus c gives us this output node d so that's drawn out of d and
link |
i'm not going to go through this in complete detail but i'm going to show you how to do that
link |
right out of d and i'm not going to go through this in complete detail you can take a look at
link |
graphvis and its api graphvis is a open source graph visualization software and what we're doing
link |
here is we're building out this graph in graphvis api and you can basically see that trace is this
link |
helper function that enumerates all the nodes and edges in the graph so that just builds a
link |
set of all the nodes and edges and then we iterate for all the nodes and we create special node
link |
objects for them in using dot node and then we also create edges using dot dot edge
link |
and the only thing that's like slightly tricky here is you'll notice that i
link |
basically add these fake nodes which are these operation nodes so for example this node here
link |
is just like a plus node and i create these special op nodes here and i connect them
link |
accordingly so these nodes of course are not actual nodes in the original graph they're not
link |
actually a value object the only value objects here are the things in squares those are actual
link |
value objects or representations thereof and these op nodes are just created in this draw dot
link |
routine so that it looks nice let's also add labels to these graphs just so we know what
link |
variables are where so let's create a special underscore label um or let's just do label equals
link |
empty by default and save it in each node and then here we're going to do label as a
link |
a label is b label is c um and then let's create a special um equals a times b
link |
and e dot label will be e it's kind of naughty and e will be e plus c and a d dot label will be
link |
b okay so nothing really changes i just added this new e function a new e variable and then here when
link |
we are printing this i'm going to print the label here so this will be a percent s bar and this will
link |
be n dot label and so now we have the label on the left here so it says a b creating e and then
link |
e plus c creates d just like we have it here and finally let's make this expression just one layer
link |
deeper so d will not be the final output node instead after d we are going to create a new value
link |
object called f we're going to start running out of variables soon f will be negative 2.0 and its
link |
label will of course just be f and then l capital l will be the output of our graph and l will be
link |
and l will be d times f okay so l will be negative 8 is the output so now we don't just draw a d we
link |
draw l okay and somehow the label of l is undefined oops although label has to be explicitly sort of
link |
given to it there we go so l is the output so let's quickly recap what we've done so far
link |
we are able to build out mathematical expressions using only plus and times so far
link |
they are scalar valued along the way and we can do this forward pass and build out a mathematical
link |
expression so we have multiple inputs here a b c and f going into a mathematical expression that
link |
produces a single output l and this here is visualizing the forward pass so the output of
link |
the forward pass is negative 8 that's the value now what we'd like to do next is we'd like to
link |
run back propagation and in back propagation we are going to start here at the end and we're going
link |
to reverse and calculate the gradient along along all these intermediate values and really what
link |
we're computing for every single value here we're going to compute the derivative of that
link |
node with respect to l so the derivative of l with respect to l is just one and then we're going to
link |
derive what is the derivative of l with respect to f with respect to d with respect to c with
link |
respect to e with respect to b and with respect to a and in a neural network setting you'd be
link |
very interested in the derivative of basically this loss function l with respect to the weights
link |
of a neural network and here of course we have just these variables a b c and f but some of
link |
these will eventually represent the weights of a neural net and so we'll need to know how those
link |
weights are impacting the loss function so we'll be interested basically in the derivative of the
link |
output with respect to some of its leaf nodes and those leaf nodes will be the weights of the neural
link |
net and the other leaf nodes of course will be the data itself but usually we will not want or
link |
use the derivative of the loss function with respect to data because the data is fixed but
link |
the weights will be iterated on using the gradient information so next we are going to create a
link |
variable inside the value class that maintains the derivative of l with respect to that value
link |
and we will call this variable grad so there is a dot data and there's a self at grad and initially
link |
it will be zero and remember that zero is basically means no effect so at initialization
link |
we're assuming that every value does not impact does not affect the out the output right because
link |
if the gradient is zero that means that changing this variable is not changing the loss function
link |
so by default we assume that the gradient is zero and then now that we have grad and it's 0.0
link |
we are going to be able to visualize it here after data so here grad is 0.4f
link |
and this will be in that graph and now we are going to be showing both the data and the grad
link |
initialized at zero and we are just about getting ready to calculate the back propagation
link |
and of course this grad again as i mentioned is representing the derivative of the output
link |
in this case l with respect to this value so with respect to so this is the derivative of
link |
l with respect to f with respect to d and so on so let's now fill in those gradients and actually
link |
do back propagation manually so let's start filling in these gradients and start all the way at the
link |
end as i mentioned here first we are interested to fill in this gradient here so what is the derivative
link |
of l with respect to l in other words if i change l by a tiny amount h how much does l change it
link |
changes by h so it's proportional and therefore the derivative will be one we can of course
link |
measure these or estimate these numerical gradients numerically just like we've seen before so if i
link |
take this expression and i create a def lol function here and put this here now the reason
link |
i'm creating a gating function lol here is because i don't want to pollute or mess up the global scope
link |
here this is just kind of like a little staging area and as you know in python all of these will
link |
be local variables to this function so i'm not changing any of the global scope here so here l1
link |
will be l and then copy pasting this expression we're going to add a small amount h in for example
link |
a right and this would be measuring the derivative of l with respect to a so here this will be l2
link |
and then we want to print that derivative so print l2 minus l1 which is how much l changed
link |
and then normalize it by h so this is the rise over run and we have to be careful because l
link |
is a value node so we actually want its data so that these are floats dividing by h and this should
link |
print the derivative of l with respect to a because a is the one that we bumped a little bit by h so
link |
what is the derivative of l with respect to a it's six okay and obviously if we change l
link |
by h then that would be here effectively this looks really awkward but changing l by h
link |
you see the derivative here is one
link |
that's kind of like the base case of what we are doing here so basically we can come up here and
link |
we can manually set l.grad to one this is our manual backpropagation l.grad is one and let's
link |
redraw and we'll see that we filled in grad is one for l we're now going to continue the
link |
backpropagation so let's here look at the derivatives of l with respect to d and f let's do
link |
a d first so what we are interested in if i create a markdown on here is we'd like to know
link |
basically we have that l is d times f and we'd like to know what is uh d l by d d what is that
link |
and if you know your calculus uh l is d times f so what is d l by d d it would be f and if you
link |
don't believe me we can also just derive it because the proof would be fairly straightforward
link |
we go to the definition of the derogative which is f of x plus h minus f of x divide h
link |
as a limit limit of h goes to zero of this kind of expression so when we have l is d times f then
link |
increasing d by h would give us the output of d plus h times f that's basically f of x plus h
link |
right minus d times f and then divide h and symbolically expanding out here we would have
link |
basically d times f plus h times f minus d times f divide h and then you see how the df minus df
link |
cancels so you're left with h times f divide h which is f so in the limit as h goes to zero of
link |
you know um derogative um definition we just get f in the case of d times f
link |
so symmetrically d l by d f will just be d so what we have is that f dot grad we see now
link |
is just the value of d which is four
link |
and we see that d dot grad is just uh the value of f
link |
and so the value of f is negative two so we'll set those manually
link |
let me erase this markdown node and then let's redraw what we have
link |
okay and let's just make sure that these were correct so we seem to think that d l by d d is
link |
negative two so let's double check um let me erase this plus h from before and now we want
link |
the derivative with respect to f so let's just come here when i create f and let's do a plus h
link |
here and this should print a derivative of f plus h so let's do a plus h here and let's
link |
do a plus h here and this should print a derivative of l with respect to f so we expect to see four
link |
yeah and this is four up to floating point funkiness and then d l by d d should be f
link |
which is negative two grad is negative two so if we again come here and we change d
link |
d dot data plus equals h right here so we expect so we've added a little h and then we see how l
link |
changed and we expect to print uh negative two there we go so we've numerically verified what
link |
we're doing here is kind of like an inline gradient check gradient check is when we are
link |
deriving this like background check and then we're going to print a negative two
link |
check is when we are deriving this like back propagation and getting the derivative with
link |
respect to all the intermediate results and then numerical gradient is just you know
link |
estimating it using small step size now we're getting to the crux of back propagation so this
link |
will be the most important node to understand because if you understand the gradient for this
link |
node you understand all of back propagation and all of training of neural nets basically so we
link |
need to derive dl by dc in other words the derivative of l with respect to c because
link |
we've computed all these other gradients already now we're coming here and we're continuing the
link |
back propagation manually so we want dl by dc and then we'll also derive dl by de now here's
link |
a problem how do we derive dl by dc we actually know the derivative l with respect to d so we know
link |
how l is sensitive to d but how is l sensitive to c so if we wiggle c how does that impact l through
link |
d so we know dl by dc and we also here know how c impacts d and so just very intuitively if you
link |
know the impact that c is having on d and the impact that d is having on l then you should be
link |
able to somehow put that information together to figure out how c impacts l and indeed this is what
link |
we can actually do so in particular we know just concentrating on d first let's look at how what is
link |
the derivative basically of d with respect to c so in other words what is dd by dc
link |
so here we know that d is c times c plus e that's what we know and now we're interested in dd by dc
link |
if you just know your calculus again and you remember then differentiating c plus e with
link |
respect to c you know that that gives you 1.0 and we can also go back to the basics and derive this
link |
because again we can go to our f of x plus h minus f of x divided by h that's the definition
link |
of a derivative as h goes to zero and so here focusing on c and its effect on d we can basically
link |
do the f of x plus h will be c is incremented by h plus c that's the first evaluation of our function
link |
minus c plus e and then divide h and so what is this just expanding this out this will be c plus h
link |
plus c minus c minus e divide h and then you see here how c minus c cancels e minus e cancels we're
link |
left with h over h which is 1.0 and so by symmetry also d d by d e will be 1.0 as well so basically
link |
the derivative of a sum expression is very simple and this is the local derivative so i call this
link |
the local derivative because we have the final output value all the way at the end of this graph
link |
and we're now like a small node here and this is a little plus node and the little plus node
link |
doesn't know anything about the rest of the graph that it's embedded in all it knows is that it did
link |
plus it took a c and an e added them and created a d and this plus node also knows the local
link |
influence of c on d or rather the derivative of d with respect to c and it also knows the
link |
derivative of d with respect to e but that's not what we want that's just a local derivative what
link |
we actually want is dl by dc and l could l is here just one step away but in a general case
link |
this little plus node is could be embedded in like a massive graph so again we know how l impacts d
link |
and now we know how c and e impact d how do we put that information together to write dl by dc
link |
and the answer of course is the chain rule in calculus and so i pulled up a chain rule here
link |
from Wikipedia and i'm going to go through this very briefly so chain rule Wikipedia sometimes
link |
can be very confusing and calculus can be very confusing like this is the way i learned chain
link |
rule and it was very confusing like what is happening it's just complicated so i like this
link |
expression much better if a variable z depends on a variable y which itself depends on a variable x
link |
then z depends on x as well obviously through the intermediate variable y in this case the chain
link |
rule is expressed as if you want dz by dx then you take the dz by dy and you multiply it by dy
link |
by dx so the chain rule fundamentally is telling you how we chain these derivatives together
link |
derivatives together correctly so to differentiate through a function composition we have to apply
link |
a multiplication of those derivatives so that's really what chain rule is telling us and there's
link |
a nice little intuitive explanation here which i also think is kind of cute the channel states
link |
that knowing the instantaneous rate of change of z with respect to y and y relative to x allows
link |
one to calculate the instantaneous rate of change of z relative to x as a product of those two rates
link |
of change simply the product of those two so here's a good one if a car travels twice as
link |
fast as a bicycle and the bicycle is four times as fast as a walking man then the car travels
link |
two times four eight times as fast as a man and so this makes it very clear that the correct thing
link |
to do sort of is to multiply so car is twice as fast as bicycle and bicycle is four times as fast
link |
as man so the car will be eight times as fast as the man and so we can take these intermediate
link |
rates of change if you will and multiply them together and that justifies the chain rule
link |
intuitively so have a look at chain rule about here really what it means for us is there's a
link |
very simple recipe for deriving what we want which is dl by dc and what we have so far is we know
link |
want and we know what is the impact of d on l so we know dl by dd the derivative of l with respect
link |
to dd we know that that's negative two and now because of this local reasoning that we've done
link |
here we know dd by dc so how does c impact d and in particular this is a plus node so the local
link |
derivative is simply 1.0 it's very simple and so the chain rule tells us that dl by dc going
link |
through this intermediate variable will just be simply dl by dd times
link |
dd by dc that's chain rule so this is identical to what's happening here except
link |
z is rl y is rd and x is rc so we literally just have to multiply these and because
link |
these local derivatives like dd by dc are just one we basically just copy over dl by dd because
link |
this is just times one so what is it so because dl by dd is negative two what is dl by dc well
link |
it's the local gradient 1.0 times dl by dd which is negative two so literally what a plus node does
link |
you can look at it that way is it literally just routes the gradient because the plus nodes local
link |
derivatives are just one and so in the chain rule one times dl by dd is just dl by dd and so that
link |
derivative just gets routed to both c and to e in this case so basically we have that e.grad
link |
or let's start with c since that's the one we looked at is negative two times one negative two
link |
and in the same way by symmetry e.grad will be negative two that's the claim so we can set those
link |
we can redraw and you see how we just assign negative two negative two so this back propagating
link |
signal which is carrying the information of like what is the derivative of l with respect to all
link |
the intermediate nodes we can imagine it almost like flowing backwards through the graph and a
link |
plus node will simply distribute the derivative to all the leaf nodes sorry to all the children
link |
nodes of it so this is the claim and now let's verify it so let me remove the plus node and
link |
now instead what we're going to do is we want to increment c so c.data will be incremented by h
link |
and when i run this we expect to see negative two negative two and then of course for e
link |
so e.data plus equals h and we expect to see negative two simple
link |
so those are the derivatives of these internal nodes and now we're going to recurse our way
link |
backwards again and we're again going to apply the chain rule so here we go our second application
link |
of chain rule and we will apply it all the way through the graph we just happen to only have
link |
one more node remaining we have that dl by d and we're going to apply the chain rule
link |
one more node remaining we have that dl by de as we have just calculated is negative two
link |
so we know that so we know the derivative of l with respect to e
link |
and now we want dl by da right and the chain rule is telling us that that's just dl by de
link |
negative two times the local gradient so what is the local gradient basically de by da we have to
link |
look at that so i'm a little times node inside a massive graph and i only know that i did a times b
link |
and i produced an e so now what is de by da and de by db that's the only thing that i sort of
link |
know about that's my local gradient so because we have that e is a times b we're asking what is de
link |
by da and of course we just did that here we had a times so i'm not going to re-derive it
link |
but if you want to differentiate this with respect to a you'll just get b right the value of b which
link |
in this case is negative 3.0 so basically we have that dl by da well let me just do it right here
link |
we have that a dot grad and we are applying chain rule here is dl by de which we see here is
link |
negative two times what is de by da it's the value of b which is negative three
link |
that's it and then we have b dot grad is again dl by de which is negative two just the same way
link |
times what is de by d um db is the value of a which is 2.0 that's the value of a
link |
so these are our claimed derivatives let's redraw and we see here that a dot grad turns out to be
link |
six because that is negative two times negative three and b dot grad is negative four times sorry
link |
is negative two times two which is negative four so those are our claims let's delete this and
link |
let's verify them we have a here a dot data plus equals h so the claim is that a dot grad is six
link |
let's verify six and we have b dot data plus equals h so nudging b by h and then we have
link |
b dot data plus equals h so nudging b by h and looking at what happens we claim it's negative four
link |
and indeed it's negative four plus minus again float oddness and that's it this that was the
link |
manual back propagation all the way from here to all the leaf nodes and we've done it piece by
link |
piece and really all we've done is as you saw we iterated through all the nodes one by one
link |
and locally applied the chain rule we always know what is the derivative of l with respect to this
link |
little output and then we look at how this output was produced this output was produced through some
link |
operation and we have the pointers to the children nodes of this operation and so in this little
link |
operation we know what the local derivatives are and we just multiply them onto the derivative
link |
always so we just go through and recursively multiply on the local derivatives and that's
link |
what back propagation is it's just a recursive application of chain rule backwards through the
link |
computation graph let's see this power in action just very briefly what we're going to do is we're
link |
going to nudge our inputs to try to make l go up so in particular what we're doing is we want a
link |
data we're going to change it and if we want l to go up that means we just have to go in the
link |
direction of the gradient so a should increase in the direction of gradient by like some small
link |
step amount this is the step size and we don't just want this for b but also for b
link |
also for c also for f those are leaf nodes which we usually have control over and if we nudge in
link |
direction of the gradient we expect a positive influence on l so we expect l to go up positively
link |
so it should become less negative it should go up to say negative you know six or something like
link |
that it's hard to tell exactly and we have to rerun the forward pass so let me just do that here
link |
this would be the forward pass f would be unchanged this is effectively the forward pass
link |
and now if we print l dot data we expect because we nudged all the values all the inputs in the
link |
rational gradient we expected a less negative l we expect it to go up so maybe it's negative
link |
six or so let's see what happens okay negative seven and this is basically one step of an
link |
optimization that will end up running and really this gradient just give us some power because we
link |
know how to influence the final outcome and this will be extremely useful for training
link |
neural nets as well as cnc so now i would like to do one more example of manual back propagation
link |
using a bit more complex and useful example we are going to back propagate through a neuron
link |
so we want to eventually build out neural networks and in the simplest case these are
link |
multi-layer perceptrons as they're called so this is a two layer neural net and it's got these hidden
link |
layers made up of neurons and these neurons are fully connected to each other now biologically
link |
neurons are very complicated devices but we have very simple mathematical models of them
link |
and so this is a very simple mathematical model of a neuron you have some inputs x's and then you
link |
have these synapses that have weights on them so the w's are weights and then the synapse interacts
link |
with the input to this neuron multiplicatively so what flows to the cell body of this neuron
link |
is w times x but there's multiple inputs so there's many w times x's flowing to the cell body
link |
the cell body then has also like some bias so this is kind of like the
link |
inner innate sort of trigger happiness of this neuron so this bias can make it a bit more trigger
link |
happy or a bit less trigger happy regardless of the input but basically we're taking all the w
link |
times x of all the inputs adding the bias and then we take it through an activation function
link |
and this activation function is usually some kind of a squashing function like a sigmoid or 10h or
link |
something like that so as an example we're going to use the 10h in this example numpy has a np dot
link |
10h so we can call it on a range then we can plot it this is the 10h function and you see that the
link |
inputs as they come in get squashed on the y coordinate here so right at zero we're going to
link |
get exactly zero and then as you go more positive in the input then you'll see that the function
link |
will only go up to one and then plateau out and so if you pass in very positive inputs we're going
link |
to cap it smoothly at one and on the negative side we're going to cap it smoothly to negative one
link |
so that's 10h and that's the squashing function or an activation function and what comes out of
link |
this neuron is just the activation function applied to the dot product of the weights and the inputs
link |
so let's write one out i'm going to copy paste because
link |
i'm going to copy paste because i don't want to type too much but okay so here we have the inputs
link |
x1 x2 so this is a two-dimensional neuron so two inputs are going to come in these are thought of
link |
as the weights of this neuron weights w1 w2 and these weights again are the synaptic strengths
link |
for each input and this is the bias of the neuron b and now we want to do is according to this model
link |
we need to multiply x1 times w1 and x2 times w2 and then we need to add bias on top of it
link |
and it gets a little messy here but all we are trying to do is x1 w1 plus x2 w2 plus b
link |
and these are multiplied here except i'm doing it in small steps so that we actually
link |
have pointers to all these intermediate nodes so we have x1 w1 variable x times x2 w2 variable
link |
and i'm also labeling them so n is now the cell body raw activation without
link |
the activation function for now and this should be enough to basically plot it so draw a dot of n
link |
gives us x1 times w1 x2 times w2 being added then the bias gets added on top of this
link |
and this n is this sum so we're now going to take it through an activation function
link |
and let's say we use the tanh so that we produce the output so what we'd like to do here is we'd
link |
like to do the output and i'll call it o is n dot tanh okay but we haven't yet written the tanh
link |
now the reason that we need to implement another tanh function here is that tanh is a hyperbolic
link |
function and we've only so far implemented a plus and a times and you can't make a tanh
link |
out of just pluses and times you also need exponentiation so tanh is this kind of a formula
link |
here you can use either one of these and you see that there is exponentiation involved which we
link |
have not implemented yet for our little value node here so we're not going to be able to produce
link |
tanh yet and we have to go back up and implement something like it now one option here is we could
link |
actually implement exponentiation right and we could return the exp of a value instead of a tanh
link |
of a value because if we had exp then we have everything else that we need so because we know
link |
how to add and we know how to we know how to add and we know how to multiply so we'd be able to
link |
create tanh if we knew how to exp but for the purposes of this example i specifically wanted to
link |
show you that we don't necessarily need to have the most atomic pieces in in this value object
link |
we can actually like create functions at arbitrary points of abstraction they can be complicated
link |
functions but they can be also very very simple functions like a plus and it's totally up to us
link |
the only thing that matters is that we know how to differentiate through any one function so we
link |
take some inputs and we make an output the only thing that matters it can be arbitrarily complicated
link |
arbitrarily complex function as long as you know how to create the local derivative if you know the
link |
local derivative of how the inputs impact the output then that's all you need so we're going
link |
to cluster up all of this expression and we're not going to break it down to its atomic pieces
link |
we're just going to directly implement tanh so let's do that dab tanh and then out will be a value
link |
of and we need this expression here so um let me actually copy paste
link |
let's grab n which is a cell dot theta and then this i believe is the tanh math dot exp of
link |
two no and my n minus one over two and plus one maybe i can call this x just so that it matches
link |
exactly okay and now this will be t and uh children of this node there's just one child
link |
and i'm wrapping it in a tuple so this is a tuple of one object just self and here the name of this
link |
operation will be tanh and we're going to return that okay so now values should be implementing tanh
link |
and now we can scroll all the way down here and we can actually do n dot tanh and that's going to
link |
return the tanh output of n and now we should be able to draw it out of o not of n so let's see how
link |
that worked there we go and went through tanh to produce this output so now tanh is a sort of
link |
our little micro grad supported node here as an operation and as long as we know the derivative
link |
of tanh then we'll be able to back propagate through it now let's see this tanh in action
link |
currently it's not squashing too much because the input to it is pretty low so the bias was increased
link |
to say eight then we'll see that what's flowing into the tanh now is two and tanh is squashing
link |
it to 0.96 so we're already hitting the tail of this tanh and it will sort of smoothly go up to
link |
one and then plateau out over there okay so now i'm going to do something slightly strange i'm
link |
going to change this bias from eight to this number 6.88 etc and i'm going to do this for
link |
specific reasons because we're about to start back propagation and i want to make sure that
link |
our numbers come out nice they're not like very crazy numbers they're nice numbers that we can
link |
sort of understand in our head let me also add o's label o is short for output here
link |
so that's the error okay so 0.88 flows into tanh comes out 0.7 so on so now we're going to do back
link |
propagation and we're going to fill in all the gradients so what is the derivative o with respect
link |
to all the inputs here and of course in a typical neural network setting what we really care about
link |
the most is the derivative of these neurons on the weights specifically the w2 and w1 because
link |
those are the weights that we're going to be changing part of the optimization and the other
link |
thing that we have to remember is here we have only a single neuron but in the neural net you
link |
typically have many neurons and they're connected so this is only like a one small neuron a piece
link |
of a much bigger puzzle and eventually there's a loss function that sort of measures the accuracy
link |
of the neural net and we're back propagating with respect to that accuracy and trying to increase it
link |
so let's start off by propagation here and end so let's start off with back propagation
link |
here in the end what is the derivative of o with respect to o the base case sort of we know always
link |
is that the gradient is just 1.0 so let me fill it in and then uh let me split out uh the drawing
link |
function um here and then here cell clear this output here okay so now when we draw o we'll see
link |
that oh that grad is one so now we're going to back propagate through the 10h so to back propagate
link |
through 10h we need to know the local derivative of 10h so if we have that uh o is 10h of n then
link |
what is d o by d n now what you could do is you could come here and you could take this expression
link |
and you could do your calculus derivative taking um and that would work but we can also just scroll
link |
down wikipedia here into a section that hopefully tells us that derivative uh d by dx of 10h of x
link |
is any of these i like this one one minus 10h square of x so this is one minus 10h of x squared
link |
so basically what this is saying is that d o by dn is one minus 10h of n squared
link |
and we already have 10h of n it's just o so it's one minus o squared so o is the output here so
link |
the output is this number o dot data is this number and then what this is saying is that d o
link |
by dn is one minus this squared so one minus o dot data squared is 0.5 conveniently so that's
link |
the local derivative of this 10h operation here is 0.5 and uh so that would be d o by dn
link |
so we can fill in that n dot grad is 0.5 we'll just fill it in
link |
so this is exactly 0.5 one half so now we're going to continue the back propagation
link |
this is 0.5 and this is a plus node so how is backdrop going to what is backdrop going to do
link |
here and if you remember our previous example a plus is just a distributor of gradient so this
link |
gradient will simply flow to both of these equally and that's because the local derivative of this
link |
operation is one for every one of its nodes so one times 0.5 is 0.5 so therefore we know that
link |
this node here which we called this its grad is just 0.5 and we know that b dot grad is also 0.5
link |
so let's set those and let's draw
link |
so those are 0.5 continuing we have another plus 0.5 again we'll just distribute so 0.5 will flow
link |
to both of these so we can set theirs x2w2 as well dot grad is 0.5 and let's redraw pluses
link |
are my favorite operations to back propagate through because it's very simple so now what's
link |
flowing into these expressions is 0.5 and so really again keep in mind what the derivative
link |
is telling us at every point in time along here this is saying that if we want the output of this
link |
neuron to increase then the influence on these expressions is positive on the output both of them
link |
are positive contribution to the output so now back propagating to x2 and w2 first this is a
link |
times node so we know that the local derivative is you know the other term so if we want to
link |
calculate x2 dot grad then can you think through what it's going to be so x2 dot grad will be w2
link |
dot data times this x2w2 dot grad right and w2 dot grad will be x2 dot data times x2w2 dot grad
link |
right so that's the little local piece of chain rule
link |
let's set them and let's redraw so here we see that the gradient on our weight
link |
two is zero because x2's data was zero right but x2 will have the gradient 0.5 because data here
link |
was one and so what's interesting here right is because the input x2 was zero then because
link |
of the way the times works of course this gradient will be zero and think about intuitively why that
link |
is derivative always tells us the influence of this on the final output if i wiggle w2 how is
link |
the output changing it's not changing because we're multiplying by zero so because it's not
link |
changing there is no derivative and zero is the correct answer because we're squashing that zero
link |
and let's do it here 0.5 should come here and flow through this times and so we'll have
link |
that x1 dot grad is can you think through a little bit what what this should be
link |
the local derivative of times with respect to x1 is going to be w1 so w1's data times x1w1
link |
dot grad and w1 dot grad will be x1 dot data times x1w2 w1 dot grad let's see what those
link |
came out to be so this is 0.5 so this would be negative 1.5 and this would be 1 and this
link |
1 and we've back propagated through this expression these are the actual final derivatives
link |
so if we want this neuron's output to increase we know that what's necessary is that w2 we have no
link |
gradient w2 doesn't actually matter to this neuron right now but this neuron this weight
link |
should go up so if this weight goes up then this neuron's output would have gone up and
link |
proportionally because the gradient is one okay so doing the back propagation manually is obviously
link |
ridiculous so we are now going to put an end to this suffering and we're going to see how we can
link |
implement the backward pass a bit more automatically we're not going to be doing all of
link |
it manually out here it's now pretty obvious to us by example how these pluses and times are back
link |
propagating gradients so let's go up to the value object and we're going to start codifying what
link |
we've seen in the examples below so we're going to do this by storing a special self dot backward
link |
and underscore backward and this will be a function which is going to do that little
link |
piece of chain rule at each little node that compute that took inputs and produced output
link |
we're going to store how we are going to chain the the outputs gradient into the inputs gradients
link |
so by default this will be a function that doesn't do anything so and you can also see
link |
that here in the value in micro grad so we have this backward function by default doesn't do
link |
anything this is a empty function and that would be sort of the case for example for a leaf node
link |
for leaf node there's nothing to do but now if when we're creating these out values these out
link |
values are an addition of self and other and so we will want to sell set outs backward to be
link |
the function that propagates the gradient so let's define what should happen
link |
and we're going to store it in the closure let's define what should happen when we call outs grad
link |
for addition our job is to take outs grad and propagate it into selfs grad and other dot grad
link |
so basically we want to sell self dot grad to something and we want to set others dot grad
link |
to something okay and the way we saw below how chain rule works we want to take the local derivative
link |
times the sort of global derivative i should call it which is the derivative of the final output of
link |
the expression with respect to outs data with respect to out so the local derivative of self
link |
in an addition is 1.0 so it's just 1.0 times outs grad that's the chain rule and others dot grad
link |
0.0 times grad and what you basically what you're seeing here is that outs grad will simply be
link |
copied onto selfs grad and others grad as we saw happens for an addition operation
link |
so we're going to later call this function to propagate the gradient having done an addition
link |
let's now do the multiplication we're going to also define that backward
link |
and we're going to set its backward to be backward
link |
and we want to chain out grad into self dot grad
link |
and others dot grad and this will be a little piece of chain rule for multiplication
link |
so we'll have so what should this be can you think through
link |
here so what is the local derivative here the local derivative was others dot data
link |
and then oops others dot data and then times out that grad that's channel
link |
and here we have self dot data times out that grad that's what we've been doing
link |
and finally here for 10h that backward and then we want to set outs backwards to be just backward
link |
and here we need to back propagate we have out that grad and we want to chain it into self dot grad
link |
and self dot grad will be the local derivative of this operation that we've done here which is 10h
link |
and so we saw that the local gradient is one minus the 10h of x squared which here is t
link |
that's the local derivative because that's t is the output of this 10h so one minus t square is
link |
the local derivative and then the gradient has to be multiplied because of the chain rule
link |
so out grad is chained through the local gradient into salt out grad and that should be basically
link |
it so we're going to redefine our value node we're going to swing all the way down here
link |
and we're going to redefine our expression make sure that all the grads are zero okay but now we
link |
don't have to do this manually anymore we are going to basically be calling the dot backward
link |
in the right order so first we want to call os dot backward so o was the outcome of 10h
link |
right so calling os that back those goes backward will be this function this is what it will do
link |
now we have to be careful because there's a times out dot grad and out dot grad remember is
link |
initialized to zero so here we see grad zero so as a base case we need to set os dot grad to 1.0
link |
to initialize this with one
link |
and then once this is one we can call o dot backward and what that should do is it should
link |
propagate this grad through 10h so the local derivative times the global derivative which
link |
is initialized at one so this should um
link |
uh don't so i thought about redoing it but i figured i should just leave the error in here
link |
because it's pretty funny why is an anti-object not callable uh it's because i screwed up we're
link |
trying to save these functions so this is correct so we're going to call this function
link |
this here you don't want to call the function because that returns none these functions return
link |
none we just want to store the function so let me redefine the value object and then we're going
link |
to come back in redefine the expression draw dot everything is great o dot grad is one
link |
o dot grad is one and now now this should work of course okay so all that backward should have
link |
this grad should now be 0.5 if we redraw and if everything went correctly 0.5 yay okay so now we
link |
need to call ends dot grad and it's not backward sorry ends backward and then we're going to call
link |
ends dot backward sorry ends backward so that seems to have worked
link |
so ends dot backward routed the gradient to both of these so this is looking great
link |
now we could of course call uh call b dot grad b dot backward sorry what's gonna happen
link |
well b doesn't have it backward b's backward because b is a leaf node b's backward is by
link |
initialization the empty function so nothing would happen but we can call call it on it
link |
but when we call this one it's backward
link |
then we expect this 0.5 to get further routed right so there we go 0.5 0.5
link |
and then finally we want to call it here on x2w2 and on x1w1
link |
let's do both of those and there we go so we get 0.5 negative 1.5 and 1 exactly as we did before
link |
but now we've done it through calling that backward sort of manually so we have one last
link |
piece to get rid of which is us calling underscore backward manually so let's think through what we
link |
are actually doing we've laid out a mathematical expression and now we're trying to go backwards
link |
through that expression so going backwards through the expression just means that we never want to
link |
call a dot backward for any node before we've done sort of everything after it so we have to do
link |
everything after it before we're ever going to call dot backward on any one node we have to get
link |
all of its full dependencies everything that it depends on has to propagate to it before we can
link |
continue back propagation so this ordering of graphs can be achieved using something called
link |
topological sort so topological sort is basically a laying out of a graph such that all the edges
link |
go only from left to right basically so here we have a graph it's a directed acyclic graph a DAG
link |
and this is two different topological orders of it I believe where basically you'll see that it's
link |
laying out of the nodes such that all the edges go only one way from left to right and implementing
link |
topological sort you can look in wikipedia and so on I'm not going to go through it in detail
link |
but basically this is what builds a topological graph we maintain a set of visited nodes and then
link |
we are going through starting at some root node which for us is O that's where I want to start
link |
the topological sort and starting at O we go through all of its children and we need to lay
link |
them out from left to right and basically this starts at O if it's not visited then it marks it
link |
as visited and then it iterates through all of its children and calls build topological on them
link |
and then after it's gone through all the children it adds itself so basically this node that we're
link |
going to call it on like say O is only going to add itself to the topo list after all of the
link |
children have been processed and that's how this function is guaranteeing that you're only not
link |
going to be in the list once all your children are in the list and that's the invariant that is
link |
being maintained so if we build topo on O and then inspect this list we're going to see that it
link |
ordered our value objects and the last one is the value of 0.707 which is the output so this is O
link |
and then this is N and then all the other nodes get laid out before it so that builds the topological
link |
graph and really what we're doing now is we're just calling dot underscore backward on all of
link |
the nodes in a topological order so if we just reset the gradients they're all zero what did we
link |
do we started by setting O dot grad to be one that's the base case then we built a topological
link |
order and then we went for node in reversed of topo now in in the reverse order because this
link |
list goes from you know we need to go through it in reversed order so starting at O node dot backward
link |
and this should be it there we go those are the correct derivatives finally we are going to hide
link |
this functionality so i'm going to copy this and we're going to hide it inside the value class
link |
because we don't want to have all that code lying around so instead of an underscore backward we're
link |
now going to define an actual backward so that backward without the underscore
link |
and that's going to do all the stuff that we just derived so let me just clean this up a little bit
link |
so we're first going to build a topological graph starting at self so build topo of self
link |
build topo of self will populate the topological order into the topo list which is a local variable
link |
then we set self dot grads to be one and then for each node in the reversed list
link |
so starting at s and going to all the children underscore backward and that should be it so
link |
save come down here we define okay all the grads are zero and now what we can do is
link |
O dot backward without the underscore and
link |
there we go and that's that's back propagation
link |
place for one neuron now we shouldn't be too happy with ourselves actually because we have
link |
a bad bug um and we have not surfaced the bug because of some specific conditions that we are
link |
have we have to think about right now so here's the simplest case that shows the bug say i create
link |
a single node a and then i create a b that is a plus a and then i call backward
link |
so what's going to happen is a is three and then a b is a plus a so there's two arrows on top of
link |
each other here then we can see that b is of course the forward pass works b is just a plus a
link |
which is six but the gradient here is not actually correct that we calculate it automatically and
link |
that we calculate it automatically and that's because um of course uh just doing calculus
link |
in your head the derivative of b with respect to a should be uh two one plus one it's not one
link |
intuitively what's happening here right so b is the result of a plus a
link |
and then we call backward on it so let's go up and see what that does
link |
um b is a result of addition so out is b and then when we call backward what happened is self.grad
link |
was set to one and then other.grad was set to one but because we're doing a plus a self and other
link |
are actually the exact same object so we are overriding the gradient we are setting it to one
link |
and then we are setting it again to one and that's why it stays at one so that's a problem
link |
there's another way to see this in a little bit more complicated expression
link |
so here we have a and b and then uh d will be the multiplication of the two and e will be the
link |
addition of the two and then we multiply e times d to get f and then we call it f.backward and
link |
these gradients if you check will be incorrect so fundamentally what's happening here again is
link |
basically we're going to see an issue anytime we use a variable more than once until now in these
link |
expressions above every variable is used exactly once so we didn't see the issue but here if a
link |
variable is used more than once what's going to happen during backward pass we're back propagating
link |
from f to e to d so far so good but now e calls it backward and it deposits its gradients to a and
link |
b but then we come back to d and call backward and it overwrites those gradients at a and b
link |
so that's obviously a problem and the solution here if you look at the multivariate case of the
link |
chain rule and its generalization there the solution there is basically that we have to
link |
accumulate these gradients these gradients add and so instead of setting those gradients
link |
we can simply do plus equals we need to accumulate those gradients plus equals plus equals plus
link |
equals plus equals plus equals plus equals and this will be okay remember because we are initializing
link |
them at zero so they start at zero and then any contribution that flows backwards will simply add
link |
so now if we redefine this one because the plus equals this now works because a dot grad started
link |
at zero and when we call b dot backward we deposit one and then we deposit one again and now this is
link |
two which is correct and here this will also work and we'll get correct gradients because when we
link |
call e dot backward we will deposit the gradients from this branch and then we get to back to d dot
link |
backward it will deposit its own gradients and then those gradients simply add on top of each
link |
other and so we just accumulate those gradients and that fixes the issue okay now before we move
link |
on let me actually do a bit of cleanup here and delete some of these some of this intermediate
link |
work so i'm not going to need any of this now that we've derived all of it um we are going to keep
link |
this because i want to come back to it delete the 10h delete harmonic example delete the step
link |
delete this keep the code that draws and then delete this example and leave behind only the
link |
definition of value and now let's come back to this non-linearity here that we implemented the
link |
10h now i told you that we could have broken down 10h into its explicit atoms in terms of other
link |
expressions if we had the exp function so if you remember 10h is defined like this and we chose to
link |
develop 10h as a single function and we can do that because we know it's derivative and we can
link |
back propagate through it but we can also break down 10h into and express it as a function of exp
link |
and i would like to do that now because i want to prove to you that you get all the same results
link |
and all the same gradients but also because it forces us to implement a few more expressions
link |
it forces us to do exponentiation addition subtraction division and things like that and
link |
i think it's a good exercise to go through a few more of these okay so let's scroll up
link |
to the definition of value and here one thing that we currently can't do is we can do like a
link |
value of say 2.0 but we can't do you know here for example we want to add constant one and we
link |
can't do something like this and we can't do it because it says int object has no attribute data
link |
that's because a plus one comes right here to add and then other is the integer one and then here
link |
python is trying to access one dot data and that's not a thing that's because basically one is not a
link |
value object and we only have addition for value objects so as a matter of convenience so that we
link |
can create expressions like this and make them make sense we can simply do something like this
link |
basically we let other alone if other is an instance of value but if it's not an instance
link |
of value we're going to assume that it's a number like an integer or a float and we're going to
link |
simply wrap it in in value and then other will just become value of other and then other will
link |
have a data attribute and this should work so if i just save this redefine value then this should
link |
work there we go okay now let's do the exact same thing for multiply because we can't do something
link |
like this again for the exact same reason so we just have to go to mall and if other is not a
link |
value then let's wrap it in value let's redefine value and now this works now here's a kind of
link |
unfortunate and not obvious part a times two works we saw that but two times a is that going to work
link |
you'd expect it to write but actually it will not and the reason it won't is because python doesn't
link |
know like when you do a times two basically um so a times two python will go and it will basically
link |
do something like a dot mall of two that's basically what it will call but to it two times
link |
a is the same as two dot mall of a and it doesn't two can't multiply value and so it's really
link |
confused about that so instead what happens is in python the way this works is you are free to
link |
define something called the rmall and rmall is kind of like a fallback so if a python can't do
link |
two times a it will check if um if by any chance a knows how to multiply two and that will be called
link |
into rmall so because python can't do two times a it will check is there an rmall in value and
link |
because there is it will now call that and what we'll do here is we will swap the order of the
link |
operands so basically two times a will redirect to rmall and rmall will basically call a times two
link |
and that's how that will work so redefining that with rmall two times a becomes four okay now
link |
looking at the other elements that we still need we need to know how to exponentiate and how to divide
link |
so let's first the explanation do the exponentiation part we're going to introduce
link |
a single function exp here and exp is going to mirror 10h in the sense that it's a simple
link |
single function that transform a single scalar value and outputs a single scalar value
link |
so we pop out the python number we use math.exp to exponentiate it create a new value object
link |
everything that we've seen before the tricky part of course is how do you back propagate through
link |
e to the x and so here you can potentially pause the video and think about what should go here
link |
okay so basically we need to know what is the local derivative of e to the x so d by dx of e
link |
to the x is famously just e to the x and we've already just calculated e to the x and it's inside
link |
and it's inside out.data so we can do out.data times and out.grad that's the chain rule
link |
so we're just chaining on to the current running grad and this is what the expression looks like
link |
it looks a little confusing but this is what it is and that's the explanation
link |
so redefining we should now be able to call a.exp and hopefully the backward pass works as well
link |
okay and the last thing we'd like to do of course is we'd like to be able to divide
link |
now i actually will implement something slightly more powerful than division because division is
link |
just a special case of something a bit more powerful so in particular just by rearranging
link |
if we have some kind of a b equals value of 4.0 here we'd like to basically be able to do a divide
link |
b and we'd like this to be able to give us 0.5 now division actually can be reshuffled as follows
link |
if we have a divide b that's actually the same as a multiplying 1 over b and that's the same as a
link |
multiplying b to the power of negative 1 and so what i'd like to do instead is i'd basically like
link |
to implement the operation of x to the k for some constant k so it's an integer or a float
link |
and we would like to be able to differentiate this and then as a special case negative 1 will
link |
be division and so i'm doing that just because it's more general and yeah you might as well do
link |
yeah you might as well do it that way so basically what i'm saying is we can redefine uh division
link |
which we will put here somewhere yeah we can put it here somewhere what i'm saying is that we can
link |
redefine division so self-divide other this can actually be rewritten as self times other to the
link |
power of negative one and now a value raised to the power of negative one we have now defined that
link |
so here's so we need to implement the pow function where am i going to put the pow function maybe
link |
here somewhere there's this color for it so this function will be called when we try to raise a
link |
value to some power and other will be that power now i'd like to make sure that other is only an
link |
int or a float usually other is some kind of a different value object but here other will be
link |
forced to be an int or a float otherwise the math uh won't work for forward trying to achieve in
link |
the specific case that would be a different derivative expression if we wanted other to be
link |
a value so here we create the up the value which is just uh you know this data raised to the power
link |
of other and other here could be for example negative one that's what we are hoping to achieve
link |
and then uh this is the backward stub and this is the fun part which is what is the uh chain rule
link |
expression here for back for um back propagating through the power function where the power is to
link |
the power of some kind of a constant so this is the exercise and maybe pause the video here and
link |
see if you can figure it out yourself as to what we should put here okay so um you can actually go
link |
here and look at derivative rules as an example and we see lots of derivative rules that you can
link |
hopefully know from calculus in particular what we're looking for is the power rule
link |
because that's telling us that if we're trying to take d by dx of x to the n which is what we're
link |
doing here then that is just n times x to the n minus one right okay so that's telling us about
link |
the local derivative of this power operation so all we want here basically n is now other and
link |
self.data is x and so this now becomes other which is n times self.data which is now a python
link |
int or a float it's not a value object we're accessing the data attribute raised to the power
link |
of other minus one or n minus one i can put brackets around this but this doesn't matter
link |
because um power takes precedence over multiply in pyhelm so that would have been okay and then
link |
okay and that's the local derivative only but now we have to chain it and we change it just
link |
simply by multiplying by our top grad that's chain rule and this should uh technically work
link |
and we're gonna find out soon but now if we do this this should now work and we get 0.5
link |
so the forward pass works but does the backward pass work and i realized that we actually also
link |
have to know how to subtract so right now a minus b will not work to make it work we need one more
link |
piece of code here and basically this is the subtraction and the way we're going to implement
link |
subtraction is we're going to implement it by addition of a negation and then to implement
link |
negation we're going to multiply by negative one so just again using the stuff we've already built
link |
and just expressing it in terms of what we have and a minus b is not working okay so now let's
link |
scroll again to this expression here for this neuron and let's just uh compute the backward
link |
pass here once we've defined o and let's draw it so here's the gradients for all these leaf
link |
nodes for this two-dimensional neuron that has a 10h that we've seen before so now what i'd like
link |
to do is i'd like to break up this 10h into this expression here so let me copy paste this here
link |
and now instead of we'll preserve the label and we will change how we define o so in particular
link |
we're going to implement this formula here so we need e to the 2x minus 1 over e to the x plus 1
link |
so e to the 2x we need to take 2 times n and we need to exponentiate it that's e to the 2x
link |
and then because we're using it twice let's create an intermediate variable e and then define o as
link |
e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1 and that should be it and then we
link |
should be able to draw dot of o so now before i run this what do we expect to see number one we're
link |
expecting to see a much longer graph here because we've broken up 10h into a bunch of other
link |
operations but those operations are mathematically equivalent and so what we're expecting to see is
link |
number one the same result here so the forward pass works and number two because of that
link |
mathematical equivalence we expect to see the same backward pass and the same gradients on these
link |
leaf nodes so these gradients should be identical so let's run this so number one let's verify that
link |
instead of a single 10h node we have now exp and we have plus we have times negative one this is
link |
the division and we end up with the same forward pass here and then the gradients we have to be
link |
careful because they're in slightly different order potentially the gradients for w2x2 should
link |
be 0 and 0.5 w2 and x2 are 0 and 0.5 and w1x1 are 1 and negative 1.5 1 and negative 1.5 so that means
link |
that both our forward passes and backward passes were correct because this turned out to be
link |
equivalent to 10h before and so the reason I wanted to go through this exercise is number one
link |
we got to practice a few more operations and writing more backwards passes and number two
link |
I wanted to illustrate the point that the level at which you implement your operations is totally
link |
up to you you can implement backward passes for tiny expressions like a single individual plus or
link |
a single times or you can implement them for say 10h which is a kind of a potentially you can see
link |
it as a composite operation because it's made up of all these more atomic operations but really all
link |
of this is kind of like a fake concept all that matters is we have some kind of inputs and some
link |
kind of an output and this output is a function of the inputs in some way and as long as you can
link |
do forward pass and the backward pass of that little operation it doesn't matter what that
link |
operation is and how composite it is if you can write the local gradients you can chain the
link |
gradient and you can continue back propagation so the design of what those functions are is
link |
completely up to you so now I would like to show you how you can do the exact same thing
link |
but using a modern deep neural network library like for example PyTorch which I've roughly modeled
link |
micrograd by and so PyTorch is something you would use in production and I'll show you how you can
link |
do the exact same thing but in PyTorch API so I'm just going to copy paste it in and walk you through
link |
it a little bit this is what it looks like so we're going to import PyTorch and then we need to
link |
define these value objects like we have here now micrograd is a scalar valued engine so we only have
link |
scalar values like 2.0 but in PyTorch everything is based around tensors and like I mentioned
link |
tensors are just n-dimensional arrays of scalars so that's why things get a little bit more
link |
complicated here I just need a scalar valued tensor a tensor with just a single element
link |
but by default when you work with PyTorch you would use more complicated tensors like this
link |
so if I import PyTorch then I can create tensors like this and this tensor for example is a 2x3
link |
array of scalars in a single compact representation so we can check its shape
link |
we see that it's a 2x3 array and so on so this is usually what you would work with in the actual
link |
libraries so here I'm creating a tensor that has only a single element 2.0 and then I'm casting it
link |
to be double because Python is by default using double precision for its floating point numbers so
link |
I'd like everything to be identical by default the data type of these tensors will be float 32
link |
so it's only using a single precision float so I'm casting it to double so that we have float 64
link |
just like in Python so I'm casting to double and then we get something similar to value of two
link |
the next thing I have to do is because these are leaf nodes by default PyTorch assumes that they
link |
do not require gradients so I need to explicitly say that all of these nodes require gradients
link |
so this is going to construct scalar valued one element tensors make sure that PyTorch knows that
link |
they require gradients now by default these are set to false by the way because of efficiency
link |
reasons because usually you would not want gradients for leaf nodes like the inputs to the
link |
network and this is just trying to be efficient in the most common cases so once we've defined
link |
all of our values in PyTorch land we can perform arithmetic just like we can here in micrograd land
link |
so this will just work and then there's a torch.tanh also and when we get back as a tensor
link |
again and we can just like in micrograd it's got a data attribute and it's got grad attributes
link |
so these tensor objects just like in micrograd have a dot data and a dot grad and we can
link |
run this in micrograd and the only difference here is that we need to call it dot item because
link |
otherwise PyTorch dot item basically takes a single tensor of one element and it just returns
link |
that element stripping out the tensor so let me just run this and hopefully we are going to get
link |
this is going to print the forward pass which is 0.707 and this will be the gradients which
link |
are 0.5 0 negative 1.5 and 1 so if we just run this there we go 0.7 so the forward pass agrees
link |
and then 0.5 0 negative 1.5 and 1 so PyTorch agrees with us and just to show you here basically
link |
oh here's a tensor with a single element and it's a double and we can call that item on it to just
link |
single number out so that's what item does and o is a tensor object like I mentioned and it's got
link |
a backward function just like we've implemented and then all of these also have a dot grad so
link |
like x2 for example has a grad and it's a tensor and we can pop out the individual number with dot
link |
item so basically Torch can do what we did in micrograd as a special case when your
link |
tensors are all single element tensors but the big deal with PyTorch is that everything is
link |
significantly more efficient because we are working with these tensor objects and we can do
link |
lots of operations in parallel on all of these tensors but otherwise what we've built very much
link |
agrees with the API of PyTorch okay so now that we have some machinery to build out pretty
link |
complicated mathematical expressions we can also start building up neural nets and as I mentioned
link |
neural nets are just a specific class of mathematical expressions so we're going to
link |
start building out a neural net piece by piece and eventually we'll build out a two-layer
link |
multi-layer layer perceptron as it's called and I'll show you exactly what that means let's start
link |
with a single individual neuron we've implemented one here but here I'm going to implement one that
link |
also subscribes to the PyTorch API in how it designs its neural network modules so just like
link |
we saw that we can like match the API of PyTorch on the autograd side we're going to try to do
link |
that on the neural network modules so here's class neuron and just for the sake of efficiency
link |
I'm going to copy paste some sections that are relatively straightforward
link |
so the constructor will take number of inputs to this neuron which is how many inputs come to a
link |
neuron so this one for example is three inputs and then it's going to create a weight that is some
link |
random number between negative one and one for every one of those inputs and a bias that controls
link |
the overall trigger happiness of this neuron and then we're going to implement a def underscore
link |
underscore call of self and x some input x and really what we don't want to do here is w times
link |
x plus b where w times x here is a dot product specifically now if you haven't seen call let me
link |
just return 0.0 here from now the way this works now is we can have an x which is say like 2.0 3.0
link |
then we can initialize a neuron that is two-dimensional because these are two
link |
numbers and then we can feed those two numbers into that neuron to get an output and so when
link |
you use this notation n of x python will use call so currently call just returns 0.0
link |
now we'd like to actually do the forward pass of this neuron instead so what we're going to
link |
do here first is we need to basically multiply all of the elements of w with all of the elements
link |
of x pairwise we need to multiply them so the first thing we're going to do is we're going to
link |
zip up salta w and x and in python zip takes two iterators and it creates a new iterator
link |
that iterates over the tuples of their corresponding entries
link |
so for example just to show you we can print this list and still return 0.0 here
link |
so we see that these w's are paired up with the x's w with x
link |
and now what we want to do is
link |
um for w i x i in we want to multiply w times w i times x i and then we want to sum all of that
link |
together to come up with an activation and add also salt b on top so that's the raw activation
link |
and then of course we need to pass that through a non-linearity so what we're going to be returning
link |
is act dot 10h and here's out so now we see that we are getting some outputs and we get a different
link |
output from a neuron each time because we are initializing different weights and biases and then
link |
to be a bit more efficient here actually sum by the way takes a second optional parameter which
link |
is the start and by default the start is zero so these elements of this sum will be added on top
link |
of zero to begin with but actually we can just start with cell dot b and then we just have an
link |
expression like this and then the generator expression here must be parenthesized in python
link |
yep so now we can forward a single neuron next up we're going to define a layer of neurons so here
link |
we have a schematic for a mlp so we see that these mlps each layer this is one layer has actually a
link |
number of neurons and they're not connected to each other but all of them are fully connected
link |
to the input so what is a layer of neurons it's just it's just a set of neurons evaluated
link |
independently so in the interest of time i'm going to do something fairly straightforward here
link |
it's literally a layer is just a list of neurons and then how many neurons do we have we take that
link |
as an input argument here how many neurons do you want in your layer number of outputs in this layer
link |
and so we just initialize completely independent neurons with this given dimensionality and we
link |
call on it we just independently evaluate them so now instead of a neuron we can make a layer
link |
of neurons they are two-dimensional neurons and let's have three of them and now we see that we
link |
have three independent evaluations of three different neurons right okay and finally let's
link |
complete this picture and define an entire multi-layer perceptron or mlp and as we can
link |
see here in an mlp these layers just feed into each other sequentially so let's come here and
link |
i'm just going to copy the code here in interest of time so an mlp is very similar to a multi-layer
link |
we're taking the number of inputs as before but now instead of taking a single nout which is
link |
number of neurons in a single layer we're going to take a list of nouts and this list defines the
link |
sizes of all the layers that we want in our mlp so here we just put them all together and then
link |
iterate over consecutive pairs of these sizes and create layer objects for them and then in the call
link |
function we are just calling them sequentially so that's an mlp really and let's actually
link |
re-implement this picture so we want three input neurons and then two layers of four and an output
link |
unit so we want a three-dimensional input say this is an example input we want three inputs
link |
into two layers of four and one output and this of course is an mlp and there we go that's a
link |
forward pass of an mlp to make this a little bit nicer you see how we have just a single element
link |
but it's wrapped in a list because layer always returns lists so for convenience returnouts at
link |
zero if lenouts is exactly a single element else returnfullest and this will allow us to just get
link |
a single value out at the last layer that only has a single neuron and finally we should be able to
link |
draw dot of n of x and as you might imagine these expressions are now getting relatively involved
link |
relatively involved so this is an entire mlp that we're defining now
link |
all the way until a single output okay and so obviously you would never differentiate
link |
on pen and paper these expressions but with micrograd we will be able to back propagate
link |
all the way through this and back propagate into these weights of all these neurons so let's see
link |
how that works okay so let's create ourselves a very simple example data set here so this data
link |
set has four examples and so we have four possible inputs into the neural net and we have four
link |
desired targets so we'd like the neural net to assign or output 1.0 when it's fed this example
link |
negative one when it's fed these examples and one when it's fed this example so it's a very simple
link |
binary classifier neural net basically that we would like here now let's think what the neural
link |
that currently thinks about these four examples we can just get their predictions basically we
link |
can just call n of x for x and x's and then we can print so these are the outputs of the neural net
link |
on those four examples so the first one is 0.91 but we'd like it to be one so we should push this
link |
one higher this one we want to be higher this one says 0.88 and we want this to be negative one
link |
this is 0.88 we want it to be negative one and this one is 0.88 we want it to be one
link |
so how do we make the neural net and how do we tune the weights to better predict the desired
link |
targets and the trick used in deep learning to achieve this is to calculate a single number that
link |
somehow measures the total performance of your neural net and we call this single number the loss
link |
so the loss first is a single number that we're going to define that basically measures how well
link |
the neural net is performing right now we have the intuitive sense that it's not performing very well
link |
because we're not very much close to this so the loss will be high and we'll want to minimize the
link |
loss so in particular in this case what we're going to do is we're going to implement the mean
link |
squared error loss so what this is doing is we're going to basically iterate for y ground truth
link |
and y output in zip of y's and y bread so we're going to pair up the ground truths with the
link |
predictions and the zip iterates over tuples of them and for each y ground truth and y output
link |
we're going to subtract them and square them so let's first see what these losses are these are
link |
individual loss components and so basically for each one of the four we are taking the
link |
prediction and the ground truth we are subtracting them and squaring them so because this one is so
link |
close to its target 0.91 is almost one subtracting them gives a very small number so here we would
link |
get like a negative 0.1 and then squaring it just makes sure that regardless of whether we
link |
are more negative or more positive we always get a positive number instead of squaring we should
link |
we could also take for example the absolute value we need to discard the sign and so you
link |
see that the expression is ranged so that you only get zero exactly when y out is equal to y
link |
ground truth when those two are equal so your prediction is exactly the target you are going
link |
to get zero and if your prediction is not the target you are going to get some other number
link |
so here for example we are way off and so that's why the loss is quite high and the more off we are
link |
the greater the loss will be so we don't want high loss we want low loss and so the final
link |
loss here will be just the sum of all of these numbers so you see that this should be zero
link |
roughly plus zero roughly but plus seven so loss should be about seven here and now we want to
link |
minimize the loss we want the loss to be low because if loss is low then every one of the
link |
predictions is equal to its target so the loss the lowest it can be is zero and the greater it is the
link |
worse off the neural net is predicting so now of course if we do loss that backward something magical
link |
happened when i hit enter and the magical thing of course that happened is that we can look at
link |
n dot layers dot neuron and dot layers at say like the first layer dot neurons at zero
link |
because remember that mlp has the layers which is a list and each layer has neurons which is a list
link |
and that gives us an individual neuron and then it's got some weights
link |
and so we can for example look at the weights at zero
link |
um oops it's not called weights it's called w and that's a value but now this value also has a grad
link |
because of the backward pass and so we see that because this gradient here on this particular
link |
weight of this particular neuron of this particular layer is negative we see that its influence on the
link |
loss is also negative so slightly increasing this particular weight of this neuron of this layer
link |
would make the loss go down and we actually have this information for every single one of our
link |
neurons and all their parameters actually it's worth looking at also a draw dot of loss by the
link |
way so previously we looked at the draw dot of a single neuron neuron forward pass and that was
link |
already a large expression but what is this expression we actually forwarded every one of
link |
those four examples and then we have the loss on top of them with the mean squared error and so
link |
this is a really massive graph because this graph that we've built up now oh my gosh this graph that
link |
we've built up now which is kind of excessive it's excessive because it has four forward passes of a
link |
neural net for every one of the examples and then it has the loss on top and it ends with the value
link |
of the loss which was 7.12 and this loss will now back propagate through all the forward passes
link |
all the way through just every single intermediate value of the neural net all the way back to of
link |
course the parameters of the weights which are the input so these weight parameters here are inputs
link |
to this neural net and these numbers here these scalars are inputs to the neural net so if we
link |
went around here we will probably find some of these examples this 1.0 potentially maybe this
link |
1.0 or you know some of the others and you'll see that they all have gradients as well the thing is
link |
these gradients on the input data are not that useful to us and that's because the input data
link |
seems to be not changeable it's it's a given to the problem and so it's a fixed input we're not
link |
going to be changing it or messing with it even though we do have gradients for it but some of
link |
these gradients here will be for the neural network parameters the w's and the b's and those
link |
we of course we want to change okay so now we're going to want some convenience codes to gather up
link |
all the parameters of the neural net so that we can operate on all of them simultaneously and
link |
every one of them we will nudge a tiny amount based on the gradient information so let's collect the
link |
parameters of the neural net all in one array so let's create a parameters of self that just returns
link |
uh cell.w which is a list concatenated with a list of cell.b
link |
so this will just return a list list plus list just you know gives you a list so that's parameters
link |
of neuron and i'm calling it this way because also pytorch has parameters on every single
link |
and in module and it does exactly what we're doing here it just returns the parameter tensors
link |
for us it's the parameter scalars now layer is also a module so it will have parameters
link |
self and basically what we want to do here is something like this like
link |
uh params is here and then for neuron in self.neurons
link |
we want to get neuron.parameters and we want to params.extend
link |
right so these are the parameters of this neuron and then we want to put them on top of params so
link |
params.extend of piece and then we want to return params so this there's way too much code
link |
so actually there's a way to simplify this which is return p for neuron in self.neurons
link |
for p in neuron.parameters so it's a single list comprehension in python you can sort of nest them
link |
like this and you can then create the desired array so this is these are identical we can
link |
take this out and then let's do the same here def parameters self and return a parameter for layer
link |
in self.layers for p in layer.parameters and that should be good now let me pop out this
link |
so we don't reinitialize our network because we need to reinitialize our
link |
okay so unfortunately we will have to probably reinitialize the network because we just
link |
add functionality because this class of course we i want to get all the end up parameters but
link |
that's not going to work because this is the old class okay so unfortunately we do have to
link |
reinitialize the network which will change some of the numbers but let me do that so that we pick
link |
up the new api we can now do end up parameters and these are all the weights and biases inside
link |
the entire neural net so in total this mlp has 41 parameters and now we'll be able to change them
link |
if we recalculate the loss here we see that unfortunately we have slightly different
link |
predictions and slightly different loss but that's okay okay so we see that this neurons
link |
gradient is slightly negative we can also look at its data right now which is 0.85 so this is
link |
the current value of this neuron and this is its gradient on the loss so what we want to do now is
link |
we want to iterate for every p in end up parameters so for all the 41 parameters in this neural net
link |
we actually want to change p dot data slightly according to the gradient information okay so
link |
dot dot to do here but this will be basically a tiny update in this gradient descent scheme
link |
and in gradient descent we are thinking of the gradient as a vector pointing in the direction of
link |
increased loss and so in gradient descent we are modifying p dot data by a small step size
link |
in the direction of the gradient so the step size as an example could be like a very small number
link |
like 0.01 is the step size times p dot grad right but we have to think through some of the signs
link |
here so in particular working with this specific example here we see that if we just left it like
link |
this then this neuron's value would be currently increased by a tiny amount of the gradient
link |
the gradient is negative so this value of this neuron would go slightly down it would become
link |
like 0.8 you know 4 or something like that but if this neuron's value goes lower that would actually
link |
increase the loss that's because the derivative of this neuron is negative so increasing this makes
link |
the loss go down so increasing it is what we want to do instead of decreasing it so basically what
link |
we're missing here is we're actually missing a negative sign and again this other interpretation
link |
and that's because we want to minimize the loss we don't want to maximize the loss we want to
link |
decrease it and the other interpretation as i mentioned is you can think of the gradient
link |
vector so basically just the vector of all the gradients as pointing in the direction of increasing
link |
the loss but then we want to decrease it so we actually want to go in the opposite direction
link |
and so you can convince yourself that this sort of like that's the right thing here with this
link |
the right thing here with a negative because we want to minimize the loss
link |
so if we nudge all the parameters by a tiny amount
link |
then we'll see that this data will have changed a little bit so now this neuron is a tiny amount
link |
greater value so 0.854 went to 0.857 and that's a good thing because slightly increasing this neuron
link |
data makes the loss go down according to the gradient and so the correcting has happened
link |
sign-wise and so now what we would expect of course is that because we've changed all these
link |
parameters we expect that the loss should have gone down a bit so we want to re-evaluate the
link |
loss let me basically this is just a data definition that hasn't changed but the forward
link |
pass here of the network we can recalculate and actually let me do it outside here so that we can
link |
compare the two loss values so here if i recalculate the loss we'd expect the new loss
link |
now to be slightly lower than this number so hopefully what we're getting now is a tiny bit
link |
lower than 4.84 4.36 okay and remember the way we've arranged this is that low loss means that our
link |
predictions are matching the targets so our predictions now are probably slightly closer
link |
to the targets and now all we have to do is we have to iterate this process so again we've done
link |
the forward pass and this is the loss now we can lost that backward let me take these out and we
link |
can do a step size and now we should have a slightly lower loss 4.36 goes to 3.9 and okay so
link |
we've done the forward pass here's the backward pass nudge and now the loss is 3.66
link |
3.47 and you get the idea we just continue doing this and this is gradient descent we're just
link |
iteratively doing forward pass backward pass update forward pass backward pass update and
link |
the neural net is improving its predictions so here if we look at ypred now ypred
link |
we see that this value should be getting closer to one so this value should be getting more
link |
positive these should be getting more negative and this one should be also getting more positive
link |
so if we just iterate this a few more times actually we may be able to afford to go a bit
link |
faster let's try a slightly higher learning rate oops okay there we go so now we're at 0.31
link |
if you go too fast by the way if you try to make it too big of a step you may actually overstep
link |
it's overconfidence because again remember we don't actually know exactly about the loss function
link |
the loss function has all kinds of structure and we only know about the very local dependence of
link |
all these parameters on the loss but if we step too far we may step into you know a part of the
link |
loss that is completely different and that can destabilize training and make your loss actually
link |
blow up even so the loss is now 0.04 so actually the predictions should be really
link |
quite close let's take a look so you see how this is almost one almost negative one almost one
link |
we can continue going so yep backward update oops there we go so we went way too fast and
link |
we actually overstepped so we got too too eager where are we now oops okay seven in negative
link |
nine so this is very very low loss and the predictions are basically perfect so somehow we
link |
basically we were doing way too big updates and we briefly exploded but then somehow we ended up
link |
getting into a really good spot so usually this learning rate and the tuning of it is a is a subtle
link |
art you want to set your learning rate if it's too low you're going to take way too long to converge
link |
but if it's too high the whole thing gets unstable and you're not going to be able to
link |
converge but if it's too high the whole thing gets unstable and you might actually even explode the
link |
loss depending on your loss function so finding the step size to be just right it's it's a pretty
link |
subtle art sometimes when you're using sort of vanilla gradient descent but we happen to get into
link |
a good spot we can look at n dot parameters so this is the setting of weights and biases that
link |
makes our network predict the desired targets very very close and basically we've successfully
link |
trained a neural net okay let's make this a tiny bit more respectable and implement an actual
link |
training loop and what that looks like so this is the data definition that stays this is the forward
link |
pass so for k in range you know we're going to take a bunch of steps first you do the forward
link |
pass we evaluate the loss let's reinitialize the neural net from scratch and here's the data
link |
and we first do forward pass then we do the backward pass and then we do an update that's
link |
gradient descent and then we should be able to iterate this and we should be able to print the
link |
current step the current loss let's just print the sort of number of the loss and then we should
link |
be good and that should be it and then the learning rate 0.01 is a little too small 0.1
link |
we saw is like a little bit dangerously too high let's go somewhere in between and we'll optimize
link |
this for not 10 steps but let's go for say 20 steps let me erase all of this junk
link |
and let's run the optimization and you see how we've actually converged slower
link |
in a more controlled manner and got to a loss that is very low so I expect widespread to be quite good
link |
and that's it okay so this is kind of embarrassing but we actually have a really terrible bug
link |
in here and it's a subtle bug and it's a very common bug and I can't believe I've done it for
link |
the 20th time in my life especially on camera and I could have re-shot the whole thing but I think
link |
it's pretty funny and you know you get to appreciate a bit what working with neural nets
link |
maybe is like sometimes we are guilty of a common bug I've actually tweeted the most common neural
link |
that mistakes a long time ago now and I'm not really going to explain any of these except for
link |
we are guilty of number three you forgot to zero grad before dot backward what is that
link |
basically what's happening and it's a subtle bug and I'm not sure if you saw it is that all of these
link |
weights here have a dot data and a dot grad and dot grad starts at zero
link |
and then we do backward and we fill in the gradients and then we do an update on the data
link |
but we don't flush the grad it stays there so when we do the second forward pass and we do backward
link |
again remember that all the backward operations do a plus equals on the grad and so these gradients
link |
just add up and they never get reset to zero so basically we didn't zero grad so here's how we
link |
zero grad before backward we need to iterate over all the parameters and we need to make sure that
link |
p dot grad is set to zero we need to reset it to zero just like it is in the constructor so remember
link |
all the way here for all these value nodes grad is reset to zero and then all these backward passes
link |
do a plus equals from that grad but we need to make sure that we reset the values to zero
link |
so that when we do backward all of them start at zero and the actual backward pass accumulates
link |
the loss derivatives into the grads so this is zero grad in PyTorch and we will slightly
link |
we'll get a slightly different optimization let's reset the neural net the data is the same
link |
this is now I think correct and we get a much more you know we get a much more
link |
slower descent we still end up with pretty good results and we can continue this a bit more
link |
to get down lower and lower and lower yeah so the only reason that the previous thing worked
link |
it's extremely buggy the only reason that worked is that this is a very very simple problem
link |
and it's very easy for this neural net to fit this data and so the grads ended up accumulating
link |
and it effectively gave us a massive step size and it made us converge extremely fast
link |
but basically now we have to do more steps to get to very low values of loss and get
link |
Wipred to be really good we can try to step a bit greater and we can get a much better result
link |
we can try to step a bit greater yeah we're going to get closer and closer to one minus one and one
link |
so working with neural nets is sometimes tricky because
link |
you may have lots of bugs in the code and your network might actually work just like ours worked
link |
but chances are is that if we had a more complex problem then actually this bug would have made
link |
us not optimize the loss very well and we were only able to get away with it because
link |
the problem is very simple so let's now bring everything together and summarize what we learned
link |
what are neural nets neural nets are these mathematical expressions fairly simple
link |
mathematical expressions in the case of multilayered perceptron that take input as the
link |
input as the data and they take input the weights and the parameters of the neural net
link |
mathematical expression for the forward pass followed by a loss function and the loss function
link |
tries to measure the accuracy of the predictions and usually the loss will be low when your
link |
predictions are matching your targets or where the network is basically behaving well so we
link |
we manipulate the loss function so that when the loss is low the network is doing what you
link |
want it to do on your problem and then we backward the loss use back propagation to get the gradient
link |
and then we know how to tune all the parameters to decrease the loss locally
link |
but then we have to iterate that process many times in what's called the gradient descent
link |
so we simply follow the gradient information and that minimizes the loss and the loss is
link |
arranged so that when the loss is minimized the network is doing what you want it to do
link |
and yeah so we just have a blob of neural stuff and we can make it do arbitrary things
link |
and that's what gives neural nets their power it's you know this is a very tiny network with 41
link |
parameters but you can build significantly more complicated neural nets with billions
link |
at this point almost trillions of parameters and it's a massive blob of neural tissue simulated
link |
neural tissue roughly speaking and you can make it do extremely complex problems and these neural
link |
nets then have all kinds of very fascinating emergent properties in when you try to make them
link |
do significantly hard problems as in the case of gpt for example we have massive amounts of text
link |
from the internet and we're trying to get a neural nets to predict to take like a few words and try
link |
to predict the next word in a sequence that's the learning problem and it turns out that when you
link |
train this on all of internet the neural net actually has like really remarkable emergent
link |
properties but that neural net would have hundreds of billions of parameters but it works on
link |
fundamentally the exact same principles the neural net of course will be a bit more complex but
link |
otherwise the evaluating the gradient is there and would be identical and the gradient descent would
link |
be there and would be basically identical but people usually use slightly different updates
link |
this is a very simple stochastic gradient descent update and the loss function would not be a mean
link |
squared error they would be using something called the cross entropy loss for predicting the next
link |
token so there's a few more details but fundamentally the neural network setup and
link |
neural network training is identical and pervasive and now you understand intuitively how that works
link |
under the hood in the beginning of this video i told you that by the end of it you would understand
link |
everything in micrograd and then we'd slowly build it up let me briefly prove that to you
link |
so i'm going to step through all the code that is in micrograd as of today actually potentially
link |
some of the code will change by the time you watch this video because i intend to continue
link |
developing micrograd but let's look at what we have so far at least init.py is empty when you
link |
go to engine.py that has the value everything here you should mostly recognize so we have the
link |
data dot grad attributes we have the backward function we have the previous set of children
link |
and the operation that produced this value we have addition multiplication and raising to a
link |
scalar power we have the relu non-linearity which is slightly different type of non-linearity than
link |
tanh that we used in this video both of them are non-linearities and notably tanh is not actually
link |
present in micrograd as of right now but i intend to add it later we have the backward which is
link |
identical and then all of these other operations which are built up on top of operations here
link |
so value should be very recognizable except for the non-linearity used in this video
link |
there's no massive difference between relu and tanh and sigmoid and these other non-linearities
link |
they're all roughly equivalent and can be used in mlps so i use tanh because it's a bit smoother
link |
and because it's a little bit more complicated than relu and therefore it's stressed a little
link |
bit more the local gradients and working with those derivatives which i thought would be useful
link |
nn.py is the neural networks library as i mentioned so you should recognize identical
link |
implementation of neuron layer and mlp notably or not so much we have a class module here there's a
link |
parent class of all these modules i did that because there's an n.module class in pytorch
link |
and so this exactly matches that api and n.module in pytorch has also a zero grad which i refactored
link |
out here so that's the end of micrograd really then there's a test which you'll see basically
link |
creates two chunks of code one in micrograd and one in pytorch and we'll make sure that the forward
link |
and the backward paths agree identically for a slightly less complicated expression a slightly
link |
more complicated expression everything agrees so we agree with pytorch on all of these operations
link |
and finally there's a demo that i.py y and b here and it's a bit more complicated binary classification
link |
demo than the one i covered in this lecture so we only had a tiny data set of four examples
link |
here we have a bit more complicated example with lots of blue points and lots of red points
link |
and we're trying to again build a binary classifier to distinguish two-dimensional
link |
points as red or blue it's a bit more complicated mlp here with it's a bigger
link |
mlp the loss is a bit more complicated because it supports batches so because our data set was
link |
so tiny we always did a forward pass on the entire data set of four examples but when your data set
link |
is like a million examples what we usually do in practice is we basically pick out some random
link |
subset we call that a batch and then we only process the batch forward backward and update
link |
so we don't have to forward the entire training set and then we have a batch of four examples
link |
so we don't have to forward the entire training set so this supports batching because there's
link |
a lot more examples here we do a forward pass the loss is slightly more different this is a
link |
max margin loss that i implement here the one that we used was the mean squared error loss
link |
because it's the simplest one there's also the binary cross entropy loss all of them can be used
link |
for binary classification and don't make too much of a difference in the simple examples that we
link |
looked at so far there's something called l2 regularization used here this has to do with
link |
generalization of the neural net and controls the overfitting in machine learning setting
link |
but i did not cover these concepts in this video potentially later and the training
link |
loop you should recognize so forward backward with zero grad and update and so on you'll
link |
notice that in the update here the learning rate is scaled as a function of number of iterations
link |
and it shrinks and this is something called learning rate decay so in the beginning you
link |
have a high learning rate and as the network sort of stabilizes near the end you bring down the
link |
learning rate to get some of the fine details in the end and in the end we see the decision
link |
surface of the neural net and we see that it learned to separate out the red and the blue
link |
area based on the data points so that's the slightly more complicated example in the demo
link |
demo.hypyymb that you're free to go over but yeah as of today that is micrograd i also wanted
link |
to show you a little bit of real stuff so that you get to see how this is actually implemented
link |
in a production grade library like pytorch so in particular i wanted to show i wanted to find and
link |
show you the backward pass for 10h in pytorch so here in micrograd we see that the backward
link |
password 10h is 1 minus t square where t is the output of the 10h of x times out that grad which
link |
is the chain rule so we're looking for something that looks like this now i went to pytorch which
link |
has an open source github code base and i looked through a lot of its code and honestly i spent
link |
about 15 minutes and i couldn't find 10h and that's because these libraries unfortunately
link |
they grow in size and entropy and if you just search for 10h you get apparently 2800 results
link |
and 400 and 406 files so i don't know what these files are doing honestly and why there are so many
link |
mentions of 10h but unfortunately these libraries are quite complex they're meant to be used not
link |
really inspected eventually i did stumble on someone who tries to change the 10h backward
link |
code for some reason and someone here pointed to the cpu kernel and the cuda kernel for 10h backward
link |
so this so basically depends on if you're using pytorch on a cpu device or on a gpu
link |
which these are different devices and i haven't covered this but this is the 10h backward kernel
link |
for cpu and the reason it's so large is that number one this is like if you're using a complex
link |
type which we haven't even talked about if you're using a specific data type of bfloat 16 which we
link |
haven't talked about and then if you're not then this is the kernel and deep here we see something
link |
that resembles our backward pass so they have a times one minus b square so this b here must be
link |
the output of the 10h and this is the out.grad so here we found it deep inside pytorch on this
link |
location for some reason inside binary ops kernel when 10h is not actually a binary op and then this
link |
is the gpu kernel we're not complex we're here and here we go with one line of code so we did
link |
find it but basically unfortunately these code bases are very large and micrograd is very very
link |
simple but if you actually want to use real stuff finding the code for it you'll actually find that
link |
difficult i also wanted to show you a little example here where pytorch is showing you how
link |
you can register a new type of function that you want to add to pytorch as a lego building block
link |
so here if you want to for example add a genre polynomial three here's how you could do it you
link |
will register it as a class that subclasses torch.rgrad.function and then you have to tell
link |
pytorch how to forward your new function and how to backward through it so as long as you
link |
can do the forward pass of this little function piece that you want to add and as long as you
link |
know the the local derivative the local gradients which are implemented in the backward pytorch
link |
will be able to back propagate through your function and then you can use this as a lego block
link |
in a larger lego castle of all the different lego blocks that pytorch already has and so that's the
link |
only thing you have to tell pytorch and everything would just work and you can register new types of
link |
functions in this way following this example and that is everything that i wanted to cover in this
link |
lecture so i hope you enjoyed building out micrograd with me i hope you find it interesting
link |
insightful and yeah i will post a lot of the links that are related to this video in the video
link |
description below i will also probably post a link to a discussion forum or discussion group where
link |
you can ask questions related to this video and then i can answer or someone else can answer your
link |
questions and i may also do a follow-up video that answers some of the most common questions
link |
but for now that's it i hope you enjoyed it if you did then please like and subscribe
link |
so that youtube knows to feature this video to more people and that's it for now i'll see you later
link |
now here's the problem we know dl by wait what is the problem
link |
and that's everything i wanted to cover in this lecture so i hope um you enjoyed us building out
link |
micrograb micrograb okay now let's do the exact same thing for multiply because we can't do
link |
something like a times two oops i know what happened there