My last post discussed the concept of loss when it comes to building a machine learning model. To summarize the last post (so we can move on to fixing things, damnit) think of a model as a line on a graph of many data points. The distance between the data points and the line = loss, which is bad. The goal of this line is to hit as many of the data points as possible. So how do we enable our models to predict accurately, and minimize loss?
Well… it’s not that easy or straight forward, but lets start with the concept of gradient descent.
Don’t freak out. Take the phrase as the two meanings you probably know. Gradient, as in going step by step towards something from something, and descent, meaning going down.
Let’s hit definitions next, since the one I just used above ^ was so crappy, but also because definitions are the basis for clarity, which is important before beginning any conversation.
Hyperparameters are configuration settings used to tune how the model is trained. Let’s take the below snippet of code running a model as an example:
Variables such as how many training steps a model will run during the training phase of model development is one of several hundreds of examples of hyperparameters that can be tweaked when you not only train but also perform inference with a model. These are all variables that can be tuned different ways to have your model train a certain way- messing with hyperparameters is one way to begin reducing loss.
Gradient steps are small steps in the direction that minimizes loss. The general strategy discussed here is gradient descent. Let’s consider the graph below:
Cool. This should look familiar, however, one change… the line is a curve. This is because this graph actually represents loss (on the x-axis) and the value of the weight given to a dimension being tested. Generally, loss looks like a curve (or, more commonly, a series of ups and downs, just like life). The ups and downs of life aspect will be important in a later discussion, so keep this in mind. However, we will just deal with this one curve for now. Now, consider the next step:
Okay, not so bad. Now, how did we get to the end point? If you recall your algebra… does the below look familiar?
So, this should not look so different than what we call our negative gradient.
What are we moving towards? If the curve represents loss, then of course we want the lowest point on the loss curve, because loss is bad, and we want loss gone:
Learning rate is controlling how big those gradient steps are. If they are too big, you risk the possibility of overshooting the minimum loss point. If the learning rate is too slow and the steps are small, then learning would take forever. The “sweet spot” is somewhere in-between. Visit the playground to play with this concept further if it is hard to comprehend- mess with the learning rate (toggle up and down) to see how fast a model reaches convergence, or the lowest rate of loss.
Last but not least… don’t bum out on me here… let’s review two ways you can optimize.
So far, we have considered the batch that we are computing/optimizing loss on to be our entire dataset that we are training on. This can get a bit out of hand with larger datasets, so there are two approaches to tackle large datasets and calculating minimum loss in a span of time that is reasonable: stochastic gradient descent (SGD for short, because… holy shit) and mini-batch SGD (because of lack of creativity on the part of whomever made these terms).
SGD uses a single batch of sample data (literally, single- 1 piece of data) per iteration of optimizing loss. This can get very “noisy” in a data sense when you start comparing millions of minimum loss rates against one another.
Mini-batch SGD uses somewhere between 10–1000 examples chosen at random. There are obvious trade-offs here (time, money, resources, accuracy, etc).
And now… you generally know about optimizing reducing loss! Stay tuned for the next post.