From 28183c74388565eb0bf35b4fca0f99d232ebaa07 Mon Sep 17 00:00:00 2001 From: wozeparrot Date: Fri, 1 Dec 2023 13:56:18 -0500 Subject: [PATCH] feat: reword (#2549) --- docs/quickstart.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/quickstart.md b/docs/quickstart.md index 99979deef2..8c52ecec0f 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -128,7 +128,7 @@ Training neural networks in tinygrad is super simple. All we need to do is define our neural network, define our loss function, and then call `.backward()` on the loss function to compute the gradients. They can then be used to update the parameters of our neural network using one of the many optimizers in [optim.py](/tinygrad/nn/optim.py). -For our loss function we will be using sparse categorical cross entropy loss. +For our loss function we will be using sparse categorical cross entropy loss. The implementation below is taken from [tensor.py](/tinygrad/tensor.py), it's copied below to highlight an important detail of tinygrad. ```python def sparse_categorical_crossentropy(self, Y, ignore_index=-1) -> Tensor: @@ -138,9 +138,9 @@ def sparse_categorical_crossentropy(self, Y, ignore_index=-1) -> Tensor: return self.log_softmax().mul(y).sum() / loss_mask.sum() ``` -As we can see in this implementation of cross entropy loss, there are certain operations that tinygrad does not support. +As we can see in this implementation of cross entropy loss, there are certain operations that tinygrad does not support natively. Namely, operations that are load/store or assigning a value to a tensor at a certain index. -Load/store ops are not supported in tinygrad because they add complexity when trying to port to different backends and 90% of the models out there don't use/need them. +Load/store ops are not supported in tinygrad natively because they add complexity when trying to port to different backends, 90% of the models out there don't use/need them, and they can be implemented like it's done above with an `arange` mask. For our optimizer we will be using the traditional stochastic gradient descent optimizer with a learning rate of 3e-4.