From 28183c74388565eb0bf35b4fca0f99d232ebaa07 Mon Sep 17 00:00:00 2001
From: wozeparrot <wozeparrot@gmail.com>
Date: Fri, 1 Dec 2023 13:56:18 -0500
Subject: [PATCH] feat: reword (#2549)

---
 docs/quickstart.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/quickstart.md b/docs/quickstart.md
index 99979deef2..8c52ecec0f 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -128,7 +128,7 @@ Training neural networks in tinygrad is super simple.
 All we need to do is define our neural network, define our loss function, and then call `.backward()` on the loss function to compute the gradients.
 They can then be used to update the parameters of our neural network using one of the many optimizers in [optim.py](/tinygrad/nn/optim.py).
 
-For our loss function we will be using sparse categorical cross entropy loss.
+For our loss function we will be using sparse categorical cross entropy loss. The implementation below is taken from [tensor.py](/tinygrad/tensor.py), it's copied below to highlight an important detail of tinygrad.
 
 ```python
 def sparse_categorical_crossentropy(self, Y, ignore_index=-1) -> Tensor:
@@ -138,9 +138,9 @@ def sparse_categorical_crossentropy(self, Y, ignore_index=-1) -> Tensor:
     return self.log_softmax().mul(y).sum() / loss_mask.sum()
 ```
 
-As we can see in this implementation of cross entropy loss, there are certain operations that tinygrad does not support.
+As we can see in this implementation of cross entropy loss, there are certain operations that tinygrad does not support natively.
 Namely, operations that are load/store or assigning a value to a tensor at a certain index.
-Load/store ops are not supported in tinygrad because they add complexity when trying to port to different backends and 90% of the models out there don't use/need them.
+Load/store ops are not supported in tinygrad natively because they add complexity when trying to port to different backends, 90% of the models out there don't use/need them, and they can be implemented like it's done above with an `arange` mask.
 
 For our optimizer we will be using the traditional stochastic gradient descent optimizer with a learning rate of 3e-4.