From 98ee10ca17772c41e9ef7848ec092f9b6389cabf Mon Sep 17 00:00:00 2001
From: lvmin <lyuminzhang@outlook.com>
Date: Mon, 20 Feb 2023 16:28:43 -0800
Subject: [PATCH] training

---
 docs/train.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/train.md b/docs/train.md
index 80fa7bc..334fb66 100644
--- a/docs/train.md
+++ b/docs/train.md
@@ -267,4 +267,4 @@ Because we use zero convolutions, the SD should always be able to predict meanin
 
 You will always find that at some iterations, the model "suddenly" be able to fit some training conditions. This means that you will get a basically usable model at about 3k to 7k steps (future training will improve it, but that model after the first "sudden converge" should be basically functional).
 
-Note that 3k to 7k steps is not very large, and you should consider larger batch size rather than more training steps. If you can observe the "sudden converge" at 3k step using batch size 4, then, rather than train it with 300k further steps, a better idea is to use 100× gradient accumulation to re-train that 3k steps with 100× batch size. Note that perhaps we should not do this *too* extremely, but you should consider that, since "sudden converge" will always happen at some point, getting a better converge is more important.
+Note that 3k to 7k steps is not very large, and you should consider larger batch size rather than more training steps. If you can observe the "sudden converge" at 3k step using batch size 4, then, rather than train it with 300k further steps, a better idea is to use 100× gradient accumulation to re-train that 3k steps with 100× batch size. Note that perhaps we should not do this *too* extremely (perhaps 100x accumulation is too extreme), but you should consider that, since "sudden converge" will *always* happen at some point, getting a better converge is more important.