training

2026-04-24 03:00:54 -04:00 · 2023-02-20 17:27:10 -08:00
parent 34dfbdc4f4
commit 6a49452611
1 changed files with 2 additions and 0 deletions
--- a/docs/train.md
+++ b/docs/train.md
@@ -272,3 +272,5 @@ Note that 3k to 7k steps is not very large, and you should consider larger batch
 Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.

 In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important.
+
+But usually, if your logic batch size is already bigger than 256, then further extending the batch size is not very meaningful. In that case, perhaps a better idea is to train more steps.