This commit is contained in:
lvmin
2023-02-20 17:27:10 -08:00
parent 34dfbdc4f4
commit 6a49452611

View File

@@ -272,3 +272,5 @@ Note that 3k to 7k steps is not very large, and you should consider larger batch
Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.
In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important.
But usually, if your logic batch size is already bigger than 256, then further extending the batch size is not very meaningful. In that case, perhaps a better idea is to train more steps.