From 6a49452611bbde827a10c2f98dc8d391b24f61c3 Mon Sep 17 00:00:00 2001 From: lvmin Date: Mon, 20 Feb 2023 17:27:10 -0800 Subject: [PATCH] training --- docs/train.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/train.md b/docs/train.md index 9cf1fa0..5acf157 100644 --- a/docs/train.md +++ b/docs/train.md @@ -272,3 +272,5 @@ Note that 3k to 7k steps is not very large, and you should consider larger batch Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge. In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important. + +But usually, if your logic batch size is already bigger than 256, then further extending the batch size is not very meaningful. In that case, perhaps a better idea is to train more steps.