set up for rev 5.1

2026-01-09 21:58:00 -05:00 · 2023-10-16 00:01:38 -04:00
parent 01cef3dcd9
commit 82b7930624
2 changed files with 10 additions and 7 deletions
--- a/expermement-notes.txt
+++ b/expermement-notes.txt
@@ -43,18 +43,21 @@ rev 4.2 - yeah nah it's the pad token
 - batch size 2
 rev 5 - new dataset
- 4 epochs
+- 3 epochs (4th epoch was overfit)
 - train cx 512
 - batch size 2
 - learning rate cosize 1e-5
 - actually stops generating text. not at the right... place but still!
 - messing with temperature makes it generate some interesting output.
 TODO:
 rev 5.1 - gradient accumulation test
- 4 epochs
+- 3 epochs
 - train cx 512
 - batch size 8
- learning rate cosize 1e-5
+- learning rate cosize 5e-6
 Ideas:
- get rid of services block. will i just learn it on it's own?
+- figure out how to penalize the wrong device name more?
- figure out how to penalize the wrong device name more?
+- need to make the device name/description and device ID match less in the examples.
    - it is learning to take the name of the device in the serviec call block from the description, not the states block
--- a/train.py
+++ b/train.py
@@ -9,7 +9,7 @@ torch.set_default_device("cuda")
 torch.set_default_tensor_type('torch.cuda.FloatTensor')
 TRAIN_CTX_SIZE = 512 # The number of tokens to pad + truncate the input examples to
-BATCH_SIZE = 2 # The simulated "batch size" that we will train on. will tweak gradient accumulations steps
+BATCH_SIZE = 8 # The simulated "batch size" that we will train on. will tweak gradient accumulations steps
 MICRO_BATCH_SIZE = 2 # The actual batch size that will fit into VRAM on this machine
 TRAINING_EPOCHS = 4 # The number of times to train the model on each example
 LEARNING_RATE_START = 1e-5 # The starting learning rate (speed at which the model trains)