set up for rev 5.1

2026-01-08 21:28:05 -05:00 · 2023-10-16 00:01:38 -04:00
parent 01cef3dcd9
commit 82b7930624
2 changed files with 10 additions and 7 deletions
--- a/expermement-notes.txt
+++ b/expermement-notes.txt
@@ -43,18 +43,21 @@ rev 4.2 - yeah nah it's the pad token
 - batch size 2

 rev 5 - new dataset
- 4 epochs
+- 3 epochs (4th epoch was overfit)
 - train cx 512
 - batch size 2
 - learning rate cosize 1e-5
+- actually stops generating text. not at the right... place but still!
+- messing with temperature makes it generate some interesting output.

+TODO:
 rev 5.1 - gradient accumulation test
- 4 epochs
+- 3 epochs
 - train cx 512
 - batch size 8
- learning rate cosize 1e-5
-
+- learning rate cosize 5e-6

 Ideas:
- get rid of services block. will i just learn it on it's own?
- figure out how to penalize the wrong device name more?
+- figure out how to penalize the wrong device name more?
+- need to make the device name/description and device ID match less in the examples.
+    - it is learning to take the name of the device in the serviec call block from the description, not the states block
--- a/train.py
+++ b/train.py
@@ -9,7 +9,7 @@ torch.set_default_device("cuda")
 torch.set_default_tensor_type('torch.cuda.FloatTensor')

 TRAIN_CTX_SIZE = 512 # The number of tokens to pad + truncate the input examples to
-BATCH_SIZE = 2 # The simulated "batch size" that we will train on. will tweak gradient accumulations steps
+BATCH_SIZE = 8 # The simulated "batch size" that we will train on. will tweak gradient accumulations steps
 MICRO_BATCH_SIZE = 2 # The actual batch size that will fit into VRAM on this machine
 TRAINING_EPOCHS = 4 # The number of times to train the model on each example
 LEARNING_RATE_START = 1e-5 # The starting learning rate (speed at which the model trains)