Files
home-llm/docs/experiment-notes-phi.md

9.8 KiB

early home-llm experiements (phi1.5)

rev1 - original test

  • 1 epoch
  • train ctx 1900
  • I think the learning rate was way too high (2e-4)
  • the service invocation syntax is ambiguous causing repeat code blocks
  • it doesn't get the device name right like ever
  • eval dataset was disabled

rev2 - it kinda works

  • eval dataset at 10%
  • 2 epochs
  • train ctx 1200
  • batch size 1
  • learning rate linear 5e-5
  • gradient accumulation steps 1
  • fixed invocation syntax
  • still repeatedly spits out code blocks but at least it closes them correctly
  • names are MUCH more accurate. will still hallucinate names that don't exist

rev3 - ok it definitely works

  • 4 epochs
  • batch size 2
  • learning rate cosine 5e-5
  • doesn't seem like there's much difference but the loss is lower
  • still halluncinates device names. (need to figure this one out)
  • need more examples for: garage_door, media_player,

rev4 - got to way lower loss. it tries really hard to stop generating text

  • 4 epochs
  • train ctx 512
  • batch size 2
  • learning rate cosine 1e-4
  • added system prompt and moved services block before states block

rev 4.1 - really doesn't work as well. loss dropped REALLY fast and then never got as low as rev4

  • 4 epochs
  • train ctx 512
  • batch size 3
  • learning rate cosine 1e-4
  • proper pad token

rev 4.2 - yeah nah it's the pad token

  • batch size 2

rev 5 - new dataset

  • 3 epochs (4th epoch was overfit)
  • train cx 512
  • batch size 2
  • learning rate cosine 1e-5
  • actually stops generating text. not at the right... place but still!
  • messing with temperature makes it generate some interesting output.

rev 5.1 - gradient accumulation test

  • 3 epochs
  • train cx 512
  • batch size 8
  • learning rate cosine 1e-5
  • very meh

rev 5.2 - learning rate test

  • 3 epochs
  • train cx 512
  • batch size 8
  • learning rate cosine 1e-4
  • higher learning rate really helped with the higher batch size
  • is able to more reliably generate the correct device name again
  • still need more examples for multi-device actions (really need room/group support in dataset)
  • need to have more variance in request format. need more informal + more formal versions

rev 5.3 - learning rate test 2

  • 4 epochs
  • train cx 512
  • batch size 8
  • learning rate cosine 6e-5
  • lower learning rate seemed to not be as effective even though it ran for longer

rev 6 - dataset revamp again

  • 3 epochs
  • train cx 512
  • batch size 8
  • learning rate cosine 1e-4
  • all questions + responses are lowercase now
  • ensured there are no duplicate entries in the states block
  • definitely a bit overfit
  • maybe not so overfit. able to 0 shot asking to do stuff to a christmas tree

rev 6.1 - lower train rate

  • 3 epochs
  • train cx 512
  • batch size 8
  • learning rate cosine 6e-5
  • also definitely a bit overfit. can't generate names it hasn't seen before

rev 6.2 - fewer epochs

  • 2 epochs
  • train cx 512
  • batch size 8
  • learning rate cosine 6e-5

rev 6.3 - higher batch

  • 2 epochs
  • train cx 512
  • batch size 12
  • learning rate cosine 1e-4

rev 7 - tweak dataset again

  • 2 epochs
  • train ctx 512
  • batch size 8
  • learning rate 1e-4
  • when generating results, don't end with a space. it works WAY better

rev 7.1 - failed padding attempt

rev 7.2 - try to overfit less + no newline at end

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-4
  • it definitly works with only one epoch

rev 7.3 - try adding fake end of sentence token

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-4

rev 8 - dataset tweaks. add status requests

  • service requests still mostly work but status requests are pretty broken

rev 8.1 - tweak example counts + ratios

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-4
  • seems to have worked better with lower example counts

rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of training

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 8e-5 (didn't change loss at all)
  • learning rate 5e-5 (same)
  • learning rate 1e-5 (wayyyy better)
  • pretty sure i've been overcranking most of these and destroying most of the model
  • oh yuuhhhhh it's overcranked. nails both request types (plus even ending generation)
  • needs ambiguous device name examples because I totally just asked it an ambiguous question and it answered the one I wasn't expecting

rev 8.3 - further reduced training rate

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 8e-6
  • certainly not overfit like < rev7
  • has some creativity with how it repsonds
  • will often get the device name wrong on the first try

rev 8.4 - re-ordered prompt

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 8e-6
  • put actions before response and also made actions it's own "block"
  • it works but is incredibly open ended
  • basically never stops generating text

rev 8.5 - tweaked prompt format again

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 8e-6
  • re-orderd response before actions again but made actions less like a "block" so it might stop generation
  • that worked rather badly

rev 8.6 - make prompt look more like other examples it has seen before

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 8e-6
  • change done to just and add 3 newlines at the end (idk it keeps doing that for other prompts before stopping)
  • it wants to generate the other prompt types much more with this config
  • only get the correct response about 50% of the time
  • it totally stops correctly when it DOES work

rev 8.7 - try to fit a bit more. the last iteration jumps around on which format it chooses

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-5
  • similar issues as last model
  • altering the format (with newlines) makes it pick our format more often
  • comparing to 8.6 with modified format shows this one is better at getting device names right

rev 8.8 - train with newlines instead of spaces in requets/response

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-5
  • definitely worse than the previous one
  • for some reason both 8.7 and 8.8 are horrible when using their actual template but if you deviate slightly it works a lot better on inference

rev 8.9 - actually fix pad token

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-5
  • properly generates a response (+ terminates) when using the actual template

rev 9 - reduced dataset size

  • 1 epoch
  • train ctx 512
  • batch size 8
  • learning rate 1e-5
  • didn't work as well
  • would often not generate a service call
  • went back to 8.9

Home 1B

home-1b-rev1

  • 1 epoch
  • 2048 train ctx
  • batch size 8
  • learning rate 1e-5
  • weight decay 0.1
  • gradient clipping 1.0
  • dataset changes:
    • updated chatml format
    • json function calling
    • included the alpaca split
  • it works OK with low temperatures
  • seems to handle the alpaca dataset not so well

Home-1b-v1-GGUF

  • eval results: 0.767816091954023

home-1b-rev5/6 parameters

  • 1 epoch
  • 2048 train ctx
  • batch size 8
  • learning rate 1e-5
  • weight decay 0.1
  • gradient clipping 1.0
  • save model every 200 or 400 steps

home-1b-rev5

  • dataset size: medium
  • evaluation results:
    • 200: 0.553448275862069
    • 400: 0.7482758620689656 (+.19)
    • 600: 0.8103448275862069 (+.06)
    • 800: 0.8316091954022988 (+.02)
    • 1000: 0.8396551724137931 (+.008)
    • 1200: 0.8488505747126437 (+.009)
    • Final (1467): 0.8494252873563218 (+.00005)

home-1b-rev5_1

  • dataset size: small
  • evaluation results:
    • 200: 0.6057471264367816
    • 400: 0.7494252873563219 (+.143)
    • 600: 0.7683908045977011 (+.018)
    • 800: 0.7729885057471264 (+.0046)
    • Final (869): bad

home-1b-rev5_2

  • dataset size: large
  • evaluation results:
    • 200: --
    • 400: --
    • 600: 0.8425287356321839
    • 800: 0.8666666666666667
    • 1000: 0.8770114942528736
    • 1200: 0.8844827586206897
    • 1400: 0.8879310344827587
    • 1600: 0.8844827586206897
    • Final (1848): 0.8833333333333333

home-1b-rev6

  • dataset size: large (fixed templates + function calling arguments; brightness is broken)
  • evaluation results: 0.8254149971379507

home-1b-rev6_1

  • dataset size: xl (fixed templates + function calling arguments; 0-255 brightness is broken)
  • evaluation results:
    • 400: 0.7240984544934173
    • 800: 0.8311390955924441
    • 1200: 0.8471665712650257
    • 1600: 0.8597595878649112
    • 2000: 0.8551803091013166
    • Final (2322): 0.8586147681740126

home-1b-rev6_2 = Home-1B-v2-GGUF

  • dataset size: large (change brightness back to percentages; increase color references by ~2x)
  • evaluation results:
    • 400: 0.7856064418721691
    • 800: 0.864116759
    • 1200: 0.882234524
    • 1600: 0.885254152
    • 2000: 0.8852541519879215
    • Final (2048):

Home 3B

  • 1 epoch
  • 2048 train ctx
  • batch size 8
  • learning rate 1e-5
  • weight decay 0.1
  • gradient clipping 1.0
  • save model every 200 or 400 steps

Missing a lot of earlier 3B training results (not sure where they are)

Home-3b-v2-GGUF (broken training run)

  • evaluation result: 0.6908045977011494

home-3b-v3-rev1

  • dataset size: large
  • evaluation results: 0.9091954022988505

home-3b-v3-rev2 = Home-3B-v2-GGUF (republished)

  • dataset size: xl + alpaca
  • evaluation results: 0.8731756416708606

Home-3B-v2-GGUF:ha_only

  • dataset size: large
  • evaluation results: FAILED (again.....)

Potential Other Datasets to Use

SFT

Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup

DPO

Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized