github/home-llm

mirror of https://github.com/acon96/home-llm.git synced 2026-01-08 05:14:02 -05:00

Files

Alex O'Connell 411276408b train new models based on stablelm + properly add new response types

2024-02-13 20:21:51 -05:00

9.8 KiB

Raw Permalink Blame History

early home-llm experiements (phi1.5)

rev1 - original test

1 epoch
train ctx 1900
I think the learning rate was way too high (2e-4)
the service invocation syntax is ambiguous causing repeat code blocks
it doesn't get the device name right like ever
eval dataset was disabled

rev2 - it kinda works

eval dataset at 10%
2 epochs
train ctx 1200
batch size 1
learning rate linear 5e-5
gradient accumulation steps 1

fixed invocation syntax
still repeatedly spits out code blocks but at least it closes them correctly
names are MUCH more accurate. will still hallucinate names that don't exist

rev3 - ok it definitely works

4 epochs
batch size 2
learning rate cosine 5e-5

doesn't seem like there's much difference but the loss is lower
still halluncinates device names. (need to figure this one out)
need more examples for: garage_door, media_player,

rev4 - got to way lower loss. it tries really hard to stop generating text

4 epochs
train ctx 512
batch size 2
learning rate cosine 1e-4
added system prompt and moved services block before states block

rev 4.1 - really doesn't work as well. loss dropped REALLY fast and then never got as low as rev4

4 epochs
train ctx 512
batch size 3
learning rate cosine 1e-4
proper pad token

rev 4.2 - yeah nah it's the pad token

batch size 2

rev 5 - new dataset

3 epochs (4th epoch was overfit)
train cx 512
batch size 2
learning rate cosine 1e-5

actually stops generating text. not at the right... place but still!
messing with temperature makes it generate some interesting output.

rev 5.1 - gradient accumulation test

3 epochs
train cx 512
batch size 8
learning rate cosine 1e-5

very meh

rev 5.2 - learning rate test

3 epochs
train cx 512
batch size 8
learning rate cosine 1e-4

higher learning rate really helped with the higher batch size
is able to more reliably generate the correct device name again
still need more examples for multi-device actions (really need room/group support in dataset)
need to have more variance in request format. need more informal + more formal versions

rev 5.3 - learning rate test 2

4 epochs
train cx 512
batch size 8
learning rate cosine 6e-5

lower learning rate seemed to not be as effective even though it ran for longer

rev 6 - dataset revamp again

3 epochs
train cx 512
batch size 8
learning rate cosine 1e-4
all questions + responses are lowercase now
ensured there are no duplicate entries in the states block

definitely a bit overfit
maybe not so overfit. able to 0 shot asking to do stuff to a christmas tree

rev 6.1 - lower train rate

3 epochs
train cx 512
batch size 8
learning rate cosine 6e-5

also definitely a bit overfit. can't generate names it hasn't seen before

rev 6.2 - fewer epochs

2 epochs
train cx 512
batch size 8
learning rate cosine 6e-5

rev 6.3 - higher batch

2 epochs
train cx 512
batch size 12
learning rate cosine 1e-4

rev 7 - tweak dataset again

2 epochs
train ctx 512
batch size 8
learning rate 1e-4

when generating results, don't end with a space. it works WAY better

rev 7.1 - failed padding attempt

rev 7.2 - try to overfit less + no newline at end

1 epoch
train ctx 512
batch size 8
learning rate 1e-4

it definitly works with only one epoch

rev 7.3 - try adding fake end of sentence token

1 epoch
train ctx 512
batch size 8
learning rate 1e-4

rev 8 - dataset tweaks. add status requests

service requests still mostly work but status requests are pretty broken

rev 8.1 - tweak example counts + ratios

1 epoch
train ctx 512
batch size 8
learning rate 1e-4

seems to have worked better with lower example counts

rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of training

1 epoch
train ctx 512
batch size 8
learning rate 8e-5 (didn't change loss at all)
learning rate 5e-5 (same)
learning rate 1e-5 (wayyyy better)

pretty sure i've been overcranking most of these and destroying most of the model
oh yuuhhhhh it's overcranked. nails both request types (plus even ending generation)
needs ambiguous device name examples because I totally just asked it an ambiguous question and it answered the one I wasn't expecting

rev 8.3 - further reduced training rate

1 epoch
train ctx 512
batch size 8
learning rate 8e-6

certainly not overfit like < rev7
has some creativity with how it repsonds
will often get the device name wrong on the first try

rev 8.4 - re-ordered prompt

1 epoch
train ctx 512
batch size 8
learning rate 8e-6
put actions before response and also made actions it's own "block"

it works but is incredibly open ended
basically never stops generating text

rev 8.5 - tweaked prompt format again

1 epoch
train ctx 512
batch size 8
learning rate 8e-6
re-orderd response before actions again but made actions less like a "block" so it might stop generation

that worked rather badly

rev 8.6 - make prompt look more like other examples it has seen before

1 epoch
train ctx 512
batch size 8
learning rate 8e-6
change done to just and add 3 newlines at the end (idk it keeps doing that for other prompts before stopping)

it wants to generate the other prompt types much more with this config
only get the correct response about 50% of the time
it totally stops correctly when it DOES work

rev 8.7 - try to fit a bit more. the last iteration jumps around on which format it chooses

1 epoch
train ctx 512
batch size 8
learning rate 1e-5

similar issues as last model
altering the format (with newlines) makes it pick our format more often
comparing to 8.6 with modified format shows this one is better at getting device names right

rev 8.8 - train with newlines instead of spaces in requets/response

1 epoch
train ctx 512
batch size 8
learning rate 1e-5

definitely worse than the previous one
for some reason both 8.7 and 8.8 are horrible when using their actual template but if you deviate slightly it works a lot better on inference

rev 8.9 - actually fix pad token

1 epoch
train ctx 512
batch size 8
learning rate 1e-5

properly generates a response (+ terminates) when using the actual template

rev 9 - reduced dataset size

1 epoch
train ctx 512
batch size 8
learning rate 1e-5

didn't work as well
would often not generate a service call
went back to 8.9

Home 1B

home-1b-rev1

1 epoch
2048 train ctx
batch size 8
learning rate 1e-5
weight decay 0.1
gradient clipping 1.0
dataset changes:
- updated chatml format
- json function calling
- included the alpaca split

it works OK with low temperatures
seems to handle the alpaca dataset not so well

Home-1b-v1-GGUF

eval results: 0.767816091954023

home-1b-rev5/6 parameters

1 epoch
2048 train ctx
batch size 8
learning rate 1e-5
weight decay 0.1
gradient clipping 1.0
save model every 200 or 400 steps

home-1b-rev5

dataset size: medium
evaluation results:
- 200: 0.553448275862069
- 400: 0.7482758620689656 (+.19)
- 600: 0.8103448275862069 (+.06)
- 800: 0.8316091954022988 (+.02)
- 1000: 0.8396551724137931 (+.008)
- 1200: 0.8488505747126437 (+.009)
- Final (1467): 0.8494252873563218 (+.00005)

home-1b-rev5_1

dataset size: small
evaluation results:
- 200: 0.6057471264367816
- 400: 0.7494252873563219 (+.143)
- 600: 0.7683908045977011 (+.018)
- 800: 0.7729885057471264 (+.0046)
- Final (869): bad

home-1b-rev5_2

dataset size: large
evaluation results:
- 200: --
- 400: --
- 600: 0.8425287356321839
- 800: 0.8666666666666667
- 1000: 0.8770114942528736
- 1200: 0.8844827586206897
- 1400: 0.8879310344827587
- 1600: 0.8844827586206897
- Final (1848): 0.8833333333333333

home-1b-rev6

dataset size: large (fixed templates + function calling arguments; brightness is broken)
evaluation results: 0.8254149971379507

home-1b-rev6_1

dataset size: xl (fixed templates + function calling arguments; 0-255 brightness is broken)
evaluation results:
- 400: 0.7240984544934173
- 800: 0.8311390955924441
- 1200: 0.8471665712650257
- 1600: 0.8597595878649112
- 2000: 0.8551803091013166
- Final (2322): 0.8586147681740126

home-1b-rev6_2 = Home-1B-v2-GGUF

dataset size: large (change brightness back to percentages; increase color references by ~2x)
evaluation results:
- 400: 0.7856064418721691
- 800: 0.864116759
- 1200: 0.882234524
- 1600: 0.885254152
- 2000: 0.8852541519879215
- Final (2048):

Home 3B

1 epoch
2048 train ctx
batch size 8
learning rate 1e-5
weight decay 0.1
gradient clipping 1.0
save model every 200 or 400 steps

Missing a lot of earlier 3B training results (not sure where they are)

Home-3b-v2-GGUF (broken training run)

evaluation result: 0.6908045977011494

home-3b-v3-rev1

dataset size: large
evaluation results: 0.9091954022988505

home-3b-v3-rev2 = Home-3B-v2-GGUF (republished)

dataset size: xl + alpaca
evaluation results: 0.8731756416708606

Home-3B-v2-GGUF:ha_only

dataset size: large
evaluation results: FAILED (again.....)

Potential Other Datasets to Use

SFT

Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup

DPO

Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized