random readme fixes + format notes

This commit is contained in:
Alex O'Connell
2024-01-28 10:14:50 -05:00
parent c2d4d95212
commit 038d869ded
3 changed files with 102 additions and 69 deletions

View File

@@ -8,12 +8,17 @@ The latest models can be found on HuggingFace:
3B v2 (Based on Phi-2): https://huggingface.co/acon96/Home-3B-v2-GGUF
1B v2 (Based on Phi-1.5): https://huggingface.co/acon96/Home-1B-v2-GGUF
Make sure you have `llama-cpp-python>=0.2.29` in order to run these models.
<details>
<summary>Old Models</summary>
Old Models:
3B v1 (Based on Phi-2): https://huggingface.co/acon96/Home-3B-v1-GGUF
1B v1 (Based on Phi-1.5): https://huggingface.co/acon96/Home-1B-v1-GGUF
</details>
Make sure you have `llama-cpp-python>=0.2.29` in order to run these models.
The main difference between the 2 models (besides parameter count) is the training data. The 1B model is ONLY trained on the synthetic dataset provided in this project, while the 3B model is trained on a mixture of this synthetic dataset, and the cleaned Stanford Alpaca dataset.
The model is quantized using Llama.cpp in order to enable running the model in super low resource environments that are common with Home Assistant installations such as Raspberry Pis.
@@ -63,6 +68,9 @@ The supported entity types are: light, fan, cover, lock, media_player, climate,
### Training
The 3B model was trained as a LoRA on an RTX 3090 (24GB) using the following settings for the custom training script. The embedding weights were "saved" and trained normally along with the rank matricies in order to train the newly added tokens to the embeddings. The full model is merged together at the end. Training took approximately 10 hours.
<details>
<summary>Training Arguments</summary>
```
python3 train.py \
--run_name home-3b \
@@ -79,7 +87,12 @@ python3 train.py \
--use_lora --lora_rank 32 --lora_alpha 64 --lora_modules fc1,fc2,q_proj,v_proj,dense --lora_modules_to_save embed_tokens,lm_head --lora_merge
```
The 1B model was trained as a full fine-tuning on on an RTX 3090 (24GB). Training took approximately 1.5 hours.
</details>
The 1B model was trained as a full fine-tuning on on an RTX 3090 (24GB). Training took approximately 2.5 hours.
<details>
<summary>Training Arguments</summary>
```
python3 train.py \
@@ -94,6 +107,9 @@ python3 train.py \
--ctx_size 2048
```
</details>
<br/>
## Home Assistant Component
In order to integrate with Home Assistant, we provide a `custom_component` that exposes the locally running LLM as a "conversation agent" that can be interacted with using a chat interface as well as integrate with Speech-to-Text and Text-to-Speech addons to enable interacting with the model by speaking.
@@ -142,7 +158,7 @@ You need the following settings in order to configure the "remote" backend:
With the remote text-generation-webui backend, the component will validate that the selected model is available for use and will ensure it is loaded remotely. The Generic OpenAI compatible version does NOT do any validation or model loading.
**Setting up with LocalAI**:
**Setting up with LocalAI**:
If you are an existing LocalAI user or would like to use LocalAI as your backend, please refer to [this](https://io.midori-ai.xyz/howtos/setup-with-ha/) website which has instructions on how to setup LocalAI to work with Home-LLM including automatic installation of the latest version of the the Home-LLM model. The auto-installer (LocalAI Manager) will automatically download and setup LocalAI and/or the model of your choice and automatically create the necessary template files for the model to work with this integration.
### Configuring the component as a Conversation Agent

View File

@@ -4,8 +4,26 @@ The dataset is generated from the different CSV "piles". The "piles" contain dif
## Generating the custom dataset
`python3 generate_home_assistant_data.py --train --test`
`python3 generate_home_assistant_data.py --train --test --large`
## Merging with Alpaca for training
## Merging with other datasets for training
`python3 generate_home_assistant_data.py --merge-alpaca`
`python3 generate_home_assistant_data.py --merge <dataset>`
Supported datasets right now are:
- `alpaca`
- `wizardlm70k`
## Potential Other Datasets to Use
### SFT
Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
### DPO
Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized

View File

@@ -1,5 +1,5 @@
# home-llm experiements (phi1.5)
rev1 - original test
# early home-llm experiements (phi1.5)
### rev1 - original test
- 1 epoch
- train ctx 1900
- I think the learning rate was way too high (2e-4)
@@ -7,7 +7,7 @@ rev1 - original test
- it doesn't get the device name right like ever
- eval dataset was disabled
rev2 - it kinda works
### rev2 - it kinda works
- eval dataset at 10%
- 2 epochs
- train ctx 1200
@@ -18,7 +18,7 @@ rev2 - it kinda works
+ still repeatedly spits out code blocks but at least it closes them correctly
+ names are MUCH more accurate. will still hallucinate names that don't exist
rev3 - ok it definitely works
### rev3 - ok it definitely works
- 4 epochs
- batch size 2
- learning rate cosine 5e-5
@@ -26,24 +26,24 @@ rev3 - ok it definitely works
+ still halluncinates device names. (need to figure this one out)
+ need more examples for: garage_door, media_player,
rev4 - got to way lower loss. it tries really hard to stop generating text
### rev4 - got to way lower loss. it tries really hard to stop generating text
- 4 epochs
- train ctx 512
- batch size 2
- learning rate cosine 1e-4
- added system prompt and moved services block before states block
rev 4.1 - really doesn't work as well. loss dropped REALLY fast and then never got as low as rev4
### rev 4.1 - really doesn't work as well. loss dropped REALLY fast and then never got as low as rev4
- 4 epochs
- train ctx 512
- batch size 3
- learning rate cosine 1e-4
- proper pad token
rev 4.2 - yeah nah it's the pad token
### rev 4.2 - yeah nah it's the pad token
- batch size 2
rev 5 - new dataset
### rev 5 - new dataset
- 3 epochs (4th epoch was overfit)
- train cx 512
- batch size 2
@@ -51,14 +51,14 @@ rev 5 - new dataset
+ actually stops generating text. not at the right... place but still!
+ messing with temperature makes it generate some interesting output.
rev 5.1 - gradient accumulation test
### rev 5.1 - gradient accumulation test
- 3 epochs
- train cx 512
- batch size 8
- learning rate cosine 1e-5
+ very meh
rev 5.2 - learning rate test
### rev 5.2 - learning rate test
- 3 epochs
- train cx 512
- batch size 8
@@ -68,14 +68,14 @@ rev 5.2 - learning rate test
+ still need more examples for multi-device actions (really need room/group support in dataset)
+ need to have more variance in request format. need more informal + more formal versions
rev 5.3 - learning rate test 2
### rev 5.3 - learning rate test 2
- 4 epochs
- train cx 512
- batch size 8
- learning rate cosine 6e-5
+ lower learning rate seemed to not be as effective even though it ran for longer
rev 6 - dataset revamp again
### rev 6 - dataset revamp again
- 3 epochs
- train cx 512
- batch size 8
@@ -85,58 +85,58 @@ rev 6 - dataset revamp again
+ definitely a bit overfit
+ maybe not so overfit. able to 0 shot asking to do stuff to a christmas tree
rev 6.1 - lower train rate
### rev 6.1 - lower train rate
- 3 epochs
- train cx 512
- batch size 8
- learning rate cosine 6e-5
+ also definitely a bit overfit. can't generate names it hasn't seen before
rev 6.2 - fewer epochs
### rev 6.2 - fewer epochs
- 2 epochs
- train cx 512
- batch size 8
- learning rate cosine 6e-5
rev 6.3 - higher batch
### rev 6.3 - higher batch
- 2 epochs
- train cx 512
- batch size 12
- learning rate cosine 1e-4
rev 7 - tweak dataset again
### rev 7 - tweak dataset again
- 2 epochs
- train ctx 512
- batch size 8
- learning rate 1e-4
+ when generating results, don't end with a space. it works WAY better
rev 7.1 - failed padding attempt
### rev 7.1 - failed padding attempt
rev 7.2 - try to overfit less + no newline at end
### rev 7.2 - try to overfit less + no newline at end
- 1 epoch
- train ctx 512
- batch size 8
- learning rate 1e-4
+ it definitly works with only one epoch
rev 7.3 - try adding fake end of sentence token
### rev 7.3 - try adding fake end of sentence token
- 1 epoch
- train ctx 512
- batch size 8
- learning rate 1e-4
rev 8 - dataset tweaks. add status requests
### rev 8 - dataset tweaks. add status requests
+ service requests still mostly work but status requests are pretty broken
rev 8.1 - tweak example counts + ratios
### rev 8.1 - tweak example counts + ratios
- 1 epoch
- train ctx 512
- batch size 8
- learning rate 1e-4
+ seems to have worked better with lower example counts
rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of training
### rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of training
- 1 epoch
- train ctx 512
- batch size 8
@@ -147,7 +147,7 @@ rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of tr
+ oh yuuhhhhh it's overcranked. nails both request types (plus even ending generation)
+ needs ambiguous device name examples because I totally just asked it an ambiguous question and it answered the one I wasn't expecting
rev 8.3 - further reduced training rate
### rev 8.3 - further reduced training rate
- 1 epoch
- train ctx 512
- batch size 8
@@ -156,7 +156,7 @@ rev 8.3 - further reduced training rate
+ has some creativity with how it repsonds
+ will often get the device name wrong on the first try
rev 8.4 - re-ordered prompt
### rev 8.4 - re-ordered prompt
- 1 epoch
- train ctx 512
- batch size 8
@@ -165,7 +165,7 @@ rev 8.4 - re-ordered prompt
+ it *works* but is incredibly open ended
+ basically never stops generating text
rev 8.5 - tweaked prompt format again
### rev 8.5 - tweaked prompt format again
- 1 epoch
- train ctx 512
- batch size 8
@@ -173,7 +173,7 @@ rev 8.5 - tweaked prompt format again
- re-orderd response before actions again but made actions less like a "block" so it might stop generation
+ that worked rather badly
rev 8.6 - make prompt look more like other examples it has seen before
### rev 8.6 - make prompt look more like other examples it has seen before
- 1 epoch
- train ctx 512
- batch size 8
@@ -183,7 +183,7 @@ rev 8.6 - make prompt look more like other examples it has seen before
+ only get the correct response about 50% of the time
+ it totally stops correctly when it DOES work
rev 8.7 - try to fit a bit more. the last iteration jumps around on which format it chooses
### rev 8.7 - try to fit a bit more. the last iteration jumps around on which format it chooses
- 1 epoch
- train ctx 512
- batch size 8
@@ -192,7 +192,7 @@ rev 8.7 - try to fit a bit more. the last iteration jumps around on which format
+ altering the format (with newlines) makes it pick our format more often
+ comparing to 8.6 with modified format shows this one is better at getting device names right
rev 8.8 - train with newlines instead of spaces in requets/response
### rev 8.8 - train with newlines instead of spaces in requets/response
- 1 epoch
- train ctx 512
- batch size 8
@@ -200,14 +200,14 @@ rev 8.8 - train with newlines instead of spaces in requets/response
+ definitely worse than the previous one
+ for some reason both 8.7 and 8.8 are horrible when using their actual template but if you deviate slightly it works a lot better on inference
rev 8.9 - actually fix pad token
### rev 8.9 - actually fix pad token
- 1 epoch
- train ctx 512
- batch size 8
- learning rate 1e-5
+ properly generates a response (+ terminates) when using the actual template
rev 9 - reduced dataset size
### rev 9 - reduced dataset size
- 1 epoch
- train ctx 512
- batch size 8
@@ -233,20 +233,19 @@ rev 9 - reduced dataset size
+ it works OK with low temperatures
+ seems to handle the alpaca dataset not so well
Eval results for existing models:
Home-1b-v1: 0.767816091954023
Home-3b-v2: 0.6908045977011494
### Home-1b-v1-GGUF
- eval results: 0.767816091954023
## home-1b-rev5 series
## home-1b-rev5/6 parameters
- 1 epoch
- 2048 train ctx
- batch size 8
- learning rate 1e-5
- weight decay 0.1
- gradient clipping 1.0
- save model every 200 steps
- save model every 200 or 400 steps
home-1b-rev5
### home-1b-rev5
- dataset size: medium
- evaluation results:
- 200: 0.553448275862069
@@ -257,7 +256,7 @@ home-1b-rev5
- 1200: 0.8488505747126437 (+.009)
- Final (1467): 0.8494252873563218 (+.00005)
home-1b-rev5_1
### home-1b-rev5_1
- dataset size: small
- evaluation results:
- 200: 0.6057471264367816
@@ -266,7 +265,7 @@ home-1b-rev5_1
- 800: 0.7729885057471264 (+.0046)
- Final (869): bad
home-1b-rev5_2
### home-1b-rev5_2
- dataset size: large
- evaluation results:
- 200: --
@@ -279,15 +278,11 @@ home-1b-rev5_2
- 1600: 0.8844827586206897
- Final (1848): 0.8833333333333333
home-3b-v3-rev1
- dataset size: large
- evaluation results: 0.9091954022988505
home-1b-rev6
### home-1b-rev6
- dataset size: large (fixed templates + function calling arguments; brightness is broken)
- evaluation results: 0.8254149971379507
home-1b-rev6_1
### home-1b-rev6_1
- dataset size: xl (fixed templates + function calling arguments; 0-255 brightness is broken)
- evaluation results:
- 400: 0.7240984544934173
@@ -297,7 +292,7 @@ home-1b-rev6_1
- 2000: 0.8551803091013166
- Final (2322): 0.8586147681740126
home-1b-rev6_2 = Home-1B-v2-GGUF
### home-1b-rev6_2 = Home-1B-v2-GGUF
- dataset size: large (change brightness back to percentages; increase color references by ~2x)
- evaluation results:
- 400: 0.7856064418721691
@@ -307,24 +302,28 @@ home-1b-rev6_2 = Home-1B-v2-GGUF
- 2000: 0.8852541519879215
- Final (2048):
home-3b-v3-rev2 = Home-3B-v2-GGUF
# Home 3B
- 1 epoch
- 2048 train ctx
- batch size 8
- learning rate 1e-5
- weight decay 0.1
- gradient clipping 1.0
- save model every 200 or 400 steps
Missing a lot of earlier 3B training results (not sure where they are)
### Home-3b-v2-GGUF (broken training run)
- evaluation result: 0.6908045977011494
### home-3b-v3-rev1
- dataset size: large
- evaluation results: 0.9091954022988505
### home-3b-v3-rev2 = Home-3B-v2-GGUF (republished)
- dataset size: xl + alpaca
- evaluation results: 0.8731756416708606
Home-3B-v2-GGUF:ha_only
### Home-3B-v2-GGUF:ha_only
- dataset size: large
- evaluation results: FAILED (again.....)
# Datasets
## SFT
Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
## DPO
Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
- evaluation results: FAILED (again.....)