random readme fixes + format notes

2026-01-09 13:48:05 -05:00 · 2024-01-28 10:14:50 -05:00
parent c2d4d95212
commit 038d869ded
3 changed files with 102 additions and 69 deletions
--- a/README.md
+++ b/README.md
@@ -8,12 +8,17 @@ The latest models can be found on HuggingFace:
 3B v2 (Based on Phi-2): https://huggingface.co/acon96/Home-3B-v2-GGUF  
 1B v2 (Based on Phi-1.5): https://huggingface.co/acon96/Home-1B-v2-GGUF  

-Make sure you have `llama-cpp-python>=0.2.29` in order to run these models.
+<details>
+
+<summary>Old Models</summary>  

-Old Models:  
 3B v1 (Based on Phi-2): https://huggingface.co/acon96/Home-3B-v1-GGUF  
 1B v1 (Based on Phi-1.5): https://huggingface.co/acon96/Home-1B-v1-GGUF  

+</details>
+
+Make sure you have `llama-cpp-python>=0.2.29` in order to run these models.
+
 The main difference between the 2 models (besides parameter count) is the training data. The 1B model is ONLY trained on the synthetic dataset provided in this project, while the 3B model is trained on a mixture of this synthetic dataset, and the cleaned Stanford Alpaca dataset.

 The model is quantized using Llama.cpp in order to enable running the model in super low resource environments that are common with Home Assistant installations such as Raspberry Pis.
@@ -63,6 +68,9 @@ The supported entity types are: light, fan, cover, lock, media_player, climate,
 ### Training
 The 3B model was trained as a LoRA on an RTX 3090 (24GB) using the following settings for the custom training script. The embedding weights were "saved" and trained normally along with the rank matricies in order to train the newly added tokens to the embeddings. The full model is merged together at the end. Training took approximately 10 hours.

+<details>
+<summary>Training Arguments</summary>
+
 ```
 python3 train.py \
    --run_name home-3b \
@@ -79,7 +87,12 @@ python3 train.py \
    --use_lora --lora_rank 32 --lora_alpha 64 --lora_modules fc1,fc2,q_proj,v_proj,dense --lora_modules_to_save embed_tokens,lm_head --lora_merge
 ```

-The 1B model was trained as a full fine-tuning on on an RTX 3090 (24GB). Training took approximately 1.5 hours.
+</details>
+
+The 1B model was trained as a full fine-tuning on on an RTX 3090 (24GB). Training took approximately 2.5 hours.
+
+<details>
+<summary>Training Arguments</summary>

 ```
 python3 train.py \
@@ -94,6 +107,9 @@ python3 train.py \
    --ctx_size 2048
 ```

+</details>
+<br/>
+
 ## Home Assistant Component
 In order to integrate with Home Assistant, we provide a `custom_component` that exposes the locally running LLM as a "conversation agent" that can be interacted with using a chat interface as well as integrate with Speech-to-Text and Text-to-Speech addons to enable interacting with the model by speaking.  

@@ -142,7 +158,7 @@ You need the following settings in order to configure the "remote" backend:

 With the remote text-generation-webui backend, the component will validate that the selected model is available for use and will ensure it is loaded remotely. The Generic OpenAI compatible version does NOT do any validation or model loading.

-**Setting up with LocalAI**:
+**Setting up with LocalAI**:  
 If you are an existing LocalAI user or would like to use LocalAI as your backend, please refer to [this](https://io.midori-ai.xyz/howtos/setup-with-ha/) website which has instructions on how to setup LocalAI to work with Home-LLM including automatic installation of the latest version of the the Home-LLM model. The auto-installer (LocalAI Manager) will automatically download and setup LocalAI and/or the model of your choice and automatically create the necessary template files for the model to work with this integration.

 ### Configuring the component as a Conversation Agent
--- a/data/README.md
+++ b/data/README.md
@@ -4,8 +4,26 @@ The dataset is generated from the different CSV "piles". The "piles" contain dif

 ## Generating the custom dataset

-`python3 generate_home_assistant_data.py --train --test`
+`python3 generate_home_assistant_data.py --train --test --large`

-## Merging with Alpaca for training
+## Merging with other datasets for training

-`python3 generate_home_assistant_data.py --merge-alpaca`
+`python3 generate_home_assistant_data.py --merge <dataset>`
+
+Supported datasets right now are: 
+- `alpaca`
+- `wizardlm70k`
+
+## Potential Other Datasets to Use
+
+### SFT
+Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
+Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
+WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
+WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
+Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
+OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
+
+### DPO
+Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
+Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
--- a/docs/expermement-notes-phi-1_5.txt
+++ b/docs/expermement-notes-phi-1_5.txt
@@ -1,5 +1,5 @@
-# home-llm experiements (phi1.5)
-rev1 - original test
+# early home-llm experiements (phi1.5)
+### rev1 - original test
 - 1 epoch
 - train ctx 1900
 - I think the learning rate was way too high (2e-4)
@@ -7,7 +7,7 @@ rev1 - original test
 - it doesn't get the device name right like ever
 - eval dataset was disabled

-rev2 - it kinda works
+### rev2 - it kinda works
 - eval dataset at 10%
 - 2 epochs
 - train ctx 1200
@@ -18,7 +18,7 @@ rev2 - it kinda works
 + still repeatedly spits out code blocks but at least it closes them correctly
 + names are MUCH more accurate. will still hallucinate names that don't exist

-rev3 - ok it definitely works
+### rev3 - ok it definitely works
 - 4 epochs
 - batch size 2
 - learning rate cosine 5e-5
@@ -26,24 +26,24 @@ rev3 - ok it definitely works
 + still halluncinates device names. (need to figure this one out)
 + need more examples for: garage_door, media_player, 

-rev4 - got to way lower loss. it tries really hard to stop generating text
+### rev4 - got to way lower loss. it tries really hard to stop generating text
 - 4 epochs
 - train ctx 512
 - batch size 2
 - learning rate cosine 1e-4
 - added system prompt and moved services block before states block

-rev 4.1 - really doesn't work as well. loss dropped REALLY fast and then never got as low as rev4
+### rev 4.1 - really doesn't work as well. loss dropped REALLY fast and then never got as low as rev4
 - 4 epochs
 - train ctx 512
 - batch size 3
 - learning rate cosine 1e-4
 - proper pad token

-rev 4.2 - yeah nah it's the pad token
+### rev 4.2 - yeah nah it's the pad token
 - batch size 2

-rev 5 - new dataset
+### rev 5 - new dataset
 - 3 epochs (4th epoch was overfit)
 - train cx 512
 - batch size 2
@@ -51,14 +51,14 @@ rev 5 - new dataset
 + actually stops generating text. not at the right... place but still!
 + messing with temperature makes it generate some interesting output.

-rev 5.1 - gradient accumulation test
+### rev 5.1 - gradient accumulation test
 - 3 epochs
 - train cx 512
 - batch size 8
 - learning rate cosine 1e-5
 + very meh

-rev 5.2 - learning rate test
+### rev 5.2 - learning rate test
 - 3 epochs
 - train cx 512
 - batch size 8
@@ -68,14 +68,14 @@ rev 5.2 - learning rate test
 + still need more examples for multi-device actions (really need room/group support in dataset)
 + need to have more variance in request format. need more informal + more formal versions

-rev 5.3 - learning rate test 2
+### rev 5.3 - learning rate test 2
 - 4 epochs
 - train cx 512
 - batch size 8
 - learning rate cosine 6e-5
 + lower learning rate seemed to not be as effective even though it ran for longer

-rev 6 - dataset revamp again
+### rev 6 - dataset revamp again
 - 3 epochs
 - train cx 512
 - batch size 8
@@ -85,58 +85,58 @@ rev 6 - dataset revamp again
 + definitely a bit overfit
 + maybe not so overfit. able to 0 shot asking to do stuff to a christmas tree

-rev 6.1 - lower train rate
+### rev 6.1 - lower train rate
 - 3 epochs
 - train cx 512
 - batch size 8
 - learning rate cosine 6e-5
 + also definitely a bit overfit. can't generate names it hasn't seen before

-rev 6.2 - fewer epochs
+### rev 6.2 - fewer epochs
 - 2 epochs
 - train cx 512
 - batch size 8
 - learning rate cosine 6e-5

-rev 6.3 - higher batch
+### rev 6.3 - higher batch
 - 2 epochs
 - train cx 512
 - batch size 12
 - learning rate cosine 1e-4

-rev 7 - tweak dataset again
+### rev 7 - tweak dataset again
 - 2 epochs
 - train ctx 512
 - batch size 8
 - learning rate 1e-4
 + when generating results, don't end with a space. it works WAY better

-rev 7.1 - failed padding attempt
+### rev 7.1 - failed padding attempt

-rev 7.2 - try to overfit less + no newline at end
+### rev 7.2 - try to overfit less + no newline at end
 - 1 epoch
 - train ctx 512
 - batch size 8
 - learning rate 1e-4
 + it definitly works with only one epoch

-rev 7.3 - try adding fake end of sentence token
+### rev 7.3 - try adding fake end of sentence token
 - 1 epoch
 - train ctx 512
 - batch size 8
 - learning rate 1e-4

-rev 8 - dataset tweaks. add status requests
+### rev 8 - dataset tweaks. add status requests
 + service requests still mostly work but status requests are pretty broken

-rev 8.1 - tweak example counts + ratios
+### rev 8.1 - tweak example counts + ratios
 - 1 epoch
 - train ctx 512
 - batch size 8
 - learning rate 1e-4
 + seems to have worked better with lower example counts

-rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of training
+### rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of training
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -147,7 +147,7 @@ rev 8.2 - try to fit learning rate so loss doesn't bottom out till the end of tr
 + oh yuuhhhhh it's overcranked. nails both request types (plus even ending generation)
 + needs ambiguous device name examples because I totally just asked it an ambiguous question and it answered the one I wasn't expecting

-rev 8.3 - further reduced training rate
+### rev 8.3 - further reduced training rate
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -156,7 +156,7 @@ rev 8.3 - further reduced training rate
 + has some creativity with how it repsonds
 + will often get the device name wrong on the first try

-rev 8.4 - re-ordered prompt
+### rev 8.4 - re-ordered prompt
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -165,7 +165,7 @@ rev 8.4 - re-ordered prompt
 + it *works* but is incredibly open ended
 + basically never stops generating text

-rev 8.5 - tweaked prompt format again
+### rev 8.5 - tweaked prompt format again
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -173,7 +173,7 @@ rev 8.5 - tweaked prompt format again
 - re-orderd response before actions again but made actions less like a "block" so it might stop generation
 + that worked rather badly

-rev 8.6 - make prompt look more like other examples it has seen before
+### rev 8.6 - make prompt look more like other examples it has seen before
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -183,7 +183,7 @@ rev 8.6 - make prompt look more like other examples it has seen before
 + only get the correct response about 50% of the time
 + it totally stops correctly when it DOES work

-rev 8.7 - try to fit a bit more. the last iteration jumps around on which format it chooses
+### rev 8.7 - try to fit a bit more. the last iteration jumps around on which format it chooses
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -192,7 +192,7 @@ rev 8.7 - try to fit a bit more. the last iteration jumps around on which format
 + altering the format (with newlines) makes it pick our format more often
 + comparing to 8.6 with modified format shows this one is better at getting device names right

-rev 8.8 - train with newlines instead of spaces in requets/response
+### rev 8.8 - train with newlines instead of spaces in requets/response
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -200,14 +200,14 @@ rev 8.8 - train with newlines instead of spaces in requets/response
 + definitely worse than the previous one
 + for some reason both 8.7 and 8.8 are horrible when using their actual template but if you deviate slightly it works a lot better on inference

-rev 8.9 - actually fix pad token
+### rev 8.9 - actually fix pad token
 - 1 epoch
 - train ctx 512
 - batch size 8
 - learning rate 1e-5
 + properly generates a response (+ terminates) when using the actual template

-rev 9 - reduced dataset size
+### rev 9 - reduced dataset size
 - 1 epoch
 - train ctx 512
 - batch size 8
@@ -233,20 +233,19 @@ rev 9 - reduced dataset size
 + it works OK with low temperatures
 + seems to handle the alpaca dataset not so well

-Eval results for existing models:
-Home-1b-v1: 0.767816091954023
-Home-3b-v2: 0.6908045977011494
+### Home-1b-v1-GGUF
+- eval results: 0.767816091954023

-## home-1b-rev5 series
+## home-1b-rev5/6 parameters
 - 1 epoch
 - 2048 train ctx
 - batch size 8
 - learning rate 1e-5
 - weight decay 0.1
 - gradient clipping 1.0
- save model every 200 steps
+- save model every 200 or 400 steps

-home-1b-rev5
+### home-1b-rev5
 - dataset size: medium
 - evaluation results:
  - 200: 0.553448275862069
@@ -257,7 +256,7 @@ home-1b-rev5
  - 1200: 0.8488505747126437 (+.009)
  - Final (1467): 0.8494252873563218 (+.00005)

-home-1b-rev5_1
+### home-1b-rev5_1
 - dataset size: small
 - evaluation results:
  - 200: 0.6057471264367816
@@ -266,7 +265,7 @@ home-1b-rev5_1
  - 800: 0.7729885057471264 (+.0046)
  - Final (869): bad

-home-1b-rev5_2
+### home-1b-rev5_2
 - dataset size: large
 - evaluation results:
  - 200: --
@@ -279,15 +278,11 @@ home-1b-rev5_2
  - 1600: 0.8844827586206897
  - Final (1848): 0.8833333333333333

-home-3b-v3-rev1
- dataset size: large
- evaluation results: 0.9091954022988505
-
-home-1b-rev6
+### home-1b-rev6
 - dataset size: large (fixed templates + function calling arguments; brightness is broken)
 - evaluation results: 0.8254149971379507

-home-1b-rev6_1
+### home-1b-rev6_1
 - dataset size: xl (fixed templates + function calling arguments; 0-255 brightness is broken)
 - evaluation results: 
  - 400: 0.7240984544934173
@@ -297,7 +292,7 @@ home-1b-rev6_1
  - 2000: 0.8551803091013166
  - Final (2322): 0.8586147681740126

-home-1b-rev6_2 = Home-1B-v2-GGUF
+### home-1b-rev6_2 = Home-1B-v2-GGUF
 - dataset size: large (change brightness back to percentages; increase color references by ~2x)
 - evaluation results: 
  - 400: 0.7856064418721691
@@ -307,24 +302,28 @@ home-1b-rev6_2 = Home-1B-v2-GGUF
  - 2000: 0.8852541519879215
  - Final (2048): 

-home-3b-v3-rev2 = Home-3B-v2-GGUF
+# Home 3B
+- 1 epoch
+- 2048 train ctx
+- batch size 8
+- learning rate 1e-5
+- weight decay 0.1
+- gradient clipping 1.0
+- save model every 200 or 400 steps
+
+Missing a lot of earlier 3B training results (not sure where they are)
+
+### Home-3b-v2-GGUF (broken training run)
+- evaluation result: 0.6908045977011494
+
+### home-3b-v3-rev1
+- dataset size: large
+- evaluation results: 0.9091954022988505
+
+### home-3b-v3-rev2 = Home-3B-v2-GGUF (republished)
 - dataset size: xl + alpaca
 - evaluation results: 0.8731756416708606

-Home-3B-v2-GGUF:ha_only
+### Home-3B-v2-GGUF:ha_only
 - dataset size: large
- evaluation results: FAILED (again.....)
-
-# Datasets
-
-## SFT
-Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
-Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
-WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
-WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
-Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
-OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
-
-## DPO
-Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
-Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
+- evaluation results: FAILED (again.....)