Upload dataset snapshot to HF

2026-01-09 13:48:05 -05:00 · 2024-01-28 21:01:59 -05:00
parent 18f9b7cdc0
commit 680ab96bb7
4 changed files with 45 additions and 18 deletions
--- a/README.md
+++ b/README.md
@@ -65,6 +65,9 @@ If Mary is 7 years old, then you are 10 years old (7+3=10).<|im_end|>
 The synthetic dataset is aimed at covering basic day to day operations in home assistant such as turning devices on and off.
 The supported entity types are: light, fan, cover, lock, media_player, climate, switch

+The dataset is available on HuggingFace: https://huggingface.co/datasets/acon96/Home-Assistant-Requests  
+The source for the dataset is in the [data](/data) of this repository.
+
 ### Training
 The 3B model was trained as a LoRA on an RTX 3090 (24GB) using the following settings for the custom training script. The embedding weights were "saved" and trained normally along with the rank matricies in order to train the newly added tokens to the embeddings. The full model is merged together at the end. Training took approximately 10 hours.

--- a/TODO.md
+++ b/TODO.md
@@ -18,7 +18,7 @@
    - Ollama  
    - support chat completions API (might fix Ollama + adds support for text-gen-ui characters)
 - [x] more config options for prompt template (allow other than chatml)  
- [ ] publish snapshot of dataset on HF  
+- [x] publish snapshot of dataset on HF  
 - [ ] figure out DPO for refusals + fixing incorrect entity id  
 - [ ] mixtral + prompting (no fine tuning)  
 - [ ] use varied system prompts to add behaviors  
--- a/data/README.md
+++ b/data/README.md
@@ -1,12 +1,33 @@
-# Dataset
+---
+license: mit
+task_categories:
+  - question-answering
+  - text-generation
+tags:
+  - automation
+  - home
+  - assistant
+language:
+  - en
+pretty_name: Home Assistant Requests
+size_categories:
+  - 10K<n<100k
+---
+
+# Home Assistant Requests Dataset
+
+This dataset contains a list of requests and responses for a user interacting with a personal assistant that controls an instance of [Home Assistant](https://www.home-assistant.io/).

 The dataset is generated from the different CSV "piles". The "piles" contain different chunks of requests that are assembled into a final context that is presented to the LLM. For example, `piles/pile_of_device_names.csv` contains only names of various devices to be used as part of context as well as inserted into `piles/pile_of_templated_actions.csv` and `piles/pile_of_status_requests.csv`. The logic for assembling the final dataset from the piles is contained in [generate_home_assistant_data.py](./generate_home_assistant_data.py).

-## Generating the custom dataset
+## Generating the dataset from piles

 `python3 generate_home_assistant_data.py --train --test --large`

-## Merging with other datasets for training
+Supported dataset splits are `--test`, `--train`, & `--sample`
+Arguments to set the train dataset size are `--small`, `--medium`, `--large`, & `--xl`.
+
+## Merging with other instruct-datasets for training

 `python3 generate_home_assistant_data.py --merge <dataset>`

@@ -14,16 +35,4 @@ Supported datasets right now are:
 - `alpaca`
 - `wizardlm70k`

-## Potential Other Datasets to Use
-
-### SFT
-Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
-Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
-WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
-WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
-Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
-OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
-
-### DPO
-Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
-Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
+Please note that the supported datasets all have different licenses. Be aware that the license of the resulting data mixture might be different that the license of this dataset alone.
--- a/docs/expermement-notes-phi.md
+++ b/docs/expermement-notes-phi.md
@@ -326,4 +326,19 @@ Missing a lot of earlier 3B training results (not sure where they are)

 ### Home-3B-v2-GGUF:ha_only
 - dataset size: large
- evaluation results: FAILED (again.....)
+- evaluation results: FAILED (again.....)
+
+
+## Potential Other Datasets to Use
+
+### SFT
+Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
+Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
+WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
+WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
+Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
+OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
+
+### DPO
+Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
+Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized