Upload dataset snapshot to HF

2026-01-09 21:58:00 -05:00 · 2024-01-28 21:01:59 -05:00
parent 18f9b7cdc0
commit 680ab96bb7
4 changed files with 45 additions and 18 deletions
--- a/data/README.md
+++ b/data/README.md
@@ -1,12 +1,33 @@
-# Dataset
+---
+license: mit
+task_categories:
+  - question-answering
+  - text-generation
+tags:
+  - automation
+  - home
+  - assistant
+language:
+  - en
+pretty_name: Home Assistant Requests
+size_categories:
+  - 10K<n<100k
+---
+
+# Home Assistant Requests Dataset
+
+This dataset contains a list of requests and responses for a user interacting with a personal assistant that controls an instance of [Home Assistant](https://www.home-assistant.io/).

 The dataset is generated from the different CSV "piles". The "piles" contain different chunks of requests that are assembled into a final context that is presented to the LLM. For example, `piles/pile_of_device_names.csv` contains only names of various devices to be used as part of context as well as inserted into `piles/pile_of_templated_actions.csv` and `piles/pile_of_status_requests.csv`. The logic for assembling the final dataset from the piles is contained in [generate_home_assistant_data.py](./generate_home_assistant_data.py).

-## Generating the custom dataset
+## Generating the dataset from piles

 `python3 generate_home_assistant_data.py --train --test --large`

-## Merging with other datasets for training
+Supported dataset splits are `--test`, `--train`, & `--sample`
+Arguments to set the train dataset size are `--small`, `--medium`, `--large`, & `--xl`.
+
+## Merging with other instruct-datasets for training

 `python3 generate_home_assistant_data.py --merge <dataset>`

@@ -14,16 +35,4 @@ Supported datasets right now are:
 - `alpaca`
 - `wizardlm70k`

-## Potential Other Datasets to Use
-
-### SFT
-Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
-Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
-WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
-WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
-Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
-OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
-
-### DPO
-Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
-Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
+Please note that the supported datasets all have different licenses. Be aware that the license of the resulting data mixture might be different that the license of this dataset alone.