Upload dataset snapshot to HF

This commit is contained in:
Alex O'Connell
2024-01-28 21:01:59 -05:00
parent 18f9b7cdc0
commit 680ab96bb7
4 changed files with 45 additions and 18 deletions

View File

@@ -1,12 +1,33 @@
# Dataset
---
license: mit
task_categories:
- question-answering
- text-generation
tags:
- automation
- home
- assistant
language:
- en
pretty_name: Home Assistant Requests
size_categories:
- 10K<n<100k
---
# Home Assistant Requests Dataset
This dataset contains a list of requests and responses for a user interacting with a personal assistant that controls an instance of [Home Assistant](https://www.home-assistant.io/).
The dataset is generated from the different CSV "piles". The "piles" contain different chunks of requests that are assembled into a final context that is presented to the LLM. For example, `piles/pile_of_device_names.csv` contains only names of various devices to be used as part of context as well as inserted into `piles/pile_of_templated_actions.csv` and `piles/pile_of_status_requests.csv`. The logic for assembling the final dataset from the piles is contained in [generate_home_assistant_data.py](./generate_home_assistant_data.py).
## Generating the custom dataset
## Generating the dataset from piles
`python3 generate_home_assistant_data.py --train --test --large`
## Merging with other datasets for training
Supported dataset splits are `--test`, `--train`, & `--sample`
Arguments to set the train dataset size are `--small`, `--medium`, `--large`, & `--xl`.
## Merging with other instruct-datasets for training
`python3 generate_home_assistant_data.py --merge <dataset>`
@@ -14,16 +35,4 @@ Supported datasets right now are:
- `alpaca`
- `wizardlm70k`
## Potential Other Datasets to Use
### SFT
Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
### DPO
Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
Please note that the supported datasets all have different licenses. Be aware that the license of the resulting data mixture might be different that the license of this dataset alone.