mirror of
https://github.com/acon96/home-llm.git
synced 2026-01-09 21:58:00 -05:00
Upload dataset snapshot to HF
This commit is contained in:
@@ -1,12 +1,33 @@
|
||||
# Dataset
|
||||
---
|
||||
license: mit
|
||||
task_categories:
|
||||
- question-answering
|
||||
- text-generation
|
||||
tags:
|
||||
- automation
|
||||
- home
|
||||
- assistant
|
||||
language:
|
||||
- en
|
||||
pretty_name: Home Assistant Requests
|
||||
size_categories:
|
||||
- 10K<n<100k
|
||||
---
|
||||
|
||||
# Home Assistant Requests Dataset
|
||||
|
||||
This dataset contains a list of requests and responses for a user interacting with a personal assistant that controls an instance of [Home Assistant](https://www.home-assistant.io/).
|
||||
|
||||
The dataset is generated from the different CSV "piles". The "piles" contain different chunks of requests that are assembled into a final context that is presented to the LLM. For example, `piles/pile_of_device_names.csv` contains only names of various devices to be used as part of context as well as inserted into `piles/pile_of_templated_actions.csv` and `piles/pile_of_status_requests.csv`. The logic for assembling the final dataset from the piles is contained in [generate_home_assistant_data.py](./generate_home_assistant_data.py).
|
||||
|
||||
## Generating the custom dataset
|
||||
## Generating the dataset from piles
|
||||
|
||||
`python3 generate_home_assistant_data.py --train --test --large`
|
||||
|
||||
## Merging with other datasets for training
|
||||
Supported dataset splits are `--test`, `--train`, & `--sample`
|
||||
Arguments to set the train dataset size are `--small`, `--medium`, `--large`, & `--xl`.
|
||||
|
||||
## Merging with other instruct-datasets for training
|
||||
|
||||
`python3 generate_home_assistant_data.py --merge <dataset>`
|
||||
|
||||
@@ -14,16 +35,4 @@ Supported datasets right now are:
|
||||
- `alpaca`
|
||||
- `wizardlm70k`
|
||||
|
||||
## Potential Other Datasets to Use
|
||||
|
||||
### SFT
|
||||
Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
|
||||
Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
|
||||
WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
|
||||
WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
|
||||
Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
|
||||
OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
|
||||
|
||||
### DPO
|
||||
Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
|
||||
Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
|
||||
Please note that the supported datasets all have different licenses. Be aware that the license of the resulting data mixture might be different that the license of this dataset alone.
|
||||
Reference in New Issue
Block a user