Upload dataset snapshot to HF

This commit is contained in:
Alex O'Connell
2024-01-28 21:01:59 -05:00
parent 18f9b7cdc0
commit 680ab96bb7
4 changed files with 45 additions and 18 deletions

View File

@@ -65,6 +65,9 @@ If Mary is 7 years old, then you are 10 years old (7+3=10).<|im_end|>
The synthetic dataset is aimed at covering basic day to day operations in home assistant such as turning devices on and off.
The supported entity types are: light, fan, cover, lock, media_player, climate, switch
The dataset is available on HuggingFace: https://huggingface.co/datasets/acon96/Home-Assistant-Requests
The source for the dataset is in the [data](/data) of this repository.
### Training
The 3B model was trained as a LoRA on an RTX 3090 (24GB) using the following settings for the custom training script. The embedding weights were "saved" and trained normally along with the rank matricies in order to train the newly added tokens to the embeddings. The full model is merged together at the end. Training took approximately 10 hours.

View File

@@ -18,7 +18,7 @@
- Ollama
- support chat completions API (might fix Ollama + adds support for text-gen-ui characters)
- [x] more config options for prompt template (allow other than chatml)
- [ ] publish snapshot of dataset on HF
- [x] publish snapshot of dataset on HF
- [ ] figure out DPO for refusals + fixing incorrect entity id
- [ ] mixtral + prompting (no fine tuning)
- [ ] use varied system prompts to add behaviors

View File

@@ -1,12 +1,33 @@
# Dataset
---
license: mit
task_categories:
- question-answering
- text-generation
tags:
- automation
- home
- assistant
language:
- en
pretty_name: Home Assistant Requests
size_categories:
- 10K<n<100k
---
# Home Assistant Requests Dataset
This dataset contains a list of requests and responses for a user interacting with a personal assistant that controls an instance of [Home Assistant](https://www.home-assistant.io/).
The dataset is generated from the different CSV "piles". The "piles" contain different chunks of requests that are assembled into a final context that is presented to the LLM. For example, `piles/pile_of_device_names.csv` contains only names of various devices to be used as part of context as well as inserted into `piles/pile_of_templated_actions.csv` and `piles/pile_of_status_requests.csv`. The logic for assembling the final dataset from the piles is contained in [generate_home_assistant_data.py](./generate_home_assistant_data.py).
## Generating the custom dataset
## Generating the dataset from piles
`python3 generate_home_assistant_data.py --train --test --large`
## Merging with other datasets for training
Supported dataset splits are `--test`, `--train`, & `--sample`
Arguments to set the train dataset size are `--small`, `--medium`, `--large`, & `--xl`.
## Merging with other instruct-datasets for training
`python3 generate_home_assistant_data.py --merge <dataset>`
@@ -14,16 +35,4 @@ Supported datasets right now are:
- `alpaca`
- `wizardlm70k`
## Potential Other Datasets to Use
### SFT
Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
### DPO
Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
Please note that the supported datasets all have different licenses. Be aware that the license of the resulting data mixture might be different that the license of this dataset alone.

View File

@@ -326,4 +326,19 @@ Missing a lot of earlier 3B training results (not sure where they are)
### Home-3B-v2-GGUF:ha_only
- dataset size: large
- evaluation results: FAILED (again.....)
- evaluation results: FAILED (again.....)
## Potential Other Datasets to Use
### SFT
Alpaca: https://huggingface.co/datasets/yahma/alpaca-cleaned
Alpaca (Translated): https://huggingface.co/datasets/saillab/taco-datasets
WizardLM 200k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k
WizardLM 70k: https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k
Huggingface Ultrachat 200k: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
OpenOrca Slim Deduped (363k): https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
### DPO
Intel Orca DPO Pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs
Huggingface Ultrachat: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized