4.2 KiB
license, task_categories, tags, language, pretty_name, size_categories
| license | task_categories | tags | language | pretty_name | size_categories | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mit |
|
|
|
Home Assistant Requests |
|
Home Assistant Requests Dataset
This dataset contains a list of requests and responses for a user interacting with a personal assistant that controls an instance of Home Assistant.
This dataset is NOT distributed as a static file, but as a Python script. This is due to the multitude of different formats that are used in the LLM fine-tuning ecosystem. The goal is to be able to support using this dataset to fine-tune any desired model, and to support that, you need to be able to generate the dataset in the exact format that matches the model you want to fine-tune.
The dataset is generated from the different CSV "piles". The "piles" contain different chunks of requests that are assembled into a final context that is presented to the LLM. For example, piles/<language>/pile_of_device_names.csv contains only names of various devices to be used as part of context as well as inserted into piles/<language>/pile_of_templated_actions.csv and piles/<language>/pile_of_status_requests.csv. The logic for assembling the final dataset from the piles is contained in generate_data.py.
Prepare environment
Start by installing system dependencies:
sudo apt-get install python3-dev
Then create a Python virtual environment and install all necessary library:
python3 -m venv .generate_data
source ./.generate_data/bin/activate
pip3 install pandas==2.2.2 datasets==2.20.0 webcolors==1.13 babel==2.15.0
Generating the dataset from piles
python3 generate_data.py --train --test --large --sharegpt
Supported dataset splits are --test, --train, & --sample
Arguments to set the train dataset size are --small, --medium, --large, & --xl.
Languages can be enabled using --language english german french spanish polish
Merging with other instruct-datasets for training
python3 generate_data.py --merge <dataset>
Supported datasets right now are:
alpacawizardlm70k
Please note that the supported datasets all have different licenses. Be aware that the license of the resulting data mixture might be different that the license of this dataset alone.
Adding a new personality
In order to add a new personality, you need to define a new system prompt and new set of responses for the assistant. The system prompt is the description of the assistant's behavior that occurs at the start of the context. The responses are what is said back to the user when performing a task. The model should still respond with the correct service call no matter what the assistant's response is. The list of system prompts are stored in pile_of_system_prompts.csv, and the list of responses are stored in pile_of_responses.csv
There are 2 columns in pile_of_system_prompts.csv:
persona: the name of the personaprompt: the system prompt to use for that persona. Recommended to put this in quotes in case the prompt also has commas in it
The response pile is a CSV with the following headers: service,response,language,persona,short
service: the service name that we are responding to. Make sure you cover enough different services so that the model can learn how to respond in all situations.response: the text of the response. Recommended to put this in quotes in case the response also has commas in itpersona: the name of the persona the response belongs to. Use the name of your persona hereshort: either 0 or 1. If it is 1 then the response is considered "short', and can be combined together with other "short" responses using "and". These are used for examples where there are multiple service calls
Generating the full dataset using the python script will print out a warning for any responses that are missing for a persona.
Adding new Home Assistant functionality
TODO