mirror of https://github.com/acon96/home-llm.git synced 2026-01-06 20:33:54 -05:00

Files

Alex O'Connell 4b071647a9 refresh backend options documentation

2025-10-25 23:48:59 -04:00

30 KiB

Raw Permalink Blame History

Backend Configuration

There are multiple backends to choose for running the model that the Home Assistant integration uses. Here is a description of all the options for each backend

Common Options

These options are available for all backends and control model inference behavior, conversation memory, and integration-specific settings.

Option Name	Description	Suggested Value
Selected Language	The language to use for prompts and responses. Affects system prompt templates and examples.	en
LLM API	The API to use for tool execution. Select "Assist" for the built-in Home Assistant API, or "No control" to disable tool execution. Other options are specialized APIs like Home-LLM v1/v2/v3.	Assist
System Prompt	see here
Additional attributes to expose in the context	Extra attributes that will be exposed to the model via the `{{ devices }}` template variable (e.g., rgb_color, brightness, temperature, humidity, fan_mode, volume_level)	See suggestions
Refresh System Prompt Every Turn	Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp)	Enabled
Remember conversation	Flag to remember the conversation history (excluding system prompt) in the model context.	Enabled
Number of past interactions to remember	If `Remember conversation` is enabled, number of user-assistant interaction pairs to keep in history. Not used by Generic OpenAI Responses backend.
Enable in context learning (ICL) examples	If enabled, will load examples from the specified file and expose them as the `{{ response_examples }}` variable in the system prompt template	Enabled
In context learning examples CSV filename	The file to load in context learning examples from. Must be located in the same directory as the custom component	in_context_examples.csv
Number of ICL examples to generate	The number of examples to select when expanding the `{{ in_context_examples }}` template in the prompt	4
Thinking prefix	String prefix to mark the start of internal model reasoning (used when the model supports explicit thinking)	`<think>`
Thinking suffix	String suffix to mark the end of internal model reasoning	`</think>`
Tool call prefix	String prefix to mark the start of a function call in the model response	`<tool_call>`
Tool call suffix	String suffix to mark the end of a function call in the model response	`</tool_call>`
Enable legacy tool calling	If enabled, uses the legacy `\```homeassistant` tool calling format instead of the newer prefix/suffix format. Required for some older Home-LLM models.	Disabled
Max tool call iterations	Maximum number of times the model can make tool calls in sequence before the conversation is terminated	3

Llama.cpp

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection & Model Selection

Option Name	Description	Suggested Value
Chat Model	The Hugging Face model repository or local model filename to use for inference	acon96/Home-3B-v3-GGUF
Model Quantization	The quantization level to download for the selected model from Hugging Face	Q4_K_M
Model File Path	The full path to a local GGUF model file. If not specified, the model will be downloaded from Hugging Face

Sampling & Output

Option Name	Description	Suggested Value
Temperature	Sampling parameter; see above link	0.1
Top K	Sampling parameter; see above link	40
Top P	Sampling parameter; see above link	1.0
Min P	Sampling parameter; see above link	0.0
Typical P	Sampling parameter; see above link	1.0
Maximum tokens to return in response	Limits the number of tokens that can be produced by each model response	512
Context Length	Maximum number of tokens the model can consider in its context window	2048

Performance Optimization

Option Name	Description	Suggested Value
Batch Size	Number of tokens to process in each batch. Higher values increase speed but consume more memory	512
Thread Count	Number of CPU threads to use for inference	(number of physical CPU cores)
Batch Thread Count	Number of threads to use for batch processing	(number of physical CPU cores)
Enable Flash Attention	Use Flash Attention optimization if supported by the model. Can significantly improve performance on compatible GPUs	Disabled

Advanced Features

Option Name	Description	Suggested Value
Enable GBNF Grammar	Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors on quantized models	Enabled
GBNF Grammar Filename	The file to load as the GBNF grammar. Must be located in the same directory as the custom component.	`output.gbnf` for Home LLM and `json.gbnf` for any model using ICL
Enable Prompt Caching	Cache the system prompt to avoid recomputing it on every turn (requires refresh_system_prompt to be disabled)	Disabled
Prompt Caching Interval	Number of seconds between prompt cache refreshes (if caching is enabled)	30

Wheels

The wheels for llama-cpp-python can be built or downloaded manually for installation/re-installation.

Take the appropriate wheel and copy it to the custom_components/llama_conversation/ directory.

After the wheel file has been copied to the correct folder, attempt the wheel installation step of the integration setup. The local wheel file should be detected and installed.

Pre-built

Pre-built wheel files (*.whl) are built as part of a fork of llama-cpp-python and are available on the GitHub releases page for the fork.

To ensure compatibility with your Home Assistant and Python versions, select the correct .whl file for your hardware's architecture:

For Home Assistant 2024.2.0 and newer, use the Python 3.12 wheels (cp312)
ARM devices (e.g., Raspberry Pi 4/5):
- Example filename:
  - llama_cpp_python-{version}-cp312-cp312-musllinux_1_2_aarch64.whl
x86_64 devices (e.g., Intel/AMD desktops):
- Example filename:
  - llama_cpp_python-{version}-cp312-cp312-musllinux_1_2_x86_64.whl

Build your own

Clone the repository on the target machine that will be running Home Assistant
Ensure docker is installed
Run the scripts/run_docker_to_make_wheels.sh script
The compatible wheel files will be placed in the folder you executed the script from

Llama.cpp Server

Llama.cpp Server backend is used when running inference via a separate llama-cpp-python HTTP server.

Connection

Option Name	Description	Suggested Value
Host	The hostname or IP address of the llama-cpp-python server
Port	The port number the server is listening on	8000
SSL	Whether to use HTTPS for the connection	false

Sampling & Output

Option Name	Description	Suggested Value
Top K	Sampling parameter; see text-generation-webui wiki	40
Top P	Sampling parameter; see above link	1.0
Maximum tokens to return in response	Limits the number of tokens that can be produced by each model response	512
Request Timeout	The maximum time in seconds that the integration will wait for a response from the remote server	90 (higher if running on low resource hardware)

Advanced Features

Option Name	Description	Suggested Value
Enable GBNF Grammar	Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors	Enabled
GBNF Grammar Filename	The file to load as the GBNF grammar. Must be located in the same directory as the custom component.	`output.gbnf`

text-generation-webui

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection

Option Name	Description	Suggested Value
Host	The hostname or IP address of the text-generation-webui server
Port	The port number the server is listening on	5000
SSL	Whether to use HTTPS for the connection	false
Admin Key	The admin key for the text-generation-webui server (if configured for authentication)

Sampling & Output

Option Name	Description	Suggested Value
Temperature	Sampling parameter; see above link	0.1
Top K	Sampling parameter; see above link	40
Top P	Sampling parameter; see above link	1.0
Min P	Sampling parameter; see above link	0.0
Typical P	Sampling parameter; see above link	1.0
Context Length	Maximum number of tokens the model can consider in its context window	2048
Request Timeout	The maximum time in seconds that the integration will wait for a response from the remote server	90 (higher if running on low resource hardware)

UI Configuration

Option Name	Description	Suggested Value
Generation Preset/Character Name	The preset or character name to pass to the backend. If none is provided then the settings that are currently selected in the UI will be applied
Chat Mode	see here	Instruct

Ollama

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection

Option Name	Description	Suggested Value
Host	The hostname or IP address of the Ollama server
Port	The port number the server is listening on	11434
SSL	Whether to use HTTPS for the connection	false

Sampling & Output

Option Name	Description	Suggested Value
Top K	Sampling parameter; see above link	40
Top P	Sampling parameter; see above link	1.0
Typical P	Sampling parameter; see above link	1.0
Maximum tokens to return in response	Limits the number of tokens that can be produced by each model response	512
Context Length	Maximum number of tokens the model can consider in its context window	2048
Request Timeout	The maximum time in seconds that the integration will wait for a response from the remote server	90 (higher if running on low resource hardware)

Advanced Features

Option Name	Description	Suggested Value
JSON Mode	Restricts the model to only output valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses.	True
Keep Alive/Inactivity Timeout	The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever	30 (minutes)

Generic OpenAI API (Chat Completions)

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection

Option Name	Description	Suggested Value
Host	The hostname or IP address of the OpenAI-compatible API server
Port	The port number the server is listening on (leave empty for default)
SSL	Whether to use HTTPS for the connection	false
API Key	The API key for authentication (if required by your server)
API Path	The path prefix for API requests (e.g., `/v1` for OpenAI-compatible servers)	v1

Sampling & Output

Option Name	Description	Suggested Value
Top P	Sampling parameter; see above link	1.0
Request Timeout	The maximum time in seconds that the integration will wait for a response from the remote server	90 (higher if running on low resource hardware)

Generic OpenAI Responses

Generic OpenAI Responses backend uses time-based conversation memory instead of interaction counts and is compatible with specialized response APIs.

Connection

Option Name	Description	Suggested Value
Host	The hostname or IP address of the OpenAI-compatible API server
Port	The port number the server is listening on (leave empty for default)
SSL	Whether to use HTTPS for the connection	false
API Key	The API key for authentication (if required by your server)
API Path	The path prefix for API requests	v1

Sampling & Output

Option Name	Description	Suggested Value
Temperature	Sampling parameter; see above link	0.1
Top P	Sampling parameter; see above link	1.0
Request Timeout	The maximum time in seconds that the integration will wait for a response from the remote server	90 (higher if running on low resource hardware)

Memory & Conversation

Option Name	Description	Suggested Value
Remember conversation time (minutes)	Number of minutes to remember conversation history. Uses time-based memory instead of interaction count.	2 (minutes)

30 KiB Raw Permalink Blame History

Backend Configuration

Common Options

Llama.cpp

Connection & Model Selection

Sampling & Output

Performance Optimization

Advanced Features

Wheels

Pre-built

Build your own

Llama.cpp Server

Connection

Sampling & Output

Advanced Features

text-generation-webui

Connection

Sampling & Output

UI Configuration

Ollama

Connection

Sampling & Output

Advanced Features

Generic OpenAI API (Chat Completions)

Connection

Sampling & Output

Generic OpenAI Responses

Connection

Sampling & Output

Memory & Conversation

30 KiB

Raw Permalink Blame History