Backend Configuration
There are multiple backends to choose for running the model that the Home Assistant integration uses. Here is a description of all the options for each backend
Common Options
These options are available for all backends and control model inference behavior, conversation memory, and integration-specific settings.
| Option Name |
Description |
Suggested Value |
| Selected Language |
The language to use for prompts and responses. Affects system prompt templates and examples. |
en |
| LLM API |
The API to use for tool execution. Select "Assist" for the built-in Home Assistant API, or "No control" to disable tool execution. Other options are specialized APIs like Home-LLM v1/v2/v3. |
Assist |
| System Prompt |
see here |
|
| Additional attributes to expose in the context |
Extra attributes that will be exposed to the model via the {{ devices }} template variable (e.g., rgb_color, brightness, temperature, humidity, fan_mode, volume_level) |
See suggestions |
| Refresh System Prompt Every Turn |
Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp) |
Enabled |
| Remember conversation |
Flag to remember the conversation history (excluding system prompt) in the model context. |
Enabled |
| Number of past interactions to remember |
If Remember conversation is enabled, number of user-assistant interaction pairs to keep in history. Not used by Generic OpenAI Responses backend. |
|
| Enable in context learning (ICL) examples |
If enabled, will load examples from the specified file and expose them as the {{ response_examples }} variable in the system prompt template |
Enabled |
| In context learning examples CSV filename |
The file to load in context learning examples from. Must be located in the same directory as the custom component |
in_context_examples.csv |
| Number of ICL examples to generate |
The number of examples to select when expanding the {{ in_context_examples }} template in the prompt |
4 |
| Thinking prefix |
String prefix to mark the start of internal model reasoning (used when the model supports explicit thinking) |
<think> |
| Thinking suffix |
String suffix to mark the end of internal model reasoning |
</think> |
| Tool call prefix |
String prefix to mark the start of a function call in the model response |
<tool_call> |
| Tool call suffix |
String suffix to mark the end of a function call in the model response |
</tool_call> |
| Enable legacy tool calling |
If enabled, uses the legacy \``homeassistant` tool calling format instead of the newer prefix/suffix format. Required for some older Home-LLM models. |
Disabled |
| Max tool call iterations |
Maximum number of times the model can make tool calls in sequence before the conversation is terminated |
3 |
Llama.cpp
For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
Connection & Model Selection
| Option Name |
Description |
Suggested Value |
| Chat Model |
The Hugging Face model repository or local model filename to use for inference |
acon96/Home-3B-v3-GGUF |
| Model Quantization |
The quantization level to download for the selected model from Hugging Face |
Q4_K_M |
| Model File Path |
The full path to a local GGUF model file. If not specified, the model will be downloaded from Hugging Face |
|
Sampling & Output
| Option Name |
Description |
Suggested Value |
| Temperature |
Sampling parameter; see above link |
0.1 |
| Top K |
Sampling parameter; see above link |
40 |
| Top P |
Sampling parameter; see above link |
1.0 |
| Min P |
Sampling parameter; see above link |
0.0 |
| Typical P |
Sampling parameter; see above link |
1.0 |
| Maximum tokens to return in response |
Limits the number of tokens that can be produced by each model response |
512 |
| Context Length |
Maximum number of tokens the model can consider in its context window |
2048 |
Performance Optimization
| Option Name |
Description |
Suggested Value |
| Batch Size |
Number of tokens to process in each batch. Higher values increase speed but consume more memory |
512 |
| Thread Count |
Number of CPU threads to use for inference |
(number of physical CPU cores) |
| Batch Thread Count |
Number of threads to use for batch processing |
(number of physical CPU cores) |
| Enable Flash Attention |
Use Flash Attention optimization if supported by the model. Can significantly improve performance on compatible GPUs |
Disabled |
Advanced Features
| Option Name |
Description |
Suggested Value |
| Enable GBNF Grammar |
Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors on quantized models |
Enabled |
| GBNF Grammar Filename |
The file to load as the GBNF grammar. Must be located in the same directory as the custom component. |
output.gbnf for Home LLM and json.gbnf for any model using ICL |
| Enable Prompt Caching |
Cache the system prompt to avoid recomputing it on every turn (requires refresh_system_prompt to be disabled) |
Disabled |
| Prompt Caching Interval |
Number of seconds between prompt cache refreshes (if caching is enabled) |
30 |
Wheels
The wheels for llama-cpp-python can be built or downloaded manually for installation/re-installation.
Take the appropriate wheel and copy it to the custom_components/llama_conversation/ directory.
After the wheel file has been copied to the correct folder, attempt the wheel installation step of the integration setup. The local wheel file should be detected and installed.
Pre-built
Pre-built wheel files (*.whl) are built as part of a fork of llama-cpp-python and are available on the GitHub releases page for the fork.
To ensure compatibility with your Home Assistant and Python versions, select the correct .whl file for your hardware's architecture:
- For Home Assistant
2024.2.0 and newer, use the Python 3.12 wheels (cp312)
- ARM devices (e.g., Raspberry Pi 4/5):
- Example filename:
llama_cpp_python-{version}-cp312-cp312-musllinux_1_2_aarch64.whl
- x86_64 devices (e.g., Intel/AMD desktops):
- Example filename:
llama_cpp_python-{version}-cp312-cp312-musllinux_1_2_x86_64.whl
Build your own
- Clone the repository on the target machine that will be running Home Assistant
- Ensure
docker is installed
- Run the
scripts/run_docker_to_make_wheels.sh script
- The compatible wheel files will be placed in the folder you executed the script from
Llama.cpp Server
Llama.cpp Server backend is used when running inference via a separate llama-cpp-python HTTP server.
Connection
| Option Name |
Description |
Suggested Value |
| Host |
The hostname or IP address of the llama-cpp-python server |
|
| Port |
The port number the server is listening on |
8000 |
| SSL |
Whether to use HTTPS for the connection |
false |
Sampling & Output
| Option Name |
Description |
Suggested Value |
| Top K |
Sampling parameter; see text-generation-webui wiki |
40 |
| Top P |
Sampling parameter; see above link |
1.0 |
| Maximum tokens to return in response |
Limits the number of tokens that can be produced by each model response |
512 |
| Request Timeout |
The maximum time in seconds that the integration will wait for a response from the remote server |
90 (higher if running on low resource hardware) |
Advanced Features
| Option Name |
Description |
Suggested Value |
| Enable GBNF Grammar |
Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors |
Enabled |
| GBNF Grammar Filename |
The file to load as the GBNF grammar. Must be located in the same directory as the custom component. |
output.gbnf |
text-generation-webui
For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
Connection
| Option Name |
Description |
Suggested Value |
| Host |
The hostname or IP address of the text-generation-webui server |
|
| Port |
The port number the server is listening on |
5000 |
| SSL |
Whether to use HTTPS for the connection |
false |
| Admin Key |
The admin key for the text-generation-webui server (if configured for authentication) |
|
Sampling & Output
| Option Name |
Description |
Suggested Value |
| Temperature |
Sampling parameter; see above link |
0.1 |
| Top K |
Sampling parameter; see above link |
40 |
| Top P |
Sampling parameter; see above link |
1.0 |
| Min P |
Sampling parameter; see above link |
0.0 |
| Typical P |
Sampling parameter; see above link |
1.0 |
| Context Length |
Maximum number of tokens the model can consider in its context window |
2048 |
| Request Timeout |
The maximum time in seconds that the integration will wait for a response from the remote server |
90 (higher if running on low resource hardware) |
UI Configuration
| Option Name |
Description |
Suggested Value |
| Generation Preset/Character Name |
The preset or character name to pass to the backend. If none is provided then the settings that are currently selected in the UI will be applied |
|
| Chat Mode |
see here |
Instruct |
Ollama
For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
Connection
| Option Name |
Description |
Suggested Value |
| Host |
The hostname or IP address of the Ollama server |
|
| Port |
The port number the server is listening on |
11434 |
| SSL |
Whether to use HTTPS for the connection |
false |
Sampling & Output
| Option Name |
Description |
Suggested Value |
| Top K |
Sampling parameter; see above link |
40 |
| Top P |
Sampling parameter; see above link |
1.0 |
| Typical P |
Sampling parameter; see above link |
1.0 |
| Maximum tokens to return in response |
Limits the number of tokens that can be produced by each model response |
512 |
| Context Length |
Maximum number of tokens the model can consider in its context window |
2048 |
| Request Timeout |
The maximum time in seconds that the integration will wait for a response from the remote server |
90 (higher if running on low resource hardware) |
Advanced Features
| Option Name |
Description |
Suggested Value |
| JSON Mode |
Restricts the model to only output valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses. |
True |
| Keep Alive/Inactivity Timeout |
The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever |
30 (minutes) |
Generic OpenAI API (Chat Completions)
For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
Connection
| Option Name |
Description |
Suggested Value |
| Host |
The hostname or IP address of the OpenAI-compatible API server |
|
| Port |
The port number the server is listening on (leave empty for default) |
|
| SSL |
Whether to use HTTPS for the connection |
false |
| API Key |
The API key for authentication (if required by your server) |
|
| API Path |
The path prefix for API requests (e.g., /v1 for OpenAI-compatible servers) |
v1 |
Sampling & Output
| Option Name |
Description |
Suggested Value |
| Top P |
Sampling parameter; see above link |
1.0 |
| Request Timeout |
The maximum time in seconds that the integration will wait for a response from the remote server |
90 (higher if running on low resource hardware) |
Generic OpenAI Responses
Generic OpenAI Responses backend uses time-based conversation memory instead of interaction counts and is compatible with specialized response APIs.
Connection
| Option Name |
Description |
Suggested Value |
| Host |
The hostname or IP address of the OpenAI-compatible API server |
|
| Port |
The port number the server is listening on (leave empty for default) |
|
| SSL |
Whether to use HTTPS for the connection |
false |
| API Key |
The API key for authentication (if required by your server) |
|
| API Path |
The path prefix for API requests |
v1 |
Sampling & Output
| Option Name |
Description |
Suggested Value |
| Temperature |
Sampling parameter; see above link |
0.1 |
| Top P |
Sampling parameter; see above link |
1.0 |
| Request Timeout |
The maximum time in seconds that the integration will wait for a response from the remote server |
90 (higher if running on low resource hardware) |
Memory & Conversation
| Option Name |
Description |
Suggested Value |
| Remember conversation time (minutes) |
Number of minutes to remember conversation history. Uses time-based memory instead of interaction count. |
2 (minutes) |