Files
home-llm/docs/Backend Configuration.md
2025-10-25 23:48:59 -04:00

30 KiB

Backend Configuration

There are multiple backends to choose for running the model that the Home Assistant integration uses. Here is a description of all the options for each backend

Common Options

These options are available for all backends and control model inference behavior, conversation memory, and integration-specific settings.

Option Name Description Suggested Value
Selected Language The language to use for prompts and responses. Affects system prompt templates and examples. en
LLM API The API to use for tool execution. Select "Assist" for the built-in Home Assistant API, or "No control" to disable tool execution. Other options are specialized APIs like Home-LLM v1/v2/v3. Assist
System Prompt see here
Additional attributes to expose in the context Extra attributes that will be exposed to the model via the {{ devices }} template variable (e.g., rgb_color, brightness, temperature, humidity, fan_mode, volume_level) See suggestions
Refresh System Prompt Every Turn Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp) Enabled
Remember conversation Flag to remember the conversation history (excluding system prompt) in the model context. Enabled
Number of past interactions to remember If Remember conversation is enabled, number of user-assistant interaction pairs to keep in history. Not used by Generic OpenAI Responses backend.
Enable in context learning (ICL) examples If enabled, will load examples from the specified file and expose them as the {{ response_examples }} variable in the system prompt template Enabled
In context learning examples CSV filename The file to load in context learning examples from. Must be located in the same directory as the custom component in_context_examples.csv
Number of ICL examples to generate The number of examples to select when expanding the {{ in_context_examples }} template in the prompt 4
Thinking prefix String prefix to mark the start of internal model reasoning (used when the model supports explicit thinking) <think>
Thinking suffix String suffix to mark the end of internal model reasoning </think>
Tool call prefix String prefix to mark the start of a function call in the model response <tool_call>
Tool call suffix String suffix to mark the end of a function call in the model response </tool_call>
Enable legacy tool calling If enabled, uses the legacy \``homeassistant` tool calling format instead of the newer prefix/suffix format. Required for some older Home-LLM models. Disabled
Max tool call iterations Maximum number of times the model can make tool calls in sequence before the conversation is terminated 3

Llama.cpp

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection & Model Selection

Option Name Description Suggested Value
Chat Model The Hugging Face model repository or local model filename to use for inference acon96/Home-3B-v3-GGUF
Model Quantization The quantization level to download for the selected model from Hugging Face Q4_K_M
Model File Path The full path to a local GGUF model file. If not specified, the model will be downloaded from Hugging Face

Sampling & Output

Option Name Description Suggested Value
Temperature Sampling parameter; see above link 0.1
Top K Sampling parameter; see above link 40
Top P Sampling parameter; see above link 1.0
Min P Sampling parameter; see above link 0.0
Typical P Sampling parameter; see above link 1.0
Maximum tokens to return in response Limits the number of tokens that can be produced by each model response 512
Context Length Maximum number of tokens the model can consider in its context window 2048

Performance Optimization

Option Name Description Suggested Value
Batch Size Number of tokens to process in each batch. Higher values increase speed but consume more memory 512
Thread Count Number of CPU threads to use for inference (number of physical CPU cores)
Batch Thread Count Number of threads to use for batch processing (number of physical CPU cores)
Enable Flash Attention Use Flash Attention optimization if supported by the model. Can significantly improve performance on compatible GPUs Disabled

Advanced Features

Option Name Description Suggested Value
Enable GBNF Grammar Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors on quantized models Enabled
GBNF Grammar Filename The file to load as the GBNF grammar. Must be located in the same directory as the custom component. output.gbnf for Home LLM and json.gbnf for any model using ICL
Enable Prompt Caching Cache the system prompt to avoid recomputing it on every turn (requires refresh_system_prompt to be disabled) Disabled
Prompt Caching Interval Number of seconds between prompt cache refreshes (if caching is enabled) 30

Wheels

The wheels for llama-cpp-python can be built or downloaded manually for installation/re-installation.

Take the appropriate wheel and copy it to the custom_components/llama_conversation/ directory.

After the wheel file has been copied to the correct folder, attempt the wheel installation step of the integration setup. The local wheel file should be detected and installed.

Pre-built

Pre-built wheel files (*.whl) are built as part of a fork of llama-cpp-python and are available on the GitHub releases page for the fork.

To ensure compatibility with your Home Assistant and Python versions, select the correct .whl file for your hardware's architecture:

  • For Home Assistant 2024.2.0 and newer, use the Python 3.12 wheels (cp312)
  • ARM devices (e.g., Raspberry Pi 4/5):
    • Example filename:
      • llama_cpp_python-{version}-cp312-cp312-musllinux_1_2_aarch64.whl
  • x86_64 devices (e.g., Intel/AMD desktops):
    • Example filename:
      • llama_cpp_python-{version}-cp312-cp312-musllinux_1_2_x86_64.whl

Build your own

  1. Clone the repository on the target machine that will be running Home Assistant
  2. Ensure docker is installed
  3. Run the scripts/run_docker_to_make_wheels.sh script
  4. The compatible wheel files will be placed in the folder you executed the script from

Llama.cpp Server

Llama.cpp Server backend is used when running inference via a separate llama-cpp-python HTTP server.

Connection

Option Name Description Suggested Value
Host The hostname or IP address of the llama-cpp-python server
Port The port number the server is listening on 8000
SSL Whether to use HTTPS for the connection false

Sampling & Output

Option Name Description Suggested Value
Top K Sampling parameter; see text-generation-webui wiki 40
Top P Sampling parameter; see above link 1.0
Maximum tokens to return in response Limits the number of tokens that can be produced by each model response 512
Request Timeout The maximum time in seconds that the integration will wait for a response from the remote server 90 (higher if running on low resource hardware)

Advanced Features

Option Name Description Suggested Value
Enable GBNF Grammar Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors Enabled
GBNF Grammar Filename The file to load as the GBNF grammar. Must be located in the same directory as the custom component. output.gbnf

text-generation-webui

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection

Option Name Description Suggested Value
Host The hostname or IP address of the text-generation-webui server
Port The port number the server is listening on 5000
SSL Whether to use HTTPS for the connection false
Admin Key The admin key for the text-generation-webui server (if configured for authentication)

Sampling & Output

Option Name Description Suggested Value
Temperature Sampling parameter; see above link 0.1
Top K Sampling parameter; see above link 40
Top P Sampling parameter; see above link 1.0
Min P Sampling parameter; see above link 0.0
Typical P Sampling parameter; see above link 1.0
Context Length Maximum number of tokens the model can consider in its context window 2048
Request Timeout The maximum time in seconds that the integration will wait for a response from the remote server 90 (higher if running on low resource hardware)

UI Configuration

Option Name Description Suggested Value
Generation Preset/Character Name The preset or character name to pass to the backend. If none is provided then the settings that are currently selected in the UI will be applied
Chat Mode see here Instruct

Ollama

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection

Option Name Description Suggested Value
Host The hostname or IP address of the Ollama server
Port The port number the server is listening on 11434
SSL Whether to use HTTPS for the connection false

Sampling & Output

Option Name Description Suggested Value
Top K Sampling parameter; see above link 40
Top P Sampling parameter; see above link 1.0
Typical P Sampling parameter; see above link 1.0
Maximum tokens to return in response Limits the number of tokens that can be produced by each model response 512
Context Length Maximum number of tokens the model can consider in its context window 2048
Request Timeout The maximum time in seconds that the integration will wait for a response from the remote server 90 (higher if running on low resource hardware)

Advanced Features

Option Name Description Suggested Value
JSON Mode Restricts the model to only output valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses. True
Keep Alive/Inactivity Timeout The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever 30 (minutes)

Generic OpenAI API (Chat Completions)

For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description

Connection

Option Name Description Suggested Value
Host The hostname or IP address of the OpenAI-compatible API server
Port The port number the server is listening on (leave empty for default)
SSL Whether to use HTTPS for the connection false
API Key The API key for authentication (if required by your server)
API Path The path prefix for API requests (e.g., /v1 for OpenAI-compatible servers) v1

Sampling & Output

Option Name Description Suggested Value
Top P Sampling parameter; see above link 1.0
Request Timeout The maximum time in seconds that the integration will wait for a response from the remote server 90 (higher if running on low resource hardware)

Generic OpenAI Responses

Generic OpenAI Responses backend uses time-based conversation memory instead of interaction counts and is compatible with specialized response APIs.

Connection

Option Name Description Suggested Value
Host The hostname or IP address of the OpenAI-compatible API server
Port The port number the server is listening on (leave empty for default)
SSL Whether to use HTTPS for the connection false
API Key The API key for authentication (if required by your server)
API Path The path prefix for API requests v1

Sampling & Output

Option Name Description Suggested Value
Temperature Sampling parameter; see above link 0.1
Top P Sampling parameter; see above link 1.0
Request Timeout The maximum time in seconds that the integration will wait for a response from the remote server 90 (higher if running on low resource hardware)

Memory & Conversation

Option Name Description Suggested Value
Remember conversation time (minutes) Number of minutes to remember conversation history. Uses time-based memory instead of interaction count. 2 (minutes)