refresh backend options documentation

2026-01-08 05:14:02 -05:00 · 2025-10-25 23:48:59 -04:00
parent 2f9181ab37
commit 4b071647a9
1 changed files with 159 additions and 40 deletions
--- a/Configuration.md
+++ b/Configuration.md
@@ -3,42 +3,73 @@
 There are multiple backends to choose for running the model that the Home Assistant integration uses. Here is a description of all the options for each backend

 # Common Options
-| Option Name                                   | Description                                                                                                                                                                                            | Suggested Value |
-|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
-| LLM API                                       | This is the set of tools that are provided to the LLM. Use Assist for the built-in API. If you are using Home-LLM v1, v2, or v3, then select the dedicated API                                         |                 |
-| System Prompt                                 | [see here](./Model%20Prompting.md)                                                                                                                                                                     |                 |
-| Maximum tokens to return in response          | Limits the number of tokens that can be produced by each model response                                                                                                                                | 512             |
-| Additional attribute to expose in the context | Extra attributes that will be exposed to the model via the `{{ devices }}` template variable                                                                                                           |                 |
-| Arguments allowed to be pass to service calls | Any arguments not listed here will be filtered out of service calls. Used to restrict the model from modifying certain parts of your home.                                                             |                 |
-| Service Call Regex                            | The regular expression used to extract service calls from the model response; should contain 1 repeated capture group                                                                                  |                 |
-| Refresh System Prompt Every Turn              | Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp) | Enabled         |
-| Remember conversation                         | Flag to remember the conversation history (excluding system prompt) in the model context.                                                                                                              | Enabled         |
-| Number of past interactions to remember       | If `Remember conversation` is enabled, number of user-assistant interaction pairs to keep in history.                                                                                                  |                 |
-| Enable in context learning (ICL) examples     | If enabled, will load examples from the specified file and expose them as the `{{ response_examples }}` variable in the system prompt template                                                         |                 |
-| In context learning examples CSV filename     | The file to load in context learning examples from. Must be located in the same directory as the custom component                                                                                      |                 |
-| Number of ICL examples to generate            | The number of examples to select when expanding the `{{ in_context_examples }}` template in the prompt                                                                                                 |                 |
+These options are available for all backends and control model inference behavior, conversation memory, and integration-specific settings.
+
+| Option Name                                   | Description                                                                                                                                                                                            | Suggested Value         |
+|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
+| Selected Language                             | The language to use for prompts and responses. Affects system prompt templates and examples.                                                                                                           | en                      |
+| LLM API                                       | The API to use for tool execution. Select "Assist" for the built-in Home Assistant API, or "No control" to disable tool execution. Other options are specialized APIs like Home-LLM v1/v2/v3.          | Assist                  |
+| System Prompt                                 | [see here](./Model%20Prompting.md)                                                                                                                                                                     |                         |
+| Additional attributes to expose in the context | Extra attributes that will be exposed to the model via the `{{ devices }}` template variable (e.g., rgb_color, brightness, temperature, humidity, fan_mode, volume_level)                             | See suggestions         |
+| Refresh System Prompt Every Turn              | Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp) | Enabled                 |
+| Remember conversation                         | Flag to remember the conversation history (excluding system prompt) in the model context.                                                                                                              | Enabled                 |
+| Number of past interactions to remember       | If `Remember conversation` is enabled, number of user-assistant interaction pairs to keep in history. Not used by Generic OpenAI Responses backend.                                                    |                         |
+| Enable in context learning (ICL) examples     | If enabled, will load examples from the specified file and expose them as the `{{ response_examples }}` variable in the system prompt template                                                         | Enabled                 |
+| In context learning examples CSV filename     | The file to load in context learning examples from. Must be located in the same directory as the custom component                                                                                      | in_context_examples.csv |
+| Number of ICL examples to generate            | The number of examples to select when expanding the `{{ in_context_examples }}` template in the prompt                                                                                                 | 4                       |
+| Thinking prefix                               | String prefix to mark the start of internal model reasoning (used when the model supports explicit thinking)                                                                                           | `<think>`               |
+| Thinking suffix                               | String suffix to mark the end of internal model reasoning                                                                                                                                              | `</think>`              |
+| Tool call prefix                              | String prefix to mark the start of a function call in the model response                                                                                                                               | `<tool_call>`           |
+| Tool call suffix                              | String suffix to mark the end of a function call in the model response                                                                                                                                 | `</tool_call>`          |
+| Enable legacy tool calling                    | If enabled, uses the legacy `\`\`\`homeassistant` tool calling format instead of the newer prefix/suffix format. Required for some older Home-LLM models.                                              | Disabled                |
+| Max tool call iterations                      | Maximum number of times the model can make tool calls in sequence before the conversation is terminated                                                                                                | 3                       |

 # Llama.cpp
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
+
+## Connection & Model Selection
+| Option Name           | Description                                                                                                                    | Suggested Value        |
+|-----------------------|--------------------------------------------------------------------------------------------------------------------------------|------------------------|
+| Chat Model            | The Hugging Face model repository or local model filename to use for inference                                                 | acon96/Home-3B-v3-GGUF |
+| Model Quantization    | The quantization level to download for the selected model from Hugging Face                                                    | Q4_K_M                 |
+| Model File Path       | The full path to a local GGUF model file. If not specified, the model will be downloaded from Hugging Face                     |                        |
+
+## Sampling & Output
+| Option Name           | Description                                                                                                                     | Suggested Value |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------|
+| Temperature           | Sampling parameter; see above link                                                                                              | 0.1             |
+| Top K                 | Sampling parameter; see above link                                                                                              | 40              |
+| Top P                 | Sampling parameter; see above link                                                                                              | 1.0             |
+| Min P                 | Sampling parameter; see above link                                                                                              | 0.0             |
+| Typical P             | Sampling parameter; see above link                                                                                              | 1.0             |
+| Maximum tokens to return in response | Limits the number of tokens that can be produced by each model response                                          | 512             |
+| Context Length        | Maximum number of tokens the model can consider in its context window                                                           | 2048            |
+
+## Performance Optimization
+| Option Name           | Description                                                                                                                     | Suggested Value                |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------|
+| Batch Size            | Number of tokens to process in each batch. Higher values increase speed but consume more memory                                 | 512                            |
+| Thread Count          | Number of CPU threads to use for inference                                                                                      | (number of physical CPU cores) |
+| Batch Thread Count    | Number of threads to use for batch processing                                                                                   | (number of physical CPU cores) |
+| Enable Flash Attention | Use Flash Attention optimization if supported by the model. Can significantly improve performance on compatible GPUs           | Disabled                       |
+
+## Advanced Features
 | Option Name           | Description                                                                                                                     | Suggested Value                                                    |
 |-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
-| Top K                 | Sampling parameter; see above link                                                                                              | 40                                                                 |
-| Top P                 | Sampling parameter; see above link                                                                                              | 1.0                                                                |
-| Temperature           | Sampling parameter; see above link                                                                                              | 0.1                                                                |
-| Min P                 | Sampling parameter; see above link                                                                                              | 0.1                                                                |
-| Typical P             | Sampling parameter; see above link                                                                                              | 0.95                                                               |
 | Enable GBNF Grammar   | Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors on quantized models | Enabled                                                            |
 | GBNF Grammar Filename | The file to load as the GBNF grammar. Must be located in the same directory as the custom component.                            | `output.gbnf` for Home LLM and `json.gbnf` for any model using ICL |
+| Enable Prompt Caching | Cache the system prompt to avoid recomputing it on every turn (requires refresh_system_prompt to be disabled)                   | Disabled                                                           |
+| Prompt Caching Interval | Number of seconds between prompt cache refreshes (if caching is enabled)                                                      | 30                                                                 |

 ## Wheels
-The wheels for `llama-cpp-python` can be built or downloaded manually for installation.
+The wheels for `llama-cpp-python` can be built or downloaded manually for installation/re-installation.

 Take the appropriate wheel and copy it to the `custom_components/llama_conversation/` directory.

 After the wheel file has been copied to the correct folder, attempt the wheel installation step of the integration setup. The local wheel file should be detected and installed.

 ## Pre-built
-Pre-built wheel files (`*.whl`) are provided as part of the [GitHub release](https://github.com/acon96/home-llm/releases/latest) for the integration.
+Pre-built wheel files (`*.whl`) are built as part of a fork of llama-cpp-python and are available on the [GitHub releases](https://github.com/acon96/llama-cpp-python/releases/latest) page for the fork.

 To ensure compatibility with your Home Assistant and Python versions, select the correct `.whl` file for your hardware's architecture:
 - For Home Assistant `2024.2.0` and newer, use the Python 3.12 wheels (`cp312`)
@@ -57,35 +88,123 @@ To ensure compatibility with your Home Assistant and Python versions, select the
 3. The compatible wheel files will be placed in the folder you executed the script from


+# Llama.cpp Server
+Llama.cpp Server backend is used when running inference via a separate `llama-cpp-python` HTTP server.
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the llama-cpp-python server                                                                       |                                                                    |
+| Port                  | The port number the server is listening on                                                                                      | 8000                                                               |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+
+## Sampling & Output
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Top K                 | Sampling parameter; see [text-generation-webui wiki](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description)                                                                                              | 40                                                                 |
+| Top P                 | Sampling parameter; see above link                                                                                              | 1.0                                                                |
+| Maximum tokens to return in response | Limits the number of tokens that can be produced by each model response                                                                                | 512                                                                |
+| Request Timeout       | The maximum time in seconds that the integration will wait for a response from the remote server                                | 90 (higher if running on low resource hardware)                   |
+
+## Advanced Features
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Enable GBNF Grammar   | Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors                    | Enabled                                                            |
+| GBNF Grammar Filename | The file to load as the GBNF grammar. Must be located in the same directory as the custom component.                            | `output.gbnf`                                                      |
+
+
 # text-generation-webui
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
-| Option Name                      | Description                                                                                                                                      | Suggested Value                                 |
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the text-generation-webui server                                                                  |                                                                    |
+| Port                  | The port number the server is listening on                                                                                      | 5000                                                               |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+| Admin Key             | The admin key for the text-generation-webui server (if configured for authentication)                                           |                                                                    |
+
+## Sampling & Output
+| Option Name                      | Description                                                                                                                      | Suggested Value                                 |
+|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| Temperature                      | Sampling parameter; see above link                                                                                              | 0.1                                             |
+| Top K                            | Sampling parameter; see above link                                                                                               | 40                                              |
+| Top P                            | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Min P                            | Sampling parameter; see above link                                                                                               | 0.0                                             |
+| Typical P                        | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Context Length                   | Maximum number of tokens the model can consider in its context window                                                             | 2048                                            |
+| Request Timeout                  | The maximum time in seconds that the integration will wait for a response from the remote server                                 | 90 (higher if running on low resource hardware) |
+
+## UI Configuration
+| Option Name                      | Description                                                                                                                      | Suggested Value                                 |
 |----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
-| Request Timeout                  | The maximum time in seconds that the integration will wait for a response from the remote server                                                 | 90 (higher if running on low resource hardware) |
 | Generation Preset/Character Name | The preset or character name to pass to the backend. If none is provided then the settings that are currently selected in the UI will be applied |                                                 |
-| Chat Mode                        | [see here](https://github.com/oobabooga/text-generation-webui/wiki/01-%E2%80%90-Chat-Tab#mode)                                                   | Instruct                                        |
-| Top K                            | Sampling parameter; see above link                                                                                                               | 40                                              |
-| Top P                            | Sampling parameter; see above link                                                                                                               | 1.0                                             |
-| Temperature                      | Sampling parameter; see above link                                                                                                               | 0.1                                             |
-| Min P                            | Sampling parameter; see above link                                                                                                               | 0.1                                             |
-| Typical P                        | Sampling parameter; see above link                                                                                                               | 0.95                                            |
+| Chat Mode                        | [see here](https://github.com/oobabooga/text-generation-webui/wiki/01-%E2%80%90-Chat-Tab#mode)                                   | Instruct                                        |

 # Ollama
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the Ollama server                                                                                 |                                                                    |
+| Port                  | The port number the server is listening on                                                                                      | 11434                                                              |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+
+## Sampling & Output
 | Option Name                   | Description                                                                                                                    | Suggested Value                                 |
 |-------------------------------|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
-| Request Timeout               | The maximum time in seconds that the integration will wait for a response from the remote server                               | 90 (higher if running on low resource hardware) |
-| Keep Alive/Inactivity Timeout | The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever            | 30m                                             |
-| JSON Mode                     | Restricts the model to only ouput valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses. | True                                            |
 | Top K                         | Sampling parameter; see above link                                                                                             | 40                                              |
 | Top P                         | Sampling parameter; see above link                                                                                             | 1.0                                             |
-| Temperature                   | Sampling parameter; see above link                                                                                             | 0.1                                             |
-| Typical P                     | Sampling parameter; see above link                                                                                             | 0.95                                            |
+| Typical P                     | Sampling parameter; see above link                                                                                             | 1.0                                             |
+| Maximum tokens to return in response | Limits the number of tokens that can be produced by each model response                                                 | 512                                             |
+| Context Length                | Maximum number of tokens the model can consider in its context window                                                            | 2048                                            |
+| Request Timeout               | The maximum time in seconds that the integration will wait for a response from the remote server                               | 90 (higher if running on low resource hardware) |

-# Generic OpenAI API Compatible
+## Advanced Features
+| Option Name                   | Description                                                                                                                    | Suggested Value                                 |
+|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| JSON Mode                     | Restricts the model to only output valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses. | True                                            |
+| Keep Alive/Inactivity Timeout | The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever            | 30 (minutes)                                    |
+
+# Generic OpenAI API (Chat Completions)
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
-| Option Name                   | Description                                                                                      | Suggested Value                                 |
-|-------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------|
-| Request Timeout               | The maximum time in seconds that the integration will wait for a response from the remote server | 90 (higher if running on low resource hardware) |
-| Top P                         | Sampling parameter; see above link                                                               | 1.0                                             |
-| Temperature                   | Sampling parameter; see above link                                                               | 0.1                                             |
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the OpenAI-compatible API server                                                                  |                                                                    |
+| Port                  | The port number the server is listening on (leave empty for default)                                                            |                                                                    |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+| API Key               | The API key for authentication (if required by your server)                                                                     |                                                                    |
+| API Path              | The path prefix for API requests (e.g., `/v1` for OpenAI-compatible servers)                                                   | v1                                                                |
+
+## Sampling & Output
+| Option Name           | Description                                                                                                                     | Suggested Value                                 |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| Top P                 | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Request Timeout       | The maximum time in seconds that the integration will wait for a response from the remote server                                | 90 (higher if running on low resource hardware) |
+
+# Generic OpenAI Responses
+Generic OpenAI Responses backend uses time-based conversation memory instead of interaction counts and is compatible with specialized response APIs.
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the OpenAI-compatible API server                                                                  |                                                                    |
+| Port                  | The port number the server is listening on (leave empty for default)                                                            |                                                                    |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+| API Key               | The API key for authentication (if required by your server)                                                                     |                                                                    |
+| API Path              | The path prefix for API requests                                                                                                | v1                                                                |
+
+## Sampling & Output
+| Option Name                      | Description                                                                                                                     | Suggested Value                                 |
+|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| Temperature                      | Sampling parameter; see above link                                                                                              | 0.1                                             |
+| Top P                            | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Request Timeout                  | The maximum time in seconds that the integration will wait for a response from the remote server                                 | 90 (higher if running on low resource hardware) |
+
+## Memory & Conversation
+| Option Name                           | Description                                                                                                                     | Suggested Value |
+|---------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------|
+| Remember conversation time (minutes) | Number of minutes to remember conversation history. Uses time-based memory instead of interaction count.                       | 2 (minutes)     |