Merge pull request #315 from acon96/release/v0.4.3

Release v0.4.3
2026-01-09 13:48:05 -05:00 · 2025-11-02 12:59:28 -05:00
parent 8759b01739 08d3c6d2ee
commit a3cd30054b
7 changed files with 179 additions and 73 deletions
--- a/README.md
+++ b/README.md
@@ -158,6 +158,7 @@ python3 train.py \
 ## Version History
 | Version | Description                                                                                                                                                                                                                                           |
 |---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| v0.4.3  | Fix an issue with the integration not creating model configs properly during setup                                                                                                                                                                    |
 | v0.4.2  | Fix the following issues: not correctly setting default model settings during initial setup, non-integers being allowed in numeric config fields, being too strict with finish_reason requirements, and not letting the user clear the active LLM API |
 | v0.4.1  | Fix an issue with using Llama.cpp models downloaded from HuggingFace                                                                                                                                                                                  |
 | v0.4    | Rewrite integration to support tool calling models/agentic tool use loop, voice streaming, multiple config sub-entries per backend, and dynamic llama.cpp processor selection                                                                         |
--- a/TODO.md
+++ b/TODO.md
@@ -1,25 +1,27 @@
 # TODO
- [x] proper tool calling support  
- [ ] fix old GGUFs to support tool calling  
- [x] home assistant component text streaming support  
- [x] move llama-cpp build to forked repo + add support for multi backend builds (no more -noavx)  
- [ ] new model based on qwen3 0.6b  
+- [ ] new model based on qwen3 0.6b, 1.7b and 4b    
 - [ ] new model based on gemma3 270m  
 - [ ] support AI task API  
+- [ ] vision support for remote backends  
+- [ ] vision support for local backend (llama.cpp + llava)  
 - [ ] move llamacpp to a separate process because of all the crashing  
 - [ ] optional sampling parameters in options panel (don't pass to backend if not set)  
+- [ ] update dataset so new models will work with the Assist API  
+- [ ] make ICL examples into conversation turns  
+- [ ] translate ICL examples + make better ones  
+- [ ] figure out DPO to improve response quality  
+- [x] proper tool calling support  
+- [x] fix old GGUFs to support tool calling  
+- [x] home assistant component text streaming support  
+- [x] move llama-cpp build to forked repo + add support for multi backend builds (no more -noavx)  
 - [x] support new LLM APIs  
    - rewrite how services are called  
    - handle no API selected  
    - rewrite prompts + service block formats  
    - implement new LLM API that has `HassCallService` so old models can still work  
- [ ] update dataset so new models will work with the API  
- [ ] make ICL examples into conversation turns  
- [ ] translate ICL examples + make better ones  
 - [x] areas/room support  
 - [x] convert requests to aiohttp  
 - [x] detection/mitigation of too many entities being exposed & blowing out the context length  
- [ ] figure out DPO to improve response quality  
 - [x] setup github actions to build wheels that  are optimized for RPIs
 - [x] mixtral + prompting (no fine tuning)  
    - add in context learning variables to sys prompt template
@@ -58,24 +60,6 @@
    - [x] ollama backend  
    - [x] tailored_openai backend  
    - [x] generic openai responses backend  
- [ ] fix and re-upload all compatible old models (+ upload all original safetensors)  
+- [x] fix and re-upload all compatible old models (+ upload all original safetensors)  
 - [x] config entry migration function  
 - [x] re-write setup guide  
-
-## more complicated ideas
- [ ] "context requests"  
-    - basically just let the model decide what RAG/extra context it wants  
-    - the model predicts special tokens as the first few tokens of its output  
-    - the requested content is added to the context after the request tokens and then generation continues  
-    - needs more complicated training b/c multi-turn + there will be some weird masking going on for training the responses properly  
- [ ] integrate with llava for checking camera feeds in home assistant
-    - can check still frames to describe what is there
-    - for remote backends that support images, could also support this
-    - depends on context requests because we don't want to feed camera feeds into the context every time
- [ ] RAG for getting info for setting up new devices  
-    - set up vectordb  
-    - ingest home assistant docs  
-    - "context request" from above to initiate a RAG search  
- [ ] train the model to respond to house events (HA is calling these AI tasks)  
-    - present the model with an event + a "prompt" from the user of what you want it to do (i.e. turn on the lights when I get home = the model turns on lights when your entity presence triggers as being home)  
-    - basically lets you write automations in plain english  
--- a/custom_components/llama_conversation/backends/generic_openai.py
+++ b/custom_components/llama_conversation/backends/generic_openai.py
@@ -191,7 +191,9 @@ class GenericOpenAIAPIClient(LocalLLMClient):
        return endpoint, request_params

    def _extract_response(self, response_json: dict, llm_api: llm.APIInstance | None, user_input: conversation.ConversationInput) -> Tuple[Optional[str], Optional[List[llm.ToolInput]]]:
-        if len(response_json["choices"]) == 0: # finished
+        if "choices" not in response_json or len(response_json["choices"]) == 0: # finished
+            _LOGGER.warning("Response missing or empty 'choices'. Keys present: %s. Full response: %s",
+                            list(response_json.keys()), response_json)
            return None, None
        
        choice = response_json["choices"][0]
--- a/custom_components/llama_conversation/config_flow.py
+++ b/custom_components/llama_conversation/config_flow.py
@@ -1139,8 +1139,8 @@ class LocalLLMSubentryFlowHandler(ConfigSubentryFlow):
            selected_default_options[CONF_PROMPT] = build_prompt_template(
                selected_language, str(selected_default_options.get(CONF_PROMPT, DEFAULT_PROMPT))
            )
-
-            self.model_config = selected_default_options
+            
+            self.model_config = {**selected_default_options, **self.model_config}

        schema = vol.Schema(
            local_llama_config_option_schema(
--- a/custom_components/llama_conversation/const.py
+++ b/custom_components/llama_conversation/const.py
@@ -337,5 +337,5 @@ def option_overrides(backend_type: str) -> dict[str, Any]:
        },
    }

-INTEGRATION_VERSION = "0.4.2"
+INTEGRATION_VERSION = "0.4.3"
 EMBEDDED_LLAMA_CPP_PYTHON_VERSION = "0.3.16+b6153"
--- a/custom_components/llama_conversation/manifest.json
+++ b/custom_components/llama_conversation/manifest.json
@@ -1,7 +1,7 @@
 {
  "domain": "llama_conversation",
  "name": "Local LLMs",
-  "version": "0.4.2",
+  "version": "0.4.3",
  "codeowners": ["@acon96"],
  "config_flow": true,
  "dependencies": ["conversation"],
--- a/Configuration.md
+++ b/Configuration.md
@@ -3,42 +3,73 @@
 There are multiple backends to choose for running the model that the Home Assistant integration uses. Here is a description of all the options for each backend

 # Common Options
-| Option Name                                   | Description                                                                                                                                                                                            | Suggested Value |
-|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
-| LLM API                                       | This is the set of tools that are provided to the LLM. Use Assist for the built-in API. If you are using Home-LLM v1, v2, or v3, then select the dedicated API                                         |                 |
-| System Prompt                                 | [see here](./Model%20Prompting.md)                                                                                                                                                                     |                 |
-| Maximum tokens to return in response          | Limits the number of tokens that can be produced by each model response                                                                                                                                | 512             |
-| Additional attribute to expose in the context | Extra attributes that will be exposed to the model via the `{{ devices }}` template variable                                                                                                           |                 |
-| Arguments allowed to be pass to service calls | Any arguments not listed here will be filtered out of service calls. Used to restrict the model from modifying certain parts of your home.                                                             |                 |
-| Service Call Regex                            | The regular expression used to extract service calls from the model response; should contain 1 repeated capture group                                                                                  |                 |
-| Refresh System Prompt Every Turn              | Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp) | Enabled         |
-| Remember conversation                         | Flag to remember the conversation history (excluding system prompt) in the model context.                                                                                                              | Enabled         |
-| Number of past interactions to remember       | If `Remember conversation` is enabled, number of user-assistant interaction pairs to keep in history.                                                                                                  |                 |
-| Enable in context learning (ICL) examples     | If enabled, will load examples from the specified file and expose them as the `{{ response_examples }}` variable in the system prompt template                                                         |                 |
-| In context learning examples CSV filename     | The file to load in context learning examples from. Must be located in the same directory as the custom component                                                                                      |                 |
-| Number of ICL examples to generate            | The number of examples to select when expanding the `{{ in_context_examples }}` template in the prompt                                                                                                 |                 |
+These options are available for all backends and control model inference behavior, conversation memory, and integration-specific settings.
+
+| Option Name                                   | Description                                                                                                                                                                                            | Suggested Value         |
+|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
+| Selected Language                             | The language to use for prompts and responses. Affects system prompt templates and examples.                                                                                                           | en                      |
+| LLM API                                       | The API to use for tool execution. Select "Assist" for the built-in Home Assistant API, or "No control" to disable tool execution. Other options are specialized APIs like Home-LLM v1/v2/v3.          | Assist                  |
+| System Prompt                                 | [see here](./Model%20Prompting.md)                                                                                                                                                                     |                         |
+| Additional attributes to expose in the context | Extra attributes that will be exposed to the model via the `{{ devices }}` template variable (e.g., rgb_color, brightness, temperature, humidity, fan_mode, volume_level)                             | See suggestions         |
+| Refresh System Prompt Every Turn              | Flag to update the system prompt with updated device states on every chat turn. Disabling can significantly improve agent response times when using a backend that supports prefix caching (Llama.cpp) | Enabled                 |
+| Remember conversation                         | Flag to remember the conversation history (excluding system prompt) in the model context.                                                                                                              | Enabled                 |
+| Number of past interactions to remember       | If `Remember conversation` is enabled, number of user-assistant interaction pairs to keep in history. Not used by Generic OpenAI Responses backend.                                                    |                         |
+| Enable in context learning (ICL) examples     | If enabled, will load examples from the specified file and expose them as the `{{ response_examples }}` variable in the system prompt template                                                         | Enabled                 |
+| In context learning examples CSV filename     | The file to load in context learning examples from. Must be located in the same directory as the custom component                                                                                      | in_context_examples.csv |
+| Number of ICL examples to generate            | The number of examples to select when expanding the `{{ in_context_examples }}` template in the prompt                                                                                                 | 4                       |
+| Thinking prefix                               | String prefix to mark the start of internal model reasoning (used when the model supports explicit thinking)                                                                                           | `<think>`               |
+| Thinking suffix                               | String suffix to mark the end of internal model reasoning                                                                                                                                              | `</think>`              |
+| Tool call prefix                              | String prefix to mark the start of a function call in the model response                                                                                                                               | `<tool_call>`           |
+| Tool call suffix                              | String suffix to mark the end of a function call in the model response                                                                                                                                 | `</tool_call>`          |
+| Enable legacy tool calling                    | If enabled, uses the legacy `\`\`\`homeassistant` tool calling format instead of the newer prefix/suffix format. Required for some older Home-LLM models.                                              | Disabled                |
+| Max tool call iterations                      | Maximum number of times the model can make tool calls in sequence before the conversation is terminated                                                                                                | 3                       |

 # Llama.cpp
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
+
+## Connection & Model Selection
+| Option Name           | Description                                                                                                                    | Suggested Value        |
+|-----------------------|--------------------------------------------------------------------------------------------------------------------------------|------------------------|
+| Chat Model            | The Hugging Face model repository or local model filename to use for inference                                                 | acon96/Home-3B-v3-GGUF |
+| Model Quantization    | The quantization level to download for the selected model from Hugging Face                                                    | Q4_K_M                 |
+| Model File Path       | The full path to a local GGUF model file. If not specified, the model will be downloaded from Hugging Face                     |                        |
+
+## Sampling & Output
+| Option Name           | Description                                                                                                                     | Suggested Value |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------|
+| Temperature           | Sampling parameter; see above link                                                                                              | 0.1             |
+| Top K                 | Sampling parameter; see above link                                                                                              | 40              |
+| Top P                 | Sampling parameter; see above link                                                                                              | 1.0             |
+| Min P                 | Sampling parameter; see above link                                                                                              | 0.0             |
+| Typical P             | Sampling parameter; see above link                                                                                              | 1.0             |
+| Maximum tokens to return in response | Limits the number of tokens that can be produced by each model response                                          | 512             |
+| Context Length        | Maximum number of tokens the model can consider in its context window                                                           | 2048            |
+
+## Performance Optimization
+| Option Name           | Description                                                                                                                     | Suggested Value                |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------|
+| Batch Size            | Number of tokens to process in each batch. Higher values increase speed but consume more memory                                 | 512                            |
+| Thread Count          | Number of CPU threads to use for inference                                                                                      | (number of physical CPU cores) |
+| Batch Thread Count    | Number of threads to use for batch processing                                                                                   | (number of physical CPU cores) |
+| Enable Flash Attention | Use Flash Attention optimization if supported by the model. Can significantly improve performance on compatible GPUs           | Disabled                       |
+
+## Advanced Features
 | Option Name           | Description                                                                                                                     | Suggested Value                                                    |
 |-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
-| Top K                 | Sampling parameter; see above link                                                                                              | 40                                                                 |
-| Top P                 | Sampling parameter; see above link                                                                                              | 1.0                                                                |
-| Temperature           | Sampling parameter; see above link                                                                                              | 0.1                                                                |
-| Min P                 | Sampling parameter; see above link                                                                                              | 0.1                                                                |
-| Typical P             | Sampling parameter; see above link                                                                                              | 0.95                                                               |
 | Enable GBNF Grammar   | Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors on quantized models | Enabled                                                            |
 | GBNF Grammar Filename | The file to load as the GBNF grammar. Must be located in the same directory as the custom component.                            | `output.gbnf` for Home LLM and `json.gbnf` for any model using ICL |
+| Enable Prompt Caching | Cache the system prompt to avoid recomputing it on every turn (requires refresh_system_prompt to be disabled)                   | Disabled                                                           |
+| Prompt Caching Interval | Number of seconds between prompt cache refreshes (if caching is enabled)                                                      | 30                                                                 |

 ## Wheels
-The wheels for `llama-cpp-python` can be built or downloaded manually for installation.
+The wheels for `llama-cpp-python` can be built or downloaded manually for installation/re-installation.

 Take the appropriate wheel and copy it to the `custom_components/llama_conversation/` directory.

 After the wheel file has been copied to the correct folder, attempt the wheel installation step of the integration setup. The local wheel file should be detected and installed.

 ## Pre-built
-Pre-built wheel files (`*.whl`) are provided as part of the [GitHub release](https://github.com/acon96/home-llm/releases/latest) for the integration.
+Pre-built wheel files (`*.whl`) are built as part of a fork of llama-cpp-python and are available on the [GitHub releases](https://github.com/acon96/llama-cpp-python/releases/latest) page for the fork.

 To ensure compatibility with your Home Assistant and Python versions, select the correct `.whl` file for your hardware's architecture:
 - For Home Assistant `2024.2.0` and newer, use the Python 3.12 wheels (`cp312`)
@@ -57,35 +88,123 @@ To ensure compatibility with your Home Assistant and Python versions, select the
 3. The compatible wheel files will be placed in the folder you executed the script from


+# Llama.cpp Server
+Llama.cpp Server backend is used when running inference via a separate `llama-cpp-python` HTTP server.
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the llama-cpp-python server                                                                       |                                                                    |
+| Port                  | The port number the server is listening on                                                                                      | 8000                                                               |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+
+## Sampling & Output
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Top K                 | Sampling parameter; see [text-generation-webui wiki](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description)                                                                                              | 40                                                                 |
+| Top P                 | Sampling parameter; see above link                                                                                              | 1.0                                                                |
+| Maximum tokens to return in response | Limits the number of tokens that can be produced by each model response                                                                                | 512                                                                |
+| Request Timeout       | The maximum time in seconds that the integration will wait for a response from the remote server                                | 90 (higher if running on low resource hardware)                   |
+
+## Advanced Features
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Enable GBNF Grammar   | Restricts the output of the model to follow a pre-defined syntax; eliminates function calling syntax errors                    | Enabled                                                            |
+| GBNF Grammar Filename | The file to load as the GBNF grammar. Must be located in the same directory as the custom component.                            | `output.gbnf`                                                      |
+
+
 # text-generation-webui
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
-| Option Name                      | Description                                                                                                                                      | Suggested Value                                 |
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the text-generation-webui server                                                                  |                                                                    |
+| Port                  | The port number the server is listening on                                                                                      | 5000                                                               |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+| Admin Key             | The admin key for the text-generation-webui server (if configured for authentication)                                           |                                                                    |
+
+## Sampling & Output
+| Option Name                      | Description                                                                                                                      | Suggested Value                                 |
+|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| Temperature                      | Sampling parameter; see above link                                                                                              | 0.1                                             |
+| Top K                            | Sampling parameter; see above link                                                                                               | 40                                              |
+| Top P                            | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Min P                            | Sampling parameter; see above link                                                                                               | 0.0                                             |
+| Typical P                        | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Context Length                   | Maximum number of tokens the model can consider in its context window                                                             | 2048                                            |
+| Request Timeout                  | The maximum time in seconds that the integration will wait for a response from the remote server                                 | 90 (higher if running on low resource hardware) |
+
+## UI Configuration
+| Option Name                      | Description                                                                                                                      | Suggested Value                                 |
 |----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
-| Request Timeout                  | The maximum time in seconds that the integration will wait for a response from the remote server                                                 | 90 (higher if running on low resource hardware) |
 | Generation Preset/Character Name | The preset or character name to pass to the backend. If none is provided then the settings that are currently selected in the UI will be applied |                                                 |
-| Chat Mode                        | [see here](https://github.com/oobabooga/text-generation-webui/wiki/01-%E2%80%90-Chat-Tab#mode)                                                   | Instruct                                        |
-| Top K                            | Sampling parameter; see above link                                                                                                               | 40                                              |
-| Top P                            | Sampling parameter; see above link                                                                                                               | 1.0                                             |
-| Temperature                      | Sampling parameter; see above link                                                                                                               | 0.1                                             |
-| Min P                            | Sampling parameter; see above link                                                                                                               | 0.1                                             |
-| Typical P                        | Sampling parameter; see above link                                                                                                               | 0.95                                            |
+| Chat Mode                        | [see here](https://github.com/oobabooga/text-generation-webui/wiki/01-%E2%80%90-Chat-Tab#mode)                                   | Instruct                                        |

 # Ollama
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the Ollama server                                                                                 |                                                                    |
+| Port                  | The port number the server is listening on                                                                                      | 11434                                                              |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+
+## Sampling & Output
 | Option Name                   | Description                                                                                                                    | Suggested Value                                 |
 |-------------------------------|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
-| Request Timeout               | The maximum time in seconds that the integration will wait for a response from the remote server                               | 90 (higher if running on low resource hardware) |
-| Keep Alive/Inactivity Timeout | The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever            | 30m                                             |
-| JSON Mode                     | Restricts the model to only ouput valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses. | True                                            |
 | Top K                         | Sampling parameter; see above link                                                                                             | 40                                              |
 | Top P                         | Sampling parameter; see above link                                                                                             | 1.0                                             |
-| Temperature                   | Sampling parameter; see above link                                                                                             | 0.1                                             |
-| Typical P                     | Sampling parameter; see above link                                                                                             | 0.95                                            |
+| Typical P                     | Sampling parameter; see above link                                                                                             | 1.0                                             |
+| Maximum tokens to return in response | Limits the number of tokens that can be produced by each model response                                                 | 512                                             |
+| Context Length                | Maximum number of tokens the model can consider in its context window                                                            | 2048                                            |
+| Request Timeout               | The maximum time in seconds that the integration will wait for a response from the remote server                               | 90 (higher if running on low resource hardware) |

-# Generic OpenAI API Compatible
+## Advanced Features
+| Option Name                   | Description                                                                                                                    | Suggested Value                                 |
+|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| JSON Mode                     | Restricts the model to only output valid JSON objects. Enable this if you are using ICL and are getting invalid JSON responses. | True                                            |
+| Keep Alive/Inactivity Timeout | The duration in minutes to keep the model loaded after each request. Set to a negative value to keep loaded forever            | 30 (minutes)                                    |
+
+# Generic OpenAI API (Chat Completions)
 For details about the sampling parameters, see here: https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#parameters-description
-| Option Name                   | Description                                                                                      | Suggested Value                                 |
-|-------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------|
-| Request Timeout               | The maximum time in seconds that the integration will wait for a response from the remote server | 90 (higher if running on low resource hardware) |
-| Top P                         | Sampling parameter; see above link                                                               | 1.0                                             |
-| Temperature                   | Sampling parameter; see above link                                                               | 0.1                                             |
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the OpenAI-compatible API server                                                                  |                                                                    |
+| Port                  | The port number the server is listening on (leave empty for default)                                                            |                                                                    |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+| API Key               | The API key for authentication (if required by your server)                                                                     |                                                                    |
+| API Path              | The path prefix for API requests (e.g., `/v1` for OpenAI-compatible servers)                                                   | v1                                                                |
+
+## Sampling & Output
+| Option Name           | Description                                                                                                                     | Suggested Value                                 |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| Top P                 | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Request Timeout       | The maximum time in seconds that the integration will wait for a response from the remote server                                | 90 (higher if running on low resource hardware) |
+
+# Generic OpenAI Responses
+Generic OpenAI Responses backend uses time-based conversation memory instead of interaction counts and is compatible with specialized response APIs.
+
+## Connection
+| Option Name           | Description                                                                                                                     | Suggested Value                                                    |
+|-----------------------|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
+| Host                  | The hostname or IP address of the OpenAI-compatible API server                                                                  |                                                                    |
+| Port                  | The port number the server is listening on (leave empty for default)                                                            |                                                                    |
+| SSL                   | Whether to use HTTPS for the connection                                                                                         | false                                                              |
+| API Key               | The API key for authentication (if required by your server)                                                                     |                                                                    |
+| API Path              | The path prefix for API requests                                                                                                | v1                                                                |
+
+## Sampling & Output
+| Option Name                      | Description                                                                                                                     | Suggested Value                                 |
+|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| Temperature                      | Sampling parameter; see above link                                                                                              | 0.1                                             |
+| Top P                            | Sampling parameter; see above link                                                                                               | 1.0                                             |
+| Request Timeout                  | The maximum time in seconds that the integration will wait for a response from the remote server                                 | 90 (higher if running on low resource hardware) |
+
+## Memory & Conversation
+| Option Name                           | Description                                                                                                                     | Suggested Value |
+|---------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------|
+| Remember conversation time (minutes) | Number of minutes to remember conversation history. Uses time-based memory instead of interaction count.                       | 2 (minutes)     |