Homelab Extension #1

Open
opened 2025-07-08 08:36:35 -04:00 by AtHeartEngineer · 0 comments

Originally created by @strhwste on 6/27/2025

Hi ParisNeo. THX for creating this :)

There is a good use-case for this in a homelab envoirement. For example I have a small NUC running for all the stuff, which is of course not powerful enough for llm inference, maybe 1-3B bedt case. But there are two powerful tower PC which are running sometimes (turning them off is good because idle is at least 200W). Must of the time there are used for not too crucial stuff but sometimes they're used for gaming or rendering, when they are, they shouldn't start inferencing.

So what I would love to implement is:

  • a ping if the server is available.
  • fallback to the next server
  • if requested model is not available on that server take the best model which is available on that server (if only the NUC is running)
    *maybe do a CTX check first if the request can be handled by the smaller model?
    *if requested model is not available -> always use the largest model available? Even if smaller Model is available wait for the model which is comparable with what is requested
  • client side usage monitor (if GPU is in heavy usage don't send requests - could also replace the approximation over list)
  • MQTT publishing for Home Assistent

Are you interested in those enhancements? If not I will just make a fork :)

*Originally created by @strhwste on 6/27/2025* Hi ParisNeo. THX for creating this :) There is a good use-case for this in a homelab envoirement. For example I have a small NUC running for all the stuff, which is of course not powerful enough for llm inference, maybe 1-3B bedt case. But there are two powerful tower PC which are running sometimes (turning them off is good because idle is at least 200W). Must of the time there are used for not too crucial stuff but sometimes they're used for gaming or rendering, when they are, they shouldn't start inferencing. So what I would love to implement is: - a ping if the server is available. - fallback to the next server - if requested model is not available on that server take the best model which is available on that server (if only the NUC is running) *maybe do a CTX check first if the request can be handled by the smaller model? *if requested model is not available -> always use the largest model available? Even if smaller Model is available wait for the model which is comparable with what is requested - client side usage monitor (if GPU is in heavy usage don't send requests - could also replace the approximation over list) - MQTT publishing for Home Assistent Are you interested in those enhancements? If not I will just make a fork :)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github/ollama_proxy_server#1