

What I am talking about is when layers are split across GPUs. I guess this is loading the full model into each GPU to parallelize layers and do batching
What I am talking about is when layers are split across GPUs. I guess this is loading the full model into each GPU to parallelize layers and do batching
Can you try setting the num_ctx
and num_predict
using a Modelfile with ollama? https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter
Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization
I have no problems with changing num_ctx or num_predict
Models are computed sequentially (the output of each layer is the input into the next layer in the sequence) so more GPUs do not offer any kind of performance benefit
Ummm… did you try /set parameter num_ctx #
and /set parameter num_predict #
? Are you using a model that actually supports the context length that you desire…?
That’s great! Hopefully it shows up on F-Droid sometime soon
you can tell whoever wrote this has never run that command
Uh… isn’t that a good thing?
My guess is an x86 32bit machine
4690k was solid! Mine is retired, though. Now I selfhost on ARM
Wow sounds like great news for Russia and a huge win for them then, I’m sure their internet trolls are supporting this decision instead of crying about it all over the internet, right??
You can overwrite the model by using the same name instead of creating one with a new name if it bothers you. Either way there is no duplication of the llm model file