Ollama vs. vLLM: A Power-user's Perspective on LLM Serving

For the past 18 months, Ollama has been my go-to tool for running Large Language Models (LLMs). Its reliability and minimalist design, focusing on user experience, are truly commendable. Installation, configuration and model management are incredibly straightforward. However, the rise of increasingly powerful and larger models like Llama 3, Qwen2.5 and DeepSeek-R1 is pushing the limits of my hardware. My RTX A4000 can barely handle a 32B parameter model with 4-bit quantisation, and the token generation speed suffers noticeably. This slowdown, coupled with the demands of extended chain-of-thought reasoning, was starting to impact my productivity.

This prompted me to explore vLLM and evaluate whether its promised performance gains would justify a switch for my personal use case.

What is vLLM and Why the Performance Hype?

vLLM is an optimised inference engine designed for serving LLMs with high efficiency, especially in scenarios requiring high throughput and low latency. While Ollama prioritises user-friendly installation and model management, vLLM employs sophisticated memory optimisations and tensor parallelism to maximise hardware utilisation.

vLLM's key advantage lies in its PagedAttention mechanism. Traditional inference engines can suffer from memory fragmentation and inefficient GPU usage, particularly with long prompts or concurrent requests. PagedAttention dynamically manages memory allocation, reducing overhead and improving the handling of extended reasoning tasks. This can lead to significant performance improvements, especially for chat-based or agentic workflows involving multiple exchanges.

Furthermore, vLLM is designed for batch inference, processing multiple queries simultaneously with minimal overhead. This makes it ideal for applications serving multiple users or requiring high request throughput. While Ollama excels in single-user, local setups, vLLM is geared towards maximising token generation speed across numerous requests.

OS Support: A Linux-Centric Approach

A crucial factor to consider is vLLM's focus on server environments, limiting its availability to Linux. While I initially used Ollama on my MacBook Pro, I eventually migrated it to an Ubuntu server. Remote access via VPN and minimal latency minimise the battery drain on my laptop. My frequent experimentation with new models also necessitates remote access – downloading large LLMs over public or tethered connections is simply impractical. This transition was seamless for me, but macOS users leveraging Apple Silicon's Metal GPU would need to consider running vLLM in a Linux virtual machine, which might not be the most efficient solution.

Model Management: A Tale of Two Approaches

Ollama's model management is simple yet effective. Its model search interface provides a user-friendly way to explore and install model variations based on parameters, architecture and quantisation. In my experience, the stated model size has always accurately reflected the vRAM requirements on my RTX 4000.

My experience with vLLM's model management differed. Ollama defaults to loading 4-bit quantised models, which is sensible given its target audience of users with potentially limited GPU resources. Running unquantised 30B+ parameter models on a desktop PC requires high-end workstation hardware.

During my investigation, vLLM seemed to favor larger, unquantised models. While vLLM can run quantised models and even offers dynamic, in-flight quantisation using the bitsandbytes library and AutoAWQ, I encountered limitations. With vLLM, the largest model I could run was a 4-bit quantised 14B parameter version of DeepSeek R1. The 32B version, even quantised, exceeded my GPU's capacity, likely due to vLLM's own overhead. Ollama, on the other hand, could just squeeze the 32B quantised version into VRAM. This memory constraint was a deal-breaker and prematurely ended my vLLM exploration. It's worth noting that AutoAWQ support in vLLM is currently under-optimised, so performance may vary.

Final Thoughts: Ollama Still Reigns Supreme (For Now)

For the time being, Ollama remains my preferred platform for hosting larger, quantised LLMs at home. I might revisit vLLM if smaller, distilled models become my preferred choice, as the performance benefits could then be more significant. Until then, Ollama's ease of use and ability to handle larger quantised models on my hardware make it the ideal tool for my needs. Perhaps future optimisations in vLLM's quantisation support, or the availability of more efficient quantised models, will change this assessment.