Open-source LLMs have become a favorite alternative for enthusiasts, programmers, and users who want to use generative AI for their daily work while maintaining privacy. These models provide excellent performance and are sometimes comparable to huge, closed-source models like GPT-4o or Claude Sonnet 3.5 in many tasks.
While they are open-source, this doesn’t mean they’re ready to use out of the box, a framework is needed to run them locally or on servers for specific use cases. Additionally, an OpenAI compatible server has become the most popular way to deploy any model, as this API allows you to use your LLM with almost any SDK or client: OpenAI SDK, Transformers, LangChain, etc. So the question is: what’s the best framework for deploying LLMs to be compatible with OpenAI?
In this series of 3 articles, we will analyze Ollama and vLLM, two of the most popular frameworks for deploying models with OpenAI API compatibility. This analysis will include performance, ease of use, customization, and other fair comparisons we consider helpful in choosing the best framework for your particular use case. I hope you enjoy it, and if so, please leave a comment.
The street fight begins!
What is Ollama?🦙
Ollama is a powerful framework that aims to make running LLMs as simple as possible. Think of it as Docker for LLMs — it simplifies the entire process of downloading, running, and managing large language models on your local machine or server.
Installation 🛠️
Getting started with Ollama is straightforward. Here’s how to install it on different platforms:
Linux (the one I will use)
curl -fsSL https://ollama.com/install.sh | sh
macOS
brew install ollama
Windows
I don’t use Windows, but here is one of the advantage of ollama, its versatility:
Install WSL (Windows Subsystem for Linux)
Follow the Linux installation instructions
Thanks to the suggestion of the user Quark Quark, you can also install Ollama on windows following the instructions of this link.
Usage 🚀
Ollama provides a ready-to-use model zoo that you can run with a single line of code: `ollama run
`. This will allow you to run any model listed in the [Ollama models](https://ollama.com/search) repository in your terminal with ease. For this tutorial, I’ll use one of my favorite models that can run on my RTX 4060 with 16GB of RAM, the [Qwen2.5–14B](https://ollama.com/library/qwen2.5:14b):
“`bash
ollama run qwen2.5:14b –verbose
“`
And that’s it! With just a single line of code, you can run an LLM on your machine or server and start asking anything you want. I added the `–verbose` flag so you can see the tokens per second (tok/sec) performance — in my case, achieving 26 tok/sec:
Usage example and GPU consumption for ollama
### Ollama parameters đź”§
Ok, the previous section was good to show how easy Ollama is to be used. But we are running Ollama with default parameters. So what if we want to modify them?
#### Modelfile creation 📝
To create your own model with specific parameters, you’ll need to create a Modelfile, a single plaintext file that contains the parameters you want to set. Here’s an example:
“`bash
FROM qwen2.5:14b
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.5
# sets the context window size to 8192, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 8192
# tokens to generate set to 4096 (max)
PARAMETER num_predict 4096
# System prompt configuration
SYSTEM “””You are a helpful AI assistant.”””
“`
To build and run your customized model:
“`bash
# Build the model
ollama create mymodel -f Modelfile
# Run the model
ollama run mymodel –verbose
“`
The list of full parameters you can customize can be seen in [this](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) link.
### Ollama API 🔌
Here is what we have so far: we have run a model in a terminal, which is a wonderful feature that allows you to try models with ease. However, the goal of this research is to use these models in a way that is compatible with OpenAI. How can I do this with Ollama? Ollama provides two ways to interact with models:
**1. Native REST API 📡**
Ollama runs a local server on port 11434 by default. You can interact with it using standard HTTP requests:
“`python
import requests
# Basic chat completion request
response = requests.post(‘http://
:11434/api/chat’,
json={
‘model’: ‘qwen2.5:14b’,
‘messages’: [
{
‘role’: ‘system’,
‘content’: ‘You are a helpful AI assistant.’
},
{
‘role’: ‘user’,
‘content’: ‘What is artificial intelligence?’
}
],
‘stream’: False
}
)
print(response.json()[‘message’][‘content’])
“`
**2. OpenAI Compatibility Layer 🔄**
For seamless integration with existing applications, Ollama provides OpenAI API compatibility. First, start the OpenAI-compatible server:
To use it with the OpenAI Python SDK:
“`python
from openai import OpenAI
client = OpenAI(
base_url=”http://
:11434/v1″,
api_key=”dummy” # vLLM accept requiring API key, one of its advantage agains ollama. In our case we set None, so you can set any string
)
# Chat completion
response = client.chat.completions.create(
model=”qwen2.5:14b”,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “What is artificial intelligence?”}
]
)
print(response.choices[0].message.content)
“`
#### Key Ollama API Features 🎯
Ollama’s API comes packed with essential features that make it a robust choice for developers. We will detail all the advantage and disadvantage of this framework in the Part III of this tutorial meanwhile, let’s list the main features:
– **Streaming Support:** Real-time token generation that ensures full OpenAI API compatibility, perfect for creating responsive applications.
– **Multiple Model Management:** Capability to run different models simultaneously, though with a caveat: Ollama will stop one model to run another when VRAM is limited, which requires careful resource planning.
– **Parameter Control:** Highly customizable settings via API calls — a double-edged sword that offers great flexibility but can be overwhelming for beginners and for production servers.
– **CPU Compatibility:** Smart resource management that automatically offloads model layers to CPU when VRAM is insufficient, making it possible to run large models even on systems with limited GPU memory.
– **Language Agnostic:** Freedom to use your preferred programming language, whether it’s Python, JavaScript, Go, or any language with HTTP capabilities.
### Conclusion 🎬
In this first part of our three-part series, we’ve explored Ollama, a powerful and user-friendly framework for running LLMs locally. From its straightforward installation process to its flexible API capabilities, Ollama stands out for its simplicity and ease of use, making it an attractive option for developers looking to experiment with open-source LLMs.
Key takeaways from this first exploration:
– Installation and basic usage are remarkably simple
– The framework provides both native REST API and OpenAI compatibility
– Resource management is handled intelligently, though with some limitations
In Part II of this series, we’ll dive into vLLM, exploring its features and capabilities. This will set the stage for Part III, where we’ll conduct a detailed comparison of both frameworks, helping you make an informed decision based on your specific needs and use cases.
Stay tuned for the next part, where we’ll continue our journey into the world of LLM inference frameworks! 🚀
Happy coding! 💻🚀
[Next: Part II — Ollama vs vLLM: which framework is better for inference? 👊 (Part II) *Exploring vLLM: The Performance-Focused Framework*](https://henrynavarro.org/ollama-vs-vllm-which-framework-is-better-for-inference-part-ii-37f7e24d3899?sk=013f7a18930a907c53d521ee046afa60)
**Keywords: Ollama vs vLLM, Ollama API, Ollama, vLLM, Ollama AI, Ollama library, vLLM GGUF, vLLM multi GPU, vLLM API, vLLM server, vLLM inference.
**
—
_Originally published on [Medium](https://medium.com/@hdnh2006/ollama-vs-vllm-which-framework-is-better-for-inference-part-i-d8211d7248d2) on January 21, 2025._