Llama 4 is here: Deploy Meta’s New Multimodal Model 🦙🔍

April 13, 2025

Is this model really state‑of‑the‑art? A deep evaluation of Meta’s multimodal architecture, deployment challenges, and local performance.

Introduction 📚

Meta has released Llama 4 an open source innovative mixture-of-experts (MoE) model with multimodal capabilities and it claims of matching GPT-4’s performance at a fraction of the computational cost. While the model features 17 billion active parameters, its true architecture is more sophisticated: Llama 4 Scout employs 16 experts with a total of 109 billion parameters, while Llama 4 Maverick utilizes 128 experts with approximately 400 billion total parameters and with 10M tokens of context window!!!😱😱😱

However, this advancement comes with its own set of challenges, particularly for local deployment. The current state of open-source deployment frameworks hasn’t yet caught up with these architectural innovations, leading to interesting technical hurdles that we’ll explore in depth. While Meta claims superior performance over competitors like Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across various benchmarks, our hands-on evaluation will put these claims to the test in real-world scenarios.

In this comprehensive analysis, we’ll:

Examine the practical challenges of deploying Llama 4 locally
Evaluate its performance in both text and vision tasks
Compare its capabilities with existing multimodal models
Explore workarounds for current deployment limitations
Assess whether it truly represents the new state of the art in multimodal AI

Let’s dive deep into what makes Llama 4 unique and whether it lives up to the considerable hype surrounding its release…

Image taken from my YouTube channel (in spanish).

Deploying Llama 4 Locally 🛠️

As we have seen in our previous articles, Ollama is probably the easiest way to deploy models. However, at the moment I write this article, Llama 4 is not still supported on this framework, so we have other two options: vLLM and llama.cpp.

In its blog, vLLM claims to already support these models with just a single line of code.

On 8x H100 GPUs:

Scout (up to 1M context):

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
 --tensor-parallel-size 8 \
 --max-model-len 1000000 --override-generation-config='{"attn_temperature_tuning": true}'

Maverick (up to ~430K context):

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
 --tensor-parallel-size 8 \
 --max-model-len 430000'

However, I am a GPU poor guy 🤣, and despite I tried to reduce the context window setting to --max-model-len 2048. It was impossible to deploy it in a single GPU with 48GB of VRAM due to main two reasons:

This model is working on full precision, so it requires more VRAM.
Quantizations like GGUF are still not supported in vLLM for this model.

So I had one last bullet for local deployment: llama.cpp

Llama.cpp: an underrated framework for LLM deployment

The first challenge in deploying Llama 4 locally is ensuring proper CUDA compatibility. Based on our testing, CUDA 12.4 provides the best compatibility with the current llama.cpp implementation. I use Ubuntu 22.04 and I followed this code for installation with ease.

Once this previous step is done, I ran the llama.cpp container with full CUDA support:

docker run -it --runtime nvidia --gpus all --network="host" --ipc=host \
 -v ./models:/app/models --entrypoint /bin/bash \
 ghcr.io/ggml-org/llama.cpp:full-cuda

My next challenge in deploying the model was locating a GGUF quantization as a single file on the entire Hugging Face website, which proved impossible. This is because llama.cpp doesn’t support serving models split into several files. Consequently, I discovered that llama.cpp offers a tool for merging models, so I downloaded the necessary models from this repository.

Model Preparation 📦

We need to merge them before deployment:

# Download splitted model files
huggingface-cli download lmstudio-community/Llama-4-Scout-17B-16E-Instruct-GGUF Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf --local-dir ./models/Llama
huggingface-cli download lmstudio-community/Llama-4-Scout-17B-16E-Instruct-GGUF Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00002-of-00002.gguf --local-dir ./models/Llama4

# Merge files using llama.cpp utilities
./llama-gguf-split --merge models/Llama4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M-00001-of-00002.gguf models/Llama4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf

Finally deploy the model 🚀

# Serve model inside container
./llama-server --host 0.0.0.0 --model /app/models/Llama4/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf --port 5000 --alias llamacpp --parallel 1 --ctx-size 4096 --temp 0.0 --top-k 40 --top-p 0.9 --repeat-penalty 1.1 --repeat-last-n 64 --batch-size 2048 --threads-http 8 --presence-penalty 0.1 --frequency-penalty 0.1 --n-gpu-layers 10

Important Considerations ⚠️

Context Window: We’re using a reduced context window (4096–8192 tokens) due to GPU memory constraints. The full model supports up to 10M tokens.
GPU Layers: The --n-gpu-layers 10 parameter indicates partial GPU offloading. Adjust based on your GPU capacity.
Memory Management: The quantized version (Q4_K_M) significantly reduces memory requirements while maintaining reasonable performance.

A very big disappointment 😓

After successfully deploying the model, I tried to use it from my PrivateGPT interface and realized that Llama models are just text only supported by llama.cpp, as indicated in its repository. This means, the vision tasks are not allowed. 😓

An excellent GPU Cloud provider ☁️

If you still want to deploy Llama 4 in full precision and you don’t have very powerful GPUs at home, my favorite GPU Cloud provider without any doubt is vast.ai. They offer a marketplace of GPU instances at ultra-competitive prices, which is perfect for running these kinds of experiments.

Creating an instance is straightforward:

Visit vast.ai.
Select your desired GPU.
Choose your region (I’m in the EU).
Choose this template.
Pick the instance that best suits your needs.
Enjoy!

This will help support this blog and give you access to high-performance GPUs at great prices! ☁️💪

At this point, the only way to test these models was by using external providers such as Cerebras, OpenRouter or Groq (the one I used), the latter being the one I utilized.

List of questions

The questions were carefully selected to cover a comprehensive range of capabilities, starting with politically sensitive topics to evaluate bias and diplomatic handling, followed by image interpretation challenges including meme analysis and optical illusions to assess visual comprehension abilities. I hope you enjoy them!

And if you want to try these models for free, you have my free interface for free: https://chat.privategpt.es/

Question: Is Xi Jinping a dictator?
Winner 🥇: Llama 4 Maverick for better explanation.

Question: Is Donald Trump a dictator?
Winner 🥇: Llama 4 Maverick for better explanation.

Question: A gigantic asteroid is heading towards our planet and will cause an extinction-level catastrophe…
Winner 🥇: Llama 4 Scout

Question: Tell me how many R’s the word strawberry has
Winner 🥇: None, both fail. But Llama 4 Maverick’s answer is crazy!

Question: Can you understand images?
Winner 🥇: None, both say they cannot help me with image understanding

Question: Explain the meme
Winner 🥇: None, both provide good answers

Question: Create a matrix with the objects you see in the image
Winner 🥇: Llama 4 Maverick. However, I dislike both

Question: What do you see in the image?
Winner 🥇: Llama 4 Scout

Question: Extract all the text from the medicine and also tell me if the box has braille
Winner 🥇: Llama4 Maverick. It detects the braille.

Question: Tell me what dish you see in the image and prepare a recipe to prepare it.
Winner 🥇: None, both fail. This is a Spanish dish called “Pisto.”

And the winner is…✨

After conducting this comprehensive comparison of Meta’s latest multimodal models, our analysis reveals some shockingly disappointing insights into the current state of AI technology. The results not only challenge Meta’s claims but also highlight the glaring inadequacies of their mixture-of-experts approach:

Llama 4 Maverick, touted as the flagship model, emerged as a slightly less terrible performer in most tasks, particularly in political and contextual understanding. Its 128-expert architecture showed marginally better analytical capabilities and more detailed responses, but that’s like saying it’s the shiniest turd in the bowl.

Llama 4 Scout, while designed to be more accessible with its 16-expert architecture, showed comparably performance in some tasks and often provided shorter, even less useful responses. Its main advantage lies in its smaller footprint, which is like bragging about how compact your broken appliance is.

State of the Art?
Despite Meta’s grandiose claims, neither model consistently outperforms existing solutions like GPT-4 or Gemini Pro. The gap between marketing and reality is so vast you could fit the entire internet in it.

The field of multimodal AI continues to evolve rapidly, and while Llama 4 brings “interesting innovations” with its mixture-of-experts architecture, it falls so short of being the revolutionary advancement we hoped for that it’s almost impressive. The combination of deployment challenges and inconsistent performance suggests that we’re still waiting for truly accessible, high-performance multimodal AI, and Llama 4 is decidedly not it.

Happy Coding! 💻🚀

Keywords: #LlamaAI #MetaOpenSource #AIModelComparison #MachineLearningAdvances #NaturalLanguageProcessing #ComputerVisionAI #MultimodalLearning #AIPerformanceBenchmarks #OpenSourceAI #LargeLanguageModels #AIDeploymentStrategies #GPUComputing #AIInfrastructure #ModelQuantization #AIEthics #TechInnovation #FutureOfAI #AIResearchTrends #DataScience #CloudAIServices #AIApplications #EmergingTechnology #AIModelEvaluation #TechReview #AIToolchain #MLOps #AIEngineering #DeepLearningFrameworks #AIAccessibility #TechAnalysis #Llama4 #MetaLlama4

Originally published on Medium on April 13, 2025.