Self-hosting Devstral Small 2 on a 48 GB GPU

Mistral's Devstral Small 2 is a 24-billion-parameter coding model designed to be served behind an OpenAI-compatible API. On a 48 GB GPU, either workstation hardware (RTX 6000 Ada, A6000, A40) or a rented cloud instance from roughly $0.40/hr, it can be deployed with vLLM and consumed by any client that speaks the OpenAI Chat Completions protocol: Cline, Continue, OpenCode, Kilo Code, and others.

The model reports 68% on SWE-bench Verified and supports up to a 256K-token context window at the architecture level, though practical deployments on 48 GB hardware run with 32K to 65K to leave headroom for KV cache and batching. This post walks through the deployment, surveys the current ecosystem of agentic clients, and notes the rough edges that are still present at the time of writing.

Economics

Self-hosting Devstral Small 2 makes sense when privacy or per-seat cost dominates the decision. On a single A40 ($0.40-0.50/hr on Runpod), realistic capacity looks like:

Usage profile	Devs supported	Per-developer cost
Light (5-10 queries/hr/dev)	10-15	$0.03-0.05/hr
Moderate (15-25 queries/hr/dev)	6-9	$0.04-0.08/hr
Heavy (30+ queries/hr/dev, agentic)	2-5	$0.08-0.25/hr

Compared to commercial alternatives at moderate use (8 hrs/day):

Solution	Cost/dev/day	Privacy	Context
Self-hosted Devstral (A40)	$0.32-0.64	Full	32K-256K
GitHub Copilot	$0.45-0.86	Partial	Limited
Cursor	$0.90	None	Limited
Claude Pro	$0.90	None	200K

The economics favour self-hosting under three conditions: high concurrent load, strict privacy requirements, or sustained team-level adoption. Commercial subscriptions scale linearly with headcount; self-hosted infrastructure scales with concurrent throughput, which is usually a much smaller number.

A brief qualitative note before getting to deployment. Devstral Small 2 is genuinely good for a self-hosted 24B model, and the deployment path described here is straightforward. If privacy is not a primary driver, the output quality of frontier closed models (Claude, GPT-4-class) still justifies the per-seat cost for most teams. The case for self-hosting is privacy, predictability, and sustained throughput, not raw output quality.

Hardware requirements

Local:

NVIDIA RTX 6000 Ada, A40, or A6000 (48 GB VRAM)
Ubuntu 20.04+ or equivalent
Python 3.9-3.12
NVIDIA driver 535+
vLLM >= 0.8.5
mistral_common >= 1.5.5
120 GB storage (150 GB recommended)
64 GB system RAM recommended

Cloud:

Runpod GPU pod with A40, A6000, or RTX 6000 Ada (all 48 GB), $0.40-1.20/hr depending on tier

48 GB VRAM is the lower bound.

Devstral Small 2505 in full precision requires 47.3 GB just to load the weights. On a 48 GB GPU this leaves nothing for KV cache, batching, or inference overhead. The configurations below use GPTQ 4-bit quantization (31 GB VRAM) for reliable operation on 48 GB GPUs.

Note on third-party quantization.

This guide uses a community-quantized variant (mratsim/Devstral-Small-2505.w4a16-gptq) for tutorial simplicity. For production deployments where the model's integrity matters, quantize the official Mistral weights yourself with AutoGPTQ or llama.cpp. Third-party quantized weights could in principle contain modifications.

Quick start

Install vLLM and download the quantized model:

# vLLM (>= 0.8.5) and dependencies
pip install --upgrade vllm
pip install mistral_common hf_transfer

# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"

# Download the GPTQ 4-bit quantized model (~31 GB, 10-15 minutes)
huggingface-cli download mratsim/Devstral-Small-2505.w4a16-gptq

# Verify the cache
ls -la ~/.cache/huggingface/hub/ | grep -i devstral

Run vLLM in the foreground for testing:

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --host 0.0.0.0 \
  --port 8000

For persistent background execution, run inside a tmux or screen session:

apt-get update && apt-get install -y tmux

tmux new-session -d -s vllm 'vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --host 0.0.0.0 \
  --port 8000'

# View logs
tmux attach -t vllm
# Detach: Ctrl+B then D

Wait for:

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000

Test:

curl http://localhost:8000/v1/models

Tool calling is intentionally disabled.

vLLM's --tool-call-parser mistral and --enable-auto-tool-choice flags currently have multiple unresolved issues with Mistral models: streaming tool calls trigger JSONDecodeError (vLLM #21303, closed as "not planned"), and tool-call validation errors on the index field (#17643). Coding clients (Cline, Continue, OpenCode, Kilo Code) work without formal tool calling by interacting through standard chat, which is more stable in practice.

Connecting agentic clients

Context window matters at this layer.

The model supports 256K tokens, but the actual ceiling for your client is whatever vLLM was started with via --max-model-len. The configurations in this post use 32K. If you tune for high throughput (65K) or low memory (16K), update the client configuration to match.

Most OpenAI-compatible clients work with this setup: Cline, Continue, Kilo Code, OpenCode, and others. Example configuration for Continue (.continue/config.yaml):

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: devstral-small-2
    provider: openai
    model: mratsim/Devstral-Small-2505.w4a16-gptq
    apiBase: http://localhost:8000/v1
    apiKey: none

For Runpod or other remote deployments, replace http://localhost:8000/v1 with the pod's exposed endpoint (e.g. http://12.345.67.89:54321/v1). Most other clients accept analogous provider: openai or openai-compatible configuration; consult the specific tool's documentation.

Mistral Vibe CLI is not currently usable with this stack.

Mistral Vibe CLI is Mistral's official terminal coding agent. Against vLLM with tool calling, it hits the streaming JSONDecodeError from vLLM #21303. The vLLM maintainers closed the issue as "not planned", and Vibe doesn't expose a flag to disable streaming. Use one of the alternatives above until this is resolved.

Performance tuning

GPTQ 4-bit quantization (31 GB VRAM) leaves headroom on 48 GB GPUs for batching and longer context windows.

High throughput (long contexts, more concurrent requests):

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 65536 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 128 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Balanced (recommended default):

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Low memory (32 GB GPUs or heavy concurrent load):

vllm serve mratsim/Devstral-Small-2505.w4a16-gptq \
  --tokenizer-mode mistral \
  --config-format mistral \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.75 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --host 0.0.0.0 \
  --port 8000

Cloud deployment with Runpod

For deployment without local hardware:

Create an account at runpod.io.
Deploy a pod:
- GPU: A40 (48 GB, $0.40 - 0.50/ h r), A 6000 (48 GB,$ 0.80-1.00/hr), or RTX 6000 Ada (48 GB, $0.80-1.20/hr)
- Template: "RunPod Pytorch 2.4" or "RunPod Pytorch" (Python 3.10+ pre-installed)
- Disk: 120 GB minimum (150 GB recommended)
- Pod type: Secure Cloud (recommended) or Community Cloud
Open a web terminal: Connect > Start Web Terminal.
Follow the same vLLM installation and serving steps as the local deployment. The cache lives at /workspace/.cache/huggingface/hub/ instead of ~/.cache/huggingface/hub/, but vLLM handles this transparently.
Expose port 8000: Edit Pod > Expose Ports > add 8000. Note the external mapping (e.g. 12.345.67.89:54321).
Configure the agentic client with the public endpoint:

{
  "baseURL": "http://12.345.67.89:54321/v1",
  "model": "mratsim/Devstral-Small-2505.w4a16-gptq"
}

For local-only access, an SSH tunnel works if the pod has SSH configured:

ssh -L 8000:localhost:8000 root@<pod-id>.runpod.io -p <ssh-port> -N

Cost optimization: spot instances (50-70% cheaper), stopping pods when idle, and the pod's auto-stop setting all help.

Troubleshooting

torch.OutOfMemoryError: CUDA out of memory with the GPTQ model:

Verify 48 GB VRAM is actually present:
```
nvidia-smi
```

Check for residual processes occupying VRAM:

nvidia-smi
# Inspect the Processes section, kill stragglers with: kill -9 <PID>

Reduce memory pressure:

--gpu-memory-utilization 0.75 \
--max-model-len 16384 \
--max-num-seqs 32

Set PyTorch's allocator to reduce fragmentation:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

If the GPU is below 48 GB: 24 GB cards (RTX 3090/4090, A10G) cannot run Devstral Small 2 with vLLM, even quantized. Ollama with GGUF Q4 (~15 GB) is a workable fallback for smaller GPUs.

RuntimeError: Cannot find any model weights when loading the GPTQ model:

Drop --load-format mistral from the command. The GPTQ model needs vLLM to auto-detect the format from quantize_config.json.

Verify the download:

ls -la ~/.cache/huggingface/hub/models--mratsim--Devstral-Small-2505.w4a16-gptq/
# Should contain: *.safetensors, quantize_config.json, config.json

config.json not found at model load:

# Verify dependencies
pip show vllm mistral_common hf_transfer

# Reinstall if any are missing
pip install --upgrade vllm mistral_common hf_transfer

# Confirm cache, redownload if necessary
ls -la ~/.cache/huggingface/hub/ | grep -i devstral
huggingface-cli download mistralai/Devstral-Small-2505

CUDA unavailable or initialization error:

nvidia-smi
echo $CUDA_VISIBLE_DEVICES
echo $CUDA_HOME
python -c "import torch; print(torch.cuda.is_available()); print(torch.version.cuda); print(torch.cuda.device_count())"

Common fixes:

NVIDIA driver missing (nvidia-smi fails): sudo apt-get install nvidia-driver-535 && sudo reboot.
PyTorch built without CUDA support: pip uninstall torch && pip install torch --index-url https://download.pytorch.org/whl/cu121.
Conflicting environment variable: unset CUDA_VISIBLE_DEVICES.
PyTorch works but vLLM doesn't: pip uninstall vllm && pip install --upgrade vllm.

vLLM is GPU-only; there is no CPU fallback path.

Server fails to start:

# Port already in use
lsof -i :8000
kill -9 $(lsof -t -i:8000)

# GPU not visible
python -c "import torch; print(torch.cuda.device_count())"

Slow download: the model is roughly 31 GB; expect 15-30 minutes on a typical connection. huggingface-cli download shows progress; letting vLLM trigger the download implicitly does not.

Connection problems from the agentic client:

Confirm the server is responding: curl http://localhost:8000/v1/models.
Open the firewall if necessary: sudo ufw allow 8000.
Double-check the endpoint URL in the client configuration.

Cost comparison

Setup	Hardware cost	Ongoing cost	Throughput
RTX 6000 Ada local	$6,000-7,000	Electricity only	80-100 tok/s
A6000 local	$4,000-5,000 (used)	Electricity only	70-85 tok/s
A40 local	$3,000-4,000 (used)	Electricity only	60-75 tok/s
Runpod A40	$0	$0.40-0.50/hr	60-75 tok/s
Runpod A6000	$0	$0.80-1.00/hr	70-85 tok/s
Runpod RTX 6000 Ada	$0	$0.80-1.20/hr	80-100 tok/s
Runpod A100 (80 GB)	$0	$1.50-2.00/hr	90-110 tok/s

For occasional use, Runpod is cheaper. The A40 at $0.40-0.50/hr offers the best value-per-dollar for cloud deployment. For sustained daily use beyond 6-8 hours, local hardware amortizes in 12-18 months.