Setting Up Exo on a 3-Node Proxmox Cluster (And Why I Had to Abandon It)

I recently explored setting up Exo, a framework for distributing large language models (LLMs) across multiple nodes in a cluster. The goal was to create a powerful offline AI assistant by utilizing my 3-node Proxmox VE cluster.

My nodes are uniform, each with:

Intel HD Graphics 530 (integrated GPU)
Multi-core CPUs (no AVX-512)
16 GB RAM per node
Running lightweight VMs with Debian-based distros

This blog post outlines my experience, step-by-step setup, issues I encountered, and ultimately why I could not use Exo effectively in this configuration.

Why Exo?

Exo promises:

Distributed inference for LLMs (like Mistral or LLaMA)
Support for containerized node deployment
Scalability with multiple compute nodes
Open-source and locally hosted

It’s a promising project, especially for offline use in homelab environments.

Step-by-Step Installation Summary

1. Git Clone and Setup

git clone https://github.com/exo-explore/exo.git
cd exo

2. Install Prerequisites (on all nodes)

sudo apt update && sudo apt install -y python3 python3-venv build-essential git

3. Create Python venv

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

After installing countless dependencies, I finally got this running on all 3 nodes, and all three nodes could see each other. Everything looked to be running smoothly at this point. When I tried to interact with AI, the program would just error out. Basically due to my lack of GPU hardware on the cluster.

Issues Encountered

🚫 No GPU Support for Exo Inference

Exo requires GPU acceleration to function efficiently.
My hardware (Intel HD 530) has no CUDA, no ROCm, and no OpenCL support for PyTorch.
CPU-only fallback is not officially supported or extremely inefficient.

🛠 Incompatibility with Llama.cpp

Some models I tested (GGUF via llama.cpp) were not compatible with Exo’s loading mechanism.

⚠️ Lack of Community Support for CPU-only Clusters

Very few users seem to be running Exo without GPUs. Issues related to CPU fallback were largely unanswered or marked as unsupported.

Workarounds I Attempted

✅ Tried smaller models (TinyLLaMA, Phi-2 quantized) — still slow or failed to start
✅ Lowered worker thread count and concurrency — still slow or failed to start
✅ Rewrote config for minimal parallelism — still slow or failed to start

Final Decision: Move to Ollama + Load Balancing

Due to lack of GPU support, I chose to:

Decommission Exo from my stack
Use Ollama instead — it runs well on CPU-only systems
Deploy one model per node in my Proxmox cluster
Build a load balancer + lightweight chat UI to mimic distributed inference

My Conclusions

While Exo is a very exciting project and I so wish I had it running on this cluster, it’s currently not viable for:

Clusters without dedicated NVIDIA or AMD GPUs
Homelabs with integrated graphics only

If you’re in a similar situation and want to run LLMs locally:

✅ Use Ollama for GGUF models (CPU-optimized)
✅ Explore Text Generation WebUI or LM Studio
✅ Create a local load-balanced endpoint for pseudo-clustered inference

I will be adding a new server to my home lab with an Nvidia GPU, so I will have this running eventually.

Hopefully Exo evolves to support CPU clusters in the future — I’ll be watching!

Tags: #AICluster, #AIDevelopment, #ClusterComputing, #CPUBasedAI, #DevOpsAI, #EdgeComputing, #ExoFramework, #Homelab, #HomelabAI, #HomelabLife, #IntegratedGraphics, #LessonsLearned, #LLMDev, #LocalLLM, #LowPowerAI, #NoGPU, #OpenSourceAI, #OpenSourceTools, #ProjectFail, #ProxmoxCluster, #ProxmoxVE, #RunLLMLocally, #SelfHosted, #SelfHostedAI, #techblog, #TechExperiment

Declined Studios