Why On-Prem Inference
Built for teams running LLM inference at scale who need full control over their stack. No API rate limits, no per-token billing, no data leaving your network. Dedicated VRAM, deterministic latency, and root access to everything.
Full Stack, Zero Config
Ships with vLLM, Ollama, Whisper, Docker, and Portainer pre-configured on Pop!_OS. PyTorch, ROCm/CUDA drivers, and model weights ready to serve. SSH in and start hacking, or use the OpenAI-compatible REST API out of the box. Your infrastructure, your root access.
Built For Engineers Who Need:
Air-Gapped Data Sovereignty
Zero egress. All inference runs locally on dedicated hardware. No telemetry, no third-party API calls. Meets HIPAA, GDPR, and SOC 2 requirements by architecture, not just policy.
Production-Ready in Minutes
Connect power + ethernet, hit the API endpoint. Pre-loaded GGUF/GPTQ weights, containerized services, health checks configured. Skip weeks of driver debugging and dependency hell.
Fixed CapEx, Predictable Throughput
One-time hardware cost. No per-token fees, no rate limits, no throttling. At sustained batch inference loads, TCO breaks even with cloud GPU instances in 4-6 months.
Sub-Millisecond Local Latency
No round-trip to us-east-1. Inference on local GPUs with P99 latency you control. Works fully offline for edge deployments, air-gapped environments, and sites with unreliable uplinks.
Pre-loaded Model Zoo
Ships with Llama 3, Mistral, CodeLlama, Whisper, and more in optimized quantizations (GGUF Q4_K_M, GPTQ 4-bit, AWQ). Swap models via CLI or REST API. Pull additional weights from HuggingFace or load your own fine-tunes.
OpenAI-Compatible API
Drop-in replacement for /v1/chat/completions and /v1/embeddings. Switch from cloud APIs by changing one base URL. Streaming, function calling, and JSON mode supported. Plus Portainer UI for container management.
Full Network Isolation
Runs fully air-gapped. No outbound connections required post-setup. All inference, embedding generation, and STT processing stays on your LAN. You own the firewall rules.
TCO That Actually Makes Sense
At 100K+ tokens/day, cloud inference costs compound fast. DeepEngine pays for itself in months. No per-request fees, no egress charges, no seat licenses. Run unlimited concurrent requests on dedicated hardware.

Deploy in Three Steps
Rack & Connect
Power + Ethernet. DHCP or static IP. Box boots into a pre-configured Pop!_OS with all services auto-starting via systemd.
Hit the API
curl the /v1/models endpoint to verify. OpenAI-compatible API is live on port 8000. Portainer dashboard on :9443 for container ops.
Serve or Customize
Use pre-loaded models or pull your own from HuggingFace. Swap quantizations, adjust context window, tune batch sizes. Full root + SSH access.
Drivers, CUDA/ROCm, Docker, vLLM, model weights -- all pre-installed and tested. We handle the boring infra so you can focus on your inference pipeline.
Hardware & Software Specs
Hardware Configurations
Mini Box
Entry LevelRun 33B-70B models in GGUF/GPTQ. Ideal for single-model serving and dev/staging.
- GPU/AI Chips:
2x AMD Radeon 7900 XTX (48GB VRAM, 960GB/s each)
- Total GPU RAM Bandwidth:
1920GB/s
- Compute Performance:
~246 TFLOPS FP16
- Memory:
64GB DDR5
- Storage:
2TB NVMe SSD
- Form Factor:
Mid-Tower ATX, ~1000W PSU
Multi-GPU Box
ProfessionalTensor parallelism across 4-8 GPUs. Run 70B+ models unquantized or serve multiple models concurrently.
- GPU/AI Chips:
4x AMD Radeon 7900 XTX (96GB VRAM, 960GB/s each)
- Total GPU RAM Bandwidth:
3840GB/s
- Compute Performance:
~492 TFLOPS FP16
- Memory:
128GB DDR5
- Storage:
4TB NVMe SSD
- Form Factor:
Tower/4U Rack, ~2000W PSU
HPC Box
EnterpriseAMD EPYC with 12-24 memory channels. Run llama.cpp CPU inference for massive context windows on DDR5 bandwidth.
- CPU/AI Chip:
1x AMD EPYC (12 memory channels)
- Memory:
288GB DDR5 ECC RAM (12CH)
- Storage:
2TB NVMe SSD
- Form Factor:
Rackmount/Tower, 750W PSU
- Performance:
DeepSeek R1 Q2_K_XL, ~10 tokens/s
Software Stack
Inference Runtimes & Models
Whisper (STT)
OpenAI Whisper large-v3 running as a containerized service with REST API. Real-time and batch transcription with word-level timestamps. Supports 99 languages.
Ollama
Pull and serve GGUF models with a single command. Built-in model management, automatic quantization detection, and OpenAI-compatible chat API. Great for rapid prototyping.
vLLM
Production-grade LLM serving with PagedAttention, continuous batching, and tensor parallelism. Supports GPTQ, AWQ, and FP16 models. Up to 24x throughput vs naive inference.
Infrastructure & Tooling
Pop!_OS + ML Toolchain
Ubuntu-based with Python 3.11, PyTorch, ROCm/CUDA drivers, and GPU toolchains pre-installed. apt-get works as expected. Full systemd integration for service management.
Docker + Portainer
Every service runs in isolated containers with GPU passthrough. Portainer CE for web-based container management, or use docker-compose from the CLI. Your choice.
Monitoring Dashboard
Real-time GPU utilization, VRAM usage, tokens/sec throughput, and per-model latency metrics. Streamlit-based with customizable panels. Export to Prometheus/Grafana if needed.
Open WebUI
Self-hosted ChatGPT-like interface for interactive model testing. Supports system prompts, conversation history, multi-model switching, and RAG pipeline integration.
Observability & Tuning
GPU Metrics Dashboard
Real-time VRAM utilization, compute load, memory bandwidth saturation, and per-model tokens/sec. Web UI + CLI. Prometheus-compatible metrics endpoint.
Structured Logging
JSON-structured logs for every inference request with latency, token count, and model version. Ship to your existing ELK/Loki stack. Webhook alerts for OOM and service failures.
Runtime Configuration
Hot-reload batch sizes, max concurrent requests, context window limits, and KV cache allocation without restarting services. Tune throughput vs latency trade-offs live.
Request Tracing
Full request lifecycle tracking: queue time, prefill latency, decode speed, and total completion time. Historical data for capacity planning and SLA monitoring.
Technical FAQ
TCO & Financing
CapEx hardware vs OpEx cloud inference. At sustained workloads, on-prem pays for itself in months. No per-token billing, no egress fees, no seat licenses.
CapEx vs Per-Token Billing
One-time hardware cost. Run unlimited requests with zero marginal cost per token. At 1M+ tokens/day, cloud API costs exceed hardware investment within months.
Break-Even in 4-6 Months
Compare your monthly cloud GPU spend (A100 instances, API fees, egress) against hardware cost. Most deployments cross the break-even point within two quarters at sustained load.
Leasing & Installments
Hardware leasing available to spread CapEx over 12-36 months. Convert to OpEx-like payments while retaining full ownership and control. Works with standard IT procurement processes.
Dedicated compute with predictable costs. No rate limits, no vendor lock-in, no surprise invoices. Scale your inference pipeline on your terms.
Deployment Scenarios
Real-world deployments across different infrastructure requirements and compliance constraints:
Legal: Air-Gapped Document Processing
Law firm deployed Mini Box for contract analysis and document summarization. Llama 3 70B Q4_K_M processing 5,000+ documents/month via REST API integrated into their DMS. Zero data egress, full GDPR compliance. 40% cost reduction vs cloud API spend at ~200K tokens/day steady load.
FinTech: Real-Time Transaction Analysis
Regional bank running Multi-GPU Box (4x 7900 XTX) in their data center for ML-powered fraud detection. Processes 50K+ transactions/day with sub-100ms P95 latency. No customer data leaves their network. ROI in 5 months vs cloud API costs at their volume. LangChain pipeline integrated with core banking via REST.
Manufacturing: Edge Inference on Factory Floor
Factory floor deployment running fully air-gapped. Mini Box serving a custom vision model + Whisper for operator voice commands. No internet uplink. Docker containers auto-start on boot via systemd. Processing production line camera feeds at 30fps with <50ms inference latency. All data stays on-site.
Gov/Defense: Classified Network Deployment
Government agency running HPC Box on isolated network segment. Dual EPYC for DeepSeek R1 INT8 inference on classified documents. No GPU needed -- pure CPU inference via llama.cpp with 765GB DDR5 as KV cache. Integrated with internal document management via OpenAI-compatible API behind existing reverse proxy.
Custom Configurations & Fine-Tuned Models
Need domain-specific models pre-loaded? We ship with custom LoRA adapters for legal, medical, financial, and code generation use cases. Bring your own fine-tuned weights, or work with our team on custom quantization and optimization for your specific throughput targets.
Discuss Custom ConfigGet a TCO Analysis for Your Workload
Tell us about your current inference stack and we will model on-prem vs cloud TCO for your specific throughput requirements. Free, no obligation.
Describe Your Stack
Current models, throughput requirements, API spend, and infrastructure constraints.
Get TCO Comparison
We model cloud GPU vs on-prem cost at your token volume and provide a break-even analysis.
Get a Hardware Recommendation
Optimal GPU config, model quantization, and deployment architecture for your use case.
Free TCO Analysis
We will model your specific workload against cloud GPU pricing and show you exact break-even timelines before discussing hardware configuration.
Stay Ahead in Private AI
Get curated updates on AI hardware, open-source LLM breakthroughs, on-premise deployment strategies, and DeepEngine product news. No fluff, just signal.
No spam. Unsubscribe anytime.