Now shipping to EU & US

On-Prem GPU Inference. Pre-configured. Ship Models, Not Data.

Run Llama 70B at 40+ tok/s on 48GB VRAM. Locally. 

Why On-Prem Inference

Built for teams running LLM inference at scale who need full control over their stack. No API rate limits, no per-token billing, no data leaving your network. Dedicated VRAM, deterministic latency, and root access to everything.

Full Stack, Zero Config

Ships with vLLM, Ollama, Whisper, Docker, and Portainer pre-configured on Pop!_OS. PyTorch, ROCm/CUDA drivers, and model weights ready to serve. SSH in and start hacking, or use the OpenAI-compatible REST API out of the box. Your infrastructure, your root access.

Built For Engineers Who Need:

Air-Gapped Data Sovereignty

Zero egress. All inference runs locally on dedicated hardware. No telemetry, no third-party API calls. Meets HIPAA, GDPR, and SOC 2 requirements by architecture, not just policy.

Production-Ready in Minutes

Connect power + ethernet, hit the API endpoint. Pre-loaded GGUF/GPTQ weights, containerized services, health checks configured. Skip weeks of driver debugging and dependency hell.

Fixed CapEx, Predictable Throughput

One-time hardware cost. No per-token fees, no rate limits, no throttling. At sustained batch inference loads, TCO breaks even with cloud GPU instances in 4-6 months.

Sub-Millisecond Local Latency

No round-trip to us-east-1. Inference on local GPUs with P99 latency you control. Works fully offline for edge deployments, air-gapped environments, and sites with unreliable uplinks.

Pre-loaded Model Zoo

Ships with Llama 3, Mistral, CodeLlama, Whisper, and more in optimized quantizations (GGUF Q4_K_M, GPTQ 4-bit, AWQ). Swap models via CLI or REST API. Pull additional weights from HuggingFace or load your own fine-tunes.

OpenAI-Compatible API

Drop-in replacement for /v1/chat/completions and /v1/embeddings. Switch from cloud APIs by changing one base URL. Streaming, function calling, and JSON mode supported. Plus Portainer UI for container management.

Full Network Isolation

Runs fully air-gapped. No outbound connections required post-setup. All inference, embedding generation, and STT processing stays on your LAN. You own the firewall rules.

TCO That Actually Makes Sense

At 100K+ tokens/day, cloud inference costs compound fast. DeepEngine pays for itself in months. No per-request fees, no egress charges, no seat licenses. Run unlimited concurrent requests on dedicated hardware.

DeepEngine on-premises AI hardware box — plug-and-play private AI infrastructure

Deploy in Three Steps

1

Rack & Connect

Power + Ethernet. DHCP or static IP. Box boots into a pre-configured Pop!_OS with all services auto-starting via systemd.

2

Hit the API

curl the /v1/models endpoint to verify. OpenAI-compatible API is live on port 8000. Portainer dashboard on :9443 for container ops.

3

Serve or Customize

Use pre-loaded models or pull your own from HuggingFace. Swap quantizations, adjust context window, tune batch sizes. Full root + SSH access.

Drivers, CUDA/ROCm, Docker, vLLM, model weights -- all pre-installed and tested. We handle the boring infra so you can focus on your inference pipeline.

Hardware & Software Specs

Hardware Configurations

Mini Box

Entry Level

Run 33B-70B models in GGUF/GPTQ. Ideal for single-model serving and dev/staging.

  • GPU/AI Chips:

    2x AMD Radeon 7900 XTX (48GB VRAM, 960GB/s each)

  • Total GPU RAM Bandwidth:

    1920GB/s

  • Compute Performance:

    ~246 TFLOPS FP16

  • Memory:

    64GB DDR5

  • Storage:

    2TB NVMe SSD

  • Form Factor:

    Mid-Tower ATX, ~1000W PSU

Most Popular

Multi-GPU Box

Professional

Tensor parallelism across 4-8 GPUs. Run 70B+ models unquantized or serve multiple models concurrently.

  • GPU/AI Chips:

    4x AMD Radeon 7900 XTX (96GB VRAM, 960GB/s each)

  • Total GPU RAM Bandwidth:

    3840GB/s

  • Compute Performance:

    ~492 TFLOPS FP16

  • Memory:

    128GB DDR5

  • Storage:

    4TB NVMe SSD

  • Form Factor:

    Tower/4U Rack, ~2000W PSU

HPC Box

Enterprise

AMD EPYC with 12-24 memory channels. Run llama.cpp CPU inference for massive context windows on DDR5 bandwidth.

  • CPU/AI Chip:

    1x AMD EPYC (12 memory channels)

  • Memory:

    288GB DDR5 ECC RAM (12CH)

  • Storage:

    2TB NVMe SSD

  • Form Factor:

    Rackmount/Tower, 750W PSU

  • Performance:

    DeepSeek R1 Q2_K_XL, ~10 tokens/s

Software Stack

Inference Runtimes & Models

Whisper (STT)

OpenAI Whisper large-v3 running as a containerized service with REST API. Real-time and batch transcription with word-level timestamps. Supports 99 languages.

Ollama

Pull and serve GGUF models with a single command. Built-in model management, automatic quantization detection, and OpenAI-compatible chat API. Great for rapid prototyping.

vLLM

Production-grade LLM serving with PagedAttention, continuous batching, and tensor parallelism. Supports GPTQ, AWQ, and FP16 models. Up to 24x throughput vs naive inference.

Infrastructure & Tooling

Pop!_OS + ML Toolchain

Ubuntu-based with Python 3.11, PyTorch, ROCm/CUDA drivers, and GPU toolchains pre-installed. apt-get works as expected. Full systemd integration for service management.

Docker + Portainer

Every service runs in isolated containers with GPU passthrough. Portainer CE for web-based container management, or use docker-compose from the CLI. Your choice.

Monitoring Dashboard

Real-time GPU utilization, VRAM usage, tokens/sec throughput, and per-model latency metrics. Streamlit-based with customizable panels. Export to Prometheus/Grafana if needed.

Open WebUI

Self-hosted ChatGPT-like interface for interactive model testing. Supports system prompts, conversation history, multi-model switching, and RAG pipeline integration.

Observability & Tuning

GPU Metrics Dashboard

Real-time VRAM utilization, compute load, memory bandwidth saturation, and per-model tokens/sec. Web UI + CLI. Prometheus-compatible metrics endpoint.

Structured Logging

JSON-structured logs for every inference request with latency, token count, and model version. Ship to your existing ELK/Loki stack. Webhook alerts for OOM and service failures.

Runtime Configuration

Hot-reload batch sizes, max concurrent requests, context window limits, and KV cache allocation without restarting services. Tune throughput vs latency trade-offs live.

Request Tracing

Full request lifecycle tracking: queue time, prefill latency, decode speed, and total completion time. Historical data for capacity planning and SLA monitoring.

Technical FAQ

TCO & Financing

CapEx hardware vs OpEx cloud inference. At sustained workloads, on-prem pays for itself in months. No per-token billing, no egress fees, no seat licenses.

CapEx vs Per-Token Billing

One-time hardware cost. Run unlimited requests with zero marginal cost per token. At 1M+ tokens/day, cloud API costs exceed hardware investment within months.

Break-Even in 4-6 Months

Compare your monthly cloud GPU spend (A100 instances, API fees, egress) against hardware cost. Most deployments cross the break-even point within two quarters at sustained load.

Leasing & Installments

Hardware leasing available to spread CapEx over 12-36 months. Convert to OpEx-like payments while retaining full ownership and control. Works with standard IT procurement processes.

Dedicated compute with predictable costs. No rate limits, no vendor lock-in, no surprise invoices. Scale your inference pipeline on your terms.

Deployment Scenarios

Real-world deployments across different infrastructure requirements and compliance constraints:

Legal: Air-Gapped Document Processing

Law firm deployed Mini Box for contract analysis and document summarization. Llama 3 70B Q4_K_M processing 5,000+ documents/month via REST API integrated into their DMS. Zero data egress, full GDPR compliance. 40% cost reduction vs cloud API spend at ~200K tokens/day steady load.

Air-GappedREST API IntegrationGDPR Compliant

FinTech: Real-Time Transaction Analysis

Regional bank running Multi-GPU Box (4x 7900 XTX) in their data center for ML-powered fraud detection. Processes 50K+ transactions/day with sub-100ms P95 latency. No customer data leaves their network. ROI in 5 months vs cloud API costs at their volume. LangChain pipeline integrated with core banking via REST.

Sub-100ms P9550K+ req/daySOC 2 Compliant

Manufacturing: Edge Inference on Factory Floor

Factory floor deployment running fully air-gapped. Mini Box serving a custom vision model + Whisper for operator voice commands. No internet uplink. Docker containers auto-start on boot via systemd. Processing production line camera feeds at 30fps with <50ms inference latency. All data stays on-site.

Fully Air-Gapped<50ms LatencyMulti-Model Serving

Gov/Defense: Classified Network Deployment

Government agency running HPC Box on isolated network segment. Dual EPYC for DeepSeek R1 INT8 inference on classified documents. No GPU needed -- pure CPU inference via llama.cpp with 765GB DDR5 as KV cache. Integrated with internal document management via OpenAI-compatible API behind existing reverse proxy.

Network IsolatedCPU-Only Inference765GB RAM

Custom Configurations & Fine-Tuned Models

Need domain-specific models pre-loaded? We ship with custom LoRA adapters for legal, medical, financial, and code generation use cases. Bring your own fine-tuned weights, or work with our team on custom quantization and optimization for your specific throughput targets.

Discuss Custom Config

Get a TCO Analysis for Your Workload

Tell us about your current inference stack and we will model on-prem vs cloud TCO for your specific throughput requirements. Free, no obligation.

1

Describe Your Stack

Current models, throughput requirements, API spend, and infrastructure constraints.

2

Get TCO Comparison

We model cloud GPU vs on-prem cost at your token volume and provide a break-even analysis.

3

Get a Hardware Recommendation

Optimal GPU config, model quantization, and deployment architecture for your use case.

Free TCO Analysis

We will model your specific workload against cloud GPU pricing and show you exact break-even timelines before discussing hardware configuration.

Stay Ahead in Private AI

Get curated updates on AI hardware, open-source LLM breakthroughs, on-premise deployment strategies, and DeepEngine product news. No fluff, just signal.

No spam. Unsubscribe anytime.

Ship Your Inference Pipeline