SJVIK Labs · sjviklabs.com

Production infrastructure,
run from a 3-node cluster.

Proxmox VE on three HP EliteDesks. 14 LXC services, monitored, backed up, recovered from runbooks. The whole thing is documented as code in a private GitHub repo — and signed off as a public PDF every time main moves.

Read the handbook (PDF) See the stack

Stack · what's actually on

Three nodes, fourteen services.

All three are HP EliteDesk 800 G3 mini desktops. Quiet, low-power, enough headroom to run the whole thing without straining. Cluster runs Proxmox VE 9.1.6 with knet + secauth, quorate.

Nodes

nx-core-01

Cluster leader

i5-7500T · 64 GB RAM · 1 TB NVMe + 1 TB SATA

11 LXCs (Traefik, AdGuard, monitoring, web apps)
nx-ai-01

Inference

i5-7500T · 32 GB RAM · 500 GB NVMe + 1 TB SATA

Ollama CPU inference, content services
nx-store-01

Storage & backup

i5-7500T · 32 GB RAM · NVMe + 1 TB SATA

Samba shares, Proxmox Backup Server

Services

DNS AdGuard Home P0 — LAN-wide name resolution
Reverse proxy Traefik v3 TLS termination, file-watcher
Observability Grafana + Prometheus 5 alert rules, email contact point
Status Uptime Kuma 13 HTTP monitors
SIEM Wazuh 4.14.4 14 agents, MITRE rules
Backup Proxmox Backup Server 7 daily + 4 weekly snapshots
Backup (offsite) Restic Nightly to 3 repos
Inference Ollama qwen2.5 family on CPU + GPU
IaC Ansible 13 roles, GitHub Actions CI

Architecture · how a request gets served

Two single points of failure, runbooks for both.

Every internal URL hits AdGuard Home first (DNS), then Traefik (TLS + reverse proxy), then the backend LXC. If either of the first two goes down, every internal URL fails — which is why each has its own recovery runbook with copy-pasteable triage commands and a phased fix tree.

┌──────────┐    DNS query     ┌──────────┐    HTTPS      ┌─────────────┐
│  Client  │ ───────────────▶ │ AdGuard  │ ──────────▶  │   Traefik   │ ───┐
│ (browser)│                  │ (LXC 100)│              │  (LXC 104)  │    │
└──────────┘                  └──────────┘              └─────────────┘    │
                                                                            ▼
                                                                    ┌──────────────┐
                                                                    │  Backend LXC │
                                                                    │ (e.g. .31)   │
                                                                    └──────────────┘

  AdGuard down → all *.lan fail.       Recovery: sops/recovery/adguard-lxc-100.md
  Traefik down → 502/connection-refused. Recovery: sops/recovery/traefik-lxc-104.md

Recovery time target

< 5 min for AdGuard, < 10 min for Traefik

Backup retention

7 daily + 4 weekly (PBS), nightly Restic

Cluster firewall

DROP inbound default · per-LXC overrides

SSH posture

Key-only · ed25519 mesh · fail2ban active

Practices · how it stays trustworthy

Built like work, not like a hobby.

Documentation as code

Every infra change lands as a PR with a state-doc update and a change-log entry. The handbook PDF is auto-generated from that source on every merge.
Recovery before reaction

P0 / P1 services have copy-pasteable runbooks. Decom changes leave dated config backups in-place so rollback is one cp away.
Defense-in-depth

DROP-inbound cluster firewall, key-only SSH, fail2ban, unattended security upgrades, weekly state audits.
Two-tier disclosure

The handbook ships in two PDFs: a full lab-internal version, and a public-redacted version produced by an explicit redaction filter. Public-safe by construction, not by hope.
Change-conscious

Conventional commits, squash-merge to main, ansible-lint in CI. New services land via Ansible roles, not artisanal SSH sessions.
Observability in

Grafana + Prometheus on every node, Wazuh SIEM with MITRE rules, Uptime Kuma fronting the *.lan estate. Alerts to email, contact-point provisioned-as-code.

Handbook · auto-generated, every merge

The lab, as a PDF.

Every push to main on the infra repo regenerates two handbook PDFs: an internal one (full IPs, ports, DDNS) and a public-redacted one for external sharing. Both are attached to a tagged GitHub Release. The link below always resolves to the latest *-public.pdf.

Latest release

SJVIK Labs Handbook

Public-redacted variant · DDNS, ports, and Tailscale IPs replaced by tokens via handbook/redactions.txt.

Download public PDF View all releases

What's in it (table of contents)

Architecture — project charter, lab overview, road map
Inventory & Network — hardware, IPs, ports, SSH mesh, topology diagrams
Recovery Runbooks — Traefik, AdGuard, full node restore
Setup & Provisioning — Linux base, storage, monitoring, projects
Appendix — recent change log entries

Production infrastructure, run from a 3-node cluster.

Three nodes, fourteen services.

Nodes

Services

Two single points of failure, runbooks for both.

Built like work, not like a hobby.

Documentation as code

Recovery before reaction

Defense-in-depth

Two-tier disclosure

Change-conscious

Observability in

The lab, as a PDF.

Production infrastructure,
run from a 3-node cluster.