SJVIK Labs · sjviklabs.com
Production infrastructure,
run from a 3-node cluster.
Proxmox VE on three HP EliteDesks. 14 LXC services, monitored, backed up, recovered from runbooks. The whole thing is documented as code in a private GitHub repo — and signed off as a public PDF every time main moves.
Stack · what's actually on
Three nodes, fourteen services.
All three are HP EliteDesk 800 G3 mini desktops. Quiet, low-power, enough headroom to run the whole thing without straining. Cluster runs Proxmox VE 9.1.6 with knet + secauth, quorate.
Nodes
-
nx-core-01
Cluster leader
i5-7500T · 64 GB RAM · 1 TB NVMe + 1 TB SATA
11 LXCs (Traefik, AdGuard, monitoring, web apps)
-
nx-ai-01
Inference
i5-7500T · 32 GB RAM · 500 GB NVMe + 1 TB SATA
Ollama CPU inference, content services
-
nx-store-01
Storage & backup
i5-7500T · 32 GB RAM · NVMe + 1 TB SATA
Samba shares, Proxmox Backup Server
Services
- DNS AdGuard Home P0 — LAN-wide name resolution
- Reverse proxy Traefik v3 TLS termination, file-watcher
- Observability Grafana + Prometheus 5 alert rules, email contact point
- Status Uptime Kuma 13 HTTP monitors
- SIEM Wazuh 4.14.4 14 agents, MITRE rules
- Backup Proxmox Backup Server 7 daily + 4 weekly snapshots
- Backup (offsite) Restic Nightly to 3 repos
- Inference Ollama qwen2.5 family on CPU + GPU
- IaC Ansible 13 roles, GitHub Actions CI
Architecture · how a request gets served
Two single points of failure, runbooks for both.
Every internal URL hits AdGuard Home first (DNS), then Traefik (TLS + reverse proxy), then the backend LXC. If either of the first two goes down, every internal URL fails — which is why each has its own recovery runbook with copy-pasteable triage commands and a phased fix tree.
┌──────────┐ DNS query ┌──────────┐ HTTPS ┌─────────────┐
│ Client │ ───────────────▶ │ AdGuard │ ──────────▶ │ Traefik │ ───┐
│ (browser)│ │ (LXC 100)│ │ (LXC 104) │ │
└──────────┘ └──────────┘ └─────────────┘ │
▼
┌──────────────┐
│ Backend LXC │
│ (e.g. .31) │
└──────────────┘
AdGuard down → all *.lan fail. Recovery: sops/recovery/adguard-lxc-100.md
Traefik down → 502/connection-refused. Recovery: sops/recovery/traefik-lxc-104.md
Recovery time target
< 5 min for AdGuard, < 10 min for Traefik
Backup retention
7 daily + 4 weekly (PBS), nightly Restic
Cluster firewall
DROP inbound default · per-LXC overrides
SSH posture
Key-only · ed25519 mesh · fail2ban active
Practices · how it stays trustworthy
Built like work, not like a hobby.
-
Documentation as code
Every infra change lands as a PR with a state-doc update and a change-log entry. The handbook PDF is auto-generated from that source on every merge.
-
Recovery before reaction
P0 / P1 services have copy-pasteable runbooks. Decom changes leave dated config backups in-place so rollback is one cp away.
-
Defense-in-depth
DROP-inbound cluster firewall, key-only SSH, fail2ban, unattended security upgrades, weekly state audits.
-
Two-tier disclosure
The handbook ships in two PDFs: a full lab-internal version, and a public-redacted version produced by an explicit redaction filter. Public-safe by construction, not by hope.
-
Change-conscious
Conventional commits, squash-merge to main, ansible-lint in CI. New services land via Ansible roles, not artisanal SSH sessions.
-
Observability in
Grafana + Prometheus on every node, Wazuh SIEM with MITRE rules, Uptime Kuma fronting the *.lan estate. Alerts to email, contact-point provisioned-as-code.
Handbook · auto-generated, every merge
The lab, as a PDF.
Every push to main on the infra repo regenerates two
handbook PDFs: an internal one (full IPs, ports, DDNS) and a
public-redacted one for external sharing. Both are attached to a
tagged GitHub Release. The link below always resolves to the
latest *-public.pdf.
What's in it (table of contents)
- Architecture — project charter, lab overview, road map
- Inventory & Network — hardware, IPs, ports, SSH mesh, topology diagrams
- Recovery Runbooks — Traefik, AdGuard, full node restore
- Setup & Provisioning — Linux base, storage, monitoring, projects
- Appendix — recent change log entries