SZABO V10

Fleet Chaos Engineering + 3-Cluster Spot Mesh
No GPUs in AKS. Spot everywhere. HVCP migrates between all 3. Local silos do inference.
γ₁ = 14.134725141734693
⚡ FLEET LAW — EOSE Labs Inc. · 2026-04-23

01 · The 3-Cluster Mesh

Three AKS Homebase Clusters — HVCP-Linked, Cross-Migratable
Role Cluster Sub / Owner Region Nodepools VM SKU Nodes Spot
Primary / pemos.ca aks-eose-aaas-dev 427873ee canadacentral system + agents D2s_v5 4 SPOT
CLO + GPU rig aks-eose-clo-gpu 440a5792 / Amani canadacentral system (D2s_v3) + gpu (T4) D2s_v3 / NC6s_v3 2+0 GPU → 0
Master 1 aks-master1 440a5792 / Amani canadacentral system B2s_v2 1 REGULAR
Master aks-master 440a5792 / Amani canadacentral system B2s_v2 1 REGULAR
Deseof sovereign aks-eose-deseof 239915fb / eose canadacentral system + agents B2s 2 TO CREATE
HVCP is the mesh router. Flux is the actuator. No workload is bound to one cluster — any can receive any workload within 60s. γ₁ migration threshold: γ₁/24 = 0.5889.

⚡ FLEET LAW: These 3 clusters form a single logical compute plane. No cluster is indispensable.

⚡ FLEET LAW: The deseof cluster (239915fb) must be created before V11 chaos drill month-1.

02 · No GPU in AKS — Why

GPU inference runs on local silos via Tailscale mesh — not AKS nodepools
Cost:NC6s_v3 (T4) = ~$0.50/hr spot = $360/mo if always on. Silos already bought and paid.
Spot risk:GPU nodes evict without warning. Model mid-inference = corrupted request, no graceful drain.
Waste:AKS GPU nodepool idles at ~85% when not inferring. Local silos idle at $0 marginal cost.
Latency:Tailscale mesh ≤ 5ms LAN. Comparable to in-cluster pod-to-pod for non-streaming requests.
Policy:GPU nodepools MUST remain scaled to 0. Kay explicit approval required to scale above 0. ARB1-B1.

🔥 forge — RTX 4090 24GB

IP:192.168.2.12
Models:qwen2.5-coder:32b, deepseek-r1:32b
Proxy:ts-lianli01-ollama
Role:Primary code + deep reasoning

⚡ msclo — RTX 5090 24GB

IP:192.168.2.19
Models:qwen3:14b, PEMCLAU CLO inference
Proxy:ts-msclo-ollama
Role:CLO + PEMCLAU GraphRAG primary

🟡 yone — RTX 5080 16GB

IP:192.168.2.23
Models:nomic-embed-text, qwen3:8b
Proxy:ts-lounge-ollama
Role:Embeddings + fast inference

🟣 pcdev — RTX 3090 24GB

IP:192.168.2.16
Models:Local experiments
Proxy:(direct)
Role:Dev + model testing
All 3 AKS clusters are GPU-free. Always 3+ best models available via Tailscale regardless of which cluster is active or under chaos test.

03 · Spot Eviction Handling — ARB-014

Graceful eviction doctrine for all user nodepools
Mechanism Detail SLA
Graceful drain Azure spot eviction notice → 30s grace period → pod checkpoints state → terminates cleanly 30s
HVCP reroute HVCP detects node loss, scores remaining 2 clusters, reroutes traffic to survivor ≤ 30s
PodDisruptionBudgets All critical workloads: minAvailable=1. No workload loses last replica without migration. Mandatory
Checkpoint save Stateful pods write checkpoint to Azure Blob / Qdrant before termination Pre-drain
Spot discount User nodepools on spot VMs — D2s_v5 spot vs regular 60–70% saving
ARB-014: Spot eviction is not a failure event. It is a scheduled migration trigger. HVCP handles it without human intervention.

⚡ FLEET LAW (ARB1-B2): All user nodepools MUST use spot/preemptible VMs. System nodepools exempt. No waiver.

04 · Chaos Engineering Stack — ARB-058

Monthly Chaos Drill — Required

Three test categories. All must pass before month-end. Quarterly: full 3-cluster migration.

C1
Spot Eviction Simulation
Drain a user nodepool node and verify workloads reschedule cleanly within SLA.
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
Pass criteria: all pods Running within 60s on surviving nodes or migrated cluster.
C2
HVCP Failover Test
Kill primary ingress controller. Verify kantai routes via secondary cluster within 30s.
kubectl delete pod -n ingress-nginx -l app.kubernetes.io/component=controller
Pass criteria: pemos.ca responds from secondary cluster. HVCP score logged.
C3
3-Cluster Workload Migration Test
Quarterly: move pemos-system namespace workload to deseof cluster and back. No manual kubectl apply allowed.
HVCP selects target → Flux kustomization patch → verify → migrate back
Pass criteria: migration completes in ≤ 60s. All data intact. Flux reconciles clean.
C4
Inference Continuity Test
Verify: regardless of which AKS cluster is active, always 3 best models available via Tailscale.
curl ts-lianli01-ollama:11434/api/tags && curl ts-lounge-ollama:11434/api/tags && curl ts-msclo-ollama:11434/api/tags
Pass criteria: all 3 Tailscale proxies reachable from any cluster namespace.

05 · Data Layer — Cloud (All 3 Clusters)

Redundant data services across the 3-cluster mesh
Service Primary NS Backup / Replication Tech Notes
Vector DB qdrant Replicated cross-cluster Qdrant pemclau-v11 · 10,435 vectors
Cache redis Replicated Redis Session + inference cache
Relational postgres Backup RGs PostgreSQL CRUD + event store
Graph neo4j Graph DB PEMCLAU knowledge graph
Secrets external-secrets Azure Key Vault ESO sync, all clusters
Object Store blob Backup RGs Azure Storage Checkpoints, artifacts
RPO target: ≤ 5 minutes for all cloud data services. RTO: ≤ 10 minutes per workload. Qdrant replication is the single most critical path — pemclau-v11 must replicate before any cluster migration.

06 · DECLONAs in Cloud — 6 Sovereign Frameworks

All 6 DECLONAs deployed to cloud — redundant across cluster mesh

CASE-001

Intent Routing Engine — classifies and routes all inbound requests across the fleet

CGates

Sovereign gateway layer — controls cross-cluster traffic, HVCP integration point

CLO Cloak

Chief Legal Officer inference wrapper — privacy-preserving CLO context layer

ONBA

Orchestration & Node Binding Architecture — fleet node registration + health

EOSE Labs Inc.

Sovereign corporate framework — identity, compliance, billing governance

The Canon

6 symbols: γ₁⚓ H=H†⬡ LSOS〰️ WLD🌀 FEP γ FOF🌌 — constitutional ground truth

07 — MIGRATION PATTERN — HOW A WORKLOAD MOVES

No manual kubectl. HVCP decides. Flux executes. γ₁-anchored threshold.
HVCP detects pressure (memory >85% OR cost spike OR eviction signal)
↓ score 3 clusters: [primary, clo, deseof]
↓ migration score must exceed γ₁/24 = 0.5889 to trigger
↓ select target cluster with lowest pressure + cost
→ Flux kustomization patch: update cluster context
→ Istio mesh reroutes traffic to new cluster
→ old pods graceful drain (PDB respected)
→ HVCP verifies new pods Running
→ external-dns updates A record if needed
↓ migration complete — target: <60s for stateless, <300s stateful
Workload TypeMigration TimeData RiskTrigger
Stateless (portal, API)~30sNoneAuto — HVCP score
Stateful (Qdrant, Redis)~120sPVC reattachManual approval
Mail (Haraka)~60sQueue drain firstManual + drain gate
CLO workloads~90sNoneAuto — CLO cluster always target

08 — MONTHLY COST PROFILE — CURRENT VS TARGET

ClusterSubCurrentTarget (spot)SavingAction
aks-eose-aaas-dev (primary)427873ee~$140/mo~$80/mo$60Convert agents to spot
aks-eose-clo-gpu440a5792~$70/mo~$40/mo$30GPU stays @ 0, system → spot
aks-master1 + aks-master440a5792~$60/mo~$40/mo$20Downsize to B2s spot
aks-eose-deseof (new)239915fb$0~$25/mo-$25Create — B2s spot
aks-kantai-eose-dev (B4ms)458e8558~$110/mo$0$110CONFIRM empty → STOP
aks-kantai-eose-ce (meek-mail)458e8558~$50/mo~$50/mo$0KEEP — Haraka live
TOTAL~$430/mo~$235/mo$195 saved
Daily target:~$7.83 CAD/day (was ~$14.33)
April projection:CA$7,332 day 19 → tracking ~$11,600 vs $10,552 March
Spot savings:60-70% on user nodepools (Azure spot discount)
GPU @ $0:All inference via Tailscale to local silos — zero cloud GPU cost