SZABO V10 — Fleet Chaos Engineering + 3-Cluster Spot Mesh

01 · The 3-Cluster Mesh

Three AKS Homebase Clusters — HVCP-Linked, Cross-Migratable

Role	Cluster	Sub / Owner	Region	Nodepools	VM SKU	Nodes	Spot
Primary / pemos.ca	aks-eose-aaas-dev	427873ee	canadacentral	system + agents	D2s_v5	4	SPOT
CLO + GPU rig	aks-eose-clo-gpu	440a5792 / Amani	canadacentral	system (D2s_v3) + gpu (T4)	D2s_v3 / NC6s_v3	2+0	GPU → 0
Master 1	aks-master1	440a5792 / Amani	canadacentral	system	B2s_v2	1	REGULAR
Master	aks-master	440a5792 / Amani	canadacentral	system	B2s_v2	1	REGULAR
Deseof sovereign	aks-eose-deseof	239915fb / eose	canadacentral	system + agents	B2s	2	TO CREATE

HVCP is the mesh router. Flux is the actuator. No workload is bound to one cluster — any can receive any workload within 60s. γ₁ migration threshold: γ₁/24 = 0.5889.

⚡ FLEET LAW: These 3 clusters form a single logical compute plane. No cluster is indispensable.

⚡ FLEET LAW: The deseof cluster (239915fb) must be created before V11 chaos drill month-1.

02 · No GPU in AKS — Why

GPU inference runs on local silos via Tailscale mesh — not AKS nodepools

Cost:NC6s_v3 (T4) = ~$0.50/hr spot = $360/mo if always on. Silos already bought and paid.

Spot risk:GPU nodes evict without warning. Model mid-inference = corrupted request, no graceful drain.

Waste:AKS GPU nodepool idles at ~85% when not inferring. Local silos idle at $0 marginal cost.

Latency:Tailscale mesh ≤ 5ms LAN. Comparable to in-cluster pod-to-pod for non-streaming requests.

Policy:GPU nodepools MUST remain scaled to 0. Kay explicit approval required to scale above 0. ARB1-B1.

🔥 forge — RTX 4090 24GB

IP:192.168.2.12

Models:qwen2.5-coder:32b, deepseek-r1:32b

Proxy:ts-lianli01-ollama

Role:Primary code + deep reasoning

⚡ msclo — RTX 5090 24GB

IP:192.168.2.19

Models:qwen3:14b, PEMCLAU CLO inference

Proxy:ts-msclo-ollama

Role:CLO + PEMCLAU GraphRAG primary

🟡 yone — RTX 5080 16GB

IP:192.168.2.23

Models:nomic-embed-text, qwen3:8b

Proxy:ts-lounge-ollama

Role:Embeddings + fast inference

🟣 pcdev — RTX 3090 24GB

IP:192.168.2.16

Models:Local experiments

Proxy:(direct)

Role:Dev + model testing

All 3 AKS clusters are GPU-free. Always 3+ best models available via Tailscale regardless of which cluster is active or under chaos test.

03 · Spot Eviction Handling — ARB-014

Graceful eviction doctrine for all user nodepools

Mechanism	Detail	SLA
Graceful drain	Azure spot eviction notice → 30s grace period → pod checkpoints state → terminates cleanly	30s
HVCP reroute	HVCP detects node loss, scores remaining 2 clusters, reroutes traffic to survivor	≤ 30s
PodDisruptionBudgets	All critical workloads: minAvailable=1. No workload loses last replica without migration.	Mandatory
Checkpoint save	Stateful pods write checkpoint to Azure Blob / Qdrant before termination	Pre-drain
Spot discount	User nodepools on spot VMs — D2s_v5 spot vs regular	60–70% saving

ARB-014: Spot eviction is not a failure event. It is a scheduled migration trigger. HVCP handles it without human intervention.

⚡ FLEET LAW (ARB1-B2): All user nodepools MUST use spot/preemptible VMs. System nodepools exempt. No waiver.

04 · Chaos Engineering Stack — ARB-058

Monthly Chaos Drill — Required

Three test categories. All must pass before month-end. Quarterly: full 3-cluster migration.

Spot Eviction Simulation

Drain a user nodepool node and verify workloads reschedule cleanly within SLA.

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Pass criteria: all pods Running within 60s on surviving nodes or migrated cluster.

HVCP Failover Test

Kill primary ingress controller. Verify kantai routes via secondary cluster within 30s.

kubectl delete pod -n ingress-nginx -l app.kubernetes.io/component=controller

Pass criteria: pemos.ca responds from secondary cluster. HVCP score logged.

3-Cluster Workload Migration Test

Quarterly: move pemos-system namespace workload to deseof cluster and back. No manual kubectl apply allowed.

HVCP selects target → Flux kustomization patch → verify → migrate back

Pass criteria: migration completes in ≤ 60s. All data intact. Flux reconciles clean.

Inference Continuity Test

Verify: regardless of which AKS cluster is active, always 3 best models available via Tailscale.

curl ts-lianli01-ollama:11434/api/tags && curl ts-lounge-ollama:11434/api/tags && curl ts-msclo-ollama:11434/api/tags

Pass criteria: all 3 Tailscale proxies reachable from any cluster namespace.

05 · Data Layer — Cloud (All 3 Clusters)

Redundant data services across the 3-cluster mesh

Service	Primary NS	Backup / Replication	Tech	Notes
Vector DB	qdrant	Replicated cross-cluster	Qdrant	pemclau-v11 · 10,435 vectors
Cache	redis	Replicated	Redis	Session + inference cache
Relational	postgres	Backup RGs	PostgreSQL	CRUD + event store
Graph	neo4j	—	Graph DB	PEMCLAU knowledge graph
Secrets	external-secrets	—	Azure Key Vault	ESO sync, all clusters
Object Store	blob	Backup RGs	Azure Storage	Checkpoints, artifacts

RPO target: ≤ 5 minutes for all cloud data services. RTO: ≤ 10 minutes per workload. Qdrant replication is the single most critical path — pemclau-v11 must replicate before any cluster migration.

06 · DECLONAs in Cloud — 6 Sovereign Frameworks

All 6 DECLONAs deployed to cloud — redundant across cluster mesh

CASE-001

Intent Routing Engine — classifies and routes all inbound requests across the fleet

CGates

Sovereign gateway layer — controls cross-cluster traffic, HVCP integration point

CLO Cloak

Chief Legal Officer inference wrapper — privacy-preserving CLO context layer

ONBA

Orchestration & Node Binding Architecture — fleet node registration + health

EOSE Labs Inc.

Sovereign corporate framework — identity, compliance, billing governance

The Canon

6 symbols: γ₁⚓ H=H†⬡ LSOS〰️ WLD🌀 FEP γ FOF🌌 — constitutional ground truth

07 — MIGRATION PATTERN — HOW A WORKLOAD MOVES

No manual kubectl. HVCP decides. Flux executes. γ₁-anchored threshold.

HVCP detects pressure (memory >85% OR cost spike OR eviction signal)
↓ score 3 clusters: [primary, clo, deseof]
↓ migration score must exceed γ₁/24 = 0.5889 to trigger
↓ select target cluster with lowest pressure + cost
→ Flux kustomization patch: update cluster context
→ Istio mesh reroutes traffic to new cluster
→ old pods graceful drain (PDB respected)
→ HVCP verifies new pods Running
→ external-dns updates A record if needed
↓ migration complete — target: <60s for stateless, <300s stateful

Workload Type	Migration Time	Data Risk	Trigger
Stateless (portal, API)	~30s	None	Auto — HVCP score
Stateful (Qdrant, Redis)	~120s	PVC reattach	Manual approval
Mail (Haraka)	~60s	Queue drain first	Manual + drain gate
CLO workloads	~90s	None	Auto — CLO cluster always target

08 — MONTHLY COST PROFILE — CURRENT VS TARGET

Cluster	Sub	Current	Target (spot)	Saving	Action
aks-eose-aaas-dev (primary)	427873ee	~$140/mo	~$80/mo	$60	Convert agents to spot
aks-eose-clo-gpu	440a5792	~$70/mo	~$40/mo	$30	GPU stays @ 0, system → spot
aks-master1 + aks-master	440a5792	~$60/mo	~$40/mo	$20	Downsize to B2s spot
aks-eose-deseof (new)	239915fb	$0	~$25/mo	-$25	Create — B2s spot
aks-kantai-eose-dev (B4ms)	458e8558	~$110/mo	$0	$110	CONFIRM empty → STOP
aks-kantai-eose-ce (meek-mail)	458e8558	~$50/mo	~$50/mo	$0	KEEP — Haraka live
TOTAL		~$430/mo	~$235/mo	$195 saved

Daily target:~$7.83 CAD/day (was ~$14.33)

April projection:CA$7,332 day 19 → tracking ~$11,600 vs $10,552 March

Spot savings:60-70% on user nodepools (Azure spot discount)

GPU @ $0:All inference via Tailscale to local silos — zero cloud GPU cost