EOSE Labs Consulting Assessment · TRB-CONSULT-NETBOX-001
NETBOX LABS · ENGINEERING FLOOR
Sovereign Engineering Standards · Day 86 · 2026-04-30
5 FLOORS SECURITY DOMAIN AI-NATIVE HYBRID SAAS/ON-PREM
The Security Floors
Each floor has a name, a current state, a target state, and concrete actions. You don't move to the next floor until the previous one holds.
FLOOR 1
AUTH_FLOOR — Token & Identity Hardening
CURRENT STATE
v1 + v2 tokens mixed in production
No IP restriction on token usage
No max expiry enforced
REMOTE_AUTH_AUTO_CREATE_USER = True
Local accounts vulnerable to brute-force
TARGET STATE
v2 only — v1 disabled at config level
90-day max expiry enforced via TOKEN_MAX_EXPIRE_TIME
IP restricted to fleet CIDR (ALLOWED_IPS)
Zitadel/Keycloak OIDC via REMOTE_AUTH + social-auth-core
REMOTE_AUTH_AUTO_CREATE_USER = False — explicit provisioning
ACTIONS
Migrate all tokens to v2; revoke all v1 tokens
Configure API_TOKEN_PEPPERS in settings
Deploy oauth2-proxy or Istio AuthorizationPolicy at ingress
Wire Zitadel OIDC (already in fleet — 30-min task)
Audit REMOTE_AUTH settings in every environment
FLOOR 2
NETWORK_FLOOR — No Public Exposure
CURRENT STATE
ALLOWED_HOSTS often blank in dev/default configs
Public internet exposure possible via LoadBalancer svc
No ingress auth by default
TARGET STATE
ALLOWED_HOSTS explicit in every environment
Never exposed to public internet — Tailscale mesh or Istio sidecar
ClusterIP only — no LoadBalancer service type for NetBox
JWT gate at ingress (oauth2-proxy or Istio AuthorizationPolicy)
ACTIONS
Set ALLOWED_HOSTS explicitly in all environment configs
Convert Service type from LoadBalancer → ClusterIP
Deploy oauth2-proxy sidecar or Istio policy at ingress
Tailscale operator for on-prem node-to-node mesh
FLOOR 3
DATA_FLOOR — Secrets, Storage, Session Security
CURRENT STATE
SECRET_KEY in configuration file / env var
Media files on local disk (ephemeral in K8s)
DEBUG = True possible in non-prod environments
CSRF / session cookies not always secure-flagged
TARGET STATE
SECRET_KEY injected from Azure Key Vault via external-secrets
Media in Azure Blob / S3 via django-storages
DEBUG = False hardcoded in production image build
CSRF_COOKIE_SECURE + SESSION_COOKIE_SECURE + SECURE_SSL_REDIRECT all True
ACTIONS
Wire external-secrets-operator → Key Vault for SECRET_KEY
Configure django-storages with Azure Blob backend
Dockerfile ENV DEBUG=False — immutable at build time
Add SECURE_HSTS_SECONDS + SECURE_CONTENT_TYPE_NOSNIFF
FLOOR 4
AUDIT_FLOOR — Retention, Alerting, Observability
CURRENT STATE
ObjectChange accumulates indefinitely (DB bloat)
No alerting on authentication anomalies
Prometheus metrics present but not wired to alerting
TARGET STATE
365-day ObjectChange retention with automated purge job
Alert: >100 failed auth attempts in 5-minute window
Prometheus → Grafana → PagerDuty/Alertmanager pipeline live
ACTIONS
Schedule RQ job for ObjectChange pruning (built-in housekeeping)
Wire django-prometheus to Grafana dashboard
Create auth failure alert rule in Alertmanager
Enable LOGIN_REQUIRED — force auth on all pages
FLOOR 5
SUPPLY_CHAIN_FLOOR — Build Security, SBOM, Scanning
CURRENT STATE
requirements.txt pinned (good)
No SBOM generated at build time
No container scan in CI pipeline
Dev dependencies potentially in production image
TARGET STATE
Trivy scan as required CI gate — no HIGH/CRIT = no merge
syft SBOM generated at build time, stored with release artifact
Renovate bot for automated dependency update PRs
Multi-stage Dockerfile: builder stage vs runtime stage — no dev deps in prod
ACTIONS
Add Trivy step to GitHub Actions — fail on HIGH+CRIT
Add syft step to Dockerfile build stage
Configure Renovate on netbox-community/netbox
Refactor Dockerfile: pip install --no-dev in final stage
The Execution System
Not process for process's sake. Load-bearing structure for a multi-product, hybrid-delivery engineering org. Each block is a real constraint — remove any one and the system wobbles.
CYCLE CADENCE
2-week cycles. Monday planning (1hr max). Wednesday async status (no meeting). Friday demo (30min, working software only).
No status theater. If it's not shippable, it doesn't demo. A slide is not a demo.
Cycle artifacts: scope doc (1 page max), decision log, retro notes. Nothing longer.
The Monday planning doc is the contract for the cycle. Changes mid-cycle require explicit decision log entry.
OWNERSHIP MATRIX
Every component has exactly one owner (team, not person). No shared mutable responsibility.
Contract between teams: defined API + SLA. If you break the contract, you fix it — same day.
Foundations team owns: CI/CD, shared auth, base images, release tooling, test infrastructure.
Product teams own their domain end-to-end — including on-prem delivery of their component. No exceptions.
ESCALATION LADDER
L1: Team resolves within cycle — no escalation needed
L2: Director escalates to VP → unblocked within 48h
L3: VP escalates to CTO → strategy decisions only
L4: CTO → Board → architecture bets, major pivots
The goal: L3 escalations rare by Month 6. L4 escalations: quarterly at most. If VP is getting L1 questions, something is wrong with ownership clarity.
AI ACCOUNTABILITY MODEL
P0 (security, auth, billing): human only. No AI writes, no AI reviews without full human check.
P1 (core product logic): AI writes first draft, human reviews fully, automated tests required.
P2 (tooling, docs, tests): AI writes, human spot-checks, automated tests gate merge.
P3 (boilerplate, scaffolding): AI autonomous with linting gate only.
On-prem constraint: AI tooling must function with customer-controlled models. SaaS-only AI tooling is not acceptable for NetBox's customer base.
AI-Native Is Not a Feature. It's How the Work Gets Done.
These are not guidelines. They are load-bearing principles. Each one exists because the failure mode without it is documented.
1
SPEC-FIRST
Every feature starts with a spec. AI writes the first draft from the issue. Human reviews the spec — not the code — before a line of implementation is written.
Why: AI implementation quality is directly proportional to spec clarity. A vague spec produces vague code. Improve the spec, improve the output. The spec review is 30 minutes. The implementation refactor is 3 days.
2
COMPREHENSION GATE
AI can write it. A human must understand it before it merges. Not a review checkbox — a real comprehension check: "explain this change to me in plain language." If you can't, it doesn't merge.
Why: this is the constraint that keeps AI-native from becoming AI-chaotic. Non-negotiable at P0 and P1. The gate scales down (P2/P3 = spot-check only) but never disappears.
3
VERIFICATION-LED ITERATION
AI writes tests first (from spec). Human approves coverage. Then AI implements. Tests gate merge. This reverses the standard AI loop (implement→test→fix) into a verification-first loop.
Why: faster iteration because the verification system is defined before implementation starts. The bottleneck is coverage clarity, not implementation speed.
4
RISK-TIERED DELEGATION
Not all code is equal. Don't apply the same human review overhead to boilerplate as to auth flows. The P0–P3 model makes delegation explicit and measurable — not a judgment call, a rule.
Why: without explicit tiers, teams either apply maximum human oversight to everything (slow) or minimum oversight to everything (dangerous). The tier table eliminates the judgment call at review time.
5
ON-PREM FIRST
NetBox's on-prem customers cannot use SaaS AI tooling. AI-native practices must be designed for offline/self-hosted from day 1. Customer-controlled models (llama, qwen, mistral) must be first-class citizens in the AI toolchain — not an afterthought.
Why: PEMCLAU pattern proves this is solved. Local inference + structured graph retrieval = no external API dependency. The architecture exists. It just needs to be the default, not the exception.
Hybrid Delivery Is Not Two Systems. It's One System With Two Surfaces.
The failure mode is building SaaS and on-prem as separate pipelines. They share a codebase, a test suite, a release process. The surfaces differ; the system doesn't.
SAAS DELIVERY
GitOps via Flux/ArgoCD. Canary deploys. Full observability from day 1.
Weekly release cadence from HEAD. Feature flags gate on-prem-unsafe features.
SaaS gets everything first — it's the canary for the on-prem release.
Incident response: full observability, direct rollback capability, <15min MTTR target.
ON-PREM DELIVERY
netbox-operator pattern: K8s CRDs for NetBox state management. Customer pulls; never push.
Version pins with LTS track: monthly releases, 6-month support window per minor version.
Offline-capable: no external API dependency in core runtime. Works air-gapped.
Upgrade gates: customer must be on N-1 before upgrading to N. No skip upgrades. Ever.
THE OBSERVABILITY GAP
On-prem = less observability. This is the hardest problem in hybrid delivery. Accept it. Engineer around it.
Fix: structured telemetry export (opt-in, customer-controlled destination — S3, Blob, syslog).
Synthetic health checks run inside customer environment, report to their monitoring stack.
/api/status endpoint: version, uptime, last migration applied, health check results. Parseable by support team without direct access.
Structured logs in known schema — support can read a log bundle without a shell session.
RELEASE COORDINATION
Single release branch. SaaS deploys from HEAD. On-prem cuts from release branch tag.
Shared test suite runs against both delivery models in CI. One test suite — two surfaces.
On-prem customers get: release notes + migration guide + rollback procedure. Every release. No exceptions. Not optional.
Breaking change policy: 12-week deprecation window, announced in release notes three cycles before removal.
5 Floors. Each One Load-Bearing.
You don't build Floor 2 until Floor 1 holds. This is not a Gantt chart. It's a proof strategy. Each floor has a gate condition — the gate must pass before the next floor starts.
1
EMBED — Understand Before Changing
WEEK 1–4
GOAL
Understand the system before changing it. No structural moves in Month 1.
Map what's working (protect it), fragile (add structure), wrong (fix fast).
DELIVERABLES
Org map — who owns what, documented
Pain map — what's blocking each team
One small execution improvement shipped
GATE → FLOOR 2
I can describe exactly what's working, what's fragile, what's wrong — without asking anyone.
The cross-team coordination blocker is named.
2
CONTRACTS — Teams Know What They Own
MONTHS 2–3
GOAL
Teams know what they own and what they owe each other. Clear interfaces. No ambiguity.
DELIVERABLES
Ownership matrix published and agreed
Cycle cadence locked — first full cycle complete
Escalation ladder defined and used
Cross-team API contracts documented
GATE → FLOOR 3
No initiative is blocked by unclear ownership for more than 24 hours.
3
AI-NATIVE PILOT — Prove It With Data
MONTHS 4–6
GOAL
One team ships measurably faster with AI-native practices. Data is the recruiter for the other teams.
DELIVERABLES
Spec-driven dev on one product from day 1
P0–P3 risk tiers implemented and measured
Throughput delta published (cycle time, defect rate)
GATE → FLOOR 4
Pilot team shows measurable improvement vs pre-pilot baseline. Numbers, not opinions.
4
PLATFORM INVESTMENT — Reduce Friction
MONTHS 7–9
GOAL
Foundations team materially reducing delivery friction across all product teams. Measurable.
DELIVERABLES
Shared CI/CD improvements shipped and measured
Base image standardization complete
Cross-team duplication eliminated (identify top 3, remove 2)
GATE → FLOOR 5
Directors escalate to VP less than once per week on average.
5
SYSTEM SCALES — Works Without Me In the Room
MONTHS 10–12
GOAL
The system works without any individual. Directors self-sufficient. CTO out of execution.
DELIVERABLES
Directors independently run their teams
AI-native practices across all teams — measured, not assumed
CTO fully out of day-to-day execution management
SUCCESS CRITERIA
All 12-month JD success criteria green. The floor holds without the architect in the room.
RELATED
→ KJ × NetBox Labs Resume → ARC-AGI 64% Benchmark → PEMOS Fleet Overview → VMX Reference