NetBox Labs · Engineering Floor

5 SECURITY DOMAIN FLOORS

The Security Floors

Each floor has a name, a current state, a target state, and concrete actions. You don't move to the next floor until the previous one holds.

FLOOR 1

AUTH_FLOOR — Token & Identity Hardening

CURRENT STATE

v1 + v2 tokens mixed in production

No IP restriction on token usage

No max expiry enforced

REMOTE_AUTH_AUTO_CREATE_USER = True

Local accounts vulnerable to brute-force

TARGET STATE

v2 only — v1 disabled at config level

90-day max expiry enforced via TOKEN_MAX_EXPIRE_TIME

IP restricted to fleet CIDR (ALLOWED_IPS)

Zitadel/Keycloak OIDC via REMOTE_AUTH + social-auth-core

REMOTE_AUTH_AUTO_CREATE_USER = False — explicit provisioning

ACTIONS

Migrate all tokens to v2; revoke all v1 tokens

Configure API_TOKEN_PEPPERS in settings

Deploy oauth2-proxy or Istio AuthorizationPolicy at ingress

Wire Zitadel OIDC (already in fleet — 30-min task)

Audit REMOTE_AUTH settings in every environment

FLOOR 2

NETWORK_FLOOR — No Public Exposure

CURRENT STATE

ALLOWED_HOSTS often blank in dev/default configs

Public internet exposure possible via LoadBalancer svc

No ingress auth by default

TARGET STATE

ALLOWED_HOSTS explicit in every environment

Never exposed to public internet — Tailscale mesh or Istio sidecar

ClusterIP only — no LoadBalancer service type for NetBox

JWT gate at ingress (oauth2-proxy or Istio AuthorizationPolicy)

ACTIONS

Set ALLOWED_HOSTS explicitly in all environment configs

Convert Service type from LoadBalancer → ClusterIP

Deploy oauth2-proxy sidecar or Istio policy at ingress

Tailscale operator for on-prem node-to-node mesh

FLOOR 3

DATA_FLOOR — Secrets, Storage, Session Security

CURRENT STATE

SECRET_KEY in configuration file / env var

Media files on local disk (ephemeral in K8s)

DEBUG = True possible in non-prod environments

CSRF / session cookies not always secure-flagged

TARGET STATE

SECRET_KEY injected from Azure Key Vault via external-secrets

Media in Azure Blob / S3 via django-storages

DEBUG = False hardcoded in production image build

CSRF_COOKIE_SECURE + SESSION_COOKIE_SECURE + SECURE_SSL_REDIRECT all True

ACTIONS

Wire external-secrets-operator → Key Vault for SECRET_KEY

Configure django-storages with Azure Blob backend

Dockerfile ENV DEBUG=False — immutable at build time

Add SECURE_HSTS_SECONDS + SECURE_CONTENT_TYPE_NOSNIFF

FLOOR 4

AUDIT_FLOOR — Retention, Alerting, Observability

CURRENT STATE

ObjectChange accumulates indefinitely (DB bloat)

No alerting on authentication anomalies

Prometheus metrics present but not wired to alerting

TARGET STATE

365-day ObjectChange retention with automated purge job

Alert: >100 failed auth attempts in 5-minute window

Prometheus → Grafana → PagerDuty/Alertmanager pipeline live

ACTIONS

Schedule RQ job for ObjectChange pruning (built-in housekeeping)

Wire django-prometheus to Grafana dashboard

Create auth failure alert rule in Alertmanager

Enable LOGIN_REQUIRED — force auth on all pages

FLOOR 5

SUPPLY_CHAIN_FLOOR — Build Security, SBOM, Scanning

CURRENT STATE

requirements.txt pinned (good)

No SBOM generated at build time

No container scan in CI pipeline

Dev dependencies potentially in production image

TARGET STATE

Trivy scan as required CI gate — no HIGH/CRIT = no merge

syft SBOM generated at build time, stored with release artifact

Renovate bot for automated dependency update PRs

Multi-stage Dockerfile: builder stage vs runtime stage — no dev deps in prod

ACTIONS

Add Trivy step to GitHub Actions — fail on HIGH+CRIT

Add syft step to Dockerfile build stage

Configure Renovate on netbox-community/netbox

Refactor Dockerfile: pip install --no-dev in final stage

ENGINEERING EXECUTION SYSTEM

The Execution System

Not process for process's sake. Load-bearing structure for a multi-product, hybrid-delivery engineering org. Each block is a real constraint — remove any one and the system wobbles.

CYCLE CADENCE

2-week cycles. Monday planning (1hr max). Wednesday async status (no meeting). Friday demo (30min, working software only).

No status theater. If it's not shippable, it doesn't demo. A slide is not a demo.

Cycle artifacts: scope doc (1 page max), decision log, retro notes. Nothing longer.

The Monday planning doc is the contract for the cycle. Changes mid-cycle require explicit decision log entry.

OWNERSHIP MATRIX

Every component has exactly one owner (team, not person). No shared mutable responsibility.

Contract between teams: defined API + SLA. If you break the contract, you fix it — same day.

Foundations team owns: CI/CD, shared auth, base images, release tooling, test infrastructure.

Product teams own their domain end-to-end — including on-prem delivery of their component. No exceptions.

ESCALATION LADDER

L1: Team resolves within cycle — no escalation needed

L2: Director escalates to VP → unblocked within 48h

L3: VP escalates to CTO → strategy decisions only

L4: CTO → Board → architecture bets, major pivots

The goal: L3 escalations rare by Month 6. L4 escalations: quarterly at most. If VP is getting L1 questions, something is wrong with ownership clarity.

AI ACCOUNTABILITY MODEL

P0 (security, auth, billing): human only. No AI writes, no AI reviews without full human check.

P1 (core product logic): AI writes first draft, human reviews fully, automated tests required.

P2 (tooling, docs, tests): AI writes, human spot-checks, automated tests gate merge.

P3 (boilerplate, scaffolding): AI autonomous with linting gate only.

On-prem constraint: AI tooling must function with customer-controlled models. SaaS-only AI tooling is not acceptable for NetBox's customer base.

AI-NATIVE ENGINEERING FLOOR · 5 PRINCIPLES

AI-Native Is Not a Feature. It's How the Work Gets Done.

These are not guidelines. They are load-bearing principles. Each one exists because the failure mode without it is documented.

SPEC-FIRST

Every feature starts with a spec. AI writes the first draft from the issue. Human reviews the spec — not the code — before a line of implementation is written.

Why: AI implementation quality is directly proportional to spec clarity. A vague spec produces vague code. Improve the spec, improve the output. The spec review is 30 minutes. The implementation refactor is 3 days.

COMPREHENSION GATE

AI can write it. A human must understand it before it merges. Not a review checkbox — a real comprehension check: "explain this change to me in plain language." If you can't, it doesn't merge.

Why: this is the constraint that keeps AI-native from becoming AI-chaotic. Non-negotiable at P0 and P1. The gate scales down (P2/P3 = spot-check only) but never disappears.

VERIFICATION-LED ITERATION

AI writes tests first (from spec). Human approves coverage. Then AI implements. Tests gate merge. This reverses the standard AI loop (implement→test→fix) into a verification-first loop.

Why: faster iteration because the verification system is defined before implementation starts. The bottleneck is coverage clarity, not implementation speed.

RISK-TIERED DELEGATION

Not all code is equal. Don't apply the same human review overhead to boilerplate as to auth flows. The P0–P3 model makes delegation explicit and measurable — not a judgment call, a rule.

Why: without explicit tiers, teams either apply maximum human oversight to everything (slow) or minimum oversight to everything (dangerous). The tier table eliminates the judgment call at review time.

ON-PREM FIRST

NetBox's on-prem customers cannot use SaaS AI tooling. AI-native practices must be designed for offline/self-hosted from day 1. Customer-controlled models (llama, qwen, mistral) must be first-class citizens in the AI toolchain — not an afterthought.

Why: PEMCLAU pattern proves this is solved. Local inference + structured graph retrieval = no external API dependency. The architecture exists. It just needs to be the default, not the exception.

HYBRID SAAS + ON-PREM DELIVERY SYSTEM

Hybrid Delivery Is Not Two Systems. It's One System With Two Surfaces.

The failure mode is building SaaS and on-prem as separate pipelines. They share a codebase, a test suite, a release process. The surfaces differ; the system doesn't.

SAAS DELIVERY

GitOps via Flux/ArgoCD. Canary deploys. Full observability from day 1.

Weekly release cadence from HEAD. Feature flags gate on-prem-unsafe features.

SaaS gets everything first — it's the canary for the on-prem release.

Incident response: full observability, direct rollback capability, <15min MTTR target.

ON-PREM DELIVERY

netbox-operator pattern: K8s CRDs for NetBox state management. Customer pulls; never push.

Version pins with LTS track: monthly releases, 6-month support window per minor version.

Offline-capable: no external API dependency in core runtime. Works air-gapped.

Upgrade gates: customer must be on N-1 before upgrading to N. No skip upgrades. Ever.

THE OBSERVABILITY GAP

On-prem = less observability. This is the hardest problem in hybrid delivery. Accept it. Engineer around it.

Fix: structured telemetry export (opt-in, customer-controlled destination — S3, Blob, syslog).

Synthetic health checks run inside customer environment, report to their monitoring stack.

/api/status endpoint: version, uptime, last migration applied, health check results. Parseable by support team without direct access.

Structured logs in known schema — support can read a log bundle without a shell session.

RELEASE COORDINATION

Single release branch. SaaS deploys from HEAD. On-prem cuts from release branch tag.

Shared test suite runs against both delivery models in CI. One test suite — two surfaces.

On-prem customers get: release notes + migration guide + rollback procedure. Every release. No exceptions. Not optional.

Breaking change policy: 12-week deprecation window, announced in release notes three cycles before removal.

5 DELIVERY PHASES · EACH ONE LOAD-BEARING

5 Floors. Each One Load-Bearing.

You don't build Floor 2 until Floor 1 holds. This is not a Gantt chart. It's a proof strategy. Each floor has a gate condition — the gate must pass before the next floor starts.

EMBED — Understand Before Changing

WEEK 1–4

GOAL

Understand the system before changing it. No structural moves in Month 1.

Map what's working (protect it), fragile (add structure), wrong (fix fast).

DELIVERABLES

Org map — who owns what, documented

Pain map — what's blocking each team

One small execution improvement shipped

GATE → FLOOR 2

I can describe exactly what's working, what's fragile, what's wrong — without asking anyone.

The cross-team coordination blocker is named.

CONTRACTS — Teams Know What They Own

MONTHS 2–3

GOAL

Teams know what they own and what they owe each other. Clear interfaces. No ambiguity.

DELIVERABLES

Ownership matrix published and agreed

Cycle cadence locked — first full cycle complete

Escalation ladder defined and used

Cross-team API contracts documented

GATE → FLOOR 3

No initiative is blocked by unclear ownership for more than 24 hours.

AI-NATIVE PILOT — Prove It With Data

MONTHS 4–6

GOAL

One team ships measurably faster with AI-native practices. Data is the recruiter for the other teams.

DELIVERABLES

Spec-driven dev on one product from day 1

P0–P3 risk tiers implemented and measured

Throughput delta published (cycle time, defect rate)

GATE → FLOOR 4

Pilot team shows measurable improvement vs pre-pilot baseline. Numbers, not opinions.

PLATFORM INVESTMENT — Reduce Friction

MONTHS 7–9

GOAL

Foundations team materially reducing delivery friction across all product teams. Measurable.

DELIVERABLES

Shared CI/CD improvements shipped and measured

Base image standardization complete

Cross-team duplication eliminated (identify top 3, remove 2)

GATE → FLOOR 5

Directors escalate to VP less than once per week on average.

SYSTEM SCALES — Works Without Me In the Room

MONTHS 10–12

GOAL

The system works without any individual. Directors self-sufficient. CTO out of execution.

DELIVERABLES

Directors independently run their teams

AI-native practices across all teams — measured, not assumed

CTO fully out of day-to-day execution management

SUCCESS CRITERIA

All 12-month JD success criteria green. The floor holds without the architect in the room.

→ KJ × NetBox Labs Resume → ARC-AGI 64% Benchmark → PEMOS Fleet Overview → VMX Reference