I-JEPA: The Missing Structure
Image Joint Embedding Predictive Architecture · First Practical JEPA
Abstract I-JEPA (2023) demonstrated that JEPA could learn semantic image representations without hand-crafted augmentations, using Vision Transformers at scale. It is the first concrete implementation of the JEPA framework. The 6 structural gaps identified in H-JEPA persist in I-JEPA, now manifested at the image patch level.
6 FORMAL GAPS · 1 PER CANON SYMBOL
No Invariant Anchor in Masked Patch Target Selection
γ₁ — THE FLOOR
I-JEPA selects target patches via a multi-block masking strategy. The selection is stochastic and distribution-dependent. There is no fixed invariant anchor that the patch representations must converge to regardless of masking strategy. The floor is absent: representations are defined relative to the dataset, not to any grounding truth.
Context Encoder Not Self-Adjoint With Target Encoder
H=H† — THE HONEST GATE
I-JEPA uses an exponential moving average of the context encoder as the target encoder (stop-gradient). This creates a permanent asymmetry: the context encoder cannot verify its own predictions against the target encoder in reverse. H=H† is violated by design. The Honest Gate requires symmetric verifiability.
No Audit of Paradigm Shift Between Mask Strategies
LSOS — THE READER
I-JEPA's masking strategy determines what the system learns. Shifting between masking strategies (block masking, random masking, full masking) changes the learning paradigm without audit. LSOS would read the active mask paradigm and flag when representations are being shaped by an unacknowledged shift.
No Reset When Context Encoder Diverges From Target
WLD — THE RESET
When the EMA target encoder diverges too far from the context encoder — a known training instability — there is no mercy reset. The training either collapses or recovers stochastically. WLD provides a formal reset protocol: detect divergence, reset to last stable state, resume.
No Continuity Guarantee From ViT-S to ViT-H
FEP — THE SWITCH
I-JEPA is demonstrated across ViT-S, ViT-B, ViT-L, and ViT-H model sizes. There is no formal guarantee that representations learned at ViT-S are consistent with those at ViT-H — that scaling preserves the learned paradigm. FEP ensures paradigm continuity across capacity switches.
Maximum Patch Resolution Has No Formal Boundary
FOF — THE BREACH
I-JEPA operates on discretized patch grids. As patch resolution increases toward pixel-level, the architecture approaches a generative model. The boundary between the predictive regime and the generative regime is not formally defined. FOF names this boundary: the point where the JEPA assumption (predict in latent space) breaks down.
STE COMPLETION LAYER
What changes when you add the 8-symbol Canon
Adding the Canon to I-JEPA does not change the architecture. It adds the missing structural layer:

⚓ γ₁ — invariant anchor: mathematical ground truth latent representations must converge to.
⯛ H=H† — honest gate: bidirectional verification of every prediction.
〰️ LSOS — paradigm reader: reads active paradigm before reasoning begins.
🌀 WLD — mercy reset: detects collapse and resets to last stable state.
γ FEP — safe switch: continuity guarantee across paradigm transitions.
🌌 FOF — named ceiling: formal boundary of what the architecture can claim.
═ EVEN — substrate: ground beneath all the above. What holds when everything else is active.

The Canon is not an add-on. It is the formal completion of the JEPA programme.