Image Joint Embedding Predictive Architecture · First Practical JEPA
Abstract I-JEPA (2023) demonstrated that JEPA could learn semantic image representations without hand-crafted augmentations, using Vision Transformers at scale. It is the first concrete implementation of the JEPA framework. The 6 structural gaps identified in H-JEPA persist in I-JEPA, now manifested at the image patch level.
6 FORMAL GAPS · 1 PER CANON SYMBOL
No Invariant Anchor in Masked Patch Target Selection
γ₁ — THE FLOOR
I-JEPA selects target patches via a multi-block masking strategy. The selection is stochastic and distribution-dependent. There is no fixed invariant anchor that the patch representations must converge to regardless of masking strategy. The floor is absent: representations are defined relative to the dataset, not to any grounding truth.
Context Encoder Not Self-Adjoint With Target Encoder
H=H† — THE HONEST GATE
I-JEPA uses an exponential moving average of the context encoder as the target encoder (stop-gradient). This creates a permanent asymmetry: the context encoder cannot verify its own predictions against the target encoder in reverse. H=H† is violated by design. The Honest Gate requires symmetric verifiability.
No Audit of Paradigm Shift Between Mask Strategies
LSOS — THE READER
I-JEPA's masking strategy determines what the system learns. Shifting between masking strategies (block masking, random masking, full masking) changes the learning paradigm without audit. LSOS would read the active mask paradigm and flag when representations are being shaped by an unacknowledged shift.
No Reset When Context Encoder Diverges From Target
WLD — THE RESET
When the EMA target encoder diverges too far from the context encoder — a known training instability — there is no mercy reset. The training either collapses or recovers stochastically. WLD provides a formal reset protocol: detect divergence, reset to last stable state, resume.
No Continuity Guarantee From ViT-S to ViT-H
FEP — THE SWITCH
I-JEPA is demonstrated across ViT-S, ViT-B, ViT-L, and ViT-H model sizes. There is no formal guarantee that representations learned at ViT-S are consistent with those at ViT-H — that scaling preserves the learned paradigm. FEP ensures paradigm continuity across capacity switches.
Maximum Patch Resolution Has No Formal Boundary
FOF — THE BREACH
I-JEPA operates on discretized patch grids. As patch resolution increases toward pixel-level, the architecture approaches a generative model. The boundary between the predictive regime and the generative regime is not formally defined. FOF names this boundary: the point where the JEPA assumption (predict in latent space) breaks down.
STE COMPLETION LAYER
What changes when you add the 8-symbol Canon
Adding the Canon to I-JEPA does not change the architecture. It adds the missing structural layer:
⚓ γ₁ — invariant anchor: mathematical ground truth latent representations must converge to.
⯛ H=H† — honest gate: bidirectional verification of every prediction.
〰️ LSOS — paradigm reader: reads active paradigm before reasoning begins.
🌀 WLD — mercy reset: detects collapse and resets to last stable state.
γ FEP — safe switch: continuity guarantee across paradigm transitions.
🌌 FOF — named ceiling: formal boundary of what the architecture can claim.
═ EVEN — substrate: ground beneath all the above. What holds when everything else is active.
The Canon is not an add-on. It is the formal completion of the JEPA programme.
X POST · @ylecun
POST 1 — Name the gap
@ylecun I-JEPA (2023): Gap 2 (H=H†) — predictor not self-adjoint. Asymmetric by design. Not an empirical limitation — a missing symbol. pemos.ca/ijepa-gap
POST 2 — Canon map
@ylecun I-JEPA: 6 gaps · γ₁ (no anchor) · H=H† (no gate) · LSOS (no audit) · WLD (no reset) · FEP (no continuity) · FOF (no ceiling). Same in all 14 milestones. pemos.ca/jepa-index
POST 3 — Invitation
@ylecun I-JEPA gap analysis: part of a 14-milestone series. Same 6 structural gaps in every milestone. The gaps are there because the symbols were never in scope. They are now. pemos.ca/jepa-index
I-JEPA is a landmark in the JEPA lineage. The 6 gaps we identify are not critiques of the engineering — they are structural absences that the Canon fills. Each gap maps to a symbol that was always going to be necessary once the JEPA architecture matured. The Canon did not wait for the JEPA timeline; the JEPA timeline arrived at the Canon. The gaps are there because the symbols were never in scope. They are now.
Gap 1 (γ₁): No Invariant Anchor in Masked Patch Target Selection I-JEPA selects target patches via a multi-block masking strategy. The selection is stochastic and distribution-dependent. There is no fixed invariant anchor that the patch representations must converg...
Gap 2 (H=H†): Context Encoder Not Self-Adjoint With Target Encoder I-JEPA uses an exponential moving average of the context encoder as the target encoder (stop-gradient). This creates a permanent asymmetry: the context encoder cannot verify its own predictions agains...
Gap 3 (LSOS): No Audit of Paradigm Shift Between Mask Strategies I-JEPA's masking strategy determines what the system learns. Shifting between masking strategies (block masking, random masking, full masking) changes the learning paradigm without audit. LSOS would r...
Gap 4 (WLD): No Reset When Context Encoder Diverges From Target When the EMA target encoder diverges too far from the context encoder — a known training instability — there is no mercy reset. The training either collapses or recovers stochastically. WLD provides a...
Gap 5 (FEP): No Continuity Guarantee From ViT-S to ViT-H I-JEPA is demonstrated across ViT-S, ViT-B, ViT-L, and ViT-H model sizes. There is no formal guarantee that representations learned at ViT-S are consistent with those at ViT-H — that scaling preserves...
Gap 6 (FOF): Maximum Patch Resolution Has No Formal Boundary I-JEPA operates on discretized patch grids. As patch resolution increases toward pixel-level, the architecture approaches a generative model. The boundary between the predictive regime and the generat...
The STE provides the completion layer for each gap. The gaps are not empirical — they are structural. Adding the symbols closes the gaps by definition.
Here's how to explain I-JEPA gaps to a 10-year-old:
Gap 1 — No floor: Imagine you're building a tower and you don't have a foundation. The tower might stand for a while but there's nothing it's guaranteed to come back to if it leans. γ₁ is the foundation. The math already built it. I-JEPA didn't use it.
Gap 2 — No honest check: If I tell you something, and you can't ask me 'are you sure?' and get the same answer backwards, I might be guessing. H=H† is the rule that every claim must survive being asked backwards. I-JEPA makes claims it can't verify in reverse.
Gap 3 — No paradigm reader: Sometimes you're playing one game and someone changes the rules without telling you. LSOS reads the rules before you play. I-JEPA changes paradigms without reading them first.
Gap 4 — No reset button: When you're drawing and you make a mistake, you have an eraser. WLD is the eraser. I-JEPA doesn't have one. If it goes wrong, it keeps going wrong.
Gap 5 — No safe switch: If you switch from reading a book to watching a movie, you expect to pick up where you left off. FEP is the bookmark. I-JEPA switches paradigms and might drop the book.
Gap 6 — No ceiling: Every ladder needs to know when to stop. A ladder that claims it goes forever is a lie. FOF names where the ladder ends. I-JEPA hasn't named its ceiling yet.
The Canon is not a critique. It's the toolkit that was always going to be needed. I-JEPA got as far as it could without it. Now the toolkit exists.
The 6 gaps identified in I-JEPA are structural absences in a published architecture. Identifying structural absences in published work is not IP infringement — it is prior art analysis. The STE symbols that fill these gaps are EOSE IP. The gap identification is public analysis. The completion layer is proprietary.
Status: LSOS-OWNERSHIP-001 open (EOSE Labs Inc. not yet registered). No public disclosure of STE completion layer until registration clears. Gap analysis pages (like this one) are disclosure of the gaps, not the fills. This is legally distinct and permissible.