Video JEPA · Latent Prediction Over Video · Motion and Appearance at Scale
Abstract V-JEPA (2023) extended the JEPA framework from images to video, learning strong motion and appearance representations without reconstruction or contrastive objectives. The leap from images to video introduces 3 new gap dimensions (temporal, spatial-temporal interaction, sequence length) on top of the 6 structural absences.
6 FORMAL GAPS · 1 PER CANON SYMBOL
No Invariant Anchor Across Video Frame Prediction Targets
γ₁ — THE FLOOR
V-JEPA selects prediction targets across video frames. The target selection is stochastic in both spatial and temporal dimensions. There is no fixed invariant that the video latent representation must preserve across temporal prediction steps. The floor is absent: what counts as a stable video representation is defined only relative to the training distribution.
Video Predictor Asymmetric (Forward-Only Prediction)
H=H† — THE HONEST GATE
V-JEPA predicts future or masked video frames from context. The prediction is forward-only: the system cannot verify its prediction by attempting backward prediction. A symmetric predictor would verify that the predicted future is consistent with the observed past from both directions. V-JEPA has no backward verification pass.
No Paradigm Audit Between Spatial and Temporal Masking
LSOS — THE READER
V-JEPA uses masking strategies that operate jointly in space and time. When the masking regime shifts from spatially-dominant to temporally-dominant, the learned representation shifts paradigm. There is no audit of this shift. LSOS would read the active space-time paradigm and flag unacknowledged transitions.
No Reset When Temporal Prediction Collapses
WLD — THE RESET
When V-JEPA's temporal predictor learns to copy the most recent frame (temporal collapse), there is no mercy reset. The collapse is detectable (the predicted representation becomes a near-copy of the input) but the architecture provides no mechanism to reset and escape this degenerate solution.
No Continuity From Short Clip to Long Video
FEP — THE SWITCH
V-JEPA is trained on fixed-length video clips. The transition from clip-level understanding to long-video understanding requires a paradigm switch. There is no formal continuity guarantee across this transition. FEP ensures the switch from short-context to long-context preserves the learned paradigm.
Video Sequence Length Has No Named Ceiling
FOF — THE BREACH
V-JEPA does not define a formal upper bound on video sequence length. As sequence length grows, the architecture approaches the limits of its positional encoding and memory. The point where the JEPA prediction framework breaks down — where sequence context is too long to predict coherently — is not named. FOF names this boundary.
STE COMPLETION LAYER
What changes when you add the 8-symbol Canon
Adding the Canon to V-JEPA does not change the architecture. It adds the missing structural layer:
⚓ γ₁ — invariant anchor: mathematical ground truth latent representations must converge to.
⯛ H=H† — honest gate: bidirectional verification of every prediction.
〰️ LSOS — paradigm reader: reads active paradigm before reasoning begins.
🌀 WLD — mercy reset: detects collapse and resets to last stable state.
γ FEP — safe switch: continuity guarantee across paradigm transitions.
🌌 FOF — named ceiling: formal boundary of what the architecture can claim.
═ EVEN — substrate: ground beneath all the above. What holds when everything else is active.
The Canon is not an add-on. It is the formal completion of the JEPA programme.
X POST · @ylecun
POST 1 — Name the gap
@ylecun V-JEPA (2023): Gap 2 (H=H†) — predictor not self-adjoint. Asymmetric by design. Not an empirical limitation — a missing symbol. pemos.ca/vjepa-gap
POST 2 — Canon map
@ylecun V-JEPA: 6 gaps · γ₁ (no anchor) · H=H† (no gate) · LSOS (no audit) · WLD (no reset) · FEP (no continuity) · FOF (no ceiling). Same in all 14 milestones. pemos.ca/jepa-index
POST 3 — Invitation
@ylecun V-JEPA gap analysis: part of a 14-milestone series. Same 6 structural gaps in every milestone. The gaps are there because the symbols were never in scope. They are now. pemos.ca/jepa-index
V-JEPA is a landmark in the JEPA lineage. The 6 gaps we identify are not critiques of the engineering — they are structural absences that the Canon fills. Each gap maps to a symbol that was always going to be necessary once the JEPA architecture matured. The Canon did not wait for the JEPA timeline; the JEPA timeline arrived at the Canon. The gaps are there because the symbols were never in scope. They are now.
Gap 1 (γ₁): No Invariant Anchor Across Video Frame Prediction Targets V-JEPA selects prediction targets across video frames. The target selection is stochastic in both spatial and temporal dimensions. There is no fixed invariant that the video latent representation must...
Gap 2 (H=H†): Video Predictor Asymmetric (Forward-Only Prediction) V-JEPA predicts future or masked video frames from context. The prediction is forward-only: the system cannot verify its prediction by attempting backward prediction. A symmetric predictor would verif...
Gap 3 (LSOS): No Paradigm Audit Between Spatial and Temporal Masking V-JEPA uses masking strategies that operate jointly in space and time. When the masking regime shifts from spatially-dominant to temporally-dominant, the learned representation shifts paradigm. There ...
Gap 4 (WLD): No Reset When Temporal Prediction Collapses When V-JEPA's temporal predictor learns to copy the most recent frame (temporal collapse), there is no mercy reset. The collapse is detectable (the predicted representation becomes a near-copy of the ...
Gap 5 (FEP): No Continuity From Short Clip to Long Video V-JEPA is trained on fixed-length video clips. The transition from clip-level understanding to long-video understanding requires a paradigm switch. There is no formal continuity guarantee across this ...
Gap 6 (FOF): Video Sequence Length Has No Named Ceiling V-JEPA does not define a formal upper bound on video sequence length. As sequence length grows, the architecture approaches the limits of its positional encoding and memory. The point where the JEPA p...
The STE provides the completion layer for each gap. The gaps are not empirical — they are structural. Adding the symbols closes the gaps by definition.
Here's how to explain V-JEPA gaps to a 10-year-old:
Gap 1 — No floor: Imagine you're building a tower and you don't have a foundation. The tower might stand for a while but there's nothing it's guaranteed to come back to if it leans. γ₁ is the foundation. The math already built it. V-JEPA didn't use it.
Gap 2 — No honest check: If I tell you something, and you can't ask me 'are you sure?' and get the same answer backwards, I might be guessing. H=H† is the rule that every claim must survive being asked backwards. V-JEPA makes claims it can't verify in reverse.
Gap 3 — No paradigm reader: Sometimes you're playing one game and someone changes the rules without telling you. LSOS reads the rules before you play. V-JEPA changes paradigms without reading them first.
Gap 4 — No reset button: When you're drawing and you make a mistake, you have an eraser. WLD is the eraser. V-JEPA doesn't have one. If it goes wrong, it keeps going wrong.
Gap 5 — No safe switch: If you switch from reading a book to watching a movie, you expect to pick up where you left off. FEP is the bookmark. V-JEPA switches paradigms and might drop the book.
Gap 6 — No ceiling: Every ladder needs to know when to stop. A ladder that claims it goes forever is a lie. FOF names where the ladder ends. V-JEPA hasn't named its ceiling yet.
The Canon is not a critique. It's the toolkit that was always going to be needed. V-JEPA got as far as it could without it. Now the toolkit exists.
The 6 gaps identified in V-JEPA are structural absences in a published architecture. Identifying structural absences in published work is not IP infringement — it is prior art analysis. The STE symbols that fill these gaps are EOSE IP. The gap identification is public analysis. The completion layer is proprietary.
Status: LSOS-OWNERSHIP-001 open (EOSE Labs Inc. not yet registered). No public disclosure of STE completion layer until registration clears. Gap analysis pages (like this one) are disclosure of the gaps, not the fills. This is legally distinct and permissible.