Operator kernels, min-cut, Tucker tensors, and persistent homology — a formal pipeline for measuring what cannot be said.
From the Dimensionality Illusion paper (Thorarinson & Hensgen, 2026): PCA variance is not semantic information. Variance maximization captures directions of largest geometric spread — the coarse inter-domain axes. The fine-grained intra-concept distinctions that this exercise exists to surface live in the long tail of low-variance dimensions. PCA discards them by design.
bhāvanā (cultivated mental construct) and 念 (arising thought) and idea (eidetic form) are encoded in dimensions that explain <1% of variance each. These are exactly the dimensions PCA throws away first. Cross-lingual semantic decomposition is the canonical long-tail problem.
Two failure modes apply directly:
Collapse (CISA). Distinct semantic axes that happen to co-vary across most of the panel merge under PCA. Sanskrit bhāvanā and Greek idea project to nearby points on the dominant substance–process axis, even though their underlying ontology of mind is incommensurable.
Distortion. Functionally equivalent encodings that use orthogonal grammatical machinery — Quechua evidential -mi and Turkish -miş — are torn apart in the principal subspace, even though they realize the same semantic dimension.
| Pair | Full-Space Relation | PCA-3 Relation | Aliasing Risk |
|---|---|---|---|
Sanskrit bhāvanā ∼ Chinese 念 |
Both process-ontology, but bhāvanā is cultivated; 念 is arising | Both at F1− pole, F3 differs subtly | High |
Greek idea ∼ German Idee |
Eidetic form vs Kantian regulative concept | Both at F1+ substance pole | High |
| Quechua evidential ∼ Bulgarian renarrative | Both grammaticalized witness-marking; Quechua 3-way, Bulgarian binary | Both at F2+ commitment pole | Medium |
Navajo process-of-thinking ∼ Sanskrit bhāvanā |
Substance-rejection vs cultivation — opposite philosophical postures | Both at F1− | Severe |
PCA confirms what you already knew and hides what you wanted to discover.
Encode each translation l as a vector xl ∈ ℝd along ten hand-encoded dimensions (d ≈ 30 after one-hot expansion). Stack into X ∈ ℝL×d, L ≈ 50. The first three principal components explain ~62% of variance. Their interpretation is reported below as a diagnostic only — not as a substantive description of the semantic field.
kakos, Turkish kötü.böse) or technical unskillfulness (akushala).
bhāvanā separates from 念 and Tagalog tayo separates from kami.
Form the 3-tensor T ∈ ℝL×W×A where L = language, W = content-word slot (enemy, bad, people, idea), A = semantic axis.
The core tensor 𝒢 exposes entanglement: which language–word pairs co-load on which axes? Diagonal core = clean factor model; large off-diagonal entries = a particular language treats a particular word in a non-generic way. Predicted hot spot: Navajo × idea × animacy is off-diagonal.
Each translation is a bounded linear operator Tl : C → Ll from concept space to lexical space.
The orthogonal complement of ∑l im(Tl) inside C is the residue that every language in the panel fails to reach.
Treat the L translations as a point cloud in embedding space. Compute the Vietoris–Rips filtration.
These are the most rigorous formalization of what is not communicated. Each persistent 1-cycle identifies a region of concept space that translations orbit around without occupying — a semantic lacuna with topological proof of existence.
Every step that compresses or projects the embedding space is gated by a domain-conditional retrieval test on the panel-critical pair set. A step is admissible only if it preserves the operationally critical distinctions; if it does not, it is used only as a diagnostic.
| # | Algorithm | Status | Output | PCA-Blindness Notes |
|---|---|---|---|---|
| 1 | Ensemble multilingual embed LaBSE + BGE-M3 + e5-large |
substantive | xl ∈ ℝd, d ∈ {768, 1024} | Single-model embedding inherits training biases; ensembling reduces idiosyncratic compression. Keep all three; do not reduce. |
| 2 | Hand-encode axis matrix A Two-pass: LLM → linguist audit |
substantive | A ∈ ℝL×daxis, daxis ≈ 30 | This is the field the variance-maximizers will distort. It is the artifact of record. |
| 5 | Procrustes panel Orthogonal Procrustes pairwise |
substantive | Residual Rlm = ‖Xl − QlmXm‖F | Does not compress; rotates only. Information-preserving. |
| 6 | Min-cut on rank-correlation graph Stoer–Wagner / Shi–Malik |
substantive | Cut partition + conductance Φ(S) | Edge weights from Spearman rank correlation, not Euclidean residuals. See §5. |
| 7 | Per-axis JSD Pairwise on hand-coded distributions |
substantive | Per-axis untranslatability scalar | Operates on hand-encoded matrix A directly. PCA-free. |
| 8 | Sinkhorn OT ε-regularized Wasserstein-2 |
substantive | Transport plan π* | Operates in full embedding space; no projection. Immune by construction. |
| 10 | Operator kernels ker(Tl*Tl) ∩ over panel |
substantive | Unsayable subspace U | The most theoretically clean step. Keep SMALL singular values. |
| # | Algorithm | Status | Output | PCA-Blindness Notes |
|---|---|---|---|---|
| 3 | PCA / oblique factor analysis Diagonalize XTX + Promax |
diagnostic only | Loadings, scree plot | Run it. Look at it. Do not interpret as the field. Annotate with CISA failures. |
| 4 | UMAP n_neighbors=15, min_dist=0.0 |
diagnostic only | 2D coordinates | Preserves local rank, not variance. Validate against SCL on P before trusting. |
| 9 | Tucker decomposition HOSVD with ℓ1 sparsity on core |
diagnostic only | Core G + factor matrices | Same PCA-blindness applies per mode. Use for entanglement diagnosis, not compression. |
| # | Algorithm | Status | Output | PCA-Blindness Notes |
|---|---|---|---|---|
| 12 | LLM gap matrix Per-axis: "can language l encode dimension a?" |
substantive | Binary gap matrix | The only step that catches absent concepts, which no geometric method can. |
| Test | Definition | Threshold |
|---|---|---|
| DCRP | Domain-Conditional Retrieval Precision — can a query for "akushala-style claim" retrieve language l's translation more accurately than chance? | Computed on P, not random pairs |
| SCL | Semantic Coherence Loss: SCLk = 1 − ρ(simfull, simk) via Spearman rank correlation over P | SCLk < 0.10 |
| CISA | Compression-Induced Semantic Aliasing: count of pairs (a,b) ∈ P with simfull(a,b) < τdistinct but simk(a,b) ≥ τalias | CISAk = 0 |
[INFORMATION-PRESERVING CORE] 1. Ensemble multilingual embed (LaBSE + BGE-M3 + e5-large) 2. Hand-encode axis matrix A (two-pass: LLM → linguist audit) 3. Compute panel-critical pair set P (~30 pairs) 5. Procrustes panel (rotation only, no compression) 6. Min-cut on rank-correlation graph (Spearman-weighted edges) 7. Per-axis JSD (operates on A, PCA-free) 8. Sinkhorn OT in full embedding space (no projection) 9. Operator-kernel intersection (keep SMALL singular values) [DIAGNOSTICS, NOT SUBSTANTIVE] 3. PCA / oblique FA (diagnostic only — annotate CISA failures) 4. UMAP (validate against SCL before trusting) 9. Tucker w/ ℓ1 sparsity on core (use for entanglement, not compression) 11. Persistent H1 via witness complex (topological, full embedding) [GAP DETECTION] 12. LLM gap matrix (catches absences geometry cannot) [GATEKEEPERS, applied at every projection] DCRP-analog on P, SCLk < 0.10, CISAk = 0
The original sketch built edge weights from the Procrustes residual ‖rlm‖² — a Euclidean quantity dominated by principal-component spread. This is exactly what PCA can already see.
Where rankl(P) is the rank-order of similarities over the panel-critical pair set as encoded in language l. This weight is invariant under monotone transformations of similarity — it cannot be inflated by high-variance axes that dominate cosine distances.
Run Stoer–Wagner for global min-cut on the rank-correlation graph; normalized cut (Shi–Malik) for soft multi-way partitions.
Wasserstein-2 between empirical content-word embedding distributions, full dimensionality. The optimal transport plan π* identifies which content-word pairs pay the translation cost. Crucially: OT operates in the original space, no projection. It is immune to PCA-blindness by construction.
The marginals of π* over content-word slots give the per-word untranslatability budget across the panel:
ideabadpeopleenemyPrediction: idea and bad carry the bulk of the translation cost because they sit at the intersection of substance–process ontology and moral-lexeme texture — the two axes with highest cross-lingual divergence. people and enemy are near-universal; the transport plan barely moves them.
The legal-retrieval failure mode and the cross-lingual decomposition failure mode are the same structural problem: variance captures coarse structure while the long tail carries operationally critical signal.
| Property | Legal Domain | Cross-Lingual Domain |
|---|---|---|
| Canonical pair | Pflichtteil vs elective share |
bhāvanā vs idea |
| What PCA sees | Both are "inheritance law" — coarse domain match | Both are "mental construct" — coarse substance pole |
| What PCA misses | Pflichtteil is mandatory share; elective share is opt-in. Operationally opposite. | bhāvanā is cultivated process; idea is eidetic form. Ontologically incommensurable. |
| Where signal lives | Dimensions 9–16 (variance <2% each) | Dimensions beyond F3 (variance <1% each) |
| Failure mode | CISA: distinct concepts aliased after compression | CISA: distinct ontologies aliased after compression |
| Correct method | Domain-conditional retrieval on critical pair set | Rank-preserving methods on panel-critical pair set |
| Core insight | Coarse benchmarks license silent failure on long-tail signal. The mechanism is universal. | |