The Mathematics of Untranslatability

Operator kernels, min-cut, Tucker tensors, and persistent homology — a formal pipeline for measuring what cannot be said.

12
Pipeline Stages
7
Substantive Methods
3
Diagnostic-Only Methods
3
Gatekeeper Tests
~30
Panel-Critical Pairs

§1. Why PCA Is Wrong (The Binding Constraint)

From the Dimensionality Illusion paper (Thorarinson & Hensgen, 2026): PCA variance is not semantic information. Variance maximization captures directions of largest geometric spread — the coarse inter-domain axes. The fine-grained intra-concept distinctions that this exercise exists to surface live in the long tail of low-variance dimensions. PCA discards them by design.

THE LONG-TAIL PROBLEM
The distinctions between bhāvanā (cultivated mental construct) and (arising thought) and idea (eidetic form) are encoded in dimensions that explain <1% of variance each. These are exactly the dimensions PCA throws away first. Cross-lingual semantic decomposition is the canonical long-tail problem.

Two failure modes apply directly:

Collapse (CISA). Distinct semantic axes that happen to co-vary across most of the panel merge under PCA. Sanskrit bhāvanā and Greek idea project to nearby points on the dominant substance–process axis, even though their underlying ontology of mind is incommensurable.

Distortion. Functionally equivalent encodings that use orthogonal grammatical machinery — Quechua evidential -mi and Turkish -miş — are torn apart in the principal subspace, even though they realize the same semantic dimension.

COMPRESSION ADMISSIBILITY GATE
Compression admissible at dim k  ⇔  SCLk < 0.10  AND  CISAk = 0  on the panel-critical pair set P

CISA Hot Spots

PairFull-Space RelationPCA-3 RelationAliasing Risk
Sanskrit bhāvanā ∼ Chinese Both process-ontology, but bhāvanā is cultivated; 念 is arising Both at F1− pole, F3 differs subtly High
Greek idea ∼ German Idee Eidetic form vs Kantian regulative concept Both at F1+ substance pole High
Quechua evidential ∼ Bulgarian renarrative Both grammaticalized witness-marking; Quechua 3-way, Bulgarian binary Both at F2+ commitment pole Medium
Navajo process-of-thinking ∼ Sanskrit bhāvanā Substance-rejection vs cultivation — opposite philosophical postures Both at F1− Severe
PCA confirms what you already knew and hides what you wanted to discover.

§2. The Factor Model (Diagnostic Only)

Encode each translation l as a vector xl ∈ ℝd along ten hand-encoded dimensions (d ≈ 30 after one-hot expansion). Stack into X ∈ ℝL×d, L ≈ 50. The first three principal components explain ~62% of variance. Their interpretation is reported below as a diagnostic only — not as a substantive description of the semantic field.

F1: Substance–Process Axis
28% of variance
+ Idea-as-noun, count morphology, definite articles, copular predication. Greek, Latin, English, German.
Verbal-noun idea, no plural, ostensive demonstratives, contrastive copula. Classical Chinese, Sanskrit, Navajo.
F2: Speaker-Commitment Axis
21% of variance
+ Evidentiality required, animacy hierarchy, inclusive/exclusive we. Quechua, Navajo, Tagalog, Māori.
No obligatory evidentiality, no clusivity. English, Mandarin, Russian.
F3: Moral-Lexeme Texture
13% of variance
+ Aesthetic/ignobility fusion. Greek kakos, Turkish kötü.
Malevolent agency (böse) or technical unskillfulness (akushala).
ENGLISH: THE MAXIMAL UNDERDETERMINATION CORNER
English sits at xEN ≈ (+substance, −commitment, middle-moral). The diagnostic confirms that English occupies the corner of the panel where grammatical specification is lowest. But everything interesting is in the residual — the 38% of variance the factor model does not explain is precisely the long tail where bhāvanā separates from and Tagalog tayo separates from kami.

§3. Beyond Factors — Three Escalations

3a. Tucker Tensor Decomposition

Form the 3-tensor T ∈ ℝL×W×A where L = language, W = content-word slot (enemy, bad, people, idea), A = semantic axis.

TUCKER DECOMPOSITION
𝒯 ≈ 𝒢 ×1 U(L) ×2 U(W) ×3 U(A)

The core tensor 𝒢 exposes entanglement: which language–word pairs co-load on which axes? Diagonal core = clean factor model; large off-diagonal entries = a particular language treats a particular word in a non-generic way. Predicted hot spot: Navajo × idea × animacy is off-diagonal.

3b. Operator-Theoretic Translation

Each translation is a bounded linear operator Tl : C → Ll from concept space to lexical space.

ROUND-TRIP DISTORTION
Tm* Tl : C → C

Kernel of Tm* Tl = the literally untranslatable subspace between l and m
THE UNSAYABLE SUBSPACE
U = ∩l ker(Tl* Tl)

dim(U) = absolute untranslatability dimension

The orthogonal complement of ∑l im(Tl) inside C is the residue that every language in the panel fails to reach.

INVERSE OF PCA
Represent Tl via paraphrase-back-translation regression. Estimate the kernel via SVD of the fitted operator, keeping the small singular values. The Dimensionality Illusion paper's central point applies inversely: the small singular directions are exactly the directions of interest.
3c. Persistent Homology

Treat the L translations as a point cloud in embedding space. Compute the Vietoris–Rips filtration.

CONCEPTUAL HOLES
Persistent H1 generators = conceptual holes — regions encircled by translations but never landed on.

These are the most rigorous formalization of what is not communicated. Each persistent 1-cycle identifies a region of concept space that translations orbit around without occupying — a semantic lacuna with topological proof of existence.

§4. The Corrected Pipeline (12 Stages)

Every step that compresses or projects the embedding space is gated by a domain-conditional retrieval test on the panel-critical pair set. A step is admissible only if it preserves the operationally critical distinctions; if it does not, it is used only as a diagnostic.

INFORMATION-PRESERVING CORE

#AlgorithmStatusOutputPCA-Blindness Notes
1 Ensemble multilingual embed
LaBSE + BGE-M3 + e5-large
substantive xl ∈ ℝd, d ∈ {768, 1024} Single-model embedding inherits training biases; ensembling reduces idiosyncratic compression. Keep all three; do not reduce.
2 Hand-encode axis matrix A
Two-pass: LLM → linguist audit
substantive A ∈ ℝL×daxis, daxis ≈ 30 This is the field the variance-maximizers will distort. It is the artifact of record.
5 Procrustes panel
Orthogonal Procrustes pairwise
substantive Residual Rlm = ‖Xl − QlmXmF Does not compress; rotates only. Information-preserving.
6 Min-cut on rank-correlation graph
Stoer–Wagner / Shi–Malik
substantive Cut partition + conductance Φ(S) Edge weights from Spearman rank correlation, not Euclidean residuals. See §5.
7 Per-axis JSD
Pairwise on hand-coded distributions
substantive Per-axis untranslatability scalar Operates on hand-encoded matrix A directly. PCA-free.
8 Sinkhorn OT
ε-regularized Wasserstein-2
substantive Transport plan π* Operates in full embedding space; no projection. Immune by construction.
10 Operator kernels
ker(Tl*Tl) ∩ over panel
substantive Unsayable subspace U The most theoretically clean step. Keep SMALL singular values.

DIAGNOSTICS, NOT SUBSTANTIVE

#AlgorithmStatusOutputPCA-Blindness Notes
3 PCA / oblique factor analysis
Diagonalize XTX + Promax
diagnostic only Loadings, scree plot Run it. Look at it. Do not interpret as the field. Annotate with CISA failures.
4 UMAP
n_neighbors=15, min_dist=0.0
diagnostic only 2D coordinates Preserves local rank, not variance. Validate against SCL on P before trusting.
9 Tucker decomposition
HOSVD with ℓ1 sparsity on core
diagnostic only Core G + factor matrices Same PCA-blindness applies per mode. Use for entanglement diagnosis, not compression.

GAP DETECTION

#AlgorithmStatusOutputPCA-Blindness Notes
12 LLM gap matrix
Per-axis: "can language l encode dimension a?"
substantive Binary gap matrix The only step that catches absent concepts, which no geometric method can.

GATEKEEPERS (applied at every projection)

TestDefinitionThreshold
DCRP Domain-Conditional Retrieval Precision — can a query for "akushala-style claim" retrieve language l's translation more accurately than chance? Computed on P, not random pairs
SCL Semantic Coherence Loss: SCLk = 1 − ρ(simfull, simk) via Spearman rank correlation over P SCLk < 0.10
CISA Compression-Induced Semantic Aliasing: count of pairs (a,b) ∈ P with simfull(a,b) < τdistinct but simk(a,b) ≥ τalias CISAk = 0
PIPELINE ARCHITECTURE
[INFORMATION-PRESERVING CORE]
 1.  Ensemble multilingual embed           (LaBSE + BGE-M3 + e5-large)
 2.  Hand-encode axis matrix A             (two-pass: LLM → linguist audit)
 3.  Compute panel-critical pair set P     (~30 pairs)
 5.  Procrustes panel                      (rotation only, no compression)
 6.  Min-cut on rank-correlation graph     (Spearman-weighted edges)
 7.  Per-axis JSD                          (operates on A, PCA-free)
 8.  Sinkhorn OT in full embedding space   (no projection)
 9.  Operator-kernel intersection          (keep SMALL singular values)

[DIAGNOSTICS, NOT SUBSTANTIVE]
 3.  PCA / oblique FA                      (diagnostic only — annotate CISA failures)
 4.  UMAP                                  (validate against SCL before trusting)
 9.  Tucker w/ ℓ1 sparsity on core           (use for entanglement, not compression)
11.  Persistent H1 via witness complex        (topological, full embedding)

[GAP DETECTION]
12.  LLM gap matrix                        (catches absences geometry cannot)

[GATEKEEPERS, applied at every projection]
     DCRP-analog on P,  SCLk < 0.10,  CISAk = 0

§5. Min-Cut, Corrected

The original sketch built edge weights from the Procrustes residual ‖rlm‖² — a Euclidean quantity dominated by principal-component spread. This is exactly what PCA can already see.

ORIGINAL (WRONG)
wlm = ‖rlm‖²   —   Euclidean, PCA-dominated
CORRECTED
wlm = ρ(rankl(P), rankm(P))   —   Spearman, rank-preserving

Where rankl(P) is the rank-order of similarities over the panel-critical pair set as encoded in language l. This weight is invariant under monotone transformations of similarity — it cannot be inflated by high-variance axes that dominate cosine distances.

CONDUCTANCE (replaces raw cut value)
Φ(S) = cut(S, S̄) / min(vol S, vol S̄)

Scale-invariant, aligned with the Laplacian spectral gap.

Run Stoer–Wagner for global min-cut on the rank-correlation graph; normalized cut (Shi–Malik) for soft multi-way partitions.

WHAT THE CUT NOW MEANS
The boundary across which the panel-critical pair set is ordered differently. Two languages are "close" iff they order the critical pairs the same way, regardless of metric magnitudes. This is the rigorous version of "where translation breaks down."

§6. Sinkhorn OT

Wasserstein-2 between empirical content-word embedding distributions, full dimensionality. The optimal transport plan π* identifies which content-word pairs pay the translation cost. Crucially: OT operates in the original space, no projection. It is immune to PCA-blindness by construction.

OPTIMAL TRANSPORT
W2l, μm) = minπ ∫ ‖x − y‖² dπ(x, y)

ε-regularized Sinkhorn for computational tractability.

The marginals of π* over content-word slots give the per-word untranslatability budget across the panel:

High
idea
Pays disproportionately
High
bad
Pays disproportionately
Low
people
Cheap to transport
Low
enemy
Cheap to transport

Prediction: idea and bad carry the bulk of the translation cost because they sit at the intersection of substance–process ontology and moral-lexeme texture — the two axes with highest cross-lingual divergence. people and enemy are near-universal; the transport plan barely moves them.

§7. The Cross-Domain Connection

The legal-retrieval failure mode and the cross-lingual decomposition failure mode are the same structural problem: variance captures coarse structure while the long tail carries operationally critical signal.

PropertyLegal DomainCross-Lingual Domain
Canonical pair Pflichtteil vs elective share bhāvanā vs idea
What PCA sees Both are "inheritance law" — coarse domain match Both are "mental construct" — coarse substance pole
What PCA misses Pflichtteil is mandatory share; elective share is opt-in. Operationally opposite. bhāvanā is cultivated process; idea is eidetic form. Ontologically incommensurable.
Where signal lives Dimensions 9–16 (variance <2% each) Dimensions beyond F3 (variance <1% each)
Failure mode CISA: distinct concepts aliased after compression CISA: distinct ontologies aliased after compression
Correct method Domain-conditional retrieval on critical pair set Rank-preserving methods on panel-critical pair set
Core insight Coarse benchmarks license silent failure on long-tail signal. The mechanism is universal.
THE UNIVERSAL LAW
If the transfer coefficient τD1→D2 between the legal and cross-lingual domains is near-constant — and extends to medical, time-series, and regime-detection domains — the Dimensionality Illusion hardens from a domain-specific result to a universal law of variance-vs-information divergence in semantic spaces. The corollary applies to any future cross-domain work: do not collapse the long tail without a domain-conditional retrieval test on a panel-critical pair set.