The Mathematics of Untranslatability

Operator kernels, min-cut, Tucker tensors, and persistent homology — a formal pipeline for measuring what cannot be said.

Pipeline Stages

Substantive Methods

Diagnostic-Only Methods

Gatekeeper Tests

~30

Panel-Critical Pairs

§1. Why PCA Is Wrong (The Binding Constraint)

From the Dimensionality Illusion paper (Thorarinson & Hensgen, 2026): PCA variance is not semantic information. Variance maximization captures directions of largest geometric spread — the coarse inter-domain axes. The fine-grained intra-concept distinctions that this exercise exists to surface live in the long tail of low-variance dimensions. PCA discards them by design.

THE LONG-TAIL PROBLEM

The distinctions between bhāvanā (cultivated mental construct) and 念 (arising thought) and idea (eidetic form) are encoded in dimensions that explain <1% of variance each. These are exactly the dimensions PCA throws away first. Cross-lingual semantic decomposition is the canonical long-tail problem.

Two failure modes apply directly:

Collapse (CISA). Distinct semantic axes that happen to co-vary across most of the panel merge under PCA. Sanskrit bhāvanā and Greek idea project to nearby points on the dominant substance–process axis, even though their underlying ontology of mind is incommensurable.

Distortion. Functionally equivalent encodings that use orthogonal grammatical machinery — Quechua evidential -mi and Turkish -miş — are torn apart in the principal subspace, even though they realize the same semantic dimension.

COMPRESSION ADMISSIBILITY GATE

Compression admissible at dim k ⇔ SCL_k < 0.10 AND CISA_k = 0 on the panel-critical pair set P

CISA Hot Spots

Pair	Full-Space Relation	PCA-3 Relation	Aliasing Risk
Sanskrit `bhāvanā` ∼ Chinese `念`	Both process-ontology, but bhāvanā is cultivated; 念 is arising	Both at F1− pole, F3 differs subtly	High
Greek `idea` ∼ German `Idee`	Eidetic form vs Kantian regulative concept	Both at F1+ substance pole	High
Quechua evidential ∼ Bulgarian renarrative	Both grammaticalized witness-marking; Quechua 3-way, Bulgarian binary	Both at F2+ commitment pole	Medium
Navajo process-of-thinking ∼ Sanskrit `bhāvanā`	Substance-rejection vs cultivation — opposite philosophical postures	Both at F1−	Severe

PCA confirms what you already knew and hides what you wanted to discover.

§2. The Factor Model (Diagnostic Only)

Encode each translation l as a vector x_l ∈ ℝ^d along ten hand-encoded dimensions (d ≈ 30 after one-hot expansion). Stack into X ∈ ℝ^L×d, L ≈ 50. The first three principal components explain ~62% of variance. Their interpretation is reported below as a diagnostic only — not as a substantive description of the semantic field.

F1: Substance–Process Axis

28% of variance

+ Idea-as-noun, count morphology, definite articles, copular predication. Greek, Latin, English, German.
− Verbal-noun idea, no plural, ostensive demonstratives, contrastive copula. Classical Chinese, Sanskrit, Navajo.

F2: Speaker-Commitment Axis

21% of variance

+ Evidentiality required, animacy hierarchy, inclusive/exclusive we. Quechua, Navajo, Tagalog, Māori.
− No obligatory evidentiality, no clusivity. English, Mandarin, Russian.

F3: Moral-Lexeme Texture

13% of variance

+ Aesthetic/ignobility fusion. Greek kakos, Turkish kötü.
− Malevolent agency (böse) or technical unskillfulness (akushala).

ENGLISH: THE MAXIMAL UNDERDETERMINATION CORNER

English sits at x_EN ≈ (+substance, −commitment, middle-moral). The diagnostic confirms that English occupies the corner of the panel where grammatical specification is lowest. But everything interesting is in the residual — the 38% of variance the factor model does not explain is precisely the long tail where bhāvanā separates from 念 and Tagalog tayo separates from kami.

§3. Beyond Factors — Three Escalations

3a. Tucker Tensor Decomposition

Form the 3-tensor T ∈ ℝ^L×W×A where L = language, W = content-word slot (enemy, bad, people, idea), A = semantic axis.

TUCKER DECOMPOSITION

𝒯 ≈ 𝒢 ×₁ U^(L) ×₂ U^(W) ×₃ U^(A)

The core tensor 𝒢 exposes entanglement: which language–word pairs co-load on which axes? Diagonal core = clean factor model; large off-diagonal entries = a particular language treats a particular word in a non-generic way. Predicted hot spot: Navajo × idea × animacy is off-diagonal.

3b. Operator-Theoretic Translation

Each translation is a bounded linear operator T_l : C → L_l from concept space to lexical space.

ROUND-TRIP DISTORTION

T_m^* T_l : C → C

Kernel of T_m^* T_l = the literally untranslatable subspace between l and m

THE UNSAYABLE SUBSPACE

U = ∩_l ker(T_l^* T_l)

dim(U) = absolute untranslatability dimension

The orthogonal complement of ∑_l im(T_l) inside C is the residue that every language in the panel fails to reach.

INVERSE OF PCA

Represent T_l via paraphrase-back-translation regression. Estimate the kernel via SVD of the fitted operator, keeping the small singular values. The Dimensionality Illusion paper's central point applies inversely: the small singular directions are exactly the directions of interest.

3c. Persistent Homology

Treat the L translations as a point cloud in embedding space. Compute the Vietoris–Rips filtration.

CONCEPTUAL HOLES

Persistent H₁ generators = conceptual holes — regions encircled by translations but never landed on.

These are the most rigorous formalization of what is not communicated. Each persistent 1-cycle identifies a region of concept space that translations orbit around without occupying — a semantic lacuna with topological proof of existence.

§4. The Corrected Pipeline (12 Stages)

Every step that compresses or projects the embedding space is gated by a domain-conditional retrieval test on the panel-critical pair set. A step is admissible only if it preserves the operationally critical distinctions; if it does not, it is used only as a diagnostic.

INFORMATION-PRESERVING CORE

#	Algorithm	Status	Output	PCA-Blindness Notes
1	Ensemble multilingual embed LaBSE + BGE-M3 + e5-large	substantive	x_l ∈ ℝ^d, d ∈ {768, 1024}	Single-model embedding inherits training biases; ensembling reduces idiosyncratic compression. Keep all three; do not reduce.
2	Hand-encode axis matrix A Two-pass: LLM → linguist audit	substantive	A ∈ ℝ^L×d_axis, d_axis ≈ 30	This is the field the variance-maximizers will distort. It is the artifact of record.
5	Procrustes panel Orthogonal Procrustes pairwise	substantive	Residual R_lm = ‖X_l − Q_lmX_m‖_F	Does not compress; rotates only. Information-preserving.
6	Min-cut on rank-correlation graph Stoer–Wagner / Shi–Malik	substantive	Cut partition + conductance Φ(S)	Edge weights from Spearman rank correlation, not Euclidean residuals. See §5.
7	Per-axis JSD Pairwise on hand-coded distributions	substantive	Per-axis untranslatability scalar	Operates on hand-encoded matrix A directly. PCA-free.
8	Sinkhorn OT ε-regularized Wasserstein-2	substantive	Transport plan π*	Operates in full embedding space; no projection. Immune by construction.
10	Operator kernels ker(T_l^*T_l) ∩ over panel	substantive	Unsayable subspace U	The most theoretically clean step. Keep SMALL singular values.

DIAGNOSTICS, NOT SUBSTANTIVE

#	Algorithm	Status	Output	PCA-Blindness Notes
3	PCA / oblique factor analysis Diagonalize X^TX + Promax	diagnostic only	Loadings, scree plot	Run it. Look at it. Do not interpret as the field. Annotate with CISA failures.
4	UMAP n_neighbors=15, min_dist=0.0	diagnostic only	2D coordinates	Preserves local rank, not variance. Validate against SCL on P before trusting.
9	Tucker decomposition HOSVD with ℓ₁ sparsity on core	diagnostic only	Core G + factor matrices	Same PCA-blindness applies per mode. Use for entanglement diagnosis, not compression.

GAP DETECTION

#	Algorithm	Status	Output	PCA-Blindness Notes
12	LLM gap matrix Per-axis: "can language l encode dimension a?"	substantive	Binary gap matrix	The only step that catches absent concepts, which no geometric method can.

GATEKEEPERS (applied at every projection)

Test	Definition	Threshold
DCRP	Domain-Conditional Retrieval Precision — can a query for "akushala-style claim" retrieve language l's translation more accurately than chance?	Computed on P, not random pairs
SCL	Semantic Coherence Loss: SCL_k = 1 − ρ(sim_full, sim_k) via Spearman rank correlation over P	SCL_k < 0.10
CISA	Compression-Induced Semantic Aliasing: count of pairs (a,b) ∈ P with sim_full(a,b) < τ_distinct but sim_k(a,b) ≥ τ_alias	CISA_k = 0

PIPELINE ARCHITECTURE

[INFORMATION-PRESERVING CORE]
 1.  Ensemble multilingual embed           (LaBSE + BGE-M3 + e5-large)
 2.  Hand-encode axis matrix A             (two-pass: LLM → linguist audit)
 3.  Compute panel-critical pair set P     (~30 pairs)
 5.  Procrustes panel                      (rotation only, no compression)
 6.  Min-cut on rank-correlation graph     (Spearman-weighted edges)
 7.  Per-axis JSD                          (operates on A, PCA-free)
 8.  Sinkhorn OT in full embedding space   (no projection)
 9.  Operator-kernel intersection          (keep SMALL singular values)

[DIAGNOSTICS, NOT SUBSTANTIVE]
 3.  PCA / oblique FA                      (diagnostic only — annotate CISA failures)
 4.  UMAP                                  (validate against SCL before trusting)
 9.  Tucker w/ ℓ₁ sparsity on core           (use for entanglement, not compression)
11.  Persistent H₁ via witness complex        (topological, full embedding)

[GAP DETECTION]
12.  LLM gap matrix                        (catches absences geometry cannot)

[GATEKEEPERS, applied at every projection]
     DCRP-analog on P,  SCL_k < 0.10,  CISA_k = 0

§5. Min-Cut, Corrected

The original sketch built edge weights from the Procrustes residual ‖r_lm‖² — a Euclidean quantity dominated by principal-component spread. This is exactly what PCA can already see.

ORIGINAL (WRONG)

w_lm = ‖r_lm‖² — Euclidean, PCA-dominated

CORRECTED

w_lm = ρ(rank_l(P), rank_m(P)) — Spearman, rank-preserving

Where rank_l(P) is the rank-order of similarities over the panel-critical pair set as encoded in language l. This weight is invariant under monotone transformations of similarity — it cannot be inflated by high-variance axes that dominate cosine distances.

CONDUCTANCE (replaces raw cut value)

Φ(S) = cut(S, S̄) / min(vol S, vol S̄)

Scale-invariant, aligned with the Laplacian spectral gap.

Run Stoer–Wagner for global min-cut on the rank-correlation graph; normalized cut (Shi–Malik) for soft multi-way partitions.

WHAT THE CUT NOW MEANS

The boundary across which the panel-critical pair set is ordered differently. Two languages are "close" iff they order the critical pairs the same way, regardless of metric magnitudes. This is the rigorous version of "where translation breaks down."

§6. Sinkhorn OT

Wasserstein-2 between empirical content-word embedding distributions, full dimensionality. The optimal transport plan π* identifies which content-word pairs pay the translation cost. Crucially: OT operates in the original space, no projection. It is immune to PCA-blindness by construction.

OPTIMAL TRANSPORT

W₂(μ_l, μ_m) = min_π ∫ ‖x − y‖² dπ(x, y)

ε-regularized Sinkhorn for computational tractability.

The marginals of π* over content-word slots give the per-word untranslatability budget across the panel:

High

idea
Pays disproportionately

High

bad
Pays disproportionately

Low

people
Cheap to transport

Low

enemy
Cheap to transport

Prediction: idea and bad carry the bulk of the translation cost because they sit at the intersection of substance–process ontology and moral-lexeme texture — the two axes with highest cross-lingual divergence. people and enemy are near-universal; the transport plan barely moves them.

§7. The Cross-Domain Connection

The legal-retrieval failure mode and the cross-lingual decomposition failure mode are the same structural problem: variance captures coarse structure while the long tail carries operationally critical signal.

Property	Legal Domain	Cross-Lingual Domain
Canonical pair	`Pflichtteil` vs `elective share`	`bhāvanā` vs `idea`
What PCA sees	Both are "inheritance law" — coarse domain match	Both are "mental construct" — coarse substance pole
What PCA misses	Pflichtteil is mandatory share; elective share is opt-in. Operationally opposite.	bhāvanā is cultivated process; idea is eidetic form. Ontologically incommensurable.
Where signal lives	Dimensions 9–16 (variance <2% each)	Dimensions beyond F3 (variance <1% each)
Failure mode	CISA: distinct concepts aliased after compression	CISA: distinct ontologies aliased after compression
Correct method	Domain-conditional retrieval on critical pair set	Rank-preserving methods on panel-critical pair set
Core insight	Coarse benchmarks license silent failure on long-tail signal. The mechanism is universal.

THE UNIVERSAL LAW

If the transfer coefficient τ_D₁→D₂ between the legal and cross-lingual domains is near-constant — and extends to medical, time-series, and regime-detection domains — the Dimensionality Illusion hardens from a domain-specific result to a universal law of variance-vs-information divergence in semantic spaces. The corollary applies to any future cross-domain work: do not collapse the long tail without a domain-conditional retrieval test on a panel-critical pair set.