Beyond Massive Text Embedding Benchmarks (MTEB): A Topology-Aware Embedder Bake-Off Across Two Corpora
I ran 14 open-weights text encoders (15 evaluation rows; EmbeddingGemma was run twice, once with the same config as the other models and once with its prescribed task prefix). Here I use multi-layer evaluation crtieria. Specifically, cosine geometry plus a 25-seed UMAP bootstrap into KeplerMapper, applied to a biomedical corpus (SciCUEval) and a general-knowledge corpus (MMLU non-STEM). The cosine ranking rewards aggressive contrastive training. The topology layer reorders everything. The contrastive-biomedical encoder MedCPT wins continuous-anchor faithfulness outright. PubMed-pretrained MLMs (masked language models) that 'collapse' in cosine space recover clean topological structure. Architecture barely moves the needle; training regime does.
Cosine leaderboards rank by training objective, not corpus fit. Adding a Mapper-TDA stability layer and linguistic anchors changes which embedder wins. A fourteen-encoder roster spanning both training-regime axes (contrastive vs MLM, general vs biomedical). Results indicate the topology ranking is structured by training-regime + pooling-convention compatibility, not cosine benchmark performance.