Research Blog

Exploring bioinformatics, data science, and sustainable agriculture through research and practical applications.

Beyond Massive Text Embedding Benchmarks (MTEB): A Topology-Aware Embedder Bake-Off Across Two Corpora

May 8, 202633 min read

I ran 14 open-weights text encoders (15 evaluation rows; EmbeddingGemma was run twice, once with the same config as the other models and once with its prescribed task prefix). Here I use multi-layer evaluation crtieria. Specifically, cosine geometry plus a 25-seed UMAP bootstrap into KeplerMapper, applied to a biomedical corpus (SciCUEval) and a general-knowledge corpus (MMLU non-STEM). The cosine ranking rewards aggressive contrastive training. The topology layer reorders everything. The contrastive-biomedical encoder MedCPT wins continuous-anchor faithfulness outright. PubMed-pretrained MLMs (masked language models) that 'collapse' in cosine space recover clean topological structure. Architecture barely moves the needle; training regime does.

Cosine leaderboards rank by training objective, not corpus fit. Adding a Mapper-TDA stability layer and linguistic anchors changes which embedder wins. A fourteen-encoder roster spanning both training-regime axes (contrastive vs MLM, general vs biomedical). Results indicate the topology ranking is structured by training-regime + pooling-convention compatibility, not cosine benchmark performance.

embeddingstopological data analysisnatural language processingTDAMapperBERTbiomedical NLPscientific NLPmodel evaluationMTEB

Running an Academic Lab is Not Running a Small Business

April 30, 202620 min read

Academic principal investigators are often described as running 'small businesses' because they manage budgets, hire staff, and sell a vision. As someone who runs an actual small business, I find the analogy superficially appealing but structurally misleading. This post lays out where the comparison breaks down: who bears risk, who the customer is, what failure costs, and how time horizons differ. It is not a takedown of academia or of entrepreneurship, both of which are demanding in their own right, but an argument for retiring a comparison that flatters neither side when examined closely.

The PI-as-entrepreneur analogy flatters both sides but obscures where the real risks, incentives, and accountability actually live.

academiaentrepreneurshipsmall-businessresearch-managementincentives

Self-Hosting is a Beautiful Black Hole

January 26, 20267 min read

Self-hosting—running your own services on hardware you control—offers a DIY path to digital autonomy. From firewalls and home automation to NAS systems and git servers, the spectrum of possibilities is vast. This post explores what self-hosting means, shares personal experiences from academia to home labs, and honestly examines the trade-offs. Whether you want to fight platform enshittification, own your data, or just tinker with hardware for fun, self-hosting rewards curiosity and patience in equal measure.

How far do you want to take your DIY approach to services?

infrastructureITdata-sciencebioinformaticsprogramming-languagescomputational-biologybiotechnology

Bioinformatics Code Rot: Do We Have an Abandonware Problem?

December 5, 202517 min read

The bioinformatics community faces a sustainability crisis: graduate students must publish novel tools to graduate, but no one funds long-term maintenance. This creates an ecosystem of abandonware that breaks microservices architectures and wastes researcher time. Without wholesale funding reform, we need to raise our engineering standards and lower the bar for what counts as 'publishable.' This post examines what bioinformatics can learn from mature open source communities and offers practical steps toward more sustainable software development.

Graduate students need novel tools to publish. Labs lack funding for maintenance. How can we build sustainable bioinformatics software without fixing the incentive structure?

scientific computingbioinformaticsabandonwarecomputational biologysoftware engineering

The Cost of 'Waiting for the Data': Why Curiosity Without Guardrails Undermines Research

November 10, 202511 min read

In fields like microbial ecology, agriculture, and molecular biology, the practice of 'waiting to see what the data say' is often justified as embracing discovery. But without rigorous upfront design—particularly computational modeling to guide measurement strategies and statistical approaches—this flexibility masks p-hacking and arbitrary choices. This post argues that modeling, preregistration, and honest distinction between exploratory and confirmatory work aren't constraints on discovery; they're prerequisites for credible science. Targeted at computational professionals and data scientists, it offers practical guidance for designing rigorous studies, collaborating with wet-lab teams, and building institutional cultures around methodological transparency.

Arbitrary experimental choices disguised as data-driven discovery lead to p-hacking, waste, and irreproducible results. How modeling and preregistration protect scientific integrity in computational biology, microbial ecology, and agriculture.

design-of-experimentsdata-sciencebioinformaticspreregistrationcomputational-biologystatistical-methodsreproducibility

Beyond Programming Language Maximalism in Data Science and Bioinformatics: The Case for Polyglot Programming

October 2, 202514 min read

An exploration of language maximalism in scientific computing and the benefits of strategic polyglot programming approaches

Why choosing the right tool for each job beats forcing everything through your favorite programming language

data-sciencebioinformaticsprogramming-languagescomputational-biologybiotechnology

Hello Data Blog!

July 30, 20253 min read

This inaugural post introduces my research focus areas and the types of content you can expect from this blog, covering bioinformatics methodologies, data science writ large, and computational methods in biotechnology.

Introduction to my data science journey and what you can expect from this blog about computational biology, data science, and biotechnology.

data-sciencebioinformaticsintroductioncomputational-biologyagriculturebiotechnology