π€ AI Summary
This study addresses the challenge that existing biomedical language models often conflate spurious correlations with genuine causal relationships in cross-domain semantic similarity assessment, leading to flawed causal reasoning. To mitigate this, the authors propose a two-stage embedding optimization framework that integrates knowledge graphβguided hard negative mining (BODHI) with contrastive learning to reshape the geometric structure of the embedding space, thereby enhancing causal discriminability. An unexpected finding reveals that FP16 precision outperforms INT8 on Intel AMX hardware. Leveraging OpenVINO and AMX instruction sets, the approach achieves highly efficient inference: cross-domain separation improves by 2.30Γ, BIOSSES correlation reaches 0.828, per-query latency drops to 10 ms (a 133Γ speedup), and throughput attains 555 sentences per second.
π Abstract
Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%.
Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness.
We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.