HERP: Hardware for Energy Efficient and Realtime DB Search and Cluster Expansion in Proteomics

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

To address the high computational overhead, latency, and low energy efficiency of proteomics database search and clustering under resource-constrained conditions, this paper proposes a lightweight incremental clustering and highly parallelized search architecture. Methodologically, it leverages mass spectrometry data characteristics to design a pre-clustering-guided incremental clustering strategy, integrated with bucket-level parallelism and dynamic query scheduling. The architecture is implemented on a 7 nm 3T 2M TJ SOT-CAM in-memory computing hardware platform to enable compute–memory co-acceleration. Experiments on the human reference proteome (131 GB) show that initializing 2 million spectra consumes only 1.19 mJ, while 1,000 queries consume just 1.1 μJ. Compared to baseline methods, clustering error increases by only 0.3%, search throughput improves 20×, hardware parallelism reaches 100×, and result overlap with mainstream tools is 96%, demonstrating superior trade-offs among accuracy, speed, and energy efficiency.

Technology Category

Application Category

📝 Abstract

Database (DB) search and clustering are fundamental in proteomics but conventional full clustering and search approaches demand high resources and incur long latency. We propose a lightweight incremental clustering and highly parallelizable DB search platform tailored for resource-constrained environments, delivering low energy and latency without compromising performance. By leveraging mass-spectrometry insights, we employ bucket-wise parallelization and query scheduling to reduce latency. A one-time hardware initialization with pre-clustered proteomics data enables continuous DB search and local re-clustering, offering a more practical and efficient alternative to clustering from scratch. Heuristics from pre-clustered data guide incremental clustering, accelerating the process by 20x with only a 0.3% increase in clustering error. DB search results overlap by 96% with state-of-the-art tools, validating search quality. The hardware leverages a 3T 2M T J SOT-CAM at the 7nm node with a compute-in-memory design. For the human genome draft dataset (131GB), setup requires 1.19mJ for 2M spectra, while a 1000 query search consumes 1.1{mu}J. Bucket-wise parallelization further achieves 100x speedup.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient proteomics database search in resource-limited environments

Reducing energy consumption and latency for real-time clustering

Achieving high-speed incremental clustering with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight incremental clustering for proteomics data

Bucket-wise parallelization to reduce search latency

Compute-in-memory hardware design for energy efficiency

🔎 Similar Papers

A PLMs based protein retrieval framework