🤖 AI Summary
To address the trust gap in LLM services—where users cannot verify whether the deployed model, prompt, or quantization accuracy matches the provider’s claims—this paper proposes the first third-party-free, lightweight inference verification framework. Methodologically, it integrates compact locality-sensitive hashing (LSH) with polynomial commitment encoding, augmented by intermediate activation compression and hardware- and algebra-agnostic reordering for robustness against tampering with model identity, input prompt, or quantization precision—achieving 100% detection accuracy (zero false positives, zero false negatives). Empirical evaluation on Llama-3.1-8B-Instruct shows verification latency is substantially lower than original inference time, with memory overhead of only 258 bytes per 32 tokens—three orders of magnitude smaller than raw embedding storage. The core contribution is the first end-to-end LLM inference verification scheme that simultaneously achieves high accuracy, low overhead, and strong robustness.
📝 Abstract
Large language models (LLMs) have proven to be very capable, but access to the best models currently rely on inference providers which introduces trust challenges -- how can we be sure that the provider is using the model configuration they claim? We propose TOPLOC, a novel method for verifiable inference that addresses this problem. TOPLOC leverages a compact locality sensitive hashing mechanism for intermediate activations which can detect unauthorized modifications to models, prompts, or precision with 100% accuracy, achieving no false positives or negatives in our empirical evaluations. Our approach is robust across diverse hardware configurations, GPU types, and algebraic reorderings, which allows for validation speeds significantly faster than the original inference. By introducing a polynomial encoding scheme, TOPLOC minimizes memory overhead of the generated commits by $1000 imes$, requiring only 258 bytes of storage per 32 new tokens compared to the 262KB requirement of storing the token embeddings directly for Llama-3.1-8B-Instruct. Our method empowers users to verify LLM inference computations efficiently, fostering greater trust and transparency in open ecosystems and lays a foundation for decentralized and verifiable AI services.