TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of efficiently deploying large language models, which are hindered by high memory and computational demands, while existing ternarization methods struggle to achieve end-to-end low-bit inference due to heavy-tailed activation distributions. To overcome this, the authors propose TWLA, a post-training quantization framework that enables significantly accelerated inference with high accuracy using only 1.58-bit weights and 4-bit activations (W1.58A4). The approach introduces three key innovations: an Euclidean-to-manifold asymmetric ternary quantizer (E2M-ATQ), Kronecker orthogonal trimodal shaping (KOTMS), and inter-layer aware mixed-precision activation allocation (ILA-AMP). Together, these components enable joint optimization of weight ternarization and activation quantization, achieving high-fidelity low-bit representations in manifold space.

📝 Abstract

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at <https://github.com/Kishon-zzx/TWLA>.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Ternary Weights

Low-Bit Activations

Post-Training Quantization

Activation Outliers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary Quantization

Post-Training Quantization

Low-Bit Activation