🤖 AI Summary
This study addresses the challenge of continuous vocal monitoring for agitation in bipolar disorder on edge devices by disentangling stable speaker identity from dynamic emotional cues. The authors propose MP-IB, a novel framework that formulates mixed-precision quantization as an information bottleneck: FP16 preserves speaker identity, while INT4 efficiently encodes agitation states. By integrating dynamic precision scheduling and multi-scale temporal fusion, MP-IB achieves an 8× information asymmetry in disentanglement without adversarial training. Evaluated on Bridge2AI-Voice, the method attains a correlation coefficient (ρ) of 0.117, significantly outperforming existing approaches. It demonstrates strong zero-shot transferability to CREMA-D with an AUC of 0.817, near-random-level identity leakage, a compact model size of 617 KB, and end-to-end latency of 23.4 ms, enabling real-time deployment under extremely constrained resources.
📝 Abstract
Continuous monitoring of bipolar disorder agitation via voice biomarkers requires disentangling stable speaker traits from volatile affective states on resource-constrained edge devices. We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation. The core insight is that numerical precision itself controls capacity: an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8x information asymmetry without adversarial training. We augment this with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. On Bridge2AI-Voice (N=833, 4 sessions/participant, strict speaker-independent CV), MP-IB achieves rho = 0.117 (95\% CI: [0.089, 0.145], p=0.003 vs. chance), outperforming 94M-parameter WavLM-Adapter with in-domain SSL continuation (rho = -0.042), beta VAE disentanglement (rho = 0.089), and hand-crafted prosody (rho = 0.031) by 2.8--15.9 points absolute. Zero-shot transfer to CREMA-D achieves AUC=0.817. Identity leakage is suppressed to near-random (EER=0.42, MIA-AUC=0.52). End-to-end latency is 23.4 ms with a 617 KB footprint, enabling real-time monitoring on sub 20 dollar devices.