TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study addresses the semantic gap between users’ natural-language descriptions of desired audio effects and the underlying signal processing parameters in digital audio workstations. To bridge this gap, the authors propose a texture-resonance retrieval (TRR) framework for editable audio effect control, which leverages intermediate-layer activations from Wav2Vec2 to construct Gram matrices that capture co-activation texture structures. This enables precise mapping from natural language queries to editable effect presets. Notably, this work introduces Gram matrix–guided texture-aware representations into audio effect retrieval for the first time, prioritizing preset editability over mere waveform generation. A leakage-proof evaluation protocol is also designed to ensure methodological rigor. Evaluated on a benchmark of 1,063 guitar effect presets, TRR achieves the lowest normalized parameter error and demonstrates perceptual efficacy in a listening study with 26 participants.

Technology Category

Application Category

📝 Abstract

Digital audio workstations expose rich effect chains, yet a semantic gap remains between perceptual user intent and low-level signal-processing parameters. We study retrieval-grounded audio effect control, where the output is an editable plugin configuration rather than a finalized waveform. Our focus is Texture Resonance Retrieval (TRR), an audio representation built from Gram matrices of projected mid-level Wav2Vec2 activations. This design preserves texture-relevant co-activation structure. We evaluate TRR on a guitar-effects benchmark with 1,063 candidate presets and 204 queries. The evaluation follows Protocol-A, a cross-validation scheme that prevents train-test leakage. We compare TRR against CLAP and internal retrieval baselines (Wav2Vec-RAG, Text-RAG, FeatureNN-RAG), using min-max normalized metrics grounded in physical DSP parameter ranges. Ablation studies validate TRR's core design choices: projection dimensionality, layer selection, and projection type. A near-duplicate sensitivity analysis confirms that results are robust to trivial knowledge-base matches. TRR achieves the lowest normalized parameter error among evaluated methods. A multiple-stimulus listening study with 26 participants provides complementary perceptual evidence. We interpret these results as benchmark evidence that texture-aware retrieval is useful for editable audio effect control, while broader personalization and real-audio robustness claims remain outside the verified evidence presented here.

Problem

Research questions and friction points this paper is trying to address.

audio effect control

semantic gap

retrieval

executable configuration

perceptual intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Texture Resonance Retrieval

Gram matrix

audio effect control