What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the trust crisis in AI training and inference—encompassing integrity, privacy, robustness, and bias—by proposing ConceptLens, the first unified attribution framework grounded in concept shift. It enables holistic integrity assessment and causal explanation across data, model, and societal layers. Methodologically, ConceptLens integrates pretrained multimodal models, concept probes, concept importance attribution, and adversarial concept perturbation to precisely localize threats including data poisoning, covert ad injection, privacy leakage, and societal bias; detect model overreliance on misleading concepts; quantify sociocultural bias in generated content; and expose how even safety-aligned training/inference data can be inadvertently misused—challenging prevailing alignment assumptions. Experiments demonstrate high-accuracy risk identification, early filtering of hazardous samples, and actionable strategies for trust enhancement.

Technology Category

Application Category

📝 Abstract
The growing adoption of artificial intelligence (AI) has amplified concerns about trustworthiness, including integrity, privacy, robustness, and bias. To assess and attribute these threats, we propose ConceptLens, a generic framework that leverages pre-trained multimodal models to identify the root causes of integrity threats by analyzing Concept Shift in probing samples. ConceptLens demonstrates strong detection performance for vanilla data poisoning attacks and uncovers vulnerabilities to bias injection, such as the generation of covert advertisements through malicious concept shifts. It identifies privacy risks in unaltered but high-risk samples, filters them before training, and provides insights into model weaknesses arising from incomplete or imbalanced training data. Additionally, at the model level, it attributes concepts that the target model is overly dependent on, identifies misleading concepts, and explains how disrupting key concepts negatively impacts the model. Furthermore, it uncovers sociological biases in generative content, revealing disparities across sociological contexts. Strikingly, ConceptLens reveals how safe training and inference data can be unintentionally and easily exploited, potentially undermining safety alignment. Our study informs actionable insights to breed trust in AI systems, thereby speeding adoption and driving greater innovation.
Problem

Research questions and friction points this paper is trying to address.

Identifying root causes of AI integrity threats through Concept Shift analysis
Detecting vulnerabilities to bias injection and privacy risks in training data
Uncovering sociological biases and model dependencies in generative AI content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ConceptLens framework for threat analysis
Leverages pre-trained multimodal models
Identifies and filters privacy risks
🔎 Similar Papers
No similar papers found.
J
Jiamin Chang
University of New South Wales & CSIRO’s Data61, Sydney, Australia
H
Haoyang Li
Macquarie University, Sydney, Australia
Hammond Pearce
Hammond Pearce
Senior Lecturer (a.k.a. Assistant Prof), UNSW School of Computer Science and Engineering
CybersecurityEmbedded SystemsHardware designLarge Language Models
R
Ruoxi Sun
CSIRO’s Data61, Sydney, Australia
B
Bo Li
University of Illinois at Urbana–Champaign, Champaign and Urbana, United States
M
Minhui Xue
CSIRO’s Data61, Sydney, Australia