🤖 AI Summary
This work addresses the challenge of detecting AI-assisted behavior in abstract, complex tasks—such as text generation, medical diagnosis, and autonomous driving—where such assistance is often latent and difficult to identify. We propose a multimodal representation framework that uniformly encodes unstructured human–AI interaction data (e.g., action logs, clickstreams) into both image and time-series formats. Specifically, we design four novel image encoding strategies and one time-series encoding mechanism to explicitly model users’ “exploration–exploitation” behavioral patterns. We further develop dedicated CNN, RNN, and parallel fusion architectures to enable cross-modal feature co-learning. Experiments demonstrate that our framework significantly improves generalization capability under low signal-to-noise ratios and few-shot conditions, achieving state-of-the-art performance across multiple real-world tasks. The approach provides a generic, robust, and interpretable technical pathway for identifying AI-assisted behavior in interactive systems.
📝 Abstract
Detecting assistance from artificial intelligence is increasingly important as they become ubiquitous across complex tasks such as text generation, medical diagnosis, and autonomous driving. Aid detection is challenging for humans, especially when looking at abstract task data. Artificial neural networks excel at classification thanks to their ability to quickly learn from and process large amounts of data -- assuming appropriate preprocessing. We posit detecting help from AI as a classification task for such models. Much of the research in this space examines the classification of complex but concrete data classes, such as images. Many AI assistance detection scenarios, however, result in data that is not machine learning-friendly. We demonstrate that common models can effectively classify such data when it is appropriately preprocessed. To do so, we construct four distinct neural network-friendly image formulations along with an additional time-series formulation that explicitly encodes the exploration/exploitation of users, which allows for generalizability to other abstract tasks. We benchmark the quality of each image formulation across three classical deep learning architectures, along with a parallel CNN-RNN architecture that leverages the additional time series to maximize testing performance, showcasing the importance of encoding temporal and spatial quantities for detecting AI aid in abstract tasks.