🤖 AI Summary
Accurate prediction of readiness time during sugar cookie drying remains challenging due to strong process dynamics, sparse and heterogeneous multi-source sensor data (e.g., temperature, humidity, airflow), and limited temporal alignment between modalities.
Method: We propose an end-to-end multimodal online prediction framework that jointly encodes in-situ video streams and heterogeneous process parameters. It employs modality-specific CNN encoders and a temporal Transformer decoder, introducing a novel architecture supporting asynchronous cross-modal alignment and lightweight inference.
Contribution/Results: Evaluated on real industrial production data, our method achieves a mean prediction error of only 15 seconds—improving over state-of-the-art multimodal approaches by 65.69% and over vision-only models by 11.30%. The model size is under 5 MB, with inference latency <200 ms per sample. To the best of our knowledge, this is the first work enabling real-time, high-accuracy, video–process-parameter–driven readiness prediction for food drying, balancing precision, computational efficiency, and industrial deployability.
📝 Abstract
Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.