Multi-Modal Fusion of In-Situ Video Data and Process Parameters for Online Forecasting of Cookie Drying Readiness

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Accurate prediction of readiness time during sugar cookie drying remains challenging due to strong process dynamics, sparse and heterogeneous multi-source sensor data (e.g., temperature, humidity, airflow), and limited temporal alignment between modalities. Method: We propose an end-to-end multimodal online prediction framework that jointly encodes in-situ video streams and heterogeneous process parameters. It employs modality-specific CNN encoders and a temporal Transformer decoder, introducing a novel architecture supporting asynchronous cross-modal alignment and lightweight inference. Contribution/Results: Evaluated on real industrial production data, our method achieves a mean prediction error of only 15 seconds—improving over state-of-the-art multimodal approaches by 65.69% and over vision-only models by 11.30%. The model size is under 5 MB, with inference latency <200 ms per sample. To the best of our knowledge, this is the first work enabling real-time, high-accuracy, video–process-parameter–driven readiness prediction for food drying, balancing precision, computational efficiency, and industrial deployability.

Technology Category

Application Category

📝 Abstract

Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. Additionally, our model balances prediction accuracy, model size, and computational efficiency, making it well-suited for heterogenous industrial datasets. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making.

Problem

Research questions and friction points this paper is trying to address.

Real-time forecasting of food drying readiness

Multi-modal fusion of video and process data

Reducing prediction error for industrial efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal fusion of video and process data

Encoder-decoder with modality-specific feature extraction

Transformer-based decoder for real-time forecasting

🔎 Similar Papers

No similar papers found.