🤖 AI Summary
This work investigates the zero-shot transferability of Generative Flow Networks (GFlowNets) across arithmetic reasoning tasks, using Game of 24 as the source and Game of 42 as the target task. Method: We systematically analyze GFlowNet generalization via combinatorial search space modeling and a joint diversity–accuracy evaluation framework, complemented by controlled fine-tuning experiments on small- and medium-scale LLMs. Contribution/Results: We find that GFlowNets suffer a 37% drop in solution diversity and a 29% decline in accuracy during cross-task transfer, revealing their strong dependence on task-specific priors—particularly operator distributions and numeric constraints. This is the first systematic identification of flow-structure bottlenecks in symbolic reasoning transfer, challenging the assumption that generative reasoning models admit direct cross-task deployment. Fine-tuning LLMs fails to fundamentally alleviate this limitation. Our findings establish a critical empirical benchmark and theoretical caution for the transferability of generative reasoning models.
📝 Abstract
Generating diverse solutions is key to human-like reasoning, yet autoregressive language models focus on single accurate responses, limiting creativity. GFlowNets optimize solution generation as a flow network, promising greater diversity. Our case study shows their limited zero-shot transferability by fine-tuning small and medium-sized large language models on the Game of 24 and testing them on the Game of 42 datasets. Results revealed that GFlowNets struggle to maintain solution diversity and accuracy, highlighting key limitations in their cross-task generalization and the need for future research in improved transfer learning capabilities.