🤖 AI Summary
This work investigates the multi-problem parallel solving (MPP) capability of large language models (LLMs), focusing on zero-shot multi-task performance. Methodologically, it evaluates 6 classification and 12 reasoning benchmarks using parallel multi-problem prompting, cross-benchmark performance analysis, and behavioral attribution to probe underlying mechanisms. The study provides the first empirical evidence that mainstream LLMs possess robust native MPP ability; identifies instruction tuning as a critical driver—yielding substantial quantitative gains in MPP performance; and reveals fundamental limitations in index localization and cross-source hybrid reasoning. By establishing MPP as a novel paradigm for assessing LLMs’ multi-task generalization, the work delineates its empirical validity boundaries and identifies concrete optimization pathways.
📝 Abstract
Recent studies have proposed placing multiple problems in a single prompt to improve input token utilization for a more efficient LLM inference. We call this MPP, in contrast to conventional SPP that prompts an LLM with a single problem at a time. While MPP has been shown to work comparably well or even better than SPP under few-shot settings, its zero-shot performance is underexplored, which better reveals the innate multiple problem handling capabilities of LLMs. To address that, we study the zero-shot MPP performance of various LLMs on 6 classification and 12 reasoning benchmarks and confirm that LLMs are competent zero-shot multi-problem solvers. We also examine the conditions of effectiveness of zero-shot MPP and explore several model-level factors that may enable MPP. We observe that LLMs consistently perform worse with selecting indices of texts of a given class label and with multiple mixed-source reasoning problems, indicating a lack of true understanding. We also find that instruction tuning is an important factor than enhances MPP.