BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

πŸ“… 2026-06-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing text-to-image generation models are constrained by unidirectional language modeling, which struggles to capture long-range semantic dependencies and is susceptible to vanishing gradients. This work proposes BLM-SGAN, the first approach to integrate bidirectional language modeling into a GAN framework by incorporating BERT’s bidirectional attention mechanism. This enables joint modeling of semantic and spatial features, significantly enhancing text-image alignment and generation quality. By supporting long-sequence contextual modeling, BLM-SGAN overcomes the limitations of conventional unidirectional architectures. Evaluated on the CUB birds dataset, the method achieves an Inception Score of 5.45 Β± 0.08, substantially outperforming state-of-the-art models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN.
πŸ“ Abstract
Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.
Problem

Research questions and friction points this paper is trying to address.

text-to-image generation
long-range dependencies
vanishing gradients
sequential processing
semantic-spatial modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Language Modeling
Text-to-Image Generation
Generative Adversarial Networks
BERT Attention Mechanism
Semantic-Spatial Synthesis
πŸ”Ž Similar Papers
No similar papers found.