This is the extension of COW(ICLR'24)
Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.
Figure 1: Comparison with existing methods for maintaining the fidelity of text and visual conditions in different application scenarios. We consistently achieve superior fidelity to both text and visual conditions in all three settings. In contrast, other learning-based approaches struggle to attain the same level of performance in diverse scenarios.
Diffusion in physics is a phenomenon caused by random movements and collisions between particles. The diffusion model, drawing inspiration from non-equilibrium thermodynamics, establishes a Markov chain between a target data distribution and the Gaussian distribution. Illustration of ''diffusion in diffusion''. In experiment (a), We invert the pictures of pure gray and white to \( \mathbf{x_t} \), merge them together, and then regenerate them to \( \mathbf{x_0} \) via deterministic denoising. In experiment (b), we enhance the attention scores of the upper right quartile to the lower left quartile, while in experiment (c) we suppress attention scores from the upper right quartile towards other areas. The resulting images show how regions within an image diffuse and interfere with each other during denoising, and reveal the direct effect of attention on diffusion.
The pipeline of our proposed SOW method. Initially, given the visual condition and text condition, we employ Gemini to infer the textual description, the adaptive location box of the visual condition, and the visual condition-related box through a three-stage reasoning process. The input visual condition is then affixed to a predefined background, serving as the seed initialization for the cycle. During the Cyclic One-Way Diffusion process, we ''disturb'' and ''reconstruct'' the image in a cyclic way and ensure a continuous one-way diffusion by consistently replacing it with corresponding \(\mathbf{x_t}\). Also, by integrating these prior pieces of information, we execute cyclic diffusion with dynamic attention modulation, enhancing the coherence and accuracy of the generated outputs.
SOW strikes a balance between meeting the visual and text conditions. For example, when given a photo of a young woman but the text is “an old person”, our method can make the woman older to meet the text description by adding wrinkles, changing skin elasticity and hair color, etc. while maintaining the facial expression and the identity of the given woman.
SOW can be directly applied to other visual conditions other than human faces.