LRDiff

Layout-to-Image

In this work, we mainly focus on the layout-to-image task. As shown in the above figure (1), user input a layout condition, inclding bounding boxes or instance mask and a caption. The proposed method can generate a high-quality image that fits both the layout and the caption conditions.

In addition, the proposed method also support edit a given image under a input layout condition (2), including replacing an object at the specified location, inserting an object at the specified location and so on.

Contributions

(1) We propose the Layered Rendering Diffusion for layout-to-image in a zero-shot manner, eliminating the need for training and complex constraint designs. The proposed method can effectively circumvent issues like unintended conceptual blending or concept mismatches that happen during the generation of multiple concepts.

(2) We are the first to introduce the concept of visual guidance to achieve spatial controllability in noise space.

(3) Three applications are enabled by the proposed method: bounding box-to-image, instance mask-to-image, and image editing.

Motivations

(1) The noise distribution inherently contains layout (or semantic) information. How can we adjust the noise to achieve a suitable distribution? The figure above provides an answer. By adding a specific vector, such as a color CLIP vector related to the object description, to a specific area, we can generally generate the object within that area.

(2) For different concepts, the specific vector in (1) can vary. We propose a layered strategy to handle multiple concepts effectively.

Method overview

The proposed vision guidance:

We factorise the vision guidance \( {\xi} \) into two components: a vector \( \delta \in \mathbb{R}^{D} \) and a binary mask \( {\mathcal{M}} \in \{0,1\}^{h \times w} \). Each element \( \xi_{j,k,l} \) of \( {\xi} \) is defined as follows: \[ \xi_{j,k,l} = \delta_l \cdot \mathcal{M}_{j,k} - \delta_l \cdot (1 - \mathcal{M}_{j,k}) , \\ = \delta_l \cdot (2\mathcal{M}_{j,k} - 1) \] where \( \mathcal{M}_{j,k} \) is assigned the value 1 if the spatial position \( ({j,k}) \) falls within the expected object region. For the region containing an object, we add \( {\delta} \) to enhance the generation tendency of that object. Conversely, for areas outside the target region, we subtract \( {\delta} \) to suppress the generation tendency of the object. The binary mask \( {\mathcal{M}} \) can be derived from user input, such as converting a bounding box or instance mask provided by the user into the binary mask.

Next, we introduce two distinct approaches to compute the vector \( \delta \):
(a) Constant vector: A nati\"ve approach for the configuration is to set the vector \( \delta \) to some constant values. When the diffusion model operates in the RGB space, we can set \( \delta \) to constant values corresponding to some colour described by the text prompt (eg, [0.3, 0.3, 0.3]) corresponding to a white colour with transparency. When operating in the latent space of VAE, \( \delta \) can be set to the latent representation of the constant values when operations such as dimension expansion and tensor repeat are required. Although the manual adjustment \( \delta \) to some constant values is versatile to generate objects with various visual concepts, it necessitates human intervention.

(b) Dynamic vector: Beyond simply assigning constant values to \( \delta \), we propose to dynamically adapt the values of \( \delta \) based on the input text conditions in order to reduce human intervention during generation. In this context, we consider the implementation of Stable Diffusion wherein text tokens are interconnected with the visual features via cross-attention modules. At the initial denoising step, ie, \( t=T \), we extract the cross-attention map \( \mathbf{A} \in \mathbb{R}^{|c| \times hw} \) from an intermediate layer in the U-Net. For a more straightforward illustration, we will consider the synthesis of an image containing a single object, corresponding to the \( i_{th} \) text token from the text prompt \( c \). Subsequently, to derive the vector \( \delta \), we perform the following operations: \[ \begin{aligned} S &= \{(j,k)| \mathbf{A}_{j,k}^i > \mathrm{Threshold}_K(\mathbf{A}^i)\},\\ \delta &= \frac{\lambda}{|S|}\sum \{ \mathbf{x}_t (j,k)| (j,k) \in S\} \end{aligned} \] where \( x_t (j,k) \) denotes the element at spatial location \( (j,k) \) in \( x_t \). The \( \sum \) operation sums up all items within the \( S \) set. Additionally, the operation \( \mathrm{Threshold}_K(\cdot) \) selects the \( K_{th} \) largest value from the top \( K \) values in \( \mathbf{A}^i \). The strength of vision guidance is modulated by the coefficient \( \lambda \), alongside the classifier-free guidance coefficient \( \gamma \). Given the presence of multiple cross-attention blocks within the score network, we opt to select the block following the down-sampling in each stage and subsequently average their outputs.

The pipeline of the method:

For synthesising a sense, the user provides the global caption, layered caption as well as the spatial layout entities which are used to construct the vision guidance. LRDiff divides the reverse-time diffusion process into two sections:

(i) When \( t \geq t_0 \), each vision guidance is employed into separate layers to alter the denoising direction, ensuring each object contour generates within specific regions.

(ii) When \( t < t_0 \), we perform the general reverse diffusion process to generate texture details that are consistent with the global caption.

Results display

The results under instance mask condition

The results under bounding box condition

BibTeX