ElasticDiffusion: Training-free Arbitrary Size Image Generation

Moayed Haji Ali; Guha Balakrishnan; Vicente Ordonez

doi:10.1109/cvpr52733.2024.00631

← back to publications

publication

ElasticDiffusion: Training-free Arbitrary Size Image Generation

Moayed Haji Ali, Guha Balakrishnan, Vicente Ordonez.

Conf. on Computer Vision and Pattern Recognition CVPR 2024. Seattle, WA.

paper project page code pdf raw bibtex

Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers at Rice University have developed a method called ElasticDiffusion that allows existing text-to-image AI models to generate pictures at sizes and shapes they were never trained on, without any additional training or significant extra memory. The problem they set out to solve is a fundamental limitation of popular diffusion models like Stable Diffusion, which are trained on images of a fixed size — typically 512×512 pixels — and tend to produce repetitive patterns, distorted objects, or incoherent images when asked to generate something taller, wider, or at a different resolution. The team's key insight was that the mathematical signals inside a diffusion model during image generation can be split into two distinct roles: a "global" signal that governs the overall structure and composition of a scene, and a "local" signal that handles fine pixel-level detail. ElasticDiffusion exploits this separation by computing the local signal in small patches at the model's native resolution and separately computing the global signal from a lower-resolution reference image, then upscaling and combining both to produce the final output. In tests on face and scene datasets, the method outperformed MultiDiffusion — a prior patch-stitching approach — and produced results competitive with Stable Diffusion XL, a much larger model explicitly retrained for higher resolutions, while using only about a third of its memory. The practical significance is that developers and researchers could use a single, already-trained diffusion model to generate portrait-mode, widescreen, or other non-standard image formats without the substantial computational cost of retraining.

abstract

Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project page: https://elasticdiffusion.github.io/

details

DOI: 10.1109/cvpr52733.2024.00631
journal ref: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
comment: Accepted at CVPR 2024. Project Page: https://elasticdiffusion.github.io/

citation

@inproceedings{ali2024elasticdiffusion,
  title = {ElasticDiffusion: Training-free Arbitrary Size Image Generation},
  author = {Ali, Moayed Haji and Balakrishnan, Guha and Ordonez, Vicente},
  year = {2024},
  booktitle = {Conf. on Computer Vision and Pattern Recognition CVPR 2024},
  url = {https://arxiv.org/abs/2311.18822},
}

automatically generated questions, main contributions and limitations of this paper

Questions this paper helps answer

What does ElasticDiffusion enable? ElasticDiffusion lets a pretrained text-to-image diffusion model generate images at sizes and aspect ratios beyond its original training resolution without retraining.
Why do standard diffusion models struggle with arbitrary sizes? Models such as Stable Diffusion are trained at fixed resolutions, so direct decoding at much larger, smaller, or differently shaped canvases can create repeated patterns, distorted structure, or poor composition.
What is the main technical idea? The method separates local and global diffusion signals: local detail is estimated on native-resolution patches, while global structure is guided by a lower-resolution reference signal.
How does ElasticDiffusion reduce patch boundary artifacts? It uses contextual patch estimation, reduced-resolution guidance, and resampling so large images remain coherent while avoiding heavy overlap between patches.
How does it compare to alternatives? The paper reports stronger coherence than standard Stable Diffusion and MultiDiffusion across resolutions and aspect ratios, with results competitive with SDXL at 1024 by 1024 while using a smaller base model.

Main contributions

The paper introduces a training-free decoding strategy for arbitrary-size text-to-image generation using existing pretrained diffusion models.
It identifies and exploits a useful separation between global class-direction guidance and local unconditional detail signals inside classifier-free guided diffusion.
ElasticDiffusion provides an efficient implicit-overlap patching method that reduces boundary discontinuities without the large number of forward calls required by heavily overlapping patch methods.
The method adds reduced-resolution guidance and iterative resampling to improve image coherence and detail at resolutions outside the base model's training size.
Experiments on CelebA-HQ and LAION-COCO show practical gains across square resolutions and multiple aspect ratios, making the approach useful for portrait, widescreen, and other non-standard outputs.

Limitations and cautions

ElasticDiffusion depends on estimating global and local diffusion signals accurately, so occasional artifacts can still appear; the paper directly addresses this with guidance and resampling mechanisms.
Reduced-resolution guidance can make outputs slightly blurrier when used strongly, but it is a practical control that helps remove artifacts and preserve overall composition.
The global content signal is initially estimated near the base model's training resolution, so extremely large scale jumps remain a challenging case and a natural direction for future refinement.
The method improves arbitrary-size decoding rather than replacing stronger base models; it is especially valuable because it can also be applied on top of better pretrained diffusion models.
The evaluation focuses on image generation quality and text alignment on face and scene datasets, leaving specialized downstream uses such as design layouts or production image editing as promising follow-up settings.

How to read this result

This paper is best read as a strong practical advance for diffusion model deployment: ElasticDiffusion makes fixed-resolution text-to-image models far more flexible, producing coherent arbitrary-size outputs without the cost of retraining or switching to a much larger model.