Crushing SDXL! The new generation of text-image model Stable Cascade is here!

Currently, huggingface already has an online demo for Stable Cascade, which allows for a quick experience: https://huggingface.co/spaces/multimodalart/stable-cascade
Currently, the model and code for Stable Cascade (including inference and training) have been open-sourced:

Model: https://huggingface.co/stabilityai/stable-cascade
Code: https://github.com/Stability-AI/StableCascade/tree/master
First, let's briefly introduce the architecture of the Stable Cascade model. Stable Cascade is based on the previous Wuerstchen architecture and consists of three stages, as shown below:

In summary, Stable Cascade consists of two latent diffusion models and a small VQGAN model. One latent diffusion model is responsible for generation, while the other latent diffusion model combined with VQGAN is responsible for reconstruction. You might wonder why we need to use a diffusion model for decoding instead of using a small decoder like SD. The main reason is that the downsampling rate of 42x still incurs significant loss, so we need a diffusion model with stronger generation capability for decoding (here, a small convolutional network is publicly available at https://github.com/Stability-AI/StableCascade/blob/master/modules/previewer.py for decoding preview images of size 192x192). In fact, VAE also involves lossy compression, so DALLE-3 also introduces a latent decoder based on the diffusion model. In addition, the three-stage inference order of Stable Cascade is: Stage C -> Stage B -> Stage A. So why are the model indices for the three decoders reversed? I guess this is because the indices are based on the training order. First, Stage A needs to be trained, followed by Stage B. Note that when training Stage B, the Semantic Compressor needs to be trained together (EfficientNetV2-S pre-training model is based on ImageNet and cannot encode image semantics accurately), and then the Semantic Compressor is fixed while training Stage C.

In terms of model comparison, Stable Cascade performs better than Playground v2, SDXL, SDXL Turbo, and Würstchen v2 in terms of text consistency and image quality.
References:
https://stability.ai/news/introducing-stable-cascade
https://huggingface.co/stabilityai/stable-cascade