banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

Crushing SDXL! The new generation of text-image model Stable Cascade is here!

Currently, huggingface already has an online demo for Stable Cascade, which allows for a quick experience: https://huggingface.co/spaces/multimodalart/stable-cascade
Currently, the model and code for Stable Cascade (including inference and training) have been open-sourced:

Model: https://huggingface.co/stabilityai/stable-cascade
Code: https://github.com/Stability-AI/StableCascade/tree/master
First, let's briefly introduce the architecture of the Stable Cascade model. Stable Cascade is based on the previous Wuerstchen architecture and consists of three stages, as shown below:

image
In summary, Stable Cascade consists of two latent diffusion models and a small VQGAN model. One latent diffusion model is responsible for generation, while the other latent diffusion model combined with VQGAN is responsible for reconstruction. You might wonder why we need to use a diffusion model for decoding instead of using a small decoder like SD. The main reason is that the downsampling rate of 42x still incurs significant loss, so we need a diffusion model with stronger generation capability for decoding (here, a small convolutional network is publicly available at https://github.com/Stability-AI/StableCascade/blob/master/modules/previewer.py for decoding preview images of size 192x192). In fact, VAE also involves lossy compression, so DALLE-3 also introduces a latent decoder based on the diffusion model. In addition, the three-stage inference order of Stable Cascade is: Stage C -> Stage B -> Stage A. So why are the model indices for the three decoders reversed? I guess this is because the indices are based on the training order. First, Stage A needs to be trained, followed by Stage B. Note that when training Stage B, the Semantic Compressor needs to be trained together (EfficientNetV2-S pre-training model is based on ImageNet and cannot encode image semantics accurately), and then the Semantic Compressor is fixed while training Stage C.

image

image

In terms of model comparison, Stable Cascade performs better than Playground v2, SDXL, SDXL Turbo, and Würstchen v2 in terms of text consistency and image quality.
References:
https://stability.ai/news/introducing-stable-cascade
https://huggingface.co/stabilityai/stable-cascade

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.