banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

Nvidia open-sources AI text-to-sound model, capable of generating 30 seconds of sound in just 3.7 seconds.

This text-to-audio model is primarily used to generate some sound effects, such as the sound of wind and rain, the sound of silver needles falling to the ground, and the roar of an airplane taking off.

image

Technical Features#

  1. Efficient Generation Capability:
    TangoFlux can generate up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. Compared to other models, it has a significant advantage in generation speed, providing high-quality audio output in a shorter time, greatly improving the efficiency of audio generation.

  2. Flow Matching and Rectified Flows:
    The model employs a flow matching framework, specifically Rectified Flows, which is a direct path from noise to the target distribution, maintaining audio quality while reducing sampling steps. This technology makes the model more efficient and stable during the generation process, reducing the demand for computational resources.

  3. Clap Ranking Preference Optimization (CRPO):
    TangoFlux introduces CRPO technology, utilizing the CLAP model as a proxy reward model to enhance the model's alignment capability through iterative generation and optimization of preference data. CRPO effectively improves the matching of generated audio with text descriptions, making the audio content more aligned with user intentions and expectations.

  4. Multimodal Diffusion Transformer Architecture:
    The model is built on a Multimodal Diffusion Transformer (MMDiT) and Diffusion Transformer (DiT), combining text prompts and duration embeddings to generate audio with varying lengths and rich details. This architecture enhances the model's ability to handle complex text descriptions and generate diverse audio content.

GitHub Project Link

Hugging Face Try It Out Link

Paper Link

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.