ControlAudio: Tackling Text-Guided, Timing-Indicated and
Intelligible Audio Generation via Progressive Diffusion Modeling

Anonymous submission

Abstract

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.

Figure 1: The end-to-end Progressive Diffusion Modeling of ControlAudio, which combines a progressive model training with a progressive guided sampling process for decoupled control of temporal structure and speech content.

Timing-Controlled Audio Generation

Text prompt
A bathtub is being filled or washed.

Timing prompt
Bathtub (filling or washing): 0.0s - 6.598s
Bathtub (filling or washing): 6.748s - 10.0s

Structured prompt

A bathtub is being filled or washed. @{Bathtub (filling or washing) <0.00,6.60><6.75,10.00>}

ControlAudio

AudioLDM

Tango

Stable Audio*

AudioComposer

Ground Truth

Text prompt
A woman sings with music and an alarm clock sounds.

Timing prompt
A woman sings: 0.00s - 6.73s
Alarm clock: 6.73s - 10.00s

Structured prompt

A woman sings with music and an alarm clock sounds. @{A woman sings & <0.00,6.73>}@{Alarm clock & <6.73,10.00>}

ControlAudio

AudioLDM

Tango

Stable Audio*

AudioComposer

Ground Truth

Text prompt
A telephone rings repeatedly.

Timing prompt
Telephone bell ringing: 0.0s - 2.221s
Telephone bell ringing: 2.568s - 8.022s
Telephone bell ringing: 8.473s - 10.0s

Structured prompt

A telephone rings repeatedly. @{Telephone bell ringing & <0.00,2.22><2.57,8.02><8.47,10.00>}

ControlAudio

AudioLDM

Tango

Stable Audio*

AudioComposer

Ground Truth

Text prompt
A man is speaking and an engine is running.

Timing prompt
Engine is running : 1.40s - 10.00s
Male speech, man speaking : 0.00s - 1.40s
LLM Planning: "The engine’s running, let’s go."

Structured prompt

A man is speaking and an engine is running. @{Engine is running <0.00,10.00>}@{Male speech, man speaking <0.00,1.40><DH><AH0>< ><EH1><N><JH><AH0><N><Z>< ><R><AH1><N><IH0><NG>< ,>< ><L><EH1><T><S>< ><G><OW1>< .>}

ControlAudio

AudioLDM

Tango

Stable Audio*

AudioComposer

Ground Truth

Text prompt
Music plays, followed by mechanisms, typing, beeps, and an alarm.

Timing prompt
Music : 0.00s - 10.00s
Beeps : 1.00s - 1.20s 3.00s - 3.20s 4.90s - 5.10s 6.90s - 7.10s
Typing : 1.20s - 7.80s
Alarm : 7.85s - 8.50s

Structured prompt

Music plays, followed by mechanisms, typing, beeps, and an alarm. @{Music. & <0.00,10.00>}@{Beeps. & <1.00,1.20><3.00,3.20><4.90,5.10><6.90,7.10>}@{Typing. & <1.20,7.80>}@{Alarm. & <7.85,8.50>}

ControlAudio

AudioLDM

Tango

Stable Audio*

AudioComposer

Ground Truth

Intelligible Audio Generation (For clarity, phoneme details are omitted here; in practice, speech content should be converted to phonemes before input to the model.)

Text prompt
People talking with the dull roar of a vehicle on the road.

Content prompt
What is she doing? What the fuck? What the fuck? She's in the middle of the road. Mate, he's on a joyride.

Structured prompt

People talking with the dull roar of a vehicle on the road. @{Road traffic ambience. & <0.00,10.00>}@{Male speech, man speaking. & <0.52,1.76>"What is she doing?"<2.00,2.62>"What the fuck?"<3.88,4.82>"What the fuck?"<5.45,7.64>"She's in the middle of the road, Mate."<8.20,9.70>"He's on a joyride."}

ControlAudio

AudioLDM 2 Speech

VoiceLDM-S

VoiceLDM-M

Ground Truth

Text prompt
A gun fires, followed by the sound of cocking and a man talking.

Content prompt
There you see? No problem.

Structured prompt

A gun fires, followed by the sound of cocking and a man talking. @{Outdoor shooting range ambience & <0.00,10.00>}@{Machine gun firing & <0.48,4.12 >}@{Gunshot & <5.84,6.12 >}@{Gun cocking & <6.64,6.90>}@{Male speech, man talking & <7.55,8.40>"There you see?"<8.68,9.56>"No problem."}

ControlAudio

AudioLDM 2 Speech

VoiceLDM-S

VoiceLDM-M

Ground Truth

Text prompt
A man speaking over an intercom as a crowd of people talk followed by a dog barking.

Content prompt
and contain them until that person can be taken into custody effectively and safely on the part of the other team of police sheriffs.

Structured prompt

A man speaking over an intercom as a crowd of people talk followed by a dog barking. @{Crowd talking ambience & <0.00,10.00>}@{Male speech, man speaking & <0.46,5.14>"And contain them until that person can be taken into custody effectively and safely."<5.64,8.22>"On the part of the other team of police sheriffs."}@{Dog barking & <9.26,9.46>}

ControlAudio

AudioLDM 2 Speech

VoiceLDM-S

VoiceLDM-M

Ground Truth

Text prompt
Females voice narrating a scene as music is playing and rain drops are falling.

Content prompt
Daniel came out of the airport. He raised one arm to hail a taxi.

Structured prompt

Females voice narrating a scene as music is playing and rain drops are falling. @{Music & <0.00,10.00>}@{Female speech, woman narrating & <2.62,4.65>"Daniel came out of the airport."<5.37,8.26>"He raised one arm to hail a taxi."}@{Rain falling & <8.26,10.00>}

ControlAudio

AudioLDM 2 Speech

VoiceLDM-S

VoiceLDM-M

Ground Truth

Text prompt
Splashing water followed by a girl speaking then scraping and spitting.

Content prompt
This is the last time you did that first thing. Same thing.

Structured prompt

Splashing water followed by a girl speaking then scraping and spitting. @{Splashing water & <0.00,1.38>}@{Female speech, girl speaking & <1.57,4.52>"This is the last time you did that first thing. Same thing."}@{Scraping & <4.66,6.81><7.10,8.00>}@{Spitting & <8.10,8.48>}

ControlAudio

AudioLDM 2 Speech

VoiceLDM-S

VoiceLDM-M

Ground Truth

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Abstract

Timing-Controlled Audio Generation

Intelligible Audio Generation (For clarity, phoneme details are omitted here; in practice, speech content should be converted to phonemes before input to the model.)

ControlAudio: Tackling Text-Guided, Timing-Indicated and
Intelligible Audio Generation via Progressive Diffusion Modeling