ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling
Anonymous submission
Abstract
Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.

Figure 1: The end-to-end Progressive Diffusion Modeling of ControlAudio, which combines a progressive model training with a progressive guided sampling process for decoupled control of temporal structure and speech content.
Timing-Controlled Audio Generation
|
|||||||
|
Structured prompt
A bathtub is being filled or washed. @{Bathtub (filling or washing) <0.00,6.60><6.75,10.00>}
|
|||||||
| ControlAudio | AudioLDM | Tango | Stable Audio* | AudioComposer | Ground Truth | ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
||
|
|||||||
|
Structured prompt
A woman sings with music and an alarm clock sounds. @{A woman sings & <0.00,6.73>}@{Alarm clock & <6.73,10.00>}
|
|||||||
| ControlAudio | AudioLDM | Tango | Stable Audio* | AudioComposer | Ground Truth | ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
||
|
|||||||
|
Structured prompt
A telephone rings repeatedly. @{Telephone bell ringing & <0.00,2.22><2.57,8.02><8.47,10.00>}
|
|||||||
| ControlAudio | AudioLDM | Tango | Stable Audio* | AudioComposer | Ground Truth | ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
||
|
|||||||
|
Structured prompt
A man is speaking and an engine is running. @{Engine is running <0.00,10.00>}@{Male speech, man speaking <0.00,1.40><DH><AH0>< ><EH1><N><JH><AH0><N><Z>< ><R><AH1><N><IH0><NG>< ,>< ><L><EH1><T><S>< ><G><OW1>< .>}
|
|||||||
| ControlAudio | AudioLDM | Tango | Stable Audio* | AudioComposer | Ground Truth | ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
||
|
|||||||
|
Structured prompt
Music plays, followed by mechanisms, typing, beeps, and an alarm. @{Music. & <0.00,10.00>}@{Beeps. & <1.00,1.20><3.00,3.20><4.90,5.10><6.90,7.10>}@{Typing. & <1.20,7.80>}@{Alarm. & <7.85,8.50>}
|
|||||||
| ControlAudio | AudioLDM | Tango | Stable Audio* | AudioComposer | Ground Truth | ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
||
Intelligible Audio Generation (For clarity, phoneme details are omitted here; in practice, speech content should be converted to phonemes before input to the model.)
|
||||||
|
Structured prompt
People talking with the dull roar of a vehicle on the road. @{Road traffic ambience. & <0.00,10.00>}@{Male speech, man speaking. & <0.52,1.76>"What is she doing?"<2.00,2.62>"What the fuck?"<3.88,4.82>"What the fuck?"<5.45,7.64>"She's in the middle of the road, Mate."<8.20,9.70>"He's on a joyride."}
|
||||||
| ControlAudio | AudioLDM 2 Speech | VoiceLDM-S | VoiceLDM-M | Ground Truth | ||
|---|---|---|---|---|---|---|
|
|
|
|
|
||
|
||||||
|
Structured prompt
A gun fires, followed by the sound of cocking and a man talking. @{Outdoor shooting range ambience & <0.00,10.00>}@{Machine gun firing & <0.48,4.12 >}@{Gunshot & <5.84,6.12 >}@{Gun cocking & <6.64,6.90>}@{Male speech, man talking & <7.55,8.40>"There you see?"<8.68,9.56>"No problem."}
|
||||||
| ControlAudio | AudioLDM 2 Speech | VoiceLDM-S | VoiceLDM-M | Ground Truth | ||
|---|---|---|---|---|---|---|
|
|
|
|
|
||
|
||||||
|
Structured prompt
A man speaking over an intercom as a crowd of people talk followed by a dog barking. @{Crowd talking ambience & <0.00,10.00>}@{Male speech, man speaking & <0.46,5.14>"And contain them until that person can be taken into custody effectively and safely."<5.64,8.22>"On the part of the other team of police sheriffs."}@{Dog barking & <9.26,9.46>}
|
||||||
| ControlAudio | AudioLDM 2 Speech | VoiceLDM-S | VoiceLDM-M | Ground Truth | ||
|---|---|---|---|---|---|---|
|
|
|
|
|
||
|
||||||
|
Structured prompt
Females voice narrating a scene as music is playing and rain drops are falling. @{Music & <0.00,10.00>}@{Female speech, woman narrating & <2.62,4.65>"Daniel came out of the airport."<5.37,8.26>"He raised one arm to hail a taxi."}@{Rain falling & <8.26,10.00>}
|
||||||
| ControlAudio | AudioLDM 2 Speech | VoiceLDM-S | VoiceLDM-M | Ground Truth | ||
|---|---|---|---|---|---|---|
|
|
|
|
|
||
|
||||||
|
Structured prompt
Splashing water followed by a girl speaking then scraping and spitting.
@{Splashing water & <0.00,1.38>}@{Female speech, girl speaking & <1.57,4.52>"This is the last time you did that first thing. Same thing."}@{Scraping & <4.66,6.81><7.10,8.00>}@{Spitting & <8.10,8.48>}
|
||||||
| ControlAudio | AudioLDM 2 Speech | VoiceLDM-S | VoiceLDM-M | Ground Truth | ||
|---|---|---|---|---|---|---|
|
|
|
|
|
||