ControlAudio: Tackling Text-Guided, Timing-Indicated
and Intelligible Audio Generation via Progressive Diffusion Modeling


Anonymous submission


Abstract

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.




Figure 1: The end-to-end Progressive Diffusion Modeling of ControlAudio, which combines a progressive model training with a progressive guided sampling process for decoupled control of temporal structure and speech content.





Timing-Controlled Audio Generation

Text prompt
A bathtub is being filled or washed.
Timing prompt
Bathtub (filling or washing): 0.0s - 6.598s
Bathtub (filling or washing): 6.748s - 10.0s
Structured prompt
A bathtub is being filled or washed. @{Bathtub (filling or washing) <0.00,6.60><6.75,10.00>}
ControlAudio AudioLDM Tango Stable Audio* AudioComposer Ground Truth
Text prompt
A woman sings with music and an alarm clock sounds.
Timing prompt
A woman sings: 0.00s - 6.73s
Alarm clock: 6.73s - 10.00s
Structured prompt
A woman sings with music and an alarm clock sounds. @{A woman sings & <0.00,6.73>}@{Alarm clock & <6.73,10.00>}
ControlAudio AudioLDM Tango Stable Audio* AudioComposer Ground Truth
Text prompt
A telephone rings repeatedly.
Timing prompt
Telephone bell ringing: 0.0s - 2.221s
Telephone bell ringing: 2.568s - 8.022s
Telephone bell ringing: 8.473s - 10.0s
Structured prompt
A telephone rings repeatedly. @{Telephone bell ringing & <0.00,2.22><2.57,8.02><8.47,10.00>}
ControlAudio AudioLDM Tango Stable Audio* AudioComposer Ground Truth
Text prompt
A man is speaking and an engine is running.
Timing prompt
Engine is running : 1.40s - 10.00s
Male speech, man speaking : 0.00s - 1.40s
LLM Planning: "The engine’s running, let’s go."
Structured prompt
A man is speaking and an engine is running. @{Engine is running <0.00,10.00>}@{Male speech, man speaking <0.00,1.40><DH><AH0>< ><EH1><N><JH><AH0><N><Z>< ><R><AH1><N><IH0><NG>< ,>< ><L><EH1><T><S>< ><G><OW1>< .>}
ControlAudio AudioLDM Tango Stable Audio* AudioComposer Ground Truth
Text prompt
Music plays, followed by mechanisms, typing, beeps, and an alarm.
Timing prompt
Music : 0.00s - 10.00s
Beeps : 1.00s - 1.20s 3.00s - 3.20s 4.90s - 5.10s 6.90s - 7.10s
Typing : 1.20s - 7.80s
Alarm : 7.85s - 8.50s
Structured prompt
Music plays, followed by mechanisms, typing, beeps, and an alarm. @{Music. & <0.00,10.00>}@{Beeps. & <1.00,1.20><3.00,3.20><4.90,5.10><6.90,7.10>}@{Typing. & <1.20,7.80>}@{Alarm. & <7.85,8.50>}
ControlAudio AudioLDM Tango Stable Audio* AudioComposer Ground Truth


Intelligible Audio Generation

Text prompt
People talking with the dull roar of a vehicle on the road.
Content prompt
What is she doing? What the fuck? What the fuck? She's in the middle of the road. Mate, he's on a joyride.
ControlAudio AudioLDM 2 Speech VoiceLDM-S VoiceLDM-M Ground Truth
Text prompt
A gun cocking followed by a gunshot and a man talking.
Content prompt
There you see? No problem.
ControlAudio AudioLDM 2 Speech VoiceLDM-S VoiceLDM-M Ground Truth
Text prompt
A man speaking over an intercom as a crowd of people talk followed by a dog barking.
Content prompt
and contain them until that person can be taken into custody effectively and safely on the part of the other team of police sheriffs.
ControlAudio AudioLDM 2 Speech VoiceLDM-S VoiceLDM-M Ground Truth
Text prompt
Females voice narrating a scene as music is playing and rain drops are falling.
Content prompt
Daniel came out of the airport. He raised one arm to hail a taxi.
ControlAudio AudioLDM 2 Speech VoiceLDM-S VoiceLDM-M Ground Truth
Text prompt
Splashing water followed by a girl speaking then scraping and spitting.
Content prompt
This is the last time you did that first thing. Same thing.
ControlAudio AudioLDM 2 Speech VoiceLDM-S VoiceLDM-M Ground Truth