ControlAudio: Tackling Text-Guided, Timing-Indicated
and Intelligible Audio Generation via Progressive Diffusion Modeling
Anonymous submission
Abstract
Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.
Figure 1: The end-to-end Progressive Diffusion Modeling of ControlAudio, which combines a progressive model training with a progressive guided sampling process for decoupled control of temporal structure and speech content.
Timing-Controlled Audio Generation
Text prompt
A bathtub is being filled or washed.
Timing prompt
Bathtub (filling or washing): 0.0s - 6.598s
Bathtub (filling or washing): 6.748s - 10.0s
Structured prompt
A bathtub is being filled or washed. @{Bathtub (filling or washing) <0.00,6.60><6.75,10.00>}
ControlAudio
AudioLDM
Tango
Stable Audio*
AudioComposer
Ground Truth
Text prompt
A woman sings with music and an alarm clock sounds.
A woman sings with music and an alarm clock sounds. @{A woman sings & <0.00,6.73>}@{Alarm clock & <6.73,10.00>}
ControlAudio
AudioLDM
Tango
Stable Audio*
AudioComposer
Ground Truth
Text prompt
A telephone rings repeatedly.
Timing prompt
Telephone bell ringing: 0.0s - 2.221s
Telephone bell ringing: 2.568s - 8.022s
Telephone bell ringing: 8.473s - 10.0s
Structured prompt
A telephone rings repeatedly. @{Telephone bell ringing & <0.00,2.22><2.57,8.02><8.47,10.00>}
ControlAudio
AudioLDM
Tango
Stable Audio*
AudioComposer
Ground Truth
Text prompt
A man is speaking and an engine is running.
Timing prompt
Engine is running : 1.40s - 10.00s Male speech, man speaking : 0.00s - 1.40s
LLM Planning: "The engine’s running, let’s go."
Structured prompt
A man is speaking and an engine is running. @{Engine is running <0.00,10.00>}@{Male speech, man speaking <0.00,1.40><DH><AH0>< ><EH1><N><JH><AH0><N><Z>< ><R><AH1><N><IH0><NG>< ,>< ><L><EH1><T><S>< ><G><OW1>< .>}
ControlAudio
AudioLDM
Tango
Stable Audio*
AudioComposer
Ground Truth
Text prompt
Music plays, followed by mechanisms, typing, beeps, and an alarm.
Music plays, followed by mechanisms, typing, beeps, and an alarm. @{Music. & <0.00,10.00>}@{Beeps. & <1.00,1.20><3.00,3.20><4.90,5.10><6.90,7.10>}@{Typing. & <1.20,7.80>}@{Alarm. & <7.85,8.50>}
ControlAudio
AudioLDM
Tango
Stable Audio*
AudioComposer
Ground Truth
Intelligible Audio Generation
Text prompt
People talking with the dull roar of a vehicle on the road.
Content prompt
What is she doing? What the fuck? What the fuck? She's in the middle of the road. Mate, he's on a joyride.
ControlAudio
AudioLDM 2 Speech
VoiceLDM-S
VoiceLDM-M
Ground Truth
Text prompt
A gun cocking followed by a gunshot and a man talking.
Content prompt
There you see? No problem.
ControlAudio
AudioLDM 2 Speech
VoiceLDM-S
VoiceLDM-M
Ground Truth
Text prompt
A man speaking over an intercom as a crowd of people talk followed by a dog barking.
Content prompt
and contain them until that person can be taken into custody effectively and safely on the part of the other team of police sheriffs.
ControlAudio
AudioLDM 2 Speech
VoiceLDM-S
VoiceLDM-M
Ground Truth
Text prompt
Females voice narrating a scene as music is playing and rain drops are falling.
Content prompt
Daniel came out of the airport. He raised one arm to hail a taxi.
ControlAudio
AudioLDM 2 Speech
VoiceLDM-S
VoiceLDM-M
Ground Truth
Text prompt
Splashing water followed by a girl speaking then scraping and spitting.
Content prompt
This is the last time you did that first thing. Same thing.