SpongeBob: Sync-Aware Harmonious
Audio-Visual Generative Editing

The first framework for joint audio-visual editing within a unified dual-stream diffusion process

Sen Liang1,2†, Cong Wang2†, Fengbin Guan1, Zhentao Yu2, Yiting Lu1, Yuanzhi Wang2,
Yuan Zhou2, Xin Li1, Zhibo Chen1*
1University of Science and Technology of China    2Tencent Hunyuan
Equal contribution   *Corresponding author
Scroll
Abstract

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%.

Method

Overview of the SpongeBob framework

SpongeBob employs a dual-stream Diffusion Transformer (DiT) that simultaneously edits video and audio within a unified denoising process. The framework comprises three core designs: (1) Sync-Aware Editing Mechanism achieves audio-visual alignment through bidirectional cross-modal attention, mask-guided spatial routing, and three-way temporal RoPE unification; (2) Context-Aware Module enables the generated audio to explicitly perceive unedited audio-visual context to avoid conflicts with preserved content; (3) Sync-Preserving Training and Guidance (SPTG) reduces context conflicts and enhances cross-modal synchronization through multi-task alignment training and two-stage inference guidance.

SpongeBob Framework

Figure: Given a source video with an object mask, text instructions, and a reference image, SpongeBob jointly edits the visual content and synthesizes synchronized audio through a dual-stream DiT with sync-aware editing mechanism and context-aware module.

Results Gallery

Click play to compare source video with SpongeBob's joint audio-visual editing result

001
Object Speech: "" Audio prompt: [Non-speech]. A dog barks loudly and repeatedly, its deep, resonant vocalizations echoing with energy and alertness.
🎬 Visual prompt: A Shiba Inu with orange-brown fur, upright ears, and a white chest barks energetically, its mouth opening wide with each vocalization.
Reference
ref
Source
Masked
✨ SpongeBob
002
Object Speech: "" Audio prompt: [Non-speech]. A cat lets out a series of sharp, high-pitched meows, vocalizing with urgency and intensity.
🎬 Visual prompt: A fluffy light-colored cat with green eyes meows with urgency, its mouth opening and closing expressively as it vocalizes.
Reference
ref
Source
Masked
✨ SpongeBob
003
Object Speech: "" Audio prompt: [Non-speech]. A traditional Chinese pipa is being plucked, producing bright, resonant tones with rapid tremolo and melodic flourishes.
🎬 Visual prompt: A wooden pear-shaped stringed instrument with ornate circular sound-hole decorations is being plucked, producing bright resonant tones with rapid melodic flourishes.
Reference
ref
Source
Masked
✨ SpongeBob
004
Object Speech: "" Audio prompt: [Non-speech]. A cuckoo bird produces its distinctive two-note call, a melodic and repetitive sound echoing through the trees.
🎬 Visual prompt: A brown cuckoo-like bird with a pale throat and yellow beak perches calmly, occasionally turning its head and producing its distinctive two-note call.
Reference
ref
Source
Masked
✨ SpongeBob
005
Object Speech: "" Audio prompt: [Non-speech]. An acoustic guitar is being strummed and plucked, producing warm, resonant chords and melodic phrases.
🎬 Visual prompt: An acoustic guitar with a glossy sunburst body and dark fretboard is being strummed, producing warm resonant chords and melodic phrases.
Reference
ref
Source
Masked
✨ SpongeBob
006
Object Speech: "" Audio prompt: [Non-speech]. An acoustic guitar is being strummed and plucked, producing warm, resonant chords and melodic phrases.
🎬 Visual prompt: A front-facing acoustic guitar with a natural wood body and black pickguard is being plucked, its strings vibrating as warm tones fill the air.
Reference
ref
Source
Masked
✨ SpongeBob
007
Object Speech: "" Audio prompt: [Non-speech]. A cat lets out a series of sharp, high-pitched meows, vocalizing with urgency and intensity.
🎬 Visual prompt: A brown tabby cat with green eyes and upright ears meows expressively, its mouth opening wide to reveal its teeth and pink tongue.
Reference
ref
Source
Masked
✨ SpongeBob
009
Object Speech: "" Audio prompt: [Non-speech]. A goose honks loudly with a deep, resonant call, its rhythmic vocalizations carrying across the open air.
🎬 Visual prompt: A white goose with an orange beak and long neck honks loudly, its beak opening wide with each deep resonant call.
Reference
ref
Source
Masked
✨ SpongeBob
010
Object Speech: "" Audio prompt: [Non-speech]. A cow lets out a deep, prolonged moo, its low-pitched vocalization resonating with a warm, rumbling quality.
🎬 Visual prompt: A black-and-white dairy cow with irregular patches and a white face lets out a deep prolonged moo, its mouth slightly open.
Reference
ref
Source
Masked
✨ SpongeBob
011
Object Speech: "" Audio prompt: [Non-speech]. A crow caws loudly with a harsh, raspy call, its sharp vocalizations punctuating the air in rapid succession.
🎬 Visual prompt: A black-and-gray crow with a dark head and gray body feathers caws loudly, its beak opening sharply with each harsh raspy call.
Reference
ref
Source
Masked
✨ SpongeBob
012
Object Speech: "" Audio prompt: [Non-speech]. A frog croaks with a deep, guttural sound, its vocal sacs inflating as it produces rhythmic, resonant calls.
🎬 Visual prompt: A yellow-brown frog with dark eyes and speckled skin croaks deeply, its body pulsing rhythmically as it produces guttural calls.
Reference
ref
Source
Masked
✨ SpongeBob
001
Dual Speech: "His daddy's name is Forrest, too?" Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: In a poignant close-up, a man with receding brown hair, fair skin, blue eyes, wearing a light suit jacket over a patterned blue shirt, stares intently with a serious and questioning expression. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
002
Dual Speech: "Well now you close an eye for us." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with medium-length dark hair with lighter highlights, glasses, a goatee, wearing a casual jacket, speaks with a calm and deliberate tone. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
003
Dual Speech: "Hello. Yeah, it's my island." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with receding light-brown hair, fair skin, and a lean angular face, wearing a dark suit, speaks earnestly with subtle facial movements. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
004
Dual Speech: "You've been to room." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with shoulder-length blonde highlighted hair, fair skin, blue eyes, and dangling earrings speaks thoughtfully, her expression shifting between contemplation and engagement. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
005
Dual Speech: "Named after a French swimming pool on a Japanese ship full of animals heading to Canada. Call me back." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short dark brown hair, fair skin, blue eyes, and angular features speaks animatedly with expressive gestures and a broad smile. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
006
Dual Speech: "I don't want to leave." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short brown hair, fair skin, blue eyes, and a square jaw, wearing a teal shirt and dark vest, speaks with conviction. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
008
Dual Speech: "You're going to teach me to read." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with long straight strawberry-blonde hair, very fair skin, light eyes, and elegant features speaks gently, her expression warm and attentive. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
009
Dual Speech: "You cannot follow three different religions at the same time, Piscine." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short curly brown hair, fair skin, blue-green eyes, light stubble, wearing a black leather jacket over a white henley, speaks while eating. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
010
Dual Speech: "What do you mean." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short dark hair, fair skin, light stubble, wearing a dark olive T-shirt, speaks while eating. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
011
Dual Speech: "You've been to room." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A young woman with straight brown hair, fair skin, light eyes, full lips, wearing a turquoise top, speaks softly with a thoughtful expression. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
002
Addition Speech: "Here's looking at you, kid." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A bearded man with dark swept-back hair, fair skin, green eyes, and a dark shirt speaks directly to the camera in front of a light blue six-panel door with a brass knocker. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
003
Addition Speech: "Life is like a box of chocolates." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with black hair, medium skin tone, light stubble, wearing a burgundy blazer over a gray shirt, speaks directly to the camera as a dark suburban night scene is visible through a window behind him. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
004
Addition Speech: "You talking to me?" Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A middle-aged man with short brown hair, fair skin, light facial hair, glasses, and a dark jacket speaks directly to the camera on a quiet suburban street with colorful houses. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
006
Addition Speech: "Life is like a box of chocolates." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with red hair, fair skin, green eyes, and a dark blue outfit speaks directly to the camera in an interior hallway with floral wallpaper in pastel green and pink tones. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
001
Deletion "Long-haired German Shepherd barking on grass"
Reference
ref
Source
Masked
✨ SpongeBob
003
Deletion "Yellow and black warbler singing on branch"
Reference
ref
Source
Masked
✨ SpongeBob
004
Deletion "White ambulance with siren wailing"
Reference
ref
Source
Masked
✨ SpongeBob
005
Deletion "Silver airplane making loud crashing sounds"
Reference
ref
Source
Masked
✨ SpongeBob
006
Deletion "Reddish-brown hound dog howling and stretching"
Reference
ref
Source
Masked
✨ SpongeBob
007
Deletion "Gray sports car engine roaring on track"
Reference
ref
Source
Masked
✨ SpongeBob
008
Deletion "Camouflage military helicopter flying across the sky"
Reference
ref
Source
Masked
✨ SpongeBob
001
Single Speech: "Just you watch and see. Mama can haiku, too." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with long brown hair, fair skin, brown eyes, wearing a teal sweater and a silver watch, speaks with a serious expression, gesturing with her right hand. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
002
Single Speech: "I searched every inch of every room in her house and guess what wasn't there." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with light brown hair, fair skin, blue eyes, wearing a black strapless dress, speaks earnestly with highly animated expressions, shifting between passionate gestures and pensive looks. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
003
Single Speech: "Is this about that baked ziti I ordered last week?" Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A bald middle-aged man with fair skin, light stubble, wearing a light blue dress shirt and a dark patterned tie, stands before a mirror in an elegantly decorated room. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
004
Single Speech: "What's everyone uh psyched for this summer?" Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A young man with short spiky hair, fair skin, blue eyes, a trimmed beard, wearing an orange hoodie over a gray shirt, looks up and offers a slight knowing smile. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
005
Single Speech: "And dear old dad strikes again." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A middle-aged man with short brown hair, a receding hairline, fair skin, wearing a gray suit and tie, raises his right hand with a knowing smirk. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
007
Single Speech: "Life is like a box of chocolates, you never know what you're gonna get." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with long wavy reddish-brown hair, fair skin, brown eyes, wearing a vibrant lime green jacket and patterned scarf, speaks passionately in a heated argument in a modern wood-paneled room. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
008
Single Speech: "Now that that shit you pulled in town nearly got us all killed." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: In a sun-dappled outdoor setting, a man with a beard, wearing a large black wide-brimmed hat and a blue chambray shirt over a dark vest, speaks deeply engaged in conversation. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
009
Single Speech: "Intelligence agency is just putting out fires every bloody day." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with short wavy blonde hair, fair skin, green eyes, defined cheekbones, wearing a purple coat, speaks with a serious expression. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
010
Single Speech: "Okay. Okay, do it fast." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with a shaved head, short beard, wearing a dark hoodie and plaid shirt, stands indoors in a brightly lit room with a concerned and slightly frustrated expression. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
011
Single Speech: "You should get ready. It's almost two." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: An elderly man with short white hair, fair skin, glasses, wearing a dark jacket, sits with a serious and contemplative expression in a dimly lit wood-paneled room. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
012
Single Speech: "No. But I had a really good teacher." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short brown hair, fair skin, light stubble, wearing a dark leather jacket and white shirt, shifts from focused concern to a subtle confident smile. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
015
Single Speech: "I just assume that one's with the authority and this one with us." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A middle-aged man with tousled dark hair, fair skin, sunglasses, a trimmed goatee, wearing a dark suit, turns from a pensive look to gaze directly at the camera. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
016
Single Speech: "I'm going to get my money and let it go." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: An older bald man with fair skin, blue eyes, wearing a black shirt, speaks with a somber expression in a tense indoor setting. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
020
Single Speech: "They can't give it away. You booked it. You gave it a check." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A blonde woman with a loose updo, fair skin, blue eyes, wearing a pale sleeveless top, speaks with an animated and concerned expression, her hands clasped near her face. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
022
Single Speech: "I'll give you your privacy." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A woman with light brown hair, fair skin, brown eyes, speaks naturally. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
023
Single Speech: "Here's looking at you, kid." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: An East Asian man with medium-length black hair and a warm skin tone, wearing a black leather jacket, briefly closes his eyes with a somber expression in a suburban neighborhood. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
024
Single Speech: "" Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with curly dark hair, fair skin, a full beard, wearing a dark heavy coat, speaks with quiet intensity in a wood-paneled room with warm high-contrast lighting. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
025
Single Speech: "London, what your life would be like going forward." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short brown hair, fair skin, blue eyes, light stubble, wearing a light gray T-shirt, speaks with a furrowed brow in a measured manner. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
026
Single Speech: "Out of a racket doing it, which is great." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with dark hair, fair skin, a reddish-brown beard, wearing a navy suit and white shirt, speaks earnestly with focused determination. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
027
Single Speech: "Now we go through every last pixel." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with shaggy brown hair, fair skin, brown eyes, wearing a gray short-sleeved button-down shirt, stands in a modern kitchen with a concerned expression. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
029
Single Speech: "What about this one?" Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: A man with short brown hair, fair skin, light stubble, wearing a brown leather jacket and black shirt, shifts from a soft smile to a wide charming grin. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob
032
Single Speech: "Sorry. I honestly thought you were worse." Audio prompt: [Speech]. A person is speaking
🎬 Visual prompt: An older man with gray curly hair, fair skin, a gray mustache, wearing a light teal blazer and dark shirt, speaks with a serious expression that gradually softens. 🎧 Audio prompt: [Speech]. A person is speaking
Reference
ref
Source
Masked
✨ SpongeBob

No samples in this category

Method Comparison

Qualitative comparison with existing methods — click play to listen

02
Object "A cat lets out a series of sharp, high-pitched meows."
Source
AvED
CoherentAVEdit
VACE+Foley
AVI-Edit
✨ SpongeBob
03
Object "A dog barks loudly and repeatedly."
Source
AvED
CoherentAVEdit
VACE+Foley
AVI-Edit
✨ SpongeBob
05
Single "It just I can't do this. You can. I'm just..." Audio prompt: [Speech]. A person is speaking
Source
AvED
CoherentAVEdit
VACE+Foley
AVI-Edit
✨ SpongeBob
06
Single "Now that that shit you pulled in town nearly got us all killed." Audio prompt: [Speech]. A person is speaking
Source
AvED
CoherentAVEdit
VACE+Foley
AVI-Edit
✨ SpongeBob
07
Dual "His daddy's name is Forrest, too?" Audio prompt: [Speech]. A person is speaking
Source
AvED
CoherentAVEdit
VACE+Foley
AVI-Edit
✨ SpongeBob
08
Dual "I don't want to leave." Audio prompt: [Speech]. A person is speaking
Source
AvED
CoherentAVEdit
VACE+Foley
AVI-Edit
✨ SpongeBob
Citation

If you find our work useful, please consider citing

@article{liang2025spongebob,
  title={SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing},
  author={Liang, Sen and Wang, Cong and Guan, Fengbin and Yu, Zhentao and Lu, Yiting and Wang, Yuanzhi and Zhou, Yuan and Li, Xin and Chen, Zhibo},
  journal={arXiv preprint arXiv:2605.25193},
  year={2026}
}