Gurudev Lipsync Shootout

23-second clip from Bhakti Sutras 11 (21:55-22:19) · VibeVoice-1.5B German dub · side-by-side lipsync outputs from each model · RTX 5070 Ti

How to evaluate this comparison

Goal: pick the lipsync model we'll integrate into the gurudev-knowledge German dubbing pipeline. The same 23-second source clip + same VibeVoice German audio went into every model — only the mouth synthesis differs.

1. Watch in this order

Original (English) first — anchor your perception of Gurudev's natural mouth shapes, head movement, and lighting.
Listen to the German audio alone (top of page) — get the pacing in your ear before judging mouth sync.
LatentSync 512² → MuseTalk → VideoReTalking → Wav2Lip → LatentSync 256². Top to bottom in the grid is roughly highest-quality to fastest.
Re-watch each one twice: once for sync (close your eyes for the audio, then look at the mouth on second viewing), once for naturalness (does the face look "right" or uncanny?).

2. What to look for

Criterion	What to check
Lip sync accuracy	Do mouth shapes hit consonants on time? Watch "besser", "gierig", "unglücklich" — the plosives are the giveaway.
Visual sharpness	Is the mouth crisp or does it look smeared/blurry compared to the rest of the face?
Identity preservation	Does it still look like Gurudev? Watch the beard line, chin, nose-mouth distance.
Temporal stability	Does the mouth jitter or "boil" between frames? Pause and step frame-by-frame if unsure.
Edge artifacts	Look at the boundary where the synthesized mouth meets the rest of the face — any visible seams?
Naturalness	Gut feeling — does the result feel like a real person speaking, or does it cross into uncanny valley?

3. Quick scoring

For each model, give it a 1–5 on each criterion and add notes. The winner is the one I'll wire into the gurudev-knowledge pipeline. If two are close, we'll let speed and reliability break the tie.

4. Tips for viewing

Use full-screen (click the expand icon on the video player) — the mouth detail matters and is hard to judge in the small grid view.
Loop short sections by dragging the playhead — sync issues are most visible in the first 5 seconds where Gurudev says "Aber einer ist besser als der andere".
Use 0.5× playback speed (right-click → speed) to see frame-level mouth shapes.
Compare on a desktop/laptop, not a phone — the source clip is only 320×240 upscaled, so artifacts are subtle.

5. After you decide

Tell me the winner and I'll: (a) integrate it into the gurudev-knowledge pipeline as a new _de_lipsync.mp4 output slot, (b) batch-process Bhakti Sutras 1–20 over the next few nights, (c) re-publish here for QC. If none are good enough, we still have options: MuseTalk with the original dwpose path via WSL Docker, or a higher-res source clip (current source is only 320×240 upscaled — a 1080p original would let LatentSync 512² shine more).

Generated German audio VibeVoice-1.5B

Zero-shot voice clone from 45s Gurudev reference. Generation: 55s for 23.3s audio (RTF 2.36x). The text says: "Aber einer ist besser als der andere. Der Wissende ist besser als jemand, der nur sucht. Wer sucht, ist besser als jemand, der gierig ist. Und wer gierig ist, ist besser als jemand, der unglücklich ist, denn wer unglücklich ist, hört auf nichts. Ein Gieriger hört wenigstens zu."

Original English source

BS 11, 320×240→640×480, 25 fps. Gurudev explaining gradations of devotees.

LatentSync 1.6 — 512² SOTA diffusion, max quality

stage2_512 · 20 steps · DeepCache · 11.5 min · 16 GB VRAM. Required killing ollama to free host RAM.

LatentSync 1.6 — 256² efficient

stage2_efficient · 20 steps · DeepCache · 67s inference · 256² face region. 10× faster than 512 variant.

Wav2Lip GAN baseline

wav2lip_gan.pth · S3FD face detector · 580 frames in ~30s · the 2020 classic.

MuseTalk v1.5 Tencent latent inpainting

v15 unet (3.4 GB) · face_alignment replacing dwpose · 586 frames landmark-extracted in ~70s + inference ~60s. Required 4 patches to unblock on Windows/Blackwell.

VideoReTalking OpenTalker 3DMM+GFPGAN

7-step pipeline: landmarks → 3DMM → expression stabilize → face enhance → lip synth → composite. 4GB checkpoints. ~6 min on 5070 Ti. 6 patches needed to unblock.

Method notes

source Bhakti Sutras 11 — Attaining Divine Love, 21:55–22:19. Picked by sliding-window density scan across all 20 BS episodes — this window had 51 spoken words in 18.6 of 20 seconds (densest in the corpus, no long pauses, face on-camera throughout).

resolution Original is 320×240 @ 15 fps (the BS series ships at low res). Upscaled to 640×480 @ 25 fps with lanczos + libx264 -crf 17 before lipsync. The face crop the models see is roughly 200×200, which is borderline-small for LatentSync 512² but works.

audio VibeVoice-1.5B zero-shot clone from a 45-second reference of Gurudev (BS 01, 3:00–3:45). The German script was time-budget translated from the English transcript so the spoken duration would match the video — no speed adjustment applied to either side. VibeVoice produced 23.3 s; the video clip was re-cut to 23.4 s to match exactly.

hardware All inference on rog-beast: RTX 5070 Ti 16 GB (Blackwell sm_120), driver 595, CUDA 13.2, Windows 11 Pro. Every model needed dependency surgery to run on this combo — Blackwell needs torch ≥2.7/cu128 and most lipsync repos pin 2.5/cu121 or older.

why these 4 LatentSync 1.6 is current SOTA on HDTF/VoxCeleb2. MuseTalk is the leading real-time alternative. Wav2Lip is the 2020 baseline everyone benchmarks against. VideoReTalking adds expression stabilization + GFPGAN face enhancement on top of a Wav2Lip-style core. These cover the four meaningful design points in the field.