Gurudev Lipsync Shootout

23-second clip from Bhakti Sutras 11 (21:55-22:19) · VibeVoice-1.5B German dub · side-by-side lipsync outputs from each model · RTX 5070 Ti

How to evaluate this comparison

Goal: pick the lipsync model we'll integrate into the gurudev-knowledge German dubbing pipeline. The same 23-second source clip + same VibeVoice German audio went into every model — only the mouth synthesis differs.

1. Watch in this order

  1. Original (English) first — anchor your perception of Gurudev's natural mouth shapes, head movement, and lighting.
  2. Listen to the German audio alone (top of page) — get the pacing in your ear before judging mouth sync.
  3. LatentSync 512² → MuseTalk → VideoReTalking → Wav2Lip → LatentSync 256². Top to bottom in the grid is roughly highest-quality to fastest.
  4. Re-watch each one twice: once for sync (close your eyes for the audio, then look at the mouth on second viewing), once for naturalness (does the face look "right" or uncanny?).

2. What to look for

CriterionWhat to check
Lip sync accuracyDo mouth shapes hit consonants on time? Watch "besser", "gierig", "unglücklich" — the plosives are the giveaway.
Visual sharpnessIs the mouth crisp or does it look smeared/blurry compared to the rest of the face?
Identity preservationDoes it still look like Gurudev? Watch the beard line, chin, nose-mouth distance.
Temporal stabilityDoes the mouth jitter or "boil" between frames? Pause and step frame-by-frame if unsure.
Edge artifactsLook at the boundary where the synthesized mouth meets the rest of the face — any visible seams?
NaturalnessGut feeling — does the result feel like a real person speaking, or does it cross into uncanny valley?

3. Quick scoring

For each model, give it a 1–5 on each criterion and add notes. The winner is the one I'll wire into the gurudev-knowledge pipeline. If two are close, we'll let speed and reliability break the tie.

4. Tips for viewing

5. After you decide

Tell me the winner and I'll: (a) integrate it into the gurudev-knowledge pipeline as a new _de_lipsync.mp4 output slot, (b) batch-process Bhakti Sutras 1–20 over the next few nights, (c) re-publish here for QC. If none are good enough, we still have options: MuseTalk with the original dwpose path via WSL Docker, or a higher-res source clip (current source is only 320×240 upscaled — a 1080p original would let LatentSync 512² shine more).

Generated German audio VibeVoice-1.5B

Zero-shot voice clone from 45s Gurudev reference. Generation: 55s for 23.3s audio (RTF 2.36x). The text says: "Aber einer ist besser als der andere. Der Wissende ist besser als jemand, der nur sucht. Wer sucht, ist besser als jemand, der gierig ist. Und wer gierig ist, ist besser als jemand, der unglücklich ist, denn wer unglücklich ist, hört auf nichts. Ein Gieriger hört wenigstens zu."

Original English source

BS 11, 320×240→640×480, 25 fps. Gurudev explaining gradations of devotees.

LatentSync 1.6 — 512² SOTA diffusion, max quality

stage2_512 · 20 steps · DeepCache · 11.5 min · 16 GB VRAM. Required killing ollama to free host RAM.

LatentSync 1.6 — 256² efficient

stage2_efficient · 20 steps · DeepCache · 67s inference · 256² face region. 10× faster than 512 variant.

Wav2Lip GAN baseline

wav2lip_gan.pth · S3FD face detector · 580 frames in ~30s · the 2020 classic.

MuseTalk v1.5 Tencent latent inpainting

v15 unet (3.4 GB) · face_alignment replacing dwpose · 586 frames landmark-extracted in ~70s + inference ~60s. Required 4 patches to unblock on Windows/Blackwell.

VideoReTalking OpenTalker 3DMM+GFPGAN

7-step pipeline: landmarks → 3DMM → expression stabilize → face enhance → lip synth → composite. 4GB checkpoints. ~6 min on 5070 Ti. 6 patches needed to unblock.