23-second clip from Bhakti Sutras 11 (21:55-22:19) · VibeVoice-1.5B German dub · side-by-side lipsync outputs from each model · RTX 5070 Ti
Goal: pick the lipsync model we'll integrate into the gurudev-knowledge German dubbing pipeline. The same 23-second source clip + same VibeVoice German audio went into every model — only the mouth synthesis differs.
| Criterion | What to check |
|---|---|
| Lip sync accuracy | Do mouth shapes hit consonants on time? Watch "besser", "gierig", "unglücklich" — the plosives are the giveaway. |
| Visual sharpness | Is the mouth crisp or does it look smeared/blurry compared to the rest of the face? |
| Identity preservation | Does it still look like Gurudev? Watch the beard line, chin, nose-mouth distance. |
| Temporal stability | Does the mouth jitter or "boil" between frames? Pause and step frame-by-frame if unsure. |
| Edge artifacts | Look at the boundary where the synthesized mouth meets the rest of the face — any visible seams? |
| Naturalness | Gut feeling — does the result feel like a real person speaking, or does it cross into uncanny valley? |
For each model, give it a 1–5 on each criterion and add notes. The winner is the one I'll wire into the
gurudev-knowledge pipeline. If two are close, we'll let speed and reliability break the tie.
Tell me the winner and I'll: (a) integrate it into the gurudev-knowledge pipeline as a new
_de_lipsync.mp4 output slot, (b) batch-process Bhakti Sutras 1–20 over the next few nights,
(c) re-publish here for QC. If none are good enough, we still have options:
MuseTalk with the original dwpose path via WSL Docker, or a higher-res source clip
(current source is only 320×240 upscaled — a 1080p original would let LatentSync 512² shine more).
Zero-shot voice clone from 45s Gurudev reference. Generation: 55s for 23.3s audio (RTF 2.36x). The text says: "Aber einer ist besser als der andere. Der Wissende ist besser als jemand, der nur sucht. Wer sucht, ist besser als jemand, der gierig ist. Und wer gierig ist, ist besser als jemand, der unglücklich ist, denn wer unglücklich ist, hört auf nichts. Ein Gieriger hört wenigstens zu."
BS 11, 320×240→640×480, 25 fps. Gurudev explaining gradations of devotees.
stage2_512 · 20 steps · DeepCache · 11.5 min · 16 GB VRAM. Required killing ollama to free host RAM.
stage2_efficient · 20 steps · DeepCache · 67s inference · 256² face region. 10× faster than 512 variant.
wav2lip_gan.pth · S3FD face detector · 580 frames in ~30s · the 2020 classic.
v15 unet (3.4 GB) · face_alignment replacing dwpose · 586 frames landmark-extracted in ~70s + inference ~60s. Required 4 patches to unblock on Windows/Blackwell.
7-step pipeline: landmarks → 3DMM → expression stabilize → face enhance → lip synth → composite. 4GB checkpoints. ~6 min on 5070 Ti. 6 patches needed to unblock.