23-second clip from Bhakti Sutras 11 (21:55-22:19) · VibeVoice-1.5B German dub · side-by-side lipsync outputs from each model · RTX 5070 Ti
Zero-shot voice clone from 45s Gurudev reference. Generation: 55s for 23.3s audio (RTF 2.36x).
BS 11, 320×240→640×480, 25 fps. Gurudev explaining gradations of devotees.
stage2_512 · 20 steps · DeepCache · 11.5 min · 16 GB VRAM. Required killing ollama to free host RAM.
stage2_efficient · 20 steps · DeepCache · 67s inference · 256² face region. 10× faster than 512 variant.
wav2lip_gan.pth · S3FD face detector · 580 frames in ~30s · the 2020 classic.
v15 unet (3.4 GB) · face_alignment replacing dwpose · 586 frames landmark-extracted in ~70s + inference ~60s. Required 4 patches to unblock on Windows/Blackwell.
7-step pipeline: landmarks → 3DMM → expression stabilize → face enhance → lip synth → composite. 4GB checkpoints. ~6 min on 5070 Ti. 6 patches needed to unblock.