📖 Tutorials & Guides

Get the Most Out of
Wav2Lip

From your first lip sync to advanced troubleshooting — everything you need to produce clean, convincing results.

All Tutorials
🎬
Beginner
Your First Lip Sync in 5 Minutes
The complete beginner walkthrough — picking files, uploading, running the job, and downloading your output.
📹
Beginner
Choosing the Right Source Video
What makes a good input video? Learn about face angle, lighting, resolution, and clip length — and what to avoid.
🎙️
Beginner
Preparing Your Audio for Best Results
Not all audio works equally well. Find out how to clean up recordings, handle background noise, and pick the right format.
⚙️
Intermediate
Understanding Wav2Lip's Settings
Padding, resize factor, batch size, smoothing — what each parameter actually does and when you'd want to change it.
✂️
Intermediate
Getting Sharper Mouth Output
The GAN checkpoint vs the base model. How to reduce blurriness around the mouth region and improve edge sharpness.
🔧
Troubleshooting
Fixing Common Errors
Face not detected, job hangs, blurry output, audio out of sync — the most frequent problems and exactly how to fix them.
🖥️
Advanced
Self-Hosting Free Lip Sync Hub
Run the full stack on your own Linux server. Covers Python setup, model checkpoints, Apache config, and the diagnostic tool.
🌍
Advanced
Dubbing Videos into Other Languages
A full dubbing workflow — translating a script, recording or generating the audio, syncing it to the original speaker, and exporting.
🎭
Intermediate
Lip Syncing Presentations & Talking Heads
Using a still image or short looped clip as your video source. Tips for educators, marketers, and content creators.

Full Tutorials
Beginner

Your First Lip Sync in 5 Minutes

⏱ 5 min read 📹 Beginner-friendly

If you've never done lip syncing before, the concept is straightforward: you have a video of someone talking, and you want to replace the audio with something different — while making the mouth movements match the new audio. That's exactly what Wav2Lip does, and this tool puts a clean interface on top of it.

Step 1 — Find a good source video

You need a short clip of a single person talking or with their face visible. Ideal characteristics: face roughly forward-facing, reasonably steady, decent lighting, 2–30 seconds long. MP4 works best. If your clip is longer, trim it first — most free video editors can do this in seconds.

Step 2 — Prepare your audio

This is the audio that will replace the original speech. It could be a voiceover you recorded, a text-to-speech output, or a dubbed version in another language. MP3 and WAV both work fine. The audio can be longer or shorter than the video — Wav2Lip will sync whatever duration overlaps.

Step 3 — Upload and run

  • 1
    Open the toolGo to the home page and scroll to the upload area, or click "Try Free" in the header.
  • 2
    Upload your videoClick the video drop zone or drag your file onto it. You'll see a preview thumbnail once it loads.
  • 3
    Upload your audioClick the audio drop zone. A mini player will appear so you can confirm it's the right file.
  • 4
    Click "Run Lip Sync"Processing typically takes 1–3× the video's duration. A progress bar will keep you updated.
  • 5
    Preview and downloadWhen it's done, the output video plays automatically in the browser. Click Download to save your MP4.

Tip: Start with a 5–10 second clip on your first try. Shorter clips process faster and let you check the quality before committing to a longer job.

Beginner

Choosing the Right Source Video

⏱ 4 min read📹 Input quality guide

The single biggest factor in output quality is your source video. Wav2Lip is doing hard work — detecting a face in every frame and regenerating the mouth region. Give it a difficult video and the results will show it.

What works well

✅ Good face angle
  • Straight-on or slight angle (±30°)
  • Full face visible in frame
  • Consistent framing throughout
✅ Good lighting
  • Even, front-facing light source
  • No harsh shadows across the face
  • No heavy backlighting
✅ Good resolution
  • 720p or higher preferred
  • Face takes up a reasonable portion of frame
  • No heavy compression artefacts
✅ Good clip length
  • 2–30 seconds is the sweet spot
  • Steady head position helps
  • Single speaker only

What causes problems

  • Extreme side profiles — the face detector struggles past ~45° and the mouth generation gets very inaccurate.
  • Heavy beard or moustache — facial hair covering the lips confuses the model's understanding of mouth shape.
  • Rapid head movement — nodding, shaking, or turning mid-clip introduces tracking errors between frames.
  • Multiple faces in frame — Wav2Lip picks one and ignores the rest. It may not pick the one you want.
  • Low resolution — faces smaller than ~96×96 pixels in frame reduce accuracy significantly.
  • Heavy video compression — blocky artefacts around the mouth area get amplified by the model.

Tip: If your original clip is poor quality, try running it through a free video enhancer first. A crisper input means a crisper output.

Beginner

Preparing Your Audio for Best Results

⏱ 4 min read🎙️ Audio quality guide

Wav2Lip generates mouth movements by reading a mel spectrogram — a visual representation of your audio's frequency and amplitude over time. The cleaner that spectrogram, the more precisely the model can match lips to phonemes.

Format and quality

  • WAV (uncompressed) is technically ideal, but MP3 at 128 kbps or higher works just as well in practice.
  • Sample rate doesn't matter much — Wav2Lip resamples internally.
  • Mono or stereo both work. The model converts to mono internally anyway.

Clean speech matters most

The model is optimized for speech. Music-only tracks, heavily produced audio with lots of reverb, or recordings with constant background noise all produce less accurate mouth movements — because the model is trying to find phoneme shapes in a signal that doesn't map cleanly to speech sounds.

⚠️ Avoid: heavy reverb, clipping (distortion from being too loud), background music mixed under speech, or very low-volume recordings. If your audio sounds muffled on playback, the lip sync will look off too.

Quick audio cleanup (free tools)

  • Audacity (free, desktop) — use Noise Reduction and Normalize to clean up a rough recording in minutes.
  • Adobe Podcast Enhance (free, browser) — one-click AI noise removal that works surprisingly well on voice recordings.
  • Descript or Cleanvoice — remove filler words and background noise automatically.

Tip: Record in a small, quiet room with soft furnishings (curtains, carpet, a sofa) rather than a hard-walled space. The difference in recording quality is dramatic and costs nothing.

Intermediate

Understanding Wav2Lip's Settings

⏱ 6 min read⚙️ Configuration guide

The default settings work well for most clips. But if you're getting blurry edges, sync drift, or odd cropping, tweaking one of these parameters is usually the fix.

Padding

Padding controls how much of the face area around the detected mouth is included in the crop that gets processed and replaced. Format: top right bottom left (in pixels), defaulting to 0 10 0 0.

  • Increasing bottom padding (e.g. 0 10 10 0) helps when the chin is getting cut off.
  • Increasing top padding helps when the nose region looks odd after blending.
  • Too much padding can cause ghosting artefacts outside the mouth region.

Resize Factor

Downscales the video before processing. 1 = original size (default). 2 = half size. Reducing this speeds up processing significantly but reduces output quality. Useful if you're running on a CPU and patience is short.

Wav2Lip Batch Size

How many frames are processed at once through the lip sync model. Larger batches = faster processing, but higher GPU memory usage. If you're seeing out-of-memory errors, reduce this from 16 to 8 or 4. If you have a powerful GPU and want to go faster, try 32.

Face Detection Batch Size

Same concept, but for the face detector. The default of 4 works for most hardware. Drop to 2 if you hit memory issues on long clips.

No Smooth

By default, Wav2Lip applies a small amount of temporal smoothing to reduce flickering between frames. Turning on "No Smooth" disables this — useful if you notice the mouth lagging slightly behind the audio, as it can sharpen the sync at the cost of occasional frame flicker.

Tip: When tuning settings, always test on a short 5–10 second clip first. It's much faster to iterate on a short clip than to wait 10 minutes for a long one to process before discovering the padding was wrong.

Intermediate

Getting Sharper Mouth Output

⏱ 5 min read✂️ Quality improvement

One of the most common complaints about Wav2Lip output is that the mouth region looks slightly soft or blurry compared to the rest of the face. There are two main levers to pull here.

The GAN checkpoint vs the base model

Wav2Lip ships with two model checkpoints. The base model (wav2lip.pth) is trained for sync accuracy — it produces the most accurately synced lips but the mouth region can look softer. The GAN model (wav2lip_gan.pth) adds a discriminator that pushes the output to look more photorealistic — sharper, more natural-looking, at a very slight trade-off in sync precision.

If both checkpoints are installed on your server, the tool will automatically prefer the GAN model. You can verify this is happening via the diagnostic page.

Source video resolution

Wav2Lip crops, resizes, processes, and pastes back the mouth region. If your source video is low resolution to begin with, the paste-back will look soft because there's no pixel detail to recover. Start with the highest resolution source you have. A 1080p input will always produce a sharper result than a 480p one, even if the output is the same size.

Post-processing with a video upscaler

For a noticeable quality jump, run your Wav2Lip output through a free AI video enhancer after the fact. Tools like Topaz Video AI (paid) or Real-ESRGAN (free, open-source) can sharpen the overall frame and reduce the visual mismatch between the processed mouth region and the surrounding face.

Tip: The mouth-region softness is an inherent limitation of how Wav2Lip blends the generated mouth back into the original frame. It's most noticeable on close-up shots. On medium or wide shots, where the face is smaller in frame, it's rarely visible.

Troubleshooting

Fixing Common Errors

⏱ 7 min read🔧 Problem-solving guide

Something went wrong? Here are the most frequent issues and what to do about each one.

❌ "No face detected" or job fails immediately

Wav2Lip needs to find a face in the first few frames to start processing. If it can't, the job will fail right away.

  • Check that your video actually contains a visible human face.
  • If the face only appears later in the clip, trim the beginning so the face is present from frame one.
  • Extreme side profiles (past ~45°) frequently fail detection. Use a more front-facing clip.
  • Very low-resolution video can cause detection failure — try a higher-quality source.
  • If you're self-hosting, check that s3fd.pth is installed. Run the diagnostic page to confirm.

❌ Processing hangs or times out

  • The video is too long. Try trimming to 30 seconds or less. Longer clips require much more GPU/CPU memory and time.
  • Server is under load. If others are using the tool simultaneously, your job may queue. Wait a few minutes and try again.
  • Out of memory. If self-hosting, reduce the Wav2Lip batch size and face detection batch size in the settings. On CPU-only machines, long clips frequently time out.

❌ Output looks blurry or mouth region is obvious

  • Start with a higher resolution source video — this is the biggest factor.
  • Confirm the GAN checkpoint is installed (check /?diag). The base model produces noticeably softer results.
  • Try reducing the resize factor to 1 if it's been set higher.
  • Post-process the output through a video sharpening tool.

❌ Lips are moving but out of sync with the audio

  • Try enabling No Smooth — the temporal smoothing can introduce a small lag.
  • Check your audio file isn't padded with silence at the start. Trim any silent intro from the audio track.
  • Make sure the audio codec is standard. Some obscure codecs confuse the audio processing pipeline. Convert to plain MP3 or WAV if in doubt.

❌ "Upload failed" or file rejected

  • Check your file formats — video must be MP4, MOV, AVI, MKV, or WEBM; audio must be MP3, WAV, OGG, M4A, or AAC.
  • Check file sizes — very large files (1 GB+) may exceed the server's upload limit. Compress or trim your video first.
  • If self-hosting, check PHP's upload_max_filesize and post_max_size settings in php.ini.

Still stuck? Check the diagnostic page at /?diag for a readout of your Python, PyTorch, OpenCV, and model checkpoint status. It tells you exactly what's missing. You can also contact us — paste the diagnostic output and we'll help you sort it.

Advanced

Self-Hosting Free Lip Sync Hub

⏱ 10 min read🖥️ Server setup guide

Running the tool on your own server means your files never leave your machine at all — not even to our server. It's also the only way to process unlimited jobs without any constraints. Here's the full setup.

What you need

  • A Linux server (Ubuntu 20.04/22.04 recommended)
  • Python 3.8–3.10
  • Apache with PHP 7.4+ (PHP 8.x works fine)
  • A GPU with CUDA support (optional but strongly recommended — CPU-only is slow for anything over 15 seconds)
  • ~5 GB of disk space for model checkpoints

Installation steps

  • 1
    Clone or download the Wav2Lip repo Place it in your web root under /var/www/html/freelipsynchub/Wav2Lip/.
  • 2
    Install Python dependencies Run pip3 install torch torchvision opencv-python librosa numpy — and any other packages listed in Wav2Lip's requirements.txt.
  • 3
    Download model checkpoints Place wav2lip_gan.pth and wav2lip.pth in Wav2Lip/checkpoints/. Download links are in the official Wav2Lip repository.
  • 4
    Configure Apache Enable mod_rewrite, set AllowOverride All for your document root, and ensure PHP's exec() function is enabled.
  • 5
    Verify with the diagnostic tool Visit yourdomain.com/?diag — it will check Python, PyTorch, OpenCV, and all model files and tell you exactly what's missing.

PHP configuration to check

; php.ini — adjust for your expected file sizes upload_max_filesize = 512M post_max_size = 512M max_execution_time = 300 memory_limit = 512M

⚠️ CPU-only servers: Wav2Lip without a GPU is very slow — a 10-second clip can take 5+ minutes. If you're processing video regularly, a server with even a modest NVIDIA GPU (e.g. RTX 3060) will reduce that to under 30 seconds.

Advanced

Dubbing Videos into Other Languages

⏱ 8 min read🌍 Full dubbing workflow

This is probably the most powerful use case for Wav2Lip: taking a video recorded in one language and producing a version where the speaker's lips match a translated audio track. Here's the full workflow from start to finish.

Step 1 — Transcribe the original

Before you can translate, you need the original script. Use a transcription tool (like our sister site FreeTranscribeAudioToText.com) to get an accurate transcript of the original speech, including timestamps if you need them.

Step 2 — Translate the script

Machine translation (DeepL or Google Translate) works well for common language pairs. For important content, have a native speaker review the output — machine translation is fast but not always idiomatic. Aim to match the original pacing: if the original speaker says a phrase in 3 seconds, your translated version should take roughly the same time.

Step 3 — Record or generate the dubbed audio

You have two options here:

  • Record it yourself — use a quiet room and a decent microphone. Even a phone mic in a soft-furnished room works. Match the energy and pacing of the original speaker.
  • Use a text-to-speech tool — modern TTS (ElevenLabs, Google TTS, Azure) produces very convincing speech. Clone the original speaker's voice if the tool supports it, for the most seamless result.

Step 4 — Sync the audio length to the video

This is the tricky part. If your dubbed audio is significantly longer or shorter than the original, the lip sync will cover only the overlapping portion. Use audio editing software (Audacity is free) to stretch or compress the dubbed track to match the video length — or trim the video to match the audio.

Step 5 — Run the lip sync

Upload the original video and your dubbed audio track to the tool. Let Wav2Lip do its thing. The output will be the original speaker's face, now moving in sync with the dubbed language.

Step 6 — Add subtitles (optional but recommended)

Even with good lip sync, dubbed video often benefits from subtitles in both the original and target language. Free tools like Subtitle Edit (desktop) or Kapwing (browser) make this straightforward.

Tip: Languages with very different phoneme structures (e.g. dubbing English into Mandarin, or vice versa) will produce less convincing lip sync — because the mouth shapes for those sounds are genuinely different. European language pairs (English ↔ Spanish, French ↔ Italian) tend to produce better results since the phoneme overlap is higher.

Intermediate

Lip Syncing Presentations & Talking Heads

⏱ 5 min read🎭 Content creation guide

You don't need a long recorded video to use this tool. A short, looped clip of a face — or even a carefully prepared still image converted to a short video — can be an effective source for lip syncing presentations, explainer videos, or social media content.

Using a looped clip

Record or find a 3–5 second clip of a person with a neutral expression, facing the camera. Loop it to roughly match the length of your audio. Most video editors (even free ones like DaVinci Resolve or iMovie) can loop a clip. Use this looped video as your input — Wav2Lip will animate the mouth across every repeated loop.

Converting a photo to video

You can convert a static portrait photo to a short MP4 using FFmpeg:

ffmpeg -loop 1 -i portrait.jpg -t 10 -vf fps=25 -pix_fmt yuv420p output.mp4

This creates a 10-second, 25fps MP4 from a still image. Feed that into the lip sync tool along with your audio. The result is a "talking portrait" — useful for presentations, memorial videos, or social media content.

Tips for clean talking head output

  • Use a high-resolution portrait — at least 512×512 pixels, ideally 1080p. More pixels means more detail in the generated mouth region.
  • Choose a neutral mouth position — a closed or slightly parted mouth in the source image gives the model more to work with than an open smile.
  • Keep the head steady — since you're looping a short clip, any movement in the source will repeat visibly. A truly still source (photo-to-video) avoids this completely.
  • Ensure good face visibility — no sunglasses, hat brims shadowing the face, or hands near the mouth region.

Use case idea: Educators can record lecture audio once and apply it to a short looped video of themselves — meaning you only need to be on camera for 5 seconds, not the whole lesson.

Ready to Put It Into Practice?

The best way to learn is to run a job. Upload a short clip and see what Wav2Lip can do.

▶ Try the Free Lip Sync Tool