// CASE STUDY · ML · AUDIO · PITCH CORRECTION

VoxAlign.

Correct the pitch. Keep the singer.

A pitch-correction tool that learns your vibrato, jitter, and formants before it touches a note. CREPE neural detection at 10 ms frames, DTW alignment to a reference take, PSOLA shifting layered over a WORLD / ONNX vocoder. The opposite of Auto-Tune homogenization.

All Projects GitHub

React 18 TypeScript Vite FastAPI CREPE librosa WORLD fastdtw

MVP · 360 of 364 backend tests passing

VoxAlign

// learn · align · correct

// raw take

// aligned · vibrato preserved

−24 ¢ avg · 0.87 alignment · formants intact

Why

Most pitch correction erases the singer.

Auto-Tune and its descendants solve the pitch problem by flattening the thing that made the voice sound like a voice — the vibrato, the scoop into a note, the shimmer on a held vowel. VoxAlign tries to solve it the opposite way: measure what makes your voice yours first, then correct only the parts that actually missed.

// 01

Learn the voice first

Vibrato rate (4–7 Hz autocorrelation), jitter, shimmer, 13-dim MFCC, F1–F4 formants via LPC 12. EMA-smoothed across takes.

// 02

Align before correcting

DTW with Sakoe Chiba band at 15% maps singer frames to a reference. Breath, rubato, and held vowels stay where the singer put them.

// 03

Formant-aware shifting

PSOLA for small moves, WORLD vocoder for larger ones, phase-vocoder fallback. ONNX neural path when WORLD can't lock.

// 04

Production-grade plumbing

FastAPI + SQLAlchemy + aiosqlite, 50 MB upload cap, 10 req/min rate limit, 360 backend tests, <500 ms API response.

// Problem

Auto-Tune fixes pitch by deleting personality.

The first time I ran a rough vocal through a popular consumer pitch-correction tool, it did exactly what it said on the tin: every note was snapped to the nearest semitone. But the take came back sounding like somebody else. The scoop into the chorus note was gone. The slight shake on the held "oh" was gone. What came out was technically in tune and emotionally flat.

That's not a bug in Auto-Tune — it's the feature. The classic algorithm is a low-pass filter on pitch movement. Anything faster than the retune-speed knob gets smoothed. Vibrato is faster than the retune speed. So is a scoop. So the price of "in tune" is a voice without a fingerprint.

VoxAlign is a different bet: build a model of your fingerprint first, and then only correct the notes that fall outside what your voice would naturally do.

// Research

What actually counts as “your voice.”

The musicology and audio-DSP literature converges on a surprisingly short list of features that make a voice recognizable: vibrato rate and depth, jitter (cycle-to-cycle pitch variance), shimmer (cycle-to-cycle amplitude variance), and the formants F1–F4 — the resonant peaks shaped by your vocal tract that turn a glottal buzz into a vowel. Nail those four and you have something that'll pass a "is this the same singer?" test.

"The second Auto-Tune touches my vibrato, I sound like a robot. That's the only reason I still re-record instead of fixing pitch." // Producer interview · bedroom studio · pre-build

"I'd pay for something that lets me punch in a phrase without losing the vibe of the rest of the verse." // Field observation · singer-songwriter · Seattle

The non-negotiable that fell out: whatever the model learns about a singer has to transfer to the next take. Rebuilding a voice profile from scratch every session would defeat the point. VoxAlign's profiler keeps an EMA (alpha = 0.3) across uploads so the third take understands the first two.

// Architecture

Extract, profile, align, correct.

The backend is a four-stage pipeline. CREPE runs a convolutional pitch estimator at 10 ms frame resolution with a 0.5 confidence threshold. The voice profiler layers on top, computing vibrato via autocorrelation in the 4–7 Hz band, jitter and shimmer from frame-to-frame variance, MFCCs for timbre, and formants from a 12-order LPC analysis. Alignment runs fastdtw against a reference take with a Sakoe Chiba band constraint so the warp path can't drift too far from the diagonal. Correction is PSOLA first, WORLD vocoder when the shift is too large, phase-vocoder via scipy as a safety net, and an ONNX neural path for the edge cases WORLD chokes on.

// Decisions

Three calls I'd defend.

CREPE over pYIN or SWIPE

Classical pitch detectors like pYIN and SWIPE are fast and have the advantage of being pure signal processing — no model, no weights, no GPU required. CREPE is a convolutional net trained specifically on monophonic voice, and on clean singing the difference is small. On a breathy vocal take with room reverb, CREPE holds a stable pitch curve where pYIN starts octave-flipping. I'd rather pay the TensorFlow dependency tax and get a pitch track I can trust than spend cycles filtering octave errors out of a cheaper estimator.

DTW alignment instead of frame-to-frame

The naïve move is to lock the singer's frame 0 to the reference's frame 0 and correct each frame independently. That breaks the first time the singer takes a half-beat longer on a word, or breathes mid-phrase. DTW with a Sakoe Chiba band warps the singer's timeline onto the reference's, so a breath stays a breath and a held note stays held. The 15% band keeps the warp path from going unhinged on a take where the singer forgot a line.

Profile the voice before correcting a single note

This is the thesis of the project and also the piece that makes it slower than commodity pitch correction. Building the voice profile costs real compute — vibrato autocorrelation, 12-order LPC for formants, 13-dim MFCC frames — and it has to happen before the correction stage can decide which notes to leave alone. The payoff is that the output still sounds like the same singer on the other side, which is the entire point.

// Outcome

MVP, measured.

VoxAlign runs end-to-end on a laptop: upload a take, attach a reference, pick a correction strength, get a corrected WAV back in under 2 × realtime. The backend sits at 360 of 364 tests passing (the 4 holdouts are environment-specific WORLD edge cases I'm tracking as known limitations). The frontend ships 22 test files covering waveform editing, reprocess flows, and the correction-controls state machine.

10 ms

// CREPE frame resolution

<2×

// Realtime correction

0.16×

// Realtime WORLD vocoder

360

// Backend tests green

<500 ms

// API response p50

4–7 Hz

// Vibrato detection band

// Hindsight

What I'd do differently.

I'd lock Python 3.11.x from day one. CREPE pulls in a TensorFlow C extension that isn't compatible with 3.12+ and I spent most of a weekend figuring that out the hard way, through pip resolver hell, before committing to a pinned interpreter and writing the lock reason into the README. Decision cost about eight hours I'm never getting back. If you're building anything on CREPE right now — pin the interpreter before you write a single line.

I'd also build the ONNX neural vocoder fallback before I needed it, not after. The WORLD vocoder (pyworld 0.3.4) is extraordinary at what it does until it segfaults on a clip with a sharp transient or a clipped sample. The day that happened I had nothing — the correction returned garbage and the user had no path forward. Adding the ONNX fallback was straightforward once I decided to do it. The lesson was that any single dependency in the hot path of a pipeline should have a fallback by default, even if you're sure you won't need one.

Last thing: I'd write the voice-profile schema with versioning from the start. The profiler's output is what makes "session 2 remembers session 1" possible, and when I tweaked the feature set mid-build I had to invalidate every stored profile in the dev database. A two-field version tag would've let me migrate.