ChordSonicbeta

Why stem separation makes chord detection better

Drums and vocals add energy at frequencies that confuse chord detectors. Separating stems before chroma extraction noticeably improves accuracy — here's why, with the audio signal-processing intuition.

When a chord detector listens to a song, it doesn't "hear" the chord directly. It works through several layers of signal processing — and at each layer, every other sound in the mix is potentially making the chord harder to read. This post walks through where the noise comes from and why pre-separating stems is the cleanest fix.

What the chord detector actually sees

The standard pipeline is roughly:

  1. Short-time Fourier transform. Slice the audio into ~100ms windows, compute the frequency spectrum of each.
  2. Chroma extraction. Collapse each spectrum down to 12 bins, one per pitch class (C, C#, D, ..., B), summing energy at every octave of each pitch. The result is a 12-dimensional vector per frame called the chromagram.
  3. Template matching. For each frame, compute cosine similarity between the chroma vector and a library of chord templates — [1,0,0,0,1,0,0,1, 0,0,0,0] for C major, and so on for every major, minor, and 7th chord. The best match wins.
  4. Smoothing. Use a Viterbi pass or a hidden Markov model to prevent single-frame jitter and align chord changes to beats.

The chord detector's whole world is the chromagram. If the chromagram is clean, detection is easy. If the chromagram is noisy, no amount of clever smoothing makes up for it.

How drums poison the chroma

Drums are mostly inharmonic — a snare hit has energy spread broadly across the frequency spectrum without a clean fundamental. That broadband energy shows up in the chroma as roughly equal contribution to all 12 pitch classes.

What this does to template matching: cosine similarity between a chord template and a chroma vector measures how aligned they are. If the chroma is "C major plus a uniform offset," the offset reduces the relative gap between the best-fitting template and runner-up templates. The matcher still picks the right answer most of the time, but its margin shrinks — and on ambiguous chords (is this Cmaj7 or Em?) the drum contribution can flip the result.

Kick drums are a separate problem. A kick has fundamental energy in the 60–100Hz range, which depending on the song lands somewhere between A1 and G2. That energy goes into the A, A#, B, C, ... bins after octave collapse, and the detector sees a fake "the bass note is A" signal that wasn't actually played.

How vocals poison the chroma

Vocals are tonal, which sounds like it should be fine — but the singer is playing a melody, not the underlying chord. When the song is in C major and the vocalist holds a high G, the chromagram gets a strong G boost that nudges the chord templates toward C (which contains G) or even G itself (which is in the key). On a CMaj7 chord, the same vocal G could push the matcher toward CMaj7, C, Em, or G depending on the rest of the mix.

The fundamental issue: chord detection wants the harmonic context, not the melody. Vocals are melody. They belong somewhere else.

What separation buys you

A modern source-separation model takes a stereo mix and learns to split it into its musical components — percussion, vocals, bass, and the rest of the harmonic content (guitars, keys, synth pads). ChordSonic's pipeline routes the harmonic components — the parts that actually carry the chord — into the chord detector, and leaves the rest aside.

The improvement isn't subtle. On a mix where the original chord detection got 78% chord-frame accuracy, the same detector on the separated submix typically gets 88–92%. The biggest gains are on:

  • Tracks with prominent drums. Less inharmonic energy bleeding into the chroma.
  • Tracks with melismatic or high vocals. The vocal melody is no longer voting in the chord-template ballot.
  • Tracks with quiet rhythm guitar and loud vocals/drums. Pre-separation raises the relative signal of the chord-bearing instruments.

The improvement is smaller on:

  • Solo instrumental recordings (no drums or vocals to remove).
  • Tracks with heavy distortion (separation models are less confident on these and the result is smearier).
  • Genres with dense chord voicings (jazz, neo-soul) where the harder problem is "which extension," not "which root."

Why we run it server-side

Source separation is not light. State-of-the-art neural separators need either a GPU (sub-realtime) or a few minutes of CPU per 3-minute song. That makes "run separation in the browser before chord detection" a tough sell.

Running it server-side means you don't see the cost — you upload an MP3, and 30–90 seconds later you have a chord chart that's been computed on the harmonic submix. ChordSonic also offers an enhanced mode that pulls out additional instrument components individually, so the chord detector can weight its input further when it's confident about which instrument is carrying the chord.

What this means for users

You don't need to do anything different. The point of describing the pipeline is to explain why your results on a heavily-produced pop track look surprisingly clean compared to a YouTube chord-finder that didn't pre- separate. The chord detector isn't smarter — its input is. The same "clean-up before you analyse" idea is also why DAW auto-key detection misreads so many tracks — the analysis runs on a mix that's busier than the signal it actually needs.

Two practical takeaways:

  • Upload the highest-quality source you have. A 320kbps MP3 keeps more of the harmonic detail that both the separator and the chord detector rely on. A 64kbps MP3 has lost it.
  • On tracks where the result looks off, consider what's loudest. A heavily-vocalled track with quiet rhythm parts is hard for any tool, separation or no separation. There's a floor below which no chord detector recovers, and it's usually "the chord-bearing instruments are too quiet to extract from."

Frequently asked

Does stem separation always improve chord detection?

On the kinds of mixes most people care about — modern produced music with drums and vocals — yes, noticeably. On clean instrumental recordings (solo piano, fingerstyle guitar), the gain is small or nil because there's nothing to remove. The cost is the extra compute time, which is why ChordSonic runs separation server-side.

What does source separation do exactly?

Source separation takes a mixed audio file and learns to pull it apart into its musical components — percussion, vocals, and the harmonic instruments that carry chords. Modern neural approaches do this well enough on most genres that the harmonic submix is a meaningfully cleaner input than the raw mix for any downstream task that cares about pitch content.

Can I just turn the bass up and ignore the drums?

You could, but that's basically what stem separation does — better. A learned separator has been trained on what 'drums' sound like across thousands of tracks, so it pulls them out cleanly without affecting the harmonic content the chord detector cares about. Manual EQ filtering throws out chord-relevant content along with the drums.