Lab Overview
Learning Objectives
After completing this lab, students will be able to:
- Apply Fourier Transform to real audio signals for frequency analysis
- Implement a simplified version of Shazam's audio fingerprinting algorithm
- Understand how spectrograms are used in audio recognition systems
- Generate and compare audio fingerprints using peak extraction
- Evaluate the robustness of fingerprinting techniques to noise
Background
Shazam's audio recognition technology relies on the Fast Fourier Transform (FFT) to convert time-domain audio signals into frequency-domain representations. By identifying unique patterns in the frequency spectrum (audio fingerprints), Shazam can match short audio samples against a database of millions of songs.
In this lab, you will implement the core components of this system, focusing on the signal processing aspects relevant to electrical engineering.
Pre-Lab Preparation
Complete these tasks before the lab session:
- Review Fourier Transform theory and properties
- Understand the difference between DFT and FFT algorithms
- Install Python with NumPy, SciPy, and Matplotlib libraries
- Download the sample audio files provided for the lab
Pre-Lab Questions
1. Explain why frequency-domain analysis (using FFT) is more effective than time-domain analysis for audio fingerprinting.
2. What is the purpose of applying a window function (like Hann or Hamming) before performing FFT on audio signals?
3. Calculate the frequency resolution of an FFT with N=4096 points for audio sampled at 44.1 kHz.
Frequency Resolution Δf = Fs / N
Where Fs = Sampling Frequency, N = FFT size
Lab Procedure
Part 1: Audio Signal Generation
First, we'll generate synthetic audio signals to understand the FFT process. Create a Python script with the following functions:
import matplotlib.pyplot as plt
from scipy.io import wavfile
# Generate a test audio signal with multiple frequencies
def generate_test_signal(duration=2, fs=44100):
t = np.linspace(0, duration, int(fs * duration), endpoint=False)
# Create signal with three frequency components
freqs = [440, 880, 1320] # A4, A5, E6
signal = np.zeros_like(t)
for f in freqs:
signal += 0.5 * np.sin(2 * np.pi * f * t)
return t, signal, fs
# Add white noise to simulate real recording conditions
def add_noise(signal, snr_db=20):
signal_power = np.mean(signal**2)
noise_power = signal_power / (10**(snr_db/10))
noise = np.random.normal(0, np.sqrt(noise_power), len(signal))
return signal + noise
Part 2: FFT Implementation & Analysis
Implement FFT calculation and analyze the frequency components of the audio signal.
def compute_fft(signal, fs, apply_window=True):
n = len(signal)
# Apply Hann window to reduce spectral leakage
if apply_window:
window = np.hanning(n)
signal = signal * window
# Compute FFT
fft_result = np.fft.fft(signal)
fft_magnitude = np.abs(fft_result[:n//2])
fft_freq = np.fft.fftfreq(n, 1/fs)[:n//2]
return fft_freq, fft_magnitude
# Identify frequency peaks (simplified Shazam approach)
def find_peaks(frequencies, magnitude, threshold=0.1, min_distance=5):
peaks = []
max_mag = np.max(magnitude)
for i in range(1, len(magnitude)-1):
if (magnitude[i] > magnitude[i-1] and
magnitude[i] > magnitude[i+1] and
magnitude[i] > threshold * max_mag):
peaks.append((frequencies[i], magnitude[i]))
return peaks
Part 3: Spectrogram Generation
Create a spectrogram - a time-frequency representation essential for audio fingerprinting.
def generate_spectrogram(signal, fs, window_size=1024, hop_size=512):
n_windows = (len(signal) - window_size) // hop_size + 1
spectrogram = np.zeros((window_size//2, n_windows))
for i in range(n_windows):
start = i * hop_size
end = start + window_size
segment = signal[start:end]
window = np.hanning(window_size)
segment = segment * window
# Compute FFT for this segment
fft_result = np.fft.fft(segment)[:window_size//2]
magnitude = np.abs(fft_result)
spectrogram[:, i] = magnitude
time_axis = np.arange(n_windows) * hop_size / fs
freq_axis = np.fft.fftfreq(window_size, 1/fs)[:window_size//2]
return time_axis, freq_axis, spectrogram
Note: The spectrogram is a 2D representation with time on the x-axis and frequency on the y-axis. Color intensity represents magnitude at each time-frequency point.
Part 4: Audio Fingerprint Generation
Implement the core Shazam fingerprinting algorithm by identifying peak constellations in the spectrogram.
def find_spectrogram_peaks(spectrogram, time_axis, freq_axis, threshold=0.3):
peaks = []
max_val = np.max(spectrogram)
rows, cols = spectrogram.shape
for t in range(1, cols-1):
for f in range(1, rows-1):
val = spectrogram[f, t]
# Check if it's a local maximum
if (val > threshold * max_val and
val > spectrogram[f-1, t] and
val > spectrogram[f+1, t] and
val > spectrogram[f, t-1] and
val > spectrogram[f, t+1]):
peaks.append((time_axis[t], freq_axis[f], val))
return peaks
# Create fingerprint hashes from peak pairs (simplified)
def create_fingerprint_hashes(peaks, max_time_diff=1.0, max_freq_diff=500):
hashes = []
n = len(peaks)
for i in range(n):
t1, f1, m1 = peaks[i]
for j in range(i+1, min(i+5, n)): # Limit pairs for efficiency
t2, f2, m2 = peaks[j]
time_diff = t2 - t1
freq_diff = f2 - f1
# Create hash from the pair
hash_val = hash((int(f1), int(f2), int(time_diff*1000)))
hashes.append(hash_val)
return hashes
Generated Fingerprint Hashes
These hash values represent unique features of the audio signal:
Key Concept: Shazam stores these hashes in a database. When you record audio, it generates similar hashes and looks for matches in the database. The matching process is efficient because it compares hashes rather than the full audio signal.
Data Analysis & Results
Analysis Questions
1. How does the window size affect the spectrogram? Compare time resolution vs frequency resolution.
2. What happens to the fingerprint when you add noise to the signal? Test with different SNR values.
3. How many unique fingerprint hashes were generated from your test signal? How might this scale for a full song?
Experimental Results
| Test Condition | Peaks Found | Fingerprint Hashes | Computation Time (ms) |
|---|---|---|---|
| Clean Signal | - | - | - |
| With Noise (SNR=20dB) | - | - | - |
| Different Window Size | - | - | - |