Birdsong Classification Project

Complete Python and ML pipeline for bird song classification, enriched with presentation insights, conclusions, and visuals from the original slides.

🐦 Birdsong Classification Project β€” Complete Version

In this blog post, we’ll explore how to automatically classify bird species from their songs using audio analysis, signal processing, and machine learning.
This project merges the original technical pipeline with insights, visuals, and conclusions from the β€œBirdsong Classification Project” presentation.

america


🎢 Why Birdsong Classification?

Bird songs are unique to each species.
With Machine Learning, we can analyze their sound signatures to automatically recognize the corresponding species.

Objectives:

  • Identify species from their songs.
  • Obtain an accurate and robust model.
  • Contribute to automated biodiversity monitoring.

Key points:

  • 5 species studied
  • Focus: American continent
  • About 5000 audio samples

bewick sparrow


🌍 Overview

Birdsong classification plays a crucial role in:

  • Ecological monitoring β€” tracking biodiversity and population trends.
  • Citizen science β€” leveraging crowd-sourced recordings like Xeno-Canto.
  • Automated surveys β€” detecting species in remote regions.

We rely on two key datasets:


🎯 Project Objectives

  1. Extract audio features (MFCC, ZCR, Spectral Centroid).
  2. Preprocess audio (filtering, normalization, silence trimming).
  3. Train and evaluate ML models (Random Forest, LightGBM, PyCaret).
  4. Visualize spatial and spectral data for interpretation.

πŸ› οΈ Tools & Libraries

  • Python 🐍
  • Librosa, NumPy, Pandas for feature extraction
  • Scikit-learn, LightGBM for ML
  • Seaborn, Matplotlib, Folium for visualization

🎡 1. Loading and Preparing Data

We begin by installing dependencies and loading metadata.

import os, librosa, numpy as np, pandas as pd
import seaborn as sns, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import scipy.signal
from pycaret.classification import *
df = pd.read_csv("./bird_songs_metadata.csv")
df['path'] = df['filename'].apply(lambda x: os.path.join("wavfiles", x))
df.sample(5)

🎚️ 2. First Audio Analysis

First sound analysis:

  • Blue: frequencies and amplitudes
  • Purple: frequency evolution over time

Conclusion:

  • Poor quality samples
  • Lots of background noise
  • Large amplitude disparity

brut2 brut1


πŸ”§ 3. Audio Preprocessing

To make the data usable, we perform:

  • Frequency filtering
  • Volume normalization
  • Silence trimming
def extract_clean_features(file_path):
  y, sr = librosa.load(file_path, sr=22050, duration=5.0)
  low, high = 200/(sr/2), 10000/(sr/2)
  b, a = scipy.signal.butter(2, [low, high], btype='band')
  y = scipy.signal.filtfilt(b, a, y)
  y = librosa.util.normalize(y)
  y, _ = librosa.effects.trim(y, top_db=15)
  mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
  zcr = librosa.feature.zero_crossing_rate(y=y)
  centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
  return np.hstack([np.mean(mfcc, axis=1), np.std(mfcc, axis=1), np.mean(zcr), np.mean(centroid)])

Goal: keep the essential sound and remove background noise.


🌲 4. Model Training β€” Random Forest

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Results:

  • Good overall result
  • Average accuracy: 0.90
  • Some confusion between similar species

randomForest1


πŸ—ΊοΈ 5. Geographic Analysis

To better understand our dataset, we clean and visualize location data.

df_clean = df.dropna(subset=['latitude', 'longitude', 'altitude'])

Conclusion:

  • Presence of geographic clusters of species
  • Latitude/longitude influence the models

cluster


⚑ 6. Improved Training with Geospatial Features

We add geographic coordinates to our cleaned audio features.

Features:

  • Cleaned audio samples
    • Altitude
    • Longitude
    • Latitude

Improvements:

  • Accuracy: 0.90 β†’ 0.96
  • Fewer inter-species confusions

randomForest1


πŸš€ 7. Model Upgrade β€” Light Gradient Boosting Machine

After comparing models via PyCaret, we switch to LightGBM for optimization.

Split:

  • 60% Training
  • 20% Test
  • 20% Validation
import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=40, max_depth=10, random_state=42)
model.fit(X_train, y_train)

Result:

  • Accuracy 0.96 β†’ 0.97
  • Slight improvement on rare species

LGBM


🧩 8. Model Comparison

  • Little difference between LGBM and Random Forest.
  • Random Forest is ~20Γ— faster.
  • Audio cleaning is crucial.
  • Geographic features significantly increase accuracy.

Final conclusion:

For a balance between performance, efficiency, and speed,
Random Forest Classifier remains the best option.

pycaret


🧠 9. Support & Tools

  • Mapping: Folium, MarkerCluster
  • Plotting: Matplotlib + Seaborn
  • Model 1: Random Forest Classifier
  • Model 2: LightGBM
  • Audio Analysis: Librosa
  • Datasets: Kaggle + Xeno-Canto

πŸ”— Sources:


🧭 10. Conclusion

🌿 Key Insights

  • The audio cleaning and preprocessing stage proved to be the most crucial step in achieving reliable classification.
  • Incorporating geospatial features (latitude, longitude, altitude) significantly improved model precision and reduced inter-species confusion.
  • The Random Forest Classifier remains the most balanced choice, offering a strong accuracy-to-efficiency ratio.
  • While LightGBM achieved slightly higher accuracy, it demands substantially more computational resources β€” an important consideration for ecological and large-scale deployments.

πŸ“Š Model Performance Summary

🧠 Model🧾 Features Used🎯 Accuracy
Random ForestRaw audio features only0.90
Random Forest + GeoCleaned audio + geographic data0.96
LightGBM + GeoCleaned audio + geographic data0.97

Final Thought:
This project demonstrates how combining bioacoustic signal analysis with geospatial context allows machine learning models to more accurately identify bird species while maintaining computational efficiency.
In practice, the Random Forest model offers the best trade-off between precision, speed, and sustainability β€” making it a practical choice for real-world biodiversity monitoring.