Can AI Decode Whale and Dolphin Speech? What the Data Shows

For decades, researchers recorded hours of clicks, whistles and codas from whales and dolphins, then ran up against a hard limit: there was simply too much audio for any team to label by ear. Machine learning has changed the scale of what is searchable, clustering millions of sounds and flagging recurring structure that humans then test against behavior. The framing that matters here is narrow and important: these models surface candidate patterns; trained biologists still decide whether a pattern carries meaning. AI is the discovery instrument, not the interpreter. This report walks through the leading projects, the verifiable numbers behind them, and why the human-in-the-loop stays firmly in the loop.

Key takeaways

Structure, not translation: A 2024 Project CETI study in Nature Communications analyzed 8,719 sperm whale codas and proposed a combinatorial “phonetic alphabet,” but the authors describe structure, not a decoded dictionary.
Models run in the field: Google’s DolphinGemma (2025) is a roughly 400-million-parameter model small enough to run on a Pixel phone, trained on Wild Dolphin Project audio collected since 1985.
Open foundation models exist: Earth Species Project’s NatureLM-audio reached state-of-the-art zero-shot results on 7 of 9 datasets in the BEANS-Zero benchmark and was released openly (ICLR 2025).
Scale is the unlock: The Wild Dolphin Project’s archive spans roughly 40 years of underwater audio-visual data, more than any team could review unaided.
Humans hold meaning: Across all three projects, AI clusters and predicts sounds; biologists pair those clusters with observed behavior to decide what, if anything, they mean.

8,719

sperm whale codas analyzed in the CETI study
Source: Sharma et al., Nature Communications, 2024

~400M

parameters in DolphinGemma, small enough to run on a phone
Source: Google / DeepMind, 2025

7 of 9

datasets where NatureLM-audio set zero-shot state of the art
Source: Earth Species Project, ICLR 2025

~40 yrs

of dolphin audio-visual data feeding the models, since 1985
Source: Wild Dolphin Project

Project CETI: finding structure in sperm whale codas

Sperm whales communicate in codas, short, patterned bursts of clicks. In a 2024 study, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory working with Project CETI applied machine learning to 8,719 codas recorded from Eastern Caribbean whale families by the Dominica Sperm Whale Project. The analysis identified a combinatorial coding system built from features the team named rhythm, tempo, rubato and ornamentation, with at least 143 distinguishable coda combinations frequently occurring. Crucially, the authors framed this as evidence of structure, a “phonetic alphabet,” not a decoded language: the AI revealed how the sounds are organized, while the question of what they mean remains open for biologists to test. Source: Sharma et al., Nature Communications, 2024. Read the paper.

DolphinGemma: a model that runs in the water

In 2025, Google introduced DolphinGemma, a roughly 400-million-parameter model built on the Gemma family and developed with Georgia Tech and the Wild Dolphin Project. It uses Google’s SoundStream tokenizer to turn dolphin vocalizations into sequences, then learns to predict the next sound, much as a language model predicts the next word, helping researchers spot recurring patterns and structure. Compact enough to run on a Pixel phone, it is designed to assist analysis in the field rather than from a distant data center. The model is trained on the Wild Dolphin Project’s decades of labeled audio of Atlantic spotted dolphins; its job is to accelerate pattern-finding, leaving interpretation to the researchers who know the animals. Source: Google / DeepMind, 2025. Project page.

Earth Species Project: open foundation models for bioacoustics

The Earth Species Project, a nonprofit whose stated premise is that more than 8 million species share the planet while humans understand the language of one, released NatureLM-audio, described as the first audio-language foundation model built for bioacoustics. Prompted in plain English alongside an audio clip, it can classify species and exhibit emergent abilities such as counting individuals in a recording. On the BEANS-Zero benchmark, NatureLM-audio reported state-of-the-art zero-shot performance on 7 of 9 datasets, and the model weights, code and benchmark were released openly. As a general-purpose tool, it is explicitly meant to give biologists a faster way to triage and query recordings, not to assign meaning on its own. Source: Earth Species Project, ICLR 2025. Announcement.

What’s next, and what to watch for

The near-term trajectory is more data and tighter feedback loops: field-deployable models like DolphinGemma can flag patterns during an expedition, letting researchers test hypotheses against behavior in something closer to real time. The honest caveats matter too. None of these systems has demonstrated two-way conversation or a verified “translation,” and confirming that a sound pattern carries a specific meaning requires controlled behavioral evidence, not model output alone. The risk to guard against is over-reading a cluster as a “word.” The promise is real, but it is a discovery accelerator, and the burden of proof still sits with the scientists.

The through-line: decision support, not replacement

Every project here follows the same division of labor. The model handles what humans cannot: clustering millions of clicks, surfacing combinatorial structure, predicting the next sound, querying archives by natural language. The biologist handles what the model cannot: pairing a flagged pattern with observed context, designing the experiment that tests for meaning, and deciding whether a finding holds. AI widens the funnel of what is worth investigating; people still make the call. That is the durable shape of this field, decision support, with the decision staying human.

Methodology & sources

8,719 codas, combinatorial structure and the four named features (rhythm, tempo, rubato, ornamentation) — Sharma et al., Nature Communications (2024)
MIT CSAIL / Project CETI framing of the “phonetic alphabet” — MIT News (2024)
DolphinGemma ~400M parameters, SoundStream tokenizer, runs on Pixel, built with the Wild Dolphin Project — Google DeepMind (2025)
NatureLM-audio, zero-shot state of the art on 7 of 9 BEANS-Zero datasets, open release — Earth Species Project (2024/2025)
Wild Dolphin Project founded 1985, longest-running underwater dolphin study, ~40 years of data — Wild Dolphin Project

Frequently asked questions

Has AI actually translated whale or dolphin language?

No. To date, AI has identified structure and patterns in animal sound, such as the combinatorial “phonetic alphabet” proposed for sperm whales, but no project has produced a verified translation or demonstrated two-way conversation. Confirming meaning requires behavioral evidence that scientists, not models, must establish.

What does the AI actually do in these projects?

It finds and predicts patterns at scale. Models cluster huge volumes of recordings, detect recurring structures, predict the next sound in a sequence, and let researchers query archives in plain language, work that would be impossible to do by ear across decades of audio.

Why is human interpretation still essential?

Because a statistical pattern is not the same as a meaning. Biologists pair the AI’s candidate patterns with observed behavior and controlled tests to decide whether a sound is significant, keeping the model as decision support while the scientific judgment stays human.

Part of our Real-World AI Use Cases series — how AI supports high-stakes decisions across surprising domains.