Dynamic Time Warping (DTW) is a well-established algorithm for comparing time series.

The basic problem that DTW attempts to solve is how to align two sequences in order to generate the most representative distance measure of their overall difference. If you have two signals encoded as a sequence of evenly spaced values (representing, for example, the peak frequency of the signals), then an obvious way to compare the signals is to sum the differences in frequency at each point along the signals. However, a problem arises if there is any discrepancy in the alignment of the signals – if for example one of the signals is stretched or compressed compared to the other: how do you decide which points to compare with each other?
The DTW algorithm uses a dynamic programming technique to solve this problem. The first step is to compare each point in one signal with every point in the second signal, generating a matrix. The second step is to work through this matrix, starting at the bottom-left corner (corresponding to the beginning of both signals), and ending at the top-right (the end of both signals): for each cell, the cumulative distance is calculated by picking the neighbouring cell in the matrix to the left or beneath with the lowest cumulative distance, and adding this value to the distance of the focal cell. When this process is complete, the value in the top-right hand cell represents the distance between the two signals according to the most efficient pathway through the matrix.

How does the DTW algorithm compare with alternative algorithms for comparing signals? There are two widely used algorithms for comparing bioacoustic signals in use today. The first is spectrogram cross correlation (SCC). SCC basically involves multiplying the intensity values in spectrographs together, and summing the total. If two signals are very similar, then they should overlap considerably in the spectrograph, and this will be reflected in a high SCC value. It has been realized for some time that this algorithm did not always accurately reflect differences between signals: slight but consistent deviations in frequency, for example, could reduce the SCC score to 0! Efforts have been made to redress this problem by, for example, sliding the spectrograms over one another in the frequency and time domain, and searching for the highest level of overlap. However, the algorithm is fundamentally rather brittle – meaning that it doesn’t react in a consistent or linear fashion to deviations in similarity, and sensitive to background noise. On the other hand, an advantage to the SCC algorithm is that it doesn’t require any processing of the spectrograph – since the spectrograph itself is the basis for comparison.

The second widely used algorithm is, like DTW a distance-based algorithm, relying on derived acoustic parameters extracted from spectrographs: Sound Analysis Pro (SAP) uses an algorithm that is rather like DTW. Like DTW, signals are compared in SAP by summing the distances between time-series. Like DTW, signals are first compared by generating a matrix of distances between every point in the first and every point in the second signal. However, at this point, the algorithms diverge. SAP then fits the best straight line through the matrix that allows up to a 30% warping in the time dimension, and sums the distances along this line.
The advantage of SAP over the basic form of DTW is that this process prevents arbitrary stretching and compression of signals: after all, if a signal is stretched to become twice as long, surely it is then not very similar to the original. However, the way that this is implemented is, in my opinion, also rather brittle. 30% is a rather arbitrary figure, and the fact that signals can be scored as identical within this threshold, but that once this threshold is breached similarity rapidly declines seems to be unappealingly nonlinear.

In Luscinia, the problem of weighting time warps is solved in a much simpler fashion: time is simply incorporated as another dimension that can be used in DTW. In other words, when the matrix is constructed, comparing each point within one signal with every point in the other, the distances are calculated as Euclidean distances, based on the parameters that the user selects. If “Time” is selected as a parameter (and I would recommend that it is!), then it becomes an ingredient in determining the overall distance between the signals. What this means is that time-warping is allowed, but that it is penalized in a straightforward linear manner.

Refinements of the DTW algorithm in Luscinia

There are several problems with the standard DTW algorithm.

1) Overall offsets in a dimension cause signals to be scored as more different from each other than they are in reality. For example, two bird song elements could share exactly the same "shape", but one could be 500Hz higher in frequency than the other. A third element could have an entirely different shape, but have the same average frequency as one of the first two elements. Which two elements of the three are most similar? The DTW algorithm by default would tend to score the two elements with the same mean frequency as the most similar. In reality, people seem to rank the "shape" of the signal as a more important parameter. This problem with DTW is discussed by Keogh & Pazzani (2001 "Derivative Time Warping", First SIAM International Conference on Data Mining).

In Luscinia, you can use frequency change parameters instead of or in addition to frequency measures. Frequency change measures will more accurately track the form or “shape” of an element.

2) The second refinement of DTW in place in Luscinia is to interpolate between points along the time series. Why is this necessary? If a signal is rapidly frequency modulated, but is still sampled at the same time intervals, then it stands to reason that there are larger jumps in frequency between consecutive points in the series. Because two signals are unlikely to sampled at exactly the same point, this creates a problem: the sampled points in a highly frequency modulated signals do not coincide closely with each other compared to the points in un-modulated signals. This means that highly modulated signals are not clustered together as closely as un-modulated signals. The solution to this problem that is used in Luscinia is that rather than calculate an initial matrix based on the distance between each of the n, m points in the two signals, instead we calculate a matrix based on distances between the n points in the first signal and the m-1 line segments in the second signal.