Can recognize Shazam piano music forms

Mathesis Wiki


Before describing the individual parts of the program, I would like to see a section that explains what is coming next and how it belongs together, also with a picture. One should always think of readers who know little about it.

In the section on STFT, I think it would be good to elaborate the problem from the Quest in order to give readers who have no idea about the difficulty of finding properties of pieces of music that are relatively robust against interference. (A text that describes the solutions to a technical or scientific problem is particularly readable if it is not only recognizable that, but also why this or that contributes to the solution. Here, for example, would be STFT (together with the downstream fingerprinting that you well documented) recognizable as an answer to the problem described. It is also nice if the reader thinks about the characteristics before he gets the answer. Or the reader.)

The planning and the process can be easily studied in the logbook. But I would like a - short - narration of the course of the project again for a reader who wants to find out more. (Also: where it hung.)

And a (short) conclusion is still missing, e.g. a sketch of how things could go further.




    Similar to Shazam, Soundhound, etc. to recognize music that you hear e.g. on the radio and to output the title.

    Sources and Links


    The aim will not be to recognize almost all pieces of music like the apps mentioned. That would mean an enormous amount of memory and is not the focus of work. However, it would be nice to be able to recognize as many songs as possible. Instead, special emphasis is placed on analyzing the sound signals for their characteristics and thus being able to clearly assign them to a song. For example, attention can also be paid to the sound patterns of individual instruments or parts of the instrument (e.g. snare for drums).

    Bonus quests

    More songs. Cooperation with the group of conductors: It would be particularly exciting here if one could analyze the output signals of the group of conductors and thus check the success of the sound output. The chain would look like this: The conductor makes movements. A song is derived from these. This is recorded and then analyzed and the title is given. Possibly one could output the name of the title from the movements of the conductor.

    important components of the program

    The basic functions of the program are explained below. These consist of:

    1. converting an audio file into a sequence of numbers that can then be used / calculated;
    2. the further conversion of this data by means of short-time Fourier transformation. This procedure is required to determine the frequency range from the song as a function of time;
    3. serves as an illustration and can be very helpful for understanding the processes: creating a spectrogram, in addition, fingerprinting (will be presented afterwards) is implemented in our program with the spectrogram;
    4. Audio fingerprinting, i.e. recognizing / filtering distinctive / characteristic points from the Fourier-transformed data set, is created here with the image of the spectrogram;
    5. Creating a database, collecting the data for later comparison;
    6. the comparison, ultimately the decisive step in testing whether a currently playing song matches one from the database. Here the comparison function is plotted, before the played song first goes through the first 4 steps, then the two data sets are compared with the mentioned function

    Reading in audio files

    "" "Reads a wav file into a numpy array" "" def wavread (filename): global NUMBER global RATE global DURATION RATE, y = (filename) # in y is the number of frames and with multiple channels the number of channels saved NUMBER = y.shape [0] iflen (y.shape) == 2: y = y.astype (float) yKonf = y.sum (axis = 1) / 2else: yKonf = y DURATION = NUMBER / RATE return np.array (yKonf, dtype = np.float) / 2 ** 15 "" "receive audio signals and write them into a numpy array" "" def getAudio (): # initialize audio input p = pyaudio.PyAudio ( ) stream = (format = pyaudio.paInt16, channels = CHANNELS, rate = RATE, input = True, frames_per_buffer = CHUNKSIZE) # Write data frames = [] # List of chunks (blocks) for i inrange (0, int (RATE / CHUNKSIZE * DAUER)): data = (CHUNKSIZE) frames.append (np.fromstring (data, dtype = np.int16)) # possibly only normal int? We'll have to estimate the data size later, which makes sense # convert list of numpy arrays to 1D array numpydata = np.hstack (frames) # close stream stream.stop_stream () stream.close () p.terminate return numpydata

    Data processing using short-time Fourier transform

    The short-time Fourier transformation is a type of Fourier transformation to represent the change over time in the frequency spectrum of an audio signal, for example. The audio signal that is read in is converted into a data record, which can later be plotted as a spectrogram. By determining the frequencies, individual instruments and characteristic points in the song can be found. This is explained further in the fingerprinting section.

    "" "Short-Time Fourier Transformation" "" # Previous function def stft (x, fs, framesz, hop): x = np.concatenate ((x, np.zeros (1))) halfwindow = int (np.round (framesz * fs / 2)) framesamp = 2 * halfwindow + 1 hopsamp = int (np.round (hop * fs)) w = scipy.hamming (framesamp) X = scipy.array ([scipy.fft (w * x [i-halfwindow: i + halfwindow + 1]) for i inrange (halfwindow, len (x) -halfwindow, hopsamp)]) return X

    The STFT is still the biggest waste of time when analyzing audio signals and creating fingerprints. In an earlier version, the program needed up to 10 minutes to analyze a single 3-minute piece. Thanks to some optimizations, we were able to shorten this. A new, even faster variant has yet to be implemented.

    "" "faster function, hanning instead of hamming, framesamp changed def stft (x, fs, framesz, hop): x = np.concatenate ((x, np.zeros (1))) halfwindow = int (np.round (framesz * fs / 2)) framesamp = 2 * halfwindow hopsamp = int (np.round (hop * fs)) w = scipy.hanning (framesamp) X = scipy.array ([scipy.fft (w * x [i-halfwindow : i + halfwindow]) for i in range (halfwindow, len (x) -halfwindow, hopsamp)]) print X.shape return X [:, 0: 800] "" "

    Create a spectrogram

    Spectrograms serve the purpose of depicting a frequency spectrum. When working with audio signals, the respective intensity of the various frequencies can be read from a spectrogram.

    (a spectrogram can be seen in the audio fingerprinting section)

    "" "Create spectrogram" "" # "" "def spect (numpydata): window duration = 0.1 window overlap = 0.025 # in seconds A = stft (numpydata, RATE, window duration, window overlap) A = A [:, 0: 800] eps = 1e-8 # offset to be able to logarithmize r, s = A.shape return A # "" "

    Audio fingerprinting

    After a spectrogram has been created, the next goal is to create what is known as an "audio fingerprint". To do this, our spectrogram is first read in as a simple image.

    "" "Read spectrogram as image" "" def im_conversion (data): r, s = data.shape eps = 1e-8 data = np.log10 (np.absolute (data [:,: s / 2] + eps)) ) im = color.rgb2gray (data) return im

    An “audio fingerprint”, like a person's fingerprint, is a way of assigning an unambiguous identity to an audio file. For the final recognition of a piece of music, a possibility of identifying it on the basis of certain features is therefore of great importance. First of all, local maxima are sought in the spectrogram, points which appear to be of interest for further processing. These are located where there are strong color differences between neighboring points, as can be seen in the picture. This means points at which an abrupt jump in intensity occurs either in terms of time (vertical peak) or in terms of frequency (horizontal). At this point it is very exciting what causes these peaks. Because these are supposed to characterize a song. So what could be so characteristic about a song and cause corresponding peaks in the spectroscope? Vertical peaks are caused by short impulses with a large frequency range, i.e. mainly by the drums, horizontal peaks are longer tones on one frequency, i.e. caused by other instruments. These characteristics are then processed in the form of selected characteristic points.

    This was implemented in the code as follows. The “peak_local_max ()” function was taken from the “skimage” package and finds local maxima in images.

    "" "find local maxima in the image" "" def fpeaks (im): peaks = peak_local_max (im, min_distance = 9) # min_distance determines the number of peaks peaks = zip (peaks [:, 0], peaks [:, 1] ) #print len ​​(peaks) return peaks

    However, due to the abundance of songs, it is not exactly unlikely that two or more songs will have the same local maxima in their spectrograms. This problem is skilfully circumvented using a so-called “hash function”. A hash function takes several integer values ​​as input and returns another integer value as output. For example, a clear audio fingerprint can be created as an input from the frequencies of our local maxima and their temporal distance from one another.

    "" "create fingerprint hashes" "" def fingerprint (peaks): # peaks tuples are already sorted according to time fingerprint_list = [] for i inrange (len (peaks)): for j inrange (1, PAIR_VAL): if (i + j = MIN_HASH_TIME_DELTA and delta_time <= MAX_HASH_TIME_DELTA): h = hashlib.sha1 ("% s |% s |% s"% (str (freq1), str (freq2), str (delta_time ))) fingerprint_list.append ((h.hexdigest () [0: FINGERPRINT_CUT], time1)) return fingerprint_list

    The possibilities to create these are numerous. However, there is always a certain tightrope act between more local maxima, which results in fingerprints that are easier to distinguish, and fewer local maxima, which, however, result in greater suppression of ambient noise, as well as higher execution speed.

    Create a database (lite)

    Our first thought was to create a small database that contains information about the song, album, artist, genre and, above all, the fingerprint hash. This should be based on SQL Lite. Since we had to admit to ourselves at the end of the semester that a database of several 100 songs is rather unnecessary, especially other parts of the program require optimization first, the currently read-in title names of the fingerprints are saved in a text file using pickle. Pickle is a function that can save files with different content (numbers, letters, vectors, ...) in a clearly compressed form, and can easily call them up again. The individual fingerprints of each individual song are in their own text file. These can then be called up one after the other when comparing audio signals. At the moment our database.txt file only consists of the songs from the album "AM" by the Arctic Monkeys:

    01 Do I Wanna Know 02 RU Mine _ 03 One For The Road 04 Arabella 05 I Want It All 06 No 1 Party Anthem 07 Mad Sounds 08 Fireside 09 Why`d You Only Call Me When You`re High 10 Snap Out of It 11 Knee Socks 12 I Wanna Be Yours

    A fingerprint of a single song is a string of pairs of numbers that contain the hash number and the time (hash number, time), (hash number, time), ... After going through Pickle, it looks something like this (only the first few lines, since the complete file would go beyond the scope of this wiki, about 6MB of text):

    (lp1 (S'b06f456e4ed24d78d8be 'cnumpy.core.multiarray scalar p2 (cnumpy dtype p3 (S'i8' I0 I1 tRp4 (I3 S '<' NNNI-1 I-1 I0 tbS'5 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 'tRp5 tp6 ...

    The comparison of two audio files

    "" "Comparison with fingerprint from database" "" def comp_fingprt (fpaktuell): withopen ("database.txt", "r") as db: # database is a text file with all file names to be compared with for fname in db: fname = fname.rstrip ("\ n") # with the append function a newline is added to the file name when reading in withopen ("% s.txt"% fname, "r") as f: # the newline character must deleted here fpdatenbank = pickle.load (f) "" "check if fingprt_aktuell is in fingprt_datenbank" "" matches = [] differences = [] for hash_a, time_a in fpaktuell: for hash_d, time_d in fpdatenbank: if hash_a == hash_d : matches.append ((hash_a, time_a, time_d)) tdiff = time_d - time_a differences.append (tdiff) variance = np.var (differences) iflen (matches)> 250and variance <10000000: # changing the values ​​determines the tolerance print fname # len (matches) = no. of matches, variance = index for timing errors


    In the first few weeks, the rough syntax for the project was worked out and the subject matter was read. We also had an extract from the Shazam patent. A website (the fingerprint link in the sources) also helped us to understand the principle fundamentally. Then the work steps were defined and largely processed one after the other. That ran reasonably smoothly. Bigger problems arose with the Fourier transformation, since this mathematical procedure is very time consuming and therefore some optimizations had to be found. Towards the end, part of the group was more concerned with fingerprinting, the other part more with creating the database. This parallel work saved some time that was lost at STFT. As a last step, the comparison was programmed in the last week and into the lecture-free period.

    A more detailed description of the work steps is contained in the logbook



    Even if the goals from the bonus area were obviously (and predictably) set too high, and it was not enough to analyze individual instruments or parts of them, in retrospect it can be said that the core goal was achieved and a very satisfactory one Result has arisen. The program works, it's fun to watch how it recognizes the songs, so far there is a 100% quota for low noise. There are problems with high levels of background noise, but apps like Shazam etc. also have them! The speed of the calculation could still be fine-tuned. A comparison with our 12-song database takes about 2 minutes, which is of course not very convenient. This can be reduced with improvements to the Fourier function and the comparison function and is already being processed. The prospect of our project is to continue working on it.

    Last modified: 2016/09/30 11:32 by