Acoustic model for the classification of Polish vowels
Abstract:
The study explored the performance of vowel recognition using an acoustic model built on Audio Fingerprint techniques [1]. The research compares the performance of Support Vector Machines (SVMs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and k-Nearest Neighbours (k-NN) classifiers in the recognition of isolated and within-word vowels and investigates the importance of different types of acoustic speech features in this process. Temporal, spectral, cepstral, formant, LPC
and perceptual features of speech were examined. Importance of features was tested using a random forest classifier. Vowel classification was tested at three confidence levels for feature importance: 90%, 95% and 99%. Two author databases consisting of a total of 1,200 samples from 20 speakers, recorded under household conditions, were used. The classifiers were evaluated by confusion matrix, accuracy, precision, sensitivity and F1 score. A segmentation of words into speech sounds was carried out using a tool based
on BiLSTM recurrent neural networks and the BIC criterion. Three most important features were determined: power spectral density, spectral cut-off, and Power-Normalised Cepstral Coefficients. In the isolated vowel recognition task, the SVM classifier was the most effective with a feature significance confidence level of 95% obtaining accuracy = 81%, precision = 81%, sensitivity = 81%, F1 score = 80%. In the task of recognising a vowel within a word, it was verified if the algorithm detected the presence
of vowels in the correct segment and if it recognised the correct vowel within it. The best results
were obtained by the k-NN classifier (statistical confidence level of feature importance of 99.9%). However, these results were low, correct recognition of the vowel in the word: A, E, U 20%, I, O: 7%, Y: 23%. This indicates strong influence of the neighbourhood of other speech sounds in speech on the acoustic model of vowels and their recognition.