Buckeye Corpus

Buckeye Corpus Creation

Why create the corpus?

Speech corpora like the Buckeye Corpus are used for both pure research studies and for applied research and product development. As a resource for pure research, the corpus provides one of the richest sources of data on pronunciation variation in conversational speech that is available for English. The 40 hours of hand-transcribed speech available in this corpus will therefore be particularly valuable for psycholinguists who study auditory word recognition, phonologists who study rules of pronunciation variation, sociolinguists who study age and gender related conditioning on pronunciation variation, and engineers who study the effects of pronunciation variation on automatic speech recognition. Additionally, phoneticians interested in gradient gestural overlap and hiding will find the corpus valuable to the extent that acoustic studies may reveal the phenomena of interest. We have also begun using the corpus as a source for stimuli in speech perception and word recognition studies.

On the applied side, because the speech has been phonetically labeled by hand, with a clean acoustic signal, we expect that this corpus may be of value in training acoustic models for speech recognition systems. Additionally, the range of phonetic realizations for each word should provide some input to studies of lexicon training for handling pronunciation variation. The database is too small to be of much use for grammar training, but it is an interesting testbed for any grammar that is supposed to handle real conversational speech because people don't talk in the grammatical sentences we use when we write.

