Lesseme system, word prominence and other animals

Prominent words are those words that we mark by intonation when we speak. Voice is a vital tool of transferring the meaning of what we want to say and by coloring the most important part of the expression emotionally we are able to achieve an adequate perception of the meaning. This topic is not really on the radar right now and to be honest we came across it by accident.

Recently we dived into speech synthesis. While conducting research, we became fascinated by it. We read and coded a lot of related stuff. Among other things, we used Blizzard Challenge 2013 dataset for English speech which is used in the namesake speech synthesis challenge. It consists of audiobooks narrated by a professional female broadcaster.

In order to partly work with this dataset, a so-called Lessac annotation is used. It looks like this:

212iHfN LcT DcP _ D 63iHfN ^ N yN42iLfN LcT @ , || _ DH N42iLfN _ M 63iHfN NcG @ _ H 12iLfN _ L N12iHfN KcS TcD _ 62iHfN FcT ^ T Rx1iLfN R _ DH N41iLfN _ H 32iHfN R ^ S N41iLfU ZcT , || _ W N42iLfN ZcG _ DG N43iHfN ScT TcD @ _ 62iLfN ZcG _ DG N32iHfN NcT ^ TocU LV1iLfN _ 61iLfN ZcD _ 512iHfN cW ^ Rx1iLfN R _ M 62iHfN ScT ^ T Rx1iLfU R , || _ S 211iLfN _ W Y1iLfN _ W Rx2iLfN R _ W N33iHfN LcD @ _ 32iHfW FcT . ||

This actually the transcript of the sentence below:

Old Daniel @ , # the man @ who looked after the horses , # was just @ as gentle as our master , # so we were well @ off . #

Kinda confusing? Yes, we get it.

Worry not. There is a comprehensive article that explains what these peculiar symbols mean. To sum up: these characters are called Lessems — symbolic representations that provide segmental information. To produce more specific information, each Lesseme can be categorized into a set of more specific phonetic symbols that represent elaborated information about the traditional phonetic system including coarticulation and suprasegmental information. For General American English, with the present Lesseme specification, there are more than 1,500 different Lessemes, unlike other sets that generally contain about 50 symbols. The distinction allows providing a more specific representation of the system resulting in a more clear understanding.

The other symbols we see in the text, are:

  • @ — operative/stress/prominent word (in this markup, it is usually selected based on the fundamental frequency of the tone)
  • # — prosody break, prominent intonation pause
  • | — minor prosody break, an almost uninterrupted pronunciation of different parts of the sentence.

The markup of the sentence is done manually, well, aurally, mainly judging by the height of the person’s voice reading the sentence. We were especially interested in the prominent word markup, besides it seemed that there were some errors in this approach.

Prominent word and how to find it?

As we have already mentioned, word prominence is a feature that helps us distinct the shades of meaning and ensures a clear understanding of what is exactly said. However, there are lots of viewpoints in defining this notion.

Terken believes that prominence is “words or syllables that are perceived as standing out from their environment”, while for Streefker it is supposed to reflect the salience of the language unit that distincts its meaning.

One of the ways of distinguishing prominence is achieved by prominence score calculation which is based on acoustic metrics. The calculation is done for the nucleus of the speech. Usually, it is a small fragment including a vowel, but we chose to use longer sections (more about it later). Here are the features that help to identify to which word the prominence is attributed:

  1. Nucleus Duration — the duration of the segment (the nucleus) for which calculation is conducted
  2. Pitch Patterns (frequency features) — it can be any pitch characteristic in the nucleus, like max, min or median value.
  3. Spectral Intensity — spectrum power for different bands. Prior research has shown that energy in the 500–2000 Hz band has a maximum correlation with prominence.

Researchers from the Institute of Bolonia summed it up in an empirical formula.

The Prom function to calculate the value of prominence parameter for each syllable nucleus. en500–2000 is the energy in the 500–2000 Hz frequency band, dur is the nucleus duration, enov is the overall energy in the nucleus, and evamp is the TILT event amplitude (if an event is present in the nucleus, zero otherwise), all referred to a generic syllable nucleus i. The Prom function is built in such a way as to express, mathematically, the fact that a prominent syllable is usually stressed or pitch accented or both.

We took this formula as a foundation for our experiment and calculated the score for the Blizzard dataset. However, we slightly modified our approach.

Our formula:

The Prominence function. overall_energy is the overall energy in the nucleus, ev_amplitude is the event’s amplitude, dur is the duration of the nucleus.

Where we calculate the amplitude this way:

Read more about pitch-tracking and how to estimate it in our most popular article on Medium.

The algorithm and the steps we followed were these:

  1. We found the pitch using YAAPT (Yet Another Algorithm of Pitch Tracking).
  2. Identified the runs where the pitch had non-zero values.
  3. Calculated the scores for each nucleus, normalizing them to its duration.
  4. Found the nucleus (nuclei) with max values. The greater the value, the higher the probability for the nucleus to be prominent. For the examples, we listened to the selected segments, but toolbox aeneas can be used for automatic text-speech alignment.

Here is the code and below it are three examples from the Blizzard dataset, where you can see the results.

import librosa
import matplotlib.pyplot as plt
import numpy as np
import amfm_decompy.basic_tools as basic
import amfm_decompy.pYAAPT as pYAAPT
def calculate_mel(y, sr):
max_db = 100
ref_db = 20
frame_shift = 0.0125 # seconds
frame_length = 0.05 # seconds
hop_length = int(sr*frame_shift) # samples.
win_length = int(sr*frame_length) # samples.
# Pre-emphasis filter
y = np.append(y[0], y[1:] — 0.97 * y[:-1])
# Windowed Fourier Transform
linear = librosa.stft(y=y,
hop_length=hop_length,
win_length=win_length)
# Amplitude spectrum
mag = np.abs(linear)
mag = 20 * np.log10(np.maximum(1e-5, mag))
# Normalize
mag = np.clip((mag — ref_db + max_db) / max_db, 1e-8, 1)
# Transpose and translate into the necessary types
mag = mag.T.astype(np.float32)
return mag
def find_nonzero_runs(a):
# Create an array that is 1 where a is nonzero, and pad each end with an extra 0.
isnonzero = np.concatenate(([0], (np.asarray(a) != 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(isnonzero))
# Runs start and end where absdiff is 1.
nzranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return nzranges
signalpath = “/wav/CA-BB-01–42.wav”
y, sr = librosa.load(signalpath)
# compute YAAPT pitches pitch for all sentence
signal = basic.SignalObj(signalpath)
pitch = pYAAPT.yaapt(signal, frame_length=20, f0_min=75, f0_max=400).samp_values
# find nucleus
nucleus = find_nonzero_runs(pitch)
# calculate melspec magnitudes
mag = calculate_mel(y, sr)
# compute scores for each nucleus
scores = []
for nucl in nucleus:
if nucl[1] >= pitch.size:
nucl[1] = nucl[1] — 1

pitch_at_start = pitch[nucl[0]]
pitch_at_end = pitch[nucl[1]]
max_pitch = max(pitch[nucl[0]:nucl[1]])
nucleus_times_st = nucl[0] * 0.01
nucleus_times_end = nucl[1] * 0.01
nucl_dur = nucleus_times_end — nucleus_times_st

mel_frame_st = int(nucleus_times_st // 0.01)
mel_frame_end = int(nucleus_times_end // 0.01)

overall_energy = np.ndarray.sum(mag[mel_frame_st:mel_frame_end])

ev_amplitude = abs(pitch_at_start — pitch_at_end) / \
abs(2*max_pitch — pitch_at_start — pitch_at_end)
prominence = overall_energy * ev_amplitude/(nucl_dur/0.01)
scores.append(prominence)
# Visualization
plt.plot(scores)
plt.xticks(np.arange(len(scores)))
plt.show()
print(‘Maximum scores are in the interval from ‘, nucleus[np.argmax(scores)][0]*10, ‘to ‘,
nucleus[np.argmax(scores)][1]*10, ‘milliseconds’)
print(‘Second Maximum scores are in the interval from ‘, nucleus[np.argsort(scores)[-2]][0]*10, ‘to ‘,
nucleus[np.argsort(scores)[-2]][1]*10, ‘milliseconds’)
[print(t, int(x), y[0]/100, y[1]/100) for t, x, y in zip(range(len(scores)), scores, nucleus)]

Results

Audio 1, CA-BB-01–16 (the actual file name in the dataset)

Manual markup: Sometimes we had rather @ rough play , | for they would frequently @ bite and kick as well as gallop . #

Automatic markup: We found 14 nuclei. The prominent nuclei are between 8 and 11, with the peak at 9.

Table 1. The results for Audio 1. For each nucleus the score was calculated, its duration and location in the sentence found and manually matched with the corresponding words.

Diagram 1. Intonational dynamics for Audio 1. On the X-axis (above) are the nuclei, on the Y-axis (on the left) are the calculated scores.

Audio 2, CA-BB-01–42

Manual markup: Old Daniel @ , # the man @ who looked after the horses , # was just @ as gentle as our master , # so we were well @ off . #

Automatic markup: We found 12 nuclei. The prominent nuclei are between 1 and 2, with the peak at 2; and 5 and 8, with the peak at 5.

Table 2. The results for Audio 2. For each nucleus the score was calculated, its duration and location in the sentence found and manually match with the corresponding words.

Diagram 2. Intonational dynamics for Audio 2. On the X-axis (above) are the nuclei, on the Y-axis (on the left) are the calculated scores.

Audio 3, CA-MP2–05–08

Manual markup: Leave @ him to settle that.

Automatic markup: We found 5 nuclei. The prominent nuclei are between 1 and 3, with the peak at 2.

Table 3. The results for Audio 3. For each nucleus the score was calculated, its duration and location in the sentence found and manually match with the corresponding words.

Diagram 3. Intonational dynamics for Audio 3. On the X-axis (above) are the nuclei, on the Y-axis (on the left) are the calculated scores.

Discussion

In this article we discussed the notion of word prominence and why it is important.

While using Blizzard2013 data set we realized that automatic markup differs from its manual counterpart and wanted to point out this difference. You can try our approach for different examples yourself. Our objective was simple — to share the results of this small experiment we’ve done and we really hope that it might be of interest for those who want to research the topic further.


 

***

Authors: Eva Kazimirova, Research Scientist at Neurodata Lab, Elizaveta Zaitseva, SMM Specialist at Neurodata Lab.


You are welcome to comment on this article in our blog on Medium.