Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance

Yuanchao Li1, Azalea Gui2, Dimitra Emmanouilidou3, Hannes Gamper3

1University of Edinburgh, 2University of Toronto, 3Microsoft Research

Baseline Models

EmoGen

EmoGen uses four pre-extracted embeddings to represent emotions within the four quadrants. It encodes discrete values of these quadrants (e.g., 1 or 2) into embeddings as the emotion input, thus samples within a particular quadrant share the same emotional conditions (i.e., the emotion embeddings for all samples in each quadrant are identical, regardless of their original labels). As a result, while the generated music exhibits precise emotions, we argue that it lacks variation, leading to a situation where all samples with the same emotion tend to sound emotionally similar.

MidiEmo

MidiEmo encodes continuous VA labels (e.g., [0.8, 0.8]) as latent embeddings using linear layers, which are then concatenated with music embeddings. We argue that this approach can make music samples with boundary labels more susceptible to subjective bias (as the boundary emotions are usually difficult to distinguish). The challenge of assigning boundary labels with high confidence often leads to ambiguity in the generated emotion. Consequently, while there is considerable variation, ambiguity is prevalent in the generated music, making it difficult to perceive precise emotions.

Our Proposed Model

Our proposed model enhances emotion conditioning by combining the strengths of EmoGen and MIDIEmo, enabling a balance and trade-off between prominence and variation in the generated emotion.

model

Fig 1. Our proposed model (above) and its comparison to baseline models in terms of emotion conditioning (below).

model

Fig 2. Russel's 4Q.

Music Samples

The emotion conditions for EmoGen are discrete scores, i.e., [1, 2, 3, 4], denoting Russell's four quadrants. Therefore, VA values do not apply.
The emotion conditions for MidiEmo and our proposed model are VA scores ranging from -1.0 to 1.0.
low VA denotes both valence and arousal scores assigned from [-0.1, +0.1, -0.2, +0.2, -0.3, +0.3].
mid VA denotes both valence and arousal scores assigned from [-0.4, +0.4, -0.5, +0.5, -0.6, +0.6, -0.7, +0.7].
high VA denotes both valence and arousal scores assigned from [-0.8, +0.8, -0.9, +0.9, -1.0, +1.0].

For example, low VA: [+0.1, +0.2]; mid VA: [+0.4, -0.4]; high VA: [-0.8, +0.9].

Boundary labels usually come from low VA and high VA, making their emotions sound like those of other quadrants. For example, a music sample of emotion in Q3 sounds like emotion in Q4 when it is close the y-axis.

The music samples generated by EmoGen is piano-only. However, this does not impact the FAD scores (the high FAD scores of EmoGen do not arise from this) because the instruments in the two compared emotional music sets are aligned, eliminating the impact of instrumentation. This is verified by the FAD scores of EMOPIA (real music), which is also piano-only.

result

Q1 EmoGen MidiEmo Ours
sample 1 low VA
sample 2 mid VA
sample 3 high VA
Q2 EmoGen MidiEmo Ours
sample 1 low VA
sample 2 mid VA
sample 3 high VA
Q3 EmoGen MidiEmo Ours
sample 1 low VA
sample 2 mid VA
sample 3 high VA
Q4 EmoGen MidiEmo Ours
sample 1 low VA
sample 2 mid VA
sample 3 high VA