Contextual Emotion Detection in Textual Conversations Using Neural Networks
Nowadays, talking to conversational agents is becoming a daily routine, and it is crucial for dialogue systems to generate responses as human-like as possible. As one of the main aspects, primary attention should be given to providing emotionally aware responses to users. In this article, we are going to describe the recurrent neural network architecture for emotion detection in textual conversations, that participated in SemEval-2019 Task 3 «EmoContext», that is, an annual workshop on semantic evaluation. The task objective is to classify emotion (i.e. happy, sad, angry, and others) in a 3-turn conversational data set.
The rest of the article is organized as follows. Section 1 gives a brief overview of the EmoContext task and the provided data. Sections 2 and 3 focus on the texts pre-processing and word embeddings, consequently. In section 4, we described the architecture of the LSTM model used in our submission. In conclusion, the final performance of our system and the source code are presented. The model is implemented in Python using Keras library.
1. Training Data
The SemEval-2019 Task 3 «EmoContext» is focused on the contextual emotion detection in textual conversation. In EmoContext, given a textual user utterance along with 2 turns of context in a conversation, we must classify whether the emotion of the next user utterance is «happy», «sad», «angry» or «others» (Table 1). There are only two conversation participants: an anonymous person (Tuen-1 and Turn-3) and the AI-based chatbot Ruuh (Turn-2). For a detailed description, see (Chatterjee et al., 2019).
Table 1. Examples showing the EmoContext dataset (Chatterjee et al., 2019)
During the competition, we had access to 30160 human-labeled texts provided by task organizers, where about 5000 samples each from «angry», «sad», «happy» class and 15000 for «others» class (Table 2). Dev and test sets, which were also provided by organizers, in contrast with a train set, have a real-life distribution, which is about 4% for each emotional class and the rest for the «others» class. Data provided by Microsoft and can be found in the official LinkedIn group.
Table 2. Emotion class label distribution in datasets (Chatterjee et al., 2019).
In addition to this data, we collected 900k English tweets in order to create a distant dataset of 300k tweets for each emotion. To form the distant dataset, we based on the strategy of Go et al. (2009), under which we simply associate tweets with the presence of emotion-related words such as »#angry»,»#annoyed»,»#happy»,»#sad,»#surprised», etc. The list of query terms was based on the query terms of SemEval-2018 AIT DISC (Duppada et al., 2018).
The key performance metric of EmoContext is a micro-average F1 score for three emotion classes, i.e. «sad», «happy», and «angry».
def preprocessData(dataFilePath, mode):
conversations = []
labels = []
with io.open(dataFilePath, encoding="utf8") as finput:
finput.readline()
for line in finput:
line = line.strip().split('\t')
for i in range(1, 4):
line[i] = tokenize(line[i])
if mode == "train":
labels.append(emotion2label[line[4]])
conv = line[1:4]
conversations.append(conv)
if mode == "train":
return np.array(conversations), np.array(labels)
else:
return np.array(conversations)
texts_train, labels_train = preprocessData('./starterkitdata/train.txt', mode="train")
texts_dev, labels_dev = preprocessData('./starterkitdata/dev.txt', mode="train")
texts_test, labels_test = preprocessData('./starterkitdata/test.txt', mode="train")
2. Texts Pre-Processing
Before any training stage, texts were pre-processed by text tool Ekphrasis (Baziotis et al., 2017). This tool helps to perform spell correction, word normalization, segmentation, and allows to specify which tokens should be omitted, normalized or annotated with special tags. We used the following techniques for the pre-processing stage.
- URLs, emails, the date and time, usernames, percentage, currencies, and numbers were replaced with the corresponding tags.
- Repeated, censored, elongated, and capitalized terms were annotated with the corresponding tags.
- Elongated words were automatically corrected based on built-in word statistics corpus.
- Hashtags and contractions unpacking (i.e. word segmentation) was performed based on built-in word statistics corpus.
- A manually created dictionary for replacing terms extracted from the text was used in order to reduce a variety of emotions.
In addition, Emphasis provides with the tokenizer which is able to identify most emojis, emoticons, and complicated expressions such as censored, emphasized and elongated words as well as dates, times, currencies, and acronyms.
Table 3. Text pre-processing examples.
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import numpy as np
import re
import io
label2emotion = {0: "others", 1: "happy", 2: "sad", 3: "angry"}
emotion2label = {"others": 0, "happy": 1, "sad": 2, "angry": 3}
emoticons_additional = {
'(^・^)': '', ':‑c': '', '=‑d': '', ":'‑)": '', ':‑d': '',
':‑(': '', ';‑)': '', ':‑)': '', ':\\/': '', 'd=<': '',
':‑/': '', ';‑]': '', '(^�^)': '', 'angru': 'angry', "d‑':":
'', ":'‑(": '', ":‑[": '', '(�?�)': '', 'x‑d': '',
}
text_processor = TextPreProcessor(
# terms that will be normalized
normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
'time', 'url', 'date', 'number'],
# terms that will be annotated
annotate={"hashtag", "allcaps", "elongated", "repeated",
'emphasis', 'censored'},
fix_html=True, # fix HTML tokens
# corpus from which the word statistics are going to be used
# for word segmentation
segmenter="twitter",
# corpus from which the word statistics are going to be used
# for spell correction
corrector="twitter",
unpack_hashtags=True, # perform word segmentation on hashtags
unpack_contractions=True, # Unpack contractions (can't -> can not)
spell_correct_elong=True, # spell correction for elongated words
# select a tokenizer. You can use SocialTokenizer, or pass your own
# the tokenizer, should take as input a string and return a list of tokens
tokenizer=SocialTokenizer(lowercase=True).tokenize,
# list of dictionaries, for replacing tokens extracted from the text,
# with other expressions. You can pass more than one dictionaries.
dicts=[emoticons, emoticons_additional]
)
def tokenize(text):
text = " ".join(text_processor.pre_process_doc(text))
return text
3. Word Embeddings
Word embeddings have become an essential part of any deep-learning approaches for NLP systems. To determine the most suitable vectors for emotions detection task, we try Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and FastText (Joulin et al., 2017) models as well as DataStories pre-trained word vectors (Baziotis et al., 2017). The key concept of Word2Vec is to locate words, which share common contexts in the training corpus, in close proximity in vector space. Both Word2Vec and Glove models learn geometrical encodings of words from their co-occurrence information, but essentially the former is a predictive model, and the latter is a count-based model. In other words, while Word2Vec tries to predict a target word (CBOW architecture) or a context (Skip-gram architecture), i.e. to minimize the loss function, GloVe calculates word vectors doing dimensionality reduction on the co-occurrence counts matrix. FastText is very similar to Word2Vec except for the fact that it uses character n-grams in order to learn word vectors, so it«s able to solve the out-of-vocabulary issue.
For all the techniques mentioned above, we used the default training prams provided by the authors. We train a simple LSTM model (dim = 64) based on each of these embeddings and compare effectiveness using cross-validation. According to the result, DataStories pre-trained embeddings demonstrated the best average F1 score.
To enrich selected word embeddings with the emotional polarity of the words, we consider performing distant pre-training phrase by a fine-tuning of the embeddings on the automatically labeled distant dataset. The importance of using pre-training was demonstrated in (Deriu et al., 2017). We use the distant dataset to train the simple LSTM network to classify angry, sad, and happy tweets. The embeddings layer was frozen for the first training epoch in order to avoid significant changes in the embeddings weights, and then it was unfrozen for the next 5 epochs. After the training stage, the fine-tuned embeddings was saved for the further training phases and made publicly available.
def getEmbeddings(file):
embeddingsIndex = {}
dim = 0
with io.open(file, encoding="utf8") as f:
for line in f:
values = line.split()
word = values[0]
embeddingVector = np.asarray(values[1:], dtype='float32')
embeddingsIndex[word] = embeddingVector
dim = len(embeddingVector)
return embeddingsIndex, dim
def getEmbeddingMatrix(wordIndex, embeddings, dim):
embeddingMatrix = np.zeros((len(wordIndex) + 1, dim))
for word, i in wordIndex.items():
embeddingMatrix[i] = embeddings.get(word)
return embeddingMatrix
from keras.preprocessing.text import Tokenizer
embeddings, dim = getEmbeddings('emosense.300d.txt')
tokenizer = Tokenizer(filters='')
tokenizer.fit_on_texts([' '.join(list(embeddings.keys()))])
wordIndex = tokenizer.word_index
print("Found %s unique tokens." % len(wordIndex))
embeddings_matrix = getEmbeddingMatrix(wordIndex, embeddings, dim)
4. Neural Network Architecture
A recurrent neural network (RNN) is a family of artificial neural networks which is specialized in the processing of sequential data. In contrast with traditional neural networks, RRNs are designed to deal with sequential data by sharing their internal weights processing the sequence. For this purpose, the computation graph of RRNs includes cycles, representing the influence of the previous information on the present one. As an extension of RNNs, Long Short-Term Memory networks (LSTMs) have been introduced in 1997 (Hochreiter and Schmidhuber, 1997). In LSTMs recurrent cells are connected in a particular way to avoid vanishing and exploding gradient issues. Traditional LSTMs only preserves information from the past since they process the sequence only in one direction. Bidirectional LSTMs combine output from two hidden LSTM layers moving in opposite directions, where one moves forward through time, and another moves backward through time, thereby enabling to capture information from both past and future states simultaneously (Schuster and Paliwal, 1997).
Figure 1: The architecture of a smaller version of the proposed architecture. LSTM unit for the first turn and for the third turn have shared weights.
A high-level overview of our approach is provided in Figure 1. The proposed architecture of the neural network consists of the embedding unit and two bidirectional LSTM units (dim = 64). The former LSTM unit is intended to analyze the utterance of the first user (i.e. the first turn and the third turn of the conversation), and the latter is intended to analyze the utterance of the second user (i.e. the second turn). These two units learn not only semantic and sentiment feature representation, but also how to capture user-specific conversation features, which allows classifying emotions more accurately. At the first step, each user utterance is fed into a corresponding bidirectional LSTM unit using pre-trained word embeddings. Next, these three feature maps are concatenated in a flatten feature vector and then passed to a fully connected hidden layer (dim = 30), which analyzes interactions between obtained vectors. Finally, these features proceed through the output layer with the softmax activation function to predict a final class label. To reduce overfitting, regularization layers with Gaussian noise were added after the embedding layer, dropout layers (Srivastava et al., 2014) were added at each LSTM unit (p = 0.2) and before the hidden fully connected layer (p = 0.1).
from keras.layers import Input, Dense, Embedding, Concatenate, Activation, \
Dropout, LSTM, Bidirectional, GlobalMaxPooling1D, GaussianNoise
from keras.models import Model
def buildModel(embeddings_matrix, sequence_length, lstm_dim, hidden_layer_dim, num_classes,
noise=0.1, dropout_lstm=0.2, dropout=0.2):
turn1_input = Input(shape=(sequence_length,), dtype='int32')
turn2_input = Input(shape=(sequence_length,), dtype='int32')
turn3_input = Input(shape=(sequence_length,), dtype='int32')
embedding_dim = embeddings_matrix.shape[1]
embeddingLayer = Embedding(embeddings_matrix.shape[0],
embedding_dim,
weights=[embeddings_matrix],
input_length=sequence_length,
trainable=False)
turn1_branch = embeddingLayer(turn1_input)
turn2_branch = embeddingLayer(turn2_input)
turn3_branch = embeddingLayer(turn3_input)
turn1_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn1_branch)
turn2_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn2_branch)
turn3_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn3_branch)
lstm1 = Bidirectional(LSTM(lstm_dim, dropout=dropout_lstm))
lstm2 = Bidirectional(LSTM(lstm_dim, dropout=dropout_lstm))
turn1_branch = lstm1(turn1_branch)
turn2_branch = lstm2(turn2_branch)
turn3_branch = lstm1(turn3_branch)
x = Concatenate(axis=-1)([turn1_branch, turn2_branch, turn3_branch])
x = Dropout(dropout)(x)
x = Dense(hidden_layer_dim, activation='relu')(x)
output = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=[turn1_input, turn2_input, turn3_input], outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
return model
model = buildModel(embeddings_matrix, MAX_SEQUENCE_LENGTH, lstm_dim=64, hidden_layer_dim=30, num_classes=4)
5. Results
In the process of searching for optimal architecture, we experimented not only with the number of cells in layers, activation functions and regularization parameters but also with the architecture of the neural network. The detailed info about this phrase can be found in the original paper.
The model described in the previous section demonstrated the best scores on the dev dataset, so it was used in the final evaluation stage of the competition. On the final test dataset, it achieved 72.59% micro-average F1 score for emotional classes, while the maximum score among all participants was 79.59%. However, this is well above the official baseline released by task organizers, which was 58,68%.
The source code of the model and word-embeddings are available at GitHub.
The full version of the article and the task description paper can be found at ACL Anthology.
The training dataset is located at the official competition group at LinkedIn.
Citation:
@inproceedings{smetanin-2019-emosense,
title = "{E}mo{S}ense at {S}em{E}val-2019 Task 3: Bidirectional {LSTM} Network for Contextual Emotion Detection in Textual Conversations",
author = "Smetanin, Sergey",
booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
year = "2019",
address = "Minneapolis, Minnesota, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/S19-2034",
pages = "210--214",
}