Responsive image
博碩士論文 etd-0801120-064208 詳細資訊
Title page for etd-0801120-064208
Applying a multimodal emotion recognition chatbot to a live streaming
Year, semester
Number of pages
Advisory Committee
Date of Exam
Date of Submission
multimodal fusion, chatbot, live streaming, dialogue generation, emotion recognition
本論文已被瀏覽 5934 次,被下載 2
The thesis/dissertation has been browsed 5934 times, has been downloaded 2 times.
In the era of rapid Internet and technological rapid growth, many careers were never seen as being popular before. For example, the live streaming industry has risen due to the rise of live streaming platforms such as Twitch, or that streamers can earn money by uploading videos daily. These are very distinctive careers from the current era.
Two components, topic and emotion recognition, have been added to dialogue generation because of poor responses from traditional chatbots, which are simple and content-independent. Our proposed approach makes these responses more relevant in the context of live streams, and improves their accuracy, diversity, and richness. Using multimodal emotion recognition is better than unimodal emotion recognition because it provides stability. While one mode meets interference, others can be used to maintain a certain recognition accuracy. In this study, the chatbot was applied to a live stream system that has complex environments that allow it to send audio, video, and text at the same time. Therefore, it is necessary to have the capability to recognize emotion from the multiple sources.
The chatbot proposed in this study can respond to a live stream hosted in a chat room respectively. It can play the role of both parties responding to the chat room when there are too many viewers and allowing the viewer to interact with the live stream host when the number of viewers is insufficient. We expect the chatbot to really help the live streamer and the audience on the live streaming platform.
The results of our experiments show that multimodal emotion recognition performs better in the live stream environments than unimodal ones. Moreover, dialogue generation refers to either topic, emotion or both context information that can help the chatbot system generate better responses. Combining both topics and emotions together produces the best results.
目次 Table of Contents
論文審定書 i
摘要 ii
Abstract iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 3
1.3 研究方法與流程 4
第二章 文獻探討 6
2.1 直播平台 6
2.1.1 直播平台概要 6
2.1.2 Twitch 8
2.2 情緒辨識 10
2.2.1 單模態情緒辨識 11
2.2.2 多模態情緒辨識 12
2.3 對話生成 15
2.3.1 主題及情緒對話生成 16
2.3.2 聊天機器人 17
第三章 研究方法與步驟 19
3.1 情緒辨識模型 19
3.1.1 模型架構 19
3.1.2 單模態特徵提取 21
3.1.3 上下文建模 23
3.1.4 多模態融合 25
3.1.5 模型調整 29
3.2 對話生成模型 30
3.2.1 問題陳述 30
3.2.2 有條件的WASSERSTEIN自動編碼器用於對話建模 31
3.3 主題式情緒對話生成 34
第四章 實驗設置 38
4.1 資料集 38
4.2 指標 40
第五章 實驗結果 45
5.1 多模態情緒辨識 45
5.2 對話生成 53
5.3 人工評估 56
5.3.1 問卷檢測 56
5.3.2 資料分析 57
5.3.3 結論 61
5.4 實驗結論 63
第六章 結論 65
6.1 結論 65
6.2 建議與限制 67
6.3 未來規劃 68
參考文獻 69
附錄 76
參考文獻 References
[1] Hilvert-Bruce, Z., Neill, J. T., Sjöblom, M., & Hamari, J. (2018). Social motivations of live-streaming viewer engagement on Twitch. Computers in Human Behavior, 84, 58-67.
[2] Li, Y., Kou, Y., Lee, J. S., & Kobsa, A. (2018). Tell me before you stream me: Managing information disclosure in video game live streaming. In Proceedings of the ACM on Human-Computer Interaction, 107, 17
[3] D. Y.Wohn, G.Freeman, & C.McLaughlin. (2018). Explaining Viewers' Emotional, Instrumental, and Financial Support Provision for Live Streamers. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 474, 1-13.
[4] J.Seering, R. E.Kraut, & L.Dabbish. (2017). Shaping pro and anti-social behavior on twitch through moderation and example-setting. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW '17).
[5] (2006, July 16). YouTube serves up 100 million videos a day online. Retrieved July 25, 2020, from
[6] F.Duarte, F.Benevenuto, V.Almeida, & J.Almeida. (2007). Geographical characterization of YouTube: A latin American view. in Proceedings - 2007 Latin American Web Conference, LA-WEB 2007.
[7] M.Sjöblom & J.Hamari. (2017). Why do people watch others play video games? An empirical study on the motivations of Twitch users. Computers in Human Behavior, 7, 985-996
[8] T.Wulf, F. M.Schneider, & S.Beckert. (2018). Watching Players: An Exploration of Media Enjoyment on Twitch. Games and Culture, 1-19.
[9] TechCrunch. (2016, February 11). Twitch’s Users Watch More Video In A Month, On Average, Than Typical YouTube Users Do. Retrieved July 25, 2020, from
[10] K.Pires & G.Simon. (2015). You tube live and twitch: A tour of user-generated live streaming systems. in Proceedings of the 6th ACM Multimedia Systems Conference, MMSys 2015.
[11] C.-T.Ho & C.-H.Yang. (2015). A study on behavior intention to use live streaming video platform based on TAM model.
[12] T.Young, E.Cambria, I.Chaturvedi, H.Zhou, S.Biswas, & M.Huang. (2018). Augmenting end-to-end dialogue systems with commonsense knowledge. in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018.
[13] R. W. Picard. (2010). Affective Computing: From Laughter to IEEE. Affective Computing, IEEE Transactions on, 1(1), 11-17.
[14] P.Ekman. (1993). Facial expression and emotion,” American Psychologist, 48(4), 384-39
[15] D.Datcu & L. J. M.Rothkrantz. (2015). Semantic Audiovisual Data Fusion for Automatic Emotion Recognition. in Emotion Recognition: A Pattern Analysis Approach.
[16] Busso, C., Z. Deng, S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, S. Lee, U. Neumann, & S. Narayanan. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. in ICMI’04 - Sixth International Conference on Multimodal Interfaces.
[17] C. O.Alm, D.Roth, & R.Sproat. (2005). Emotions from text: Machine learning for text-based emotion prediction. in HLT/EMNLP 2005 - Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference.
[18] A.Gandomi & M.Haider. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.
[19] K.Han, D.Yu, & I.Tashev. (2014). Speech emotion recognition using deep neural network and extreme learning machine. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH.
[20] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh,E. Mower, S. Kim, J. N. Chang, S. Lee, & S. S.Narayanan.(2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335-359.
[21] W ZHANG, D ZHAO, Z CHAI et al. (2017). Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services,” Software - Practice and Experience, 47(8), 1127-1138.
[22] Y. Niu, D. Zou, Y. Niu, Z. He, & H. Tan. (2017). A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural. Retrieved July 25, 2020, from
[23] H.Zhou, M.Huang, T.Zhang, X.Zhu, & B.Liu. (2018). Emotional chatting machine: Emotional conversation generation with internal and external memory. in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018.
[24] M. Jadeja, N. Varia & A. Shah. (2017). Deep Reinforcement Learning for Conversational AI. SCAI'17-Search-Oriented Conversational AI, San Diego.
[25] K.Wang & X.Wan. (2018). Sentigan: Generating sentimental texts via mixture adversarial networks. in IJCAI International Joint Conference on Artificial Intelligence
[26] J.Li, W.Monroe, A.Ritter, M.Galley, J.Gao, & D.Jurafsky. (2016). Deep reinforcement learning for dialogue generation. in EMNLP 2016.
[27] L.Yu, W.Zhang, J.Wang, & Y.Yu, (2017). SeqGAN: Sequence generative adversarial nets with policy gradient. in 31st AAAI Conference on Artificial Intelligence, AAAI 2017.
[28] B.Pang & L.Lee. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. in ACL-05-43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference.
[29] M.Hu & B.Liu. (2004). Mining and summarizing customer reviews. in KDD-2004- Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[30] Liu, Bing. (2012). Sentiment analysis and opinion mining. Morgan & Claypool.
[31] M.Chen, S.Wang, P. P.Liang, T.Baltrušaitis, A.Zadeh, & L. P.Morency. (2017). Multimodal sentiment analysis withword-level fusion and reinforcement learning. in ICMI 2017-Proceedings of the 19th ACM International Conference on Multimodal Interaction.
[32] De Silva, Liyanage & Miyasato, T. & Nakatsu, Ryohei. (1997). Facial Emotion Recognition Using Multi-Modal Information. in Proceedings of the IEEE Intelligent Conf. Information, Comm. And Signal Processing.
[33] L. S.Chen, T. S.Huang, T.Miyasato, & R.Nakatsu. (1998). Multimodal human emotion/expression recognition. in Proceedings - 3rd IEEE International Conference on Automatic Face and Gesture Recognition, 366–371.
[34] M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, & L.-P. Morency.(2013). Youtube movie reviews: Sentiment analysis in an audio-visual context. Intell. Syst. IEEE, 28 (3), 46-53.
[35] V.Rozgić, S.Ananthakrishnan, S.Saleem, R.Kumar, & R.Prasad.(2012). Ensemble of SVM trees for multimodal emotion recognition. in 2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.
[36] A.Metallinou, S.Lee, & S.Narayanan.(2008). Audio-visual Emotion recognition using Gaussian Mixture Models for face and Proceedings-10th IEEE International Symposium on Multimedia, 250-257.
[37] F. Eyben, M. W¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, & R. Cowie. (2010). On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3, 7-19.
[38] C. H.Wu & W.BinLiang. (2011). Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. T. Affective Computing,2(1), 10-21.
[39] N.Majumder, D.Hazarika, A.Gelbukh, E.Cambria, & S.Poria.(2018). Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-based Systems, 161, 124-133.
[40] S.Poria, D.Hazarika, N.Majumder, G.Naik, E.Cambria, & R.Mihalcea.(2019).MELD: A multimodal multi-party dataset for emotion recognition in conversations. in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference.
[41] D.Hazarika, S.Poria, A.Zadeh, E.Cambria, L. P.Morency, & R.Zimmermann. (2018). Conversational memory network for emotion recognition in dyadic dialogue videos. in NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference.
[42] L.Shang, Z.Lu, & H.Li. (2015). Neural responding machine for short-Text conversation. in ACL-IJCNLP 2015-53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference.
[43] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, & Y. Bengio. (2017). A hierarchical latent variable encoder-decoder model for generating dialogues. in 31st AAAI Conference on Artificial Intelligence, AAAI 2017.
[44] Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Gao, J., Dolan, B., & Nie, J.-Y. (2015). A neural network approach to context-sensitive generation of conversational responses. in NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference.
[45] I.V.Serban, A.Sordoni, Y.Bengio, A.Courville, & J.Pineau. (2016). Building end-To-end dialogue systems using generative hierarchical neural network models. in 30th AAAI Conference on Artificial Intelligence, AAAI 2016.
[46] Z.Yang, D.Yang, C.Dyer, X.He, A.Smola, & E.Hovy. (2016). Hierarchical attention networks for document classification. in 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference.
[47] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, & Wei-Ying Ma.(2017). Topic aware neural response generation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, 3351-3357
[48] L.Mou, Y.Song, R.Yan, G.Li, L.Zhang, & Z.Jin.(2016). Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. in 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers.
[49] Prendinger, H. & M. Ishizuka. (2005). The Empathic Companion: A character-based interface that addresses users' affective states. International Journal of Applied Artificial Intelligence 19 (3,4), 267-285.
[50] J.Li & X.Sun. (2018). A syntactically constrained bidirectional-asynchronous approach for emotional conversation generation. in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
[51] J.Li, M.Galley, C.Brockett, G. P.Spithourakis, J.Gao, & B.Dolan. (2016). A persona-based neural conversation model. in 54th Annual Meeting of the Association for Computational Linguistics.
[52] J.Zamora. (2017). I’m Sorry, Dave, i’m afraid i can’t do that: Chatbot perception and expectations. in Proceedings of the 5th International Conference on Human Agent Interaction.
[53] N.Asghar, P.Poupart, J.Hoey, X.Jiang, & L.Mou. (2018). Affective neural response generation. In Proceedings of the ECIR, 154-166.
[54] X.Zhou & W. Y.Wang. (2018). Mojitalk: Generating emotional responses at scale. in ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics.
[55] Hu, T., Xu, A., Liu, Z., You, Q., Guo, Y., Sinha, V., Luo, J., & Akkiraju, R. (2018). Touch your heart: A tone-aware chatbot for customer care on social media. in Conference on Human Factors in Computing Systems - Proceedings.
[56] S.Young, M.Gašić, B.Thomson, & J. D.Williams. (2013). POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE.
[57] I.Sutskever Google, O.Vinyals Google, & Q.VLeGoogle. (2014). Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems (NIPS 2014).
[58] K.Cho, A.Courville, & Y.Bengio. (2015). Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks. IEEE Transactions on Multimedia, 17(11), 1875-1886.
[59] J.Li, M.Galley, C.Brockett, J.Gao, & B.Dolan. (2016). A diversity-promoting objective function for neural conversation models. in 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
[60] A.Karpathy, G.Toderici, S.Shetty, T.Leung, R.Sukthankar, & F. F.Li. (2014). Large-scale video classification with convolutional neural networks. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[61] T.Mikolov, K.Chen, G.Corrado, & J.Dean. (2013). Efficient estimation of word representations in vector space. in 1st International Conference on Learning Representations - Workshop Track Proceedings.
[62] “Advances in Neural Information Processing Systems,” Advances in Neural Information Processing Systems. 2018.
[63] F.Eyben, M.Wöllmer, & B.Schuller. (2010). OpenSMILE - The Munich versatile and fast open-source audio feature extractor. in MM’10 -in Proceedings of the ACM Multimedia 2010 International Conference.
[64] S.Ji, W.Xu, M.Yang, & K.Yu. (2013). 3D Convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[65] S.Poria, E.Cambria, D.Hazarika, N.Mazumder, A.Zadeh, & L.-P.Morency. (2017) . Context-Dependent Sentiment Analysis in User-Generated Videos. Proc. Annu. Meeting Association Computational Linguistics, 873–883.
[66] D. P.Kingma & J. L.Ba.(2015). Adam: A method for stochastic optimization. in 3rd International Conference on Learning Representations-Conference Track Proceedings.
[67] X.Gu, K.Cho, J. W.Ha, & S.Kim. (2019). Dialogwae: Multimodal response generation with conditional Wasserstein auto-encoder. 7th Int. Conf. Learn. Represent, 1-11.
[68] X.Shen, H.Su, S.Niu, & V.Demberg. (2018). Improving variational encoder-decoders in dialogue generation. in 32nd AAAI Conference on Artificial Intelligence.
[69] J.Zhao, Y.Kim, K.Zhang, A. M.Rush, & Y.LeCun. (2018). Adversarially regularized autoencoders. in 35th International Conference on Machine Learning.
[70] I.Tolstikhin, O.Bousquet, S.Gelly, & B.Schölkopf. (2018). Wasserstein Auto-Encoders. in International Conference on Learning Representations.
[71] M.Arjovsky, S.Chintala, & L.Bottou. (2017). Wasserstein GAN. in International Conference on Machine Learning.
[72] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu. (2017). DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. in Proceedings of the International Joint Conference on Natural Language Processing. ACL, 986-995
[73] J. J.Godfrey & E.Holliman. (1997). Switchboard-1 Release 2: Linguist. Data Consortium. SWITCHBOARD: A User’s Manual.
[74] T.Zhao, R.Zhao, & M.Eskenazi. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. in 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference.
[75] K.Papineni, S.Roukos, T.Ward, & W.Zhu. (2002). BLEU : a Method for Automatic Evaluation of Machine Translation. In ACL.
[76] C.-W.Liu, R.Lowe, I.VSerban, M.Noseworthy, L.Charlin, & J.Pineau. (2016). How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation.In EMNLP.
[77] B.Chen & C.Cherry. (2015). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. in Proceedings of the Ninth Workshop on Statistical Machine Translation.
[78] Evaluating models | AutoML Translation | Google Cloud. Retrieved July 25 2020, from
[79] V.Rus & M.Lintean. (2012). A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. in Proceedings of the Seventh Workshop on Building Educational Applications Using NLP.
[80] J.Mitchell & M.Lapata. (2008). Vector-based models of semantic composition. in 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference.
[81] G.Forgues, J.Pineau, J.-M.Larcheveque, & R.Tremblay. (2014). Bootstrapping Dialog Systems with Word Embeddings. Workshop on Modern Machine Learning and Natural Language Processing, Advances in Neural Information Processing Systems (NIPS).
[82] J.Pennington, R.Socher, & C. D.Manning. (2014). GloVe: Global vectors for word representation. in EMNLP 2014 .
[83] S.Mai, H.Hu, & S.Xing. (2019). Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. in 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference.
[84] Y.Park, J.Cho, & G.Kim. (2018). A Hierarchical latent structure for variational conversation modeling. in 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
電子全文 Fulltext
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available

紙本論文 Printed copies
開放時間 available 已公開 available

QR Code