論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
應用多模態情緒辨識聊天機器人於直播系統之研究 Applying a multimodal emotion recognition chatbot to a live streaming |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
101 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2020-07-30 |
繳交日期 Date of Submission |
2020-09-01 |
關鍵字 Keywords |
多模態融合、情緒辨識、對話生成、直播系統、聊天機器人 multimodal fusion, chatbot, live streaming, dialogue generation, emotion recognition |
||
統計 Statistics |
本論文已被瀏覽 6007 次,被下載 2 次 The thesis/dissertation has been browsed 6007 times, has been downloaded 2 times. |
中文摘要 |
現今,不論是網路的發達,抑或是科技的進步都相當快速,身處在這樣的環境之中,許多過往不曾出現的職業開始盛行,像是因為Twitch崛起而發展的直播產業、將生活日常影片上傳至平台就能賺錢的網紅YouTuber、因電競遊戲興盛而衍生的陪玩主等,都是這個時代獨特的職業。 過往總是被用來回覆單一問題的聊天機器人,如今也被應用於直播系統中,但是實際使用直播平台的聊天機器人後,發現其回覆方式單一且制式化,並容易出現與當前直播內容無關的語句,因此本研究將提出一個具有主題式情緒對話生成架構的聊天機器人。 使用多模態情緒辨識的原因在於多模態會比單一模態來的更加穩定,在遭逢干擾時可以依靠其餘的模態來維持一定的辨識準確率,且由於本研究的聊天機器人是應用於直播系統上,因此需要具有可以識別多個來源的情緒辨識模型。為了改善聊天機器人單一或是與內容無關的回應,因而將對話生成加入主題以及情緒兩個因素,使得最終生成的回覆能夠與直播內容及氛圍更加貼切,並同時提升聊天機器人在回覆上的準確性、多樣性以及豐富度。 由於本研究提出的聊天機器人可以分別對直播主和聊天室產生回覆,因此在應用上能夠扮演雙方的角色,在觀看人數過多時扮演直播主的角色來對聊天室進行回覆;在觀看人數不足時則可以充當觀眾在聊天室生成回覆來跟直播主互動。期望能夠透過這樣的聊天機器人在直播平台上真正為直播主或是觀眾帶來幫助。 最後的實驗結果顯示,多模態情緒辨識會比單模態有更好準確率。此外對話生成不管是依據主題、情緒或兩者情境來生成對話,其結果都比不使用情境資訊的對話生成方法來得好,並且同時使用主題和情緒資訊的方法能夠獲得最好的結果。 |
Abstract |
In the era of rapid Internet and technological rapid growth, many careers were never seen as being popular before. For example, the live streaming industry has risen due to the rise of live streaming platforms such as Twitch, or that streamers can earn money by uploading videos daily. These are very distinctive careers from the current era. Two components, topic and emotion recognition, have been added to dialogue generation because of poor responses from traditional chatbots, which are simple and content-independent. Our proposed approach makes these responses more relevant in the context of live streams, and improves their accuracy, diversity, and richness. Using multimodal emotion recognition is better than unimodal emotion recognition because it provides stability. While one mode meets interference, others can be used to maintain a certain recognition accuracy. In this study, the chatbot was applied to a live stream system that has complex environments that allow it to send audio, video, and text at the same time. Therefore, it is necessary to have the capability to recognize emotion from the multiple sources. The chatbot proposed in this study can respond to a live stream hosted in a chat room respectively. It can play the role of both parties responding to the chat room when there are too many viewers and allowing the viewer to interact with the live stream host when the number of viewers is insufficient. We expect the chatbot to really help the live streamer and the audience on the live streaming platform. The results of our experiments show that multimodal emotion recognition performs better in the live stream environments than unimodal ones. Moreover, dialogue generation refers to either topic, emotion or both context information that can help the chatbot system generate better responses. Combining both topics and emotions together produces the best results. |
目次 Table of Contents |
論文審定書 i 摘要 ii Abstract iii 目錄 iv 圖目錄 vi 表目錄 vii 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 研究方法與流程 4 第二章 文獻探討 6 2.1 直播平台 6 2.1.1 直播平台概要 6 2.1.2 Twitch 8 2.2 情緒辨識 10 2.2.1 單模態情緒辨識 11 2.2.2 多模態情緒辨識 12 2.3 對話生成 15 2.3.1 主題及情緒對話生成 16 2.3.2 聊天機器人 17 第三章 研究方法與步驟 19 3.1 情緒辨識模型 19 3.1.1 模型架構 19 3.1.2 單模態特徵提取 21 3.1.3 上下文建模 23 3.1.4 多模態融合 25 3.1.5 模型調整 29 3.2 對話生成模型 30 3.2.1 問題陳述 30 3.2.2 有條件的WASSERSTEIN自動編碼器用於對話建模 31 3.3 主題式情緒對話生成 34 第四章 實驗設置 38 4.1 資料集 38 4.2 指標 40 第五章 實驗結果 45 5.1 多模態情緒辨識 45 5.2 對話生成 53 5.3 人工評估 56 5.3.1 問卷檢測 56 5.3.2 資料分析 57 5.3.3 結論 61 5.4 實驗結論 63 第六章 結論 65 6.1 結論 65 6.2 建議與限制 67 6.3 未來規劃 68 參考文獻 69 附錄 76 |
參考文獻 References |
[1] Hilvert-Bruce, Z., Neill, J. T., Sjöblom, M., & Hamari, J. (2018). Social motivations of live-streaming viewer engagement on Twitch. Computers in Human Behavior, 84, 58-67. [2] Li, Y., Kou, Y., Lee, J. S., & Kobsa, A. (2018). Tell me before you stream me: Managing information disclosure in video game live streaming. In Proceedings of the ACM on Human-Computer Interaction, 107, 17 [3] D. Y.Wohn, G.Freeman, & C.McLaughlin. (2018). Explaining Viewers' Emotional, Instrumental, and Financial Support Provision for Live Streamers. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 474, 1-13. [4] J.Seering, R. E.Kraut, & L.Dabbish. (2017). Shaping pro and anti-social behavior on twitch through moderation and example-setting. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW '17). [5] USATODAY.com. (2006, July 16). YouTube serves up 100 million videos a day online. Retrieved July 25, 2020, from https://usatoday30.usatoday.com/tech/news/2006-07-16-youtube-views_x.htm. [6] F.Duarte, F.Benevenuto, V.Almeida, & J.Almeida. (2007). Geographical characterization of YouTube: A latin American view. in Proceedings - 2007 Latin American Web Conference, LA-WEB 2007. [7] M.Sjöblom & J.Hamari. (2017). Why do people watch others play video games? An empirical study on the motivations of Twitch users. Computers in Human Behavior, 7, 985-996 [8] T.Wulf, F. M.Schneider, & S.Beckert. (2018). Watching Players: An Exploration of Media Enjoyment on Twitch. Games and Culture, 1-19. [9] TechCrunch. (2016, February 11). Twitch’s Users Watch More Video In A Month, On Average, Than Typical YouTube Users Do. Retrieved July 25, 2020, from https://techcrunch.com/2016/02/11/twitchs-users-watch-more-video-in-a-month-than-youtube-users-do/. [10] K.Pires & G.Simon. (2015). You tube live and twitch: A tour of user-generated live streaming systems. in Proceedings of the 6th ACM Multimedia Systems Conference, MMSys 2015. [11] C.-T.Ho & C.-H.Yang. (2015). A study on behavior intention to use live streaming video platform based on TAM model. [12] T.Young, E.Cambria, I.Chaturvedi, H.Zhou, S.Biswas, & M.Huang. (2018). Augmenting end-to-end dialogue systems with commonsense knowledge. in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018. [13] R. W. Picard. (2010). Affective Computing: From Laughter to IEEE. Affective Computing, IEEE Transactions on, 1(1), 11-17. [14] P.Ekman. (1993). Facial expression and emotion,” American Psychologist, 48(4), 384-39 [15] D.Datcu & L. J. M.Rothkrantz. (2015). Semantic Audiovisual Data Fusion for Automatic Emotion Recognition. in Emotion Recognition: A Pattern Analysis Approach. [16] Busso, C., Z. Deng, S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, S. Lee, U. Neumann, & S. Narayanan. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. in ICMI’04 - Sixth International Conference on Multimodal Interfaces. [17] C. O.Alm, D.Roth, & R.Sproat. (2005). Emotions from text: Machine learning for text-based emotion prediction. in HLT/EMNLP 2005 - Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. [18] A.Gandomi & M.Haider. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144. [19] K.Han, D.Yu, & I.Tashev. (2014). Speech emotion recognition using deep neural network and extreme learning machine. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. [20] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh,E. Mower, S. Kim, J. N. Chang, S. Lee, & S. S.Narayanan.(2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335-359. [21] W ZHANG, D ZHAO, Z CHAI et al. (2017). Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services,” Software - Practice and Experience, 47(8), 1127-1138. [22] Y. Niu, D. Zou, Y. Niu, Z. He, & H. Tan. (2017). A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural. Retrieved July 25, 2020, from https://arxiv.org/abs/1707.09917 [23] H.Zhou, M.Huang, T.Zhang, X.Zhu, & B.Liu. (2018). Emotional chatting machine: Emotional conversation generation with internal and external memory. in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018. [24] M. Jadeja, N. Varia & A. Shah. (2017). Deep Reinforcement Learning for Conversational AI. SCAI'17-Search-Oriented Conversational AI, San Diego. [25] K.Wang & X.Wan. (2018). Sentigan: Generating sentimental texts via mixture adversarial networks. in IJCAI International Joint Conference on Artificial Intelligence [26] J.Li, W.Monroe, A.Ritter, M.Galley, J.Gao, & D.Jurafsky. (2016). Deep reinforcement learning for dialogue generation. in EMNLP 2016. [27] L.Yu, W.Zhang, J.Wang, & Y.Yu, (2017). SeqGAN: Sequence generative adversarial nets with policy gradient. in 31st AAAI Conference on Artificial Intelligence, AAAI 2017. [28] B.Pang & L.Lee. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. in ACL-05-43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. [29] M.Hu & B.Liu. (2004). Mining and summarizing customer reviews. in KDD-2004- Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [30] Liu, Bing. (2012). Sentiment analysis and opinion mining. Morgan & Claypool. [31] M.Chen, S.Wang, P. P.Liang, T.Baltrušaitis, A.Zadeh, & L. P.Morency. (2017). Multimodal sentiment analysis withword-level fusion and reinforcement learning. in ICMI 2017-Proceedings of the 19th ACM International Conference on Multimodal Interaction. [32] De Silva, Liyanage & Miyasato, T. & Nakatsu, Ryohei. (1997). Facial Emotion Recognition Using Multi-Modal Information. in Proceedings of the IEEE Intelligent Conf. Information, Comm. And Signal Processing. [33] L. S.Chen, T. S.Huang, T.Miyasato, & R.Nakatsu. (1998). Multimodal human emotion/expression recognition. in Proceedings - 3rd IEEE International Conference on Automatic Face and Gesture Recognition, 366–371. [34] M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, & L.-P. Morency.(2013). Youtube movie reviews: Sentiment analysis in an audio-visual context. Intell. Syst. IEEE, 28 (3), 46-53. [35] V.Rozgić, S.Ananthakrishnan, S.Saleem, R.Kumar, & R.Prasad.(2012). Ensemble of SVM trees for multimodal emotion recognition. in 2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. [36] A.Metallinou, S.Lee, & S.Narayanan.(2008). Audio-visual Emotion recognition using Gaussian Mixture Models for face and voice.in Proceedings-10th IEEE International Symposium on Multimedia, 250-257. [37] F. Eyben, M. W¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, & R. Cowie. (2010). On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces, 3, 7-19. [38] C. H.Wu & W.BinLiang. (2011). Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. T. Affective Computing,2(1), 10-21. [39] N.Majumder, D.Hazarika, A.Gelbukh, E.Cambria, & S.Poria.(2018). Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-based Systems, 161, 124-133. [40] S.Poria, D.Hazarika, N.Majumder, G.Naik, E.Cambria, & R.Mihalcea.(2019).MELD: A multimodal multi-party dataset for emotion recognition in conversations. in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. [41] D.Hazarika, S.Poria, A.Zadeh, E.Cambria, L. P.Morency, & R.Zimmermann. (2018). Conversational memory network for emotion recognition in dyadic dialogue videos. in NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference. [42] L.Shang, Z.Lu, & H.Li. (2015). Neural responding machine for short-Text conversation. in ACL-IJCNLP 2015-53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference. [43] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, & Y. Bengio. (2017). A hierarchical latent variable encoder-decoder model for generating dialogues. in 31st AAAI Conference on Artificial Intelligence, AAAI 2017. [44] Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Gao, J., Dolan, B., & Nie, J.-Y. (2015). A neural network approach to context-sensitive generation of conversational responses. in NAACL HLT 2015 - 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. [45] I.V.Serban, A.Sordoni, Y.Bengio, A.Courville, & J.Pineau. (2016). Building end-To-end dialogue systems using generative hierarchical neural network models. in 30th AAAI Conference on Artificial Intelligence, AAAI 2016. [46] Z.Yang, D.Yang, C.Dyer, X.He, A.Smola, & E.Hovy. (2016). Hierarchical attention networks for document classification. in 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference. [47] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, & Wei-Ying Ma.(2017). Topic aware neural response generation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, 3351-3357 [48] L.Mou, Y.Song, R.Yan, G.Li, L.Zhang, & Z.Jin.(2016). Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. in 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers. [49] Prendinger, H. & M. Ishizuka. (2005). The Empathic Companion: A character-based interface that addresses users' affective states. International Journal of Applied Artificial Intelligence 19 (3,4), 267-285. [50] J.Li & X.Sun. (2018). A syntactically constrained bidirectional-asynchronous approach for emotional conversation generation. in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. [51] J.Li, M.Galley, C.Brockett, G. P.Spithourakis, J.Gao, & B.Dolan. (2016). A persona-based neural conversation model. in 54th Annual Meeting of the Association for Computational Linguistics. [52] J.Zamora. (2017). I’m Sorry, Dave, i’m afraid i can’t do that: Chatbot perception and expectations. in Proceedings of the 5th International Conference on Human Agent Interaction. [53] N.Asghar, P.Poupart, J.Hoey, X.Jiang, & L.Mou. (2018). Affective neural response generation. In Proceedings of the ECIR, 154-166. [54] X.Zhou & W. Y.Wang. (2018). Mojitalk: Generating emotional responses at scale. in ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics. [55] Hu, T., Xu, A., Liu, Z., You, Q., Guo, Y., Sinha, V., Luo, J., & Akkiraju, R. (2018). Touch your heart: A tone-aware chatbot for customer care on social media. in Conference on Human Factors in Computing Systems - Proceedings. [56] S.Young, M.Gašić, B.Thomson, & J. D.Williams. (2013). POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE. [57] I.Sutskever Google, O.Vinyals Google, & Q.VLeGoogle. (2014). Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems (NIPS 2014). [58] K.Cho, A.Courville, & Y.Bengio. (2015). Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks. IEEE Transactions on Multimedia, 17(11), 1875-1886. [59] J.Li, M.Galley, C.Brockett, J.Gao, & B.Dolan. (2016). A diversity-promoting objective function for neural conversation models. in 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [60] A.Karpathy, G.Toderici, S.Shetty, T.Leung, R.Sukthankar, & F. F.Li. (2014). Large-scale video classification with convolutional neural networks. in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [61] T.Mikolov, K.Chen, G.Corrado, & J.Dean. (2013). Efficient estimation of word representations in vector space. in 1st International Conference on Learning Representations - Workshop Track Proceedings. [62] “Advances in Neural Information Processing Systems,” Advances in Neural Information Processing Systems. 2018. [63] F.Eyben, M.Wöllmer, & B.Schuller. (2010). OpenSMILE - The Munich versatile and fast open-source audio feature extractor. in MM’10 -in Proceedings of the ACM Multimedia 2010 International Conference. [64] S.Ji, W.Xu, M.Yang, & K.Yu. (2013). 3D Convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. [65] S.Poria, E.Cambria, D.Hazarika, N.Mazumder, A.Zadeh, & L.-P.Morency. (2017) . Context-Dependent Sentiment Analysis in User-Generated Videos. Proc. Annu. Meeting Association Computational Linguistics, 873–883. [66] D. P.Kingma & J. L.Ba.(2015). Adam: A method for stochastic optimization. in 3rd International Conference on Learning Representations-Conference Track Proceedings. [67] X.Gu, K.Cho, J. W.Ha, & S.Kim. (2019). Dialogwae: Multimodal response generation with conditional Wasserstein auto-encoder. 7th Int. Conf. Learn. Represent, 1-11. [68] X.Shen, H.Su, S.Niu, & V.Demberg. (2018). Improving variational encoder-decoders in dialogue generation. in 32nd AAAI Conference on Artificial Intelligence. [69] J.Zhao, Y.Kim, K.Zhang, A. M.Rush, & Y.LeCun. (2018). Adversarially regularized autoencoders. in 35th International Conference on Machine Learning. [70] I.Tolstikhin, O.Bousquet, S.Gelly, & B.Schölkopf. (2018). Wasserstein Auto-Encoders. in International Conference on Learning Representations. [71] M.Arjovsky, S.Chintala, & L.Bottou. (2017). Wasserstein GAN. in International Conference on Machine Learning. [72] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu. (2017). DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. in Proceedings of the International Joint Conference on Natural Language Processing. ACL, 986-995 [73] J. J.Godfrey & E.Holliman. (1997). Switchboard-1 Release 2: Linguist. Data Consortium. SWITCHBOARD: A User’s Manual. [74] T.Zhao, R.Zhao, & M.Eskenazi. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. in 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. [75] K.Papineni, S.Roukos, T.Ward, & W.Zhu. (2002). BLEU : a Method for Automatic Evaluation of Machine Translation. In ACL. [76] C.-W.Liu, R.Lowe, I.VSerban, M.Noseworthy, L.Charlin, & J.Pineau. (2016). How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation.In EMNLP. [77] B.Chen & C.Cherry. (2015). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. in Proceedings of the Ninth Workshop on Statistical Machine Translation. [78] Evaluating models | AutoML Translation | Google Cloud. Retrieved July 25 2020, from https://cloud.google.com/translate/automl/docs/evaluate. [79] V.Rus & M.Lintean. (2012). A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. in Proceedings of the Seventh Workshop on Building Educational Applications Using NLP. [80] J.Mitchell & M.Lapata. (2008). Vector-based models of semantic composition. in 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference. [81] G.Forgues, J.Pineau, J.-M.Larcheveque, & R.Tremblay. (2014). Bootstrapping Dialog Systems with Word Embeddings. Workshop on Modern Machine Learning and Natural Language Processing, Advances in Neural Information Processing Systems (NIPS). [82] J.Pennington, R.Socher, & C. D.Manning. (2014). GloVe: Global vectors for word representation. in EMNLP 2014 . [83] S.Mai, H.Hu, & S.Xing. (2019). Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. in 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. [84] Y.Park, J.Cho, & G.Kim. (2018). A Hierarchical latent structure for variational conversation modeling. in 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:自定論文開放時間 user define 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |