Responsive image
博碩士論文 etd-0804120-103028 詳細資訊
Title page for etd-0804120-103028
Exploration of Multimodal Machine Learning Model - Findings from Real Estate Valuation
Year, semester
Number of pages
Advisory Committee
Date of Exam
Date of Submission
Convolutional neural network, Real estate value evaluation, Multi-modal model, Model interpretability
本論文已被瀏覽 6135 次,被下載 106
The thesis/dissertation has been browsed 6135 times, has been downloaded 106 times.
機器學習以及深度學習近年來被廣泛應用在各個領域,然而許多模型在追求預 測表現的同時卻犧牲了模型的可解釋性,使得模型像是黑盒子一樣讓人難以理解。在 本文中,我們提出透過多模態模型的概念來使得模型同時擁有預測準確度以及模型的 可解釋性。模型的可解釋性指得是我們能不能去了解模型是如何產生預測結果的,或 者是說模型是根據哪些特徵來產生預測。我們透過房地產鑑價為例子並搭配我們所提 出的多模態模型架構來驗證我們的想法,使得模型在擁有好的預測表現的同時,也具 有一定的解釋能力。而模型的可解釋性我們更進一步透過模型所學到的特徵來做全局 解釋以及透過局部解釋器來做局部解釋。
Machine learning and deep learning have been woidely used in various fields in recent years. However, many models sacrifice the model interpretability while purchasing the predictive performance, which make the model difficult to understand like a block box. In this article, we propose the concept of multi-modal models to enable the model to have both predictive performance and model interpretability. The interpretability of a model refers to whether we can understand how the model produces the predictions. We take real estate value evaluation task as an example with our propsed method yp verify our ideas, so that the model has a good predictive performance while also having a certain explanatory power. As for the interpretability of the model, we further use the features learned by the model to make a global explanations and a local explainer to make a local explanations.
目次 Table of Contents
論文審定書 i
誌謝 ii
摘要 iii
Abstract iv
目錄 v
List of Figures vi
List of Tables vii
1. Introduction 1
2. Background & Related Work 3
2.1. Explainable AI 3
2.2. Housing Price Estimation 6
2.3. Representation Learning 8
2.4. Multi-modal Model 10
2.5. LIME explainer 11
3. Methodology 13
3.1. Real estate transaction dataset 14
3.2. Boosting model’s predictive performance by image features 14
3.3. Interpretability of model 17
3.3.1. Explaining the models by sets of labels 18
3.3.2. Explaining the models by LIME explainer 19
4. Experimental Results 22
4.1. Experiment environment 22
4.2. Data pre-processing 23
4.3. The base real estate value evaluation model 25
4.4. The effect of images’ embeddings 26
4.5. Model interpretability 28
4.5.1. Explain the models by images’ labels 28
4.5.2. Explain the models by LIME explainer 30
5. Conclusion 35
6. Reference 35
參考文獻 References
Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.
Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127.
Bengio, Yoshua, Courville, A., & Vincent, P. (2014). Representation Learning: A Review and New Perspectives. ArXiv:1206.5538 [Cs].
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ArXiv:1412.7062 [Cs].
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255.
Detect Labels | Cloud Vision API. (n.d.). Google Cloud. Retrieved August 17, 2020, from
D’mello, S. K., & Kory, J. (2015). A Review and Meta-Analysis of Multimodal Affect Detection Systems. ACM Computing Surveys, 47(3), 1–36.
Dubey, A., Naik, N., Parikh, D., Raskar, R., & Hidalgo, C. A. (2016). Deep Learning the City: Quantifying Urban Perception At A Global Scale. ArXiv:1608.01769 [Cs].
Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22.
Fu, X., Jia, T., Zhang, X., Li, S., & Zhang, Y. (2019). Do street-level scene perceptions affect housing prices in Chinese megacities? An analysis using open access datasets and deep learning. PLOS ONE, 14(5), e0217505.
Girshick, R. (2015). Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV), 1440–1448.
Goodman, B., & Flaxman, S. (2017). European Union Regulations on Algorithmic Decision-Making and a “Right to Explanation.” AI Magazine, 38(3), 50–57.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. ArXiv:1512.03385 [Cs].
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hodosh, M., Young, P., & Hockenmaier, J. (n.d.). Framing Image Description as a Ranking Task Data, Models and Evaluation Metrics Extended Abstract. 5.
K-means clustering. (2020). In Wikipedia.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25 (pp. 1097–1105). Curran Associates, Inc.
Law, S., Paige, B., & Russell, C. (2019). Take a Look Around: Using Street View and Satellite Images to Estimate House Prices. ACM Transactions on Intelligent Systems and Technology, 10(5), 1–19.
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2), 91–110.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (pp. 3111–3119). Curran Associates, Inc.
RBM, wikipedia. (2019). In 維基百科,自由的百科全書.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative Adversarial Text to Image Synthesis. ArXiv:1605.05396 [Cs].
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. ArXiv:1602.04938 [Cs, Stat].
RStudio | Open source & professional software for data science teams. (n.d.). Retrieved July 15, 2020, from
Samek, W., Wiegand, T., & Müller, K.-R. (2017). Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models. ArXiv:1708.08296 [Cs, Stat].
Seresinhe, C. I., Preis, T., & Moat, H. S. (n.d.). Using deep learning to quantify the beauty of outdoor places. Royal Society Open Science, 4(7), 170170.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv:1409.1556 [Cs].
Srivastava, N., & Salakhutdinov, R. (n.d.). Multimodal Learning with Deep Boltzmann Machines. 32.
Therneau, T. M., Atkinson, E. J., & Foundation, M. (n.d.). An Introduction to Recursive Partitioning Using the RPART Routines. 60.
Wright, M. N., & Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1).
Yuhas, B. P., Goldstein, M. H., & Sejnowski, T. J. (1989). Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 27(11), 65–71.
電子全文 Fulltext
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available

紙本論文 Printed copies
開放時間 available 已公開 available

QR Code