||Big Data technologies and applications flourish in recent years. However, the cost of solutions for big data is high because of the expensive hardware equipments and professional software tools for analysis.|
Taiwan national health insurance since 1995 accumulates the whole treatment information of all populations of Taiwan. In the past, Microsoft Excel and statistical packages, like SAS and SPSS, are the main tools to analyze these data. However, these data are too huge to be processed, or it may be handled and expense too much time.
Therefore, the purpose of our research is looking for a solution to process the big data and its cost would not be too high. In addition to the platform solution, we also design and implement a solution for automatical SQL code generation. It is very useful for those who are not IT experts to be mining data from the platform. Our proposed platform solution is composed of a computing cluster with many off-shelf personal computers, and then we apply virtual machine tool, Linux container (LXC), to ensure data security and system scalability and utilization. Also, you use OpenFlow to ensure the required network bandwidth during the data mining.
We choose Cloudera Impala as the tool of data mining, which uses standard SQL as the query language in order to reduce the gap between users and the database. Impala, whose implementation uses in-memory approach, has a faster query speed than those which uses Map/Reduce one. Additionally, we use HTML5 as the interface to develop the automatic SQL generator for non-IT users to quickly get correct SQL code and then to execute the code.