||For two sequences A = a1a2a3 … am and B = b1b2b3 … cn, a multiset interval ∆(A, i, j) = [ax | i ≤ x ≤ j], and a common multisets interval (CMI) is ∆(A, iA, jA) = ∆(B, iB, jB) for some $iA, jA, iB, jB. 1 ≤ iA ≤ jA ≤ m and 1 ≤ iB ≤ jB ≤ n, which is a multiset that appears in both sequences. |
Previously, researchers have proposed algorithms for finding the common set interval of permutations and sequences. In this thesis, we propose two algorithms to find common multiset intervals of two sequences. The first is the occurrence counting algorithm, which counts the occurrences of the characters in all intervals of the two input sequences and calculate the difference of character occurrences. Its time complexity is O(n3) time, where n denotes the length of the input sequences. The second is the hash key algorithm, which use the product of prime numbers and the modulo operation to build a hash table for quick search. The time complexity of the second algorithm is O(n2 + Gn + qn) or O(n2|Σ| + G|Σ|+ q|Σ|), where G denotes the number of answers and q denotes the number of error collisions. In our experiments, we use C/C++ source codes in CPE (Collegiate Programming Examination of Taiwan) as the data set for classification. The experimental results show that the BKS (behavior knowledge space) method with the combination of the LCS (longest common subsequence) and CMI classifiers can obtain better accuracy than the two methods alone.