Academia.eduAcademia.edu

Generalized Monge-Elkan Method for Approximate Text String Comparison

2009, Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science Volume 5449, 2009, pp 559-570

https://doi.org/10.1007/978-3-642-00382-0_45

Abstract

The Mongue-Elkan method is a general text string comparison method based on an internal character-based similarity measure (e.g. edit distance) combined with a token level (i.e. word level) similarity measure. We propose a generalization of this method based on the notion of the generalized arithmetic mean instead of the simple average used in the expression to calculate the Monge-Elkan method. The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens.

Key takeaways
sparkles

AI

  1. The proposed generalization of the Monge-Elkan method enhances token similarity by utilizing the generalized arithmetic mean.
  2. Experiments with 12 name-matching datasets demonstrate improved performance with m values greater than 1, particularly m=2.
  3. The Monge-Elkan method addresses disordered and missing tokens while the generalization maintains this robustness.
  4. Using character-based measures like edit distance and Jaro similarity, the new method outperforms the original Monge-Elkan.
  5. The study aims to improve approximate string matching techniques in NLP tasks, enhancing accuracy in name-matching applications.

References (23)

  1. De Baets, B., De Meyer, H.: Transitivity-preserving fuzzification schemes for cardinality-based similarity measures. European Journal of Operational Re- search 160, 726-740 (2005)
  2. Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison Wesley / ACM Press (1999)
  3. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18 (5), 16-23 (2003)
  4. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
  5. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD Inter- national Conference on Management of Data (2003)
  6. Christen, P.: A comparison of personal name matching: Techniques and practi- cal issues. Technical report, The Australian National University, Department of Computer Science, Faculty of Engineering and Information Technology (2006)
  7. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI 2003 Workshop on Information Integration on the Web (2003)
  8. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1-16 (2007)
  9. Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K.: Non-adjacent digrams improve matching of cross-lingual spelling variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 252-265.
  10. Springer, Heidelberg (2003)
  11. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2 2, 83-97 (1955)
  12. Levenshtein, V.: Bynary codes capable of correcting deletions, insertions and re- versals. Doklady Akademii Nauk SSSR 163(4), 845-848 (1965)
  13. Michelson, M., Knoblock, C.A.: Unsupervised information extraction from unstruc- tured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition 10(3), 211-226 (2007)
  14. Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A het- erogeneous field matching method for record linkage. In: Proceedings of the Fifth IEEE International Conference on Data Mining (2005)
  15. Monge, A.: An adaptive and efficient algorithm for detecting approximately dupli- cate database records. International Journal on Information Systems Special Issue on Data Extraction, Cleaning, and Reconciliation (2001)
  16. Monge, A., Elkan, C.: The field matching problem: Algorithms and applications. In: Proceedings of The Second International Conference on Knowledge Discovery and Data Mining, (KDD) (1996)
  17. Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) (2008)
  18. Pedersen, T., Pakhomov, S.V.S., Patwardhan, S., Chute, C.G.: Measures of se- mantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40(3), 288-299 (2007)
  19. Piskorski, J., Sydow, M.: Usability of string distance metrics for name matching tasks in polish. In: Proceedings of the 3rd Language and Technology Conference, Poznan (2007)
  20. Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5) (1998)
  21. Ullmann, J.R.: A binary n-gram technique for automatic correction of substitution deletion, insertion and reversal errors in words. The Computer Journal 20(2), 141- 147 (1977)
  22. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168-173 (1974)
  23. Winkler, W., Thibaudeau, Y.: An application fo the fellegi-sunter model of record linkage to the 1990 us decenial census. Technical report, Bureau of the Census, Washington, D.C. (1991)