Outline

Generalized Monge-Elkan Method for Approximate Text String Comparison

2009, Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science Volume 5449, 2009, pp 559-570

https://doi.org/10.1007/978-3-642-00382-0_45

visibility

…

description

12 pages

Abstract

The Mongue-Elkan method is a general text string comparison method based on an internal character-based similarity measure (e.g. edit distance) combined with a token level (i.e. word level) similarity measure. We propose a generalization of this method based on the notion of the generalized arithmetic mean instead of the simple average used in the expression to calculate the Monge-Elkan method. The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens.

Key takeaways
AI

The proposed generalization of the Monge-Elkan method enhances token similarity by utilizing the generalized arithmetic mean.
Experiments with 12 name-matching datasets demonstrate improved performance with m values greater than 1, particularly m=2.
The Monge-Elkan method addresses disordered and missing tokens while the generalization maintains this robustness.
Using character-based measures like edit distance and Jaro similarity, the new method outperforms the original Monge-Elkan.
The study aims to improve approximate string matching techniques in NLP tasks, enhancing accuracy in name-matching applications.

References (23)

De Baets, B., De Meyer, H.: Transitivity-preserving fuzzification schemes for cardinality-based similarity measures. European Journal of Operational Re- search 160, 726-740 (2005)
Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison Wesley / ACM Press (1999)
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18 (5), 16-23 (2003)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD Inter- national Conference on Management of Data (2003)
Christen, P.: A comparison of personal name matching: Techniques and practi- cal issues. Technical report, The Australian National University, Department of Computer Science, Faculty of Engineering and Information Technology (2006)
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI 2003 Workshop on Information Integration on the Web (2003)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1-16 (2007)
Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K.: Non-adjacent digrams improve matching of cross-lingual spelling variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 252-265.
Springer, Heidelberg (2003)
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2 2, 83-97 (1955)
Levenshtein, V.: Bynary codes capable of correcting deletions, insertions and re- versals. Doklady Akademii Nauk SSSR 163(4), 845-848 (1965)
Michelson, M., Knoblock, C.A.: Unsupervised information extraction from unstruc- tured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition 10(3), 211-226 (2007)
Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A het- erogeneous field matching method for record linkage. In: Proceedings of the Fifth IEEE International Conference on Data Mining (2005)
Monge, A.: An adaptive and efficient algorithm for detecting approximately dupli- cate database records. International Journal on Information Systems Special Issue on Data Extraction, Cleaning, and Reconciliation (2001)
Monge, A., Elkan, C.: The field matching problem: Algorithms and applications. In: Proceedings of The Second International Conference on Knowledge Discovery and Data Mining, (KDD) (1996)
Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) (2008)
Pedersen, T., Pakhomov, S.V.S., Patwardhan, S., Chute, C.G.: Measures of se- mantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40(3), 288-299 (2007)
Piskorski, J., Sydow, M.: Usability of string distance metrics for name matching tasks in polish. In: Proceedings of the 3rd Language and Technology Conference, Poznan (2007)
Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5) (1998)
Ullmann, J.R.: A binary n-gram technique for automatic correction of substitution deletion, insertion and reversal errors in words. The Computer Journal 20(2), 141- 147 (1977)
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168-173 (1974)
Winkler, W., Thibaudeau, Y.: An application fo the fellegi-sunter model of record linkage to the 1990 us decenial census. Technical report, Bureau of the Census, Washington, D.C. (1991)

Generalized Monge-Elkan Method for Approximate Text String Comparison

Sign up for access to the world's latest research

Abstract

Key takeawaysAI

Related papers

References (23)

Related papers

Related topics

Cited by

Key takeaways
AI