Detection of foreign words and names in written text

Bashir U Ahmed, Pace University

Abstract

Tremendous research effort has gone into the field of natural language processing and understanding during the last half of the 20th century, yet it remains as one of the most challenging problems. Nevertheless, many companies have commercialized language processing applications such as Text-to-Speech (TTS) systems, domain-specific Automatic Speech Recognition (ASR) systems, Machine Translation (MT) systems, and limited-domain, speaker-dependent Speech-to-Speech (STS) prototype systems. While the quality of all natural language processing applications has improved steadily, they remain far from being perfect. Due to globalization and the widespread use of the Internet, occurrences of foreign entities—foreign words, names, locations, and events—in written text are becoming commonplace. This further complicates the automatic processing of natural language text. Identification of such foreign entities is important for several reasons. For example, TTS systems will usually fail to pronounce such entities properly in their native language as they try to sound them using the lexicon of the base language. These foreign entities also cause problems for MT programs where translation is initially done word by word. This dissertation simplifies the Naïve Bayesian, N-gram based text-classification algorithm by taking its logarithm, naming it the Cumulative Frequency Addition (CFA) algorithm, and applies it to three text processing tasks: language identification, name-to-nationality identification, and detection of foreign words and names. In the language identification task CFA yields 100% accuracy on string sizes greater than 150 characters. In the name-to-nationality task, it yields 86% accuracy on a 14 country database and 96% on a 7 country database within the top three choices. Identifying the nationality, or at least the language group, of a person from his/her name, can be important not only for proper pronunciation but also for purposes of forensics studies and national security. Finally, in the task of detecting foreign words we obtain 66.9% accuracy. This is the first study to apply natural language processing techniques to the latter two tasks.

Subject Area

Information systems|Communication|Artificial intelligence|Linguistics

Recommended Citation

Ahmed, Bashir U, "Detection of foreign words and names in written text" (2005). ETD Collection for Pace University. AAI3172339.
https://digitalcommons.pace.edu/dissertations/AAI3172339

Share

COinS

Remote User: Click Here to Login (must have Pace University remote login ID and password. Once logged in, click on the View More link above)