Information retrieval: A framework for recommending text-based classification algorithms

Hany Saleeb, Pace University


Classification is one of the central issues in information retrieval systems dealing with text data. The need for effective approaches has been dramatically increased due to the advent of the World Wide Web and massive digital libraries. Effective methods are invaluable for the exploration of information repositories with the aim to discover similarities between groups of text-based documents. One goal of this thesis is the development of tools for supporting users of machine learning and data mining algorithms in the area of text classification. While the interest in such technology is growing rapidly, tools are still limited to end-users who are not experts. This is due to the fact that machine learning systems are difficult to design and their number keeps increasing. As a result, system designers are faced with two major research problems: algorithmic model selection and model combination, i.e., (a) selecting the most suitable model/algorithm to use on a given application, and (b) integrating this with useful and effective transformations of the data. Traditionally, these problems are resolved by trial-and-error or through consultation of experts. The first solution is time consuming and unreliable. The second solution is expensive and biased by the expert's own prejudices and preferences. This thesis develops a meta-model framework system called the Regression Model Framework (RMF) that supports system designers with model selection and method combination. RMF uses statistical regression analysis to combine prior meta-knowledge with meta-level learning. The second major goal of this thesis is to investigate how text classification is performed on the Web. A great deal of text-based documents are available on the Internet and in corporate intranets, and categorizing them into useful semantic categories is a rewarding and challenging research problem. However, current approaches to text categorization on the Web mostly concentrate on simple representation schemes that are based on word occurrence and word frequency. The structural information that is inherent to documents on the Web is usually neglected. In analyzing Web documents, the relative importance of hypertext tags is investigated in order to ascertain their relative importance in predicting the relevance of unknown documents.

Subject Area

Computer science|Information Systems

Recommended Citation

Saleeb, Hany, "Information retrieval: A framework for recommending text-based classification algorithms" (2002). ETD Collection for Pace University. AAI3064838.



Remote User: Click Here to Login (must have Pace University remote login ID and password. Once logged in, click on the View More link above)