bstracting Keywords from Hypertext Documents
BEN CHOI & BAOLIN LI
Computer Science,
College of Engineering and Science
Louisiana Tech University, Ruston, LA 71272,
USA
pro@BenChoi.org
Abstract: This paper presents a process for abstracting keywords from hypertext or text documents. The abstracted keywords, like keywords listed in a paper, identify the contents of a document. Our proposed process can be used, for example, to identify the contents of HTML documents returned from a search engine, to allow users to quickly find their needed information. The proposed process not only considers the occurrent frequency of a word in a document, like other related works, but also considers the occurrent frequency of its synonyms. It also considers key phrases consisting of two or three words. To increase the accuracy of the frequency count of words, a stemming algorithm is used to remove suffixes. Our tests show that the stemming algorithm consumed on average 56.7% of the total computation time, and that the proposed process can on average abstract 52% of the keywords provided by the authors of the tested documents.
Keywords: web mining, keyword extraction, information retrieval, and hypertext
Full Paper: