Type: Conference Paper

Recognition of data records in semi-structured web-pages using ontology and χ 2 statistical distribution

Journal: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (03029743)Year: 2008Volume: 5139Issue: Pages: 675 - 682

Keshavarzi A. Rahmani A.M. Mohsenzadeh M.Keshavarzi R.^a

a :University of Isfahan - IRAN(IR) - Isfahan

DOI:10.1007/978-3-540-88192-6_71Language: English

Abstract

Information extraction (IE) has been emerged as a novel discipline in computer science. In IE, intelligent algorithms are employed to extract the required data, and structure them so that they are appropriate for query. In most IE systems, a web-page structure, e.g. HTML tags are used to recognize the looked-for information. In this article, an algorithm is developed to recognize the main region of web-pages containing the looked-for information, by means of an ontology, a web-page structure and goodness-of-fit χ 2 test. After recognizing the main region, the existing records of the region are recognized, and then each record is put in a text file. © 2008 Springer-Verlag Berlin Heidelberg.