This publication constitutes the refereed court cases of the 14th business convention on Advances in information Mining, ICDM 2014, held in St. Petersburg, Russia, in July 2014. The sixteen revised complete papers offered have been conscientiously reviewed and chosen from numerous submissions. the themes diversity from theoretical features of information mining to functions of knowledge mining, corresponding to in multimedia info, in advertising, in drugs and agriculture and in strategy keep an eye on, and society.

After that, we present the results of the experiments, and also give some discussions. Multiple Template Detection Based on Segments 35 Table 1. Number of Web pages and their classes Web sites PCConnection Amazon CNet J&R PCMag ZDnet Notebook 560 410 431 60 145 Camera 156 230 206 150 138 198 139 Mobile 20 36 42 32 47 108 Printer 423 610 123 127 110 89 TV 267 589 146 171 56 72 Fig. 3. 1 Data Sets and Evaluation Measures In this paper, we crawled six distinct commercial Web sites: PCConnection1 , Amazon2 , CNet3 , J&R4 , PCMag5 and ZDnet6 .

And many applications can realize a significant improvement in performance. Thus it is very important to identify templates correctly and efficiently. In this work, we focus on discovering informative contents based on the following observation: In a given Web site, templates usually share some common presentation styles. Moreover, the contents of templates tend to be similar or almost identical. Many previous extraction methods we found in literature extract informative contents of Web pages based on per Web page analysis.

1. ) • parent is the pointer to its parent; • children is the list of pointers to its children. Figure 1 shows the HTML source code of a Web page and its corresponding DOM tree. In the figure, the circle is the actual content of the node. For example, for the tag ”DIV”, the actual contents are ”Welcome, my friends” and ”Thanks for you coming”; for the tag ”A”, the actual content is ”See more”. , for the tag ”TABLE”, its style is represented by attributes ”width” and ”height”. tagN ame). Whenever two sibling nodes get equal tagN ame, we distinguish them by adding the styleHash to their label values.

