Scalable Representative Instance Selection and Ranking

 

Dr. Xingquan (Hill) Zhu

 Department of Computer Science

University of Vermont

 

Date: Monday March 7, 2005

Time: 12:20 p.m. - 1:10 p.m.

Location: 367 Votey

 

 

Abstract

 

Real-world datasets often come with a large data volume or are physically distributed. Finding a small set of representative instances can bring various benefits to data mining practitioners so they can (1) build a learner superior to the one constructed from the whole massive data; and (2) avoid working on the whole original dataset all the time. This problem becomes more important when one wants to perform data mining on physically distributed datasets, because having each distributed branch forward a small subset of representative instances to a central place likely helps users build trustworthy models. We propose a Scalable Representative Instance Selection And Ranking (SRISTAR pronounced 3STAR) mechanism, which carries three unique features: (1) it provides a representative instance ranking list, so that users can always select instances from the top to the bottom, based on the number of examples they prefer; (2) it investigates the behaviors of the underlying examples for instance selection, and the selection procedure tries to optimize the expected future error; and (3) it nicely accommodates both single and the distributed scenarios for representative instance selection. Given a dataset, we first cluster instances into small data cells, each of which consists of instances with similar behaviors. Then we progressively evaluate each of the data cells and their combinations, and order the data cells into a list such that the learners built from the top cells are more accurate. Experimental results from both single and distributed datasets demonstrate the effectiveness of our strategies.