Scalable Representative Instance
Selection and Ranking
Dr. Xingquan (Hill) Zhu
Department of Computer Science
Date:
Time:
Location: 367 Votey
Abstract
Real-world datasets often come with a large data volume or are physically distributed. Finding a small set of representative instances can bring various benefits to data mining practitioners so they can (1) build a learner superior to the one constructed from the whole massive data; and (2) avoid working on the whole original dataset all the time. This problem becomes more important when one wants to perform data mining on physically distributed datasets, because having each distributed branch forward a small subset of representative instances to a central place likely helps users build trustworthy models. We propose a Scalable Representative Instance Selection And Ranking (SRISTAR pronounced 3STAR) mechanism, which carries three unique features: (1) it provides a representative instance ranking list, so that users can always select instances from the top to the bottom, based on the number of examples they prefer; (2) it investigates the behaviors of the underlying examples for instance selection, and the selection procedure tries to optimize the expected future error; and (3) it nicely accommodates both single and the distributed scenarios for representative instance selection. Given a dataset, we first cluster instances into small data cells, each of which consists of instances with similar behaviors. Then we progressively evaluate each of the data cells and their combinations, and order the data cells into a list such that the learners built from the top cells are more accurate. Experimental results from both single and distributed datasets demonstrate the effectiveness of our strategies.