Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

David M Rocke, Jian Dai

Research output: Contribution to journalArticle

22 Scopus citations

Abstract

This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

Original languageEnglish (US)
Pages (from-to)215-232
Number of pages18
JournalData Mining and Knowledge Discovery
Volume7
Issue number2
DOIs
StatePublished - Apr 2003

Keywords

  • Clustering algorithm
  • Mixture likelihood
  • Sampling
  • Star/galaxy classification

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Artificial Intelligence
  • Information Systems

Fingerprint Dive into the research topics of 'Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data'. Together they form a unique fingerprint.

  • Cite this