Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

David M Rocke, Jian Dai

Research output: Contribution to journalArticle

21 Citations (Scopus)

Abstract

This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

Original languageEnglish (US)
Pages (from-to)215-232
Number of pages18
JournalData Mining and Knowledge Discovery
Volume7
Issue number2
DOIs
StatePublished - Apr 2003

Fingerprint

Cluster analysis
Data mining
Sampling
Galaxies
Stars

Keywords

  • Clustering algorithm
  • Mixture likelihood
  • Sampling
  • Star/galaxy classification

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Artificial Intelligence
  • Information Systems

Cite this

Sampling and Subsampling for Cluster Analysis in Data Mining : With Applications to Sky Survey Data. / Rocke, David M; Dai, Jian.

In: Data Mining and Knowledge Discovery, Vol. 7, No. 2, 04.2003, p. 215-232.

Research output: Contribution to journalArticle

@article{30b4e9778fb147a2ae878a427c738632,
title = "Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data",
abstract = "This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.",
keywords = "Clustering algorithm, Mixture likelihood, Sampling, Star/galaxy classification",
author = "Rocke, {David M} and Jian Dai",
year = "2003",
month = "4",
doi = "10.1023/A:1022497517599",
language = "English (US)",
volume = "7",
pages = "215--232",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",
number = "2",

}

TY - JOUR

T1 - Sampling and Subsampling for Cluster Analysis in Data Mining

T2 - With Applications to Sky Survey Data

AU - Rocke, David M

AU - Dai, Jian

PY - 2003/4

Y1 - 2003/4

N2 - This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

AB - This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

KW - Clustering algorithm

KW - Mixture likelihood

KW - Sampling

KW - Star/galaxy classification

UR - http://www.scopus.com/inward/record.url?scp=0037287745&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037287745&partnerID=8YFLogxK

U2 - 10.1023/A:1022497517599

DO - 10.1023/A:1022497517599

M3 - Article

AN - SCOPUS:0037287745

VL - 7

SP - 215

EP - 232

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 2

ER -