Abstract
This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.
Original language | English (US) |
---|---|
Pages (from-to) | 215-232 |
Number of pages | 18 |
Journal | Data Mining and Knowledge Discovery |
Volume | 7 |
Issue number | 2 |
DOIs | |
State | Published - Apr 2003 |
Fingerprint
Keywords
- Clustering algorithm
- Mixture likelihood
- Sampling
- Star/galaxy classification
ASJC Scopus subject areas
- Control and Systems Engineering
- Artificial Intelligence
- Information Systems
Cite this
Sampling and Subsampling for Cluster Analysis in Data Mining : With Applications to Sky Survey Data. / Rocke, David M; Dai, Jian.
In: Data Mining and Knowledge Discovery, Vol. 7, No. 2, 04.2003, p. 215-232.Research output: Contribution to journal › Article
}
TY - JOUR
T1 - Sampling and Subsampling for Cluster Analysis in Data Mining
T2 - With Applications to Sky Survey Data
AU - Rocke, David M
AU - Dai, Jian
PY - 2003/4
Y1 - 2003/4
N2 - This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.
AB - This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.
KW - Clustering algorithm
KW - Mixture likelihood
KW - Sampling
KW - Star/galaxy classification
UR - http://www.scopus.com/inward/record.url?scp=0037287745&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0037287745&partnerID=8YFLogxK
U2 - 10.1023/A:1022497517599
DO - 10.1023/A:1022497517599
M3 - Article
AN - SCOPUS:0037287745
VL - 7
SP - 215
EP - 232
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
SN - 1384-5810
IS - 2
ER -