Identification of outliers in multivariate data

David M Rocke, David L. Woodruff

Research output: Contribution to journalArticle

213 Citations (Scopus)

Abstract

New insights are given into why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of the data. Significant improvements in methods for detecting outliers are described, and extensive simulation experiments demonstrate that a hybrid method extends the practical boundaries of outlier detection capabilities. Based on simulation results and examples from the literature, the question of what levels of contamination can be detected by this algorithm as a function of dimension, computation time, sample size, contamination fraction, and distance of the contamination from the main body of data is investigated. Software to implement the methods is available from the authors and STATLIB.

Original languageEnglish (US)
Pages (from-to)1047-1061
Number of pages15
JournalJournal of the American Statistical Association
Volume91
Issue number435
StatePublished - Sep 1996

Fingerprint

Multivariate Data
Contamination
Outlier
Multivariate Outliers
Outlier Detection
Hybrid Method
Simulation Experiment
Sample Size
Software
Demonstrate
Outliers
Simulation

Keywords

  • Heuristic search
  • M estimation
  • Minimum covariance determinant
  • S estimation

ASJC Scopus subject areas

  • Mathematics(all)
  • Statistics and Probability

Cite this

Identification of outliers in multivariate data. / Rocke, David M; Woodruff, David L.

In: Journal of the American Statistical Association, Vol. 91, No. 435, 09.1996, p. 1047-1061.

Research output: Contribution to journalArticle

@article{6b704dbfa2f64ce79ca8e45b3e83667a,
title = "Identification of outliers in multivariate data",
abstract = "New insights are given into why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of the data. Significant improvements in methods for detecting outliers are described, and extensive simulation experiments demonstrate that a hybrid method extends the practical boundaries of outlier detection capabilities. Based on simulation results and examples from the literature, the question of what levels of contamination can be detected by this algorithm as a function of dimension, computation time, sample size, contamination fraction, and distance of the contamination from the main body of data is investigated. Software to implement the methods is available from the authors and STATLIB.",
keywords = "Heuristic search, M estimation, Minimum covariance determinant, S estimation",
author = "Rocke, {David M} and Woodruff, {David L.}",
year = "1996",
month = "9",
language = "English (US)",
volume = "91",
pages = "1047--1061",
journal = "Journal of the American Statistical Association",
issn = "0162-1459",
publisher = "Taylor and Francis Ltd.",
number = "435",

}

TY - JOUR

T1 - Identification of outliers in multivariate data

AU - Rocke, David M

AU - Woodruff, David L.

PY - 1996/9

Y1 - 1996/9

N2 - New insights are given into why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of the data. Significant improvements in methods for detecting outliers are described, and extensive simulation experiments demonstrate that a hybrid method extends the practical boundaries of outlier detection capabilities. Based on simulation results and examples from the literature, the question of what levels of contamination can be detected by this algorithm as a function of dimension, computation time, sample size, contamination fraction, and distance of the contamination from the main body of data is investigated. Software to implement the methods is available from the authors and STATLIB.

AB - New insights are given into why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of the data. Significant improvements in methods for detecting outliers are described, and extensive simulation experiments demonstrate that a hybrid method extends the practical boundaries of outlier detection capabilities. Based on simulation results and examples from the literature, the question of what levels of contamination can be detected by this algorithm as a function of dimension, computation time, sample size, contamination fraction, and distance of the contamination from the main body of data is investigated. Software to implement the methods is available from the authors and STATLIB.

KW - Heuristic search

KW - M estimation

KW - Minimum covariance determinant

KW - S estimation

UR - http://www.scopus.com/inward/record.url?scp=0030344143&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0030344143&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:0030344143

VL - 91

SP - 1047

EP - 1061

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

SN - 0162-1459

IS - 435

ER -