Anomaly Detection Based on Measures of Influence for Modelling Economic Phenomena

Authors

DOI:

https://doi.org/10.15678/AOC.2024.2604

Keywords:

anomaly detection, influential observations, economertic model, outliers

Abstract

Objective: Anomalies are data points (or sequences of points) for which relationships between variables are significantly different to those that can be observed under normal circumstances. Their presence in data used for estimating an econometric model may significantly influence the values of the parameter estimates. The result is a skewed projection of the real world and less accurate forecasts. The purpose of this study is to propose a method of identifying anomalies in data based on their influence on the regression function parameter estimates.

Research Design & Methods: This paper proposes a method of detecting anomalies by identifying data points with the most significant influence on the estimates of the model parameters using permutations of the dataset. The method was applied to data generated using copula functions, and anomalies were generated by changing the marginal distribution of the dependent variable. A fixed percentage of data points was identified as anomalies and removed. This method was compared with one based on distance to k-nearest neighbours.

Findings: The exclusion of the anomalies identified by the proposed method resulted in models with a significantly lower prediction error. Additionally the method based on influence of the observations was more accurate in identifying anomalies.

Implications/Recommendations: Excluding anomalies can be an important stage in  data preparation for estimating an econometric model, particularly when one aims to predict. Nevertheless, it is important to keep in mind the risk of deleting valid observations from the dataset.

Contribution: In the conducted simulation study removing the observations identified as anomalies resulted in models with a significantly lower prediction error, even when some typical observations were incorrectly classified as anomalies. The method based on influence on the model parameter estimates allowed for accurate identification of anomalies although it was dependent on correct prediction of the percentage of anomalous observations that would appear in the data.

References

Aggarwal, C. C. (2017). Outlier Analysis. Springer. https://doi.org/10.1007/978-3-319-47578-3

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. https://doi.org/10.1002/0471725153

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys, 41(3), 1–58. https://doi.org/10.1145/1541880.1541882

Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249

Draper, N. R., & John, J. A. (1981). Influential Observations and Outliers in Regression. Technometrics, 23(1), 21–26. https://doi.org/10.2307/1267971

Hawkins, D. M. (1980). Identification of Outliers (Vol. 11). Springer. https://doi.org/10.1007/978-94-015-3994-4

Heilpern, S. (2007). Funkcje łączące. Wydawnictwo Akademii Ekonomicznej im. Oskara Langego.

Hofert, M., Kojadinovic, I., Mächler, M., & Yan, J. (2018). Elements of Copula Modeling with R. Springer International Publishing. https://doi.org/10.1007/978-3-319-89635-9

Lee, L.-F. (1983). Generalized Econometric Models with Selectivity. Econometrica: Journal of the Econometric Society, 51(2), 507–512. https://doi.org/10.2307/1912003

Mehrotra, K. G., Mohan, C. K., & Huang, H. (2017). Anomaly Detection Principles and Algorithms (Vol. 1). Springer. https://doi.org/10.1007/978-3-319-67526-8

Nelsen, R. B. (1998). An Introduction to Copulas. Springer Science & Business Media.

Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. Annales de l’ISUP, 8(3), 229–231.

Trivedi, P. K., & Zimmer, D. M. (2007). Copula Modeling: An Introduction for Practitioners. Foundations and Trends® in Econometrics, 1(1), 1–111. https://doi.org/10.1561/0800000005

Trzęsiok, M. (2014). O jakości danych w kontekście obserwacji oddalonych w wielowymiarowej analizie regresji. Studia Ekonomiczne, 191, 75–88.

Downloads

Published

2025-02-13

Issue

Section

Articles