Anomaly Detection Based on Measures of Influence for Modelling Economic Phenomena
DOI:
https://doi.org/10.15678/AOC.2024.2604Keywords:
anomaly detection, influential observations, economertic model, outliersAbstract
Objective: Anomalies are data points (or sequences of points) for which relationships between variables are significantly different to those that can be observed under normal circumstances. Their presence in data used for estimating an econometric model may significantly influence the values of the parameter estimates. The result is a skewed projection of the real world and less accurate forecasts. The purpose of this study is to propose a method of identifying anomalies in data based on their influence on the regression function parameter estimates.
Research Design & Methods: This paper proposes a method of detecting anomalies by identifying data points with the most significant influence on the estimates of the model parameters using permutations of the dataset. The method was applied to data generated using copula functions, and anomalies were generated by changing the marginal distribution of the dependent variable. A fixed percentage of data points was identified as anomalies and removed. This method was compared with one based on distance to k-nearest neighbours.
Findings: The exclusion of the anomalies identified by the proposed method resulted in models with a significantly lower prediction error. Additionally the method based on influence of the observations was more accurate in identifying anomalies.
Implications/Recommendations: Excluding anomalies can be an important stage in data preparation for estimating an econometric model, particularly when one aims to predict. Nevertheless, it is important to keep in mind the risk of deleting valid observations from the dataset.
Contribution: In the conducted simulation study removing the observations identified as anomalies resulted in models with a significantly lower prediction error, even when some typical observations were incorrectly classified as anomalies. The method based on influence on the model parameter estimates allowed for accurate identification of anomalies although it was dependent on correct prediction of the percentage of anomalous observations that would appear in the data.
References
Aggarwal, C. C. (2017). Outlier Analysis. Springer. https://doi.org/10.1007/978-3-319-47578-3
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. https://doi.org/10.1002/0471725153
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys, 41(3), 1–58. https://doi.org/10.1145/1541880.1541882
Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15–18. https://doi.org/10.2307/1268249
Draper, N. R., & John, J. A. (1981). Influential Observations and Outliers in Regression. Technometrics, 23(1), 21–26. https://doi.org/10.2307/1267971
Hawkins, D. M. (1980). Identification of Outliers (Vol. 11). Springer. https://doi.org/10.1007/978-94-015-3994-4
Heilpern, S. (2007). Funkcje łączące. Wydawnictwo Akademii Ekonomicznej im. Oskara Langego.
Hofert, M., Kojadinovic, I., Mächler, M., & Yan, J. (2018). Elements of Copula Modeling with R. Springer International Publishing. https://doi.org/10.1007/978-3-319-89635-9
Lee, L.-F. (1983). Generalized Econometric Models with Selectivity. Econometrica: Journal of the Econometric Society, 51(2), 507–512. https://doi.org/10.2307/1912003
Mehrotra, K. G., Mohan, C. K., & Huang, H. (2017). Anomaly Detection Principles and Algorithms (Vol. 1). Springer. https://doi.org/10.1007/978-3-319-67526-8
Nelsen, R. B. (1998). An Introduction to Copulas. Springer Science & Business Media.
Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. Annales de l’ISUP, 8(3), 229–231.
Trivedi, P. K., & Zimmer, D. M. (2007). Copula Modeling: An Introduction for Practitioners. Foundations and Trends® in Econometrics, 1(1), 1–111. https://doi.org/10.1561/0800000005
Trzęsiok, M. (2014). O jakości danych w kontekście obserwacji oddalonych w wielowymiarowej analizie regresji. Studia Ekonomiczne, 191, 75–88.