Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affect the performance of an algorithm, however, it is yet to be known why an algorithm is preferred over the other for a certain task. The work aims at highlighting the underlying principles of machine learning techniques and about their role in enhancing the prediction performance. The study adopts, 38 most relevant studies in the field of environmental science and engineering which have applied machine learning techniques during last 6 years. The review conducted explores several aspects of the studies such as: 1) the role of input predictors to improve the prediction accuracy; 2) geographically where these studies were conducted; 3) the major techniques applied for pollutant concentration estimation or forecasting; and 4) whether these techniques were based on Linear Regression, Neural Network, Support Vector Machine or Ensemble learning algorithms. The results obtained suggest that, machine learning techniques are mainly conducted in continent Europe and America. Furthermore a factorial analysis named multi-component analysis performed show that pollution estimation is generally performed by using ensemble learning and linear regression based approaches, whereas, forecasting tasks tend to implement neural networks and support vector machines based algorithms.
- Studies dedicated to estimation modeling are 1.5 times more than that of forecast modeling.
- Estimation based studies mainly apply ensemble learning and regression algorithms, whereas forecasting tasks are tend to use NN and SVM based approaches.
- Predictive features like land use and satellite images have a strong association with estimation models, but their correlation with forest models is weak.
- Ensemble learning are highly reliable techniques with an average correlation coefficient equal to 0.79 but their applications in forecast modeling are limited.