This section describes each of the utilized machine learning algorithms.
Convolutional neural network
CNNs are a type of deep learning model that work particularly well for image-related tasks, including classification, object detection, and segmentation. They learn hierarchical feature representations from images automatically, with initial layers identifying basic patterns like edges and textures, while deeper layers recognize more intricate structures such as objects and scenes. CNNs are made up of convolutional layers that utilize filters to capture features, activation layers (usually ReLU) to add non-linearity, pooling layers to decrease spatial dimensions and computational expenses, and fully connected layers for generating final predictions. The model is trained through backpropagation, modifying the filters to reduce prediction errors. The effectiveness of CNNs arises from their implementation of parameter sharing and spatial invariance, which minimizes the parameter count and allows the model to identify patterns across different locations in the image.
CNNs have revolutionized fields such as audio processing, computer vision, and natural language processing. Their ability to extract and learn features from large, complex datasets has made them essential in applications ranging from medical imaging (e.g., disease detection) and autonomous vehicles (e.g., object recognition) to facial recognition and video analysis. Innovations like transfer learning have further enhanced CNNs’ versatility, letting pre-trained models to be adjusted for precise jobs. Their scalability, computational efficiency, and capacity for handling mixed data types have solidified CNNs as a cornerstone in modern AI applications24,25.
Artificial neural network
ANNs is computating models constructed to simulate the functions of biological neourons in the human brain. The architecture typically comprises three types of layers: input layers that accept raw data, hidden layers that perform complex computations, and output layers that generate final predictions or decisions. Within these layers, each neuron receives inputs, applies a an activation function and weighted sum, and forwards the result to successive layers. The ANN learns by regulating weights between neurons to diminish prediction errors using algorithms such as backpropagation.
ANNs are widely applied across diverse areas, such as natural language processing, image recognition, and healthcare. They power systems which may detect objects in images or videos and diagnose medical situations. In spite of their efficacy, ANNs encounter some challenges like overfitting, and vanishing gradients, which hinder deep network training. Additionally, they require substantial labeled data for effective training. To enhance generalization and training efficiency, approaches such as regularization, and transfer learning are used, addressing these challenges and bolstering the applicability and robustness of ANNs in real-world tasks26,27.
Decision tree
A notable advantage of the approach is their capacity to model nonlinear relationships in data without necessitating feature scaling, as they inherently manage diverse data scales via their splitting mechanism. These models are handling quantitative and category based data. However, this technique is not delicate against overfitting, chiefly in case of tree grows excessively deep and begins to capture noise in the training data. Overfitting can be mitigated through techniques like pruning, where nodes that do not contribute significant information are removed, thereby improving the model’s generalizability and performance on unseen data.
Decision trees are widely implemented in prominent machine learning frameworks which offer tools for model construction, visualization, and optimization. These models are employed in various domains where interpretability is as critical as prediction accuracy. Though decision trees may not always achieve the predictive power of more sophisticated algorithms, they remain an essential tool in machine learning due to their simplicity, transparency, and ability to provide understandable decision-making processes. In scenarios where model interpretability is paramount, decision trees continue to be a valuable and widely adopted approach28,29,30,31.
Random forest
This technique is a kind of ensemble learning methode useful in both classifications and regressions. It builds various decision trees via selecting random subsets, a process known as bootstrap aggregating or bagging. This approach independently trains each tree on different data subsets, reducing variance and mitigating overfitting. Predictions are made by aggregating outputs from wole trees using dominant voting approach for classifications or averaging tasks. The random feature selection decorates trees, enhancing robustness and generalizability.
This method finds extensive applications across diverse fields because of strength and capability to process complicated datasets. In healthcare, finance, and environmental science, it adeptly handles large-scale data for tasks like disease diagnosis, fraud detection, and pollution prediction. Its capacity to manage high-dimensional data and missing values underscores its versatility and broad applicability in machine learning32,33,34.
Linear regression
These techniques are fundamental statistical approaches for modeling the relation of one continuous dependent and several independent factors as a straight line. The model seeks to predict the dependent variable by estimation of coefficients whicn tend to quantify every self-determining variable’s influence. This estimation is achieved by minimizing the errors between real and output values. The technique is generally utilized in arenas such as healthcare, economics and marketing because it is simple, interpretable, and easy to implement, making it a preferred choice for preliminary analyses and baseline models.
However, there exist some limitations because of assumptions of linear relationships, and constant error varianc. It is also sensitive to suspected data, which may skew predictions. The model struggles with complex, nonlinear associations unless inputs are bering transformed. Techniques like regularization can mitigate issues such as overfitting, enhancing robustness when the assumptions hold, and maintaining its utility in predictive and inferential analyses35,36,37,38.
Ridge regression
Ridge Regression is known as a branch of linear regression model which mitigates overfitting and multicollinearity by incorporating an L2 penalty, which is proportional to coefficient magnitudes. This regularization term discourages excessive complexity by shrinking coefficient values, thereby preventing the model from fitting data noise, especially when predictors are intercorrelated, or features are numerous. It modifies the least squares objective function to produce a more stable solution that is less delicate to small changes within the training data. Ridge regression is particularly advantageous when dealing with many predictors, ensuring the model remains generalizable rather than overly complex.
This technique finds extensive applications in scenarios where overfitting or multicollinearity is problematic, such as high-dimensional datasets in finance, genomics, and machine learning. In finance, it can be used to predit stock prices by handling correlated predictors. In genomics, it helps analyze gene expression data with complex interactions. Ridge regression’s ability to shrink coefficients and focus on relevant features enhances predictive accuracy and robustness, making it valuable for improving model performance in noisy scenarios39,40,41,42.
Lasso regression
The Lasso regression approach enhances linear regression by addressing overfitting and multicollinearity with a focus on input data selection. On the contrary to ridge regression’s L2 penalty, this algorithm employs an L1 penalty, influencing some coefficients to become exactly zero, thereby excluding non-essential features. This makes it particularly advantageous for high-dimensional data sets in which several forecasters could be redundant or irrelevant, improving model generalization and interpretability by highlighting key predictors. Applications span genomics for gene selection, finance for identifying influential economic factors, and marketing for optimizing resource allocation. Machine learning aids in sparse modeling in complex data sets, like text classification or image processing, by selecting solely relevant features, enhancing both model efficiency and accuracy. Lasso’s capacity to simplify models while curbing overfitting makes it invaluable for analyzing large-scale, complex data environments43,44.
Support vector regression
SVR extends SVM to regression tasks by focusing on capturing the relationship between input features and the target variable without minimizing errors for all data points. Instead, SVR balances error tolerance and model complexity by introducing an epsilon-insensitive “tube” where errors are not penalized within the margin but incur penalties outside it. This approach allows SVR to handle noise effectively while modeling both linear and nonlinear relationships via kernel functions such as Radial Basis Function (RBF) kernel. SVR is particularly useful in fields requiring precise predictions and robustness, such as engineering, financial forecasting, and time-series analysis, due to its capacity for intrinsic regularization and high-dimensional data handling28,45,46,47.
Gradient boosting machine
Gradient Boosting Machine (GBM) is a prevailing ensemble method for classification and regression, building models sequentially to correct predecessor errors. Each iteration fits a shallow decision tree to the residuals of previous predictions, incrementally enhancing accuracy through a process moderated by a learning rate to prevent overfitting. Regularization methods, like tree pruning and subsampling, further enhance model generalization. Widely applied in finance for credit scoring, healthcare for predictive diagnostics, marketing for personalized recommendations, and NLP for sentiment analysis, GBM excels in complex, nonlinear problem-solving by effectively managing high-dimensional data delivering superior predictive performance across diverse domains48,49.
K-nearest neighbors
The KNN technique is a non-parametric, instance-driven learning algorithm that is utilized for regression and classification purposes. In a classification context, the process starts by measuring the distance from the data point that needs prediction to every other data point in the training dataset. Typical distance measures consist of Euclidean distance, Manhattan distance, or Minkowski distance, based on the specific problem being addressed. After calculating the distances, the KNN algorithm finds the K closest data points (neighbors) to the query point. The classification of the query point is based on a majority vote from these K neighbors; the class that occurs most often among the nearest neighbors is allocated to the query point. In regression, the forecast is usually the mean (or at times a weighted mean) of the target values from the K closest neighbors. The selection of K, which is the number of neighbors taken into account, is vital to the effectiveness of the model50.
In finance, KNN is employed for credit scoring and fraud detection, where the algorithm classifies customers or transactions into safe or suspicious categories based on historical behavior. In e-commerce and marketing, KNN helps in customer segmentation, personalized recommendations, and predicting consumer behavior based on past interactions, allowing businesses to target specific groups of customers with tailored marketing strategies. In image recognition and computer vision, KNN can classify images or detect objects by comparing pixel features to those of known objects, making it useful in facial recognition, character recognition, and object detection tasks. In recommendation systems, KNN is used to suggest products, services, or content based on the preferences of similar users. It is also widely used in geospatial analysis for tasks like predicting geographic features, clustering regions with similar characteristics, and in areas like urban planning or agriculture for analyzing land use patterns28,51,52.
Extreme gradient boosting
This approach is a vigorious machine learning technique according to the principles of gradient boosting, commonly utilized for classifications and regressions. The procedure of XGBoost begins with initializing a base model, typically a simple decision tree or weak model. Then, it iteratively builds several decision trees that new trees correct the errors made by the previously constructed trees. This is achieved by calculating the residuals of the predictions made by earlier trees, and the next tree is trained to minimize these residuals by adding more predictive power. Each subsequent model in the sequence is trained using a gradient descent approach, specifically targeting the gradient versus model outputs. The key feature of XGBoost is the optimization of both the tree structure and the regularization parameters to prevent overfitting. It implements two terms, L1 and L2, which are controling the complication and promoting simpler, more generalizable trees. XGBoost also incorporates weighted quantile sketching, a method that efficiently handles sparse data.
XGBoost is widely used in numerous fields due to its outstanding performance in handling large datasets and complex tasks. In finance, XGBoost is applied for risk assessment, credit scoring, and fraud detection by building predictive models that assess the likelihood of a loan default or identifying fraudulent transactions. Moreover, energy forecasting, credit card fraud detection, and image recognition in fields like autonomous driving or face recognition have seen significant improvements with XGBoost’s predictive power. The algorithm is also used in geospatial analysis to predict geographical trends and in gaming to predict player behavior and build intelligent game models. Its success in machine learning competitions is a testament to its effectiveness in solving a broad spectrum of prediction and classification problems across diverse industries53,54.
Light gradient boosting machine
This approach is a strong framework developed by Microsoft, optimized for speed and performance on large, high-dimensional datasets. It shares foundational principles with traditional gradient boosting, constructing decision trees sequentially to correct prior errors. However, LightGBM introduces significant enhancements by growing trees leaf-wise rather than level-wise, enabling faster convergence and improved accuracy. Its use of histogram-based algorithms further accelerates computation and decreases memory requirements. Its effectiveness in managing expansive datasets while maintaining accuracy and efficiency has cemented LightGBM’s popularity in data modeling, especially in scenarios demanding rapid model deployment and real-time predictions.
LightGBM’s exceptional performance has led to its widespread adoption across various industries. In finance, it efficiently manages large datasets for risk assessment, credit scoring, high-frequency trading models, and fraud detection. Furthermore, in the technology and telecommunications industries, it aids in optimizing network operations and enhancing service quality. LightGBM’s efficiency and versatility make it an indispensable tool for tasks requiring expedient deployment and reliable predictive accuracy55,56.
Elastic net
This is an approach that mixes Lasso and Ridge regressions, to address their drawbacks to make an effective prediction for high-dimensional datasets. It applies both L1 (Lasso) and L2 (Ridge) penalties to the regression model, resulting in a dual regularization approach. The L1 penalty encourages sparsity, effectively performing feature selection, while the L2 penalty ensures stability by shrinking the coefficients without eliminating them. This combination allows Elastic Net to handle situations where there are many correlated predictors, making it more robust than either Lasso or Ridge alone. The parameter λ controls the regularization strength, while the balance between the L1 and L2 penalties is determined by α, with α = 1 corresponding to Lasso and α = 0 to Ridge.
Elastic Net is particularly advantageous when dealing with datasets where predictors are highly correlated. While Lasso tends to select one variable from a group of associated inputs, potentially discarding relevant information, Elastic Net can retain multiple correlated features, improving both model interpretability and accuracy. This is achieved by adjusting the α parameter, allowing users to fine-tune the regularization approach based upon the detailed features of data. The method is widely used in machine learning and statistical modeling, especially when working with large datasets that contain many predictors, as it efficiently combines feature selection with regularization. Overall, Elastic Net offers a powerful and versatile solution for addressing the challenges of high-dimensional regression problems57,58.
Categorical boosting
CatBoost is a strong gradient-boosting approach constructed by Yandex that was made to address the challenges posed by categorical features in data modeling. Traditional gradient boosting methods typically require extensive preprocessing of categorical variables, converting them into numerical formats before modeling. In contrast, CatBoost natively handles categorical features, automatically transforming them through innovative techniques like permutation-driven transformations and ordered boosting. These processes help to preserve the intrinsic relationships within categorical data, leading to improved model accuracy and reduced overfitting. Furthermore, CatBoost employs symmetric trees, which enhance model inference speed and reduce prediction latency. Additionally, it supports GPU acceleration, enabling it to handle large-scale datasets efficiently. These features, combined with its ability to process mixed data types (categorical and numerical), make CatBoost a powerful and user-friendly tool for data scientists across various industries, including finance, e-commerce, and healthcare.
CatBoost’s strength lies in its capacity to handle complex, high-dimensional datasets that include significant categorical data. The library’s scalability, computational efficiency, and real-time predictive capabilities have made it a favored tool for large-scale, complex modeling tasks across a diverse range of sectors, including telecommunications, gaming, and automotive34.