Machine learning survival analysis

This is an adaptation of Scikit Survival Analysis Python Package to analyze our dataset.

S. Pölsterl, “scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn,” Journal of Machine Learning Research, vol. 21, no. 212, pp. 1–6, 2020.

Encoding-Resampling

In the preceding analysis, we explored the optimal encoding strategy for our dataset. Building on that, the current phase investigates whether our significantly imbalanced event dataset requires resampling. To this end, we employed Bayes Search CV, utilizing RandomForestClassifier as our estimator. 

The outcomes from this approach align with those from our deep learning model constructed with PyTorch's nn module, indicating resampling is unnecessary. Notably, the results corroborate earlier findings that the model achieving the best performance was the one where resampling was not employed. This suggests that our modeling strategies are robust to class imbalance and can yield reliable predictions without the need for altering the dataset's natural distribution.

Initial performance: 0.998

Best parameters: 

'criterion', 'entropy' 

'max_depth', 4

'n_estimators', 88

No resampling suggested.

Starting performance was recorded at 0.998. The most effective parameters identified were 'criterion' set to 'entropy', 'max_depth' at 4, and 'n_estimators' at 88. The analysis did not recommend any resampling. It's important to point out that we established a performance threshold for our testing procedure; should the model's performance fall below 75%, the training process is programmed to halt automatically.

random forest

Survival probability using Random forest with 6 samples. 

random forest

Cumulative hazard using random forest with 6 samples.

gradient boosting

Measurement of C-Index with learning rate 1.0 and max depth of 1.

Using alpha of 0.1, we are able to obtain the Uno's Index:

Returned values from concordance_index_ipcw: (0.9339609763243423, 74661, 4820, 0, 846)

Uno's C-index: 0.9339609763243423

component gradient boosting

Measurement of C-Index with learning rate 1.0. 

Number of non-zero coefficients: 4

Failure Type Encoded    0.569700

Type_H                 -0.207757

Torque_Nm               0.035211

Type_L                  0.028829

gradient boosting fine-tuning

The key factor in gradient boosting is the count of base learners to employ (specified by the n_estimators argument). Increasing this number results in a more intricate model, but it also risks overfitting the training data. The simplest solution might be to reduce the number of base estimators, but there are three other methods to prevent overfitting:

Recall that in our previous section, high performing deep learning uses 'no dropout' and 'no regularization'.

oob improvement

The graph examines the mean enhancement of the preceding 25 iterations, and if it has been negative for the past 50 iterations, it will terminate the training process. The graph above showing the improvement for each base learner and the moving average.

gradient boosting loss(sqUARED)

gradient boosting loss(coxph)

CUMULATIVE HAZARD (LOSS=SqUARED)

Time-To-Event Assumptions.

CUMULATIVE HAZARD (LOSS=COXPH)

Proportional Hazard Assumptions.

PENALIZING COX MODELS (FOR VARYING ALPHA)

It’s evident that when the penalty carries a significant weight (towards the right), all coefficients are nearly reduced to zero. As we lessen the weight of the penalty, the value of the coefficients rises. Notably, the paths for ‘Air temperature’ and ‘Failure Type’ swiftly diverge from the rest of the coefficients, suggesting that these specific features play a crucial role in predicting the time to distant failure.

Note: We also tried to plot the results using Lasso and Elastic Net below. Other features that were included in the calculations are behind the Type_M label.

CHOOSING THE PENALTY STRENGTH

The results can be graphically represented by plotting the average concordance index and its standard deviation across all folds for each. The diagram illustrates that there’s a region to the right for alpha where it’s excessively large, causing all coefficients to approach zero, as demonstrated by the 0.5 concordance index of a completely random model. Conversely, if alpha becomes too small, an excessive number of features are incorporated into the model, and the performance begins to resemble that of a random model once more. The optimal point (indicated by the orange line) is located somewhere in between these extremes.

survival probabilities from 10 samples

Once a specific alpha is chosen, we can carry out predictions. These can be in the form of a risk score using the predict function, or in terms of the survival or cumulative hazard function. On this graph, we can determine how positive or negative "Torque" can affect each survival function. From our original dataset, these failed samples are due to "Power Failure" thus our graph shows negative results.

c-index for cgb

Using Component Gradient Boosting with the same parameters grid.

SURVIvAL PROBABILITY for cgb

Using Component Gradient Boosting with six samples and their survival probability.

LINEAR survival SUPPORT VECTOR MACHINE

The final segment of the setup outlines the array of parameters we aim to test and the number of training and testing repetitions we wish to execute for each parameter configuration. Ultimately, the parameters that consistently deliver the best performance across all test sets (in this case, 100) are chosen.

We interpreted in two distinct manners:

kernel survival SUPPORT VECTOR MACHINE

The Kernel Survival Support Vector Machine, an extension of the Linear Survival Support Vector Machine, is capable of handling intricate relationships between features and survival time. It’s implemented in sksurv.svm.FastKernelSurvivalSVM. However, the selection of the kernel function and its hyper-parameters can be challenging and often necessitates fine-tuning for optimal results. FastKernelSurvivalSVM also supports numerous other built-in kernel functions, which can be utilized by specifying their names as the kernel parameter.

cumulative hazard

Component Wise Gradient Boosting Survival Analysis

hazard function

Component Wise Gradient Boosting Survival Analysis

TIME-DEPENDENT AREA UNDER THE ROC

The plot shows the estimated area under the time-dependent ROC at each time point and the average across all time points as dashed line.

EVALUATING THE SCIKIT ROC PREDICTION

Using the test data, we want to assess how well the model can distinguish survivors from FAILURE in 7 minutes intervals, up to 251 minutes from time of observation. NOTE: The dashed line is the proportional hazard baseline assumption. 

using time-dependent risk scores

Certainly, the Random Survival Forest exhibits marginally superior performance on average, primarily due to its enhanced performance within the 125-250 mins intervals. However, its performance deteriorates for periods exceeding starting 25-125 mins from the very beginning. This demonstrates that while the mean AUC is a handy measure for evaluating overall performance, it may conceal intriguing aspects that only become apparent when examining the AUC at specific time points.

Using Metrics in Hyper-parameter Search

Typically, estimators possess hyper-parameters that are desirable to optimize, such as the maximum tree depth for tree-based learners. To achieve this, we can employ scikit-learn’s GridSearchCV to seek the hyper-parameter setup that performs best on average. By default, the performance of estimators is assessed using Harrell’s concordance index, as realized in concordance_index_censored. 

In the above plots, AUC will have a separate plot to measure the mean and median for each depth tested. Note that when using independent brier's score (IBS), the max depth is at 9.

Using Metrics in Hyper-parameter Search

Typically, estimators possess hyper-parameters that are desirable to optimize, such as the maximum tree depth for tree-based learners. To achieve this, we can employ scikit-learn’s GridSearchCV to seek the hyper-parameter setup that performs best on average. By default, the performance of estimators is assessed using Harrell’s concordance index, as realized in concordance_index_censored.

The plot above show the results of a GridSearchCV experiment that was used to find the optimal max_depth for a random survival forest estimator.

GridSearchCV is a common technique used for hyperparameter tuning in machine learning. It works by trying out a variety of parameter values for a chosen model, and then evaluating the performance of the model on each set of parameters. In this case, the experiment have evaluated different max_depth values for a random survival forest model.

Max depth is a hyperparameter that controls the maximum depth of trees in the forest. Deeper trees can potentially learn more complex relationships between features, but they are also more prone to overfitting.

The x-axis of the plot shows the different values of max_depth that were evaluated. The y-axis shows the test score of the model on each set of parameters. In this case, the test score seems to be some metric of classification performance, though the exact metric is not specified in the image.

"Fitting 3 folds for each of 9 candidates, totalling 27 fits". This suggests that the GridSearchCV experiment used 3-fold cross-validation. Cross-validation is a technique that is used to evaluate the generalizability of a machine learning model. In 3-fold cross-validation, the data is split into three folds. The model is trained on two of the folds, and then evaluated on the third fold. This process is repeated three times, so that each fold is used for evaluation once.

Looking at the plot itself, the test score increases as the max_depth increases, until it reaches a maximum around 4. After that point, the test score appears to start to increase at 7 which is the optimal depth.. This suggests that a max_depth of 7 is to be the best choice for this particular model and dataset.

Overall, the plot suggests that the GridSearchCV experiment was successful in finding a good value for the max_depth hyperparameter. The test score of the model is highest when the max_depth is set to 7. This suggests that a random survival forest model with a max_depth of 7 will perform well on unseen data.