W T T E - R N N

Implementation of Weibull Time-To-Event with Recurrent Neural Network
Reference in this document and github repo of ragulpr and spacagna.

Deep Learning Survival Analysis in this document.

creating the model ARCHITECTURE

This architecture is tailored for scenarios where predicting the time until an event occurs is crucial, such as in predictive maintenance (forecasting when equipment will fail) or in healthcare (estimating the time to an event like hospital readmission). The LSTM layer enables the model to capture temporal dependencies and patterns in sequential data, while the output layer and activation functions ensure the predictions are in a valid range for Weibull parameters, making it suitable for time-to-event analysis.

Below is a description of WTTE_RNN class is a neural network architecture specifically designed for a Weibull Time to Event (WTTE) prediction model using PyTorch. This architecture aims to predict the parameters of a Weibull distribution, which are used to model the time until an event occurs, typically in the context of survival analysis or predictive maintenance. Here's a breakdown of its components and functionality:

Components

Initialization Method (__init__): This method initializes the WTTE_RNN model with three key components:
- Input Size: The size of the input features for each sequence element.
- Hidden Size: The number of features in the hidden state of the LSTM (Long Short-Term Memory) layer.
- Output Size: Set to 2 by default, representing the two parameters of the Weibull distribution (shape and scale, often denoted as α and

β that this model predicts.

- LSTM Layer: An LSTM layer is used to process sequences of data. The batch_first=True argument indicates that the input and output tensors are provided with the batch dimension first.
- Output Layer: A linear layer that transforms the output of the LSTM layer to the desired output size. This layer is responsible for predicting the raw values that will be transformed into the Weibull parameters.

Forward Pass (forward Method)

Input Handling: The input x is passed through the LSTM layer. The output from the LSTM has the shape [batch_size, seq_len, num_directions * hidden_size], where seq_len is the length of the sequence, and num_directions accounts for the directionality of the LSTM (1 for unidirectional, 2 for bidirectional).
Output Processing:
- If the output is three-dimensional (the common case), the model uses the last output of the sequence (out[:, -1, :]) for prediction, assuming the most relevant information for predicting the event time is captured at the end of the sequence.
- If the output is unexpectedly two-dimensional, indicating a missing sequence length dimension, it directly applies the output layer on it. This scenario might occur under specific circumstances, such as processing a single sequence item.
- A ValueError is raised for outputs with unexpected dimensions, indicating a potential issue with the input data or LSTM configuration.
Parameter Prediction:
- The model applies an exponential activation function (torch.exp) to the first element of the output to ensure that the predicted α (scale parameter) is positive.
- A SoftPlus activation function is applied to the second element of the output to ensure that the predicted β (shape parameter) is positive. SoftPlus is a smooth approximation to the ReLU function that ensures strictly positive outputs.

Incorporating the Weibull log-likelihood with Alpha and Beta as the loss function of our model, we also added the Weibull beta penalty at location=30 where the penalty starts to increase and growth of 10.0 to control how fast the penalty grows for beta values beyond the location.

training and validation loss

Dropout rate of 0.50, learning rate=0.001, hidden size=100, epochs=200 achieving a Test Loss: 0.0328 and Concordance Index (CI): 0.9427.

CI measures the ranking accuracy of a model's predictions. A CI of 1.0 indicates perfect concordance, meaning the model always predicts higher risks for individuals who experience events (e.g., death, failure) sooner than those who experience them later or not at all. A CI of 0.5 suggests no better accuracy than random chance.

To calculate the concordance Index(CI), we converted the predicted Weibull parameters to a risk score. For Weibull distribution, the median time to event for a given Alpha and Beta.

effect of zero inflation

Alpha (Shape parameter): The estimate is 5.82476. This shape parameter, as discussed previously, affects the failure rate's behavior over time.

Beta (Scale parameter): The estimate is 3.54913. This scale parameter spreads out the time to failure or event over the timeline.

ZI (Zero-Inflation parameter): The estimate is 0 with a standard error, suggesting that the maximum likelihood estimation did not find evidence of zero inflation in your data. Essentially, the model fitting concluded that there isn't a portion of the population for which the failure event never happens.

INTERPRETATION

In our plot above, most data points in the middle range seem to fit well, lying close to the fitted Weibull line, but there are some deviations at the lower and higher ends. This could indicate that while the Weibull model generally fits well, it's not perfect, particularly for very early or very late failures. We fitted different models to check how the events affect the alpha and beta values.

Given this, the Zero-Inflated Weibull model might not be necessary, and a standard Weibull model might suffice, as suggested by the zero estimate for the ZI parameter. The plot suggests that the data follows a Weibull distribution quite well for most of the range but may require additional analysis for early failures (the leftmost points below the fitted line) and potentially for late failures (if any points are significantly above the line at the rightmost end).

effect of beta penalty

Our model's current penalty and loss functions lead to overestimated beta values compared to the Weibull baseline. It incorporates the location at which the penalty starts to increase and the growth which controls how fast the penalty grows for beta values beyond the location

Plots reveal aggressive predictions from the deep learning model, indicating a need for calibration. We should focus on refining the beta penalty calculation to optimize performance. After recalibration, both trace plots show "noisy" samples that do not settle down to a stable value which indicative of very imbalanced dataset.

FITTING WEIBULL DISTRIBUTION

Censoring Indicators(dots)

If the individual's event is censored ('Target' == 0), a black circle ('ko') is used to mark the point on the curve corresponding to the last observed time.

If the event is not censored ('Target' == 1), a red circle ('ro') marks the event time on the curve.

RANDOM ALPHAS AND BETAS

When we use the Weibull log-likelihood function as a loss in our deep learning model, the goal is to find the best α (shape parameter) and β (scale parameter) for each iteration, aiming to maximize the log-likelihood (or equivalently, minimize the negative log-likelihood) of the observed data given the model parameters.

The Weibull distribution's log-likelihood function, given survival times T, event indicators E (where E=1 if the event of interest, like failure or death, has occurred, and E=0 otherwise), shape parameter α, and scale parameter β, is a way to measure how well our model's parameters fit the observed data. For censored observations (where E=0), the likelihood contribution is adjusted to account for the fact that the event has not been observed up to the time T.

Deep Learning Optimization Process

Forward Pass: The model predicts α and β for each observation based on its input features. This prediction is made by the neural network's architecture, which learns complex relationships between the input features and the parameters of the Weibull distribution.

Loss Calculation: The predicted α and β are used in the Weibull log-likelihood function to calculate the loss. This function compares the model's predictions with the actual observed survival times and event indicators. By using the negative log-likelihood as the loss, we essentially measuring how unlikely the observed data is under the model parameters predicted by the network.

Backward Pass (Gradient Descent): The gradient of the loss with respect to the model parameters (weights and biases in the neural network, including those affecting α and β) is computed. The model parameters are then updated in a direction that reduces the loss, using an optimization algorithm, in this case we use Adam.

Iteration: This process is repeated for many iterations (in this case 200 epochs), with the model parameters continuously being adjusted to reduce the loss. Over time, this converges towards the set of parameters that makes the observed data under the Weibull model (maximizing the log-likelihood).

CALCULATING THE BRIER SCORE AND IBS:

After obtaining the Alphas and Betas from the trained model, we then calculate the Brier Score based on Weibull Survival Probabilities in terms of time, alpha and Beta. We got a score of Brier Score: 0.06667745.

After obtaining the Brier Score, we then proceed to merge the scaled tensor output to our original data so that we can compare the features if it is aligning with our test set. Once aligned, we compute the Integrated Brier Score for each Weibull distribution to determine the scores on each distribution in this dataset.

weibull_pdfn Integrated Brier Score: 0.034

weibull_sfn Integrated Brier Score: 0.16714285714285718

weibull_cdfn Integrated Brier Score: 0.83285714285714298

weibull_hfn Integrated Brier Score: 2.54276872764917e+47

weibull_chfn Integrated Brier Score: 1.260373549927319e+50

The results for different Weibull functions reveal significant variability, indicating how each function's predictions align with the observed survival data.

weibull_pdf (Probability Density Function) IBS: 0.034

This is a very low IBS, suggesting that the probability density function provides good prediction accuracy for the time to event data. However, we will not use PDF for survival probability predictions directly, as PDF represents the rate of change in survival probability at each time point rather than the survival probability itself.

weibull_sf (Survival Function) IBS: 0.16714285714285718

This score is typical for survival analysis, as the survival function (SF) directly estimates the survival probability over time. An IBS of ~0.167 indicates a decent model performance, showing that the model's survival function predictions align relatively well with the actual survival outcomes.

weibull_cdf (Cumulative Distribution Function) IBS: 0.83285714285714298

The CDF is the complement of the survival function (1 - SF), representing the probability of failure by time t. The higher IBS compared to SF suggests less predictive accuracy when directly interpreting CDF values as survival probabilities, which is expected since CDF and SF convey opposite probabilities.

weibull_hf (Hazard Function) IBS: 2.54276872764917e+47

The hazard function describes the instantaneous rate of failure at any given time, assuming the individual has survived up to that time. Such an extremely high IBS indicates that using the hazard function directly for survival probability prediction is inappropriate and leads to highly inaccurate predictions.

weibull_chf (Cumulative Hazard Function) IBS: 1.260373549927319e+50

Similar to the hazard function, the cumulative hazard function provides information on the accumulation of risk over time rather than direct survival probabilities. The astronomically high IBS further confirms that CHF is not suitable for direct survival probability predictions.

Interpretation and Recommendations

The survival function (SF) is the most appropriate Weibull function for calculating survival probabilities, as evidenced by its IBS. Survival analysis typically relies on SF for estimating the probability that an event (e.g., death for medical analysis, failure for machinery or equipment) has not occurred by a certain time.

The PDF, CDF, HF, and CHF serve different purposes in survival analysis, providing insights into the event rate, accumulated probability of event occurrence, instantaneous risk, and accumulated risk, respectively.

REMAINING USEFUL LIFE

After calibration, we tried to predict the first sample of the dataset which is censored. We then converted the predicted RUL by transforming the scaled predicted time to actual units. We then fitted the weibull prediction for our probability of failure curve.

PREDICTION PLOTS:

For each row in results_df, which contains the parameters and actual times, it calculates the Weibull probability density function (PDF) based on the alpha (shape parameter) and beta (scale parameter) values from the LSTM model's results.

Maximum A Posteriori (MAP) predictions:

It compares the Maximum A Posteriori (MAP) predictions—median, mode, and mean—against the actual observed values (Tool_wear_min_original) for 50 random samples.

Three scatter plots are generated to visualize the relationship between actual tool wear times and the Weibull distribution's statistical measures (mean, median, and mode) for each parameter set.

This plot is helpful for assessing the accuracy of the Weibull model's central tendency predictions against the actual times.

PREDICTION ERROR:

It visualizes the distribution of the errors between the actual tool wear times and the mode predictions.

A histogram is used to plot the errors (the differences between the actual times and the mode predictions), with a Kernel Density Estimate (KDE) overlaid to show the error distribution's shape.

This plot can be used to understand the dispersion and bias of the errors. Ideally, the error distribution should be centered around zero, indicating that the predictions are unbiased.

PREDICTION MODE:

The main scatterplot (Test Set):

Each point in the scatterplot corresponds to a single observation in the dataset.

The points are plotted based on the predicted mode from the Weibull distribution against the actual tool wear time.

The line through the scatterplot with a shaded region suggests a trend line with its confidence interval, indicating the general relationship between the predicted mode and the actual values. The trend line helps to understand whether there's a positive, negative, or no correlation between the predicted mode and the actual tool wear.

The histogram on the top:

Shows the distribution of the predicted_mode values. It provides an insight into the frequency of predicted mode values across the dataset.

The shape of this histogram can give us an idea about the central tendency and spread of the mode predictions.

The histogram on the right:

Shows the distribution of the Tool_wear_min_original values. It indicates how the actual tool wear times are distributed.

Similar to the top histogram, it tells about the central tendency and the variability in the actual observed tool wear times.

If the scatterplot shows a diagonal pattern from the bottom left to the top right, it would indicate a positive correlation, suggesting that as the predicted mode increases, the actual tool wear time tends to increase as well. Conversely, a pattern from the top left to the bottom right would indicate a negative correlation.

In this plot, while there's a concentration of data points along a line in the scatterplot, suggesting some relationship, there's also a considerable spread indicating variance in the data that isn't captured by the mode predictions alone. The trend line with the confidence band provides a visual summary of this relationship and how strongly the predicted mode might be related to the actual tool wear.

OPTIMIZATION OF WEIBULL PARAMETERS:

Optimizing using Nelder-Mead:

Estimated b (beta using discrete): 9.276061336099053

Optimizing using L-BFGS-B:

Estimated a (alpha using continuous): 20.488203501049078

Estimated b (beta using continuous): 14.410029960366254

Estimated a (alpha using discrete): 29.310627581578036

Estimated b (beta using discrete): 22.317539181071695

Optimizing using TNC:

Estimated b (beta using discrete): 17.56874203342972

Optimizing using SLSQP:

Estimated a (alpha using continuous): 15.719455367046596

Estimated b (beta using continuous): 32.430451955738576

Estimated a (alpha using discrete): 10.36003329617002

Estimated b (beta using discrete): 22.81236275597879

Optimizing using Powell:

Estimated b (beta using discrete): 4.14766842559694

Optimizing using CG:

Estimated a (alpha using continuous): 37.30013522521838

Estimated b (beta using continuous): 20.799440975912376

Estimated a (alpha using discrete): 15.355454118970854

Estimated b (beta using discrete): 19.15919557519194

Optimizing using BFGS:

Estimated b (beta using continuous): 6.9977827040329945

Optimizing using trust-constr:

Estimated a (alpha using continuous): 23.046254230898754

Estimated b (beta using continuous): 8.057315399608626

Estimated a (alpha using discrete): 11.176534457090613

Estimated b (beta using discrete): 13.152739509698579.

Google Sites

Report abuse