---
title: "Life Data Analysis Part II - Estimation Methods for Parametric Lifetime Models"
subtitle: "Rank Regression and Maximum Likelihood"
author: 
  - "Tim-Gunnar Hensel" 
  - "David Barkemeyer"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    fig_height: 6
    fig_width: 7
    fig_caption: yes
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Life Data Analysis Part II - Estimation Methods for Parametric Lifetime Models}
  %\VignetteEncoding{UTF-8}
---

```{r setup, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  screenshot.force = FALSE,
  comment = "#>"
)
library(weibulltools)
```

This document introduces two methods for the parameter estimation of lifetime distributions. Whereas _Rank Regression (RR)_ fits a straight line through transformed plotting positions (transformation is described precisely in `vignette(topic = "Life_Data_Analysis_Part_I", package = "weibulltools")`), _Maximum likelihood (ML)_ strives to maximize a function of the parameters given the sample data. If the parameters are obtained, a cumulative distribution function _(CDF)_ can be computed and added to a probability plot.  

In the theoretical part of this vignette the focus is on the two-parameter Weibull distribution. The second part is about the application of the provided estimation methods in {weibulltools}. All implemented models can be found in the help pages of `rank_regression()` and `ml_estimation()`.

## The Weibull Distribution 

The Weibull distribution is a continuous probability distribution, which is specified by the location parameter $\mu$ and the scale parameter $\sigma$. Its _CDF_ and _PDF (probability density function)_ are given by the following formulas: 

$$F(t)=\Phi_{SEV}\left(\frac{\log(t) - \mu}{\sigma}\right)$$

$$f(t)=\frac{1}{\sigma t}\;\phi_{SEV}\left(\frac{\log(t) - \mu}{\sigma}\right)$$
The practical benefit of the Weibull in the field of lifetime analysis is that the common profiles of failure rates, which are observed over the lifetime of a large number of technical products, can be described using this statistical distribution.

In the following, the estimation of the specific parameters $\mu$ and $\sigma$ is explained.  

## Rank Regression (RR) 

In _RR_ the _CDF_ is linearized such that the true, unknown population is estimated by a straight line which is analytically placed among the plotting pairs. The lifetime characteristic, entered on the x-axis, is displayed on a logarithmic scale. A double-logarithmic representation of the estimated failure probabilities is used for the y-axis. Ordinary Least Squares _(OLS)_ determines a best-fit line in order that the sum of squared deviations between this fitted regression line and the plotted points is minimized.  

In reliability analysis, it became prevalent that the line is placed in the probability plot in the way that the horizontal distances between the best-fit line and the points are minimized [^note1]. This procedure is called __x on y__ rank regression.  

[^note1]: Berkson, J.: _Are There Two Regressions?_, 
          _Journal of the American Statistical Association 45 (250)_, 
          DOI: 10.2307/2280676, 1950, pp. 164-180  
          
The formulas for estimating the slope and the intercept of the regression line according to the described method are given below.  

Slope: 

$$\hat{b}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})\cdot(y_i-\bar{y})}{\sum_{i=1}^{n}(y_i-\bar{y})^2}$$  

Intercept:  

$$\hat{a}=\bar{x}-\hat{b}\cdot\bar{y}$$  

With  

$$x_i=\log(t_i)\;;\; \bar{x}=\frac{1}{n}\cdot\sum_{i=1}^{n}\log(t_i)\;;$$  

as well as  

$$y_i=\Phi^{-1}_{SEV}\left[F(t)\right]=\log\left\{-\log\left[1-F(t_i)\right]\right\}\;and \; \bar{y}=\frac{1}{n}\cdot\sum_{i=1}^{n}\log\left\{-\log\left[1-F(t_i)\right]\right\}.$$  

The estimates of the intercept and slope are equal to the Weibull parameters $\mu$ and $\sigma$, i.e. 

$$\hat{\mu}=\hat{a}$$

and 

$$\hat{\sigma}=\hat{b}.$$

In order to obtain the parameters of the shape-scale parameterization the intercept and the slope need to be transformed [^note2].  

[^note2]: ReliaSoft Corporation: _Life Data Analysis Reference Book_, 
          online: [ReliaSoft](http://reliawiki.org/index.php/The_Weibull_Distribution), accessed 19 December 2020  

$$\hat{\eta}=\exp(\hat{a})=\exp(\hat{\mu})$$

and 

$$\hat{\beta}=\frac{1}{\hat{b}}=\frac{1}{\hat{\sigma}}.$$  

## Maximum Likelihood (ML)

The _ML_ method of Ronald A. Fisher estimates the parameters by maximizing the likelihood function. Assuming a theoretical distribution, the idea of _ML_ is that the specific parameters are chosen in such a way that the plausibility of obtaining the present sample is maximized. The likelihood and log-likelihood are given by the following equations:  

$$L = \prod_{i=1}^n\left\{\frac{1}{\sigma t_i}\;\phi_{SEV}\left(\frac{\log(t_i) - \mu}{\sigma}\right)\right\}$$ 

and 

$$\log L = \sum_{i=1}^n\log\left\{\frac{1}{\sigma t_i}\;\phi_{SEV}\left(\frac{\log(t_i) - \mu}{\sigma}\right)\right\}$$  

Deriving and nullifying the log-likelihood function according to parameters results in two formulas that have to be solved numerically in order to obtain the estimates.  

In large samples, ML estimators have optimality properties. In addition, the simulation studies by _Genschel and Meeker_ [^note3] have shown that even in small samples it is difficult to find an estimator that regularly has better properties than ML estimators.

[^note3]: Genschel, U.; Meeker, W. Q.: _A Comparison of Maximum Likelihood and Median-Rank Regression for Weibull Estimation_, 
          in: _Quality Engineering 22 (4)_, DOI: 10.1080/08982112.2010.503447, 2010, pp. 236-255


## Data

To apply the introduced parameter estimation methods the `shock` and `alloy` datasets are used. 

### Shock Absorber

In this dataset kilometer-dependent problems that have occurred on shock absorbers are reported. In addition to failed items the dataset also contains non-defective (*censored*) observations. The data can be found in _Statistical Methods for Reliability Data_ [^note4]. 

[^note4]: Meeker, W. Q.; Escobar, L. A.: _Statistical Methods for Reliability Data_, 
          _New York, Wiley series in probability and statistics_, 1998, p. 630  
          
For consistent handling of the data, {weibulltools} introduces the function `reliability_data()` that converts the original dataset into a `wt_reliability_data` object. This formatted object allows to easily apply the presented methods.  

```{r dataset_shock, message = FALSE}
shock_tbl <- reliability_data(data = shock, x = distance, status = status)
shock_tbl
```          

### Alloy T7989

The dataset `alloy` in which the cycles until a fatigue failure of a special alloy occurs are inspected. The data is also taken from Meeker and Escobar [^note5]. 

[^note5]: Meeker, W. Q.; Escobar, L. A.: _Statistical Methods for Reliability Data_, 
          _New York, Wiley series in probability and statistics_, 1998, p. 131  
          
Again, the data have to be formatted as a `wt_reliability_data` object: 

```{r, data_alloy}
# Data: 
alloy_tbl <- reliability_data(data = alloy, x = cycles, status = status)
alloy_tbl
```

## RR and ML with {weibulltools}

`rank_regression()` and `ml_estimation()` can be applied to complete data as well as failure and (multiple) right-censored data. Both methods can also deal with models that have a threshold parameter $\gamma$.  

In the following both methods are applied to the dataset `shock`.

### RR for two-parameter Weibull distribution

```{r RR_weibull, fig.cap = "Figure 1: RR for a two-parametric Weibull distribution.", message = FALSE}
# rank_regression needs estimated failure probabilities: 
shock_cdf <- estimate_cdf(shock_tbl, methods = "johnson")

# Estimating two-parameter Weibull: 
rr_weibull <- rank_regression(shock_cdf, distribution = "weibull")
rr_weibull 

# Probability plot: 
weibull_grid <- plot_prob(
  shock_cdf,
  distribution = "weibull", 
  title_main = "Weibull Probability Plot", 
  title_x = "Mileage in km", 
  title_y = "Probability of Failure in %",
  title_trace = "Defectives",
  plot_method = "ggplot2"
)

# Add regression line: 
weibull_plot <- plot_mod(
  weibull_grid,
  x = rr_weibull,
  title_trace = "Rank Regression"
)

weibull_plot
```

### ML for two-parameter Weibull distribution

```{r ML_weibull, fig.cap = "Figure 2: ML for a two-parametric Weibull distribution.", message = FALSE}
# Again estimating Weibull: 
ml_weibull <- ml_estimation(
  shock_tbl, 
  distribution = "weibull"
)

ml_weibull 

# Add ML estimation to weibull_grid: 
weibull_plot2 <- plot_mod(
  weibull_grid, 
  x = ml_weibull, 
  title_trace = "Maximum Likelihood"
)

weibull_plot2
```

### ML for two- and three-parameter log-normal distribution

Finally, two- and three-parametric log-normal distributions are fitted to the `alloy` data using maximum likelihood.  

```{r ML_estimation_log-normal, message = FALSE}
# Two-parameter log-normal:  
ml_lognormal <- ml_estimation(
  alloy_tbl,
  distribution = "lognormal"
)

ml_lognormal

# Three-parameter Log-normal:  
ml_lognormal3 <- ml_estimation(
  alloy_tbl,
  distribution = "lognormal3"
)

ml_lognormal3
```

<br> 

```{r ML_visualization_I, fig.cap = "Figure 3: ML for a two-parametric log-normal distribution.", message = FALSE}
# Constructing probability plot: 
tbl_cdf_john <- estimate_cdf(alloy_tbl, "johnson")

lognormal_grid <- plot_prob(
  tbl_cdf_john,
  distribution = "lognormal", 
  title_main = "Log-normal Probability Plot", 
  title_x = "Cycles", 
  title_y = "Probability of Failure in %",
  title_trace = "Failed units",
  plot_method = "ggplot2"
)

# Add two-parametric model to grid: 
lognormal_plot <- plot_mod(
  lognormal_grid,
  x = ml_lognormal,
  title_trace = "Two-parametric log-normal"
)

lognormal_plot
```

<br> 

```{r ML_visualization_II, fig.cap = "Figure 4: ML for a three-parametric log-normal distribution.", message = FALSE}
# Add three-parametric model to lognormal_plot:
lognormal3_plot <- plot_mod(
  lognormal_grid, 
  x = ml_lognormal3, 
  title_trace = "Three-parametric log-normal"
)

lognormal3_plot
```