Survival time data and their first statistical analysis: Review

In the life testing, medical follow-up studies, and other fields, it is often impossible to observe the lifetimes of all experimental units in the study. These types of data are called survival data. Because of the nature of the data, we cannot obtain the full information of the survival data. Therefore, it is not possible to apply the standard statistical techniques to analysis such survival data. In this paper, I mainly focus onright censoring data and explain how to derive the basic nonparametric estimators of cumulative distribution (Kaplan-Meier estimator), hazard, and cumulative hazard function using observed data. In addition to that I discuss how to compare the survival probabilities in two or more groups by using log-rank test. I also introduce the proportional hazard model (Cox’s model) to incorporate the other related covariates to the experiment. Finally, I present some simulation study and real data application.


Introduction
In the life testing, medical follow-up studies, and other fields, it is often impossible to observe the lifetimes of all experimental units in the study. What makes measuring durations difficult is time itself. In most cases, it is highly likely that all the events have not been observed by the time one wants to make inference about lifetimes. For example, a medical professional will not wait fifty years for each individual in the study to pass away before closing the study. He or she is interested in the effectiveness of improving lifetimes after only a few years. The individuals in the study who have not died by the end of the study period are labeled as right-censored: all the information we have on these individuals are their current lifetime durations which are naturally less than their actual lifetimes. The simplest kind of censoring is that of single censoring which occurs when all observations are censored at the same time. There are two types of single censoring: Type I censoring and Type II censoring. In Type I censoring, the censoring time is predetermined. Type II censoring occurs if an experiment stops when a predetermined number of failures are observed; the remaining subjects are then right censored. In many studies, observations are not censored at the same time, which is frequently referred to as arbitrary censored data. For example, in a clinical trial, censoring occurs because of events due to other causes that are not related to what is being investigated in the study, such as: self-removal from the study, drop out, and death from other factors that are not related to the study. These are known as competing risk factors in literature.
The analysis of the survival data, such as the life time data, is very important in many fields including reliability, engineering, biology, and medicine. Survival data are highly non-normal in nature therefore, the use of standard statistical techniques like linear regression models is problematic.
Under the random censorship model, we assume that 1 , 2 , … , are independent nonnegative random variables with the continuous distribution function ( ) = ( ≤ ).
The censoring variables 1 , 2 , … , are also nonnegative and are assumed to be a random sample, drawn independently of the s from a population with the continuous distribution function ( ) = ( ≤ ). The s right-censor the s. The observable random variables are = min { , } and = { = } where indicates whether is an uncensored observation or not. In this model, the s represent times to an endpoint event (e.g., death, relapse, malfunctioning) and the s represent censoring times. In the random censorship model, informative censoring occurs when the distribution function is informative about the distribution function .

Some Preliminary Definitions
In this subsection, I introduce some basic concepts and their definitions, namely the survival function and the hazard rate function.

The Survival Function
The survival function of is denoted by ( ) and defined by ( ) = 1 − ( ). This measures the probability that an individual survives from the time origin to a specific future time .

The Hazard Rate Function
The hazard rate function is usually denoted by ℎ( ) and is the probability that an individual who is under observation at a time has an event at that time. This means that ℎ( ) is the instantaneous event rate for an individual who has already survived at . The hazard rate function is defined mathematically by where ( ) = ( ) and ′( )are the probability density function of and the derivative of ( ), respectively.

Kaplan Meier Survival Estimate
In literature, most of the time the survival function is estimated by using the observed data, both uncensored and censored. This is a nonparametric estimator of ( ) and it is denoted by ̂( ).Consider the right censored data = min { , } and = { = } for number patients in a medical study. Assume that 1 < 2 < ⋯ < are the distinct event times for the above observations. For simplicity, I assume here that there are no ties in the event times. As events are assumed to occur independently of one another, the probabilities of surviving from one interval to the next may be multiplied together to give the cumulative survival probability. That is the probability of being alive at time . ( ) is calculated from ( −1 ), the probability of being alive at −1 , , the number of patients alive just before , and , the number of events at , by ( ) = ( −1 )(1 − ).
By using similar arguments, one can reach the following Kaplan-Meier (1958) product limit formula for the survival function, where (0) is the probability of survival at time 0. The value of ( ) is constant between event times and, therefore, the estimated survival function is a step function. Confidence intervals for the survival probabilities are also possible. In the next subsection I will show how to obtain the KM-curve and confidence intervals for simulated data set.
In a similar fashion, one can show that the nonparametric estimator for ( ) is

Example 1
Consider an example for survival data. In this example, I simulate the failure data and censored data from the exponential distributions, (0.2)and (0.1), respectively. In this case, I generate 50 observed data and compute the Kaplan-Meier (KM) estimator and its95% confidence bands by using R software. One can use the following R codes to generate the KM curve as shown in Figure 1. To obtain the median survival time and its95% confidence interval, one can use the KM curve in Figure 1. In order to build a confidence interval with a different confidence level, say 90%, for ( ), youshould to use conf.int = 0.9 in the Rsurvfit() function. Next, one can try to compare the survival curves for two groups by extending the R codes.

Figure 2: Comparisons of survival function for two groups
From the curves, it is clear that the survival probability functions are not much different in the middle part of the curves for both groups. But they differ on both ends. The above conclusion can be justified by using standard statistical tests like log-rank test.

The log-rank Test
As in the standard statistics, two or more survival curves can be compared by conducting hypotheses testing. Because of the nature of the survival data here, the standard hypotheses testing like t-test for two sample case cannot be used. In such instances, the log-rank test can be used to check whether two or more survival curves are identical or not. Here we can test the hypothesis 0 : ( ) = ( ), for ≥ 0,where ( )and ( ) are survival functions for two groups. We can consider the composite hypothesis instead of the simple one. The logrank test statistic for this hypothesis is given by where and are the observed and expected number of events for group .In example 1, = 2, where the two groups are females and males. This test statistic has a chi-square distribution with − 1 degrees of freedom if null hypothesis is true.
We use the following R codes to check the above hypothesis for example 1. The following output results for the log-rank test justifies the conclusion from the graph in Figure 2.
If we want to perform Peta-Prentice's Wilcoxon test, we need to specify rho=1 in the above R code.
So, we have a similar conclusion as the log-rank test. Finally, we discuss the Cox's proportional hazard model in this paper. This is a regression type model known as Cox's regression model (1972).

Cox's Proportional Hazard Model
We can use the log-rank test to compare the survival times in different groups. But it does not allow other covariates to be taken into account in our analysis. Cox's proportional hazard model is analogous to the multiple regression model. This model allows the analysis of survival data by regression model similar to those of linear models and generalized linear models. The scale on which linearity is assumed is the log-hazard scale. Therefore, in the Cox's model, the dependent variable is the hazard rate. This model enables one to compare the survival times of particular groups by taking into account other relevant factors. These factors are sometimes known as covariates.
Most of the time, the coefficients 1 , 2 , … , are estimated by likelihood methods using observed data.

Real Data Application: Ovarian Data
In this subsection, we consider an example from literature. An investigator collected data related to 845 patients with primary epithetical ovarian carcinoma between January 1990 and December 1999 at the Western General Hospital in Edinburg. Follow-up data were available up to the end of December 2000. By this time 550(75.9%) subjects had died (Clark et al, 2001).
We fit a Cox model to ovarian data with futime as a dependent variable and age as a covariate.
The following R code can be used to fit this model.
The results are as follows: According to the output, the likelihood ratio test, the Wald test, and the log-rank test reveal that the model is significant. These are all equivalent in the large samples but may differ a little in small sample cases. The coef,0.16162, is the hazard ratio between two groups in log scale and exp( ), 1.175, is the actual hazard ratio.

Discussion
In survival analysis, the statistical inference techniques that can be used are different from the standard statistical techniques. This difference is mainly because of the nature of the survival data, especially the right censoring. This preventsthe full information of the event interested of some subjects in the study from being obtained. In this review, the Kaplan-Meier estimator is used to get the cumulative distribution function, the log-rank test to compare the survival functions of two groups, and Cox's regression model to incorporate other covariates. These techniques can be used with time dependent covariates. Finally, one can extend these techniques to analyze the recurrent event data, where we observed more than one event of a subject. For example, for a subject with a cancer, he or she has multiple occurrences in the observation window.