Sunday, August 17, 2014

Survival Analysis

While working on few assignments related to exploring disorder in cohort studies I came across the concept of Survival analysis. It seems very useful in many real life scenarios.

What is Survival analysis?

"Survival analysis is a branch of statistics which deals with analysis of time duration to until one or more events happen".[1] The event of interest can be development of a disease, failure of a mechanical system or a person getting married.

It is also called as the time to event analysis.

In survival analysis, subjects are generally followed over a certain time period and the focus is on the time at which the event of interest occurs.

Consider the following examples in a followup study,

Case 1. Patient develops the event of interest within the followup time.
Case 2. Patient does NOT develop the event of interest within the followup time.
Case 3. Patient is unreachable after certain time during the followup study and we do not have any information if the event of interest happened.

The  case 2 and case 3 are similar in a way as the event did not occur or we do not have the information about it. It is called 'censoring'.

Censoring:  Observations are considered censored when the information about their survival time is not complete.

Types of Censoring

Right censoring: Consider a survival analysis study with event of interest as getting divorced. Assume subjects are followed in a study for 20 years. Now a subject who does not get divorced (does not experience the event of interest) for the duration of the study is called right censored. The survival time for this person is considered to be at least as long as the duration of the study.

Left censoring: If a subject's lifetime is less than observed duration, is it said to be left censored.


Analogy with regression: The concept is pretty much similar to regression with a dependent variable and multiple independent variable. Also we will get similar output containing coefficient, standard error, p-value etc. Then why cant we simply use linear regression instead? Well, because regression is not capable of effectively dealing with censored data. 

Difference with regression: In survival analysis the dependent variable is made up of two variables, 
  • Time to the event of interest
  • The event status 
We use censoring concepts discussed above to fix any missing data in dependent variable. Now the point of whole analysis is estimation of two functions,
  • Survival function: It gives us survival probability (chances of event of interests not happening)
  • Hazard function: It gives us chances of event happening per unit time, provided the subject has survived in given time.
A popular model used in survival analysis is cox proportional hazards regression model. In R you can find this in "survival" package, coxph() function. 

Here is the code written by Ani Katchova.

# Survival Analysis in R
# Copyright 2013 by Ani Katchova

# install.packages("survival")
library(survival)

mydata<- read.csv("C:/Econometrics/Data/survival_unemployment.csv")
attach(mydata)

# Define variables
time <- spell
event <- event
X <- cbind(logwage, ui, age)
group <- ui

# Descriptive statistics
summary(time)
summary(event)
summary(X)
summary(group)

# Kaplan-Meier non-parametric analysis
kmsurvival <- survfit(Surv(time,event) ~ 1)
summary(kmsurvival)
plot(kmsurvival, xlab="Time", ylab="Survival Probability")

# Kaplan-Meier non-parametric analysis by group
kmsurvival1 <- survfit(Surv(time, event) ~ group)
summary(kmsurvival1)
plot(kmsurvival1, xlab="Time", ylab="Survival Probability")

# Nelson-Aalen non-parametric analysis
nasurvival <- survfit(coxph(Surv(time,event)~1), type="aalen")
summary(nasurvival)
plot(nasurvival, xlab="Time", ylab="Survival Probability")

# Cox proportional hazard model - coefficients and hazard rates
coxph <- coxph(Surv(time,event) ~ X, method="breslow")
summary(coxph)

# Exponential, Weibull, and log-logistic parametric model coefficients
# Opposite signs from Stata results, Weibull results differ; same as SAS
exponential <- survreg(Surv(time,event) ~ X, dist="exponential")
summary(exponential)

weibull <- survreg(Surv(time,event) ~ X, dist="weibull")
summary(weibull)

loglogistic <- survreg(Surv(time,event) ~ X, dist="loglogistic")
summary(loglogistic)



This amazing video below explains the code.



Reference

[1]Survival analysis
[2]Cornell university stats news letter

No comments:

Post a Comment