Thursday, December 8, 2016

Selection Bias

[7 mins read]


Barack Obama's article at Wired. [1]
Stephen Hawking's article at The Guardian. [2]
Peter Thiel's speech at RNC. [3]

In last two months, three renowned people have shared their thoughts about the time we live in.

All of them are highly successful and revered figures in their field. They all are data driven, you will find them quoting facts and figures all the time. Yet there is a stark difference between the central message here.


***********

Case 1

Barack Obama wrote an article titled "Now is the greatest time to be alive". His argument is, we have achieved great breakthroughs. Though it's not utopia, considering the history the current time is the best time to live in.


".Just since 1983, when I finished college, things like crime rates, teen pregnancy rates, and poverty rates are all down. Life expectancy is up. The share of Americans with a college education is up too. Tens of mil­lions of Americans recently gained the security of health insurance. Blacks and Latinos have risen up the ranks to lead our businesses and communities. Women are a larger part of our workforce and are earning more money. Once-quiet factories are alive again, with assembly lines churning out the components of a clean-energy age.


...


And just as America has gotten better, so has the world. More countries know democracy. More kids are going to school. A smaller share of humans know chronic hunger or live in extreme poverty. In nearly two dozen countries—including our own—­people now have the freedom to marry whomever they love. And last year the nations of the world joined together to forge the most comprehen­sive agreement to battle climate change in human history."


Indeed, these are facts. So that does seem like a step towards utopia, doesn't it? Being his admirer I assumed the same.


***********


Case 2


Stephen hawking published an article this week - "This is the most dangerous time for our planet". As the name suggests central theme is pretty opposite of the first case.


"The concerns underlying these votes about the economic consequences of globalization and accelerating technological change are absolutely understandable. The automation of factories has already decimated jobs in traditional manufacturing, and the rise of artificial intelligence is likely to extend this job destruction deep into the middle classes, with only the most caring, creative or supervisory roles remaining. This in turn, will accelerate the already widening economic inequality around the world.


...


The consequences of this are plain to see: the rural poor flock to cities, to shanty towns, driven by hope. And then often, finding that the Instagram nirvana is not available there, they seek it overseas, joining the ever greater numbers of economic migrants in search of a better life. These migrants in turn place new demands on the infrastructures and economies of the countries in which they arrive, undermining tolerance and further fuelling political populism."


I think a lot of us can relate to what he is stating above. Sadly, it does appear to be the bigger picture at a global scale. We could face some serious issues in near future.


***********


Case 3


Peter Thiel gave a speech at RNC highlighting the poor state of the country. Basically, his stand was how as a country the US couldn't continue on the expected trajectory and things are already bad.


"...today our government is broken. Our nuclear bases still use floppy disks. Our newest fighter jets can’t even fly in the rain. And it would be kind to say the government’s software works poorly, because much of the time it doesn’t even work at all. That is a staggering decline for the country that completed the Manhattan project. We don’t accept such incompetence in Silicon Valley, and we must not accept it from our government. Americans get paid less today than ten years ago. But healthcare and college tuition cost more every year. Meanwhile, Wall Street bankers inflate bubbles in everything."


If you think about it, he did mention some facts. The average healthcare cost per capita in the US has touched $10,000 per year. Medical debt appears to be the leading cause of personal bankruptcy in the US. The education is getting so expensive people can spend decade(s) repaying education loans.


***********


Conclusion


If you look at three cases, you will realize how "convenient" data selection can be used to support almost any argument. The difference in arguments above could be due to the difference in perception about how to measure things. Measuring things in a real world is an extremely hard problem. In research, "double-blinded + randomized + controlled" trials are considered the gold standard of evidence (not the highest though). Even with these gold standards and billions of dollars experiment could fail to measure things miserably. For an example, according to a paper published in Journal of American Medical Association cancer drugs in the real world do not follow the expectations set by clinical trials of the same drugs. The average increase in the survival time for patients under these drugs could be a lot less than results in trials. Sometimes the average increase in survival time for patients in real world taking these drugs is less than the survival time of the patients on placebo (sugar pills) in the experiment. [4].


That might look like an unnecessary example here. However, the point is, even a ton of money and brightest minds working together can not guarantee good judgment of the reality. So the least we could do is take things with a grain of salt than absolute reality.


I think it's hard to eliminate selection bias completely but it can be reduced. The examples above exhibits a comparatively decent level of selection bias. It can get really ugly and dangerous. Irrespective of the nature selection bias will contribute to twisting the perception of reality (by definition) and possibly spreading misinformation. Some scenarios where we could spot them


- While watching a TV debate with (loud) guests (and maybe equally loud host)
- Reading an article on a news website (whose sole aim could be click rate)
- Studying a biography of a highly successful or controversial person
- Watching your favorite politician or celebrity delivering a speech
- Public surveys and opinion polls (especially by political parties and related organizations)


Let's make a genuine attempt to observe if it's the entire picture or just a "convenient" part of it.



[1] https://www.wired.com/2016/10/president-obama-guest-edits-wired-essay/
[2] https://www.theguardian.com/commentisfree/2016/dec/01/stephen-hawking-dangerous-time-planet-inequality
[3] http://time.com/4417679/republican-convention-peter-thiel-transcript/
[4] https://www.statnews.com/2016/11/21/cancer-clinical-trials/?s_campaign=stat%3Arss

Sunday, January 25, 2015

How to Create and Publish R package on CRAN : Step-by-Step Guide

Requirements:
  • R Studio (This tutorial is based on R studio 0.98.501)
  • Beginner level R programming skills
  • devtools package (to build and compile the code)
  • roxygen2 package (to create the documentation)
Lets break it down into 7 simple steps as following:
  1. Create R project
  2. Create function(s)
  3. Create  description file
  4. Create help file(s)
  5. Build, load and check the package
  6. Export package
  7. Submit on CRAN
Step 1

1.1  Open R Studio. Create a new project using "file > new project > new directory > empty project". Give directory name.



1.2  Install and load R packages "devtools" and "roxygen2".

install.packages("devtools")
install.packages("roxygen2")
library(devtools)
library(roxygen2)

1.3 Go to "Build > Configure buildtools"
Select "Package" from the dropdown menu


Check the option "Generate documentation with Roxygen". A popup window will open, make sure all six checkboxes are checked there.

1.4 Make sure, the build tab appears in top-right panel.

Step 2

Go to bottom right panel. Click on "files > new folder" and name the new folder as "R". This is the directory where we will save our code (functions).

In top left panel, click on "File > new File > R script". 
Write the function code in script file and save the file inside "R" directory.

Step 3

We need to create a description file where we an specify details like package name, title, description, author, maintainer, licence etc.

A simple way to create skeleton of description file is use bottom left panel (console) and give command "load_all()". It basically loads all the files. In our case it will create description file as it is not there and reload the package.


In bottom right panel you should be able to see the description file under "Files" tab.


Click on description file, it will get opened in top left panel. Lets put values in description file,


Save the file and use console to give command "load_all()". It will load the package with newly created description file. You should not see any errors or warning.

Step 4

Now next step is creating help file for the function we have written. We will add information about function in the same file containing the function code. Let's go to AddNumbers.R and add function description, input parameters of function, return value, references if any.


As you can see in screenshot above, we have added 11 lines before the actual function code.

The last parameter called "@exports" makes sure this function is publicly available to users of the package.

In some cases we might write a function for internal use of others function(s) in package. We can keep these internal functions private by not adding "@exports".

Step 5

Go to top right panel, "Build" tab and click on  "Build & Reload". You should see something like following,


In bottom left panel, you should be able to see the  package is re-loaded.

Now go to bottom right panel, "Packages" tab, you should see the package we have just created.

Click on it and explore if description file and help file looks fine.


Description file looks like,

Click on back arrow, go to AddNumbers help file. It should look like,

Now lets test if the function we have written actually works, in console.

Before we export the package, lets do a thorough check by "Build > Check" in top right panel.

There should NOT be any warnings or errors.

Step 6

Go to "Build > More > Build Source Package". It will create source package in 'tar.gz' format.
The output looks like,


Step 7

Make sure you are NOT violating any CRAN submission policies before you proceed.

Go to CRAN website, cran.r-project.org/submit.html.

It is a three step process.

Fill in the basic details, upload the package and hit "Upload package".


It will take you to step 2, where you can verify the details and click Submit.


All maintainers of the package listed in description file, will get an email for confirmation. After maintainers confirm it, CRAN moderators will review it. If the package adheres to CRAN policies it should get approved.

Congratulations! You are now officially a contributor to CRAN!

Tuesday, October 28, 2014

Important Concepts in Statistics

This is a random collection of few important statistical concepts. These notes provide simple explanation (not a formal definition) of concepts and the reason why we need them.


Sample space: This is a set of all possible outcome values.

So if we consider a coin flip then sample space would be {head,tail}. If one unbiased die is thrown then sample space would be {1, 2, 3, 4, 5, 6}.

Event: It is a subset of sample space. For a given event "Getting even numbers after throwing unbiased die" the subset is {2, 4, 6}. So every time we run experiment either the event will occur or it wont.

Why we need it: Both sample space and event helps us to determine the probability of event. Probability is nothing but ratio of number of elements in event space to number of elements of sample space.


Probability distributions: It is a probability of every possible outcome in sample space.

So for a unbiased dice, probability of every outcome is equal 1/6 which look like,


When a probability distribution looks like this (equal probability of all outcomes) it is called Uniform probability distribution.

Important thing to consider here is sum of all probabilities is exactly equals to one for probability distribution.

Why we need it: Most of the statistical modelling methods make certain assumption about underlying probability distribution. So based on what kind of distribution data follows, we can choose appropriate methods. Sometimes we will transform (log transform, inverse transform) the data if the distribution observed is not what we would have expected or required by certain statistical methods.

We can categorize probability distribution in to two classes, discrete probability distribution and continuous probability distribution.
  • Discrete: Sample space is collection of discrete values. e.g. Coin flip, die throw etc 
    • Continuous: Sample space is collection of infinite continuous values. e.g. Height of all people in US, distance traveled to reach workplace

    Normal distribution: It is one of the most important concepts in statistics. Distributions in real world are very similar to the normal distribution which look like a bell shaped curve approaching zero on both ends.

    In reality we almost never observe exact normal distribution in nature, however in many cases it provides good approximation model.

    Normal Distribution PDF.svg

    Normal Distribution PDF" by Inductiveload - self-made, Mathematica, Inkscape. Licensed under Public domain via Wikimedia Commons.


    When the mean of normal distribution is zero and standard deviation is 1 then it is called Standard normal distribution. The red curve is standard normal distribution.

    Why we need it: Attaching a screenshot from Quora discussion which sums it up pretty well.



    Law of Large numbers: The law of large numbers implies larger the sample size, closer is our sample mean to the true (population) mean.

    Why we need it: Have you ever wondered, if probability of any outcome (head or tail) for a fair coin is exactly half but for 10 trials you might actually get different results (e.g. 6 heads and 4 tails). Well, Law of Large numbers provides answer to it. It says as we will increase number of trials, mean of all trials will come closer to expected value.

    Another simple example is, for an unbiased die probability of every outcome {1,2,3,4,5,6} is exactly same (1/6) so the mean should be 3.5.
    Largenumbers.svg
    "Largenumbers" by NYKevin - Own work. Licensed under CC0 via Wikimedia Commons.

    As we can see in the image above, only after large number of trials the mean approaches to 3.5.


    Central Limit theorem: Regardless of the underlying distribution, if we draw large enough samples and plot each sample mean then it approximates to normal distribution.

    Why we need it: If we know given data is normally distributed then it provides more understanding about data as compared to unknown distribution. And the Central Limit Theorem enables us to actually use the real world data (near-normal or non-normal) with statistical methods making assumption about normality of the data.

    An article on about.com summarizes the practical use of CLT as follows,

    "The assumption that data is from a normal distribution simplifies matters, but seems a little unrealistic. Just a little work with some real-world data shows that outliers, skewness, multiple peaks and asymmetry show up quite routinely. We can get around the problem of data from a population that is not normal. The use of an appropriate sample size and the central limit theorem help us to get around the problem of data from populations that are not normal.

    Thus, even though we might not know the shape of the distribution where our data comes from, the central limit theorem says that we can treat the sampling distribution as if it were normal."


    Correlation: A number representing strength of association between two variables. A high value of correlation coefficient implies both variables are strongly associated.

    One way to measure it is Person's correlation coefficient. It most widely used method which can measure only linear relationship between variables. The coefficient value varies from -1 to 1.

    The correlation coefficient value of zero means, there is no relationship between two variables. A negative value means as one variable increases the other decreases.

    The most important thing to remember here is, correlation does not necessarily mean there is a causation. It represents how two variables are associated with each other.

    source : xkcd

    Peter Flom, a statistical consultant explains the difference in simple words as following:
    "Correlation means two things go together. Causation means one thing causes another."

    Once we find correlation, controlled experiments can be conducted to check if any causation exists. There are few statistical methods which help us to check non-linear relationship between two variables like, maximal correlation.

    Why we need it: A correlation coefficient tells us how strongly two variables are associated and direction of the association.

    P-value: The basic notion of this concept is, a number representing results by chance. Smaller the number, more reliable are the results. Generally 0.05 is considered the threshold, P-value less than that is reliable.

    Having said that, Fisher argued strongly that interpretation of the P value was ultimately up to the researcher. The threshold can vary depending on requirements.

    Why we need it: So a P-value of 5% or 0.05 tells us, 1 out of every 20 results will be produced by chance.


    Monday, September 15, 2014

    How to check normality of the data

    All parametric tests make certain assumptions about the data. Most of the parametric tests like F-test, Z-test assume the data is normally distributed. So it is always useful to test the assumption of normality before we proceed. Sharing my notes about normality tests in this post.

    At high level I would generalize tests into two categories
    • Visual test
    • Statistical test

    Visual tests: These might not be the best way to check for normality and can be ambiguous and/or misleading sometimes. Lets get a high-level overview of how to use them.

    Histogram: We can plot a histogram of observed data and check for
    • If it looks like a bell shaped curve
    • Not skewed in any direction.


    Tuesday, September 9, 2014

    P-P plot vs Q-Q plot

    P-P plot and Q-Q plot are called probability plots.

    Probability plot helps us to compare two data sets in terms of distribution. Generally one set is theoretical and one set is empirical (not mandatory though).

    The two types of probability plots are
    • Q-Q plot (more common)
    • P-P plot
    Before getting in to details consider following,
    Diffusion of ideas.svg

    Thursday, September 4, 2014

    Regression concepts simplified

    Regression modelling technique is widely used in analytics and perhaps easiest to understand. In this post I am sharing my findings about the concept in simple words.

    What is Simple Linear Regression?

    A Simple Linear Regression allows you to determine functional dependency between two sets of numbers. For example, we can use regression to determine the relation between ice cream sales and average temperature.

    Since we are talking about functional dependency between two sets of variables, we need an independent variable and one dependent variable. In the example above, if change in temperature leads to change in ice cream sales then, temperature is independent variable and sales is dependent variable.

    Dependent variables is also called as criterion, response variable or label. It is denoted by Y.

    The independent variable is also referred as covariates, predictor or features. It is denoted by X.

    Sunday, August 31, 2014

    Statistical Modeling vs Machine Learning

    I have often used the terms Statistical modeling techniques and Machine learning techniques interchangeably but was not sure about the similarities and differences. So I went through few resources and sharing my findings here.


    Lets start with basic definition,

    A statistical model is a formalization of relationships between variables in the form of mathematical equations.

    Machine learning is a subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.


    Lets explore what books and courses say in their first chapter/lecture about both fields.

    From book “An introduction to statistical learning” by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

    Wednesday, August 27, 2014

    Useful Unix commands for exploring data

    While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files. Also sometimes it is not convenient to load the data file in Python or R in order to perform few basic checks and exploratory analysis. Unix commands are pretty handy in these scenarios and often takes significantly less time in execution.

    Lets consider movie data set from some parallel universe (with random values) for this assignment. There are 8 fields in total,



    Tuesday, August 19, 2014

    Interesting talk about AI and ML

    Microsoft researcher John Platt discusses his enthusiasm for artificial intelligence and machine learning. He is a Microsoft distinguished scientist and has been working in Artificial Intelligence for 32 years. In this video he talks about Artificial Intelligence, Machine Learning, Bing, Cortana and Project Adam.

    Sunday, August 17, 2014

    Survival Analysis

    While working on few assignments related to exploring disorder in cohort studies I came across the concept of Survival analysis. It seems very useful in many real life scenarios.

    What is Survival analysis?

    "Survival analysis is a branch of statistics which deals with analysis of time duration to until one or more events happen".[1] The event of interest can be development of a disease, failure of a mechanical system or a person getting married.

    In survival analysis, subjects are generally followed over a certain time period and the focus is on the time at which the event of interest occurs.