Sunday, January 25, 2015

How to Create and Publish R package on CRAN : Step-by-Step Guide

  • R Studio (This tutorial is based on R studio 0.98.501)
  • Beginner level R programming skills
  • devtools package (to build and compile the code)
  • roxygen2 package (to create the documentation)
Lets break it down into 7 simple steps as following:
  1. Create R project
  2. Create function(s)
  3. Create  description file
  4. Create help file(s)
  5. Build, load and check the package
  6. Export package
  7. Submit on CRAN
Step 1

1.1  Open R Studio. Create a new project using "file > new project > new directory > empty project". Give directory name.

1.2  Install and load R packages "devtools" and "roxygen2".


1.3 Go to "Build > Configure buildtools"
Select "Package" from the dropdown menu

Check the option "Generate documentation with Roxygen". A popup window will open, make sure all six checkboxes are checked there.

1.4 Make sure, the build tab appears in top-right panel.

Step 2

Go to bottom right panel. Click on "files > new folder" and name the new folder as "R". This is the directory where we will save our code (functions).

In top left panel, click on "File > new File > R script". 
Write the function code in script file and save the file inside "R" directory.

Step 3

We need to create a description file where we an specify details like package name, title, description, author, maintainer, licence etc.

A simple way to create skeleton of description file is use bottom left panel (console) and give command "load_all()". It basically loads all the files. In our case it will create description file as it is not there and reload the package.

In bottom right panel you should be able to see the description file under "Files" tab.

Click on description file, it will get opened in top left panel. Lets put values in description file,

Save the file and use console to give command "load_all()". It will load the package with newly created description file. You should not see any errors or warning.

Step 4

Now next step is creating help file for the function we have written. We will add information about function in the same file containing the function code. Let's go to AddNumbers.R and add function description, input parameters of function, return value, references if any.

As you can see in screenshot above, we have added 11 lines before the actual function code.

The last parameter called "@exports" makes sure this function is publicly available to users of the package.

In some cases we might write a function for internal use of others function(s) in package. We can keep these internal functions private by not adding "@exports".

Step 5

Go to top right panel, "Build" tab and click on  "Build & Reload". You should see something like following,

In bottom left panel, you should be able to see the  package is re-loaded.

Now go to bottom right panel, "Packages" tab, you should see the package we have just created.

Click on it and explore if description file and help file looks fine.

Description file looks like,

Click on back arrow, go to AddNumbers help file. It should look like,

Now lets test if the function we have written actually works, in console.

Before we export the package, lets do a thorough check by "Build > Check" in top right panel.

There should NOT be any warnings or errors.

Step 6

Go to "Build > More > Build Source Package". It will create source package in 'tar.gz' format.
The output looks like,

Step 7

Make sure you are NOT violating any CRAN submission policies before you proceed.

Go to CRAN website,

It is a three step process.

Fill in the basic details, upload the package and hit "Upload package".

It will take you to step 2, where you can verify the details and click Submit.

All maintainers of the package listed in description file, will get an email for confirmation. After maintainers confirm it, CRAN moderators will review it. If the package adheres to CRAN policies it should get approved.

Congratulations! You are now officially a contributor to CRAN!

Tuesday, October 28, 2014

Important Concepts in Statistics

This is a random collection of few important statistical concepts. These notes provide simple explanation (not a formal definition) of concepts and the reason why we need them.

Sample space: This is a set of all possible outcome values.

So if we consider a coin flip then sample space would be {head,tail}. If one unbiased die is thrown then sample space would be {1, 2, 3, 4, 5, 6}.

Event: It is a subset of sample space. For a given event "Getting even numbers after throwing unbiased die" the subset is {2, 4, 6}. So every time we run experiment either the event will occur or it wont.

Why we need it: Both sample space and event helps us to determine the probability of event. Probability is nothing but ratio of number of elements in event space to number of elements of sample space.

Probability distributions: It is a probability of every possible outcome in sample space.

So for a unbiased dice, probability of every outcome is equal 1/6 which look like,

When a probability distribution looks like this (equal probability of all outcomes) it is called Uniform probability distribution.

Important thing to consider here is sum of all probabilities is exactly equals to one for probability distribution.

Why we need it: Most of the statistical modelling methods make certain assumption about underlying probability distribution. So based on what kind of distribution data follows, we can choose appropriate methods. Sometimes we will transform (log transform, inverse transform) the data if the distribution observed is not what we would have expected or required by certain statistical methods.

We can categorize probability distribution in to two classes, discrete probability distribution and continuous probability distribution.
  • Discrete: Sample space is collection of discrete values. e.g. Coin flip, die throw etc 
    • Continuous: Sample space is collection of infinite continuous values. e.g. Height of all people in US, distance traveled to reach workplace

    Normal distribution: It is one of the most important concepts in statistics. Distributions in real world are very similar to the normal distribution which look like a bell shaped curve approaching zero on both ends.

    In reality we almost never observe exact normal distribution in nature, however in many cases it provides good approximation model.

    Normal Distribution PDF.svg

    Normal Distribution PDF" by Inductiveload - self-made, Mathematica, Inkscape. Licensed under Public domain via Wikimedia Commons.

    When the mean of normal distribution is zero and standard deviation is 1 then it is called Standard normal distribution. The red curve is standard normal distribution.

    Why we need it: Attaching a screenshot from Quora discussion which sums it up pretty well.

    Law of Large numbers: The law of large numbers implies larger the sample size, closer is our sample mean to the true (population) mean.

    Why we need it: Have you ever wondered, if probability of any outcome (head or tail) for a fair coin is exactly half but for 10 trials you might actually get different results (e.g. 6 heads and 4 tails). Well, Law of Large numbers provides answer to it. It says as we will increase number of trials, mean of all trials will come closer to expected value.

    Another simple example is, for an unbiased die probability of every outcome {1,2,3,4,5,6} is exactly same (1/6) so the mean should be 3.5.
    "Largenumbers" by NYKevin - Own work. Licensed under CC0 via Wikimedia Commons.

    As we can see in the image above, only after large number of trials the mean approaches to 3.5.

    Central Limit theorem: Regardless of the underlying distribution, if we draw large enough samples and plot each sample mean then it approximates to normal distribution.

    Why we need it: If we know given data is normally distributed then it provides more understanding about data as compared to unknown distribution. And the Central Limit Theorem enables us to actually use the real world data (near-normal or non-normal) with statistical methods making assumption about normality of the data.

    An article on summarizes the practical use of CLT as follows,

    "The assumption that data is from a normal distribution simplifies matters, but seems a little unrealistic. Just a little work with some real-world data shows that outliers, skewness, multiple peaks and asymmetry show up quite routinely. We can get around the problem of data from a population that is not normal. The use of an appropriate sample size and the central limit theorem help us to get around the problem of data from populations that are not normal.

    Thus, even though we might not know the shape of the distribution where our data comes from, the central limit theorem says that we can treat the sampling distribution as if it were normal."

    Correlation: A number representing strength of association between two variables. A high value of correlation coefficient implies both variables are strongly associated.

    One way to measure it is Person's correlation coefficient. It most widely used method which can measure only linear relationship between variables. The coefficient value varies from -1 to 1.

    The correlation coefficient value of zero means, there is no relationship between two variables. A negative value means as one variable increases the other decreases.

    The most important thing to remember here is, correlation does not necessarily mean there is a causation. It represents how two variables are associated with each other.

    source : xkcd

    Peter Flom, a statistical consultant explains the difference in simple words as following:
    "Correlation means two things go together. Causation means one thing causes another."

    Once we find correlation, controlled experiments can be conducted to check if any causation exists. There are few statistical methods which help us to check non-linear relationship between two variables like, maximal correlation.

    Why we need it: A correlation coefficient tells us how strongly two variables are associated and direction of the association.

    P-value: The basic notion of this concept is, a number representing results by chance. Smaller the number, more reliable are the results. Generally 0.05 is considered the threshold, P-value less than that is reliable.

    Having said that, Fisher argued strongly that interpretation of the P value was ultimately up to the researcher. The threshold can vary depending on requirements.

    Why we need it: So a P-value of 5% or 0.05 tells us, 1 out of every 20 results will be produced by chance.

    Monday, September 15, 2014

    How to check normality of the data

    All parametric tests make certain assumptions about the data. Most of the parametric tests like F-test, Z-test assume the data is normally distributed. So it is always useful to test the assumption of normality before we proceed. Sharing my notes about normality tests in this post.

    At high level I would generalize tests into two categories
    • Visual test
    • Statistical test

    Visual tests: These might not be the best way to check for normality and can be ambiguous and/or misleading sometimes. Lets get a high-level overview of how to use them.

    Histogram: We can plot a histogram of observed data and check for
    • If it looks like a bell shaped curve
    • Not skewed in any direction.

    Tuesday, September 9, 2014

    P-P plot vs Q-Q plot

    P-P plot and Q-Q plot are called probability plots.

    Probability plot helps us to compare two data sets in terms of distribution. Generally one set is theoretical and one set is empirical (not mandatory though).

    The two types of probability plots are
    • Q-Q plot (more common)
    • P-P plot
    Before getting in to details consider following,
    Diffusion of ideas.svg

    Thursday, September 4, 2014

    Regression concepts simplified

    Regression modelling technique is widely used in analytics and perhaps easiest to understand. In this post I am sharing my findings about the concept in simple words.

    What is Simple Linear Regression?

    A Simple Linear Regression allows you to determine functional dependency between two sets of numbers. For example, we can use regression to determine the relation between ice cream sales and average temperature.

    Since we are talking about functional dependency between two sets of variables, we need an independent variable and one dependent variable. In the example above, if change in temperature leads to change in ice cream sales then, temperature is independent variable and sales is dependent variable.

    Dependent variables is also called as criterion, response variable or label. It is denoted by Y.

    The independent variable is also referred as covariates, predictor or features. It is denoted by X.

    Sunday, August 31, 2014

    Statistical Modeling vs Machine Learning

    I have often used the terms Statistical modeling techniques and Machine learning techniques interchangeably but was not sure about the similarities and differences. So I went through few resources and sharing my findings here.

    Lets start with basic definition,

    A statistical model is a formalization of relationships between variables in the form of mathematical equations.

    Machine learning is a subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.

    Lets explore what books and courses say in their first chapter/lecture about both fields.

    From book “An introduction to statistical learning” by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

    Wednesday, August 27, 2014

    Useful Unix commands for exploring data

    While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files. Also sometimes it is not convenient to load the data file in Python or R in order to perform few basic checks and exploratory analysis. Unix commands are pretty handy in these scenarios and often takes significantly less time in execution.

    Lets consider movie data set from some parallel universe (with random values) for this assignment. There are 8 fields in total,

    Tuesday, August 19, 2014

    Interesting talk about AI and ML

    Microsoft researcher John Platt discusses his enthusiasm for artificial intelligence and machine learning. He is a Microsoft distinguished scientist and has been working in Artificial Intelligence for 32 years. In this video he talks about Artificial Intelligence, Machine Learning, Bing, Cortana and Project Adam.

    Sunday, August 17, 2014

    Survival Analysis

    While working on few assignments related to exploring disorder in cohort studies I came across the concept of Survival analysis. It seems very useful in many real life scenarios.

    What is Survival analysis?

    "Survival analysis is a branch of statistics which deals with analysis of time duration to until one or more events happen".[1] The event of interest can be development of a disease, failure of a mechanical system or a person getting married.

    In survival analysis, subjects are generally followed over a certain time period and the focus is on the time at which the event of interest occurs.

    Saturday, August 16, 2014

    Simple R tricks for data processing

    Count unique values from a column in matrix or data-frame
    # Find number of unique first names

    Count specific values in vector or column
    # Find number of occurrences for "ac" in vector nlist