Wednesday, February 15, 2017

The Curse of Fake News and Possible Remedy

Well before the debate of "Fake News" started trending, researchers at Stanford University started working on a project aimed at studying how well we evaluate the information, especially from online sources. Starting in early 2015 researchers studied the behaviour of students from schools and universities like Stanford for 18 months. In the summary of the report researchers summed up their disappointment by stating "in every case and at every level, we were taken aback by students’ lack of preparation." The participant did a pretty poor job in assessing the credibility of information and sources. Though it is unfortunate it might not be entirely shocking as it confirms the pattern we are observing around us.

In the era of social media journalism, the reliability of information appears to have taken the back seat. Facebook announced it's intention to crack down on fake news. Recently Twitter has also joined the call. Maybe few other companies will follow the suit. Though commendable initiative, it doesn't seem enough for the enormous scale of the problem. There are 85+ virtual communities worldwide with atleast a million registered users each (like Facebook and Twitter). Additionally, there are few dozen instant messaging services like Whatsapp. Making all these platforms accountable seems like practically an impossible task. And even if a lot of them implement some measure of regulation, can we trust these platforms with their self-moderation policies?

Not just that, recently there were instances of mainstream media publishing news citing social media references, only to find out it was inaccurate. For an example, in February 2017 multiple leading news agencies in India published a story about a Canadian citizen of Indian origin Shawna Pandya, claiming she has been selected for one of the NASA's 2018 flights. (Some of the news articles added additional feathers in her cap like neuroscientist, opera singer etc.) Shawna had to debunk these claims, stressing that it merely a possibility at this point, in a facebook post she published on February 10th 2017. This is not an isolated incident. So it would be unwise to view the mainstream media as a highly reliable source.

In general, trusting information platforms entirely to provide factually accurate and unbiased information doesn't seem like a wise strategy. An alternative approach would be making information consumption points more resistant to the onslaught of misinformation. It would be worth exploring if we can address the problems in information consumption (like subjective bias, exposure to highly exaggerated or false information) by employing some ideas from the field of experiment design. Experiment design techniques are the framework we humans have invented in the quest for the ultimate truth. This framework takes us closer to the ultimate truth by accounting for multiple biases. Though its application is mostly limited to research, let's see if we can borrow few concepts from the framework in our daily life.


Just imagine if news articles started excluding identities associated with remarks, opinions, speeches, policies etc. That would be really weird right? Now we might have to actually read and analyse the content before we pass any judgement. There are some fascinating resources on how people with prejudice react when you hide or interchange the identities associated with the source of information.

For example, a guy asked a bunch of questions to Hillary supporters about some made-up stuff but presented it as Hillary's stand or policy.

Interviewer: One of Hillary's primary campaign promises is to expand Sharia law program in minority communities in America. You think that is the right campaign platform to be running on? 
Girl: Yeah 
Interviewer: Sharia law expansion? 
Girl: I would say yes! I am pro. 
Interviewer: To change the way women are treated in America by implementing Sharia law? 
Girl: Absolutely! 
Interviewer: Hillary knows what's best? 
Girl: Hillary is cool!

You will find similar videos for Trump supporters as well. In fact, this phenomenon seems universal amongst followers of politicians, celebrities, entrepreneurs etc. Most of us demonstrate strong association bias, which makes us vulnerable to misinformation. It can also lead us into playing an active role in misinformation distribution network.

In my board exam, the examiner would apply a sticker on my personal information box on the first page of the answer sheet. So anyone downstream dealing with my answer sheet would only be able to see my answers, not my name, town or any other details. This would help reduce personal bias of evaluators or moderators so they can grade based on the content only. Let's say a good researcher wants to test which of the two available medicines works better for the common cold. Being a good researcher what he would do is take off the labels from both brands of pills. Then provide it to two fairly similar groups of patients and measure the outcome. This helps to reduce the personal bias of patients and doctors when they would report the results to the researchers.

Basically blinding helps us to evaluate or measure subjective aspects in an unbiased manner. Unfortunately, we can not blind ourselves selectively in the real world. So, either we have to define what we wish to evaluate as objectively as possible or let source/association not prejudice us. Even better if we could do both together.

Meta-analysis strategy

Let's assume Albert conducts a research and concludes meditating daily for 10 minutes reduces anxiety attacks by 50%. Coincidently his two friends have conducted a similar research "independently". However, the percentage reduction in anxiety attacks differs from Albert's conclusion. His first friend thinks the reduction in anxiety attacks happens by 40% and second believes it is 60%. Statistically, we can combine these results (percentage/effect, standard error, variance, confidence etc) and get closer to the ground truth. Note that the result is not necessarily the simple average of percentages (ie. 50, 40 and 60).

Assume you are a manager planning to hire an engineer in your team. A simple method you could follow is to make four of your teammates interview the candidate. Then collect the feedback from all four team members which could be something as following,

 Interviewer 1 (with 9 years of work experience) - Highly recommends hiring
 Interviewer 2 (with 7 years of work experience) - Recommends hiring
 Interviewer 3 (with 5 years of work experience) - Recommends not hiring
 Interviewer 4 (with 5 years of work experience) - Highly recommends not hiring

As a manager, you might end up hiring the candidate. However, the important detail here is you would consider all four feedbacks in final decision making.

Consider a new piece of information in the same example. The position you are hiring for is highly technical. So you decide to consider the technical experience of the interviewers.

 Interviewer 1 (0 years of Technical experience)
 Interviewer 2 (0 years of Technical experience)
 Interviewer 3 (5 years of Technical experience)
 Interviewer 4 (5 years of Technical experience)

Now you might not hire the candidate. This is a reason why gathering the information from multiple and often diverse sources is crucial.

A lot of us seem to prefer consuming news from similar sources. These could be only leaning-left, leaning-right, pro-environment, pro-industry etc. It is certainly not evil to have a political or social position but building a bubble of the information based on sources favouring your sociopolitical position is a risky business. To make the matter worse, follow/unfollow options and recommendation algorithms of social media expedite this bubble construction. We see a version of this in mainstream media. By following only Fox News we might undermine lives saved by Affordable Care Act and by following CNN alone we might not learn about a possible increase in health insurance premiums due to the same act. A voice of dissent could make our understanding less skewed. As people say in data science, outliers are interesting.

Exposing yourself to diverse information sources and conflicting ideologies could help you to arrive at a less biased conclusion.

Systematic Review approach

Arguably Systematic reviews are the highest level of evidence we have today. It could be pretty time-consuming but highly methodical way to remove bias and arrive at the conclusion. It requires going through available resources/literature associated with a specific question in a systematic way, evaluating the quality of each and then aggregating insights to reach the conclusion. Technically meta-analysis is often part of a systematic review.

Obviously, you would use this approach not very frequently but only for critical decisions by gathering, weighing and aggregating comprehensive information. Not the exact but more realistic version of this method could help us answer very important questions.

A classic application case for systematic reviews would be electoral surveys (opinion polls) conducted in election-bound states by news agencies. Often times, these surveys make contradictory claims and very rarely they are closer to the mark. Praveen Chakravarty published an article in The Hindu analysing 82 electoral surveys from 1996 to 2014 for Lok Sabha and State Assembly elections. The highly disappointing conclusion was zero (yes, zero) surveys were fully accurate. They defined “fully accurate” as being within a +/-5 percent range in terms of seats predicted with regards to the actual results for both the winner and the runner-up.

So, it would be worth trying to run systematic reviews on these surveys for a specific election. One would start by going through methods of data collection and analysis. Once you have the idea about how systematic each survey is, you could take the weighted average. That might end the curve of utterly incorrect predictions.

Another takeaway here is, when presented with a survey we should be asking questions about how the data was collected, what kind of people participated, what kind of questions were asked. These questions are equally applicable when a government claims overwhelming support of citizens for a new nationwide initiative through a nontransparent internal survey. In late 2016, the Indian Prime minister claimed 93% of the citizen support his demonetization initiative - where 86% of the currency notes were cancelled overnight without any warning to general public or banks. If you ask the same set of questions discussed earlier, you would recognise flaws in the survey methods. In addition to considering highly biased (urban and pro-government) sample, the survey asked crafty questions like "Did you mind the inconvenience faced in our fight to curb corruption, black money, terrorism and counterfeiting of currency?" After considering details, this survey seems like a state-sponsored fake news. Again, I don't intend to single out this government as every ruling force will exhibit this aspect to some degree as part of it's propaganda machinery.


We can not rely on information sources to be self-aware and account for various biases. Additionally, we need to work consistently on our prejudice related to information consumption. As Rumi said eight hundred years ago, "Yesterday I was clever, so I wanted to change the world. Today I am wise, so I am changing myself."

Sunday, January 29, 2017

Influence of celebrities on Public Health

In late 2014, one of the leading Indian actress Anushka Sharma came out regarding her anxiety issues in an interview. She did not play a victim but suggested it is as normal as having a constant stomach pain and encouraged talking about it.

"I have anxiety. And I’m treating my anxiety. I’m on medication for my anxiety. Why am I saying this? Because it’s a completely normal thing. It’s a biological problem. In my family there have been cases of depression. More and more people should talk openly about it. There is nothing shameful about it or something to hide. If you had a constant stomach pain, wouldn’t you go to the doctor? It’s that simple. I want to make this my mission, to take any shame out of this, to educate people about this."

Roughly around the same time another Indian actress Deepika Padukone came out openly about her depression in an interview with a national newspaper. [1] She talked about her plans to create more awareness about depression and also used social media for the same.

"Anxiety,Depression and Panic Attacks are not signs of weakness.They are signs of trying to remain strong for way too long." - @deepikapadukone, 31 Dec, 2014 [2]

They might not be the first to open up about this in India but their words did not go unnoticed. It triggered a sensible discussion and reasonably positive media coverage. Few more celebrities came out and talked about their issues. A good message was reaching the public that anxiety/depression are not the end of the world and it is perfectly fine to talk about it or reach out for help. Talking about mental issues always appeared like a huge taboo in India. So these incidents and follow-up discussion seemed like a very welcomed change. I think for the same impact the government would have required a lot more resources. Imagine a senior mental health researcher from AIIMS stating the same guidelines in an interview. I think you would agree, though the researcher is well-qualified, the reach/impact would have been a lot less. So subject matter expertise does not necessarily assure reach/impact to the target audience. On the other hand, a couple of celebrity interviews had done the trick.

In 2014, CNN published an article - India beats the odds, beats the Polio. [3]

"In 2009, India still reported half of the world's new cases -- 741 out of 1,604. India has millions of poor and uneducated people. The population is booming. Large areas lack hygiene and good sanitation, and polio spreads through contaminated water. Many health experts predicted India would be the last country in the world to get rid of polio. They were wrong."

It was a great teamwork. The WHO, the government and so many other contributors played their part. But do you remember how you used to know about a next planned vaccination drive? For the most part through Amitabh Bachchan, the superstar of Indian cinema. He appeared in TV and radio ads, billboards and wherever he could with the message to vaccinate your children. In 2014 India was declared a Polio-free country. Later that year Amitabh Bachchan was awarded for his contribution by Union Health minister and Unicef India representative. The Hindu published an article appreciating his efforts with the title "When Amitabh's voice did the trick to make India polio-free." [4] After this success, the actor seems to be gearing up for another campaign related to Hepatitis B. Wonderful way to use your charisma!

We can see similar pattern in developed countries as well. In 2014, Angelina Jolie wrote in The New York Times about her mother's death due to breast cancer and her own high-risk situation. [6] In this article, she specifically mentioned the risk associated with BRCA genes and hoped many other women would test themselves to learn about risk level. This single article in NYT appears to have created a significant impact. The BMJ (old British Medical Journal) reported a spike in genetic tests related to the gene (BRCA) associated with increased risk of Breast Cancer. [7] The conclusion of this observational study states, "Celebrity endorsements can have a large and immediate effect on the use of health services. Such announcements can be a low-cost means of reaching a broad audience quickly, but they may not effectively target the subpopulations that are most at risk for the relevant underlying condition."

A similar incident happened in Australia. A pop singer, Kylie Minogue was diagnosed with breast cancer in 2005. The Medical Journal of Australia followed this case and reported, "News coverage of Kylie Minogue’s breast cancer diagnosis caused an unprecedented increase in bookings for mammography". Additionally, there was a 20-fold increase in news coverage of breast cancer, which emphasized that young women do get breast cancer and that early detection was critical. [8]

However, this celebrity influence in public health seems to be a double-edged sword. In 2014, Indian actress Madhuri Dixit advertised the Maggi instant noodles claiming it is very healthy. Months after the campaign was launched by the company, Uttar Pradesh FDA found these noodle packets containing MSG and lead more than the permissible limit. To make the matter worse the courts in Muzaffarpur and Barabanki ordered FIRs against celebrities for endorsing Maggi. In addition to Madhuri Dixit, it included the same Amitabh Bachchan we thanked for Polio campaign. [9]

A researcher in 1998 came up with fraudulent paper (which was later retracted) that appeared to have linked MMR vaccine and autism. Things blew out of the proportion, it became the biggest science story in 2002. The fear spread throughout the country and eventually Tony Blair was asked if his infant son had the MMR jab. Mr. Blair who had supported the MMR program refused to state if his son is vaccinated. Some of us might agree with Tony Blair on the grounds of privacy. However, the point is about the message that was sent to already scared parents and things that followed. Sir Liam Donaldson, who was Chief Medical Officer of England during that period have criticized Tony Blair for not going public with his son being vaccinated.[10] I guess when you hold a public office, your clear stand on public health issues is more important than you privacy concerns.

The rising consumption of sugary drink is a major contributor to the Obesity epidemic.[11] These drinks which contain almost no nutritional value but sugar. France recently banned the unlimited refills of these drinks as part of the battle against obesity. However, we constantly see celebrities from Britney Spears to Justin Timberlake endorsing these products. In fact, these soft drink companies appear to have advertisement contracts with major celebrities from almost all countries.

How these celebrity opinions and endorsements affect the choice of people is a fascinating area.

Two years ago, researchers from US and Canada looked into this by studying existing research papers available to this topic (like a study of existing studies). [12] The research was aimed at how celebrity engagements can benefit or hinder efforts to educate patients on evidence-based practices and improve their health literacy. The result section of this paper says, "According to the economics literature, celebrities distinguish endorsed items from competitors and can catalyze herd behavior. Neuroscience research supports these explanations, finding that celebrity endorsements activate brain regions involved in making positive associations, building trust and encoding memories. The psychology literature tells us that celebrity advice conditions people to react positively toward it."

It gets even more fascinating. There is a follow-up project which considered even more research literature with a significantly expanded team.[13] This time researchers aimed to get answers to more specific questions like,

-- Which health-related outcomes are influenced by celebrities?
-- How large of an impact do celebrities actually have on these health-related outcomes?
-- Under what circumstances do celebrities produce either beneficial or harmful impact?

They hope that the results of this will contribute to the understanding of celebrity influences and further help to design positive evidence-based celebrity health promotion activities. In addition, these findings can help inform the development of media reporting guidelines pertaining to celebrity health news.

If we can get reasonably accurate answers to these questions, it would be immensely helpful. We could choose celebrities for the promotion of health initiatives in order to maximize the impact. At the same time, there will be some risk of bad companies using this science for monetary benefits and against the public interest. Let's be positive and hope celebrities will understand the power of influencing public health they have. After all as one of them said, "With Great Power Comes Great Responsibility".



Sunday, January 22, 2017

Wordplay in Information Manipulation

There is a very interesting scene in the movie Dark Knight. The Joker (bad guy) is holding Rachel (lead actress) hostage at the edge of a rooftop and then the Batman arrives. The short conversation goes something like,

Batman: Let her go!
Joker: Ohh, very poor choice of words

Indeed, maybe the Batman was under a lot of stress. If this poor^ information representation example serves as one end of the spectrum, then researchers might be on the other end.

The way (good) researchers chose their words, seem remarkably careful^^. They would love to say something like, "X is associated with increased risk of Y with the p-value of blah-blah" (possibly with extra stress on the word associated).

^Poor = unintentional or careless
^^Careful = intentional and thoughtful

Interestingly these are not necessarily the people who unfairly manipulate the information. Poor or careful choice of words does not have definitive relation with information manipulation, though poor word choice will lead to information ambiguity or misrepresentation. On the other hand, I believe information manipulation can be traced back to both poor or careful choice of words (or other means of representation). So no easy way to spot it.

The Digital Trends website published an article two months ago with title "Stanford study *concludes* next generation of robots won’t try to kill us".[1] This title so far fetched from the actual content of the report that it would be very hard to qualify it as the truth. Nowhere in the report, we can find the conclusion stated by the article.[2] Funny thing is this article cites another article written by Fast Company as the source for the catchy headline. So the title is basically based on Digital Trend's interpretation of Fast Company's interpretation of the study. Poor choice of words to create a clickbait.

In the Indian epic of Mahabharata, Guru Dronacharya was invincible while holding a weapon. However as long as he was alive, the Pandavas could NOT win the Dharm Yuddha. So an ingenious plan was created by Lord Krishna to weaken Dronacharya by spreading the rumor of the death of his son Ashwastthama. Accordingly, Bhima killed an elephant named Ashwastthama and the message was spread that Ashwastthama has been killed. Guru Dronacharya found it hard to believe. There was one way to confirm, ask the man who had never lied - Yudhistira. Yudhistira being a virtuous man, refused to tell any lies. However, lord Krishna convinced him to say 'Ashwathama Hatahath, Naro Va Kunjaro Va' which means 'Ashwathama had died (in clear loud voice and then continue in low pitch) but it is not certain whether it was a Drona's son or an elephant'. Hearing this Guru Dronacharya got disheartened, laid down his weapons, got killed. Eventually, Pandavas won the war. Very careful use of words to manipulate the information.

Here is another very interesting example from the book Bad Science,

"The reports were based on a study that had observed participants over four years, and the results suggested, using natural frequencies, that you would expect one extra heart attack for every 1005 people taking ibuprofen. Or as the Daily Mail, in an article titled "How Pills for Your Headache Could Kill" reported: "British research revealed that patients taking ibuprofen to treat arthritis face a 24 percent increased risk of suffering a heart attack." Feed the fear.

Almost everyone reported the relative risk increases: diclofenac increases the risk of heart attack by 55 percent; ibuprofen, by 24 percent. The Boston Globe was clever enough to report the natural frequency: 1 extra heart attack in 1005 people on ibuprofen. The UK's Daily Mirror, meanwhile, tried and failed, reporting that 1 in 1005 people on ibuprofen "will suffer heart failure over the following year." No. It's heart attack, not heart failure, and it's 1 extra person in 1005, on the top of the heart attacks you'd get anyway. Several other papers repeated the same mistake."

Creating catchy (possibly misleading) headlines directly corresponds to revenue in this age of click rate. To be fair these reporters are generous enough as they snuck in the clauses related with title somewhere deep in the article. Unfortunately, this is not limited to science reporting. In 2011 when the anti-corruption movement was at peak against UPA-2 government in India, many news outlets used to publish similar articles. In a show called Devil’s Advocate at CNN-IBN, Mr. Kejriwal told Karan Thapar “Citizens are more important than Parliament. It is in the Constitution. Anna Hazare and every citizen is supreme. I think the Constitution says so”.[3] Irrespective of your views on Mr. Kejriwal I think you can see through the memorable headline Times of India created out of it. "Anna Hazare is above parliament: Arvind Kejriwal" [4]

Part of the problem is a lot of us don't have time /interest to see beyond the wordplay and verify the information from multiple sources (which reduce the possibility of selective reporting) or the original source (which reveals the ground truth). Another aspect is a lot of us crave for flashy headlines. Who wants to read Nature News when BuzzFeed is writing about science?

In some cases, we need to have a special qualification in order to interpret the words, like the legal systems. We can observe manipulation based on "poorly" worded laws and it's "careful" interpretation. A specific example would be a hate-speech related colonial era law in India called Section 295a. It is often used to target rationalist in India debunking godmen. Founder and president of Rationalist International, Sanal Edamaruku debunked an event perceived as magic by a church in Mumbai. In three separate police stations, cases were registered against him. [5]

I think technology can offer the solution to the reporting problem to some extent. Perhaps publishing a white box algorithm for periodically ranking reporters/anchors and newspapers/TV channels on selection bias, exaggeration factor etc in a peer-reviewed open-access journal might be a good start. Not sure how feasible it is considering so many constraints. And even if someone does come up with the algorithm, then industry adherence is another uphill battle. Meanwhile, let's watch out for words.

Thursday, December 8, 2016

Selection Bias

Barack Obama's article at Wired. [1]
Stephen Hawking's article at The Guardian. [2]
Peter Thiel's speech at RNC. [3]

In last two months, three renowned people have shared their thoughts about the time we live in.

All of them are highly successful and revered figures in their field. They all are data driven, you will find them quoting facts and figures all the time. Yet there is a stark difference between the central message here.

Case 1

Barack Obama wrote an article titled "Now is the greatest time to be alive". His argument is, we have achieved great breakthroughs. Though it's not utopia, considering the history the current time is the best time to live in.
"Just since 1983, when I finished college, things like crime rates, teen pregnancy rates, and poverty rates are all down. Life expectancy is up. The share of Americans with a college education is up too. Tens of mil­lions of Americans recently gained the security of health insurance. Blacks and Latinos have risen up the ranks to lead our businesses and communities. Women are a larger part of our workforce and are earning more money. Once-quiet factories are alive again, with assembly lines churning out the components of a clean-energy age.


And just as America has gotten better, so has the world. More countries know democracy. More kids are going to school. A smaller share of humans know chronic hunger or live in extreme poverty. In nearly two dozen countries—including our own—­people now have the freedom to marry whomever they love. And last year the nations of the world joined together to forge the most comprehen­sive agreement to battle climate change in human history.

Indeed, these are facts. So that does seem like a step towards utopia, doesn't it? Being his admirer I assumed the same.

Case 2

Stephen hawking published an article this week - "This is the most dangerous time for our planet". As the name suggests central theme is pretty opposite of the first case.
"The concerns underlying these votes about the economic consequences of globalization and accelerating technological change are absolutely understandable. The automation of factories has already decimated jobs in traditional manufacturing, and the rise of artificial intelligence is likely to extend this job destruction deep into the middle classes, with only the most caring, creative or supervisory roles remaining. This in turn, will accelerate the already widening economic inequality around the world.
The consequences of this are plain to see: the rural poor flock to cities, to shanty towns, driven by hope. And then often, finding that the Instagram nirvana is not available there, they seek it overseas, joining the ever greater numbers of economic migrants in search of a better life. These migrants in turn place new demands on the infrastructures and economies of the countries in which they arrive, undermining tolerance and further fuelling political populism."

I think a lot of us can relate to what he is stating above. Sadly, it does appear to be the bigger picture at a global scale. We could face some serious issues in near future.

Case 3

Peter Thiel gave a speech at RNC highlighting the poor state of the country. Basically, his stand was how as a country the US couldn't continue on the expected trajectory and things are already bad.
" our government is broken. Our nuclear bases still use floppy disks. Our newest fighter jets can’t even fly in the rain. And it would be kind to say the government’s software works poorly, because much of the time it doesn’t even work at all. That is a staggering decline for the country that completed the Manhattan project. We don’t accept such incompetence in Silicon Valley, and we must not accept it from our government. Americans get paid less today than ten years ago. But healthcare and college tuition cost more every year. Meanwhile, Wall Street bankers inflate bubbles in everything."

If you think about it, he did mention some facts. The average healthcare cost per capita in the US has touched $10,000 per year. Medical debt appears to be the leading cause of personal bankruptcy in the US. The education is getting so expensive people can spend decade(s) repaying education loans.


If you look at three cases, you will realize how "convenient" data selection can be used to support almost any argument. The difference in arguments above could be due to the difference in perception about how to measure things. Measuring things in a real world is an extremely hard problem. In research, "double-blinded + randomized + controlled" trials are considered the gold standard of evidence (not the highest though). Even with these gold standards and billions of dollars experiment could fail to measure things miserably. For an example, according to a paper published in Journal of American Medical Association cancer drugs in the real world do not follow the expectations set by clinical trials of the same drugs. The average increase in the survival time for patients under these drugs could be a lot less than results in trials. Sometimes the average increase in survival time for patients in real world taking these drugs is less than the survival time of the patients on placebo (sugar pills) in the experiment. [4].

That might look like an unnecessary example here. However, the point is, even a ton of money and brightest minds working together can not guarantee good judgment of the reality. So the least we could do is take things with a grain of salt than absolute reality.
I think it's hard to eliminate selection bias completely but it can be reduced. The examples above exhibits a comparatively decent level of selection bias. It can get really ugly and dangerous. Irrespective of the nature selection bias will contribute to twisting the perception of reality (by definition) and possibly spreading misinformation.

In some cases selection bias can twist society's perception significantly like,
- TV debate
- News article on a news website (whose sole aim could be click rate)
- Biography of a highly successful or controversial person
- Speeches of Politicians or celebrities
- Public surveys and opinion polls (especially by political parties and related organizations)

Let's make a genuine attempt to observe if it's the entire picture or just a "convenient" part of it.



Sunday, January 25, 2015

How to Create and Publish R package on CRAN : Step-by-Step Guide

  • R Studio (This tutorial is based on R studio 0.98.501)
  • Beginner level R programming skills
  • devtools package (to build and compile the code)
  • roxygen2 package (to create the documentation)
Lets break it down into 7 simple steps as following:
  1. Create R project
  2. Create function(s)
  3. Create  description file
  4. Create help file(s)
  5. Build, load and check the package
  6. Export package
  7. Submit on CRAN
Step 1

1.1  Open R Studio. Create a new project using "file > new project > new directory > empty project". Give directory name.

1.2  Install and load R packages "devtools" and "roxygen2".


1.3 Go to "Build > Configure buildtools"
Select "Package" from the dropdown menu

Check the option "Generate documentation with Roxygen". A popup window will open, make sure all six checkboxes are checked there.

1.4 Make sure, the build tab appears in top-right panel.

Step 2

Go to bottom right panel. Click on "files > new folder" and name the new folder as "R". This is the directory where we will save our code (functions).

In top left panel, click on "File > new File > R script". 
Write the function code in script file and save the file inside "R" directory.

Step 3

We need to create a description file where we an specify details like package name, title, description, author, maintainer, licence etc.

A simple way to create skeleton of description file is use bottom left panel (console) and give command "load_all()". It basically loads all the files. In our case it will create description file as it is not there and reload the package.

In bottom right panel you should be able to see the description file under "Files" tab.

Click on description file, it will get opened in top left panel. Lets put values in description file,

Save the file and use console to give command "load_all()". It will load the package with newly created description file. You should not see any errors or warning.

Step 4

Now next step is creating help file for the function we have written. We will add information about function in the same file containing the function code. Let's go to AddNumbers.R and add function description, input parameters of function, return value, references if any.

As you can see in screenshot above, we have added 11 lines before the actual function code.

The last parameter called "@exports" makes sure this function is publicly available to users of the package.

In some cases we might write a function for internal use of others function(s) in package. We can keep these internal functions private by not adding "@exports".

Step 5

Go to top right panel, "Build" tab and click on  "Build & Reload". You should see something like following,

In bottom left panel, you should be able to see the  package is re-loaded.

Now go to bottom right panel, "Packages" tab, you should see the package we have just created.

Click on it and explore if description file and help file looks fine.

Description file looks like,

Click on back arrow, go to AddNumbers help file. It should look like,

Now lets test if the function we have written actually works, in console.

Before we export the package, lets do a thorough check by "Build > Check" in top right panel.

There should NOT be any warnings or errors.

Step 6

Go to "Build > More > Build Source Package". It will create source package in 'tar.gz' format.
The output looks like,

Step 7

Make sure you are NOT violating any CRAN submission policies before you proceed.

Go to CRAN website,

It is a three step process.

Fill in the basic details, upload the package and hit "Upload package".

It will take you to step 2, where you can verify the details and click Submit.

All maintainers of the package listed in description file, will get an email for confirmation. After maintainers confirm it, CRAN moderators will review it. If the package adheres to CRAN policies it should get approved.

Congratulations! You are now officially a contributor to CRAN!

Tuesday, October 28, 2014

Important Concepts in Statistics

This is a random collection of few important statistical concepts. These notes provide simple explanation (not a formal definition) of concepts and the reason why we need them.

Sample space: This is a set of all possible outcome values.

So if we consider a coin flip then sample space would be {head,tail}. If one unbiased die is thrown then sample space would be {1, 2, 3, 4, 5, 6}.

Event: It is a subset of sample space. For a given event "Getting even numbers after throwing unbiased die" the subset is {2, 4, 6}. So every time we run experiment either the event will occur or it wont.

Why we need it: Both sample space and event helps us to determine the probability of event. Probability is nothing but ratio of number of elements in event space to number of elements of sample space.

Probability distributions: It is a probability of every possible outcome in sample space.

So for a unbiased dice, probability of every outcome is equal 1/6 which look like,

When a probability distribution looks like this (equal probability of all outcomes) it is called Uniform probability distribution.

Important thing to consider here is sum of all probabilities is exactly equals to one for probability distribution.

Why we need it: Most of the statistical modelling methods make certain assumption about underlying probability distribution. So based on what kind of distribution data follows, we can choose appropriate methods. Sometimes we will transform (log transform, inverse transform) the data if the distribution observed is not what we would have expected or required by certain statistical methods.

We can categorize probability distribution in to two classes, discrete probability distribution and continuous probability distribution.
  • Discrete: Sample space is collection of discrete values. e.g. Coin flip, die throw etc 
    • Continuous: Sample space is collection of infinite continuous values. e.g. Height of all people in US, distance traveled to reach workplace

    Normal distribution: It is one of the most important concepts in statistics. Distributions in real world are very similar to the normal distribution which look like a bell shaped curve approaching zero on both ends.

    In reality we almost never observe exact normal distribution in nature, however in many cases it provides good approximation model.

    Normal Distribution PDF.svg

    Normal Distribution PDF" by Inductiveload - self-made, Mathematica, Inkscape. Licensed under Public domain via Wikimedia Commons.

    When the mean of normal distribution is zero and standard deviation is 1 then it is called Standard normal distribution. The red curve is standard normal distribution.

    Why we need it: Attaching a screenshot from Quora discussion which sums it up pretty well.

    Law of Large numbers: The law of large numbers implies larger the sample size, closer is our sample mean to the true (population) mean.

    Why we need it: Have you ever wondered, if probability of any outcome (head or tail) for a fair coin is exactly half but for 10 trials you might actually get different results (e.g. 6 heads and 4 tails). Well, Law of Large numbers provides answer to it. It says as we will increase number of trials, mean of all trials will come closer to expected value.

    Another simple example is, for an unbiased die probability of every outcome {1,2,3,4,5,6} is exactly same (1/6) so the mean should be 3.5.
    "Largenumbers" by NYKevin - Own work. Licensed under CC0 via Wikimedia Commons.

    As we can see in the image above, only after large number of trials the mean approaches to 3.5.

    Central Limit theorem: Regardless of the underlying distribution, if we draw large enough samples and plot each sample mean then it approximates to normal distribution.

    Why we need it: If we know given data is normally distributed then it provides more understanding about data as compared to unknown distribution. And the Central Limit Theorem enables us to actually use the real world data (near-normal or non-normal) with statistical methods making assumption about normality of the data.

    An article on summarizes the practical use of CLT as follows,

    "The assumption that data is from a normal distribution simplifies matters, but seems a little unrealistic. Just a little work with some real-world data shows that outliers, skewness, multiple peaks and asymmetry show up quite routinely. We can get around the problem of data from a population that is not normal. The use of an appropriate sample size and the central limit theorem help us to get around the problem of data from populations that are not normal.

    Thus, even though we might not know the shape of the distribution where our data comes from, the central limit theorem says that we can treat the sampling distribution as if it were normal."

    Correlation: A number representing strength of association between two variables. A high value of correlation coefficient implies both variables are strongly associated.

    One way to measure it is Person's correlation coefficient. It most widely used method which can measure only linear relationship between variables. The coefficient value varies from -1 to 1.

    The correlation coefficient value of zero means, there is no relationship between two variables. A negative value means as one variable increases the other decreases.

    The most important thing to remember here is, correlation does not necessarily mean there is a causation. It represents how two variables are associated with each other.

    source : xkcd

    Peter Flom, a statistical consultant explains the difference in simple words as following:
    "Correlation means two things go together. Causation means one thing causes another."

    Once we find correlation, controlled experiments can be conducted to check if any causation exists. There are few statistical methods which help us to check non-linear relationship between two variables like, maximal correlation.

    Why we need it: A correlation coefficient tells us how strongly two variables are associated and direction of the association.

    P-value: The basic notion of this concept is, a number representing results by chance. Smaller the number, more reliable are the results. Generally 0.05 is considered the threshold, P-value less than that is reliable.

    Having said that, Fisher argued strongly that interpretation of the P value was ultimately up to the researcher. The threshold can vary depending on requirements.

    Why we need it: So a P-value of 5% or 0.05 tells us, 1 out of every 20 results will be produced by chance.

    Monday, September 15, 2014

    How to check normality of the data

    All parametric tests make certain assumptions about the data. Most of the parametric tests like F-test, Z-test assume the data is normally distributed. So it is always useful to test the assumption of normality before we proceed. Sharing my notes about normality tests in this post.

    At high level I would generalize tests into two categories
    • Visual test
    • Statistical test

    Visual tests: These might not be the best way to check for normality and can be ambiguous and/or misleading sometimes. Lets get a high-level overview of how to use them.

    Histogram: We can plot a histogram of observed data and check for
    • If it looks like a bell shaped curve
    • Not skewed in any direction.

    Tuesday, September 9, 2014

    P-P plot vs Q-Q plot

    P-P plot and Q-Q plot are called probability plots.

    Probability plot helps us to compare two data sets in terms of distribution. Generally one set is theoretical and one set is empirical (not mandatory though).

    The two types of probability plots are
    • Q-Q plot (more common)
    • P-P plot
    Before getting in to details consider following,
    Diffusion of ideas.svg

    Thursday, September 4, 2014

    Regression concepts simplified

    Regression modelling technique is widely used in analytics and perhaps easiest to understand. In this post I am sharing my findings about the concept in simple words.

    What is Simple Linear Regression?

    A Simple Linear Regression allows you to determine functional dependency between two sets of numbers. For example, we can use regression to determine the relation between ice cream sales and average temperature.

    Since we are talking about functional dependency between two sets of variables, we need an independent variable and one dependent variable. In the example above, if change in temperature leads to change in ice cream sales then, temperature is independent variable and sales is dependent variable.

    Dependent variables is also called as criterion, response variable or label. It is denoted by Y.

    The independent variable is also referred as covariates, predictor or features. It is denoted by X.

    Sunday, August 31, 2014

    Statistical Modeling vs Machine Learning

    I have often used the terms Statistical modeling techniques and Machine learning techniques interchangeably but was not sure about the similarities and differences. So I went through few resources and sharing my findings here.

    Lets start with basic definition,

    A statistical model is a formalization of relationships between variables in the form of mathematical equations.

    Machine learning is a subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.

    Lets explore what books and courses say in their first chapter/lecture about both fields.

    From book “An introduction to statistical learning” by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani