Sunday, August 31, 2014

Statistical Modeling vs Machine Learning

I have often used the terms Statistical modeling techniques and Machine learning techniques interchangeably but was not sure about the similarities and differences. So I went through few resources and sharing my findings here.


Lets start with basic definition,

A statistical model is a formalization of relationships between variables in the form of mathematical equations.

Machine learning is a subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.


Lets explore what books and courses say in their first chapter/lecture about both fields.

From book “An introduction to statistical learning” by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani



A relation between response variable and predictor(s) can be written as,

Y = f(X) + e

Where,
f() : function of X
X : An input vector with X1, X1…Xn.
Y : Output
e is random error

Statistical learning refers to approaches in estimating the f().

From notes of Caltech course “Learning from data” by Yaser S. Abu-Mostafa


Machine learning requires,

Input (X)
Output(Y)
Target function f : X -> Y
Data (x.1, y.1), (x.2, y.2), (x.3, y.3) … (x.n, y.n)
Hypothesis g : X -> Y


Notes from Andrew Ng’s Machine learning class at Stanford also talks about same basic concepts.



So we can say, both fields deal with data trying to find some function which takes (data as) input producing the desired output.


Lets see what other people think about both fields

Larry Wasserman a statistician and professor at CMU thinks there is no difference. "They are both concerned with the same question: how do we learn from data?” In his blog post he states how same concepts have different names in both fields,
  • Estimation~Learning
  • Classifier~Hypothesis
  • Data point~Example/Instance
  • Regression~Supervised Learning
  • Classification~Supervised Learning
  • Covariate~Feature
  • Response~Label

However he also talks about few points suggesting the difference, like 
  • Machine learning is comparatively new filed, evolved in computer age. However statistical data analysis practices existed long before computers were invented.
  • Statistics emphasizes on statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems and Machine Learning emphasizes high dimensional prediction problems.


Robert Tibshiriani, a statistician and machine learning expert at Stanford says machine learning is glamorous version of statistics in his class notes.



Brendan O’Connor, Assistant Professor at University of Massachusetts Amherst, also wrote on similar lines in a blog post back in 2008. He says "Statistics and machine learning aren’t very different fields." He added an update to his post saying "Statistics, not machine learning, is the real deal, but unfortunately suffers from bad marketing.” He explains the difference by mentioning about few techniques which exists only in one of the two subfields

"There are definitely a number of topics in ML that aren’t very related to statistics or probability. Max-margin methods: if all we care about is prediction, why bother using a probability model at all? Why not just optimize the spatial geometry instead? SVM’s don’t require a lick of probability theory to understand. (Of course probability-based approaches are huge in ML, but it’s important to remember they’re not the only game in town, and there is no necessary reason they must be.) And then there are non-traditional settings such as online learning, reinforcement learning, and active learning, where the structure of access to information is in play. There are certainly plenty of things in statistics that aren’t considered part of ML — say, regression diagnostics and significance testing.
"

Andrew Gelman a statistician and professor at Columbia University replied to this on his blog with following points
  • Its better to have two fields trying to solve similar problems.
  • He does reiterate the point statistics generally deal with low dimensional data as compared to machine learning
  • Machine learning has done great progress on hard problems.

Interesting discussion in comments section of this post




Summary of findings:
  • Both fields are trying to solve similar problems. Unfortunately statistics suffers from bad marketing.
  • Statistics is much older field which evolved from mathematics and Machine learning is pretty new which evolved from Computer Science/ Artificial Intelligence.
  • Though there is a huge overlap between two fields, both fields have few unique techniques.
  • Machine learning use computers extensively, which helps in solving many complex problems.
  • Statistics generally deals with low dimensional data where Machine learning is generally associated with high dimensional data.

Follow discussion on Reddit.

No comments:

Post a Comment