Machine Learning – Easy Reference

In this post, we have included the must known things when you deal with Machine Learning Algorithms. Here is the list of things for your easy reference, bookmark this page!

Classification metrics

In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.

Confusion matrix: The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of classification models:

ROC: The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

AUC: The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

Regression metrics:

Basic metrics: Given a regression model f, the following metrics are commonly used to assess the performance of the model:

Coefficient of determination: The coefficient of determination, often noted R^2 or r^2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:

Model selection:

Vocabulary– When selecting a model, we distinguish 3 different parts of the data that we have as follows:

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:

Cross-validation: It also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:

The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.

Regularization: The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:

Diagnostics:

Bias: The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

Variance: The variance of a model is the variability of the model prediction for given data points.

Bias/variance tradeoff: The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

Error analysis: Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.

Ablative analysis: Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.

I hope this is useful for you to refer the important things easily.

Credits: Amidi brothers!

Life-cycle of a Data Science Project

Cover

Are you wondering how would the life-cycle of a data science project be? Here you go..
Problem Identification:

1 identify-the-problem

Have you ever heard the phrase “Here’s the data, can you do some analysis find some insights?” Often, management approach Data Scientists with vague or even undefined goals. Understanding the goal is important and sets up the rest of the project for success.

This step consumes up about 10% of the time in the project life-cycle

Data Preparation:

2 data prep

So far, everybody’s least favorite stage, but possibly the most important one. Data can come from different sources, be in the ugly format, and have errors and a myriad of other problems. A single error in this stage can render the rest of the analysis useless.

That’s why typically, up to 70% of the time is spent here.

Analyse the data:

3 Data-Analysis

Creating models, performing data mining, setting up simulations etc. This is the most exciting part and if the previous stages were done correctly, analyzing the data and getting insights will feel like a good.

Time needed here would be 10%

Visualization of the insights:

4 Visual

Visualizing comes hand-in-hand with analyzing. This is a powerful technique as looking at the data in various forms and shapes can help reveal insights that are otherwise not evident. Also several projects such as BI dashboards don’t need much analysis but rely on visualization instead.

Time needed here would be 10%

Presentation of the findings:

5 data-presentation

We’ve reached 100% the project is over! Actually, No. Presenting findings is a whole separate “Additional” stage. You need to not only convey the insights in your audience’s language but also get buy-in from them to take action based on those insights. This is an art.

Time needed: extra 80% 🙂

Hope you benefited ! Enjoy learning!

Treasure for Data Science blogs (A to Z)

This blog will help you in knowledge hunt of Data science. The below given list will help you to find the blogs that talk about Data science easily. I hope you will find this useful.

A Blog From a Human-engineer-being http://www.erogol.com/ 
Aakash Japi http://aakashjapi.com/ 
Adit Deshpande https://adeshpande3.github.io/ 
Advanced Analytics & R http://advanceddataanalytics.net/ 
Adventures in Data Land http://blog.smola.org 
Agile Data Science http://blog.sense.io/ 
Ahmed El Deeb https://medium.com/@D33B 
Airbnb Data blog http://nerds.airbnb.com/data/ 
Alex Castrounis | InnoArchiTech http://www.innoarchitech.com/ 
Alex Perrier http://alexperrier.github.io/ 
Algobeans | Data Analytics Tutorials & Experiments for the Layman https://algobeans.com 
Amazon AWS AI Blog https://aws.amazon.com/blogs/ai/ 
Analytics Vidhya http://www.analyticsvidhya.com/blog/ 
Analytics and Visualization in Big Data @ Sicara https://blog.sicara.com 
Andreas Müller http://peekaboo-vision.blogspot.com/ 
Andrej Karpathy blog http://karpathy.github.io/ 
Andrew Brooks http://brooksandrew.github.io/simpleblog/ 
Andrey Kurenkov http://www.andreykurenkov.com/writing/ 
Anton Lebedevich’s Blog http://mabrek.github.io/ 
Arthur Juliani https://medium.com/@awjuliani 
Audun M. Øygard http://www.auduno.com/ 
Avi Singh https://avisingh599.github.io/ 
Beautiful Data http://beautifuldata.net/ 
Beckerfuffle http://mdbecker.github.io/ 
Becoming A Data Scientist http://www.becomingadatascientist.com/ 
Ben Bolte’s Blog http://benjaminbolte.com/ml/ 
Ben Frederickson http://www.benfrederickson.com/blog/ 
Berkeley AI Research http://bair.berkeley.edu/blog/ 
Big-Ish Data http://bigishdata.com/ 
Blog on neural networks http://yerevann.github.io/ 
Blogistic RegressionAbout Projects http://d10genes.github.io/blog/ 
blogR | R tips and tricks from a scientist https://drsimonj.svbtle.com/ 
Brain of mat kelcey http://matpalm.com/blog/ 
Brilliantly wrong thoughts on science and programming https://arogozhnikov.github.io/ 
Bugra Akyildiz http://bugra.github.io/ 
Building Babylon https://building-babylon.net/ 
Carl Shan http://carlshan.com/ 
Chris Stucchio https://www.chrisstucchio.com/blog/index.html 
Christophe Bourguignat https://medium.com/@chris_bour 
Christopher Nguyen https://medium.com/@ctn 
Cloudera Data Science Posts http://blog.cloudera.com/blog/category/data-science/ 
colah’s blog http://colah.github.io/archive.html 
Cortana Intelligence and Machine Learning Blog https://blogs.technet.microsoft.com/machinelearning/ 
Daniel Forsyth http://www.danielforsyth.me/ 
Daniel Homola http://danielhomola.com/category/blog/ 
Daniel Nee http://danielnee.com 
Data Based Inventions http://datalab.lu/ 
Data Blogger https://www.data-blogger.com/ 
Data Labs http://blog.insightdatalabs.com/ 
Data Meets Media http://datameetsmedia.com/ 
Data Miners Blog http://blog.data-miners.com/ 
Data Mining Research http://www.dataminingblog.com/ 
Data Mining: Text Mining, Visualization and Social Media http://datamining.typepad.com/data_mining/ 
Data Piques http://blog.ethanrosenthal.com/ 
Data School http://www.dataschool.io/ 
Data Science 101 http://101.datascience.community/ 
Data Science @ Facebook https://research.facebook.com/blog/datascience/ 
Data Science Insights http://www.datasciencebowl.com/data-science-insights/ 
Data Science Tutorials https://codementor.io/data-science/tutorial 
Data Science Vademecum http://datasciencevademecum.wordpress.com/ 
Dataaspirant http://dataaspirant.com/ 
Dataclysm http://blog.okcupid.com/ 
DataGenetics http://datagenetics.com/blog.html 
Dataiku https://www.dataiku.com/blog/ 
DataKind http://www.datakind.org/blog 
DataLook http://blog.datalook.io/ 
Datanice https://datanice.wordpress.com/ 
Dataquest Blog https://www.dataquest.io/blog/ 
DataRobot http://www.datarobot.com/blog/ 
Datascope http://datascopeanalytics.com/blog 
DatasFrame http://tomaugspurger.github.io/ 
David Mimno http://www.mimno.org/ 
Dayne Batten http://daynebatten.com 
Deep Learning http://deeplearning.net/blog/ 
Deepdish http://deepdish.io/ 
Delip Rao http://deliprao.com/ 
DENNY’S BLOG http://blog.dennybritz.com/ 
Dimensionless https://dimensionless.in/blog/ 
Distill http://distill.pub/ 
District Data Labs http://districtdatalabs.silvrback.com/ 
Diving into data https://blog.datadive.net/ 
Domino Data Lab’s blog http://blog.dominodatalab.com/ 
Dr. Randal S. Olson http://www.randalolson.com/blog/ 
Drew Conway https://medium.com/@drewconway 
Dustin Tran http://dustintran.com/blog/ 
Eder Santana https://edersantana.github.io/blog.html 
Edwin Chen http://blog.echen.me 
EFavDB http://efavdb.com/ 
Emilio Ferrara, Ph.D. http://www.emilio.ferrara.name/ 
Entrepreneurial Geekiness http://ianozsvald.com/ 
Eric Jonas http://ericjonas.com/archives.html 
Eric Siegel http://www.predictiveanalyticsworld.com/blog 
Erik Bern http://erikbern.com 
ERIN SHELLMAN http://www.erinshellman.com/ 
Eugenio Culurciello http://culurciello.github.io/ 
Fabian Pedregosa http://fa.bianp.net/ 
Fast Forward Labs http://blog.fastforwardlabs.com/ 
FastML http://fastml.com/ 
Florian Hartl http://florianhartl.com/ 
FlowingData http://flowingdata.com/ 
Full Stack ML http://fullstackml.com/ 
GAB41 http://www.lab41.org/gab41/ 
Garbled Notes http://www.chioka.in/ 
Greg Reda http://www.gregreda.com/blog/ 
Hyon S Chu https://medium.com/@adailyventure 
i am trask http://iamtrask.github.io/ 
I Quant NY http://iquantny.tumblr.com/ 
inFERENCe http://www.inference.vc/ 
Insight Data Science https://blog.insightdatascience.com/ 
INSPIRATION INFORMATION http://myinspirationinformation.com/ 
Ira Korshunova http://irakorshunova.github.io/ 
I’m a bandit https://blogs.princeton.edu/imabandit/ 
Jason Toy http://www.jtoy.net/ 
Jeremy D. Jackson, PhD http://www.jeremydjacksonphd.com/ 
Jesse Steinweg-Woods https://jessesw.com/ 
Joe Cauteruccio http://www.joecjr.com/ 
John Myles White http://www.johnmyleswhite.com/ 
John’s Soapbox http://joschu.github.io/ 
Jonas Degrave http://317070.github.io/ 
Joy Of Data http://www.joyofdata.de/blog/ 
Julia Evans http://jvns.ca/ 
KDnuggets http://www.kdnuggets.com/ 
Keeping Up With The Latest Techniques http://colinpriest.com/ 
Kenny Bastani http://www.kennybastani.com/ 
Kevin Davenport http://kldavenport.com/ 
kevin frans http://kvfrans.com/ 
korbonits | Math ∩ Data http://korbonits.github.io/ 
Large Scale Machine Learning http://bickson.blogspot.com/ 
LATERAL BLOG https://blog.lateral.io/ 
Lazy Programmer http://lazyprogrammer.me/ 
Learn Analytics Here https://learnanalyticshere.wordpress.com/ 
LearnDataSci http://www.learndatasci.com/ 
Learning With Data http://learningwithdata.com/ 
Life, Language, Learning http://daoudclarke.github.io/ 
Locke Data https://itsalocke.com/blog/ 
Louis Dorard http://www.louisdorard.com/blog/ 
M.E.Driscoll http://medriscoll.com/ 
Machinalis http://www.machinalis.com/blog 
Machine Learning (Theory) http://hunch.net/ 
Machine Learning and Data Science http://alexhwoods.com/blog/ 
Machine Learning https://charlesmartin14.wordpress.com/ 
Machine Learning Mastery http://machinelearningmastery.com/blog/ 
Machine Learning Blogs https://machinelearningblogs.com/ 
Machine Learning, etc http://yaroslavvb.blogspot.com 
Machine Learning, Maths and Physics https://mlopezm.wordpress.com/ 
Machined Learnings http://www.machinedlearnings.com/ 
MAPPING BABEL https://jack-clark.net/ 
MAPR Blog https://www.mapr.com/blog 
MAREK REI http://www.marekrei.com/blog/ 
MARGINALLY INTERESTING http://blog.mikiobraun.de/ 
Math ∩ Programming http://jeremykun.com/ 
Matthew Rocklin http://matthewrocklin.com/blog/ 
Melody Wolk http://melodywolk.com/projects/ 
Mic Farris http://www.micfarris.com/ 
Mike Tyka http://mtyka.github.io/ 
minimaxir | Max Woolf’s Blog http://minimaxir.com/ 
Mirror Image https://mirror2image.wordpress.com/ 
Mitch Crowe http://www.dataphoric.com/ 
MLWave http://mlwave.com/ 
MLWhiz http://mlwhiz.com/ 
Models are illuminating and wrong https://peadarcoyle.wordpress.com/ 
Moody Rd http://blog.mrtz.org/ 
Moonshots http://jxieeducation.com/ 
Mourad Mourafiq http://mourafiq.com/ 
My thoughts on Data science, predictive analytics, Python http://shahramabyari.com/ 
Natural language processing blog http://nlpers.blogspot.fr/ 
Neil Lawrence http://inverseprobability.com/blog.html 
NLP and Deep Learning enthusiast http://camron.xyz/ 
no free hunch http://blog.kaggle.com/ 
Nuit Blanche http://nuit-blanche.blogspot.com/ 
Number 2147483647 https://no2147483647.wordpress.com/ 
On Machine Intelligence https://aimatters.wordpress.com/ 
Opiate for the masses Data is our religion. http://opiateforthemass.es/ 
p-value.info http://www.p-value.info/ 
Pete Warden’s blog http://petewarden.com/ 
Plotly Blog http://blog.plot.ly/ 
Probably Overthinking It http://allendowney.blogspot.ca/ 
Prooffreader.com http://www.prooffreader.com 
ProoffreaderPlus http://prooffreaderplus.blogspot.ca/ 
Publishable Stuff http://www.sumsar.net/ 
PyImageSearch http://www.pyimagesearch.com/ 
Pythonic Perambulations https://jakevdp.github.io/ 
quintuitive http://quintuitive.com/ 
R and Data Mining https://rdatamining.wordpress.com/ 
R-bloggers http://www.r-bloggers.com/ 
R2RT http://r2rt.com/ 
Ramiro Gómez http://ramiro.org/notebooks/ 
Random notes on Computer Science, Mathematics and Software Engineering http://barmaley-exe.github.io/ 
Randy Zwitch http://randyzwitch.com/ 
RaRe Technologies http://rare-technologies.com/blog/ 
Rayli.Net http://rayli.net/blog/ 
Revolutions http://blog.revolutionanalytics.com/ 
Rinu Boney http://rinuboney.github.io/ 
RNDuja Blog http://rnduja.github.io/ 
Robert Chang https://medium.com/@rchang 
Rocket-Powered Data Science http://rocketdatascience.org 
Sachin Joglekar’s blog https://codesachin.wordpress.com/ 
samim https://medium.com/@samim 
Sean J. Taylor http://seanjtaylor.com/ 
Sebastian Raschka http://sebastianraschka.com/blog/index.html 
Sebastian Ruder http://sebastianruder.com/ 
Sebastian’s slow blog http://www.nowozin.net/sebastian/blog/ 
SFL Scientific Blog https://sflscientific.com/blog/ 
Shakir’s Machine Learning Blog http://blog.shakirm.com/ 
Simply Statistics http://simplystatistics.org 
Springboard Blog http://springboard.com/blog
Startup.ML Blog http://startup.ml/blog 
Statistical Modeling, Causal Inference, and Social Science http://andrewgelman.com/ 
Stigler Diet http://stiglerdiet.com/ 
Stitch Fix Tech Blog http://multithreaded.stitchfix.com/blog/ 
Storytelling with Statistics on Quora http://datastories.quora.com/ 
StreamHacker http://streamhacker.com/ 
Subconscious Musings http://blogs.sas.com/content/subconsciousmusings/ 
Swan Intelligence http://swanintelligence.com/ 
TechnoCalifornia http://technocalifornia.blogspot.se/ 
TEXT ANALYSIS BLOG | AYLIEN http://blog.aylien.com/ 
The Angry Statistician http://angrystatistician.blogspot.com/ 
The Clever Machine https://theclevermachine.wordpress.com/ 
The Data Camp Blog https://www.datacamp.com/community/blog 
The Data Incubator http://blog.thedataincubator.com/ 
The Data Science Lab https://datasciencelab.wordpress.com/ 
THE ETZ-FILES http://alexanderetz.com/ 
The Science of Data http://www.martingoodson.com 
The Shape of Data https://shapeofdata.wordpress.com 
The unofficial Google data science Blog http://www.unofficialgoogledatascience.com/ 
Tim Dettmers http://timdettmers.com/ 
Tombone’s Computer Vision Blog http://www.computervisionblog.com/ 
Tommy Blanchard http://tommyblanchard.com/category/projects 
Trevor Stephens http://trevorstephens.com/ 
Trey Causey http://treycausey.com/ 
UW Data Science Blog http://datasciencedegree.wisconsin.edu/blog/ 
Wellecks http://wellecks.wordpress.com/ 
Wes McKinney http://wesmckinney.com/archives.html 
While My MCMC Gently Samples http://twiecki.github.io/ 
WildML http://www.wildml.com/ 
Will do stuff for stuff http://rinzewind.org/blog-en 
Will wolf http://willwolf.io/ 
WILL’S NOISE http://www.willmcginnis.com/ 
William Lyon http://www.lyonwj.com/ 
Win-Vector Blog http://www.win-vector.com/blog/ 
Yanir Seroussi http://yanirseroussi.com/ 
Zac Stewart http://zacstewart.com/ 
ŷhat http://blog.yhat.com/ 
ℚuantitative √ourney http://outlace.com/ 
大トロ http://blog.otoro.net/ 

Data Engineer vs Data Scientist (Infographic)

This Infographic will assist us to understand better about the skills and responsibilities of Data Engineer and Data Scientist. Also, it helps us to compare salaries, popular software and tools used by each. Hope this helps!

data-engineer-vs-data-scientist

10 famous TV shows related to Data science & AI (Artificial Intelligence)

“If you want to become one, first get inspired by one”

There is always few interesting ways to learn things and get inspire. Would you like to know few TV shows which are based on Data science and Artificial intelligence? We always like to do the things in the way we love. Here you go & happy watching (learning)

 

final_finally

Thanks to AV for this.

Top 8 Viz features in Excel 2016 !

This is especially for the excel lovers! In this blog, we will see few of the new and exciting data visualization features of Excel 2016.

Here is the list of new features

  1. Hierarchy Chart/Tree Map
  2. Sunburst
  3. Water fall or Stock Chart
  4. Transform Cold data into a cool picture
  5. Instant Histogram
  6. Pareto Chart
  7. 3D map
  8. One click forecast

These are the most wanted charts by the Dashboard creators. These are very simple and attractive. This set of features makes excel more competitive with other expensive visualization tools.

  1. Hierarchy Chart/Tree Map:

Select the data that you want to use for creation of the chart then Go to ‘Insert’ tab > Charts > Insert Hierarchy Chart

Hier

Isn’t it cool? OK, we go to the next one.

2. Sunburst/Donut Chart:

It is another representation of a Pie chart. An alternate to boring the Pie chart. Go to ‘Insert’ > Charts > Insert Hierarchy ChartSunburst

3. Water fall or Stock Chart

It is recommended to sort the data by any order to have the better insights.Screenshot 2016-01-02 12.13.11.png

4. Transform Cold data into a cool picture

This one is based on the Add-ins.

Screenshot 2016-01-02 13.10.54

Select your data to visualizeScreenshot 2016-01-02 12.21.56Screenshot 2016-01-02 12.22.02

Select ‘Settings’ to change the design of the chartsScreenshot 2016-01-02 12.24.11

5. Instant Histogram:

Create histograms quickly instead of going to “Analysis Tool Pack” in add-ins. Go to Insert > Charts > Histogram

Screenshot 2016-01-02 13.38.51.png

6. Pareto Chart:

Earlier, we had to customize the data structure to create ‘Pareto chart’ but now it is just a click away to explain the 80/20 principle.

Screenshot 2016-01-02 13.50.36.png

7. 3D map:

Power Map, the popular 3-D geospatial visualization add-in for Excel 2013, is now fully integrated into Excel. We’ve also this feature a more descriptive name, “3D Maps”. You’ll find this functionality alongside other visualization features on the Insert tab.

Screenshot 2016-01-02 13.55.08

It will open another sheet like below Screenshot 2016-01-02 14.00.36.png

then we can change the theme and other options like ‘2D Map’. “Play Tour” option will show an awesome chart with lively visual.

Screenshot 2016-01-02 14.02.13Screenshot 2016-01-02 14.03.48

8. One click Forecast

It has become more easy for the Data analysts who do forecast.

Select the data that you want to forecast and Go to ‘Data’ tab > Click on “Forecast Sheet”

Screenshot 2016-01-02 14.11.35

Adjust the “Seasonality” appropriatelyScreenshot 2016-01-02 14.17.37

Screenshot 2016-01-02 14.18.19

and your forecast is ready.

Hope you like these features and much more to come from Microsoft. Try these things and enjoy !

Data Viz ! Cheat sheet for R Data Analyst

Data visualization has become a vital slice of data science arena. Hence, our key tool should have strong capabilities on both the fronts – data analysis as well as data visualization. With this revolution in the landscape, or has extended immense popularity because of its splendid data visualization capabilities. With a few lines of code, you can produce beautiful charts and data stories. R contains superb libraries to create basic and more evolved visualizations like Bar Chart, Histogram, Scatter Plot, Map visualization, Mosaic Plot and various others. Below is the cheat sheet of widespread visualization for representing data. Thanks to my colleague for sharing this.

Data Viz Cheat Sheet

Introducing cricketr! : An R package to analyze performances of cricketers

A very good analysis using R in the field of cricket. Must see ! 🙂

Tinniam V Ganesh's avatar Giga thoughts ...

Yet all experience is an arch wherethro’
Gleams that untravell’d world whose margin fades
For ever and forever when I move.
How dull it is to pause, to make an end,
To rust unburnish’d, not to shine in use!

Ulysses by Alfred Tennyson

Introduction

This is an initial post in which I introduce a cricketing package ‘cricketr’ which I have created. This package was a natural culmination to my earlier posts on cricket and my completing 9 modules of Data Science Specialization, from John Hopkins University at Coursera. The thought of creating this package struck me some time back, and I have finally been able to bring this to fruition.

So here it is. My R package ‘cricketr!!!’

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package only uses data from test cricket. I plan to develop functionality for One-day and…

View original post 2,667 more words

A Complete List of Data Science Online Classes

Great resources to learn data science online ! Here you go !

Scott Ge's avatarHi, I'm Scott

The blog is now migrated to http://scottge.net/2015/06/08/complete-list-of-data-science-online-classes/

You can consider online classes from Coursera for self-study.  Coursera provides online classes (most of them are free) offered by university professors, typically attended worldwide by thousands of students and working professionals. In particular, consider the Data Science Specialization from John Hopkins University, which offers a guaranteed certificate demonstrating your ability.

Coursera Courses

Additional online class resources

Acknowledgement

Please share in the comments…

View original post 23 more words

Growth of Six Sigma

Below is the trend of six sigma search over the period from google trends.

six

There can be two reasons for the decreasing trend:

  1. The awareness on six sigma has almost done, hence the search is reduced over the period.
  2. Six sigma is not really a big deal.

May be I go with the second one but the reason is slightly different,

So many people are getting trained on six sigma just paying money ( 2 days, 5 days and max 10 days), then they start practice and teach six sigma.

I know personally so many inefficient people teaching six sigma!

So what they teach is six sigma now. I think this is the reason behind the fade out of six sigma.

Eight Steps to become a Data Scientist ! (The Sexiest and the Hot Job of the Decade)

Thinking how to become a Data Scientist? Here we go, the 8 Steps to become a Data Scientist (The Sexiest and the Hot Job of the Decade)

Well, these steps are not so easy but possible if we try. Most of the steps come with no-cost or very low-cost.

https://i0.wp.com/blog.datacamp.com/wp-content/uploads/2014/08/How-to-become-a-data-scientist.jpg

Thanks for DataCamp for the nice infographic. Is this info useful? Then please share this info with your circle.

Clash of the Titans ! (R vs Python)

This is to all out there who are wondering which is better language to learn for data analysis and visualization. Whether one should use R or Python when they do their everyday data analysis tasks.

Both Python and R are amongst the most extensively held languages for data analysis, and have their supporters and opponents. While Python is a lot praised for being a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in thoughts, thus giving it field-specific advantages such as excessive features for data visualization.

The DataCamp has recently released a new infographic for everyone interested in how these two (statistical) programming languages relate to each other. This superb infographic discovers what the strengths of R over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective.

R vs Python for data science

Note:

Not to ignore the new entrant in war field “Julia” language. It is a high-level dynamic programming language designed to address the requirements of high-performance numerical and scientific computing while also being effective for general purpose programming. Influenced by MATLAB, C, Python, Perl, R, Ruby and others.

Soon we expect Julia to join the clash !

Introduction to Six Sigma, in the way you want to know !

What is Six Sigma?

A method that delivers organizations to improve the capability of their business practices. This increase in performance and decrease in process variation lead to defect reduction and improvement in profits, employee morale, and quality of products or services. Six Sigma quality is a term generally used to indicate a process is well controlled (within process limits ±3s from the center line in a control chart, and requirements/tolerance limits ±6s from the center line).

Diverse definitions have been proposed for Six Sigma, but they all share some common threads:

Use of teams that are assigned well-defined projects that have direct impact on the organization’s bottom line.

Training in “statistical thinking” at all levels and providing key people with extensive training in advanced statistics and project management. These key people are designated “Black Belts.” Review the different Six Sigma belts, levels and roles.

Emphasis on the DMAIC approach to problem solving: define, measure, analyze, improve, and control.

A management environment that supports these initiatives as a business strategy.

Six Sigma has two key methodologies:

  • DMAIC: It refers to a data-driven quality strategy for improving processes. This methodology is used to improve an existing business process.
  • DMADV: It refers to a data-driven quality strategy for designing products & processes. This methodology is used to create new product designs or process designs in such a way that it results in a more predictable, mature and defect free performance.

There is one more methodology called DFSS – Design For Six Sigma. DFSS is a data-driven quality strategy for designing or redesigning a product or service from the ground up.

Sometimes a DMAIC project may turn into a DFSS project because the process in question requires complete redesign to bring about the desired degree of improvement.

DMAIC Methodology:

This methodology consists of the following five steps.

Define –> Measure –> Analyze –> Improve –>Control

  • Define: Define the problem or project goal that needs to be addressed.
  • Measure: Measure the problem and process from which it was produced.
  • Analyze: Analyze data and process to determine root causes of defects and opportunities.
  • Improve: Improve the process by finding solutions to fix, diminish, and prevent future problems.
  • Control: Implement, control, and sustain the improvements solutions to keep the process on the new course.

DMADV Methodology

This methodology consists of five steps:

Define –> Measure –> Analyze –> Design –>Verify

  • Define: Define the Problem or Project Goal that needs to be addressed.
  • Measure: Measure and determine customers needs and specifications.
  • Analyze: Analyze the process to meet the customer needs.
  • Design: Design a process that will meet customers needs.
  • Verify: Verify the design performance and ability to meet customer needs.

DFSS Methodology

DFSS is a separate and emerging discipline related to Six Sigma quality processes. This is a systematic methodology utilizing tools, training, and measurements to enable us to design products and processes that meet customer expectations and can be produced at Six Sigma Quality levels.

This methodology can have the following five steps.

Define –> Identify –> Design –> Optimize –>Verify

  • Define: Define what the customers want, or what they do not want.
  • Identify: Identify the customer and the project.
  • Design: Design a process that meets customers needs.
  • Optimize: Determine process capability and optimize the design.
  • Verify: Test, verify, and validate the design.

 Features of Six Sigma

  • Six Sigma’s aim is to eliminate waste and inefficiency, thereby increasing customer satisfaction by delivering what the customer is expecting.
  • Six Sigma follows a structured methodology, and has defined roles for the participants.
  • Six Sigma is a data driven methodology, and requires accurate data collection for the processes being analyzed.
  • Six Sigma is about putting results on Financial Statements.
  • Six Sigma is a business-driven, multi-dimensional structured approach for:
    • Improving Processes
    • Lowering Defects
    • Reducing process variability
    • Reducing costs
    • Increasing customer satisfaction
    • Increased profits

The word Sigma is a statistical term that measures how far a given process deviates from perfection.

The central idea behind Six Sigma: If you can measure how many “defects” you have in a process, you can systematically figure out how to eliminate them and get as close to “zero defects” as possible and specifically it means a failure rate of 3.4 parts per million or 99.9997% perfect.

Key Concepts of Six Sigma

At its core, Six Sigma revolves around a few key concepts.

  • Critical to Quality : Attributes most important to the customer.
  • Defect : Failing to deliver what the customer wants.
  • Process Capability : What your process can deliver.
  • Variation : What the customer sees and feels.
  • Stable Operations : Ensuring consistent, predictable processes to improve what the customer sees and feels.
  • Design for Six Sigma : Designing to meet customer needs and process capability.

Our Customers Feel the Variance, Not the Mean. So Six Sigma focuses first on reducing process variation and then on improving the process capability.

Myths about Six Sigma

There are several myths and misunderstandings surrounding Six Sigma. Some of them few are given below:

  • Six Sigma is only concerned with reducing defects.
  • Six Sigma is a process for production or engineering.
  • Six Sigma cannot be applied to engineering activities.
  • Six Sigma uses difficult-to-understand statistics.
  • Six Sigma is just training.

Benefits of Six Sigma

Six Sigma offers six major benefits that attract companies:

  • Generates sustained success
  • Sets a performance goal for everyone
  • Enhances value to customers
  • Accelerates the rate of improvement
  • Promotes learning and cross-pollination
  • Executes strategic change

Origin of Six Sigma

  • Six Sigma originated at Motorola in the early 1980s, in response to achieving 10X reduction in product-failure levels in 5 years.
  • Engineer Bill Smith invented Six Sigma, but died of a heart attack in the Motorola cafeteria in 1993, never knowing the scope of the craze and controversy he had touched off.
  • Six Sigma is based on various quality management theories (e.g. Deming’s 14 point for management, Juran’s 10 steps on achieving quality).

There are three key elements of Six Sigma Process Improvement:

  • Customers
  • Processes
  • Employees

The Customers:

Customers define quality. They expect performance, reliability, competitive prices, on-time delivery, service, clear and correct transaction processing and more. This means it is important to provide what the customers need to gain customer delight.

The Processes:

Defining processes as well as defining their metrics and measures is the central aspect of Six Sigma.

In a business, the quality should be looked form the customer’s perspective and so we must look at a defined process from the outside-in.

By understanding the transaction lifecycle from the customer’s needs and processes, we can discover what they are seeing and feeling. This gives a chance to identify weak areas with in a process and then we can improve them.

The Employees

A company must involve all its employees in the Six Sigma program. Company must provide opportunities and incentives for employees to focus their talents and ability to satisfy customers.

It is important to Six Sigma that all the team members should have a well-defined role with measurable objectives.

Six Sigma Belts (remember karate belts ! 🙂 )

Six Sigma professionals exist at all level – each with a different role to play. While executions and roles may vary, here is a straightforward guide to who does what.

At the project level, there are black belts, master black belts, green belts, yellow belts and white belts. These people conduct projects and implement improvements

Level Description with Roles and Responsibilities
Executives Provide overall alignment by establishing the strategic focus of the Six Sigma program within the context of the organization’s culture and vision
Champions Translate the company’s vision, mission, goals and metrics to create an organizational deployment plan and identify individual projects. Identify resources and remove roadblocks
Master Black Belt (MBB) Trains and coaches Black Belts and Green Belts. Functions more at the Six Sigma program level by developing key metrics and the strategic direction. Acts as an organization’s Six Sigma technologist and internal consultant.
Black Belt (BB) Understands Six Sigma philosophies and principles, including the supporting systems and tools. Demonstrates team leadership and understands all aspects of the DMAIC model in accordance with Six Sigma principles. Leads problem-solving projects. Trains and coaches project teams.
Green Belt (GB) Supports a Six Sigma Black Belt by analyzing and solving quality problems and is involved in quality-improvement projects. Assists with data collection and analysis for Black Belt projects. Leads Green Belt projects or teams.
Yellow Belt (YB) Participates as a project team member. Reviews process improvements that support the project. Has a small role, interest, or need to develop foundational knowledge of Six Sigma, whether as an entry level employee or an executive champion.
White Belt (WB) Can work on local problem-solving teams that support overall projects, but may not be part of a Six Sigma project team. Understands basic Six Sigma concepts from an awareness perspective

Different views on the definition of Six Sigma:

Methodology— This view of Six Sigma recognizes the underlying and rigorous approach known as DMAIC (define, measure, analyze, improve and control). DMAIC defines the steps a Six Sigma practitioner is expected to follow, starting with identifying the problem and ending with the implementation of long-lasting solutions. While DMAIC is not the only Six Sigma methodology in use, it is certainly the most widely adopted and recognized.

Metrics – In simple terms, Six Sigma quality performance means 3.4 defects per million opportunities

Philosophy— The philosophical standpoint views all effort as processes that can be defined, measured, analyzed, improved and controlled. Processes require inputs (x) and produce outputs (y). If you control the inputs, you will control the outputs. This is commonly expressed as y = f(x).

Set of tools— The Six Sigma expert uses qualitative and quantitative techniques to drive process improvement. A few such tools include statistical process control (SPC), control charts, failure mode and effects analysis, and process mapping. Six Sigma professionals do not totally agree as to exactly which tools constitute the set.

Steps to Learn Data Science using R

One of the common difficulties individuals face in learning R is lack of an organized way. They don’t know, from where to start, how to proceed, which way to choose? However, there is a surplus of good free resources accessible on the Internet, this could be overwhelming as well as puzzling at the mean time.

After mining through infinite resources & archives, here is a comprehensive Learning way on R to learn R from the beginning. This will help you to learn R rapidly and proficiently.

Step 1: Download and Install R

The easy way to proceed is to download the basic version of R and installation instructions from CRAN site. R is available for Windows, Mac and Linux. Windows and Mac users most likely want one of these versions of R. R is part of many Linux distributions, you should check with your Linux package management system in addition to the link above.

You can now install various packages. There are more than 9000 packages in R for different purposes. Here is a link to understand packages called CRAN Views.  You can accordingly select the sub type of packages that you want.

To install a package you can just do this

For example, if we want to install a package called “animation” then we use

install.packages("animation")

Normally the package should just install, however:

  • if you are using Linux and don’t have root access, this command won’t work.
  • you will be asked to select your local mirror, i.e. which server should you use to download the package.

You must also install RStudio. It helps R coding much easier since it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment.

Step 2: Learn the basics

You need to start by knowing the basics of the language, libraries and data structure. The R track from Datacamp is the best place to start your journey. See the free Introduction to R course at https://www.datacamp.com/courses/introduction-to-r. After doping this course, you would be comfortable writing basic scripts on R and also understand data analysis. Alternately, you can also see Code School for R at http://tryr.codeschool.com/

If you want to learn R offline on your own time – you can use the interactive package swirl from http://swirlstats.com

Primarily learn  read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command.

Step 3: Learn Data Management:

You need to use them a lot for data cleaning, especially if you are going to work on text data. The best way is to go through the text manipulation and numerical manipulation assignments. You can learn about connecting to databases through the RODBC  package and writing sql queries to data frames through sqldf  package.

Step 4: Study specific packages in R– data.table and dplyr Here we go ! Here is a brief introduction to numerous libraries. We need to start practising some common operations.

  • Practice the data.table tutorial  thoroughly here. Print and study the cheat sheet for data.table
  • Next, you can have a look at the dplyr tutorial here.
  • For text mining, start with creating a word cloud in R and then learn learn through this series of tutorial: Part 1 and Part 2.
  • For social network analysis read through these pages.
  • Do sentiment analysis using Twitter data – check out this and this analysis.
  • For optimization through R read here and here

Step 5: Effective Data Visualization through ggplot2

  • Read Edward Tufte and his principles on how to make data visualizations here . Especially read on data-ink, lie factor and data density.
  • Read about the common pitfalls on dashboard design by Stephen Few.
  • For learning grammar of graphics and a good way to do it in R. Go through this link from Dr Hadley Wickham creator of ggplot2 and one of the most brilliant R package creators in the world today. You can download the data and slides as well.
  • Are you interested in visualzing data on spatial analsysis. Go through the amazing ggmap package.
  • Interested in making animations thorugh R. Look through these examples. Animate package will help you here.
  • Slidify will help supercharge your graphics with HTML5.

Step 6: Learn Data mining and Machine Learning Now, we come to the most valuable skill for a data scientist which is data mining and machine learning. You can see a very comprehensive set of resources on data mining in R here at http://www.rdatamining.com/ . The rattle package really helps you with an easy to use Graphical User Interface (GUI).  You can see a free open source easy to understand book here at http://togaware.com/datamining/survivor/index.html You will go through an overview of  algorithms like regressions, decision trees, ensemble modelling and   clustering.  You can also see the various machine learning options available in R by seeing the relevant CRAN view here. Resources:

Step 7: Practice Practice with example data available with you and on the internet. Stay in touch with what your fellow R coders are doing by subscribing to http://www.r-bloggers.com/ , http://stats.stackexchange.com and www.stackoverflow.com. Go through the questions and answers that users come up with. Start interacting by asking questions and providing the answers for the questions which you can ! Happy learning !!! 🙂

Famous quotes about Statistics !

Here are few famous quotes about Statistics !

  • A big computer, a complex algorithm and a long time does not equal science. –Robert Gentleman
  • Absence of evidence is not evidence of absence. –Carl Sagan
  • All generalizations are false, including this one. – Mark Twain
  • All models are wrong, but some are useful. –(George E. P. Box)
  • An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” — John Tukey
  • Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin. – Von Neumann
  • Figures don’t lie, but liars do figure –Mark Twain
  • He uses statistics like a drunken man uses a lamp post, more for support than illumination. — Andrew Lang
  • I think it is much more interesting to live with uncertainty than to live with answers that might be wrong. –Richard Feynman
  • If you torture the data enough, nature will always confess. – Ronald Coase
  • If your experiment needs statistics, you ought to have done a better experiment. – Ernest Rutherford
  • In God we trust. All others must bring data. – Edwards Deming
  • It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so. –Mark Twain
  • Say you were standing with one foot in the oven and one foot in an ice bucket. According to the percentage people, you should be perfectly comfortable. – Bobby Bragan, 1963
  • Statistical thinking will one day be as necessary a qualification for efficient citizenship as the ability to read and write.- G. Wells
  • Statisticians, like artists, have the bad habit of falling in love with their models. — George Box
  • Statistics – A subject which most statisticians find difficult but in which nearly all physicians are expert.
  • “Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. –Aaron Levenstein
  • Strange events permit themselves the luxury of occurring. – Charlie Chan
  • The best thing about being a statistician is that you get to play in everyone’s backyard. –John Tukey
  • The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data –Tukey
  • The death of one man is a tragedy. The death of millions is a statistic – Joseph Stalin
  • The greatest value of a picture is when it forces us to notice what we never expected to see. –John Tukey
  • The statistician cannot evade the responsibility for understanding the process he applies or recommends. –Sir Ronald A. Fisher
  • There are no routine statistical questions, only questionable statistical routines. – R. Cox
  • To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. –A Fisher (1938)
  • We are drowning in information and starving for knowledge. –Rutherford D. Roger

You can add the famous quotes that you like in the comments. 🙂

Go back

Your message has been sent