Machine Learning – Easy Reference

6 Feb 20216 Feb 2021 Data Lover analytics, data-science, machine learning AI, bookmarks, machine learning, ML, reference, supervised, unsupervised

In this post, we have included the must known things when you deal with Machine Learning Algorithms. Here is the list of things for your easy reference, bookmark this page!

Classification metrics

In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.

Confusion matrix: The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of classification models:

ROC: The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

AUC: The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

Regression metrics:

Basic metrics: Given a regression model f, the following metrics are commonly used to assess the performance of the model:

Coefficient of determination: The coefficient of determination, often noted R^2 or r^2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:

Model selection:

Vocabulary– When selecting a model, we distinguish 3 different parts of the data that we have as follows:

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:

Cross-validation: It also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:

The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.

Regularization: The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:

Diagnostics:

Bias: The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

Variance: The variance of a model is the variability of the model prediction for given data points.

Bias/variance tradeoff: The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

Error analysis: Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.

Ablative analysis: Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.

I hope this is useful for you to refer the important things easily.

Credits: Amidi brothers!

Life-cycle of a Data Science Project

18 May 201819 May 2018 Data Lover analytics, data-science, General, life-cycle, machine learning, statistics, visualization analytics, basic, data mining, data scientist, statistics, tableau, visualization

Are you wondering how would the life-cycle of a data science project be? Here you go..
Problem Identification:

Have you ever heard the phrase “Here’s the data, can you do some analysis find some insights?” Often, management approach Data Scientists with vague or even undefined goals. Understanding the goal is important and sets up the rest of the project for success.

This step consumes up about 10% of the time in the project life-cycle

Data Preparation:

So far, everybody’s least favorite stage, but possibly the most important one. Data can come from different sources, be in the ugly format, and have errors and a myriad of other problems. A single error in this stage can render the rest of the analysis useless.

That’s why typically, up to 70% of the time is spent here.

Analyse the data:

Creating models, performing data mining, setting up simulations etc. This is the most exciting part and if the previous stages were done correctly, analyzing the data and getting insights will feel like a good.

Time needed here would be 10%

Visualization of the insights:

Visualizing comes hand-in-hand with analyzing. This is a powerful technique as looking at the data in various forms and shapes can help reveal insights that are otherwise not evident. Also several projects such as BI dashboards don’t need much analysis but rely on visualization instead.

Time needed here would be 10%

Presentation of the findings:

We’ve reached 100% the project is over! Actually, No. Presenting ﬁndings is a whole separate “Additional” stage. You need to not only convey the insights in your audience’s language but also get buy-in from them to take action based on those insights. This is an art.

Time needed: extra 80% 🙂

Hope you benefited ! Enjoy learning!

Treasure for Data Science blogs (A to Z)

15 Jul 201718 Jul 2017 Data Lover data-science, python., R, Uncategorized

This blog will help you in knowledge hunt of Data science. The below given list will help you to find the blogs that talk about Data science easily. I hope you will find this useful.

A Blog From a Human-engineer-being	http://www.erogol.com/
Aakash Japi	http://aakashjapi.com/
Adit Deshpande	https://adeshpande3.github.io/
Advanced Analytics & R	http://advanceddataanalytics.net/
Adventures in Data Land	http://blog.smola.org
Agile Data Science	http://blog.sense.io/
Ahmed El Deeb	https://medium.com/@D33B
Airbnb Data blog	http://nerds.airbnb.com/data/
Alex Castrounis \| InnoArchiTech	http://www.innoarchitech.com/
Alex Perrier	http://alexperrier.github.io/
Algobeans \| Data Analytics Tutorials & Experiments for the Layman	https://algobeans.com
Amazon AWS AI Blog	https://aws.amazon.com/blogs/ai/
Analytics Vidhya	http://www.analyticsvidhya.com/blog/
Analytics and Visualization in Big Data @ Sicara	https://blog.sicara.com
Andreas Müller	http://peekaboo-vision.blogspot.com/
Andrej Karpathy blog	http://karpathy.github.io/
Andrew Brooks	http://brooksandrew.github.io/simpleblog/
Andrey Kurenkov	http://www.andreykurenkov.com/writing/
Anton Lebedevich’s Blog	http://mabrek.github.io/
Arthur Juliani	https://medium.com/@awjuliani
Audun M. Øygard	http://www.auduno.com/
Avi Singh	https://avisingh599.github.io/
Beautiful Data	http://beautifuldata.net/
Beckerfuffle	http://mdbecker.github.io/
Becoming A Data Scientist	http://www.becomingadatascientist.com/
Ben Bolte’s Blog	http://benjaminbolte.com/ml/
Ben Frederickson	http://www.benfrederickson.com/blog/
Berkeley AI Research	http://bair.berkeley.edu/blog/
Big-Ish Data	http://bigishdata.com/
Blog on neural networks	http://yerevann.github.io/
Blogistic RegressionAbout Projects	http://d10genes.github.io/blog/
blogR \| R tips and tricks from a scientist	https://drsimonj.svbtle.com/
Brain of mat kelcey	http://matpalm.com/blog/
Brilliantly wrong thoughts on science and programming	https://arogozhnikov.github.io/
Bugra Akyildiz	http://bugra.github.io/
Building Babylon	https://building-babylon.net/
Carl Shan	http://carlshan.com/
Chris Stucchio	https://www.chrisstucchio.com/blog/index.html
Christophe Bourguignat	https://medium.com/@chris_bour
Christopher Nguyen	https://medium.com/@ctn
Cloudera Data Science Posts	http://blog.cloudera.com/blog/category/data-science/
colah’s blog	http://colah.github.io/archive.html
Cortana Intelligence and Machine Learning Blog	https://blogs.technet.microsoft.com/machinelearning/
Daniel Forsyth	http://www.danielforsyth.me/
Daniel Homola	http://danielhomola.com/category/blog/
Daniel Nee	http://danielnee.com
Data Based Inventions	http://datalab.lu/
Data Blogger	https://www.data-blogger.com/
Data Labs	http://blog.insightdatalabs.com/
Data Meets Media	http://datameetsmedia.com/
Data Miners Blog	http://blog.data-miners.com/
Data Mining Research	http://www.dataminingblog.com/
Data Mining: Text Mining, Visualization and Social Media	http://datamining.typepad.com/data_mining/
Data Piques	http://blog.ethanrosenthal.com/
Data School	http://www.dataschool.io/
Data Science 101	http://101.datascience.community/
Data Science @ Facebook	https://research.facebook.com/blog/datascience/
Data Science Insights	http://www.datasciencebowl.com/data-science-insights/
Data Science Tutorials	https://codementor.io/data-science/tutorial
Data Science Vademecum	http://datasciencevademecum.wordpress.com/
Dataaspirant	http://dataaspirant.com/
Dataclysm	http://blog.okcupid.com/
DataGenetics	http://datagenetics.com/blog.html
Dataiku	https://www.dataiku.com/blog/
DataKind	http://www.datakind.org/blog
DataLook	http://blog.datalook.io/
Datanice	https://datanice.wordpress.com/
Dataquest Blog	https://www.dataquest.io/blog/
DataRobot	http://www.datarobot.com/blog/
Datascope	http://datascopeanalytics.com/blog
DatasFrame	http://tomaugspurger.github.io/
David Mimno	http://www.mimno.org/
Dayne Batten	http://daynebatten.com
Deep Learning	http://deeplearning.net/blog/
Deepdish	http://deepdish.io/
Delip Rao	http://deliprao.com/
DENNY’S BLOG	http://blog.dennybritz.com/
Dimensionless	https://dimensionless.in/blog/
Distill	http://distill.pub/
District Data Labs	http://districtdatalabs.silvrback.com/
Diving into data	https://blog.datadive.net/
Domino Data Lab’s blog	http://blog.dominodatalab.com/
Dr. Randal S. Olson	http://www.randalolson.com/blog/
Drew Conway	https://medium.com/@drewconway
Dustin Tran	http://dustintran.com/blog/
Eder Santana	https://edersantana.github.io/blog.html
Edwin Chen	http://blog.echen.me
EFavDB	http://efavdb.com/
Emilio Ferrara, Ph.D.	http://www.emilio.ferrara.name/
Entrepreneurial Geekiness	http://ianozsvald.com/
Eric Jonas	http://ericjonas.com/archives.html
Eric Siegel	http://www.predictiveanalyticsworld.com/blog
Erik Bern	http://erikbern.com
ERIN SHELLMAN	http://www.erinshellman.com/
Eugenio Culurciello	http://culurciello.github.io/
Fabian Pedregosa	http://fa.bianp.net/
Fast Forward Labs	http://blog.fastforwardlabs.com/
FastML	http://fastml.com/
Florian Hartl	http://florianhartl.com/
FlowingData	http://flowingdata.com/
Full Stack ML	http://fullstackml.com/
GAB41	http://www.lab41.org/gab41/
Garbled Notes	http://www.chioka.in/
Greg Reda	http://www.gregreda.com/blog/
Hyon S Chu	https://medium.com/@adailyventure
i am trask	http://iamtrask.github.io/
I Quant NY	http://iquantny.tumblr.com/
inFERENCe	http://www.inference.vc/
Insight Data Science	https://blog.insightdatascience.com/
INSPIRATION INFORMATION	http://myinspirationinformation.com/
Ira Korshunova	http://irakorshunova.github.io/
I’m a bandit	https://blogs.princeton.edu/imabandit/
Jason Toy	http://www.jtoy.net/
Jeremy D. Jackson, PhD	http://www.jeremydjacksonphd.com/
Jesse Steinweg-Woods	https://jessesw.com/
Joe Cauteruccio	http://www.joecjr.com/
John Myles White	http://www.johnmyleswhite.com/
John’s Soapbox	http://joschu.github.io/
Jonas Degrave	http://317070.github.io/
Joy Of Data	http://www.joyofdata.de/blog/
Julia Evans	http://jvns.ca/
KDnuggets	http://www.kdnuggets.com/
Keeping Up With The Latest Techniques	http://colinpriest.com/
Kenny Bastani	http://www.kennybastani.com/
Kevin Davenport	http://kldavenport.com/
kevin frans	http://kvfrans.com/
korbonits \| Math ∩ Data	http://korbonits.github.io/
Large Scale Machine Learning	http://bickson.blogspot.com/
LATERAL BLOG	https://blog.lateral.io/
Lazy Programmer	http://lazyprogrammer.me/
Learn Analytics Here	https://learnanalyticshere.wordpress.com/
LearnDataSci	http://www.learndatasci.com/
Learning With Data	http://learningwithdata.com/
Life, Language, Learning	http://daoudclarke.github.io/
Locke Data	https://itsalocke.com/blog/
Louis Dorard	http://www.louisdorard.com/blog/
M.E.Driscoll	http://medriscoll.com/
Machinalis	http://www.machinalis.com/blog
Machine Learning (Theory)	http://hunch.net/
Machine Learning and Data Science	http://alexhwoods.com/blog/
Machine Learning	https://charlesmartin14.wordpress.com/
Machine Learning Mastery	http://machinelearningmastery.com/blog/
Machine Learning Blogs	https://machinelearningblogs.com/
Machine Learning, etc	http://yaroslavvb.blogspot.com
Machine Learning, Maths and Physics	https://mlopezm.wordpress.com/
Machined Learnings	http://www.machinedlearnings.com/
MAPPING BABEL	https://jack-clark.net/
MAPR Blog	https://www.mapr.com/blog
MAREK REI	http://www.marekrei.com/blog/
MARGINALLY INTERESTING	http://blog.mikiobraun.de/
Math ∩ Programming	http://jeremykun.com/
Matthew Rocklin	http://matthewrocklin.com/blog/
Melody Wolk	http://melodywolk.com/projects/
Mic Farris	http://www.micfarris.com/
Mike Tyka	http://mtyka.github.io/
minimaxir \| Max Woolf’s Blog	http://minimaxir.com/
Mirror Image	https://mirror2image.wordpress.com/
Mitch Crowe	http://www.dataphoric.com/
MLWave	http://mlwave.com/
MLWhiz	http://mlwhiz.com/
Models are illuminating and wrong	https://peadarcoyle.wordpress.com/
Moody Rd	http://blog.mrtz.org/
Moonshots	http://jxieeducation.com/
Mourad Mourafiq	http://mourafiq.com/
My thoughts on Data science, predictive analytics, Python	http://shahramabyari.com/
Natural language processing blog	http://nlpers.blogspot.fr/
Neil Lawrence	http://inverseprobability.com/blog.html
NLP and Deep Learning enthusiast	http://camron.xyz/
no free hunch	http://blog.kaggle.com/
Nuit Blanche	http://nuit-blanche.blogspot.com/
Number 2147483647	https://no2147483647.wordpress.com/
On Machine Intelligence	https://aimatters.wordpress.com/
Opiate for the masses Data is our religion.	http://opiateforthemass.es/
p-value.info	http://www.p-value.info/
Pete Warden’s blog	http://petewarden.com/
Plotly Blog	http://blog.plot.ly/
Probably Overthinking It	http://allendowney.blogspot.ca/
Prooffreader.com	http://www.prooffreader.com
ProoffreaderPlus	http://prooffreaderplus.blogspot.ca/
Publishable Stuff	http://www.sumsar.net/
PyImageSearch	http://www.pyimagesearch.com/
Pythonic Perambulations	https://jakevdp.github.io/
quintuitive	http://quintuitive.com/
R and Data Mining	https://rdatamining.wordpress.com/
R-bloggers	http://www.r-bloggers.com/
R2RT	http://r2rt.com/
Ramiro Gómez	http://ramiro.org/notebooks/
Random notes on Computer Science, Mathematics and Software Engineering	http://barmaley-exe.github.io/
Randy Zwitch	http://randyzwitch.com/
RaRe Technologies	http://rare-technologies.com/blog/
Rayli.Net	http://rayli.net/blog/
Revolutions	http://blog.revolutionanalytics.com/
Rinu Boney	http://rinuboney.github.io/
RNDuja Blog	http://rnduja.github.io/
Robert Chang	https://medium.com/@rchang
Rocket-Powered Data Science	http://rocketdatascience.org
Sachin Joglekar’s blog	https://codesachin.wordpress.com/
samim	https://medium.com/@samim
Sean J. Taylor	http://seanjtaylor.com/
Sebastian Raschka	http://sebastianraschka.com/blog/index.html
Sebastian Ruder	http://sebastianruder.com/
Sebastian’s slow blog	http://www.nowozin.net/sebastian/blog/
SFL Scientific Blog	https://sflscientific.com/blog/
Shakir’s Machine Learning Blog	http://blog.shakirm.com/
Simply Statistics	http://simplystatistics.org
Springboard Blog	http://springboard.com/blog
Startup.ML Blog	http://startup.ml/blog
Statistical Modeling, Causal Inference, and Social Science	http://andrewgelman.com/
Stigler Diet	http://stiglerdiet.com/
Stitch Fix Tech Blog	http://multithreaded.stitchfix.com/blog/
Storytelling with Statistics on Quora	http://datastories.quora.com/
StreamHacker	http://streamhacker.com/
Subconscious Musings	http://blogs.sas.com/content/subconsciousmusings/
Swan Intelligence	http://swanintelligence.com/
TechnoCalifornia	http://technocalifornia.blogspot.se/
TEXT ANALYSIS BLOG \| AYLIEN	http://blog.aylien.com/
The Angry Statistician	http://angrystatistician.blogspot.com/
The Clever Machine	https://theclevermachine.wordpress.com/
The Data Camp Blog	https://www.datacamp.com/community/blog
The Data Incubator	http://blog.thedataincubator.com/
The Data Science Lab	https://datasciencelab.wordpress.com/
THE ETZ-FILES	http://alexanderetz.com/
The Science of Data	http://www.martingoodson.com
The Shape of Data	https://shapeofdata.wordpress.com
The unofficial Google data science Blog	http://www.unofficialgoogledatascience.com/
Tim Dettmers	http://timdettmers.com/
Tombone’s Computer Vision Blog	http://www.computervisionblog.com/
Tommy Blanchard	http://tommyblanchard.com/category/projects
Trevor Stephens	http://trevorstephens.com/
Trey Causey	http://treycausey.com/
UW Data Science Blog	http://datasciencedegree.wisconsin.edu/blog/
Wellecks	http://wellecks.wordpress.com/
Wes McKinney	http://wesmckinney.com/archives.html
While My MCMC Gently Samples	http://twiecki.github.io/
WildML	http://www.wildml.com/
Will do stuff for stuff	http://rinzewind.org/blog-en
Will wolf	http://willwolf.io/
WILL’S NOISE	http://www.willmcginnis.com/
William Lyon	http://www.lyonwj.com/
Win-Vector Blog	http://www.win-vector.com/blog/
Yanir Seroussi	http://yanirseroussi.com/
Zac Stewart	http://zacstewart.com/
ŷhat	http://blog.yhat.com/
ℚuantitative √ourney	http://outlace.com/
大トロ	http://blog.otoro.net/

Data Engineer vs Data Scientist (Infographic)

2 Mar 20172 Mar 2017 Data Lover data-science, Infographic data mining, data scientist, data-engineer, data-science, DataCamp, machine learning, tableau, visualization

This Infographic will assist us to understand better about the skills and responsibilities of Data Engineer and Data Scientist. Also, it helps us to compare salaries, popular software and tools used by each. Hope this helps!

10 famous TV shows related to Data science & AI (Artificial Intelligence)

14 Feb 201716 Feb 2017 Data Lover data-science

“If you want to become one, first get inspired by one”

There is always few interesting ways to learn things and get inspire. Would you like to know few TV shows which are based on Data science and Artificial intelligence? We always like to do the things in the way we love. Here you go & happy watching (learning)

Thanks to AV for this.

Top 8 Viz features in Excel 2016 !

2 Jan 20164 Jan 2016 Data Lover excel, excel-2016, General, visualization excel, visualization

This is especially for the excel lovers! In this blog, we will see few of the new and exciting data visualization features of Excel 2016.

Here is the list of new features

Hierarchy Chart/Tree Map
Sunburst
Water fall or Stock Chart
Transform Cold data into a cool picture
Instant Histogram
Pareto Chart
3D map
One click forecast

These are the most wanted charts by the Dashboard creators. These are very simple and attractive. This set of features makes excel more competitive with other expensive visualization tools.

Hierarchy Chart/Tree Map:

Select the data that you want to use for creation of the chart then Go to ‘Insert’ tab > Charts > Insert Hierarchy Chart

Isn’t it cool? OK, we go to the next one.

2. Sunburst/Donut Chart:

It is another representation of a Pie chart. An alternate to boring the Pie chart. Go to ‘Insert’ > Charts > Insert Hierarchy Chart

3. Water fall or Stock Chart

It is recommended to sort the data by any order to have the better insights.

4. Transform Cold data into a cool picture

This one is based on the Add-ins.

Select your data to visualize

Select ‘Settings’ to change the design of the charts

5. Instant Histogram:

Create histograms quickly instead of going to “Analysis Tool Pack” in add-ins. Go to Insert > Charts > Histogram

6. Pareto Chart:

Earlier, we had to customize the data structure to create ‘Pareto chart’ but now it is just a click away to explain the 80/20 principle.

7. 3D map:

Power Map, the popular 3-D geospatial visualization add-in for Excel 2013, is now fully integrated into Excel. We’ve also this feature a more descriptive name, “3D Maps”. You’ll find this functionality alongside other visualization features on the Insert tab.

It will open another sheet like below

then we can change the theme and other options like ‘2D Map’. “Play Tour” option will show an awesome chart with lively visual.

8. One click Forecast

It has become more easy for the Data analysts who do forecast.

Select the data that you want to forecast and Go to ‘Data’ tab > Click on “Forecast Sheet”

Adjust the “Seasonality” appropriately

and your forecast is ready.

Hope you like these features and much more to come from Microsoft. Try these things and enjoy !

Data Viz ! Cheat sheet for R Data Analyst

26 Aug 201516 Feb 2017 Data Lover R, visualization analytics, data scientist, data-science, R, visualization

Data visualization has become a vital slice of data science arena. Hence, our key tool should have strong capabilities on both the fronts – data analysis as well as data visualization. With this revolution in the landscape, or has extended immense popularity because of its splendid data visualization capabilities. With a few lines of code, you can produce beautiful charts and data stories. R contains superb libraries to create basic and more evolved visualizations like Bar Chart, Histogram, Scatter Plot, Map visualization, Mosaic Plot and various others. Below is the cheat sheet of widespread visualization for representing data. Thanks to my colleague for sharing this.

Introducing cricketr! : An R package to analyze performances of cricketers

8 Jul 201517 Feb 2017 Data Lover data-science, R

A very good analysis using R in the field of cricket. Must see ! 🙂

Giga thoughts ...

Yet all experience is an arch wherethro’
Gleams that untravell’d world whose margin fades
For ever and forever when I move.
How dull it is to pause, to make an end,
To rust unburnish’d, not to shine in use!
Ulysses by Alfred Tennyson

Introduction

This is an initial post in which I introduce a cricketing package ‘cricketr’ which I have created. This package was a natural culmination to my earlier posts on cricket and my completing 9 modules of Data Science Specialization, from John Hopkins University at Coursera. The thought of creating this package struck me some time back, and I have finally been able to bring this to fruition.

So here it is. My R package ‘cricketr!!!’

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package only uses data from test cricket. I plan to develop functionality for One-day and…

View original post 2,667 more words

A Complete List of Data Science Online Classes

12 Jun 201512 Jun 2015 Data Lover data-science data-science

Great resources to learn data science online ! Here you go !

Hi, I'm Scott

The blog is now migrated to http://scottge.net/2015/06/08/complete-list-of-data-science-online-classes/

You can consider online classes from Coursera for self-study. Coursera provides online classes (most of them are free) offered by university professors, typically attended worldwide by thousands of students and working professionals. In particular, consider the Data Science Specialization from John Hopkins University, which offers a guaranteed certificate demonstrating your ability.

Coursera Courses

Data Science Specialization
http://www.coursera.org/specialization/jhudatascience/1, scroll down to see a list of classes and availability
Data Analysis and Statistical Inference
http://www.coursera.org/course/statistics Excellent foundational class, highly regarded world-wide
Machine Learning
https://www.coursera.org/course/ml
Statistics class list
http://www.coursera.org/courses?orderby=upcoming&cats=stats

Additional online class resources

From MIT Open Courseware
http://ocw.mit.edu/courses/find-by-topic/#cat=mathematics&subcat=probabilityandstatistics
From edX
http://www.edx.org/course-list/allschools/statistics-data-analysis/allcourses
From Udacity
http://www.udacity.com/courses#!/all (search for “statistics”)
Stanford Online
http://online.stanford.edu/courses (search for “data”)
Stanford, Statistical Learning
http://online.stanford.edu/course/statistical-learning
List of MOOC classes [Massive Open Online Course]
http://www.mooc-list.com/, search for “statistics”
OpenIntro textbook
http://www.openintro.org/stat/textbook.php
OpenIntro labs
http://www.openintro.org/stat/labs.php
Khan Academy
http://www.khanacademy.org/math/probability

Acknowledgement

Please share in the comments…

View original post 23 more words

Growth of Six Sigma

26 May 201517 Feb 2017 kajabux six sigma six sigma

Below is the trend of six sigma search over the period from google trends.

There can be two reasons for the decreasing trend:

The awareness on six sigma has almost done, hence the search is reduced over the period.
Six sigma is not really a big deal.

May be I go with the second one but the reason is slightly different,

So many people are getting trained on six sigma just paying money ( 2 days, 5 days and max 10 days), then they start practice and teach six sigma.

I know personally so many inefficient people teaching six sigma!

So what they teach is six sigma now. I think this is the reason behind the fade out of six sigma.

Eight Steps to become a Data Scientist ! (The Sexiest and the Hot Job of the Decade)

16 May 201517 Feb 2017 Data Lover data-science, R data mining, data scientist, data-science, DataCamp, hadoop, julia, machine learning, python, quandl, R, SAS, spotfire, tableau

Thinking how to become a Data Scientist? Here we go, the 8 Steps to become a Data Scientist (The Sexiest and the Hot Job of the Decade)

Well, these steps are not so easy but possible if we try. Most of the steps come with no-cost or very low-cost.

Thanks for DataCamp for the nice infographic. Is this info useful? Then please share this info with your circle.

Clash of the Titans ! (R vs Python)

14 May 201517 Feb 2017 Data Lover R data-science, machine learning, python, R

This is to all out there who are wondering which is better language to learn for data analysis and visualization. Whether one should use R or Python when they do their everyday data analysis tasks.

Both Python and R are amongst the most extensively held languages for data analysis, and have their supporters and opponents. While Python is a lot praised for being a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in thoughts, thus giving it field-specific advantages such as excessive features for data visualization.

The DataCamp has recently released a new infographic for everyone interested in how these two (statistical) programming languages relate to each other. This superb infographic discovers what the strengths of R over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective.

Note:

Not to ignore the new entrant in war field “Julia” language. It is a high-level dynamic programming language designed to address the requirements of high-performance numerical and scientific computing while also being effective for general purpose programming. Influenced by MATLAB, C, Python, Perl, R, Ruby and others.

Soon we expect Julia to join the clash !

Introduction to Six Sigma, in the way you want to know !

13 May 201517 Jul 2017 Data Lover six sigma Black Belts, DMAIC, green belts, process capability, Sigma, six sigma

What is Six Sigma?

A method that delivers organizations to improve the capability of their business practices. This increase in performance and decrease in process variation lead to defect reduction and improvement in profits, employee morale, and quality of products or services. Six Sigma quality is a term generally used to indicate a process is well controlled (within process limits ±3s from the center line in a control chart, and requirements/tolerance limits ±6s from the center line).

Diverse definitions have been proposed for Six Sigma, but they all share some common threads:

Use of teams that are assigned well-defined projects that have direct impact on the organization’s bottom line.

Training in “statistical thinking” at all levels and providing key people with extensive training in advanced statistics and project management. These key people are designated “Black Belts.” Review the different Six Sigma belts, levels and roles.

Emphasis on the DMAIC approach to problem solving: define, measure, analyze, improve, and control.

A management environment that supports these initiatives as a business strategy.

Six Sigma has two key methodologies:

DMAIC: It refers to a data-driven quality strategy for improving processes. This methodology is used to improve an existing business process.
DMADV: It refers to a data-driven quality strategy for designing products & processes. This methodology is used to create new product designs or process designs in such a way that it results in a more predictable, mature and defect free performance.

There is one more methodology called DFSS – Design For Six Sigma. DFSS is a data-driven quality strategy for designing or redesigning a product or service from the ground up.

Sometimes a DMAIC project may turn into a DFSS project because the process in question requires complete redesign to bring about the desired degree of improvement.

DMAIC Methodology:

This methodology consists of the following five steps.

Define –> Measure –> Analyze –> Improve –>Control

Define: Define the problem or project goal that needs to be addressed.
Measure: Measure the problem and process from which it was produced.
Analyze: Analyze data and process to determine root causes of defects and opportunities.
Improve: Improve the process by finding solutions to fix, diminish, and prevent future problems.
Control: Implement, control, and sustain the improvements solutions to keep the process on the new course.

DMADV Methodology

This methodology consists of five steps:

Define –> Measure –> Analyze –> Design –>Verify

Define: Define the Problem or Project Goal that needs to be addressed.
Measure: Measure and determine customers needs and specifications.
Analyze: Analyze the process to meet the customer needs.
Design: Design a process that will meet customers needs.
Verify: Verify the design performance and ability to meet customer needs.

DFSS Methodology

DFSS is a separate and emerging discipline related to Six Sigma quality processes. This is a systematic methodology utilizing tools, training, and measurements to enable us to design products and processes that meet customer expectations and can be produced at Six Sigma Quality levels.

This methodology can have the following five steps.

Define –> Identify –> Design –> Optimize –>Verify

Define: Define what the customers want, or what they do not want.
Identify: Identify the customer and the project.
Design: Design a process that meets customers needs.
Optimize: Determine process capability and optimize the design.
Verify: Test, verify, and validate the design.

Features of Six Sigma

Six Sigma’s aim is to eliminate waste and inefficiency, thereby increasing customer satisfaction by delivering what the customer is expecting.
Six Sigma follows a structured methodology, and has defined roles for the participants.
Six Sigma is a data driven methodology, and requires accurate data collection for the processes being analyzed.
Six Sigma is about putting results on Financial Statements.
Six Sigma is a business-driven, multi-dimensional structured approach for:
- Improving Processes
- Lowering Defects
- Reducing process variability
- Reducing costs
- Increasing customer satisfaction
- Increased profits

The word Sigma is a statistical term that measures how far a given process deviates from perfection.

The central idea behind Six Sigma: If you can measure how many “defects” you have in a process, you can systematically figure out how to eliminate them and get as close to “zero defects” as possible and specifically it means a failure rate of 3.4 parts per million or 99.9997% perfect.

Key Concepts of Six Sigma

At its core, Six Sigma revolves around a few key concepts.

Critical to Quality : Attributes most important to the customer.
Defect : Failing to deliver what the customer wants.
Process Capability : What your process can deliver.
Variation : What the customer sees and feels.
Stable Operations : Ensuring consistent, predictable processes to improve what the customer sees and feels.
Design for Six Sigma : Designing to meet customer needs and process capability.

Our Customers Feel the Variance, Not the Mean. So Six Sigma focuses first on reducing process variation and then on improving the process capability.

Myths about Six Sigma

There are several myths and misunderstandings surrounding Six Sigma. Some of them few are given below:

Six Sigma is only concerned with reducing defects.
Six Sigma is a process for production or engineering.
Six Sigma cannot be applied to engineering activities.
Six Sigma uses difficult-to-understand statistics.
Six Sigma is just training.

Benefits of Six Sigma

Six Sigma offers six major benefits that attract companies:

Generates sustained success
Sets a performance goal for everyone
Enhances value to customers
Accelerates the rate of improvement
Promotes learning and cross-pollination
Executes strategic change

Origin of Six Sigma

Six Sigma originated at Motorola in the early 1980s, in response to achieving 10X reduction in product-failure levels in 5 years.
Engineer Bill Smith invented Six Sigma, but died of a heart attack in the Motorola cafeteria in 1993, never knowing the scope of the craze and controversy he had touched off.
Six Sigma is based on various quality management theories (e.g. Deming’s 14 point for management, Juran’s 10 steps on achieving quality).

There are three key elements of Six Sigma Process Improvement:

Customers
Processes
Employees

The Customers:

Customers define quality. They expect performance, reliability, competitive prices, on-time delivery, service, clear and correct transaction processing and more. This means it is important to provide what the customers need to gain customer delight.

The Processes:

Defining processes as well as defining their metrics and measures is the central aspect of Six Sigma.

In a business, the quality should be looked form the customer’s perspective and so we must look at a defined process from the outside-in.

By understanding the transaction lifecycle from the customer’s needs and processes, we can discover what they are seeing and feeling. This gives a chance to identify weak areas with in a process and then we can improve them.

The Employees

A company must involve all its employees in the Six Sigma program. Company must provide opportunities and incentives for employees to focus their talents and ability to satisfy customers.

It is important to Six Sigma that all the team members should have a well-defined role with measurable objectives.

Six Sigma Belts (remember karate belts ! 🙂 )

Six Sigma professionals exist at all level – each with a different role to play. While executions and roles may vary, here is a straightforward guide to who does what.

At the project level, there are black belts, master black belts, green belts, yellow belts and white belts. These people conduct projects and implement improvements

Level	Description with Roles and Responsibilities
Executives	Provide overall alignment by establishing the strategic focus of the Six Sigma program within the context of the organization’s culture and vision
Champions	Translate the company’s vision, mission, goals and metrics to create an organizational deployment plan and identify individual projects. Identify resources and remove roadblocks
Master Black Belt (MBB)	Trains and coaches Black Belts and Green Belts. Functions more at the Six Sigma program level by developing key metrics and the strategic direction. Acts as an organization’s Six Sigma technologist and internal consultant.
Black Belt (BB)	Understands Six Sigma philosophies and principles, including the supporting systems and tools. Demonstrates team leadership and understands all aspects of the DMAIC model in accordance with Six Sigma principles. Leads problem-solving projects. Trains and coaches project teams.
Green Belt (GB)	Supports a Six Sigma Black Belt by analyzing and solving quality problems and is involved in quality-improvement projects. Assists with data collection and analysis for Black Belt projects. Leads Green Belt projects or teams.
Yellow Belt (YB)	Participates as a project team member. Reviews process improvements that support the project. Has a small role, interest, or need to develop foundational knowledge of Six Sigma, whether as an entry level employee or an executive champion.
White Belt (WB)	Can work on local problem-solving teams that support overall projects, but may not be part of a Six Sigma project team. Understands basic Six Sigma concepts from an awareness perspective

Different views on the definition of Six Sigma:

Methodology— This view of Six Sigma recognizes the underlying and rigorous approach known as DMAIC (define, measure, analyze, improve and control). DMAIC defines the steps a Six Sigma practitioner is expected to follow, starting with identifying the problem and ending with the implementation of long-lasting solutions. While DMAIC is not the only Six Sigma methodology in use, it is certainly the most widely adopted and recognized.

Metrics – In simple terms, Six Sigma quality performance means 3.4 defects per million opportunities

Philosophy— The philosophical standpoint views all effort as processes that can be defined, measured, analyzed, improved and controlled. Processes require inputs (x) and produce outputs (y). If you control the inputs, you will control the outputs. This is commonly expressed as y = f(x).

Set of tools— The Six Sigma expert uses qualitative and quantitative techniques to drive process improvement. A few such tools include statistical process control (SPC), control charts, failure mode and effects analysis, and process mapping. Six Sigma professionals do not totally agree as to exactly which tools constitute the set.

Steps to Learn Data Science using R

12 May 201517 Feb 2017 Data Lover data-science, R analytics, basic, data mining, data-science, machine learning, R, statistics

One of the common difficulties individuals face in learning R is lack of an organized way. They don’t know, from where to start, how to proceed, which way to choose? However, there is a surplus of good free resources accessible on the Internet, this could be overwhelming as well as puzzling at the mean time.

After mining through infinite resources & archives, here is a comprehensive Learning way on R to learn R from the beginning. This will help you to learn R rapidly and proficiently.

Step 1: Download and Install R

The easy way to proceed is to download the basic version of R and installation instructions from CRAN site. R is available for Windows, Mac and Linux. Windows and Mac users most likely want one of these versions of R. R is part of many Linux distributions, you should check with your Linux package management system in addition to the link above.

You can now install various packages. There are more than 9000 packages in R for different purposes. Here is a link to understand packages called CRAN Views. You can accordingly select the sub type of packages that you want.

To install a package you can just do this

For example, if we want to install a package called “animation” then we use

install.packages("animation")

Normally the package should just install, however:

if you are using Linux and don’t have root access, this command won’t work.
you will be asked to select your local mirror, i.e. which server should you use to download the package.

You must also install RStudio. It helps R coding much easier since it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment.

Step 2: Learn the basics

You need to start by knowing the basics of the language, libraries and data structure. The R track from Datacamp is the best place to start your journey. See the free Introduction to R course at https://www.datacamp.com/courses/introduction-to-r. After doping this course, you would be comfortable writing basic scripts on R and also understand data analysis. Alternately, you can also see Code School for R at http://tryr.codeschool.com/

If you want to learn R offline on your own time – you can use the interactive package swirl from http://swirlstats.com

Primarily learn read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command.

Step 3: Learn Data Management:

You need to use them a lot for data cleaning, especially if you are going to work on text data. The best way is to go through the text manipulation and numerical manipulation assignments. You can learn about connecting to databases through the RODBC package and writing sql queries to data frames through sqldf package.

Step 4: Study specific packages in R– data.table and dplyr Here we go ! Here is a brief introduction to numerous libraries. We need to start practising some common operations.

Practice the data.table tutorial thoroughly here. Print and study the cheat sheet for data.table
Next, you can have a look at the dplyr tutorial here.
For text mining, start with creating a word cloud in R and then learn learn through this series of tutorial: Part 1 and Part 2.
For social network analysis read through these pages.
Do sentiment analysis using Twitter data – check out this and this analysis.
For optimization through R read here and here

Step 5: Effective Data Visualization through ggplot2

Read Edward Tufte and his principles on how to make data visualizations here . Especially read on data-ink, lie factor and data density.
Read about the common pitfalls on dashboard design by Stephen Few.
For learning grammar of graphics and a good way to do it in R. Go through this link from Dr Hadley Wickham creator of ggplot2 and one of the most brilliant R package creators in the world today. You can download the data and slides as well.
Are you interested in visualzing data on spatial analsysis. Go through the amazing ggmap package.
Interested in making animations thorugh R. Look through these examples. Animate package will help you here.
Slidify will help supercharge your graphics with HTML5.

Step 6: Learn Data mining and Machine Learning Now, we come to the most valuable skill for a data scientist which is data mining and machine learning. You can see a very comprehensive set of resources on data mining in R here at http://www.rdatamining.com/ . The rattle package really helps you with an easy to use Graphical User Interface (GUI). You can see a free open source easy to understand book here at http://togaware.com/datamining/survivor/index.html You will go through an overview of algorithms like regressions, decision trees, ensemble modelling and clustering. You can also see the various machine learning options available in R by seeing the relevant CRAN view here. Resources:

You can learn on time series forecasting from this booklet – A Little Book for Time Series in R .
Some machine learning in R is here. You can enroll in a free course here.

Step 7: Practice Practice with example data available with you and on the internet. Stay in touch with what your fellow R coders are doing by subscribing to http://www.r-bloggers.com/ , http://stats.stackexchange.com and www.stackoverflow.com. Go through the questions and answers that users come up with. Start interacting by asking questions and providing the answers for the questions which you can ! Happy learning !!! 🙂

Famous quotes about Statistics !

12 May 201517 Feb 2017 Data Lover General quotes

Here are few famous quotes about Statistics !

A big computer, a complex algorithm and a long time does not equal science. –Robert Gentleman
Absence of evidence is not evidence of absence. –Carl Sagan
All generalizations are false, including this one. – Mark Twain
All models are wrong, but some are useful. –(George E. P. Box)
An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” — John Tukey
Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin. – Von Neumann
Figures don’t lie, but liars do figure –Mark Twain
He uses statistics like a drunken man uses a lamp post, more for support than illumination. — Andrew Lang
I think it is much more interesting to live with uncertainty than to live with answers that might be wrong. –Richard Feynman
If you torture the data enough, nature will always confess. – Ronald Coase
If your experiment needs statistics, you ought to have done a better experiment. – Ernest Rutherford
In God we trust. All others must bring data. – Edwards Deming
It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so. –Mark Twain
Say you were standing with one foot in the oven and one foot in an ice bucket. According to the percentage people, you should be perfectly comfortable. – Bobby Bragan, 1963
Statistical thinking will one day be as necessary a qualification for efficient citizenship as the ability to read and write.- G. Wells
Statisticians, like artists, have the bad habit of falling in love with their models. — George Box
Statistics – A subject which most statisticians find difficult but in which nearly all physicians are expert.
“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. –Aaron Levenstein
Strange events permit themselves the luxury of occurring. – Charlie Chan
The best thing about being a statistician is that you get to play in everyone’s backyard. –John Tukey
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data –Tukey
The death of one man is a tragedy. The death of millions is a statistic – Joseph Stalin
The greatest value of a picture is when it forces us to notice what we never expected to see. –John Tukey
The statistician cannot evade the responsibility for understanding the process he applies or recommends. –Sir Ronald A. Fisher
There are no routine statistical questions, only questionable statistical routines. – R. Cox
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. –A Fisher (1938)
We are drowning in information and starving for knowledge. –Rutherford D. Roger

You can add the famous quotes that you like in the comments. 🙂

Go back

Classification metrics

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Introduction

Share this:

Coursera Courses

Additional online class resources

Share this:

Share this:

Share this:

Share this:

What is Six Sigma?

Six Sigma has two key methodologies:

DMAIC Methodology:

DMADV Methodology

DFSS Methodology

Features of Six Sigma

Key Concepts of Six Sigma

Myths about Six Sigma

Benefits of Six Sigma

Origin of Six Sigma

The Customers:

The Processes:

The Employees

Six Sigma Belts (remember karate belts ! 🙂 )

Share this:

Share this:

Your message has been sent

Share this: