From my experience reading publications on many data sciences related groups on Facebook, this question is very frequently asked by users.
The answers that are usually provided vary drastically from starting with Python or R programming courses, to taking some data sciences courses on YouTube, Coursera, etc.
Little attention is paid to the background of the student. Generally, the responses tend to position Data Sciences as a programming or algorithmic field.
In addition, I have seen all kind of questions based on visualizations and modeling results produced in Python or R. Many tend to show a lack of basic statistics understanding.
It is very important to point out that all these domain of specialization (Data Sciences, Data Analytics, Data Engineers) have the word Data in it. By definition, «Statistics is the science of conducting studies to collect, organize, summarize, analyze and draw conclusions from data.». It is the core foundation of all these new fields. It is not surprising that in the US, most undergraduate students are required to take basic statistics course in order to graduate.
My suggestions for a beginner learner, is to start by taking a sound Basic statistics course. You can take such course with any university or from qualified instructors preferably with a background in Statistics.
Some of the topics you will learn are:
- Statistics, data and statistical thinking
- Types of data
- Basic notions of samples and populations
- Methods for describing quantitative data and qualitative data
- Counting techniques (Permutations and Combinations)
- Probability
- Discrete Random variables
- Continuous random variables (Normal distribution)
- Sampling Distributions
- Inferences based on a single sample (Confidence intervals and tests of hypotheses)
- Inferences based on two samples
- ANOVA (Analysis of Variance)
- Correlations and Simple Linear Regression
- Multiple regression
- Basic categorical data analysis
A basic statistics course will provide the necessary foundation to start learning other machine learning topics.
It is also recommended to have some basic math skills such as college algebra, calculus and linear algebra.
While taking the statistics course, the next course in mind are SQL and Spark SQL. You need to develop strong SQL skills in order to extract and analyze large datasets. Python and R are needed programming languages. You can start with Python first. Overtime, there could be some cases where you need to learn R because it’s the most complete statistical programming language.

0 Comments