Statistics in Data Science: Importance and Applications

Statistics in Data Science plays a crucial role and is vital for various industries and decision-making processes. Data Science has become one of the most sought-after skills in the current job market. Professionals with even just a year of experience in Data Science are now securing lucrative positions at top companies. To succeed in the field of Data Science and secure a rewarding job, one must master several key aspects.

What is Statistics?

Statistics is a collection of rules, formulas, and parameters used to analyze data and make decisions in uncertain situations. In today’s world, businesses heavily rely on Statistics and Data Science to drive their operations and make informed choices. For example, companies use Sales Tracking to analyze audience behavior and profits, while Health Insurance companies utilize Statistics to design the best plans for different customer groups.

“How much Statistics is enough for Data Scientist?”

The question of how much Statistics knowledge is sufficient for a Data Scientist is commonly asked. Since Statistics is a vast field with various techniques required for different situations in Data Science, there is no fixed answer to this question. However, beginners in Data Science should familiarize themselves with some fundamental statistical terms, such as:

  • Measurement of Central Tendencies: Mean, Median, and Mode. Among these, Median is often preferred due to its resistance to outliers.
  • Measurement of Variations: Variance, Range, Standard Deviation

For more in-depth understanding, interested readers can refer to an informative article by Terence Shin.

Descriptive Statistics

Descriptive Statistics, as the name suggests, involves summarizing the features and characteristics of a large dataset in a more understandable format. It helps analysts and readers alike to comprehend complex data in a simplified manner.

Descriptive Statistics in Data Science often involves representing the dataset with graphs and charts. To create these visuals, one needs to properly format, cleanse, and analyze the data, requiring programming knowledge. Some commonly used languages and tools in Data Science are R Language, Python, and TensorFlow.

Descriptive Statistics can be further categorized into three types:

  1. Measurement of Central Tendency
  2. Measurement of Variability.
  3. Measurement of Frequency

Let’s look at each of them one by one.

Measurement of Central Tendency

Measurement of Central Tendency involves computing values like Mean, Median, and Mode, which provide insights into the center point of the dataset’s distribution.

Example: Consider a dataset: (1, 2, 3, 4, 5, 5, 8). Mean, median and mode values are 4, 4, and 5, respectively.

Variability Measurement

Variability Measurement analyzes the spread or dispersion of the dataset’s distribution. It helps identify symmetry in the dataset. Quantities such as Range, Quartiles, Variances, and Deviations are measured in this category.

For Example, The Range of the given dataset: 5, 18, 30, 44, 60, 75, 80, 100 is 95, which is obtained by subtracting the lowest observation from the largest observation.

Distribution/Frequency Measurement

Distribution or frequency distribution involves counting the number of times data appears in the dataset.

Graphical representations like Bar graphs, scatterplots, Histograms, and Boxplots are commonly used in Descriptive Statistics to provide a visual overview of the dataset, including parameters like maximum, minimum, median, and quartiles.

While Descriptive Statistics is ideal for recording, analyzing, and representing the dataset, it does not allow drawing inferences.

Inferential Statistics

Inferential Statistics in Data Science involves drawing conclusions or inferences about a population based on scientific measures taken from a smaller sample. These samples represent small groups derived from the larger population.

To understand Inferential Statistics and its applications in Data Science, one must grasp the concept of sampling the population. Sampling involves selecting a small group from the vast universal set of data or population to represent it as accurately as possible. Some common sampling techniques include Simple Random Sampling, Stratified Sampling, and Cluster Sampling.

Inferential Statistics in Data Science is classified into two types:   

  1. Hypothesis Testing
  2. Regression Analysis

Hypothesis Testing

Hypothesis Testing is a method used to test an assumption or inference made about a population, leading to conclusions about the entire population. The process of Hypothesis Testing typically involves:

  • Assuming a null hypothesis (H0).
  • Assuming an alternate hypothesis (Ha) that often contradicts the null hypothesis.
  • Calculating a test statistic (e.g., F-test, T-test, Z-test) along with its p-value at confidence limits.
  • Comparing the test statistic value with the p-value.
  • Accepting or rejecting the Null Hypothesis based on the comparison.

Various tests, such as Z Test, T Test, and F Test, are conducted during Hypothesis Testing to draw conclusions about the population:

Z Test: This test is conducted when the sample size is greater than or equal to 30. Also, in this test, the variance of the only population is known, while the mean of both sample and the population is known. Now, if the calculated statistical value is greater than the critical value, which is experimentally known, then the null hypothesis is rejected. This part of hypothesis testing is common to all the other tests as well.

T Test: This test is conducted when the sample size is less than 30 and also the population variance is unknown.

F Test: This test is conducted to check whether there is a difference between the variances of two populations or samples.

Regression Analysis

Regression Analysis is another aspect of Inferential Statistics in Data Science. It helps understand how one variable changes concerning variations in another variable. The most commonly used type of Regression Analysis is Linear Regression, which can further be divided into simple linear and multiple linear regression. Linear Regression examines the linear relationship between the dependent and independent variables.