Data science is rapidly growing, with demand for skilled data scientists outpacing the supply. As a result, competition for data science jobs is intense, companies are looking for the best and brightest to join their teams. This means that the data science interview process is becoming increasingly rigorous, with interviewers using a variety of questions to assess candidates’ technical and soft skills.
This article presents commonly asked Data Science interview questions for both freshers and experienced candidates.
Data Science Interview Questions for Freshers
Here are some common data science interview questions you may encounter:
1. What is the difference between Type I Error & Type II Error?
Type I and Type II errors are statistical errors that can occur when making decisions based on statistical evidence.
Type I error (false positive) is rejecting a null hypothesis when it is actually true. In other words, it’s a mistake that results from accepting a false positive. It occurs when the test detects a relationship or a difference that doesn’t exist.
Type II error (false negative) is the failure to reject a null hypothesis when it is actually false. In other words, it’s a mistake that results from accepting a false negative. It occurs when the test fails to detect a relationship or a difference that does exist.
2. What is Over-fitting and Under-fitting?
Over-fitting refers to a scenario in machine learning where a model is trained too well on the training data to the extent that it starts memorizing the noise or random fluctuations instead of generalizing the underlying patterns. This results in poor performance on unseen data or validation data.
On the other hand, under-fitting occurs when a model is too simple to capture the complexity of the relationship between the predictors and the target variables. It occurs when the model is not complex enough to capture the underlying patterns in the data. As a result, the model has a high bias and low variance and performs poorly on both training and validation data.
3. What are Data Cleansing and its importance?
Data cleansing is a process that helps in detecting and correcting (or removing) errors or inconsistencies in the data. It’s an important step in preparing data for analysis because dirty data can lead to inaccurate results and misleading conclusions. Data cleansing can include correcting misspelled names, removing duplicate records, and filling in missing values. It also helps ensure that data meets quality standards and is consistent across different datasets. Organizations can make the most of their data assets by cleansing them into actionable insights.
4. What is a p-value?
The p-value is a statistical measure that represents the level of evidence against a null hypothesis. A small p-value (usually less than 0.05) indicates strong evidence against the null hypothesis and supports the alternative hypothesis. It is used to determine the value of a result. It helps to make decisions about accepting or rejecting the null hypothesis. The smaller the p-value, the stronger the evidence against the null hypothesis and the more likely it is that the result is not due to chance.
5. What is the use of Statistics in Data Science?
Statistics plays a important role in Data Science by providing techniques and methods to collect, analyze, and interpret data. It helps data scientists make informed decisions based on data-driven insights and provides the foundation for various machine learning algorithms. In Data Science, Statistics enables the assessment of the validity of hypotheses, the measurement of uncertainty, and the prediction of future outcomes based on past data trends.
6. How to handle a class imbalance in given data?
To handle class imbalance in data, several techniques can be used, including:
- Resampling: This involves either oversampling the minority class or under sampling the majority class to balance the classes.
- Synthetic Data Generation: This involves creating synthetic data points for the minority class to balance the classes.
- Ensemble Methods: These methods use a combination of several classifiers to handle class imbalance by creating a more robust classifier that can accurately predict the minority class.
- Cost-Sensitive Learning: This approach modifies the learning algorithm to penalize incorrect predictions of the minority class more than inaccurate predictions of the majority class.
- Anomaly Detection: This approach treats the minority class as an anomaly and uses algorithms for anomaly detection to handle class imbalance.
The choice of technique depends on the problem’s specifics and the data type.
7. What is a support vector machine, and how it works?
Support Vector Machines (SVMs) is a supervised machine learning algorithms used for classification and regression analysis. They work by finding the hyperplane in a high-dimensional space that best separates the data into classes by maximizing the margin between them. The data points closest to the hyperplane are known as support vectors and have the greatest impact on the position of the hyperplane. The algorithm then uses these support vectors to make predictions for new data.
8. What is the curse of dimensionality, and how does it affect machine learning?
The curse of dimensionality challenges the use of high-dimensional data in machine learning. As the number of features or dimensions in a dataset increases, the required data to effectively train a model grows exponentially, leading to overfitting and poor performance on new data. Machine learning algorithms designed for low-dimensional spaces may need help finding patterns in high-dimensional data, resulting in poor performance and increased computational cost. To overcome these challenges, techniques such as feature selection, dimensionality reduction, and regularization reduce the number of features in a dataset and enhance the performance of machine learning algorithms.
9. What are the steps used in making a decision tree.
The steps in making a decision tree are as follows:
- Select the best attribute to split the data: Choose the attribute that results in the largest information gain or the lowest impurity, such as Gini impurity or entropy.
- Split the dataset: Split the dataset into subsets based on the chosen attribute.
- Create a decision node: Create a decision node for the selected attribute and assign the corresponding subsets to its child nodes.
- Repeat the process: Repeat the process for each child node recursively until all the data in the subsets belong to the same class or a stopping criterion is reached.
- Assign class labels: Assign the class label to each leaf node based on the majority class of the data in the corresponding subset.
- Evaluate the tree: Evaluate the performance of the tree on a separate test dataset to ensure its accuracy and generalizability.
10. How will you calculate the Euclidean distance in Python?
plot1 = [1,4] and plot2 = [2,6]
You can calculate the Euclidean distance between two points in Python using the following code:
import numpy as np
plot1 = np.array([1,4])
plot2 = np.array([2,6])
distance = np.sqrt(np.sum((plot1 - plot2)**2))
print("Euclidean distance:", distance)
Euclidean distance: 2.8284271247461903
11. What are correlation and covariance in statistics?
Correlation and covariance are two measures of the relationship between two variables in statistics.
Correlation: Correlation measures the linear association between two variables. It ranges from -1 to 1, where -1 indicates a strong negative correlation, 1 indicates a strong positive correlation, and 0 indicates no correlation.
Covariance: Covariance measures the variability of two variables for their means. It can be positive, negative, or zero, indicating the direction and strength of the relationship between the variables.
Both correlation and covariance provide information about the relationship between two variables. Still, correlation is a standardized measure, meaning it has a fixed scale and is easier to interpret than covariance, which has no fixed hierarchy.
12. What are lambda functions?
You can create small, inline functions called “lambda functions” or “anonymous functions” without a name in Python. These functions come in handy when executing a small piece of code only once or when you want to create simple functions to pass as arguments to other functions.
The syntax for a lambda function is as follows:
lambda arguments: expression
For example, the following lambda function takes two arguments and returns their sum:
sum = lambda a, b: a + b
print(sum(3, 4))
Output: 7
Lambda functions are used with higher-order functions, such as map(), filter(), and reduce(), to perform operations on lists and other iterable objects.
13. What is the Bias-Variance tradeoff?
The Bias-Variance tradeoff in machine learning involves finding a balance between two sources of error in a model: bias and variance. Bias arises when the model assumes a relationship between features and the target variable that is too simple, causing underfitting—variance results from the model’s sensitivity to small changes in the training data, leading to overfitting. Striving for a model that generalizes well to new data requires finding the optimal balance between bias and variance, which requires making a tradeoff between the two.
14. How can you calculate accuracy using a confusion matrix?
To calculate accuracy using a confusion matrix, Sum the number of true positive and true negative predictions and divide by the total number of predictions made.
The formula for accuracy is:
Accuracy = (TP + TN) / TP
Where True Positives refer to instances correctly predicted as positive, True Negatives refer to instances correctly predicted as negative, and Total Predictions represent the sum of all predictions made by the model.
15. What is the difference between “long” and “wide” format data?
In data analysis, long and wide formats refer to the arrangement of data in a table or dataset.
“Long” format refers to a data structure where each row represents a single observation, and each column represents a variable or feature. The advantage of this format is that it allows for easy aggregation and analysis of the data, especially when dealing with time-series data.
“Wide” format refers to a data structure where each row represents multiple observations for a single entity. Each column represents a unique observation in this format, and there are multiple columns for each entity. The advantage of this format is that it is easier to perform descriptive statistics and data visualization, as all observations for a single entity stored in a single row.
16. What are skewed Distribution & uniform Distribution?
Skewed Distribution: A skewed distribution, also known as a skewed or asymmetrical distribution, is a probability distribution where the values are not evenly distributed around the mean. Instead, the values are concentrated more towards one side of the mean, creating a “skewed” shape. There are two types of skewed distributions: positive skewed (long tail to the right) and negative skewed (long tail to the left).
Uniform Distribution: A uniform distribution, also known as a rectangular distribution, is a probability distribution where all values have an equal probability of occurrence within a specified range. The Distribution is symmetrical, with a constant probability density over a specified interval. A uniform distribution’s mean, median, and mode are all equal. The uniform Distribution is commonly used in applications where a random variable is equally likely to take on any value within a specified range.
17. Create an empty DataFrame in Pandas?
This approach can be useful for creating an empty DataFrame with specified column names, by passing a dictionary to the pd.DataFrame() constructor:
import pandas as pd
df = pd.DataFrame(columns=['column1', 'column2', 'column3'])
18. How is time series data declared as stationery?
To declare a time-series data as stationary, its statistical properties must remain constant over time, such as its mean, variance, and covariance. Two standard methods used to determine if a time series is stationary:
Visual inspection: Plotting the data to observe trends, seasonality, or fluctuations. If the mean and variance are constant over time, the data can be considered stationary.
Statistical testing: Conduct statistical tests such as the Augmented Dickey-Fuller (ADF) test or the KPSS test to identify unit roots, which are hallmarks of non-stationary time series. The time series is considered stationary if the test statistic is lower than the critical value.
When a time series is stationary, its mean, variance, and covariance remain unchanged. This property is crucial in time-series modeling, as many methods assume stationarity. If a time series is non-stationary, it may require transformation, such as taking first differences or using logarithmic scaling, to make it stationary before modeling.
19. Why is R used in Data Visualization?
R is used in data visualization due to its comprehensive suite of graphical and statistical tools and its strong community of developers who contribute packages and functions to expand its capabilities further. R’s dynamic and flexible syntax allows for easy data manipulation, creating complex visualizations and the representation of insights clearly. R integrates well with other data analysis tools, making it a versatile choice for data visualization projects.
Data Science Interview Questions for Experienced
20. What is the Difference between Standardization and Normalization?
The difference between normalization and standardization lies in the scaling of the data. Normalization scales the data to fall within a specific range, typically [0,1]. This method is useful when the range of the data is important, and there is a need to maintain that range. On the other hand, standardization scales the data to have a mean of 0 and a standard deviation of This method is useful when the data distribution is important, and there is a need to maintain the distribution properties.
21. What are the steps of a Data Analytics Project?
To carry out a data analytics project, you can follow these steps:
- Define the problem and gather the data
- Clean and pre-process the data
- Exploratory data analysis and feature engineering
- Model selection and training
- Model evaluation and hyperparameter tuning
- Deployment and monitoring
- Refining and updating the model as needed.
You may also like to read: Top Benefits of Learning Data Science
22. Define the terms KPI, lift, model fitting, robustness and DOE?
KPI (Key Performance Indicator): A metric used to measure the success of a specific business goal.
- Lift: The improvement in the performance of a predictive model compared to a baseline model.
- Model fitting: Training a machine learning model on a given dataset to make predictions.
- Robustness: The ability of a model to perform well on unseen data and maintain its performance even in the presence of minor variations or errors in the input data.
- DOE (Design of Experiments): A statistical method used to optimize the process of experimentation and evaluate the relationships between variables to improve the outcomes of a process or product.
23. Define confounding variables?
Confounding variables are extraneous variables that can affect the relationship between an independent and dependent variable, making it difficult to determine cause and effect. Confounding variables can distort the results of a study by creating a spurious association between the variables being studied, leading to incorrect conclusions about their relationship. To control for confounding variables, it is important to either adjust for them in the analysis or randomly assign them to the study’s design.
24. How SQL is different from NoSQL?
SQL and NoSQL are two different approaches to data storage and retrieval that are optimized for different use cases. SQL is best suited for structured data and complex, multi-step queries, while NoSQL is designed for handling large amounts of unstructured data and is optimized for performance, scalability, and flexibility.
You may also like to read: Introduction To NoSQL Database
25. What is univariate, bivariate, and multivariate analysis?
Univariate analysis is the simplest form of statistical analysis and involves the study of a single variable. The goal of the univariate analysis is to describe and summarize the distribution of the data, including measures such as the mean, median, mode, and range.
The bivariate analysis involves the study of two variables and is used to understand the relationship between them. This type of analysis includes methods such as scatter plots, correlation coefficients, and regression analysis, which are used to determine if there is a relationship between the two variables and to quantify the strength and direction of that relationship.
The multivariate analysis involves the study of more than two variables and is used to understand the relationships between multiple variables. This type of analysis includes methods such as factor analysis, principal component analysis, and multivariate regression, which are used to identify underlying patterns and relationships between the variables and the most important variables that contribute to the relationships.
26. What is pickle module in Python?
The pickle module in Python is a module that implements the binary serialization and deserialization of Python objects. Serialization is converting a Python object into a byte stream that can be stored in a file or transmitted over a network, and deserialization is the reverse process of converting a byte stream back into a Python object.
The pickle module allows Python objects, such as lists, dictionaries, and custom classes, to be easily saved to a file and loaded back into memory later. This makes it possible to persist the state of a Python program across multiple sessions or to store and share data between different Python programs.
27. What is a Box-Cox Transformation?
The Box-Cox transformation is a method for transforming non-normally distributed data into a more standard or Gaussian-like distribution. The transformation aims to make the data more suitable for statistical analysis, as many statistical methods assume that the data is normally distributed.
The Box-Cox transformation is named after its inventors, George Box, and David Cox, who developed the method in the 1960s. The transformation is a powerful transformation that takes the form of:
y = (x^λ – 1) / λ
Where x is the original data, λ is a parameter that controls the shape of the transformation, and y is the transformed data. The parameter λ can be estimated using maximum likelihood methods, and a value of λ=0 corresponds to the log transformation.
28. What is the F1 score, and how to calculate it?
The F1 score measures the accuracy of a binary classification model. It is a harmonic mean of Precision and Recall and balances the trade-off between the two. The F1 score provides a single metric that summarizes the overall performance of a model and is particularly useful when Precision and Recall have different levels of importance.
The F1 score is calculated as follows:
F1 score = 2 * (Precision * Recall) / (Precision + Recall)
Where Precision is the number of true positive predictions divided by the sum of true positive predictions and false positive predictions, and Recall is the number of true positive predictions divided by the sum of true positive and false negative predictions.
29. Why do we use the summary function?
The summary function obtains a concise summary of an object in R, such as a linear regression model, a data frame, or a statistical analysis. The summary function provides important information about the object in a compact and easy-to-read format, such as the coefficients, residuals, and statistical measures.
30. What is Markov chain?
Markov chain is a mathematical model used to describe the evolution of a system over time, where the system’s future state depends only on the current state and not on any past states. It consists of a set of states, a set of probabilities that describe the transitions between states, and a transition matrix that gives the probability of transitioning from one state to another. Various applications make use of Markov chains, including weather forecasting, financial modeling, and queueing systems.
31. What is the goal of A/B Testing in data science?
A/B testing in data science aims to compare two versions of a product, service, or process (A and B) to determine which version is more effective or preferred by customers or users. A/B testing is used to make data-driven decisions about design, marketing, and product development by comparing the performance of two variations of a product, process, or service.
The goal of A/B testing is to identify the most effective version of a product, service, or process to make informed decisions about design, marketing, and product development. A/B testing provides valuable insights into customer behavior and preferences, which can be used to improve products, services, and processes and increase customer satisfaction and engagement.
32. What is Cluster Sampling?
Cluster sampling is a technique used in statistical surveys and research studies in which the entire population is divided into groups (also known as clusters), and a sample of these clusters is then selected for the study. The individuals within each selected cluster are then sampled and used to represent the larger population.
This technique is often used when it is difficult or costly to obtain a representative sample by simple random sampling. For example, suppose the population is spread over a large geographical area. In that case, it may be more efficient to sample clusters of individuals within regions rather than selecting individuals randomly from the entire population.
33. What do you mean by Deep Learning?
Deep Learning refers to a subfield of machine learning that employs algorithms modeled after the structure and functioning of the human brain, known as artificial neural networks, to process and analyze complex data such as images, audio, and text. The term “deep” alludes to the utilization of multiple layers in the neural network, allowing for more abstract representations of the input data.
34. What do you mean by Data science
Data science is an interdisciplinary field that combines statistical analysis, programming, and domain knowledge to extract insights and knowledge from data. It involves using algorithms, methods, and systems to collect, store, process, and analyze large and complex data sets in order to uncover hidden patterns, correlations, and other insights. The goal of data science is to turn raw data into actionable insights that can inform decision-making and drive business value.
You may also like to read: Career in Data Science – Scope & Opportunities
35. What Are Hyperparameters?
Hyperparameters are parameters in machine learning models that are set before the learning process begins and are not learned from the data. They control the overall behavior of the model and are often set based on experience and intuition rather than being learned from the data itself. Examples of hyperparameters include:
- The learning rate in gradient descent.
- The number of hidden layers in a neural network.
- The regularization strength in linear regression.
- The number of clusters in a clustering algorithm.
36. What Is the Difference Between Iteration, Epoch, and Batch in Deep Learning?
In deep learning, the terms epoch, batch, and iteration describe the various steps in training a neural network model.
An epoch is a complete iteration over the entire training dataset. For example, if the training dataset has 1000 examples and a batch size of 100, then one epoch would consist of 10 iterations, where each iteration processes 100 examples. After one epoch, the model has seen all 1000 examples once.
A batch is a set of training examples processed together in a single iteration. The batch size is a hyperparameter that determines the number of examples processed in each iteration. In general, larger batch sizes can lead to faster training times, but too large a batch size can also slow down convergence and make it more difficult for the model to learn.
An iteration is a single update of the model parameters based on the gradients calculated from one batch of examples. In each iteration, the model updates its parameters based on the gradients calculated from the current batch and then moves on to the next batch. The number of iterations per epoch equals the number of examples in the training dataset divided by the batch size.
37. What is a computational graph?
A computational graph is a mathematical representation of a computational process, where nodes in the graph represent operations, and the edges between nodes represent the data flow. In the context of deep learning, a computational graph represents a neural network, where the nodes represent individual layers and operations, and the edges represent the flow of data between layers.
38. What is the use of cycle fields in Tableau?
Cycle fields in Tableau are used to create a repeating pattern of color or mark styles in visualizations. They are particularly useful when working with time-series data, as they allow you to distinguish different time periods or categories in a clear and intuitive way.
For example, you might use a cycle field to apply a different color to each quarter of a line chart, making it easier to see trends over time. Alternatively, you might use a cycle field to apply a different marker style to each category in a scatter plot, making it easier to distinguish between categories.
You may also like to read: 50+ Tableau Interview Questions and Answers
39. What is Superkey and the Candidate key?
Super keys can be a single or combination of keys that helps to identify records in the table. It is important to know that Super keys could contain at least one attribute even though all attributes are not required to distinguish the records.
Candidate keys are a subset of Superkey and can be equipped with one or more attributes to distinguish records in tables. In contrast to Superkey however, all features of the candidates key should be useful in identifying the records.
40. What is an auto-increment?
The auto-increment is a feature in database management systems that automatically assigns a unique numerical value to a record in a table each time a new record is inserted. It is typically used as a primary key, a unique identifier for each record in a table.
The auto-increment is commonly used in relational databases such as MySQL, PostgreSQL, and Microsoft SQL Server. It is a convenient way to manage unique identifiers for records in a table, as it eliminates the need for manual assignment and helps ensure the integrity of the data in the table.
41. What is the purpose of DCL Language?
The main purpose of DCL is to allow database administrators to control who has access to the data stored in a database and what operations they can perform. For example, DCL commands such as GRANT and REVOKE can grant or revoke specific privileges, such as the ability to SELECT, INSERT, UPDATE, or DELETE data, to individual users or groups of users.
42. What is the difference between deep and shallow copy in python?
In Python, a shallow copy of an object is a new object with the same values as the original object, but the objects themselves are not copied. Instead, the new object contains references to the same objects as the original. On the other hand, a deep copy of an object creates a new object with completely independent copies of all objects contained in the original.
For example, consider a list that contains lists as its elements. A shallow copy of this list would create a new list with the same elements, but the inner lists would still reference the same objects as in the original list. Modifying one of these inner lists would affect both the original and shallow copies.
43. How OLAP Works?
OLAP (Online Analytical Processing) is a category of software tools that provides fast, interactive access to data stored in data warehouses and marts. OLAP supports multi-dimensional data analysis, enabling users to quickly analyze data from different perspectives and answer complex business questions.
Here’s how OLAP works:
- Data Warehouse: OLAP works with data stored in a data warehouse or data mart, a large repository of data organized and optimized for fast querying and analysis. The data in the data warehouse is typically extracted from transactional databases, transformed, and loaded into the data warehouse to support business intelligence (BI) and analytics.
- Multi-Dimensional Data Model: OLAP uses a multi-dimensional data model, which organizes data into a series of dimensions and hierarchies, such as time, geography, product, and customer. This allows users to analyze data from multiple angles, such as sales by product and region over time.
- Cubes: OLAP stores data in multi-dimensional data structures called cubes. A cube is a pre-aggregated and pre-calculated representation of data that provides fast access to the data for analysis. Cubes are constructed from the data in the data warehouse, and the calculations and aggregations within the cube are performed in advance to speed up querying.
- Querying and Analysis: OLAP provides fast and interactive access to the data in the cubes for querying and analysis. Users can interact with the data using various tools, such as pivot tables, drill-down and drill-up functions, and slicing and dicing the data. These tools allow users to quickly analyze data from different perspectives and answer complex business questions.
- Performance Optimization: OLAP uses a variety of performance optimization techniques, such as data compression, indexing, and caching, to provide fast querying and analysis of data. Additionally, OLAP tools may use in-memory technology to store data in RAM for even faster access to the data.
44. What is VIZQL in Tableau?
VIZQL is the visualization query language used by Tableau, a powerful data visualization and business intelligence tool. It is a declarative language that enables users to interact with and analyze data in Tableau by specifying the visualizations they want to create.
VIZQL combines the power of SQL and a graphical interface to allow users to create complex visualizations without writing any code. With VIZQL, users can drag and drop fields from a data source onto the view canvas to create a variety of visualizations, including bar charts, line charts, scatter plots, and maps.
45. Explain leave-p-out cross validation?
Leave-p-out cross-validation is a type of resampling method used in machine learning and statistics to evaluate the performance of a model. It is a variation of k-fold cross-validation, where the goal is to partition the data into training and testing sets in order to evaluate the model’s accuracy. In leave-p-out cross-validation, the data is divided into p observations, and for each iteration of the evaluation, one of the p observations is left out as the test set, and the remaining observations are used as the training set. This process is repeated p times, each time leaving out a different observation as the test set, until all p observations have been used as the test set once.
Very helpful post, answers are to-the-point and simple to understand
Though not a Data Science student, the q and ans are written in a very easy language which is beneficial for everyone. A good look and it will help ace the competitive exams
Your website is helping me a lot ..The content u post mostly answers all my queries that I have regarding the matter. Must say- Very good stuff!!!
Very insightful… thanks for sharing this content and would Love to read more!!!
Good content
Insightful!
Yet another useful and fantastic article from Brainalyst, all answers explained very well