Wed, Nov 30, 2022
Top 40 Data Scientist Interview Questions and Answers
  1. What do you mean by Data Science?

Answer: Data science is a multidisciplinary mix of data inference, algorithms, machine learning principles, and technology in order to solve analytically complex problems. The key objective is to extract valuable information.

data scientist vs data analyst

Image source: edureka!

  1. Data cleaning holds an important role in the analysis. Why?

Answer: Data cleaning plays an important role in analysis in the following ways:

  • It helps to increase the accuracy of the model in machine language
  • It is a tiresome process as the number of data sources increases, the time taken to clean the data also increases.
  • Cleaning the data from different sources helps in transforming it into a form of that data analysts can work with.
  1. What do you mean by Systematic Sampling?

Answer: A statistical technique where elements are selected from an ordered sampling frame is said to be Systematic Sampling. In Systematic Sampling, the list is progressed in a circular manner and it is done so that once you reach the end of the list, it is progressed from the top again.

  1. What do you think is the goal of A/B Testing?

Answer: A/B testing is basically a statistical hypothesis for a randomized experiment with two variables A and B.

The main aim of A/B Testing is to identify any kind of changes on the web page in order to increase the productivity of the theatre.

  1. Can you name the various kernels functions in SVM?

Answer: Few of the kernels in SVM are:

  • Radial basis
  • Sigmoid
  • Polynomial
  • Linear
  1. Elucidate Decision Tree Algorithm in detail.

Answer: Decision tree can be defined as the supervised machine learning language which is used for the purpose of Classification and Regression. It would break down a data set in further smaller subsets and simultaneously an allied decision tree is incrementally established.

It finally produces a tree which has leaf nodes and decision nodes.

  1. What do you think is pruning in Decision Tree?

Answer: The process of removing the sub-nodes of a decision node is call pruning. It is also known as opposite process of splitting.

  1. What are recommender systems?

Answer: Recommender systems are basically known as a subclass of data filtering systems that are meant to predict the preferences and or ratings that a user would give to a product.

These systems are widely used in movies, news, research articles, music, social tags, etc.

  1. What is Naïve Bayes?

Answer: The Naïve Bayes algorithm is largely based on the Bayes Theorem which describes the probability of the occurrence of an event. This event is based on the prior knowledge of conditions that might be related to the event.

  1. What do you mean by Deep Learning?

Answer: One can define deep learning as a sub field of machine learning which is inspired by function and structure of the brain called an artificial neural network. There are end number of algorithms found under machine learning which are namely: Linear regression, Neural Network, SVM, etc.

In neural nets, specifically, we find small number of hidden layers but when it comes to deep learning algorithms, we consider a vast number of hidden layers in order to comprehend the input-output relationship.

  1. 11. State the difference between deep learning and machine learning.

Answer: Machine learning:

One of the fields of computer science that allows computers to understand and learn without being explicitly programmed.

Machine learning (ML) can be categorized as follows:

  • Unsupervised ML
  • Supervised ML
  • Reinforcement L

Deep Learning: It can be called as a sub-field of machine learning largely concerned with algorithms taking inspiration form structure and function of the brain which is called as artificial neural networks.

  1. Give some examples where a false positive holds more importance than a false negative.

Answer: Differentiating false negatives and false positives:

False positives: An event wrongly classified as a non-event is known as false positive. (This is also known as Type 1 error)

False Negatives: An event wrongly classified as an event as a non-event is known as false negative. (This is also known as Type 2 error)

  1. What do you mean by artificial neural networks?

Answer: A specific set of algorithms that have revolutionized machine learning is said to be artificial neural networks. Taking inspiration from biological neural networks, neural networks can adapt to changing the input so that the network generates the best possible results without having the need to redesign the output criteria.

  1. List down the different Deep Learning frameworks.

Answer: Following are the different deep learning frameworks:

  • Caffe
  • Chainer
  • Microsoft cognitive tool kit
  • Tensor Flow
  • Keras
  • Pytorch
  1. How does the ROC curve works?

Answer: The ROC curve can be defined as the graphical representation of the contrast between true positive rates and false positive rates.

Various times, it is also used as a proxy for the trade-off between false positive rate and true positive rate.

  1. 16. Explain what you mean by regularization and how is it useful.

Answer: The process of adding tunning stricture to a model to encourage smoothness so that over fitting could be prevented.

Most of the times, this is performed by adding a constant multiple to a weight vector which already exists.

This constant is often the L1 (Lasso) or L2 (ridge).

  1. 17. What is Normal Distribution?

Answer: Data is usually distributed in different kind of ways with a bias to the left or there could be a case where all of it could be jumbled up. This is known as Normal Distribution. There are chances that data is distributed around a central value without biasing to the left or to the right. It eventually reaches normal distribution in the form of a bell-shaped curve.

  1. 18. Elucidate Cross-validation.

Answer: A model technique for evaluating how the outcomes of statistical analysis will generalize to an independent data set.

The aim of cross-validation is to term a data set to test the model in the training phase. Doing this will limit problems like over-fitting and you will get an insight on how the model will simplify to an independent data set.

  1. 19. How should outlier values be treated?

Answer: We can identify outlier values by using any graphical analysis method. If the outlier values are less in number, they can be assessed individually. But in the case of a large number of outliers, the values can be substituted with either the 1st or 99th values.

  1. 20. What is gradient descent?

Answer: The term gradient meansGradient’ basically measures how much the output of any function changes when you change input even a little bit. We can also call a gradient as the slope of a function.

  1. List down the different variants of Black propagation.

Answer: The different variants of Black Propagation are:

  • Batch gradient descent: We can calculate the gradient for the whole data set and can perform the update at each iteration.
  • Mini-batch gradient descent: Being the most popular optimization algorithms, it is a variant of Stochastic Gradient Descent.
  • Stochastic gradient descent: We normally use a single training example for the purpose of calculation of gradient and updating all the parameters.
  1. Define the role of Activation function.

Answer: The Activation function is used for the purpose of introducing non-linearity into the neural network which ultimately helps it to learn more complex kind of functions. Without it, the neural network would only be able to learn linear kind of function which is basically the linear combination of its input data.

  1. What do you mean by Auto encoder?

Answer: The easy learning networks which aspire to transform all the inputs into outputs with the least possible error is known as Auto encoder. This basically indicates that the input and output should be close to each other.

We can add a couple of layers between the input and the output wherein the sizes of these layers are smaller as compared to the input layer.

  1. 24. What do Boltzmann machines do?

Answer: Boltzmann machines basically have a simple learning algorithm which lets them discover stimulating features which highlight complicated regularities in the training information.

  1. How do you differentiate supervised and unsupervised machine learning?

Answer: Supervised Machine Learning:

This kind of learning requires training labeled data.

Unsupervised Machine Learning:

This kind of learning doesn’t require labeled data.

  1. Why do you think it is necessary to clean a data set?

Answer: When you clean data, it makes it into a format which lets scientists work on it. It is crucial to clean data sets because if they are not cleaned, it may lead to biased information which can alter the business decisions. More than eighty percent of their time is spent by data scientists to clean all the data.

  1. What are the assumptions made by data scientist in case of linear regression?

Answer: Various assumptions are made by data scientist in case of linear regress and they are:

  • No auto-correlation
  • No or little multicollinearity
  • Linear Relationship
  • Multivariate normality
  1. Which programming language is used for text analytics?

Answer: One should choose Python as the programming language since it offers solid data analysis tools and simple data structures.

  1. What do you mean by Reinforcement Learning?

Answer: The process of learning what to do and mapping situations to actions is known as Reinforcement Learning. The learner should focus on discovering those kinds of actions which will yield the maximum reward. Inspired by the learning of human beings, Reinforcement learning is based on reward mechanism.

  1. 30. What do you think are Recommender systems?

Answer: A subclass of information filtering systems that are meant to make predictions for preferences or ratings that a product’s user makes. These systems are widely used in research articles, products, music, etc.

  1. In ‘Naïve Bayes’, what is the significance of ‘Naïve’?

Answer: We call the Algorithm as ‘Naïve’ because it makes guesses which may or may not turn out to be correct.

  1. What do mean by bias, variance trade-off?

Answer: “Bias is error introduced in the model due to over-simplification of the machine learning algorithm.”

While “Variance is an error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training dataset and performs badly on test dataset.”

  1. What do you mean by confusion matrix?

Answer: The confusion matrix can be defined as a 2X2 table which contains 4 outputs provided by the binary classifier.

There are various measures such as error rate, accuracy, sensitivity, precision and recall which is derived from it.

  1. 34. What do you mean by statistical interaction?

Answer: When the effect of one factor on the dependable variable differs among levels of another factor, it is known as interaction.

  1. What are the supported data types in Python?

Answer: The supported data types in Python are:

  • Sequences
  • Numeric Types
  • Sets
  • Mappings
  1. What is that command which is used to store R objects in a file?

Answer: Save (x,file=”x.Rdata”)

  1. Name the different types of sorting algorithms available in R language.

Answer: The different types of sorting algorithms that are available in R language are:

  • Selection sorting algorithms
  • Bubble
  • Insertion
  1. Let’s consider a case where a table contains duplicate rows, so would a query result in displaying the duplicate values by default. And how would you eliminate duplicate rows from a query result?

Answer: The answer is yes. With the help of the ‘Duplicate clause’, one can eliminate duplicate rows.

  1. What is UNION used for?

Answer: UNION is used to remove duplicate records.

  1. How would you differentiate ‘UNION’ and ‘UNION ALL’?

Answer: While UNION removes duplicate records, UNION ALL doesn’t.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.