20 Most Popular Data Science Interview Questions
Harvard Business Review referred to it as “The Sexiest Job of the 21st Century.” Glassdoor placed it in the first position on the 25 Best Jobs in America list. According to Forbes, this job stands at number 3 under the list “10 toughest jobs to fill in 2016.”
Yes, it’s simply impossible to ignore the importance of data, and our capacity to analyze, consolidate, and contextualize it. Data science is without a doubt a thriving field. And with the explosion of big data and the need to track it, employers keep on hiring data scientists.
However, there is a serious dearth of qualified candidates.
Hone yourself to be the perfect candidate with these 20 most popular interview questions.
1. What are feature vectors?
- n-dimensional vector of numerical features that represent some object
- Term occurrences frequencies, pixels of an image etc.
- Feature space: vector space associated with these vectors
2. Explain the steps in making a decision tree.
- Take the entire data set as input
- Look for a split that maximizes the separation of the classes. A split is any test that divides the data in two sets
- Apply the split to the input data (divide step)
- Re-apply steps 1 to 2 to the divided data
- Stop when you meet some stopping criteria
- This step is called pruning. Clean up the tree when you went too far doing splits.
3. What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas. It is basically a technique of problem solving used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.
4. What is logistic regression?
Logistic Regression is also referred as logit model. It is a technique to forecast the binary outcome from a linear combination of predictor variables.
5. What are Recommender Systems?
Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.
6. Explain cross-validation.
It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set.
7. What is Collaborative filtering?
The process of filtering used by most of the recommender systems to find patterns or information by collaborating perspectives, numerous data sources and several agents.
8. Do gradient descent methods at all times converge to similar point?
No, they do not because in some cases it reaches a local minima or a local optima point. You will not reach the global optima point. This is governed by the data and the starting conditions.
9. What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B. The objective of A/B Testing is to detect any changes to the web page to maximize or increase the outcome of an interest.
10. What are the drawbacks of linear model?
Some drawbacks of the linear model are:
- The assumption of linearity of the errors
- It can’t be used for count outcomes, binary outcomes
- There are overfitting problems that it can’t solve
11. What is the Law of Large Numbers?
It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample mean, the sample variance and the sample standard deviation converge to what they are trying to estimate.
12. What are confounding variables?
These are extraneous variables in a statistical model that correlate directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.
13. Explain star schema.
It is a traditional database schema with a central table. Satellite tables map ID’s to physical name or description and can be connected to the central fact table using the ID fields; these tables are known as lookup tables, and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
14. How regularly an algorithm must be update?
You want to update an algorithm when:
- You want the model to evolve as data streams through infrastructure
- The underlying data source is changing
- There is a case of non-stationarity
15. What are Eigenvalue and Eigenvector?
Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
16. Why is resampling done?
Resampling is done in one of these cases:
- Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
- Substituting labels on data points when performing significance tests
- Validating models by using random subsets (bootstrapping, cross validation
17. Explain selective bias.
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.
18. What are the types of biases that can occur during sampling?
- Selection bias
- Under coverage bias
- Survivorship bias
19. Explain survivorship bias.
It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous different means.
20. How to work towards a random forest?
Underlying principle of this technique is that several weak learners combined provide a strong learner. The steps involved are
- Build several decision trees on bootstrapped training samples of data
- On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors
- Rule of thumb: at each split m=p√m=p
- Predictions: at the majority rule
About the On-Demand Webinar
About the Webinar