Question
Time left
Score
0
What is the answer to this questions?
A
Choice 1
B
Choice 2
C
Choice 3
D
Choice 4
Machine Learning Cheat Sheets
Q1: What is Data Science?
- Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. How is this different from what statisticians have been doing for years? The answer lies in the difference between explaining and predicting: statisticians work a posteriori, explaining the results and designing a plan; data scientists use historical data to make predictions.
- Data Science
-
Q2: What are the assumptions required for linear regression?
There are four major assumptions:
- There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data,
- The errors or residuals of the data are normally distributed and independent from each other,
- There is minimal multicollinearity between explanatory variables, and
- Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.
- Data Science
-
Q3: What is sampling? How many sampling methods do you know?
- Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. It enables data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings.
- Sampling can be particularly useful with data sets that are too large to efficiently analyze in full – for example, in big data analytics applications or surveys. Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.
- An important consideration, though, is the size of the required data sample and the possibility of introducing a sampling error. In some cases, a small sample can reveal the most important information about a data set. In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even though the increased size of the sample may impede ease of manipulation and interpretation.
- There are many different methods for drawing samples from data; the ideal one depends on the data set and situation. Sampling can be based on probability, an approach that uses random numbers that correspond to points in the data set to ensure that there is no correlation between points chosen for the sample.
- Sampling
-
Q4: What is a statistical interaction?
- Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor. When two or more independent variables are involved in a research design, there is more to consider than simply the "main effect" of each of the independent variables (also termed "factors"). That is, the effect of one independent variable on the dependent variable of interest may not be the same at all levels of the other independent variable. Another way to put this is that the effect of one independent variable may depend on the level of the other independent
variable. In order to find an interaction, you must have a factorial design, in which the two (or more) independent variables are "crossed" with one another so that there are observations at every
combination of levels of the two independent variables. EX: stress level and practice to memorize words: together they may have a lower performance.
- Data Science: Statistical Interaction
Q5: What is selection bias?
Selection (or ‘sampling’) bias occurs when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see.
That is, active selection bias occurs when a subset of the data is systematically (i.e., non-randomly) excluded from analysis.
- Selection bias is a kind of error that occurs when the researcher decides what has to be studied. It is associated with research where the selection of participants is not random. Therefore, some conclusions of the study may not be accurate.
The types of selection bias include:
- Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
- Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
- Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
- Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants)
discounting trial subjects/tests that did not run to completion.
- Data Science: Selection Bias
-
Q6: What is an example of a data set with a non-Gaussian distribution?
- The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has a solid grounding in statistics, they can be utilized where appropriate.
- Binomial: multiple toss of a coin Bin(n,p): the binomial distribution consists of the probabilities of each of the possible numbers of successes on n trials for independent events that each have a probability of p of
occurring.
- Bernoulli: Bin(1,p) = Be(p)
- Poisson: Pois(λ)
- Data Science: data set with a non-Gaussian distribution
-
Q7: What is bias-variance trade-off?
- Bias: Bias is an error introduced in the model due to the oversimplification of the algorithm used (does not fit the data properly). It can lead to under-fitting.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM
High bias machine learning algorithms — Linear Regression, Logistic Regression
- Variance: Variance is error introduced in the model due to a too complex algorithm, it performs very well in the training set but poorly in the test set. It can lead to high sensitivity and overfitting.
Possible high variance – polynomial regression
- Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
- Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
- Data Science: What is bias-variance trade-off?
-
Q8: What do you understand by the term Normal Distribution?
- A distribution is a function that shows the possible values for a variable and how often they occur. A Normal distribution, also known as Gaussian
distribution, or The Bell Curve, is probably the most common distribution.
- Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.
- Data Science: Normal Distribution
-
- The random variables are distributed in the form of a symmetrical, bell-shaped curve. Properties of Normal Distribution are as follows:
1- Unimodal (Only one mode)
2- Symmetrical (left and right halves are mirror images)
3- Bell-shaped (maximum height (mode) at the mean)
4- Mean, Mode, and Median are all located in the center
5- Asymptotic
Q9: What is correlation and covariance in statistics?
- Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related. Given two random variables, it is the covariance between both divided by the product of the two standard deviations of the single variables, hence always between -1 and 1.
-
- Covariance is a measure that indicates the extent to which two random variables change in cycle. It explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.
- Data Science: Correlation and covariance
-
Q10: What is the difference between Point Estimates and Confidence Interval?
- Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.
- A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 − ∝, where ∝ is the level of significance.
- Data Science: Point Estimates and Confidence Interval
Q11: What is the goal of A/B Testing?
- It is a hypothesis testing for a randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads. An example of this could be identifying the click-through rate for a banner ad.
- Data Science: A/B Testing?
Q12: What is p-value?
- When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is the minimum significance level at which you can reject the null hypothesis. The lower the p-value, the more likely you reject the null hypothesis.
- Data Science: p-value
Q13: What do you understand by statistical power of sensitivity and how do you calculate it?
- Sensitivity is commonly used to validate the accuracy of a classifier (Logistic, SVM, Random Forest etc.). Sensitivity = [ TP / (TP +TN)]
-
- Data Science: statistical power of sensitivity
Q14: What are the differences between over-fitting and under-fitting?
- In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.
- In overfitting, a statistical model describes random error or noise instead of the underlying relationship.
Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data.
- Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data.
Such a model too would have poor predictive performance.
- Data Science: Differences between over-fitting and under-fitting?
Q15: How to combat Overfitting and Underfitting?
To combat overfitting:
1. Add noise
2. Feature selection
3. Increase training set
4. L2 (ridge) or L1 (lasso) regularization; L1 drops weights, L2 no
5. Use cross-validation techniques, such as k folds cross-validation
6. Boosting and bagging
7. Dropout technique
8. Perform early stopping
9. Remove inner layers
To combat underfitting:
1. Add features
2. Increase time of training
- Data Science: combat Overfitting and Underfitting
Q16: What is regularization? Why is it useful?
- Regularization is the process of adding tuning parameter (penalty term) to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1 (Lasso - |∝|) or L2 (Ridge - ∝2). The model predictions should then minimize the loss function calculated on the regularized training set.
- Data Science: Regularization
Q17: What Is the Law of Large Numbers?
- It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
- Data Science: Law of Large Numbers?
Q18: What Are Confounding Variables?
- In statistics, a confounder is a variable that influences both the dependent variable and independent variable.
- If you are researching whether a lack of exercise leads to weight gain:
- weight gain = dependent variable
- lack of exercise = independent variable
- A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.
- Data Science: Confounding Variables
Q19: What is Survivorship Bias?
- It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means. For example, during a recession you look just at the survived businesses, noting that they are performing poorly. However, they perform better than the rest, which is failed, thus being removed from the time series.
- Data Science: Survivorship Bias
Q20: Differentiate between univariate, bivariate and multivariate analysis.
- Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on one variable involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.
- The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
- Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.
- Data Science: univariate, bivariate and multivariate analysis
-
Q21: What’s the difference between SAS, R, And Python Programming?
- SAS is one of the most popular analytics tools
used by some of the biggest companies in the
world. It has great statistical functions and graphical
user interface. However, it is too pricey to be eagerly
adopted by smaller enterprises or individuals.
- R, on the other hand, is a robust tool for statistical
computation, graphical representation, and reporting.
The best part about R is that it is an Open Source
tool. As such, both academia and the research community use it generously and update it with the latest features for everybody to use.
- In comparison, Python is a powerful open-source
programming language. It’s intuitive to learn and
works well with most other tools and technologies.
Python has a myriad of libraries and community created modules. Its functions include statistical operation, model building and many more. The best characteristic of Python is that it is a general-purpose
programming language so it is not limited in any way.
- Data Science:
Q22: What is an example of a dataset with a non-Gaussian distribution?
- A Gaussian distribution is also known as ‘Normal distribution’ or ‘The Bell Curve’. For a distribution to be
non-Gaussian, it shouldn’t follow the normal distribution. One of the main characteristics of the normal
distribution is that it is symmetric around the mean, the median and the mode, which all fall on one point. Therefore, all we have to do is to select a distribution, which is not symmetrical, and we will have our
counterexample.
- One of the popular non-Gaussian instances is the
distribution of the household income in the USA .
You can see where the 50th percent line is, but
that is not where the mean is. While the graph is
from 2014, this pattern of inequality still persists and
even deepens in the United States. As such, household income in the US is one of the most commonly
quoted non-Gaussian distributions in the world.
- Data Science: What is an example of a dataset with a non-Gaussian distribution
-
Q23: Explain Star Schema
- It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
- Data Science: Explain Star Schema
Q24: What is Cluster Sampling?
- Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
- For example, a researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.
- Data Science: Cluster Sampling
Q25: What is Systematic Sampling?
- Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
- Data Science: What is Systematic Sampling?
Q26: What are Eigenvectors and Eigenvalues?
- Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
- Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
- Data Science: What are Eigenvectors and Eigenvalues?
Q27: Give Examples where a false positive is important than a false negative?
- False Positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error
- False Negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
- Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.
- Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase
- Data Science: Examples where a false positive is important than a false negative?
Q28: Give Examples where both false positive and false negatives are equally important?
- In the Banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
- Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
- Data Science: Examples where both false positive and false negatives are equally important?
-
Q29: What is cross-validation?
- Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation. Mainly used in backgrounds where the objective is forecast, and one wants to estimate how accurately a model will accomplish in practice.
- Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
- It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
- The general procedure is as follows:
- 1. Shuffle the dataset randomly.
- 2. Split the dataset into k groups
- 3. For each unique group:
a. Take the group as a hold out or test data set
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores
- There is an alternative in Scikit-Learn called Stratified k fold, in which the split is shuffled to make it sure you have a representative sample of each class and a k fold in which you may not have the assurance of it (not good with a very unbalanced dataset).
- Data Science: What is cross-validation?
-
Q30: What is Machine Learning?
- Machine learning is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. You select a model to train and then manually perform feature extraction. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.
- Data Science: What is Machine Learning?
-
Q71: Why do we need one-hot encoding?
- One hot encoding makes our training data more useful and expressive, and it can be rescaled easily. By using numeric values, we more easily determine a probability for our values. In particular, one hot encoding is used for our output values, since it provides more nuanced predictions than single labels.
- Data Science: Why do we need one-hot encoding?
Q32: What is supervised machine learning?
- In supervised machine learning algorithms, we have to provide labeled data, for example, prediction of stock market prices, whereas in unsupervised we need not have labeled data, for example, classification of emails into spam and non-spam.
- Data Science: What is supervised machine learning?
Q33: What is regression? Which models can you use to solve a regression problem?
- We use regression analysis when we are dealing with continuous data, for example predicting stock prices at a certain point in time.
- Data Science: What is regression? Which models can you use to solve a regression problem?
Q34: What is linear regression? When do we use it?
- Linear Regression is a supervised Machine Learning algorithm. It is used to find the linear relationship between the dependent and the independent variables for predictive analysis.
Linear regression assumes that the relationship between the features and the target vector is approximately linear. That is, the effect of the features on the target vector is constant.
- Data Science: What is linear regression? When do we use regression?
Q35: What’s the normal distribution? Why do we care about it?
- Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.
- Data Science: What’s the normal distribution? Why do we care about it?
-
Q36: How do we check if a variable follows the normal distribution?
- The random variables are distributed in the form of a symmetrical, bell-shaped curve. Properties of Normal Distribution are as follows:
- Unimodal (Only one mode)
- Symmetrical (left and right halves are mirror images)
- Symmetrical (left and right halves are mirror images)
- Mean, Mode, and Median are all located in the center
- Asymptotic
- Data Science: How do we check if a variable follows the normal distribution?
Q37: What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?
-
- Data Science:What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?
Q38: What are the methods for solving linear regression do you know?
- The first approach is through the lens of minimizing loss. A common practice in machine learning is to choose a loss function that defines how well a model with a given set of parameters estimates the observed data. The most common loss function for linear regression is squared error loss.
The second approach is through the lens of maximizing the likelihood. Another common practice in machine learning is to model the target as a random variable whose distribution depends on one or more parameters, and then find the parameters that maximize its likelihood.
- Data Science: What are the methods for solving linear regression do you know?
Q39: What is gradient descent? How does it work?
- In Data Science, it simply measures the change in all weights with regard to the change in error, as we are partially derivating by w the loss function.
- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function.
- Data Science: What is gradient descent? How does it work?
-
Q40: What is the normal equation?
- Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression.
- Data Science: What is the normal equation?
Q41: What is SGD - stochastic gradient descent? What’s the difference with the usual gradient descent?
- In stochastic gradient descent, you'll evaluate only 1 training sample for the set of parameters before updating them. This is akin to taking small, quick steps toward the solution.
In standard gradient descent, you'll evaluate all training samples for each set of parameters.
- Data Science:What is SGD - stochastic gradient descent? What’s the difference with the usual gradient descent?
Q42: Which metrics for evaluating regression models do you know?
- The very naive way of evaluating a model is by considering the R-Squared value. Suppose if I get an R-Squared of 95%, is that good enough? Here are ways to evaluate your regression model:
- Mean/Median of prediction
- Standard Deviation of prediction
- Range of prediction
- Coefficient of Determination (R2)
- Relative Standard Deviation/Coefficient of Variation (RSD)
- Relative Squared Error (RSE)
- Mean Absolute Error (MAE)
- Relative Absolute Error (RAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error on Prediction (RMSE/RMSEP)
- Normalized Root Mean Squared Error (Norm RMSEP)
- Relative Root Mean Squared Error (RRMSEP)
- Data Science: Which metrics for evaluating regression models do you know?
Q43: What are MSE and RMSE?
- RMSE is a popular formula to measure the error rate of a regression model. However, it can only be compared between models whose errors are measured in the same units. Unlike RMSE, the relative squared error (RSE) can be compared between models whose errors are measured in the different units.
- Data Science: What are MSE and RMSE?
Q44: What is overfitting?
- Overfitting is a situation that occurs when a model learns the training set too well, taking up random fluctuations in the training data as concepts. These impact the model’s ability to generalize and don’t apply to new data.
- When a model is given the training data, it shows 100 percent accuracy—technically a slight loss. But, when we use the test data, there may be an error and low efficiency. This condition is known as overfitting.
- There are multiple ways of avoiding overfitting, such as:
- Regularization. It involves a cost term for the features involved with the objective function
- Making a simple model. With lesser variables and parameters, the variance can be reduced
- Cross-validation methods like k-folds can also be used
- If some model parameters are likely to cause overfitting, techniques for regularization like LASSO can be used that penalize these parameters
- Data Science: What is overfitting?
Q45: How to validate your models?
-
- Data Science: How to do you validate your models?
Q46: Why do we need to split our data into three parts: train, validation, and test?
- A training set to fit the parameters i.e. weights. A Validation set:
• part of the training set
• for parameter selection
• to avoid overfitting
- A Test set:
• for testing or evaluating the performance of a trained machine learning model, i.e. evaluating the predictive power and generalization.
- Data Science: Why do we need to split our data into three parts: train, validation, and test?
Q47: Can you explain how cross-validation works?
- Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation. Mainly used in backgrounds where the objective is forecast, and one wants to estimate how accurately a model will accomplish in practice.
- Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
- It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
- Data Science: Can you explain how cross-validation works?
Q48: What is K-fold cross-validation?
- A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set. The k-fold cross-validation method specifically lends itself to use in the evaluation of predictive models that are repeatedly trained on one subset of the data and evaluated on a second held-out subset of the data.
- Data Science: What is K-fold cross-validation?
Q49: How do we choose K in K-fold cross-validation? What’s your favourite K?
- When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
- Data Science: How do we choose K in K-fold cross-validation? What’s your favourite K?
Q50: What happens to our linear regression model if we have three columns in our data: x, y, z - and z is a sum of x and y?
-
- Data Science: What happens to our linear regression model if we have three columns in our data: x, y, z - and z is a sum of x and y?
Q51: What happens to our linear regression model if the column z in the data is a sum of columns x and y and some random noise?
-
- Data Science: What happens to our linear regression model if the column z in the data is a sum of columns x and y and some random noise?
Q52:What is regularization? Why do we need it?
- Regularization is the process of adding tuning parameter (penalty term) to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1 (Lasso - |∝|) or L2 (Ridge - ∝2). The model predictions should then minimize the loss function calculated on the regularized training set.
- Data Science: What is regularization? Why do we need it?
Q53: Which regularization techniques do you know?
- AdaBoost, Random Forest, and eXtreme Gradient Boosting (XGBoost).
- Data Science: Which regularization techniques do you know?
Q54: What is classification? Which models would you use to solve a classification problem?
- Classification is used to produce discrete results, classification is used to classify data into some specific categories. For example, classifying emails into spam and non-spam categories.
- Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points.
You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)
- Data Science: What is classification? Which models would you use to solve a classification problem?
Q55: What is logistic regression? When do we need to use it?
- Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).
- Data Science: What is logistic regression? When do we need to use it?
Q56: Is logistic regression a linear model? Why?
- Logistic regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) ... Logistic regression is an algorithm that learns a model for binary classification.
- Data Science: Is logistic regression a linear model? Why?
Q57: What is sigmoid? What does it do?
- The sigmoid function is a mathematical function having a characteristic “S” — shaped curve, which transforms the values between the range 0 and 1. The sigmoid function also called the sigmoidal curve or logistic function. It is one of the most widely used non- linear activation function.
- Sigmoid, ReLU, Tanh, and Softmax are examples of activation functions.
- Data Science:What is sigmoid? What does it do?
Q58: How do we evaluate classification models?
- AUC is the area under the ROC curve, and it's a common performance metric for evaluating binary classification models.
- Data Science: How do we evaluate classification models?
Q59: What is accuracy?
- Accuracy is the number of correctly predicted data points out of all the data points. More formally, it is defined as the number of true positives and true negatives divided by the number of true positives, true negatives, false positives, and false negatives.
- Data Science: What is accuracy?
Q60: Is accuracy always a good metric?
- Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: Accuracy = Number of correct predictions Total number of predictions.
- Data Science: Is accuracy always a good metric?
- Model accuracy is defined as the number of classifications a model correctly predicts divided by the total number of predictions made. It's a way of assessing the performance of a model, but certainly not the only way.
Q61: What is the confusion table? What are the cells in this table?
- A confusion matrix is used to check the performance of a classification model on a set of test data for which the true values are known. Most performance measures such as precision, recall are calculated from the confusion matrix.
- Here are the four quadrants in a confusion matrix: True Positive (TP) is an outcome where the model correctly predicts the positive class. True Negative (TN) is an outcome where the model correctly predicts the negative class. ... False Negative (FN) is an outcome where the model incorrectly predicts the negative class.
- Data Science: What is the confusion table? What are the cells in this table?
Q62: What is precision, recall, and F1-score?
- Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.
- A classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5 but an F1 score of 0. The F1 score gives equal weight to both measures and is a specific example of the general Fβ metric where β can be adjusted to give more weight to either recall or precision.
- The F-score, also called the F1-score, is a measure of a model's accuracy on a dataset. ... The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model's precision and recall.
- Data Science: What is precision, recall, and F1-score?
Q63: What is Precision-recall trade-off
- The Idea behind the precision-recall trade-off is that when a person changes the threshold for determining if a class is positive or negative it will tilt the scales. What I mean by that is that it will cause precision to increase and recall to decrease, or vice versa.
- Data Science: What is Precision-recall trade-off
Q64: What is the ROC curve? When to use it?
- The ROC (receiver operating characteristic) the performance plot for binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (xaxis).
- Data Science: What is the ROC curve? When to use it?
Q65: What is AUC (AU ROC)? When to use it?
- AUC is the area under the ROC curve, and it's a common performance metric for evaluating binary classification models.
- It's equivalent to the expected probability that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
- AUROC is robust to class imbalance, unlike raw accuracy.
For example, if you want to detect a type of cancer that's prevalent in only 1% of the population, you can build a model that achieves 99% accuracy by simply classifying everyone has cancer-free.
- Data Science: What is AUC (AU ROC)? When to use it?
Q66: How to interpret the AU ROC score?
- An AUROC of 0.5 (area under the red dashed line in the figure above) corresponds to a coin flip, i.e. a useless model.
- An AUROC less than 0.7 is sub-optimal performance.
- An AUROC of 0.70 – 0.80 is good performance.
- An AUROC greater than 0.8 is excellent performance.
- Data Science: How to interpret the AU ROC score?
Q67: What is the PR (precision-recall) curve?
- A precision-recall curve (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. PR Curve: Plot of Recall (x) vs Precision (y)
- Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds
- Data Science: What is the PR (precision-recall) curve?
Q68: What is the area under the PR curve? Is it a useful metric?
- AUC-PR stands for area under the (precision-recall) curve. Generally, the higher the AUC-PR score, the better a classifier performs for the given task. One way to calculate AUC-PR is to find the AP, or average precision.
- Data Science: What is the area under the PR curve? Is it a useful metric?
Q69: In which cases AU PR is better than AU ROC?
- If one method is better in AU-ROC but worse in AU-PR, then the method is better in Recall but worse in Precision. So you should use this method when you want high recall. If one method is better in AU-PR but worse in AU-ROC, then the method is better in Precision but worse in Recall.
- Data Science: In which cases AU PR is better than AU ROC?
Q70: What do we do with categorical variables?
- Categorical variables are known to hide and mask lots of interesting information in a data set. It's crucial to learn the methods of dealing with such variables. If you won't, many a times, you'd miss out on finding the most important variables in a model
- Categorical variables can be used to represent different types of qualitative data. For example: Ordinal data - represents outcomes for which the order of the groups is relevant. Nominal data - represent outcomes for which the order of groups does not matter.
- Data Science: What do we do with categorical variables?
Q72: What kind of regularization techniques are applicable to linear models?
-
- Data Science: What kind of regularization techniques are applicable to linear models?
Q73: How does L2 regularization look like in a linear model?
-
- Data Science: How does L2 regularization look like in a linear model?
Q74:How do we select the right regularization parameters?
-
- Data Science:How do we select the right regularization parameters?
Q75:What’s the effect of L2 regularization on the weights of a linear model?
-
- Data Science: What’s the effect of L2 regularization on the weights of a linear model?
Q76: How L1 regularization looks like in a linear model?
-
- Data Science:How L1 regularization looks like in a linear model?
Q77: What’s the difference between L2 and L1 regularization?
-
- Data Science: What’s the difference between L2 and L1 regularization?
Q78: Can we have both L1 and L2 regularization components in a linear model?
-
- Data Science: Can we have both L1 and L2 regularization components in a linear model?
Q79: What’s the interpretation of the bias term in linear models?
-
- Data Science: What’s the interpretation of the bias term in linear models?
Q80: How do we interpret weights in linear models?
-
- Data Science: How do we interpret weights in linear models?
Q81:
-
- Data Science:
Q82: When do we need to perform feature normalization for linear models? When it’s okay not to do it?
-
- Data Science: When do we need to perform feature normalization for linear models? When it’s okay not to do it?
Q83: What is feature selection? Why do we need it?
-
- Data Science: What is feature selection? Why do we need it?
Q84: Is feature selection important for linear models?
-
- Data Science: Is feature selection important for linear models?
Q85: Which feature selection techniques do you know?
-
- Data Science: Which feature selection techniques do you know?
Q86:Can we use L1 regularization for feature selection?
-
- Data Science: Can we use L1 regularization for feature selection?
Q87: Can we use L2 regularization for feature selection?
-
- Data Science: Can we use L2 regularization for feature selection?
Q88: What are the decision trees?
- Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
- Data Science: What are the decision trees?
Q89: How do we train decision trees?
-
- Data Science: How do we train decision trees?
Q90: What are the main parameters of the decision tree model?
- Return the depth of the decision tree. The depth of a tree is the maximum distance between the root and any leaf. The maximum depth of the tree. Return the number of leaves of the decision tree.
- Data Science: What are the main parameters of the decision tree model?
Q91: How do we handle categorical variables in decision trees?
- To deal with categorical variables that have more than two levels, the solution is one-hot encoding. This takes every level of the category (e.g., Dutch, German, Belgian, and other), and turns it into a variable with two levels (yes/no).
- Data Science: How do we handle categorical variables in decision trees?
Q92: What are the benefits of a single decision tree compared to more complex models?
-
- Data Science: What are the benefits of a single decision tree compared to more complex models?
Q93: How can we know which features are more important for the decision tree model?
- Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
- Data Science: How can we know which features are more important for the decision tree model?
Q94: What is random forest?
- Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. ... It performs better results for classification problems.
- Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction.
- Data Science: What is random forest?
Q95: Why do we need randomization in random forest?
-
- Data Science: Why do we need randomization in random forest?
Q96: What are the main parameters of the random forest model?
-
- Data Science: What are the main parameters of the random forest model?
Q97: How do we select the depth of the trees in random forest?
-
- Data Science: How do we select the depth of the trees in random forest?
Q98: How do we know how many trees we need in random forest?
-
- Data Science: How do we know how many trees we need in random forest?
Q99: Is it easy to parallelize training of random forest? How can we do it?
-
- Data Science: Is it easy to parallelize training of random forest? How can we do it?
Q100: What are the potential problems with many large trees?
-
- Data Science: What are the potential problems with many large trees?
Q101: What if instead of finding the best split, we randomly select a few splits and just select the best from them. Will it work?
-
- Data Science: What if instead of finding the best split, we randomly select a few splits and just select the best from them. Will it work?
Q102: R has several packages for solving a particular problem. How do you decide which one is best to use?
- R has extensive documentation online. There is
usually a comprehensive guide for the use of popular packages in R, including the analysis of concrete
data sets. These can be useful to find out which approach is best suited to solve the problem at hand.
- Just like with any other script language, it is the
responsibility of the data scientist to choose the best
approach to solve the problem at hand. The choice
usually depends on the problem itself or the specific
nature of the data (i.e., size of the data set, the type
of values and so on).
- Something to consider is the tradeoff between
how much work the package is saving you, and how
much of the functionality you are sacrificing.
- It bears also mentioning that because packages
come with limitations, as well as benefits, if you are
working in a team and sharing your code, it might be
wise to assimilate to a shared package culture.
- Data Science: R has several packages for solving a particular problem. How do you decide which one is best to use?
Q103: What are interpolation and extrapolation?
- Now, interpolation and extrapolation are two very similar concepts. They both refer
to predicting or determining new values based on
some sample information.
- There is one subtle difference, though.
Say the range of values we’ve got is in the interval
(a, b). If the values we are predicting are inside the interval (a, b), we are talking about interpolation (inter =
between). If the values we are predicting are outside
the interval (a, b), we are talking about extrapolation
(extra = outside).
- Here’s one example.
Imagine you’ve got the number sequence: 2, 4, _,
8, 10, 12. What is the number in the blank spot? It is
obviously 6. By solving this problem, you interpolated the value
- Now, with this knowledge, you know the sequence is 2, 4, 6, 8, 10, 12. What is the next value in
line? 14, right? Well, we have extrapolated the next
number in the sequence
- Whenever we are doing predictive modeling you
will be trying to predict values – that’s no surprise.
Interpolated values are generally considered reliable, while extrapolated ones – less reliable or sometimes invalid. For instance, in the sequence from
above: 2, 4, 6, 8, 10, 12, you may want to extrapolate a
number before 2. Normally, you’d go for ‘0’. However,
the natural domain of your problem may be positive
numbers. In that case, 0 would be an inadmissible
answer
- In fact, often we are faced with issues where extrapolation may not be permitted because the pattern doesn’t hold outside the observed range, or the
domain of the event is … the observed domain. It is extremely rare to find cases where interpolation is
problematic.
- Data Science: What are interpolation and extrapolation?
Q104: What is the difference between population and sample in data?
- A population is the collection of all items of interest to our study and is usually denoted with an uppercase N. The numbers we’ve obtained when using
a population are called parameters.
- A sample is a subset of the population and is denoted with a lowercase n, and the numbers we’ve
obtained when working with a sample are called
statistics.
- That’s more or less what you are expected to say.
Further, you can spend some time exploring the
peculiarities of observing a population. Conversely,
it is likely that you’ll be asked to dig deeper into why
in statistics we work with samples and what types of
samples are there.
- In general, samples are much more efficient and
much less expensive to work with. With the proper statistical tests, 30 sample observations may be
enough for you to take a data-driven decision.
- Finally, samples have two properties: randomness and representativeness. A sample can be one
of those, both, or neither. To conduct statistical tests,
which results you can use later on, your sample
needs to be both random and representative.
-
- Data Science: What is the difference between population and sample in data?
Q105: What are the steps in making a decision tree?
- First, a decision tree is a flow-chart diagram. It
is extremely easy to read, understand and apply to
many different problems. There are 4 steps that are
important when building a decision tree.
1- Start the tree. In other words, find the starting
state – maybe a question or idea, depending on
your context
2- Add branches. Once you have a question or
an idea, it branches out into 1,2, or many different
branches.
3- Add the leaves. Each branch ends with a leaf.
The leaf is the state which you will reach once
you have followed a branch.
4- Repeat 2 and 3. We then repeat steps 2 and 3,
where the starting points are the leaves, until we
finish-off the tree. In other words, every question
and possible outcome should be included.
- Depending on the context you may be expected
to add additional steps like: complete the tree, terminate a branch, verify with your team, code it, deploy it, etc.
- However, these 4 steps are the main ones in creating a decision tree.
- Data Science: What are the steps in making a decision tree?
Latest Data Science and Machine Learning Tweets
A Twitter List by enoumenA Twitter List by enoumen