CDSP Exam Set 1 (New) Quiz

1. Which of the following best describes the first step in a data science project?

A: Selecting the appropriate machine learning algorithm
B: Identifying the project scope, objectives, and stakeholder requirements
C: Performing exploratory data analysis
D: Creating a predictive model

2. What is the purpose of a Proof of Concept (POC) in a data science project?

A: To demonstrate the full functionality of a final solution
B: To show the feasibility of the proposed solution
C: To create a minimum viable product (MVP) ready for deployment
D: To scale a data science model for production use

3. Which of the following scenarios would best classify as a regression problem in data science?

A: Predicting whether a customer will purchase a product
B: Grouping customers into similar segments based on behavior
C: Estimating the future sales revenue for a product line
D: Identifying the most frequent customer complaint categories

4. What is the most effective way to avoid scope creep in a data science project?

A: Adding new features based on stakeholder feedback
B: Constantly revising the project scope to meet evolving needs
C: Clearly defining objectives and deliverables from the outset
D: Ensuring team members work on multiple tasks simultaneously

5. Which of the following data privacy regulations mandates strict guidelines for handling personal data in the European Union?

A: HIPAA
B: CCPA
C: PCI DSS
D: GDPR

6. Which of the following best defines a minimal viable product (MVP) in the context of data science?

A: A fully developed product ready for large-scale deployment
B: A basic version of a product with just enough features for early adopters to use
C: A test version of the product created for internal testing
D: A proof of concept designed to evaluate a complex machine learning model

7. What is a common challenge when identifying stakeholder requirements for a data science project?

A: Ensuring that the data is in a clean and usable format
B: Translating business objectives into measurable data science goals
C: Deciding on the type of machine learning model to use
D: Developing a working prototype within a limited timeframe

8. Which of the following methods is most commonly used to remove duplicates from a dataset in Python?

A: df.remove_duplicates()
B: df.drop_duplicates()
C: df.delete_duplicates()
D: df.filter_duplicates()

9. When loading a dataset from a CSV file into a pandas DataFrame in Python, which of the following commands is used?

A: pd.load_csv('file.csv')
B: pd.read_csv('file.csv')
C: pd.open_csv('file.csv')
D: pd.import_csv('file.csv')

10. Which of the following describes an effective way to handle missing data in a dataset?

A: Remove all rows with missing values
B: Impute missing values with the median of the column
C: Leave the missing values as they are
D: Remove the entire dataset

11. What is the most important consideration when extracting data from multiple sources for analysis?

A: Ensuring all data is in the same file format
B: Making sure the sources are from different time periods
C: Ensuring there is a common key for joining datasets
D: Loading the data into a database before cleaning it

12. Which Python library is commonly used to interact with SQL databases to extract data for analysis?

A: NumPy
B: Matplotlib
C: Pandas
D: SQLAlchemy

13. When merging two datasets in Python using pandas, which method should be used to perform a left join?

A: pd.merge(df1, df2, how='inner')
B: pd.merge(df1, df2, how='outer')
C: pd.merge(df1, df2, how='left')
D: pd.merge(df1, df2, how='right')

14. In the context of data extraction, which of the following is the primary role of an API (Application Programming Interface)?

A: Visualizing extracted data
B: Automating the data cleaning process
C: Allowing access to data from external services or systems
D: Normalizing data across different formats

15. Which of the following techniques is most suitable for handling missing values in categorical data?

A: Dropping rows with missing values
B: Imputing with the column's mode
C: Imputing with the column's median
D: Filling with random values

16. What is the main advantage of using cloud storage solutions like AWS S3 or Azure Data Lake for data extraction and loading?

A: They automatically clean the data
B: They allow for scalable and flexible data storage solutions
C: They remove the need for data transformation
D: They ensure data security without further configuration

17. Which of the following commands can be used to load data from an SQL database into a pandas DataFrame in Python?

A: pd.read_sql(query, connection)
B: pd.load_sql(query, connection)
C: pd.import_sql(query, connection)
D: pd.open_sql(query, connection)

18. Which of the following techniques is used to handle imbalanced datasets in machine learning?

A: Cross-validation
B: Random over-sampling of the minority class
C: One-hot encoding
D: Feature selection

19. In the context of data transformation, what is one purpose of feature scaling?

A: To reduce the number of features in the dataset
B: To ensure all features contribute equally to the model
C: To remove outliers from the dataset
D: To normalize categorical variables

20. Which Python library is primarily used for scraping web data to be used in data science projects?

A: pandas
B: matplotlib
C: BeautifulSoup
D: SQLAlchemy

21. When dealing with a dataset that has a significant number of outliers, which of the following is an effective way to mitigate their impact?

A: Normalizing the dataset
B: Imputing outliers with the median value
C: Using robust statistics or transformations such as log transformations
D: Dropping all rows with outliers

22. What is the primary goal of exploratory data analysis (EDA)?

A: To create predictive models based on the data
B: To gain insights into the underlying structure and patterns of the data
C: To clean and transform the data before loading it into a database
D: To validate the results of machine learning algorithms

23. Which of the following is used to visualize the distribution of a single numerical variable in a dataset?

A: Scatter plot
B: Histogram
C: Line chart
D: Bar chart

24. Which of the following measures is used to detect the central tendency of a dataset?

A: Standard deviation
B: Variance
C: Mean
D: Interquartile range

25. Which of the following is an effective way to detect outliers in a dataset?

A: Using a histogram to visualize the data distribution
B: Normalizing the data
C: Applying one-hot encoding to categorical variables
D: Creating a line chart of the data

26. What is the purpose of feature engineering in exploratory data analysis?

A: To perform data cleaning and transformation
B: To create new features based on existing data to improve model performance
C: To visualize the relationship between features and target variables
D: To split the dataset into training and testing sets

27. Which of the following methods is used to assess correlations between numerical features in a dataset?

A: Principal Component Analysis (PCA)
B: Pearson correlation coefficient
C: Clustering
D: Normalization

28. In Python, which pandas function can be used to generate summary statistics for a DataFrame?

A: df.describe()
B: df.info()
C: df.stats()
D: df.summary()

29. What is the effect of normalizing data during EDA?

A: It reduces the dimensionality of the dataset
B: It transforms all numerical features to a fixed range
C: It removes missing values from the dataset
D: It converts categorical variables into numerical format

30. When exploring a dataset, what is the purpose of generating a scatter plot?

A: To display the relationship between two numerical variables
B: To show the distribution of a single variable
C: To visualize categorical data
D: To summarize the dataset

31. Which of the following indicates a strong positive linear relationship between two variables?

A: Correlation coefficient of 0.9
B: Correlation coefficient of 0
C: Correlation coefficient of -0.9
D: Correlation coefficient of -1

32. What is the purpose of using dimensionality reduction techniques like Principal Component Analysis (PCA) during EDA?

A: To reduce the number of features while preserving most of the variance in the data
B: To create new features from existing variables
C: To visualize the distribution of a single variable
D: To clean missing values from the dataset

33. In exploratory data analysis, which plot would be most appropriate to visualize the distribution of outliers?

A: Box plot
B: Bar chart
C: Line chart
D: Pie chart

34. Which of the following preprocessing steps is recommended before performing EDA on a dataset?

A: Removing all rows with missing values
B: Imputing missing values using an appropriate method
C: Standardizing all numerical features
D: Performing dimensionality reduction

35. Which of the following techniques is used to scale data such that the mean is zero and the standard deviation is one?

A: Normalization
B: Standardization
C: One-hot encoding
D: Min-max scaling

36. In Python, which library is commonly used for creating data visualizations such as histograms, scatter plots, and bar charts?

A: NumPy
B: SciPy
C: Matplotlib
D: Scikit-learn

37. Which method is used in Python to calculate the correlation matrix of a pandas DataFrame?

A: df.corr()
B: df.cov()
C: df.describe()
D: df.summary()

38. In a correlation matrix, what does a value close to 0 indicate?

A: A strong positive relationship between two variables
B: A weak or no linear relationship between two variables
C: A strong negative relationship between two variables
D: A moderate positive relationship between two variables

39. When analyzing categorical data, which visualization is typically used to display the frequency distribution of different categories?

A: Box plot
B: Scatter plot
C: Bar chart
D: Histogram

40. Which of the following is an example of feature engineering in EDA?

A: Filling missing values with the median
B: Creating a new feature by extracting the month from a date field
C: Removing duplicate rows from the dataset
D: Scaling numerical features to a range of 0 to 1

41. Which statistical measure is most sensitive to outliers in a dataset?

A: Mean
B: Median
C: Mode
D: Range

42. Which of the following tasks should be performed before training a machine learning model?

A: Splitting the dataset into training and testing sets
B: Calculating performance metrics like accuracy and F1 score
C: Deploying the model into production
D: Tuning hyperparameters of the model

43. Which technique is commonly used to tune hyperparameters in a machine learning model?

A: Cross-validation
B: Feature selection
C: One-hot encoding
D: Dimensionality reduction

44. You have trained a logistic regression model and obtained the following confusion matrix for a binary classification problem:

A: 75%
B: 80%
C: 85%
D: 70%

45. Which machine learning model is most suitable for predicting a continuous numerical variable, such as house prices?

A: Logistic regression
B: Decision tree classifier
C: Linear regression
D: k-means clustering

46. Which of the following techniques is effective in preventing overfitting in machine learning models?

A: Adding more layers to a neural network
B: Reducing the size of the training dataset
C: Using regularization techniques like Lasso or Ridge
D: Increasing the number of features in the dataset

47. A model is trained to predict whether a patient has a disease. You are given the following confusion matrix: Actual Positive Actual Negative Predicted Positive 70 Predicted Negative 10 Calculate the precision of the model for the positive class.

A: 77.8%
B: 85.3%
C: 87.5%
D: 90%

48. Which of the following is an appropriate evaluation metric for a regression model?

A: Accuracy
B: F1 score
C: Mean Squared Error (MSE)
D: Precision

49. What is the purpose of using the train-test split method in machine learning?

A: To improve the model's performance on the training data
B: To reduce the dimensionality of the dataset
C: To evaluate the model's performance on unseen data
D: To tune hyperparameters of the model

50. Which machine learning algorithm is suitable for solving a binary classification problem?

A: Linear regression
B: Logistic regression
C: k-means clustering
D: Principal Component Analysis (PCA)

51. Which of the following is used to compare the performance of different machine learning models on a dataset?

A: Cross-validation
B: Regularization
C: Feature engineering
D: Normalization

52. Which algorithm is most commonly used for clustering data points into groups based on their similarities?

A: Logistic regression
B: Decision tree
C: k-means
D: Linear regression

53. Which of the following algorithms is most suitable for a multiclass classification problem?

A: Support Vector Machine (SVM)
B: k-means clustering
C: Principal Component Analysis (PCA)
D: Decision tree

54. Which metric is used to evaluate a model's ability to generalize to unseen data?

A: Cross-entropy loss
B: Training accuracy
C: Validation accuracy
D: Precision

55. A regression model has a Mean Squared Error (MSE) of 25. What is the Root Mean Squared Error (RMSE) for this model?

A: 4
B: 5
C: 10
D: 25

56. Which of the following methods is used to handle imbalanced datasets during model training?

A: Reducing the number of features
B: Using SMOTE (Synthetic Minority Over-sampling Technique)
C: Cross-validation
D: One-hot encoding

57. A model predicts the sales of a product and gives the following results for actual and predicted values:

A: 5
B: 10
C: 15
D: 20

58. Which type of regularization penalizes the sum of the absolute values of the coefficients in a model?

A: L1 Regularization (Lasso)
B: L2 Regularization (Ridge)
C: Dropout
D: Elastic Net

59. Which of the following techniques reduces the dimensionality of data while preserving variance?

A: Normalization
B: Standardization
C: Principal Component Analysis (PCA)
D: Cross-validation

60. What is the key benefit of using ensemble methods like Random Forest in model building?

A: They improve the interpretability of models
B: They reduce the need for feature selection
C: They combine multiple models to improve prediction accuracy
D: They reduce the size of the training dataset

61. Which of the following is a key reason to use a test set after training a machine learning model?

A: To ensure the model fits the training data well
B: To fine-tune the model’s hyperparameters
C: To evaluate how well the model generalizes to unseen data
D: To reduce the dimensionality of the dataset

62. You are given a regression model with the following predictions and actual values for a test set:

A: 12.5
B: 25
C: 50
D: 100

63. Which of the following metrics is commonly used to evaluate the performance of a classification model on the test set?

A: Mean Squared Error (MSE)
B: Accuracy
C: Adjusted R-squared
D: Root Mean Squared Error (RMSE)

64. A binary classifier is tested on a dataset, and the following confusion matrix is generated:

A: 0.80
B: 0.85
C: 0.88
D: 0.75

65. Why is cross-validation an important step when testing machine learning models?

A: To ensure the model is not overfitting to the test set
B: To tune the model's hyperparameters
C: To select the most relevant features for the model
D: To visualize the distribution of predictions

66. What is the primary goal of model operationalization in data science?

A: To deploy the machine learning model into production for use in real-world applications
B: To fine-tune the model's hyperparameters before testing
C: To perform exploratory data analysis on the training data
D: To validate the model on the training data

67. Which of the following is a common way to deploy a machine learning model in production environments?

A: Creating an API for the model
B: Using one-hot encoding to prepare the data
C: Scaling the features using normalization
D: Visualizing the results using a bar chart

68. Which of the following best describes model monitoring in a production environment?

A: Ensuring the model fits the training data well
B: Regularly checking the model's performance to detect model drift
C: Updating the training dataset every day
D: Performing exploratory data analysis on the test data

69. What is the primary purpose of versioning machine learning models during deployment?

A: To manage different model versions and track changes made to the model
B: To improve the accuracy of the model predictions
C: To ensure that all features are standardized before deployment
D: To retrain the model on new data

70. What is a key challenge when operationalizing machine learning models in production?

A: Ensuring that the model’s predictions are interpretable for end users
B: Increasing the complexity of the model architecture
C: Reducing the amount of data used in training
D: Ensuring that hyperparameters are fixed

71. Which of the following tools is commonly used to deploy machine learning models in a cloud environment?

A: TensorFlow
B: AWS SageMaker
C: Pandas
D: Matplotlib

72. Which of the following is the most important factor when presenting data science results to non-technical stakeholders?

A: Using complex technical terms to explain the analysis
B: Focusing on key insights and recommendations based on the data
C: Showing the entire dataset used in the analysis
D: Explaining the mathematical models in detail

73. Which type of visualization is most effective for showing the relationship between two numerical variables?

A: Bar chart
B: Line chart
C: Scatter plot
D: Pie chart

74. When presenting model performance to a business audience, which of the following metrics is typically the most important?

A: Root Mean Squared Error (RMSE)
B: Recall
C: Accuracy
D: ROC-AUC score

75. Which of the following is a key principle of data storytelling when communicating results?

A: Including as much data as possible for transparency
B: Structuring the presentation around a clear narrative or message
C: Using only text-based explanations to avoid misinterpretation
D: Presenting findings without interpretation, allowing the audience to draw their own conclusions

76. Given the following confusion matrix, calculate the accuracy of the model:

A: 75%
B: 80%
C: 70%
D: 85%

77. Based on the confusion matrix provided below, calculate the precision of the model for the positive class:

A: 75%
B: 83.3%
C: 85.0%
D: 88.2%

78. For the following confusion matrix, calculate the recall for the positive class:

A: 60%
B: 66.7%
C: 70%
D: 75%

79. Using the following confusion matrix, calculate the F1 score for the positive class:

A: 70%
B: 72.6%
C: 75%
D: 80%

80. A binary classification model has the following confusion matrix:

A: Precision: 88.9%, Recall: 80%
B: Precision: 88.9%, Recall: 90%
C: Precision: 80%, Recall: 88.9%
D: Precision: 90%, Recall: 80%

81. You are given the following confusion matrix:

A: 20%
B: 25%
C: 33.3%
D: 16.7%

82. A model predicts a total of 100 samples, with the following breakdown:

A: 75%
B: 85.7%
C: 90%
D: 66.7%

83. You are given the following confusion matrix for a binary classifier:

A: 30%
B: 50%
C: 20%
D: 40%

84. In a binary classification task, a model gives the following predicted probabilities for class 1:

A: 3
B: 2
C: 4
D: 5

85. In a model performance evaluation, what does the ROC-AUC score represent?

A: The probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance
B: The model's accuracy on the training set
C: The F1 score of the model on the test set
D: The sensitivity of the model

86. A model's predicted class probabilities for a binary classification task are as follows:

A: 1
B: 2
C: 3
D: 4

87. Which of the following statements best describes sensitivity (recall) in a binary classification model?

A: The percentage of true negatives correctly identified
B: The percentage of true positives correctly identified
C: The percentage of false negatives incorrectly identified
D: The ratio of true positives to false positives

88. In a machine learning pipeline, why is it important to evaluate a model on unseen test data?

A: To improve the performance of the model on the training data
B: To assess how well the model generalizes to new, unseen data
C: To reduce the size of the dataset for faster computation
D: To tune the hyperparameters of the model

89. In the context of model performance, what does the precision-recall curve highlight?

A: The trade-off between true positives and false positives
B: The trade-off between precision and recall at different thresholds
C: The overall accuracy of the model
D: The model's performance on the test set

90. Which of the following is most useful for assessing the performance of a model on imbalanced datasets?

A: Accuracy
B: Confusion matrix
C: Precision-Recall AUC score
D: Mean Squared Error