CDSP 2 Quiz

1. What does ETL stand for in the context of data science?

A: Extract, Transform, Load
B: Extract, Train, Load
C: Export, Transform, Load
D: Export, Test, Learn

2. What is the purpose of data validation in the ETL process?

A: To analyze the data's meaning
B: To remove unnecessary columns
C: To ensure that large, prominent errors are corrected early
D: To write new data into the dataset

3. Which of the following describes structured data?

A: Data that is difficult to query, like images or audio
B: Data with a clear format, such as tables in a database
C: Data that lacks clear organization
D: Data in the form of JSON or XML

4. What is a dataset?

A: A file format used to store numerical values
B: A collection of data used to accomplish business goals
C: A type of machine learning algorithm
D: A set of tools for data visualization

5. What is an example of a first-party data source?

A: Data collected by a third-party company
B: Data shared via an external API
C: Data directly collected by your own organization
D: Data purchased from another company

6. What is a key advantage of public APIs in data collection?

A: They provide unlimited access to all available data.
B: They require no keys for usage.
C: They allow for programmatic access to data and related functions.
D: They guarantee the data will always be free of charge.

7. Which of the following is a third-party data source?

A: A government census dataset
B: Data you collect from your own website
C: Data you share with a partner organization
D: Data collected from your internal systems

8. What is one benefit of using generated data in data science?

A: It always provides real-world examples.
B: It allows testing when real data is unavailable or insufficient.
C: It avoids the need for data cleaning.
D: It is faster to process than real data.

9. What is the purpose of data transformation in the ETL process?

A: To load data into a database
B: To prepare data by changing its structure or values
C: To extract data from multiple sources
D: To validate the accuracy of the data

10. What does deduplication in data science refer to?

A: Removing irrelevant features from a dataset
B: Merging multiple datasets into one
C: Identifying and removing duplicate entries
D: Splitting data into smaller chunks for analysis

11. What is a benefit of using word embedding techniques in data science?

A: They eliminate the need for any textual data cleaning.
B: They allow words with similar meanings to be grouped close together in vector space.
C: They convert all text into numerical values for visualization.
D: They ensure that data processing is faster.

12. Which of the following is an example of a quantitative feature in a dataset?

A: Eye color
B: Employee ID
C: Population size
D: Product category

13. What does the "range" of a quantitative dataset represent?

A: The most frequent value in the dataset
B: The difference between the highest and lowest values
C: The average of all the values
D: The middle value when the dataset is sorted

14. How is continuous data different from discrete data?

A: Continuous data can take any value within a range, while discrete data is limited to specific values.
B: Continuous data only involves numbers, while discrete data involves categories.
C: Continuous data is qualitative, while discrete data is quantitative.
D: Continuous data has a fixed number of possible values, while discrete data has an unlimited range.

15. Why is data parsing important in the ETL process?

A: It helps remove all duplicate values.
B: It converts raw data into a structured format for easier processing.
C: It increases the speed of data loading into databases.
D: It replaces missing values with approximations.

16. What is the key purpose of feature scaling in data preparation?

A: To remove irrelevant data points
B: To convert categorical data into numerical form
C: To ensure all features have a similar range for better model performance
D: To reduce the size of the dataset

17. What type of error is commonly corrected during the data-cleaning phase?

A: Repeated user inputs
B: Incorrect machine learning model predictions
C: Dataset formatting errors
D: Feature selection issues

18. Which tool is most commonly used to visualize data for non-practitioners?

A: Jupyter Notebook
B: Excel
C: Tableau
D: SQL

19. Why is deduplication important in data science?

A: It increases the number of records in the dataset.
B: It removes redundant data to improve the quality of analysis.
C: It merges multiple datasets into one.
D: It decreases the overall dataset size for easier storage.

20. What is the main challenge of loading large volumes of data into databases?

A: It is impossible to index the data properly.
B: Queries on large data can take a long time to execute.
C: It decreases the data’s accuracy.
D: It increases the need for feature selection.