CDSP 7 Quiz

1. What is the primary goal of k-means clustering?

A: To classify labeled data
B: To find similar groups of data points in an unsupervised manner
C: To increase the number of clusters
D: To minimize dimensionality

2. In k-means clustering, what is a centroid?

A: A random point outside the cluster
B: A data point at the edge of a cluster
C: The center of a cluster
D: The most distant point from other clusters

3. What is the key challenge in k-means clustering?

A: Determining the correct number of clusters
B: Selecting the correct algorithm
C: Minimizing dimensionality
D: Maximizing outliers

4. What is a common technique used to determine the optimal number of clusters in k-means?

A: Silhouette analysis
B: Regression analysis
C: Principal component analysis
D: Outlier detection

5. What is the elbow point in k-means clustering?

A: The point where the number of clusters increases indefinitely
B: The point where the within-cluster sum of squares stops decreasing significantly
C: The point where the algorithm terminates
D: The point where the dataset is divided into equal clusters

6. What is the global cost function in k-means clustering used for?

A: To minimize the distance between each data point and its assigned centroid
B: To maximize the distance between clusters
C: To determine the number of iterations needed
D: To eliminate outliers

7. What is a disadvantage of k-means clustering?

A: It always converges to a global optimum
B: It requires the number of clusters to be specified in advance
C: It cannot handle numerical data
D: It can only be applied to supervised learning problems

8. What is the purpose of silhouette analysis?

A: To determine the accuracy of supervised models
B: To evaluate the compactness and separation of clusters
C: To reduce the number of features in a dataset
D: To determine the learning rate of a model

9. What is hierarchical agglomerative clustering (HAC)?

A: A method that clusters data points based on density
B: A bottom-up clustering method where each data point starts as its own cluster
C: A top-down clustering method that splits data into clusters
D: A method used for classification

10. When should hierarchical clustering be used over k-means?

A: When the data is well separated and does not overlap
B: When there are no outliers in the dataset
C: When there is a need for supervised learning
D: When the data points are perfectly linear

11. What is the role of a dendrogram in hierarchical clustering?

A: To classify data into predefined labels
B: To visually represent the merging of clusters
C: To predict new data points
D: To determine the number of features

12. What is the purpose of DBSCAN (Density-Based Spatial Clustering of Applications with Noise)?

A: To find clusters of arbitrary shape and detect noise
B: To create equally sized clusters
C: To classify data based on predefined labels
D: To split data into hierarchical levels

13. What is a key advantage of DBSCAN over k-means?

A: DBSCAN requires fewer iterations to converge
B: DBSCAN can find clusters of arbitrary shape and handle noise
C: DBSCAN is faster for large datasets
D: DBSCAN requires no hyperparameter tuning

14. What does the epsilon (ϵ) parameter represent in DBSCAN?

A: The number of clusters
B: The maximum distance between two points to be considered part of the same neighborhood
C: The minimum number of iterations
D: The size of the dataset

15. What is the main difference between DBSCAN and k-means clustering?

A: DBSCAN requires labeled data, while k-means does not
B: DBSCAN can detect outliers, while k-means cannot
C: k-means is used for classification, while DBSCAN is not
D: DBSCAN requires more clusters than k-means

16. What does the silhouette coefficient close to 1 indicate?

A: The example is well-clustered and far from neighboring clusters
B: The example is poorly clustered
C: The example is near the decision boundary
D: The clustering model is overfitting

17. What is the within-cluster sum of squares (WCSS) used for in clustering?

A: To measure the compactness of a cluster
B: To determine the number of clusters
C: To measure the distance between clusters
D: To eliminate outliers

18. What is a primary metric used to evaluate density-based clustering models like DBSCAN?

A: Silhouette score
B: Within-cluster sum of squares
C: Reachability plot
D: Decision boundary

19. What is the primary advantage of hierarchical clustering over k-means clustering?

A: It can find clusters of arbitrary shapes
B: It does not require specifying the number of clusters in advance
C: It is more efficient for large datasets
D: It is faster to converge

20. What does the between-cluster sum of squares (BCSS) measure?

A: The distance between data points and their centroids
B: The separation between different clusters
C: The number of clusters in a dataset
D: The overall compactness of a dataset