Why do we split data into training and test sets?
B
To avoid overfitting by evaluating model performance on unseen data
Analysis & Theory
Splitting the data allows us to train on one part and test on another to evaluate how the model performs on new data.
What is a common split ratio for training and testing datasets?
Analysis & Theory
An 80/20 split is commonly used, meaning 80% of data is used for training and 20% for testing.
Which scikit-learn function is used to split a dataset into train and test sets?
Analysis & Theory
`train_test_split()` is the standard function in `sklearn.model_selection` for splitting data.
What does the `random_state` parameter do in `train_test_split()`?
A
Changes the size of the dataset
B
Makes the split reproducible by setting a seed for random number generation
D
Splits only numeric features
Analysis & Theory
`random_state` ensures the same split happens every time for reproducibility.
What could happen if you evaluate your model only on the training data?
A
You get an accurate generalization
B
You may overestimate its performance
C
You will detect overfitting easily
Analysis & Theory
Testing on training data often gives high accuracy but doesn't reflect real-world performance, leading to overfitting.
What is the purpose of a validation set?
A
To replace the test set
B
To visualize predictions
C
To tune hyperparameters and avoid overfitting
Analysis & Theory
A validation set is used during training to tune hyperparameters and evaluate performance before final testing.
How do you ensure balanced class distribution when splitting data?
B
Use stratified sampling with `stratify` parameter in `train_test_split()`
D
Convert classes to numbers
Analysis & Theory
`stratify` ensures class proportions remain the same in both training and test sets.
What is the main risk of using the test set during model training?
C
Overfitting to test data (data leakage)
Analysis & Theory
Using test data during training leads to data leakage, making the model less generalizable.
What happens if you don’t split the data and train on the whole dataset?
A
You’ll get better accuracy
B
You won’t be able to evaluate generalization
C
The model trains faster
D
It prevents overfitting
Analysis & Theory
Without a test set, you can’t measure how well the model will perform on new, unseen data.
What is an alternative to the train/test split when data is limited?
Analysis & Theory
Cross-validation is used to train and validate on different folds when data is limited, improving reliability.