- Should oversampling be done before or after train test split?
- When should we use oversampling?
- Can we oversample test data?
- Should we apply smote on test data?
Should oversampling be done before or after train test split?
Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets.
When should we use oversampling?
When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training. Over sampling is used when the amount of data collected is insufficient.
Can we oversample test data?
Oversample the train data and NOT the validation data since if train data is unbalanced, your test data will most likely show the same trait and be unbalanced. If you don't know if test data will be balanced or not, oversample only train data.
Should we apply smote on test data?
SMOTE does not take into account neighboring examples from other classes when generating synthetic examples. This could result in more class overlap and noise. This is especially bad if you have a high-dimensional dataset. So the answer is you definitely should not with SMOTE.