What is oversampling in Python?
Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.
How do you oversample data?
To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1.
How do you upsample data in Python?
You can upsample a dataset by simply copying records from minority classes. You can do so via the resample() method from the sklearn. utils module, as shown in the following script. You can see that in this case, the first argument we pass the resample() method is our minority class, i.e. our spam dataset.