Fixing kNN Model Accuracy Drop by Proper Feature Scaling
The author's kNN model accuracy dropped from 0.89 to 0.61 after adding new features with vastly different scales. The issue was that the kNN distance calculation was dominated by the feature with the largest range. Applying StandardScaler incorrectly led to data leakage, so the author shares the right way to scale features before training and deploying the model.
Why it matters
Proper feature scaling is a critical step in machine learning model development, especially for distance-based algorithms like kNN. Failing to scale correctly can severely impact model performance in production.
Key Points
- 1kNN models are sensitive to feature scale differences
- 2Applying StandardScaler at the wrong time can cause data leakage
- 3The correct approach is to fit the scaler on the training data, then transform both train and test sets
Details
The author's kNN classifier was performing well until they added two new features with values ranging from 0 to 50,000, while the existing features had much smaller ranges. This caused the high-scale feature to completely dominate the distance calculations in the kNN model, leading to a dramatic accuracy drop from 0.89 to 0.61. The author initially tried applying StandardScaler, but discovered that the timing and method of scaling is critical. Scaling the training and test sets separately can result in data leakage, where information from the test set is used to transform the training data. The correct approach is to fit the scaler on the training data only, then transform both the training and test sets using the same scaler. This ensures the test set remains truly unseen data for accurate model evaluation.
No comments yet
Be the first to comment