Achieving Top 8% on Kaggle with a Ridge-XGBoost N-gram Pipeline
The article describes a machine learning pipeline that achieved a top 8% ranking on the Kaggle Playground Series S6E3 customer churn prediction challenge. The key insights were treating categorical features like text and using n-gram interactions, nested target encoding, and a two-stage Ridge-XGBoost ensemble.
Why it matters
This approach demonstrates how creative feature engineering and ensemble modeling can unlock high performance on complex, categorical-heavy datasets that challenge standard machine learning techniques.
Key Points
- 1Treated categorical features as text and generated n-gram interactions to capture feature combinations
- 2Used nested target encoding to avoid data leakage
- 3Engineered service bundle counts and digit features for continuous columns
- 4Employed a two-stage ensemble with a regularized Ridge model followed by XGBoost
Details
The Kaggle Playground Series S6E3 dataset had 594,000 rows of heavily categorical data, where the signal was buried in combinations of features rather than individual columns. The author's starting point was a LightGBM single model, but to crack the top 10%, a more unconventional approach was needed. The breakthrough came from treating the categorical columns like text and generating bigrams and trigrams across high-impact features. This captured interaction patterns that a standard feature matrix would miss. The author also used nested target encoding, service bundle analysis, and digit features to enrich the input data. Finally, a two-stage ensemble was employed, with a regularized Ridge model as the first stage to provide a stable, low-variance signal, followed by an XGBoost model trained on the original features plus the Ridge model's out-of-fold predictions.
No comments yet
Be the first to comment