

The goal of ncaa-march-madness-2020 is to store the notebooks for this Kaggle Competition, see GitBook including

We publish our package with some internal functions, install with

pip install ncaa-march-madness-2020

How to use

All notebooks work in the analysis directory, and save all data files in input, output and data directories.

fs::dir_tree("analysis", recurse = TRUE, regexp = "ipynb")
#> analysis
#> +-- baseline.ipynb
#> +-- evaluate-features.ipynb
#> +-- gbdt_lr.ipynb
#> +-- gbdt_lr_CV.ipynb
#> +-- id2vec.ipynb
#> +-- linear-base-learner.ipynb
#> +-- march-madness-2020-ncaam-simple-lightgbm-on-kfold.ipynb
#> +-- Obtain_Answer.ipynb
#> +-- outliers.ipynb
#> +-- params_tuning.ipynb
#> +-- paris-madness.ipynb
#> +-- pkg_test.ipynb
#> \-- target-encoding.ipynb
fs::dir_tree(recurse = TRUE, regexp = "input|output|data")
#> .
#> +-- data
#> |   +-- feature_importances.csv
#> |   +-- id2vec.npy
#> |   +-- NCAA2020_Kenpom.csv
#> |   +-- outlier_df.csv
#> |   +-- submission_True.csv
#> |   +-- team_strength_embedding.csv
#> |   +-- Tourney_Reuslt.csv
#> |   \-- Tourney_Reuslt_inputs.csv
#> +-- input
#> |   +-- google-cloud-ncaa-march-madness-2020-division-1-mens-tournament
#> |   |   +-- MDataFiles_Stage1
#> |   |   |   +-- Cities.csv
#> |   |   |   +-- Conferences.csv
#> |   |   |   +-- MConferenceTourneyGames.csv
#> |   |   |   +-- MGameCities.csv
#> |   |   |   +-- MMasseyOrdinals.csv
#> |   |   |   +-- MNCAATourneyCompactResults.csv
#> |   |   |   +-- MNCAATourneyDetailedResults.csv
#> |   |   |   +-- MNCAATourneySeedRoundSlots.csv
#> |   |   |   +-- MNCAATourneySeeds.csv
#> |   |   |   +-- MNCAATourneySlots.csv
#> |   |   |   +-- MRegularSeasonCompactResults.csv
#> |   |   |   +-- MRegularSeasonDetailedResults.csv
#> |   |   |   +-- MSeasons.csv
#> |   |   |   +-- MSecondaryTourneyCompactResults.csv
#> |   |   |   +-- MSecondaryTourneyTeams.csv
#> |   |   |   +-- MTeamCoaches.csv
#> |   |   |   +-- MTeamConferences.csv
#> |   |   |   +-- MTeams.csv
#> |   |   |   \-- MTeamSpellings.csv
#> |   |   +-- MEvents2015.csv
#> |   |   +-- MEvents2016.csv
#> |   |   +-- MEvents2017.csv
#> |   |   +-- MEvents2018.csv
#> |   |   +-- MEvents2019.csv
#> |   |   +-- MPlayers.csv
#> |   |   \-- MSampleSubmissionStage1_2020.csv
#> |   \--
#> +-- large_data
#> \-- output
#>     \-- paris-submission.csv

Download Data


kaggle competitions download -c google-cloud-ncaa-march-madness-2020-division-1-mens-tournament -p input
mkdir input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament
unzip input/ -d input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament


  1. Do the feature engineering on goal and spots with distance(Nandakumar 2020)
  2. We ignore the multicollinearity detection in the feature, we choose XGBoost, thus it handles this problem itself, see more

**Code of Conduct**

Please note that the `ncaa-march-madness-2020` project is released with a [Contributor Code of Conduct](
By contributing to this project, you agree to abide by its terms.


Apache License (\>= 2.0) © [Jiaxiang Li;Jiatao Li;Zhipeng Liang;Yue Pan](


Nandakumar, Namita. 2020. “R + Tidyverse in Sports.” RStudio Conference 2020. 2020. <>.