Winning the UCSD Data Mining Contest 2009

Author: Jan Hendrik Hosang (homepage)

Contents

  1. General Methodology
  2. The Easy Set
    1. Data Distribution
    2. Preprocessing
    3. Classification
    4. Combination
  3. The Hard Set
    1. Data Distribution
    2. Preprocessing
    3. Classification
    4. Combination
  4. So what's your best model anyway?
    1. Issues
    2. Exploiting the Tell

General Methodology

To test new methods, select features and tune parameters while avoiding overfitting I used five fold cross validation. To avoid overfitting on the cross validation and the test feedback of the ranking page it would usually be advised to use a holdout set as well, but because of the feedback given on the complete data set (used to determine the final standings) it was unnecessary.

The Easy Set

Data Distribution

A glance at the data showed that several features were not homogeneously distributed throughout the whole data set. Graphically, the plots of accumulated features usually showed a straight line, but for some features that line had a change of slope in two places. For example the hour feature is usually monotonically increasing with a wrap around at 24 hours. But for the last five thousand examples in the training set the regularity is gone.

The label is distributed differently as well. The first ninety thousand lines of the training set carry approximately the same amount of positive instances as the last five thousand lines, which means that p(c=1) = 1.1% for the first and p(c=1) = 23.4% for the latter. Or to put it more drastically, 50% of all positive instances are in 5% of the data.

Preprocessing

The most important feature marks the part of the data set where the hour feature is flipping, since the prior probability of positive instances in both regions is massively different (see 2.1 Data Distribution).

I created new features by building usual equi-depth and equi-width histograms and by replacing values of features by their frequency. I also replaced some features by their binary representation, meaning that every value of a feature is turned into a new feature, which is 1 when the actual feature has the value that is considered, and 0 elsewhere.

I grouped the top level domains by their fraud rate, which is supervised preprocessing and therefore dangerous but on the other hand rather coarse.

For feature selection I conducted a randomised hill climbing algorithm, which is close to Boosting Feature Selection [bfs]. First I created hundreds of small, randomly drawn data sets and evaluated their quality by training random forests and calculating the lift. Then I defined the quality of a feature to be the average lift of data sets it is contained in. With this new information about the quality of features I repeated the experiment, but then I preferred good features and drew them with a higher probability than poor features.

Another technique I used involved an implementation of Landwehr's Logistic Model Trees [lmt]. I used each binary feature as target and all other features as features for training an LMT. Since during this process the actual labels are discarded it is perfectly valid to do this on the complete data set. After the training classified every instance and recorded the leaf it ended up in the tree. Since the nodes in the tree are enumerated breadth-first, this yields a lexicographical order with respect to the split points defined by the inner nodes of the tree. The generated features can be interpreted as a clustering with respect to the features used as target (one clustering per binary feature).

Classification

Mainly I made use of my own implementation of Random Forests in Python [ert], to be more flexible with parametrisation and combination with other techniques. For example I used internal five fold cross validation on the training data to obtain reasonable estimates of the performance of the models though I was testing on data with unknown labels. I also implemented feature selection (see 3.2 Preprocessing), bagging, several boosting algorithms and different tree growing strategies.

Combination

Having reasonable probabilities (no testing on training data) for both train and test set enabled me to perform ensemble selection on the train set. I tried different approaches, being simple summation of probabilities, linear regression and using other classifiers to combine probabilities of different forests.

The latter is also known as stacking and I did that both with and without the features used previously. The classifiers I used were Log-Linear Models and Support Vector Machines with linear kernels.

As mentioned before, (see 2.1 Data Distribution) there is a region in the data with a much higher prior probability of positive instances than in the rest of the data. The relative size of that region in the training set differs from that of the testing set. Approximately 5% of the training data is a region with a higher prior probability, but on the testing set it is 11%. I wanted the stacking classifiers to train in the same environment as they will be tested, so they may put more weight on experts that are good at classifying in the small regions with a high density of positive instances. So I oversampled the region in the training set to account for 11% of the data as well by duplicating instances.

The Hard Set

Data Distribution

I did not find any apparent irregularities like in the easy data set. But while looking at the session ID (or whatever the long numbers where) and thinking about whether they are session IDs or not, I also looked at reused email addresses and found they are more likely to be used for frauds. In fact, all email addresses occurring both in the training and in the testing set were almost surely used by fraudsters. Are regular customers fraudsters? Before jumping into any conclusions, let me mention they are too few to profit on it too much.

Preprocessing

Basically I took the same preprocessing steps as in the easy set (see 2.2 Preprocessing). Since we were given complete email addresses, I added the frequency of the name in the email address before the "@". Unfortunately I could not determine a region of a flipping hour feature, but on the upside this enabled me to create a feature specifying the day of the week.

For feature selection I implemented a randomised hill climbing method into my Random Forests. This method is different to the one on the easy data set. I randomly deleted or added a feature, trained models on internal cross validation and evaluated the lift to decide whether to keep the change or not. The difference is that this method is not used for feature selection before the actual training of models, but is used during training. (I also used this for one easy submission: the last one in fig. 1.)

Classification

See easy Classification. Additionally, I used Artificial Neural Networks and Log Linear Models.

Combination

See easy Combination.

So what's your best model anyway?

method lift
5 forests of 1.000 trees (with bagging, different leafsizes, different amount of entropy tests), combination: linear regression on oversampled holdout 4.617
same without oversampling 4.611
same, but with simple sum instead of linear regression 4.528
same, but combining with log linear models on oversampled holdout 4.573
3 forests of 1.000 trees (with bagging, leafsize 5, different amount of entropy tests), combination: sum 4.624
4 forests of 1.000 trees (with/without bagging, with/without hill climbing feature selection, leafsize 5, 50 entropy tests, only entropy optimal split points), stacking: SVM 4.656
combination 4.719
Submissions on the easy data set and their lift in chronological order.

None. I had some well working methods but none of them was as good as my best submissions.

Issues

On both the easy and hard set, an improved, tuned five fold validation lift often led to a worse submission. I do not know if this curious fact is caused by my Random Forests or inherent in the problem. This incapacitated to do the fine tuning on the training data alone. So I decided to analyse the submissions and the lift they got on the testing data. Fortunately the leaderboard shows the final standings, so I did not have to watch out for overfitting on some small evaluations set. In fact this is really what I wanted to do.

Exploiting the Tell

easy set region # instances # positives p(c=1)
train normal 90,000 999 1.1%
flipping 4,682 1,095 23.4%
total 94,682 2,094 2.2%
test
(forecast)
normal 32,000 355 1.1%
flipping 4,019 940 23.4%
total 36,019 1,295 3.6%
Distribution of the target on the easy training set and a forecast for the easy testing set.

The leaderboard shows the lift for the complete data used for the calculation of the final results. It is possible to obtain the prior probability of positive instances in the top 20% of the submission: ps,top(c=1) = lift0.2(s) · p(c=1).

However, the prior probability of positive instances in the easy data set probably is different in train and test, due to the strange distribution of the target throughout the easy data set (see 2.1 Data Distribution). So I assumed the prior probability to be constant in the regions of a flipping hour feature through train and test and the same for the regions where the hour feature is not flipping. This way I was able to predict a prior probability for the easy testing set (see fig. 2).

This being done, I was able to determine how many positive instances were captured by the top 20% of the submission. Submitting the second 20% of a submission explicitly, showed that the top 40% of a submission covers 98% of positive instances.

top
20%
top2
20%
flop
60%
high probability low probability
A sorted submission. The top 20% instances are used for the lift calculation. The next 20% carry almost all other positive instances. The rest contains almost no positive instances.

For every submission I assumed that all positive instances were captured by the top 40% of the submission. Given the lift of the submission I calculated the probability for each instance of being a positive instance: ps(c=1|x ∈ tops) = lift0.2(s) · p(c=1) and ps(c=1|x ∈ top2s) = (1-lift0.2(s)) · p(c=1) and ps(c=1|x ∈ flops) = 0.

I added the probabilities of submissions I knew the lift for and used the sum to build a new submission. It was beneficial to use submissions that were very different with respect to the instances it used to make up the top 20%, so the method is not biased by one particular method.