Using Predictive Power Score to Pinpoint Non-linear Correlations

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data

Image for post
Image for post
Imabe by Author
Image for post
Image for post
Image by Author
Image for post
Image for post
Image by Author
  • Of course, the score should be asymmetrical because I want to detect all the strange relationships between two variables.
  • The score should be 0 if there is no relationship and the score should be 1 if there is a perfect relationship
  • And that the score helps to answer the question Are there correlations between the columns? with a correlation matrix, then you make a scatter plot over the two columns to compare them and see if there is indeed a strong correlation.
  • And like the icing on the cake, the score should be able to handle both categorical and numerical columns by default.
!pip3 install ppscore

Calculating the PPS

First of all, there is no single way to calculate the Predictive Power Score. In fact, there are many possible ways to calculate a score that meets the requirements mentioned above. So, let’s rather think of the PPS as a framework for a family of scores. Let’s say we have two columns and we want to calculate the PPS of X predicting Y. In this case, we treat Y as our target variable and X as our (only) characteristic.

  • When the objective is categorical, we can use a Classification Decision Tree and calculate the weighted F1

PPS VS Correlation

To get a better idea of the PPS and its differences with the correlation let’s see this versus. We now have the correlations between x and y and vice versa

Image for post
Image for post
Imabe by Github
import pandas as pd
import numpy as np
import ppscore as pps
df = pd.DataFrame()
df
Image for post
Image for post
Image by Author
df["x"] = np.random.uniform(-2, 2, 10000)
df.head()
Image for post
Image for post
df["error"] = np.random.uniform(-0.5, 0.5, 10000)
df.head()
Image for post
Image for post
Image for post
Image for post
Image by Author
df["y"] = df["x"] * df["x"] + df["error"]
df.head()
Image for post
Image for post
df["x"].corr(df["y"])
-0.0115046561021449
df.corr()
Image for post
Image for post
pps.score(df, "x", "y")
{'x': 'x',
'y': 'y',
'ppscore': 0.675090383548477,
'case': 'regression',
'is_valid_score': True,
'metric': 'mean absolute error',
'baseline_score': 1.025540102508908,
'model_score': 0.33320784136182485,
'model': DecisionTreeRegressor()}
Image for post
Image for post
Image By Author
pps.score(df, "y", "x")
{'x': 'y',
'y': 'x',
'ppscore': 0,
'case': 'regression',
'is_valid_score': True,
'metric': 'mean absolute error',
'baseline_score': 1.0083196087945172,
'model_score': 1.1336852173737795,
'model': DecisionTreeRegressor()}
pps.predictors(df, "y")
Image for post
Image for post
pps.predictors(df, "x")
Image for post
Image for post
pps.matrix(df)
Image for post
Image for post

Analizing & visualizing results

We call this non-linear effects and asymmetry. Let’s use a typical quadratic relationship: the feature x is a uniform variable ranging from -2 to 2 and the target y is the square of x plus some error. In this case, x can predict very well y because there is a clear non-linear, quadratic relationship, this is how we generate the data, after all. However, this is not true in the other direction from y to x. For example, if y is 4, it is impossible to predict whether x was approximately 2 or -2

import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")
Image for post
Image for post
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')matrix_df
Image for post
Image for post
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
Image for post
Image for post

Example with Categorical Features

Comparing the correlation matrix with the PPS matrix of the Titanic data set will give you some new ideas.

Image for post
Image for post

Disclosure

PPS clearly has some advantages over correlation in finding predictive patterns in the data. However, once patterns are found, correlation is still a great way to communicate the linear relationships found. Therefore, you can use the PPS matrix as an alternative to the correlation matrix to detect and understand linear or non-linear patterns in your data

Limitations

  • The calculation is slower than the correlation (matrix).
  • The score cannot be interpreted as easily as the correlation because it does not tell you anything about the type of relationship that was found. Therefore, the PPS is better at finding patterns but the correlation is better at communicating the linear relationships found.
  • You cannot compare the scores of different target variables in a strictly mathematical way because they are calculated using different evaluation metrics. Scores are still valuable in the real world, but this must be kept in mind.
  • There are limitations to the components used under the hood

Conclusions

  • In addition to your usual feature selection mechanism, you can use the PPS to find good predictors for your target column.
  • You can also remove features that only add random noise.
  • Those features sometimes still score high on the feature importance metric.
  • You can remove features that can be predicted by other features because they do not add new information
  • You can identify pairs of mutually predictive characteristics in the PPS matrix — this includes strongly correlated characteristics but will also detect non-linear relationships.
  • Detect leakage: Use the PPS matrix to detect leakage between variables — even if the leakage is mediated by other variables.
  • Data normalization: Find entity structures in the data by interpreting the PPS matrix as a directed graph. This can be surprising when the data contain latent structures that were previously unknown. For example: the TicketID in the Titanic data set is often a flag for a

References

I have written this other Notebook about Datetimes in Python. I invite you to read it!

Data Scientist. ML Engineer. Co-founder at DataSource.ai, Linkedin https://www.linkedin.com/in/danielmorales1/, Twitter https://twitter.com/daniel_moralesp

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store