Skip to content
This repository has been archived by the owner on Nov 29, 2022. It is now read-only.

Memory Error & NIV Dictionary Query #19

Open
khaashif opened this issue Apr 24, 2019 · 3 comments
Open

Memory Error & NIV Dictionary Query #19

khaashif opened this issue Apr 24, 2019 · 3 comments

Comments

@khaashif
Copy link

Hi,

I was hoping you guys could please help me!

My dataset is around 270k rows with around 60 variables, its a 60mb file, when I try to call NIV, I run into memory error. To tackle this, I have ran NIV on a sample of around 180k max with success, I then refer to the sample's NIV dictionary and select all variables with a NIV higher than 0.03 for example. I then select these variables from my 270k dataset and build my model on this.

However, my cumulative gains plot always ends up with a negative correlation and cgains line plotted below the random selection line.

My guess is this problem is as a result of two possible things:

  • Using NIV performed on a sample of the entire dataset has led to a biased NIV calculation, and so shows me variables that aren't very good predictors of uplift.
  • OR, my dataset is rubbish.

Possible solutions/questions I have:

  • How can I tackle the memory issue stated above (this will then allow me to select variables based on NIV calculated on the entire 270k dataset)
  • The NIV values in the dictionary don't seem to resemble the bar plot of NIV. (Assuming its a bar plot with error bars and not a box plot). So my question is, what is the value recorded in the dictionary vs. the values/bars plotted on the NIV graph?

Are there any other solutions you guys would recommend? Apologies - I am fairly new to machine learning in general so still learning alot!

Many thanks for your help in advance, its much appreciated!

Khaashif

@rsyi
Copy link
Contributor

rsyi commented May 19, 2019

I'd have to revert to the expertise of @WTFrost on the memory requirement of NIV as he built the module. I'll look into what the numbers mean and get back to you.

Because the NIV and NWOE modules are sort of just crude dimensional cuts for EDA, I'd actually recommend not using them for feature selection. We never really came up with a great way of selecting features, though, to be honest. I'd often just look at the feature importances from xgboost after making my first model with all the features (on both "gain" and "weight"), and pick the most important of these from each category.

If any uplift exists, you can usually find it by creating an outcome model on the treatment group. I'd try that first, and ensure that you can get a positive cumulative gains curve when ranking by this outcome prediction. If not, then it probably is a problem with your dataset (your features just may not be predictive enough). Outcome models made in this way should be able to find all "persuadables", but you'll just end up mixing some "sure things" in there. As long as you have some "lost causes" or "sleeping dogs" in the population, such a model should be able to at least pick them out. They're also a lot more stable to create than uplift models. You just risk overspending if you use them for targeting. You can use the UpliftEval class to evaluate the performance of such a model.

@rsyi
Copy link
Contributor

rsyi commented May 19, 2019

Ah! There was a bug in the NIV() routine causing the incorrect numbers. Sorry about that. Should be fixed now.

I'll look into the memory issue now.

@narmin-a
Copy link

Hi! Great package, thanks for open-sourcing it.
The model predicts well for small datasets, but I'm running into memory error with up.randomized_search() command for larger datasets (>200K rows, 10 features). Any suggestions on how it can be solved? Or why it is happening?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants