Tuesday, February 14, 2017

How to solve Scikit-learn Deprecation Warning on cross_validation

When using the Scikit-learn library and trying out various examples found over the web, have you come across a DeprecationWarning for the cross_validation module?

The DeprecationWarning on cross_validation

This most commonly happens when the code you're trying to run utilizes the train_test_split() function - a handy function used to quickly split the training and test datasets from a main dataset. The full warning message is something like this,

 C:\Users\Thimira\Anaconda3\envs\tensorflow12\lib\site-packages\sklearn\cross_val  
 idation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in  
  favor of the model_selection module into which all the refactored classes and f  
 unctions are moved. Also note that the interface of the new CV iterators are dif  
 ferent from that of this module. This module will be removed in 0.20.  
  "This module will be removed in 0.20.", DeprecationWarning)

So, how to solve this?

The first thing to note is that it's a 'deprecation warning'. It means that the cross_validation has been deprecated - that module is being considered for removal in a future release, and you are advised against using it. As the message mentions, the module will be removed in Scikit-learn v0.20. So, technically, you can ignore this warning and keep using the module until 0.20 is released and you upgrade to it.

Looking at the message again, it says to consider using the model_selection module instead. If you look at the documentation for model_selection module, you can see that it also has a train_test_split() function, which is identical (in parameters) to the one in the cross_validation module.

So, what you basically need to do is to change this,

 from sklearn.cross_validation import train_test_split  
 from sklearn import datasets  
 import numpy as np  
 ...  
 ...  
   
 (trainData, testData, trainLabels, testLabels) = train_test_split(  
   data / 255.0, dataset.target.astype("int"), test_size=0.33)  
 ...

in to this,

 from sklearn.model_selection import train_test_split  
 from sklearn import datasets  
 import numpy as np  
 ...  
 ...  
   
 (trainData, testData, trainLabels, testLabels) = train_test_split(  
   data / 255.0, dataset.target.astype("int"), test_size=0.33)  
 ...

Notice that only the import has changed (from sklearn.model_selection import train_test_split), the usage of the function hasn't changed.

Related links:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html