scikit-learn: machine learning in Python scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation

One Hot Encoding for Machine learning

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to prediction accuracy, compared to using the original matrix itself as training data. How does this performance increase happen? Can someone give an explanation on this.

Source: (StackOverflow)

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:


    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
print 'Precision:', precision_score(y_test, weighted_prediction,
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)


auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,

print 'Recall:', recall_score(y_test, auto_weighted_prediction,

print 'Precision:', precision_score(y_test, auto_weighted_prediction,

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)


clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)

from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".

However, Im getting warnings like this:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The defaultweightedaveraging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value foraverage, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1". How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?

Source: (StackOverflow)

how to extract the decision rules from scikit-learn decision-tree?

Can I extract the underlying decision-rules (or 'decision paths') from a trained tree in a decision tree - as a textual list ?
something like: "if A>0.4 then if B<0.2 then if C>0.8 then class='X' etc...
If anyone knows of a simple way to do so, it will be very helpful.

Source: (StackOverflow)

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the *feature_importances_*, which works well for me.

However, I would like to know how they are getting calculated and which measure/algorithm is used.

Unfortunately I could not find any documentation on this topic.

Regards and thanks in advance, Ingmar

Source: (StackOverflow)

How to get most informative features for scikit-learn classifiers?

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:

viagra = None          ok : spam     =      4.5 : 1.0
hello = True           ok : spam     =      4.5 : 1.0
hello = None           spam : ok     =      3.3 : 1.0
viagra = True          spam : ok     =      3.3 : 1.0
casino = True          spam : ok     =      2.0 : 1.0
casino = None          ok : spam     =      1.5 : 1.0

My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.

If there is no such function yet, does somebody know a workaround how to get to those values?

Thanks alot!

Source: (StackOverflow)

scikit's GridSearch and Python in general are not freeing memory

I made some weird observations that my GridSearches keep failing after a couple of hours and I initially couldn't figure out why. I monitored the memory usage then over time and saw that it it started with a few gigabytes (~6 Gb) and kept increasing until it crashed the node when it reached the max. 128 Gb the hardware can take. I was experimenting with random forests for classification of a large number of text documents. For simplicity -- to figure out what's going on -- I went back to naive Bayes.

The versions I am using are

  • Python 3.4.2
  • scikit-learn 0.15.2

I found some related discussion on the scikit-issue list on GitHub about this topic: https://github.com/scikit-learn/scikit-learn/issues/565 and https://github.com/scikit-learn/scikit-learn/pull/770

And it sounds like it was already successfully addressed!

So, the relevant code that I am using is

grid_search = GridSearchCV(pipeline, 
                           n_jobs=1, # 
                           refit=False)  # tried both True and False

grid_search.fit(X_train, y_train)  
print('Best score: {0}'.format(grid_search.best_score_))  
print('Best parameters set:') 

Just out of curiosity, I later decided to do the grid search the quick & dirty way via nested for loop

for p1 in parameterset1:
    for p2 in parameterset2:
            pipeline = Pipeline([
                        ('vec', CountVectorizer(
                         ('tfidf', TfidfTransformer(
                         ('clf', MultinomialNB())])

            scores = cross_validation.cross_val_score(

           params_dict[i][1] = '%s,%0.4f,%0.4f' % (params_dict[i][1], scores.mean(), scores.std())
           sys.stdout.write(params_dict[i][1] + '\n')

So far so good. The grid search runs and writes the results to stdout. However, after some time it exceeds the memory cap of 128 Gb again. Same problem as with the GridSearch in scikit. After some experimentation, I finally found out that

len(gc.get_objects()) # particularly this part!

in the for loop solves the problem and the memory usage stays constantly at 6.5 Gb over the run time of ~10 hours.

Eventually, I got it to work with the above fix, however, I am curious to hear your ideas about what might be causing this issue and your tips & suggestions!

Source: (StackOverflow)

fastest SVM implementation usable in python

I'm building some predictive models in python and have been using scikits learn's SVM implementation. It's been really great, easy to use, and relatively fast. Unfortunately, I'm beginning to become constrained by my runtime. I run a rbf SVM on a full dataset of about 4 - 5000 with 650 features. Each run takes about a minute. But with a 5 fold cross validation + grid search (using a coarse to fine search), it's getting a bit unfeasible for my task at hand. So generally, do people have any recommendations in terms of the fastest SVM implementation that can be used in python? That, or any ways to speed up my modeling?

I've heard of LIBSVM's GPU implementation, which seems like it could work. I don't know of any other GPU SVM implementations usable in python, but would definitely be open to others. Also, does using the GPU significantly increase runtime?

I've also heard that there are ways of approximating the rbf SVM by using a linear SVM + feature map in scikits. Not sure what people think about this approach. Again, anyone using this approach, is it a significant increase in runtime?

All ideas for increasing the speed of program is most welcome. Thanks!

Source: (StackOverflow)

How to convert a pandas DataFrame subset of columns AND rows into a numpy array?

I'm wondering if there is a simpler, memory efficient way to select a subset of rows and columns from a pandas DataFrame.

For instance, given this dataframe:

df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
print df

          a         b         c         d         e
0  0.945686  0.000710  0.909158  0.892892  0.326670
1  0.919359  0.667057  0.462478  0.008204  0.473096
2  0.976163  0.621712  0.208423  0.980471  0.048334
3  0.459039  0.788318  0.309892  0.100539  0.753992

I want only those rows in which the value for column 'c' is greater than 0.5, but I only need columns 'b' and 'e' for those rows.

This is the method that I've come up with - perhaps there is a better "pandas" way?

locs = [df.columns.get_loc(_) for _ in ['a', 'd']]
print df[df.c > 0.5][locs]

          a         d
0  0.945686  0.892892

My final goal is to convert the result to a numpy array to pass into an sklearn regression algorithm, so I will use the code above like this:

training_set = array(df[df.c > 0.5][locs])

... and that peeves me since I end up with a huge array copy in memory. Perhaps there's a better way for that too?

Source: (StackOverflow)

use scikit-learn to classify into multiple categories

Im trying to use on of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms i tried just returns one match.

For example I have a piece of text "Theaters in New York compared to those in London" And I have trained the algorithm to pick a place for every text snippet i feed it.

In the above example I would want it to return New York and London, but it only returns New York.

Is it possible to use Scikit-learn to return multiple results? Or even return the label with the next highest probability?

Thanks for your help


I tried using OneVsRestClassifier but I still only get one option back per piece of text. Below is the sample code i am using

y_train = ('New York','London')

train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()

base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

Result: ['New York' 'London' 'London']

Source: (StackOverflow)

sklearn agglomerative clustering linkage matrix

I'm trying to draw a complete-link scipy.cluster.hierarchy.dendrogram, and I found that scipy.cluster.hierarchy.linkage is slower than sklearn.AgglomerativeClustering.

However, sklearn.AgglomerativeClustering doesn't return the distance between clusters and the number of original observations, which scipy.cluster.hierarchy.dendrogram needs. Is there a way to take them?

Source: (StackOverflow)

Run an OLS regression with Pandas Data Frame

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})

Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

Source: (StackOverflow)

How to get SVMs to play nicely with missing data in scikit-learn?

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA). I load the data in with genfromtxt with dtype='f8' and go about training my classifier.

The classification is fine on RandomForestClassifier and GradientBoostingClassifier objects, but using SVC from sklearn.svm causes the following error:

    probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
    X = self._validate_for_predict(X)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
    X = atleast2d_or_csr(X, dtype=np.float64, order="C")
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
    raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

Source: (StackOverflow)

Distinguishing overfitting vs good prediction

These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.

I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:

>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)
>> term_freq = count_vectorizer.fit_transform(texts)
>> transformer = TfidfTransformer()
>> X = transformer.fit_transform(term_freq) 
>> print X.shape

(26, 6323)

Inserting those 26 samples of 6323 features (X) and associated scores (y), into a LinearRegression model, gives good predictions. These are obtained using leave-one-out cross validation, from cross_validation.LeaveOneOut(X.shape[0], indices=True) :

using ngrams (n=1):
     human  machine  points-off  %error
      8.67    8.27    0.40       1.98
      8.00    7.33    0.67       3.34
      ...     ...     ...        ...
      5.00    6.61    1.61       8.06
      9.00    7.50    1.50       7.50
mean: 7.59    7.64    1.29       6.47
std : 1.94    0.56    1.38       6.91

Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:

using ngrams (n=300):
      human  machine  points-off  %error
       8.67    7.55    1.12       5.60
       8.00    7.57    0.43       2.13
       ...     ...     ...        ...
mean:  7.59    7.59    1.52       7.59
std :  1.94    0.08    1.32       6.61

Question 1: This might mean that the prediction model is overfitting the data. I only know this because I chose an extreme value for the ngrams (n=300) which I KNOW can't produce good results. But if I didn't have this knowledge, how would you normally tell that the model is over-fitting? In other words, if a reasonable measure (n=1) were used, how would you know that the good prediction was a result of being overfit vs. the model just working well?

Question 2: What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

Question 3: If LeaveOneOut cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.

Please answer any of the questions even if you don't know them all. Thanks!

Source: (StackOverflow)

How is Elastic Net used?

This is a beginner question on regularization with regression. Most information about Elastic Net and Lasso Regression online replicates the information from Wikipedia or the original 2005 paper by Zou and Hastie (Regularization and variable selection via the elastic net).

Resource for simple theory? Is there a simple and easy explanation somewhere about what it does, when and why reguarization is neccessary, and how to use it - for those who are not statistically inclined? I understand that the original paper is the ideal source if you can understand it, but is there somewhere that more simply the problem and solution?

How to use in sklearn? Is there a step by step example showing why elastic net is chosen (over ridge, lasso, or just simple OLS) and how the parameters are calculated? Many of the examples on sklearn just include alpha and rho parameters directly into the prediction model, for example:

from sklearn.linear_model import ElasticNet
alpha = 0.1
enet = ElasticNet(alpha=alpha, rho=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)

However, they don't explain how these were calculated. How do you calculate the parameters for the lasso or net?

Source: (StackOverflow)

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict gender of a certain customers. Amongst others I want to use the Naive Bayes classifier but my problem is that I have a mix of categorial data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernouilli Naive Bayes can be used for categorial data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

Source: (StackOverflow)