scikit-learn: machine learning in Python scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation

scikit-learn: machine learning in Python scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation

I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences.

My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices?

So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.

Thanks for your help and my apologies if I've missed something completely obvious.

Source: (StackOverflow)

I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class `sklearn.svm.LinearSVC`

.

In the chapter about preprocessing in scikit-learn's documentation, I've read the following:

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

**Question 1:** Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?

**Question 2:** As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class `sklearn.preprocessing.StandardScaler`

. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.

That is, do I have to do this:

```
scaler = StandardScaler()
scaler.fit(X_train) # only compute mean and std here
X_test = scaler.transform(X_test) # perform standardization by centering and scaling
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)
```

Or do I have to do this:

```
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # compute mean, std and transform training data as well
X_test = scaler.transform(X_test) # same as above
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)
```

In short, do I have to use `scaler.fit(X_train)`

or `scaler.fit_transform(X_train)`

on the training data in order to get reasonable results with `LinearSVC`

?

Source: (StackOverflow)

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?

Source: (StackOverflow)

I am getting the following error while trying to import from sklearn:

```
>>> from sklearn import svm
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
from sklearn import svm
File "C:\Python27\lib\site-packages\sklearn\__init__.py", line 16, in <module>
from . import check_build
ImportError: cannot import name check_build
```

I am using python 2.7, scipy-0.12.0b1 superpack, numpy-1.6.0 superpack, scikit-learn-0.11 I have a windows 7 machine

I have checked several answers for this issue but none of them gives a way out of this error.

Source: (StackOverflow)

I know that the computation in scikit-learn is based on NumPy so everything is a matrix or array.

How does this package handle mixed data (numerical and nominal values)?

For example, a product could have the attribute 'color' and 'price', where color is nominal and price is numerical. I notice there is a model called 'DictVectorizer' to numerate the nominal data. For example, two products are:

```
products = [{'color':'black','price':10}, {'color':'green','price':5}]
```

And the result from 'DictVectorizer' could be:

```
[[1,0,10],
[0,1,5]]
```

If there are lots of different values for the attribute 'color', the matrix would be very sparse. And long features will degrade the performance of some algorithms, such as decision trees.

**Is there any way to use the nominal value without the need to create dummy codes?**

Source: (StackOverflow)

I'm using scikit-learn in Python to develop a classification algorithm to predict gender of a certain customers. Amongst others I want to use the Naive Bayes classifier but my problem is that I have a mix of categorial data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernouilli Naive Bayes can be used for categorial data. However, since I want to have **both** categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

Source: (StackOverflow)

This is a beginner question on regularization with regression. Most information about Elastic Net and Lasso Regression online replicates the information from Wikipedia or the original 2005 paper by Zou and Hastie (Regularization and variable selection via the elastic net).

* Resource for simple theory?* Is there a simple and easy explanation somewhere about what it does, when and why reguarization is neccessary, and how to use it - for those who are not statistically inclined? I understand that the original paper is the ideal source if you can understand it, but is there somewhere that more simply the problem and solution?

* How to use in sklearn?* Is there a step by step example showing why elastic net is chosen (over ridge, lasso, or just simple OLS) and how the parameters are calculated? Many of the examples on sklearn just include alpha and rho parameters directly into the prediction model, for example:

```
from sklearn.linear_model import ElasticNet
alpha = 0.1
enet = ElasticNet(alpha=alpha, rho=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
```

However, they don't explain how these were calculated. How do you calculate the parameters for the lasso or net?

Source: (StackOverflow)

Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows:

```
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
labelid = list(classifier.classes_).index(classlabel)
feature_names = vectorizer.get_feature_names()
topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
for coef, feat in topn:
print classlabel, feat, coef
```

Then:

```
most_informative_feature_for_class(tfidf_vect, clf, 5)
```

For this classfier:

```
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)
```

The problem is the output of `most_informative_feature_for_class`

:

```
5 a_base_de_bien bastante (0, 2451) -0.210683496368
(0, 3533) -0.173621065386
(0, 8034) -0.135543062425
(0, 10346) -0.173621065386
(0, 15231) -0.154148294738
(0, 18261) -0.158890483047
(0, 21083) -0.297476572586
(0, 434) -0.0596263855375
(0, 446) -0.0753492277856
(0, 769) -0.0753492277856
(0, 1118) -0.0753492277856
(0, 1439) -0.0753492277856
(0, 1605) -0.0753492277856
(0, 1755) -0.0637950312345
(0, 3504) -0.0753492277856
(0, 3511) -0.115802483001
(0, 4382) -0.0668983049212
(0, 5247) -0.315713152154
(0, 5396) -0.0753492277856
(0, 5753) -0.0716096348446
(0, 6507) -0.130661516772
(0, 7978) -0.0753492277856
(0, 8296) -0.144739048504
(0, 8740) -0.0753492277856
(0, 8906) -0.0753492277856
: :
(0, 23282) 0.418623443832
(0, 4100) 0.385906085143
(0, 15735) 0.207958503155
(0, 16620) 0.385906085143
(0, 19974) 0.0936828782325
(0, 20304) 0.385906085143
(0, 21721) 0.385906085143
(0, 22308) 0.301270427482
(0, 14903) 0.314164150621
(0, 16904) 0.0653764031957
(0, 20805) 0.0597723455204
(0, 21878) 0.403750815828
(0, 22582) 0.0226150073272
(0, 6532) 0.525138162099
(0, 6670) 0.525138162099
(0, 10341) 0.525138162099
(0, 13627) 0.278332617058
(0, 1600) 0.326774799211
(0, 2074) 0.310556919237
(0, 5262) 0.176400451433
(0, 6373) 0.290124806858
(0, 8593) 0.290124806858
(0, 12002) 0.282832270298
(0, 15008) 0.290124806858
(0, 19207) 0.326774799211
```

It is not returning the label nor the words. Why this is happening and how can I print the words and the labels?. Do you guys this is happening since I am using pandas to read the data?. Another thing I tried is the following, form this question:

```
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
print_top10(tfidf_vect,clf,y)
```

But I get this traceback:

Traceback (most recent call last):

```
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
print_top10(tfidf_vect,clf,5)
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable
```

Any idea of how to solve this, in order to get the features with the highest coefficient values?.

Source: (StackOverflow)

These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.

I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:

```
>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)
>> term_freq = count_vectorizer.fit_transform(texts)
>> transformer = TfidfTransformer()
>> X = transformer.fit_transform(term_freq)
>> print X.shape
(26, 6323)
```

Inserting those 26 samples of 6323 features (X) and associated scores (y), into a `LinearRegression`

model, gives good predictions. These are obtained using leave-one-out cross validation, from `cross_validation.LeaveOneOut(X.shape[0], indices=True)`

:

```
using ngrams (n=1):
human machine points-off %error
8.67 8.27 0.40 1.98
8.00 7.33 0.67 3.34
... ... ... ...
5.00 6.61 1.61 8.06
9.00 7.50 1.50 7.50
mean: 7.59 7.64 1.29 6.47
std : 1.94 0.56 1.38 6.91
```

Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:

```
using ngrams (n=300):
human machine points-off %error
8.67 7.55 1.12 5.60
8.00 7.57 0.43 2.13
... ... ... ...
mean: 7.59 7.59 1.52 7.59
std : 1.94 0.08 1.32 6.61
```

* Question 1:* This might mean that the prediction model is

* Question 2:* What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

* Question 3:* If

`LeaveOneOut`

cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.Please answer any of the questions even if you don't know them all. Thanks!

Source: (StackOverflow)

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by `NA`

). I load the data in with `genfromtxt`

with `dtype='f8'`

and go about training my classifier.

The classification is fine on `RandomForestClassifier`

and `GradientBoostingClassifier`

objects, but using `SVC`

from `sklearn.svm`

causes the following error:

```
probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
X = self._validate_for_predict(X)
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
X = atleast2d_or_csr(X, dtype=np.float64, order="C")
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
assert_all_finite(X)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity
```

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

Source: (StackOverflow)

I have a `pandas`

data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

```
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
```

Ideally, I would have something like `ols(A ~ B + C, data = df)`

but when I look at the examples from algorithm libraries like `scikit-learn`

it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

Source: (StackOverflow)

I'm trying to draw a complete-link `scipy.cluster.hierarchy.dendrogram`

, and I found that `scipy.cluster.hierarchy.linkage`

is slower than `sklearn.AgglomerativeClustering`

.

However, `sklearn.AgglomerativeClustering`

doesn't return the distance between clusters and the number of original observations, which `scipy.cluster.hierarchy.dendrogram`

needs. Is there a way to take them?

Source: (StackOverflow)

I was bummed out to see that scikit-learn does not support Python 3...Is there a comparable package anyone can recommend for Python 3?

Source: (StackOverflow)

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the *feature_importances_*, which works well for me.

However, I would like to know how they are getting calculated and which measure/algorithm is used.

Unfortunately I could not find any documentation on this topic.

Regards and thanks in advance, Ingmar

Source: (StackOverflow)

Im trying to use on of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms i tried just returns one match.

For example I have a piece of text "Theaters in New York compared to those in London" And I have trained the algorithm to pick a place for every text snippet i feed it.

In the above example I would want it to return New York and London, but it only returns New York.

Is it possible to use Scikit-learn to return multiple results? Or even return the label with the next highest probability?

Thanks for your help

---Update

I tried using OneVsRestClassifier but I still only get one option back per piece of text. Below is the sample code i am using

```
y_train = ('New York','London')
train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')
X_vectorized = count.transform(train_set).todense()
smatrix2 = count.transform(test_set).todense()
base_clf = MultinomialNB(alpha=1)
clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred
```

Result: ['New York' 'London' 'London']

Source: (StackOverflow)