scikit-learn: machine learning in Python scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation

scikit-learn: machine learning in Python scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. Is it possible to do this with sklearn ? are there alternatives ?

Thanks register

Source: (StackOverflow)

Using Python 2.7 with scikit-learn 0.14 package. It runs well on some examples from the user guild expect the Linear Models.

```
Traceback (most recent call last):
File "E:\P\plot_ols.py", line 28, in <module>
from sklearn import datasets, linear_model
File "C:\Python27\lib\site-packages\sklearn\linear_model\__init__.py", line 12, in <module>
from .base import LinearRegression
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", line 29, in <module>
from ..utils.sparsefuncs import mean_variance_axis0, inplace_column_scale
ImportError: cannot import name inplace_column_scale
```

Thank you~

Source: (StackOverflow)

I'm getting the following error when performing recursive feature selection with cross-validation:

```
Traceback (most recent call last):
File "/Users/.../srl/main.py", line 32, in <module>
argident_sys.train_classifier()
File "/Users/.../srl/identification.py", line 194, in train_classifier
feat_selector.fit(train_argcands_feats,train_argcands_target)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit
ranking_ = rfe.fit(X[train], y[train]).ranking_
TypeError: only integer arrays with one element can be converted to an index
```

The code that generates the error is the following:

```
def train_classifier(self):
# Get the argument candidates
argcands = self.get_argcands(self.reader)
# Extract the necessary features from the argument candidates
train_argcands_feats = []
train_argcands_target = []
for argcand in argcands:
train_argcands_feats.append(self.extract_features(argcand))
if argcand["info"]["label"] == "NULL":
train_argcands_target.append("NULL")
else:
train_argcands_target.append("ARG")
# Transform the features to the format required by the classifier
self.feat_vectorizer = DictVectorizer()
train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)
# Transform the target labels to the format required by the classifier
self.target_names = list(set(train_argcands_target))
train_argcands_target = [self.target_names.index(target) for target in train_argcands_target]
## Train the appropriate supervised model
# Recursive Feature Elimination
self.classifier = LogisticRegression()
feat_selector = RFECV(estimator=self.classifier, step=1, cv=StratifiedKFold(train_argcands_target, 10))
feat_selector.fit(train_argcands_feats,train_argcands_target)
print feat_selector.n_features_
print feat_selector.support_
print feat_selector.ranking_
print feat_selector.cv_scores_
return
```

I know I should also perform GridSearch for the parameters of the LogisticRegression classifier, but I don't think that's the source of the error (or is it?).

I should mention that I'm testing with around 50 features, and almost all of them are categoric (that's why I use the DictVectorizer to transform them appropriately).

Any help or guidance you could give me is more than welcome. Thanks!

**EDIT**

Here's some training data examples:

```
train_argcands_feats = [{'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'dado', 'head': u'dado', 'head_postag': u'N'}, {'head_lemma': u'postura', 'head': u'postura', 'head_postag': u'N'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'de', 'head': u'de', 'head_postag': u'PRP'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'muito', 'head': u'Muitas', 'head_postag': u'PRON-DET'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}, {'head_lemma': u'com', 'head': u'com', 'head_postag': u'PRP'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}]
train_argcands_target = ['NULL', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'NULL', 'NULL']
```

Source: (StackOverflow)

How do I save a trained **Naive Bayes classifier** to **disk** and use to **predict** data?

I have the following sample program from Scikits learn website:

```
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()
```

Source: (StackOverflow)

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Source: (StackOverflow)

I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class `sklearn.svm.LinearSVC`

.

In the chapter about preprocessing in scikit-learn's documentation, I've read the following:

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

**Question 1:** Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?

**Question 2:** As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class `sklearn.preprocessing.StandardScaler`

. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.

That is, do I have to do this:

```
scaler = StandardScaler()
scaler.fit(X_train) # only compute mean and std here
X_test = scaler.transform(X_test) # perform standardization by centering and scaling
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)
```

Or do I have to do this:

```
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # compute mean, std and transform training data as well
X_test = scaler.transform(X_test) # same as above
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)
```

In short, do I have to use `scaler.fit(X_train)`

or `scaler.fit_transform(X_train)`

on the training data in order to get reasonable results with `LinearSVC`

?

Source: (StackOverflow)

I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences.

My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices?

So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.

Thanks for your help and my apologies if I've missed something completely obvious.

Source: (StackOverflow)

I know that the computation in scikit-learn is based on NumPy so everything is a matrix or array.

How does this package handle mixed data (numerical and nominal values)?

For example, a product could have the attribute 'color' and 'price', where color is nominal and price is numerical. I notice there is a model called 'DictVectorizer' to numerate the nominal data. For example, two products are:

```
products = [{'color':'black','price':10}, {'color':'green','price':5}]
```

And the result from 'DictVectorizer' could be:

```
[[1,0,10],
[0,1,5]]
```

If there are lots of different values for the attribute 'color', the matrix would be very sparse. And long features will degrade the performance of some algorithms, such as decision trees.

**Is there any way to use the nominal value without the need to create dummy codes?**

Source: (StackOverflow)

I am getting the following error while trying to import from sklearn:

```
>>> from sklearn import svm
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
from sklearn import svm
File "C:\Python27\lib\site-packages\sklearn\__init__.py", line 16, in <module>
from . import check_build
ImportError: cannot import name check_build
```

I am using python 2.7, scipy-0.12.0b1 superpack, numpy-1.6.0 superpack, scikit-learn-0.11 I have a windows 7 machine

I have checked several answers for this issue but none of them gives a way out of this error.

Source: (StackOverflow)

Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows:

```
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
labelid = list(classifier.classes_).index(classlabel)
feature_names = vectorizer.get_feature_names()
topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
for coef, feat in topn:
print classlabel, feat, coef
```

Then:

```
most_informative_feature_for_class(tfidf_vect, clf, 5)
```

For this classfier:

```
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)
```

The problem is the output of `most_informative_feature_for_class`

:

```
5 a_base_de_bien bastante (0, 2451) -0.210683496368
(0, 3533) -0.173621065386
(0, 8034) -0.135543062425
(0, 10346) -0.173621065386
(0, 15231) -0.154148294738
(0, 18261) -0.158890483047
(0, 21083) -0.297476572586
(0, 434) -0.0596263855375
(0, 446) -0.0753492277856
(0, 769) -0.0753492277856
(0, 1118) -0.0753492277856
(0, 1439) -0.0753492277856
(0, 1605) -0.0753492277856
(0, 1755) -0.0637950312345
(0, 3504) -0.0753492277856
(0, 3511) -0.115802483001
(0, 4382) -0.0668983049212
(0, 5247) -0.315713152154
(0, 5396) -0.0753492277856
(0, 5753) -0.0716096348446
(0, 6507) -0.130661516772
(0, 7978) -0.0753492277856
(0, 8296) -0.144739048504
(0, 8740) -0.0753492277856
(0, 8906) -0.0753492277856
: :
(0, 23282) 0.418623443832
(0, 4100) 0.385906085143
(0, 15735) 0.207958503155
(0, 16620) 0.385906085143
(0, 19974) 0.0936828782325
(0, 20304) 0.385906085143
(0, 21721) 0.385906085143
(0, 22308) 0.301270427482
(0, 14903) 0.314164150621
(0, 16904) 0.0653764031957
(0, 20805) 0.0597723455204
(0, 21878) 0.403750815828
(0, 22582) 0.0226150073272
(0, 6532) 0.525138162099
(0, 6670) 0.525138162099
(0, 10341) 0.525138162099
(0, 13627) 0.278332617058
(0, 1600) 0.326774799211
(0, 2074) 0.310556919237
(0, 5262) 0.176400451433
(0, 6373) 0.290124806858
(0, 8593) 0.290124806858
(0, 12002) 0.282832270298
(0, 15008) 0.290124806858
(0, 19207) 0.326774799211
```

It is not returning the label nor the words. Why this is happening and how can I print the words and the labels?. Do you guys this is happening since I am using pandas to read the data?. Another thing I tried is the following, form this question:

```
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
print_top10(tfidf_vect,clf,y)
```

But I get this traceback:

Traceback (most recent call last):

```
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
print_top10(tfidf_vect,clf,5)
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable
```

Any idea of how to solve this, in order to get the features with the highest coefficient values?.

Source: (StackOverflow)

I'm using scikit-learn in Python to develop a classification algorithm to predict gender of a certain customers. Amongst others I want to use the Naive Bayes classifier but my problem is that I have a mix of categorial data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernouilli Naive Bayes can be used for categorial data. However, since I want to have **both** categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

Source: (StackOverflow)

This is a beginner question on regularization with regression. Most information about Elastic Net and Lasso Regression online replicates the information from Wikipedia or the original 2005 paper by Zou and Hastie (Regularization and variable selection via the elastic net).

* Resource for simple theory?* Is there a simple and easy explanation somewhere about what it does, when and why reguarization is neccessary, and how to use it - for those who are not statistically inclined? I understand that the original paper is the ideal source if you can understand it, but is there somewhere that more simply the problem and solution?

* How to use in sklearn?* Is there a step by step example showing why elastic net is chosen (over ridge, lasso, or just simple OLS) and how the parameters are calculated? Many of the examples on sklearn just include alpha and rho parameters directly into the prediction model, for example:

```
from sklearn.linear_model import ElasticNet
alpha = 0.1
enet = ElasticNet(alpha=alpha, rho=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
```

However, they don't explain how these were calculated. How do you calculate the parameters for the lasso or net?

Source: (StackOverflow)

These are questions on how to calculate & reduce overfitting in machine learning. I think many new to machine learning will have the same questions, so I tried to be clear with my examples and questions in hope that answers here can help others.

I have a very small sample of texts and I'm trying to predict values associated with them. I've used sklearn to calculate tf-idf, and insert those into a regression model for prediction. This gives me 26 samples with 6323 features - not a lot.. I know:

```
>> count_vectorizer = CountVectorizer(min_n=1, max_n=1)
>> term_freq = count_vectorizer.fit_transform(texts)
>> transformer = TfidfTransformer()
>> X = transformer.fit_transform(term_freq)
>> print X.shape
(26, 6323)
```

Inserting those 26 samples of 6323 features (X) and associated scores (y), into a `LinearRegression`

model, gives good predictions. These are obtained using leave-one-out cross validation, from `cross_validation.LeaveOneOut(X.shape[0], indices=True)`

:

```
using ngrams (n=1):
human machine points-off %error
8.67 8.27 0.40 1.98
8.00 7.33 0.67 3.34
... ... ... ...
5.00 6.61 1.61 8.06
9.00 7.50 1.50 7.50
mean: 7.59 7.64 1.29 6.47
std : 1.94 0.56 1.38 6.91
```

Pretty good! Using ngrams (n=300) instead of unigrams (n=1), similar results occur, which is obviously not right. No 300-words occur in any of the texts, so the prediction should fail, but it doesn't:

```
using ngrams (n=300):
human machine points-off %error
8.67 7.55 1.12 5.60
8.00 7.57 0.43 2.13
... ... ... ...
mean: 7.59 7.59 1.52 7.59
std : 1.94 0.08 1.32 6.61
```

* Question 1:* This might mean that the prediction model is

* Question 2:* What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

* Question 3:* If

`LeaveOneOut`

cross validation is used, how can the model possibly over-fit with good results? Over-fitting means the prediction accuracy will suffer - so why doesn't it suffer on the prediction for the text being left out? The only reason I can think of: in a tf-idf sparse matrix of mainly 0s, there is strong overlap between texts because so many terms are 0s - the regression then thinks the texts correlate highly.Please answer any of the questions even if you don't know them all. Thanks!

Source: (StackOverflow)

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by `NA`

). I load the data in with `genfromtxt`

with `dtype='f8'`

and go about training my classifier.

The classification is fine on `RandomForestClassifier`

and `GradientBoostingClassifier`

objects, but using `SVC`

from `sklearn.svm`

causes the following error:

```
probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
X = self._validate_for_predict(X)
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
X = atleast2d_or_csr(X, dtype=np.float64, order="C")
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
assert_all_finite(X)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity
```

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

Source: (StackOverflow)

I have a `pandas`

data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

```
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
```

Ideally, I would have something like `ols(A ~ B + C, data = df)`

but when I look at the examples from algorithm libraries like `scikit-learn`

it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

Source: (StackOverflow)