EzDev.org

scikit-learn

scikit-learn: machine learning in Python scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation


How is Elastic Net used?

This is a beginner question on regularization with regression. Most information about Elastic Net and Lasso Regression online replicates the information from Wikipedia or the original 2005 paper by Zou and Hastie (Regularization and variable selection via the elastic net).

Resource for simple theory? Is there a simple and easy explanation somewhere about what it does, when and why reguarization is neccessary, and how to use it - for those who are not statistically inclined? I understand that the original paper is the ideal source if you can understand it, but is there somewhere that more simply the problem and solution?

How to use in sklearn? Is there a step by step example showing why elastic net is chosen (over ridge, lasso, or just simple OLS) and how the parameters are calculated? Many of the examples on sklearn just include alpha and rho parameters directly into the prediction model, for example:

from sklearn.linear_model import ElasticNet
alpha = 0.1
enet = ElasticNet(alpha=alpha, rho=0.7)
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)

However, they don't explain how these were calculated. How do you calculate the parameters for the lasso or net?


Source: (StackOverflow)

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict gender of a certain customers. Amongst others I want to use the Naive Bayes classifier but my problem is that I have a mix of categorial data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernouilli Naive Bayes can be used for categorial data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!


Source: (StackOverflow)

In scikit learn, how to deal with the data mixed with numerical and nominal value?

I know that the computation in scikit-learn is based on NumPy so everything is a matrix or array.

How does this package handle mixed data (numerical and nominal values)?

For example, a product could have the attribute 'color' and 'price', where color is nominal and price is numerical. I notice there is a model called 'DictVectorizer' to numerate the nominal data. For example, two products are:

products = [{'color':'black','price':10}, {'color':'green','price':5}]

And the result from 'DictVectorizer' could be:

[[1,0,10],
 [0,1,5]]

If there are lots of different values for the attribute 'color', the matrix would be very sparse. And long features will degrade the performance of some algorithms, such as decision trees.

Is there any way to use the nominal value without the need to create dummy codes?


Source: (StackOverflow)

How do I find which attributes my tree splits on, when using scikit-learn?

I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences.

My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices?

So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.

Thanks for your help and my apologies if I've missed something completely obvious.


Source: (StackOverflow)

How to apply standardization to SVMs in scikit-learn?

I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class sklearn.svm.LinearSVC.

In the chapter about preprocessing in scikit-learn's documentation, I've read the following:

Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

Question 1: Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?

Question 2: As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class sklearn.preprocessing.StandardScaler. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.

That is, do I have to do this:

scaler = StandardScaler()
scaler.fit(X_train)                # only compute mean and std here
X_test = scaler.transform(X_test)  # perform standardization by centering and scaling

clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

Or do I have to do this:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # compute mean, std and transform training data as well
X_test = scaler.transform(X_test)  # same as above

clf = LinearSVC()
clf.fit(X_train, y_train)
clf.predict(X_test)

In short, do I have to use scaler.fit(X_train) or scaler.fit_transform(X_train) on the training data in order to get reasonable results with LinearSVC?


Source: (StackOverflow)

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?


Source: (StackOverflow)

Best Machine Learning package for Python 3x?

I was bummed out to see that scikit-learn does not support Python 3...Is there a comparable package anyone can recommend for Python 3?


Source: (StackOverflow)

Python scikit-learn: exporting trained classifier

I am using a DBN (deep belief network) from nolearn based on scikit-learn.

I have already build a Network which can classify my data very well, now I am interested in exporting the model for deployment, but I don't know how (I am training the DBN every time I want predict something). In matlab I would just export the weight matrix and import it in another machine.

Does someone know how to export the model/the weight matrix to be imported without needing to train the whole model again?


Source: (StackOverflow)

how to normalize array numpy?

I would like to have a norm 1 numpy array. I am looking for an equivalent version of this function

def normalize(v):
    norm=np.linalg.norm(v)
    if norm==0: 
       return v
    return v/norm

Is there something like that in skearn or numpy? This function works in situation where v is the 0 vector.


Source: (StackOverflow)

Fitting a scikits.learn.hmm.GaussianHMM to variable length training sequences

I'd like to fit a scikits.learn.hmm.GaussianHMM to training sequences of different length. The fit method, however, prevents using sequences of different length by doing

obs = np.asanyarray(obs)

which only works on a list of equally shaped arrays. Does anyone have a hint on how to proceed?


Source: (StackOverflow)

Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. Is it possible to do this with sklearn ? are there alternatives ?

Thanks register


Source: (StackOverflow)

ImportError: cannot import name inplace_column_scale

Using Python 2.7 with scikit-learn 0.14 package. It runs well on some examples from the user guild expect the Linear Models.

Traceback (most recent call last):
File "E:\P\plot_ols.py", line 28, in <module>
from sklearn import datasets, linear_model
File "C:\Python27\lib\site-packages\sklearn\linear_model\__init__.py", line 12, in    <module>
from .base import LinearRegression
File "C:\Python27\lib\site-packages\sklearn\linear_model\base.py", line 29, in <module>
from ..utils.sparsefuncs import mean_variance_axis0, inplace_column_scale
ImportError: cannot import name inplace_column_scale

Thank you~


Source: (StackOverflow)

TypeError: only integer arrays with one element can be converted to an index

I'm getting the following error when performing recursive feature selection with cross-validation:

Traceback (most recent call last):
  File "/Users/.../srl/main.py", line 32, in <module>
    argident_sys.train_classifier()
  File "/Users/.../srl/identification.py", line 194, in train_classifier
    feat_selector.fit(train_argcands_feats,train_argcands_target)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit
    ranking_ = rfe.fit(X[train], y[train]).ranking_
TypeError: only integer arrays with one element can be converted to an index

The code that generates the error is the following:

def train_classifier(self):

    # Get the argument candidates
    argcands = self.get_argcands(self.reader)

    # Extract the necessary features from the argument candidates
    train_argcands_feats = []
    train_argcands_target = []

    for argcand in argcands:
        train_argcands_feats.append(self.extract_features(argcand))
        if argcand["info"]["label"] == "NULL":
            train_argcands_target.append("NULL")
        else:
            train_argcands_target.append("ARG")

    # Transform the features to the format required by the classifier
    self.feat_vectorizer = DictVectorizer()
    train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)

    # Transform the target labels to the format required by the classifier
    self.target_names = list(set(train_argcands_target))
    train_argcands_target = [self.target_names.index(target) for target in train_argcands_target]

    ## Train the appropriate supervised model      

    # Recursive Feature Elimination
    self.classifier = LogisticRegression()
    feat_selector = RFECV(estimator=self.classifier, step=1, cv=StratifiedKFold(train_argcands_target, 10))

    feat_selector.fit(train_argcands_feats,train_argcands_target)

    print feat_selector.n_features_
    print feat_selector.support_
    print feat_selector.ranking_
    print feat_selector.cv_scores_

    return

I know I should also perform GridSearch for the parameters of the LogisticRegression classifier, but I don't think that's the source of the error (or is it?).

I should mention that I'm testing with around 50 features, and almost all of them are categoric (that's why I use the DictVectorizer to transform them appropriately).

Any help or guidance you could give me is more than welcome. Thanks!

EDIT

Here's some training data examples:

train_argcands_feats = [{'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'dado', 'head': u'dado', 'head_postag': u'N'}, {'head_lemma': u'postura', 'head': u'postura', 'head_postag': u'N'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'de', 'head': u'de', 'head_postag': u'PRP'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'muito', 'head': u'Muitas', 'head_postag': u'PRON-DET'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}, {'head_lemma': u'com', 'head': u'com', 'head_postag': u'PRP'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}]

train_argcands_target = ['NULL', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'NULL', 'NULL']

Source: (StackOverflow)

Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use to predict data?

I have the following sample program from Scikits learn website:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

Source: (StackOverflow)

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?


Source: (StackOverflow)