Hi all,
I'm fairly new to scikit-learn, but have been using a predictive model for a while now that would benefit from scikit-learn's estimator API. However, I could use some advice on how best to implement this.
Briefly, the model is a combination of dimension reduction and nearest neighbors, but the dimension reduction step (canonical correspondence analysis - CCA) relies on two matrices to create the synthetic feature scores for the candidates in the nearest neighbor step. The two matrices are a "species" matrix (spp) and an "environmental" matrix (env) which are used to create orthogonal CCA axes that are linear combinations of the environmental features.
In reading through the documentation on creating new estimators, it seems that every estimator should provide a fit(X, y) method. Somehow I need my X parameter to be both the spp and env matrices together. I got a lot of good inspiration from this post on Stack Overflow:
https://stackoverflow.com/questions/45966500/use-sklearn-gridsearchcv-on-cu…
and can mostly understand how the OP implemented this, basically by creating a DataHandler class that packs together the two matrices, such that the call to fit would look like:
estimator.fit(DataHandler(spp, env), y)
I'm wondering if this is the best way to handle the design or if I'm not fully understanding how I could use a Pipeline to accomplish the same goal. Thanks for any guidance - boilerplate sample code would be most appreciated!
matt
I am trying to use cross_validate. I had initial hiccup due to pickable and
was to get past that. Still I am not able to get the cross_validate to work.
Git Link:
https://github.com/Neetu162/DeepLearningResearch/blob/76675a79a4922b8bd0d72…
Error:
Average recall value is: 0.9453125
creating the loaded model
calling the cross_validate
method/home/osboxes/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:552:
FitFailedWarning: Estimator fit failed. The score on this train-test
partition for these parameters will be set to nan. Details: Traceback
(most recent call last):
File "/home/osboxes/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py",
line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/osboxes/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py",
line 223, in fit
return super(KerasClassifier, self).fit(x, y, **kwargs)
File "/home/osboxes/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py",
line 155, in fit
**self.filter_sk_params(self.build_fn.__call__))
File "/home/osboxes/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py",
line 800, in __call__
'The first argument to `Layer.call` must always be
passed.')ValueError: The first argument to `Layer.call` must always be
passed.
FitFailedWarning)
--
Thanks & Regards,
Neetu
Hello!
I have a DataFrame with a column of text, and I would like to vectorize the
text using CountVectorizer. However, the text includes missing values, and
so I would like to impute a constant value (for any missing values) before
vectorizing.
My initial thought was to create a Pipeline of SimpleImputer (with
strategy='constant') and CountVectorizer. However, SimpleImputer outputs a
2D array and CountVectorizer requires 1D input.
The only solution I have found is to insert a transformer into the Pipeline
that reshapes the output of SimpleImputer from 2D to 1D before it is passed
to CountVectorizer. (You can find my code at the bottom of this message.)
My question: Is there a more elegant solution to this problem than what I'm
currently doing?
Notes:
- I realize that the missing values could be filled in pandas. However, I
would like to accomplish all preprocessing in scikit-learn so that the same
preprocessing can be applied via Pipeline to out-of-sample data.
- I recall seeing a GitHub issue in which Andy proposed that
CountVectorizer should allow 2D input as long as the second dimension is 1
(in other words: a single column of data). This modification to
CountVectorizer would be a great long-term solution to my problem. However,
I'm looking for a solution that would work in the current version of
scikit-learn.
Thank you so much for any feedback or ideas!
Kevin
== START OF CODE EXAMPLE ==
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
imp = SimpleImputer(strategy='constant')
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})
vect = CountVectorizer()
pipe = make_pipeline(imp, one_dim, vect)
pipe.fit_transform(df[['text']]).toarray()
== END OF CODE EXAMPLE ==
--
Kevin Markham
Founder, Data School
https://www.dataschool.iohttps://www.youtube.com/dataschool