Bugfix: sklearn 1.1 and SliceDataset#858
Conversation
The error message when using set_params with an invalid key has been altered to contain quote marks. Now testing message without or with.
Using SliceDataset could result in an error because of this line: self.classes_inferred_ = np.unique(to_numpy(y)) This would fail with a SliceDataset. However, the error was masked because GridSearchCV swallows errors. Therefore, we now run these tests with error='raise' to surface the bug. Regarding the bugfix, it is a bit clumsy because we need to make to_numpy work with SliceDataset, but we cannot check 'isinstance(X, SliceDataset)' because we don't want to import the helper class. Therefore, we now check this indirectly by looking at attributes. Furthermore, SliceDataset now also works with torch tensors as X and y, not only numpy arrays. Tests have been extended to cover this.
thomasjpfan
left a comment
There was a problem hiding this comment.
I'm okay with the workaround with SliceDataset.
| if np.isscalar(X[0]): | ||
| return np.array([X[i] for i in range(len(X))]) | ||
| return np.array([to_numpy(X[i]) for i in range(len(X))]) |
There was a problem hiding this comment.
Nit:
| if np.isscalar(X[0]): | |
| return np.array([X[i] for i in range(len(X))]) | |
| return np.array([to_numpy(X[i]) for i in range(len(X))]) | |
| Xs = [X[i] for i in range(len(X))] | |
| if np.isscalar(Xs[0]): | |
| return np.array(Xs) | |
| return np.array([to_numpy(x) for x in Xs]) |
There was a problem hiding this comment.
Re sklearn version: I was getting ahead of myself a bit :)
Re this suggestion: Could you explain why it is better? (Also, there is no out variable, is it referring to X or Xs?) It seems like with your suggestion, we iterate over X a second time if we don't deal with a scalar, is it not potentially more expensive?
Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.
There was a problem hiding this comment.
Also, there is no out variable, is it referring to X or Xs
Yup, I meant to type Xs.
I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time. I'm okay with what you have now.
Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.
We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?
Side note: A slightly nicer solution would be to define SliceDataset.__array__, so it knows how to turn itself into a ndarray.
There was a problem hiding this comment.
I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time.
Yeah, it's a difficult call. In the best case, a numpy conversion is very cheap because pytorch and numpy share a memory layout. But in the worst case, as you mentioned, we pay for one additional dataset.__getitem__ call. I think the more cautious solution is to prevent the worst case, i.e. avoid an additional __getitem__ but iterate through the list once more.
We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?
I think this could be a dangerous solution performance-wise, as we would be checking each item of the list. As is, we only check the first item and then assume that the subsequent items have the same type.
Side note: A slightly nicer solution would be to define SliceDataset.array, so it knows how to turn itself into a ndarray.
I like that, I moved the conversion code into SliceDataset.__array__. Of course, performance is the same this way or that, but it looks cleaner.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…ev/skorch into bugfix/sklearn-1.11-and-sliceds
Sklearn 1.1 issue
The error message when using
set_paramswith an invalid key has beenaltered to contain quote marks. Now testing message without or with.
SliceDatasetissueUsing
SliceDatasetcould result in an error because of this line:skorch/skorch/classifier.py
Line 119 in 090ece0
This would fail when y comes from a
SliceDataset. However, the error was maskedbecause
GridSearchCVswallows errors by default now. Therefore, we now run thesetests with
error='raise'to surface the bug.Implementation
Regarding the bugfix, it is a bit clumsy because we need to make
to_numpywork withSliceDataset, but we cannot check'isinstance(X, SliceDataset)'because we don't want to import the helper class. Therefore, we now
check this indirectly by looking at attributes.
Furthermore,
SliceDatasetnow also works with torch tensors as X and y,not only numpy arrays. Tests have been extended to cover this.