Skip to content

Add Example for Skorch DataLoader#1105

Merged
BenjaminBossan merged 11 commits intoskorch-dev:masterfrom
ParagEkbote:Example-for-DataLoader
Jun 13, 2025
Merged

Add Example for Skorch DataLoader#1105
BenjaminBossan merged 11 commits intoskorch-dev:masterfrom
ParagEkbote:Example-for-DataLoader

Conversation

@ParagEkbote
Copy link
Copy Markdown
Contributor

@ParagEkbote ParagEkbote commented May 29, 2025

Refs #82

In this example, we use the Iterable Dataset class from torch for a synthetic streaming dataset. We use a custom callback for validation since train_split cannot be used for streaming datasets. I believe I have named the notebook a bit incorrectly and would appreciate feed-back on it.

Could you please review the changes?

cc: @githubnemo, @BenjaminBossan

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ParagEkbote ParagEkbote marked this pull request as ready for review May 31, 2025 14:47
Copy link
Copy Markdown
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR to add an example notebook for streaming data. It is clear and precise, nicely done. I have a few small suggestions for improvement, please check. Also, please run the notebook and check it in with the output cells.

Comment thread notebooks/Streaming_Dataset.ipynb Outdated
Comment thread notebooks/Streaming_Dataset.ipynb Outdated
Comment thread notebooks/Streaming_Dataset.ipynb Outdated
" def __iter__(self):\n",
" for _ in range(self.length):\n",
" X = torch.randn(20, generator=self.rng)\n",
" y = torch.randint(0, 2, (1,), generator=self.rng).item()\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a proposal: When y is not completely random, the net can actually learn something and improve the loss.

Suggested change
" y = torch.randint(0, 2, (1,), generator=self.rng).item()\n",
" y = (X.sum() > 10).sum()\n",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think that adding a rule-based approach for a single variable is a bit unfavorable since we are adding a dependency to a specific variable(X).

Do you think that adding controlled noise to synthetic data with deterministic logic could be useful?

A proposed example could be:

class StreamingDataset(IterableDataset):
    def __init__(self, length=1000, seed=42, noise_prob=0.1, threshold=3.0):
        self.length = length
        self.rng = torch.Generator().manual_seed(seed)
        self.noise_prob = noise_prob
        self.threshold = threshold
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the reply, could you please elaborate? My suggestion was simply to make y correlate with X. That way, when we train the model, we can see the loss improving. When y is completely random, the loss is not improving. For the purpose of this notebook, one could say it doesn't matter, but I can imagine some viewers being confused by the stagnating loss, perhaps assuming there is an error, hence my suggestion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was to add controlled noise/entropy to a synthetic dataset instead of having a correlation that could be seen as less realistic. But I agree that the stagnating loss could be seen as an error. I'll update the variables.

Comment thread notebooks/Streaming_Dataset.ipynb Outdated
ParagEkbote and others added 2 commits June 5, 2025 21:08
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@ParagEkbote
Copy link
Copy Markdown
Contributor Author

Could you please review the changes?

cc: @BenjaminBossan

Copy link
Copy Markdown
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I just left 2 small comments, please check, the rest looks good.

Comment thread notebooks/Streaming_Dataset.ipynb Outdated
" def __iter__(self):\n",
" for _ in range(self.length):\n",
" X = torch.randn(20, generator=self.rng)\n",
" y = torch.randint(0, 2, (1,), generator=self.rng).item()\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the reply, could you please elaborate? My suggestion was simply to make y correlate with X. That way, when we train the model, we can see the loss improving. When y is completely random, the loss is not improving. For the purpose of this notebook, one could say it doesn't matter, but I can imagine some viewers being confused by the stagnating loss, perhaps assuming there is an error, hence my suggestion.

Comment thread notebooks/Streaming_Dataset.ipynb Outdated
ParagEkbote and others added 2 commits June 12, 2025 17:46
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@BenjaminBossan
Copy link
Copy Markdown
Collaborator

@ParagEkbote Great, thanks for the updates. Could you please run the notebook and check it in with the cell outputs, then we should be good to merge.

@ParagEkbote
Copy link
Copy Markdown
Contributor Author

I have been able to execute the notebook completely and the final output is being shown correctly as well:

image

Could you please review?

cc: @BenjaminBossan

Copy link
Copy Markdown
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this notebook all looks good.

@BenjaminBossan BenjaminBossan merged commit b40d905 into skorch-dev:master Jun 13, 2025
16 checks passed
@ParagEkbote ParagEkbote deleted the Example-for-DataLoader branch June 13, 2025 14:11
githubnemo pushed a commit that referenced this pull request Aug 8, 2025
# Version 1.2.0

This is a smaller release, most changes concern examples and development and thus don't affect users of skorch.

## Changed

- Loading of skorch nets using pickle: When unpickling a skorch net, you may come across a PyTorch warning that goes: "FutureWarning: You are using torch.load with weights_only=False [...]"; to avoid this warning, pickle the net again and use the new pickle file (#1092)

## Added

- Add Contributing Guidelines for skorch. (#1097)
- Add an example of hyper-parameter optimization using [Optuna](https://optuna.org/) [here](https://github.com/skorch-dev/skorch/tree/master/examples/optuna) (#1098)
- Add Example for Streaming Dataset(#1105)
- Add pyproject.toml to Improve CI/CD and Tooling (#1108)

Thanks @raphaelrubrice, @omahs, and @ParagEkbote for their contributions.

**Full Changelog**: v1.1.0...v1.2.0

Release commit specific:

* Bump verison to 1.2.0
* Update CHANGES.md
* Remove workarounds that have been fixed in sklearn
Only affects tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants