-
Notifications
You must be signed in to change notification settings - Fork 679
Docs: integrate new tutorial from Jacob and modernize deprecated cache_field_names usage #6510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdded three Torch-related recipe pages and updated the recipes index; expanded two notebooks with setup, downloads, and runnable MNIST training examples; refactored the example utility module to introduce SimpleGetItem and MnistGetItem and removed the old cache_field_names parameter in dataloader factory functions. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant create_dataloaders
participant SimpleGetItem
participant to_torch
participant DataLoader
User->>create_dataloaders: dataset, get_item (maybe SimpleGetItem)
create_dataloaders->>create_dataloaders: determine use_vectorize
alt use_vectorize == true
create_dataloaders->>to_torch: to_torch(..., vectorize=True)
to_torch->>SimpleGetItem: use wrapper to produce cached-field dicts
SimpleGetItem-->>to_torch: cached dicts for each sample
to_torch->>DataLoader: build dataset preloaded with fields
else use_vectorize == false
create_dataloaders->>to_torch: to_torch(..., vectorize=False)
to_torch->>DataLoader: build standard dataset (model samples)
end
DataLoader-->>User: DataLoader ready
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (1)
392-396: Replace placeholder path with a valid default.The hardcoded path
/path/to/save/weightswill cause a runtime error when the training script attempts to save model weights. Users following this tutorial will encounter a failure.Apply this diff to use a temporary directory:
+import tempfile +import os + device = 'cuda' if torch.cuda.is_available() else 'cpu' -path_to_save_weights = '/path/to/save/weights' +path_to_save_weights = os.path.join(tempfile.gettempdir(), 'mnist_weights') +os.makedirs(path_to_save_weights, exist_ok=True) mnist_training.main(subset, 10, 10, device, path_to_save_weights)Alternatively, add a markdown cell above explaining that users must modify this path to a valid directory on their system.
🧹 Nitpick comments (5)
docs/source/recipes/torch-dataset-examples/utils.py (1)
167-204: Approve the modernization, but document the vectorization logic.The migration from
cache_field_namesto thevectorizeparameter with automatic detection (lines 187-188) is well-executed. The design choice to enable vectorization only forSimpleGetIteminstances makes sense for the tutorial context, asSimpleGetItemis specifically designed for cached field optimization.However, consider documenting this behavior in the docstring or as a comment, since users with custom
GetItemsubclasses won't get vectorization unless they inherit fromSimpleGetItemor explicitly modify this logic.Consider adding a clarifying comment:
+ # Use vectorize only for SimpleGetItem (cached fields) + # Custom GetItem subclasses will use lazy loading by default use_vectorize = isinstance(get_item, SimpleGetItem)docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (1)
85-86: Consider documenting the security implications of downloading remote code.While the hardcoded HTTPS URL to the official Voxel51 CDN is safe, users should be aware they're downloading and executing Python code from a remote source. Consider adding a brief note about this, or alternatively, provide the
utils.pycontent inline or as part of the repository.This addresses the static analysis hint about URL schemes, though the specific concern (custom schemes like
file:) doesn't apply here.docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (3)
82-82: Verify usage of PIL.Image import.The
PIL.Imageimport doesn't appear to be used in this notebook. Consider removing it unless it's required by the downloaded external scripts.
94-111: Add integrity verification for downloaded scripts.Downloading and importing Python scripts without integrity verification poses security and reliability risks. While the URLs point to the official Voxel51 CDN, consider adding checksum validation to ensure the downloaded files haven't been tampered with or corrupted.
Example approach:
+import hashlib + url = "https://cdn.voxel51.com/tutorials_torch_dataset_examples/notebook_simple_training_example/mnist_training.py" -urllib.request.urlretrieve(url, "mnist_training.py") +filename, _ = urllib.request.urlretrieve(url, "mnist_training.py") + +# Verify checksum (replace with actual SHA256 hash of the file) +expected_hash = "abc123..." +with open(filename, 'rb') as f: + file_hash = hashlib.sha256(f.read()).hexdigest() + if file_hash != expected_hash: + raise ValueError(f"File integrity check failed for {filename}")Alternatively, consider hosting these scripts in the repository and importing them directly to eliminate the download step entirely.
406-408: Remove empty markdown cell.The empty markdown cell at the end of the notebook serves no purpose and can be removed for cleaner structure.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/source/recipes/index.rst(2 hunks)docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb(4 hunks)docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb(13 hunks)docs/source/recipes/torch-dataset-examples/utils.py(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-24T16:42:23.947Z
Learnt from: swheaton
PR: voxel51/fiftyone#6161
File: plugins/operators/model_evaluation/__init__.py:1059-1060
Timestamp: 2025-07-24T16:42:23.947Z
Learning: In FiftyOne, there are two distinct caching mechanisms that serve different purposes: (1) singleton cache (`fo.config.singleton_cache`) controls whether Dataset/Sample/Frame objects are treated as singletons for object identity management, and (2) server-side TTL cache (`fosu.cache_dataset`) provides performance optimization in app/server contexts. These mechanisms are independent - server-side caching should continue regardless of singleton cache settings.
Applied to files:
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
🧬 Code graph analysis (1)
docs/source/recipes/torch-dataset-examples/utils.py (2)
fiftyone/utils/torch.py (2)
FiftyOneTorchDataset(1704-1936)GetItem(258-326)fiftyone/core/collections.py (1)
to_torch(9987-10025)
🪛 Ruff (0.14.2)
docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb
19-19: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
23-23: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
11-11: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (12)
docs/source/recipes/torch-dataset-examples/utils.py (4)
2-2: LGTM! Import aligns with the new GetItem-based design.The addition of
GetItemimport is necessary for the newSimpleGetItemandMnistGetItemclasses and aligns with the FiftyOne API.
13-34: LGTM! Clean wrapper design for vectorized caching.The
SimpleGetItemclass effectively wraps user functions with field name specifications, enabling the vectorized caching path into_torch(). The identity field mapping (where keys equal values) is appropriate for this use case.
137-164: LGTM! Well-structured migration from function to GetItem class.The
MnistGetItemclass properly extendsGetItemwith explicit field mappings and correctly handles the MNIST label format. The label extraction at line 159 (int(sample["label"][0])) appropriately parses the"<number> - <number name>"format by taking the first character, which works correctly for single-digit MNIST classes.
231-270: LGTM! Consistent DDP implementation.The
create_dataloaders_ddpfunction mirrors the changes increate_dataloaderswith the same vectorization logic and updated API. The implementation correctly integrates with PyTorch'sDistributedSamplerfor distributed training.docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (5)
137-140: Excellent explanation of the caching mechanism.The description clearly explains how
vectorize=Truewith theSimpleGetItemwrapper enables field preloading, and importantly notes that the helper must be imported from a proper Python module for PyTorch's multiprocessing to work correctly. This is valuable context for users.
239-242: Good example of theSimpleGetItempattern.The code correctly demonstrates wrapping a get_item function with
SimpleGetItemand passing it toto_torch()withvectorize=True. Ensure thatget_item_identityandfields_of_interestare defined in earlier cells (not visible in this diff).
274-276: Important clarification about the input signature change.This correctly explains a critical behavioral difference: when using
vectorize=Truewith cached fields, theget_itemfunction receives a dict (with keys corresponding to cached field names) instead of a FiftyOne Sample object. This is essential information for users adapting their code.
322-324: Consistent usage ofSimpleGetItemwrapper.The example correctly demonstrates the same pattern with a different get_item function. The comment helpfully clarifies the purpose of the wrapper.
393-396: Complete example showing end-to-end integration.This final example demonstrates using the
SimpleGetItemwrapper with a function from the utilities module and creating a dataloader, which is helpful for showing the complete workflow.docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (1)
3-15: LGTM! Clear and comprehensive introduction.The introduction effectively sets expectations by outlining the specific topics covered in the tutorial.
docs/source/recipes/index.rst (2)
52-57: Verify description aligns with updated API.The PR summary states that
cache_field_namesis deprecated and replaced with theSimpleGetItemhelper class. However, this recipe description still refers to "using thecache_field_namesargument".Please verify that the description accurately reflects the content of the linked notebook. If the notebook teaches the new
SimpleGetItemapproach (as implied by the PR summary), consider updating the description:.. customcarditem:: :header: Speeding up with cached fields - :description: Improve training performance by preloading specific fields into memory using the `cache_field_names` argument with `FiftyOneTorchDataset`. + :description: Improve training performance by preloading specific fields into memory using the `SimpleGetItem` helper with vectorized data loading in `FiftyOneTorchDataset`. :link: torch-dataset-examples/the_cache_field_names_argument.htmlAlternatively, if the notebook covers both the deprecated and new approaches, clarify that in the description.
160-162: The original review comment is incorrect.The file
fiftyone_torch_dataloader.ipynbexists atdocs/source/recipes/fiftyone_torch_dataloader.ipynband is correctly referenced using a relative path in line 160 of the index.rst file. Since both the index.rst and the notebook are in the samedocs/source/recipes/directory, the reference<fiftyone_torch_dataloader.ipynb>is appropriate and consistent with how lines 161-162 reference files in thetorch-dataset-examples/subdirectory. There is no path inconsistency, and the file is not missing from the PR.Likely an incorrect or invalid review comment.
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
Show resolved
Hide resolved
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (2)
7-14: Consider mentioning that this replaces the deprecatedcache_field_namesparameter.The title and description effectively explain the modern approach using
SimpleGetItemwithvectorize=True. However, since the PR objectives indicate this modernizes deprecatedcache_field_namesusage, adding a brief note (e.g., "This modern approach replaces the deprecatedcache_field_namesparameter") would help users transitioning from older documentation understand the context and migration path.
137-140: Consider simplifying terminology for clarity.Line 137 mentions "a
GetItemobject," but this notebook only imports and usesSimpleGetItem. WhileGetItemmay be a base class or protocol, readers might find it clearer to say "a get-item wrapper object likeSimpleGetItem" or simply referenceSimpleGetItemdirectly, since that's what they'll actually use in their code.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/source/recipes/index.rst(2 hunks)docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb(13 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-08-12T17:41:41.645Z
Learnt from: swheaton
PR: voxel51/fiftyone#4651
File: fiftyone/__public__.py:116-116
Timestamp: 2024-08-12T17:41:41.645Z
Learning: In `fiftyone/__public__.py`, imports are intended to make names available to users at the top module level, even if they are not used directly in the file. Avoid suggesting the removal of unused imports in this file.
Applied to files:
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
📚 Learning: 2025-07-24T16:42:23.947Z
Learnt from: swheaton
PR: voxel51/fiftyone#6161
File: plugins/operators/model_evaluation/__init__.py:1059-1060
Timestamp: 2025-07-24T16:42:23.947Z
Learning: In FiftyOne, there are two distinct caching mechanisms that serve different purposes: (1) singleton cache (`fo.config.singleton_cache`) controls whether Dataset/Sample/Frame objects are treated as singletons for object identity management, and (2) server-side TTL cache (`fosu.cache_dataset`) provides performance optimization in app/server contexts. These mechanisms are independent - server-side caching should continue regardless of singleton cache settings.
Applied to files:
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
🪛 Ruff (0.14.2)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
11-11: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (5)
112-112: LGTM!The import of
SimpleGetItemfrom the downloadedutils.pyis correct and necessary for the tutorial examples.
241-242: LGTM!The pattern of wrapping the get_item function with
SimpleGetItemand passingvectorize=Truetoto_torch()is correctly demonstrated. The inline comments clearly explain what's happening.
274-276: LGTM!This explanation is crucial for users to understand the behavior change when using
vectorize=True. The clarification that theget_itemfunction receives a dict of cached fields rather than a FiftyOne Sample object is clear and well-stated.
323-324: LGTM!The consistent application of the
SimpleGetItemwrapper pattern throughout the tutorial reinforces learning and demonstrates best practices.
394-396: LGTM!The final example effectively demonstrates how to use helper functions from the
utilsmodule with theSimpleGetItemwrapper pattern, maintaining consistency with the previous examples.docs/source/recipes/index.rst (1)
38-57: Past feedback incorporated and new recipes well-structured.The description on line 40 has been updated to "Learn how to efficiently load and work with FiftyOne datasets in PyTorch using FiftyOneTorchDataset," which now properly aligns with the header "Data Loading with Torch Datasets" and addresses the previous review's concern about mismatched descriptions. All three new recipe cards are well-structured with clear headers, aligned descriptions, consistent tags, and CDN images. The toctree entries (lines 160–162) correctly reference the corresponding notebook files, with the first notebook at the recipes root level and the subsequent two in the
torch-dataset-examples/subdirectory—consistent with the card link paths.Also applies to: 160-162
What changes are proposed in this pull request?
This PR integrates a new tutorial developed by @jacobsela and applies fixes to older tutorials in our documentation.
Summary of improvements:
cache_field_namesparameter with a newSimpleGetItemhelper class that wraps user functions and field names, leveragingvectorize=Truewith FiftyOne’s current API to preload specified fields into memory for faster data loading during training.mnist_get_itemfunction into a properMnistGetItemclass with an explicitfield_mapping, which aligns with the latestto_torch()API requirements to correctly extract and pass the needed fields (id,filepath,label) from dataset samples.How is this patch tested? If it is not, please explain why.
make docsand render correctly in the local documentation site.SimpleGetItemandMnistGetItemclasses function as expected when running example scripts.https://docs.dev.voxel51.com/recipes/fiftyone_torch_dataloader.html
https://docs.dev.voxel51.com/recipes/torch-dataset-examples/simple_training_example.html
https://docs.dev.voxel51.com/recipes/torch-dataset-examples/the_cache_field_names_argument.html
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
Description:
Integrated a new tutorial and modernized tutorial examples by replacing the deprecated
cache_field_namesparameter with the newSimpleGetItemwrapper class and updated themnist_get_itemimplementation to useMnistGetItemclass compatible with the current API.What areas of FiftyOne does this PR affect?
fiftyonePython library changesSummary by CodeRabbit