Skip to content

Conversation

@AdonaiVera
Copy link
Contributor

@AdonaiVera AdonaiVera commented Oct 30, 2025

What changes are proposed in this pull request?

This PR integrates a new tutorial developed by @jacobsela and applies fixes to older tutorials in our documentation.

Summary of improvements:

  1. Replaced deprecated cache_field_names parameter with a new SimpleGetItem helper class that wraps user functions and field names, leveraging vectorize=True with FiftyOne’s current API to preload specified fields into memory for faster data loading during training.
  2. Converted mnist_get_item function into a proper MnistGetItem class with an explicit field_mapping, which aligns with the latest to_torch() API requirements to correctly extract and pass the needed fields (id, filepath, label) from dataset samples.

How is this patch tested? If it is not, please explain why.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • Yes. Give a description of this change to be included in the release notes for FiftyOne users.

Description:
Integrated a new tutorial and modernized tutorial examples by replacing the deprecated cache_field_names parameter with the new SimpleGetItem wrapper class and updated the mnist_get_item implementation to use MnistGetItem class compatible with the current API.

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • Documentation
    • Added three new recipe guides: Data Loading with Torch Datasets, Training on MNIST with Torch, and Speeding up with Cached Fields.
    • Included runnable notebooks with step-by-step MNIST training, setup and script retrieval, training runs, visualizations, and narrative guidance.
    • Added an example-driven recipe demonstrating how to preload specific fields for faster data loading and batch visualization.
@AdonaiVera AdonaiVera requested a review from a team as a code owner October 30, 2025 18:36
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

Added three Torch-related recipe pages and updated the recipes index; expanded two notebooks with setup, downloads, and runnable MNIST training examples; refactored the example utility module to introduce SimpleGetItem and MnistGetItem and removed the old cache_field_names parameter in dataloader factory functions.

Changes

Cohort / File(s) Summary
Recipe Index Updates
docs/source/recipes/index.rst
Added three recipe cards and updated the toctree to reference new notebook pages for Torch data loading, MNIST training, and cached-field optimization.
Notebook: MNIST training
docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb
Added step-by-step notebook content: imports, external script downloads (mnist_training.py, utils.py), dataset setup, training invocation, app launch and results display, and markdown restructuring.
Notebook: Cached fields / get-item changes
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
Replaced use of cache_field_names with pattern using SimpleGetItem + vectorize=True; updated narrative and examples to demonstrate preloading specified fields and altered get-item input shape to dicts of cached fields.
Utility module refactor
docs/source/recipes/torch-dataset-examples/utils.py
Added GetItem import, introduced SimpleGetItem and MnistGetItem (with mnist_get_item instance), removed cache_field_names parameter from create_dataloaders / create_dataloaders_ddp, and switched to conditional vectorize usage when calling to_torch.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant create_dataloaders
    participant SimpleGetItem
    participant to_torch
    participant DataLoader

    User->>create_dataloaders: dataset, get_item (maybe SimpleGetItem)
    create_dataloaders->>create_dataloaders: determine use_vectorize
    alt use_vectorize == true
        create_dataloaders->>to_torch: to_torch(..., vectorize=True)
        to_torch->>SimpleGetItem: use wrapper to produce cached-field dicts
        SimpleGetItem-->>to_torch: cached dicts for each sample
        to_torch->>DataLoader: build dataset preloaded with fields
    else use_vectorize == false
        create_dataloaders->>to_torch: to_torch(..., vectorize=False)
        to_torch->>DataLoader: build standard dataset (model samples)
    end
    DataLoader-->>User: DataLoader ready
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas to review:
    • utils.py — new classes (SimpleGetItem, MnistGetItem) and changed function signatures (ensure backwards compatibility and correct behavior with to_torch vectorize flag).
    • Notebooks — external script downloads (paths/URLs) and runnable cells that invoke the refactored utilities.
    • Index TOC entries — verify paths and metadata are correct.

Poem

🐰 I hopped through docs with a helpful twitch,
New recipes added, each shiny and rich,
SimpleGetItem guards cached field delight,
MNIST trains in the soft document light,
Hooray — small hops making workflows quick! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "Docs: integrate new tutorial from Jacob and modernize deprecated cache_field_names usage" clearly and specifically summarizes the two main objectives of this changeset. The title accurately reflects the core changes: the integration of new recipe tutorials into the documentation index and the modernization of existing tutorials by replacing the deprecated cache_field_names parameter with the new SimpleGetItem helper class. The title is concise, descriptive, and provides sufficient context for someone scanning the repository history to understand the primary purpose of these documentation updates.
Description Check ✅ Passed The pull request description comprehensively follows the required template structure and includes all essential sections. The "What changes are proposed" section provides specific technical details about the new SimpleGetItem and MnistGetItem classes and their purpose. The "How is this patch tested" section includes multiple validation approaches (local build verification, class functionality testing, backward compatibility checks) along with concrete Dev documentation links for review. The Release Notes section is fully completed with both the user-facing change confirmation (marked with [x]) and the affected areas clearly identified (Documentation marked with [x]). The description is well-organized, specific, and provides sufficient context without being vague.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch docs/recipe-improvements-torch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (1)

392-396: Replace placeholder path with a valid default.

The hardcoded path /path/to/save/weights will cause a runtime error when the training script attempts to save model weights. Users following this tutorial will encounter a failure.

Apply this diff to use a temporary directory:

+import tempfile
+import os
+
 device = 'cuda' if torch.cuda.is_available() else 'cpu'
-path_to_save_weights = '/path/to/save/weights'
+path_to_save_weights = os.path.join(tempfile.gettempdir(), 'mnist_weights')
+os.makedirs(path_to_save_weights, exist_ok=True)
 mnist_training.main(subset, 10, 10, device, path_to_save_weights)

Alternatively, add a markdown cell above explaining that users must modify this path to a valid directory on their system.

🧹 Nitpick comments (5)
docs/source/recipes/torch-dataset-examples/utils.py (1)

167-204: Approve the modernization, but document the vectorization logic.

The migration from cache_field_names to the vectorize parameter with automatic detection (lines 187-188) is well-executed. The design choice to enable vectorization only for SimpleGetItem instances makes sense for the tutorial context, as SimpleGetItem is specifically designed for cached field optimization.

However, consider documenting this behavior in the docstring or as a comment, since users with custom GetItem subclasses won't get vectorization unless they inherit from SimpleGetItem or explicitly modify this logic.

Consider adding a clarifying comment:

+    # Use vectorize only for SimpleGetItem (cached fields)
+    # Custom GetItem subclasses will use lazy loading by default
     use_vectorize = isinstance(get_item, SimpleGetItem)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (1)

85-86: Consider documenting the security implications of downloading remote code.

While the hardcoded HTTPS URL to the official Voxel51 CDN is safe, users should be aware they're downloading and executing Python code from a remote source. Consider adding a brief note about this, or alternatively, provide the utils.py content inline or as part of the repository.

This addresses the static analysis hint about URL schemes, though the specific concern (custom schemes like file:) doesn't apply here.

docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (3)

82-82: Verify usage of PIL.Image import.

The PIL.Image import doesn't appear to be used in this notebook. Consider removing it unless it's required by the downloaded external scripts.


94-111: Add integrity verification for downloaded scripts.

Downloading and importing Python scripts without integrity verification poses security and reliability risks. While the URLs point to the official Voxel51 CDN, consider adding checksum validation to ensure the downloaded files haven't been tampered with or corrupted.

Example approach:

+import hashlib
+
 url = "https://cdn.voxel51.com/tutorials_torch_dataset_examples/notebook_simple_training_example/mnist_training.py"
-urllib.request.urlretrieve(url, "mnist_training.py")
+filename, _ = urllib.request.urlretrieve(url, "mnist_training.py")
+
+# Verify checksum (replace with actual SHA256 hash of the file)
+expected_hash = "abc123..."  
+with open(filename, 'rb') as f:
+    file_hash = hashlib.sha256(f.read()).hexdigest()
+    if file_hash != expected_hash:
+        raise ValueError(f"File integrity check failed for {filename}")

Alternatively, consider hosting these scripts in the repository and importing them directly to eliminate the download step entirely.


406-408: Remove empty markdown cell.

The empty markdown cell at the end of the notebook serves no purpose and can be removed for cleaner structure.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b280134 and c77a1f1.

📒 Files selected for processing (4)
  • docs/source/recipes/index.rst (2 hunks)
  • docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (4 hunks)
  • docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (13 hunks)
  • docs/source/recipes/torch-dataset-examples/utils.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-24T16:42:23.947Z
Learnt from: swheaton
PR: voxel51/fiftyone#6161
File: plugins/operators/model_evaluation/__init__.py:1059-1060
Timestamp: 2025-07-24T16:42:23.947Z
Learning: In FiftyOne, there are two distinct caching mechanisms that serve different purposes: (1) singleton cache (`fo.config.singleton_cache`) controls whether Dataset/Sample/Frame objects are treated as singletons for object identity management, and (2) server-side TTL cache (`fosu.cache_dataset`) provides performance optimization in app/server contexts. These mechanisms are independent - server-side caching should continue regardless of singleton cache settings.

Applied to files:

  • docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
🧬 Code graph analysis (1)
docs/source/recipes/torch-dataset-examples/utils.py (2)
fiftyone/utils/torch.py (2)
  • FiftyOneTorchDataset (1704-1936)
  • GetItem (258-326)
fiftyone/core/collections.py (1)
  • to_torch (9987-10025)
🪛 Ruff (0.14.2)
docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb

19-19: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


23-23: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb

11-11: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (12)
docs/source/recipes/torch-dataset-examples/utils.py (4)

2-2: LGTM! Import aligns with the new GetItem-based design.

The addition of GetItem import is necessary for the new SimpleGetItem and MnistGetItem classes and aligns with the FiftyOne API.


13-34: LGTM! Clean wrapper design for vectorized caching.

The SimpleGetItem class effectively wraps user functions with field name specifications, enabling the vectorized caching path in to_torch(). The identity field mapping (where keys equal values) is appropriate for this use case.


137-164: LGTM! Well-structured migration from function to GetItem class.

The MnistGetItem class properly extends GetItem with explicit field mappings and correctly handles the MNIST label format. The label extraction at line 159 (int(sample["label"][0])) appropriately parses the "<number> - <number name>" format by taking the first character, which works correctly for single-digit MNIST classes.


231-270: LGTM! Consistent DDP implementation.

The create_dataloaders_ddp function mirrors the changes in create_dataloaders with the same vectorization logic and updated API. The implementation correctly integrates with PyTorch's DistributedSampler for distributed training.

docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (5)

137-140: Excellent explanation of the caching mechanism.

The description clearly explains how vectorize=True with the SimpleGetItem wrapper enables field preloading, and importantly notes that the helper must be imported from a proper Python module for PyTorch's multiprocessing to work correctly. This is valuable context for users.


239-242: Good example of the SimpleGetItem pattern.

The code correctly demonstrates wrapping a get_item function with SimpleGetItem and passing it to to_torch() with vectorize=True. Ensure that get_item_identity and fields_of_interest are defined in earlier cells (not visible in this diff).


274-276: Important clarification about the input signature change.

This correctly explains a critical behavioral difference: when using vectorize=True with cached fields, the get_item function receives a dict (with keys corresponding to cached field names) instead of a FiftyOne Sample object. This is essential information for users adapting their code.


322-324: Consistent usage of SimpleGetItem wrapper.

The example correctly demonstrates the same pattern with a different get_item function. The comment helpfully clarifies the purpose of the wrapper.


393-396: Complete example showing end-to-end integration.

This final example demonstrates using the SimpleGetItem wrapper with a function from the utilities module and creating a dataloader, which is helpful for showing the complete workflow.

docs/source/recipes/torch-dataset-examples/simple_training_example.ipynb (1)

3-15: LGTM! Clear and comprehensive introduction.

The introduction effectively sets expectations by outlining the specific topics covered in the tutorial.

docs/source/recipes/index.rst (2)

52-57: Verify description aligns with updated API.

The PR summary states that cache_field_names is deprecated and replaced with the SimpleGetItem helper class. However, this recipe description still refers to "using the cache_field_names argument".

Please verify that the description accurately reflects the content of the linked notebook. If the notebook teaches the new SimpleGetItem approach (as implied by the PR summary), consider updating the description:

 .. customcarditem::
     :header: Speeding up with cached fields
-    :description: Improve training performance by preloading specific fields into memory using the `cache_field_names` argument with `FiftyOneTorchDataset`.
+    :description: Improve training performance by preloading specific fields into memory using the `SimpleGetItem` helper with vectorized data loading in `FiftyOneTorchDataset`.
     :link: torch-dataset-examples/the_cache_field_names_argument.html

Alternatively, if the notebook covers both the deprecated and new approaches, clarify that in the description.


160-162: The original review comment is incorrect.

The file fiftyone_torch_dataloader.ipynb exists at docs/source/recipes/fiftyone_torch_dataloader.ipynb and is correctly referenced using a relative path in line 160 of the index.rst file. Since both the index.rst and the notebook are in the same docs/source/recipes/ directory, the reference <fiftyone_torch_dataloader.ipynb> is appropriate and consistent with how lines 161-162 reference files in the torch-dataset-examples/ subdirectory. There is no path inconsistency, and the file is not missing from the PR.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (2)

7-14: Consider mentioning that this replaces the deprecated cache_field_names parameter.

The title and description effectively explain the modern approach using SimpleGetItem with vectorize=True. However, since the PR objectives indicate this modernizes deprecated cache_field_names usage, adding a brief note (e.g., "This modern approach replaces the deprecated cache_field_names parameter") would help users transitioning from older documentation understand the context and migration path.


137-140: Consider simplifying terminology for clarity.

Line 137 mentions "a GetItem object," but this notebook only imports and uses SimpleGetItem. While GetItem may be a base class or protocol, readers might find it clearer to say "a get-item wrapper object like SimpleGetItem" or simply reference SimpleGetItem directly, since that's what they'll actually use in their code.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c77a1f1 and b256f79.

📒 Files selected for processing (2)
  • docs/source/recipes/index.rst (2 hunks)
  • docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (13 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-08-12T17:41:41.645Z
Learnt from: swheaton
PR: voxel51/fiftyone#4651
File: fiftyone/__public__.py:116-116
Timestamp: 2024-08-12T17:41:41.645Z
Learning: In `fiftyone/__public__.py`, imports are intended to make names available to users at the top module level, even if they are not used directly in the file. Avoid suggesting the removal of unused imports in this file.

Applied to files:

  • docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
📚 Learning: 2025-07-24T16:42:23.947Z
Learnt from: swheaton
PR: voxel51/fiftyone#6161
File: plugins/operators/model_evaluation/__init__.py:1059-1060
Timestamp: 2025-07-24T16:42:23.947Z
Learning: In FiftyOne, there are two distinct caching mechanisms that serve different purposes: (1) singleton cache (`fo.config.singleton_cache`) controls whether Dataset/Sample/Frame objects are treated as singletons for object identity management, and (2) server-side TTL cache (`fosu.cache_dataset`) provides performance optimization in app/server contexts. These mechanisms are independent - server-side caching should continue regardless of singleton cache settings.

Applied to files:

  • docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb
🪛 Ruff (0.14.2)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb

11-11: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (6)
docs/source/recipes/torch-dataset-examples/the_cache_field_names_argument.ipynb (5)

112-112: LGTM!

The import of SimpleGetItem from the downloaded utils.py is correct and necessary for the tutorial examples.


241-242: LGTM!

The pattern of wrapping the get_item function with SimpleGetItem and passing vectorize=True to to_torch() is correctly demonstrated. The inline comments clearly explain what's happening.


274-276: LGTM!

This explanation is crucial for users to understand the behavior change when using vectorize=True. The clarification that the get_item function receives a dict of cached fields rather than a FiftyOne Sample object is clear and well-stated.


323-324: LGTM!

The consistent application of the SimpleGetItem wrapper pattern throughout the tutorial reinforces learning and demonstrates best practices.


394-396: LGTM!

The final example effectively demonstrates how to use helper functions from the utils module with the SimpleGetItem wrapper pattern, maintaining consistency with the previous examples.

docs/source/recipes/index.rst (1)

38-57: Past feedback incorporated and new recipes well-structured.

The description on line 40 has been updated to "Learn how to efficiently load and work with FiftyOne datasets in PyTorch using FiftyOneTorchDataset," which now properly aligns with the header "Data Loading with Torch Datasets" and addresses the previous review's concern about mismatched descriptions. All three new recipe cards are well-structured with clear headers, aligned descriptions, consistent tags, and CDN images. The toctree entries (lines 160–162) correctly reference the corresponding notebook files, with the first notebook at the recipes root level and the subsequent two in the torch-dataset-examples/ subdirectory—consistent with the card link paths.

Also applies to: 160-162

@AdonaiVera AdonaiVera added the documentation Documentation work label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Documentation work

3 participants