Skip to content

Conversation

@misrasaurabh1
Copy link

@misrasaurabh1 misrasaurabh1 commented Nov 13, 2025

📄 137% (1.37x) speedup for fix_nan_category in deepnote_toolkit/ocelots/pandas/utils.py

⏱️ Runtime : 834 milliseconds 352 milliseconds (best of 10 runs)

📝 Explanation and details

The optimized version achieves a 137% speedup by eliminating unnecessary work through two key optimizations:

What was optimized:

  1. Pre-filtered categorical detection: Instead of checking column.dtype.name == "category" for every column in the loop, the optimization identifies all categorical columns upfront using enumerate(df.dtypes) and stores their indices.
  2. Early exit for non-categorical DataFrames: Added a guard clause that returns immediately if no categorical columns exist, avoiding any loop overhead.

Why this is faster:

  • Reduced dtype access overhead: The original code called df.iloc[:, i] (expensive pandas indexing) for every column, then checked its dtype. The optimization accesses df.dtypes once, which is much faster than repeated iloc calls.
  • Eliminated wasted iterations: For DataFrames with few/no categorical columns, the original code still iterates through all columns. The optimization skips non-categorical columns entirely and exits early when possible.

Performance characteristics from tests:

  • Large DataFrames with mixed types: Shows significant gains (16-22% faster) when many columns exist but only some are categorical
  • No categorical columns: Dramatic improvement (33-58% faster) due to early exit
  • Small DataFrames: Slight overhead (9-16% slower) due to upfront processing, but this is negligible in absolute terms (microseconds)

The line profiler confirms this: the original spent 66.8% of time on df.iloc access across all columns, while the optimized version only accesses iloc for the pre-identified categorical columns, reducing this bottleneck substantially.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 31 Passed
⏪ Replay Tests 86 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest
from deepnote_toolkit.ocelots.pandas.utils import fix_nan_category

# unit tests

# ----------------------------- #
# Basic Test Cases
# ----------------------------- #

def test_single_categorical_column_adds_nan_category():
    # Basic: Single categorical column, no 'nan' category present
    df = pd.DataFrame({'A': pd.Series(['a', 'b', 'c'], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 302μs -> 336μs (10.3% slower)


def test_multiple_columns_mixed_types():
    # Basic: Multiple columns, some categorical, some not
    df = pd.DataFrame({
        'A': pd.Series(['a', 'b', 'c'], dtype='category'),
        'B': [1, 2, 3],
        'C': pd.Series(['x', 'y', 'z'], dtype='category')
    })
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 427μs -> 448μs (4.67% slower)
    # Non-categorical column should not have cat accessor
    with pytest.raises(AttributeError):
        _ = result['B'].cat

def test_no_categorical_columns():
    # Basic: DataFrame with no categorical columns
    df = pd.DataFrame({'A': [1, 2, 3], 'B': [4.0, 5.0, 6.0]})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 103μs -> 77.0μs (33.9% faster)

def test_empty_dataframe():
    # Basic: Empty DataFrame
    df = pd.DataFrame()
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 1.54μs -> 42.4μs (96.4% slower)

# ----------------------------- #
# Edge Test Cases
# ----------------------------- #

def test_column_with_nan_values():
    # Edge: Categorical column with actual np.nan values
    df = pd.DataFrame({'A': pd.Series(['a', None, 'b'], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 291μs -> 333μs (12.8% slower)

def test_column_with_duplicate_column_names():
    # Edge: DataFrame with duplicate column names
    df = pd.DataFrame([[1, 'a'], [2, 'b']], columns=['X', 'X'])
    df.iloc[:, 1] = pd.Series(['a', 'b'], dtype='category')
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 81.9μs -> 55.4μs (47.9% faster)

def test_all_categorical_columns():
    # Edge: All columns are categorical
    df = pd.DataFrame({
        'A': pd.Series(['foo', 'bar'], dtype='category'),
        'B': pd.Series(['x', 'y'], dtype='category')
    })
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 372μs -> 427μs (12.8% slower)


def test_column_with_integer_categories():
    # Edge: Categorical column with integer categories
    df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 302μs -> 355μs (14.8% slower)

def test_column_with_boolean_categories():
    # Edge: Categorical column with boolean categories
    df = pd.DataFrame({'A': pd.Series([True, False], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 286μs -> 330μs (13.4% slower)

def test_column_with_only_nan_values():
    # Edge: Categorical column with only np.nan values
    df = pd.DataFrame({'A': pd.Series([None, None], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 282μs -> 329μs (14.3% slower)

def test_column_with_empty_category():
    # Edge: Categorical column with no categories (empty)
    df = pd.DataFrame({'A': pd.Series([], dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 286μs -> 323μs (11.6% slower)
    # No rows, so nothing to check for values

def test_column_with_object_dtype():
    # Edge: Object dtype column that looks like categorical but isn't
    df = pd.DataFrame({'A': ['a', 'b', 'c']})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 66.7μs -> 70.5μs (5.46% slower)
    with pytest.raises(AttributeError):
        _ = result['A'].cat

# ----------------------------- #
# Large Scale Test Cases
# ----------------------------- #

def test_large_dataframe_many_rows():
    # Large: DataFrame with 1000 rows, single categorical column
    data = ['a', 'b', 'c', None] * 250  # 1000 values
    df = pd.DataFrame({'A': pd.Series(data, dtype='category')})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 281μs -> 336μs (16.3% slower)

def test_large_dataframe_many_columns():
    # Large: DataFrame with 500 categorical columns and 500 int columns
    data = {f'C{i}': pd.Series(['x', 'y'], dtype='category') for i in range(500)}
    data.update({f'I{i}': [1, 2] for i in range(500)})
    df = pd.DataFrame(data)
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 85.4ms -> 73.4ms (16.4% faster)
    # All categorical columns should have 'nan' in categories
    for i in range(500):
        pass
    # Integer columns should not have cat accessor
    for i in range(500):
        with pytest.raises(AttributeError):
            _ = result[f'I{i}'].cat

def test_large_dataframe_duplicate_column_names():
    # Large: DataFrame with 1000 columns, all named 'A', alternating categorical and int
    columns = ['A'] * 1000
    data = []
    for i in range(1000):
        if i % 2 == 0:
            data.append(pd.Series(['foo', 'bar'], dtype='category'))
        else:
            data.append([1, 2])
    df = pd.DataFrame({col: val for col, val in zip(columns, data)})
    # This will result in only one column due to dict key collision, so instead use from dict of tuples
    df = pd.DataFrame({i: data[i] for i in range(1000)})
    df.columns = ['A'] * 1000  # force duplicate column names
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 89.3ms -> 73.4ms (21.6% faster)
    # All even-indexed columns should have 'nan' in categories
    for i in range(0, 1000, 2):
        pass
    # All odd-indexed columns should not have cat accessor
    for i in range(1, 1000, 2):
        with pytest.raises(AttributeError):
            _ = result.iloc[:, i].cat

def test_large_dataframe_all_empty_categorical():
    # Large: 1000 columns, all empty categorical
    df = pd.DataFrame({f'C{i}': pd.Series([], dtype='category') for i in range(1000)})
    codeflash_output = fix_nan_category(df.copy()); result = codeflash_output # 152ms -> 150ms (0.730% faster)
    for i in range(1000):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pandas as pd
# imports
import pytest
from deepnote_toolkit.ocelots.pandas.utils import fix_nan_category

# unit tests

# -----------------------------
# 1. Basic Test Cases
# -----------------------------

def test_basic_single_categorical_column():
    # Basic: Single categorical column, no NaNs
    df = pd.DataFrame({"A": pd.Series(["x", "y"], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 309μs -> 342μs (9.79% slower)

def test_basic_multiple_categorical_columns():
    # Basic: Multiple categorical columns
    df = pd.DataFrame({
        "A": pd.Series(["a", "b"], dtype="category"),
        "B": pd.Series(["c", "d"], dtype="category")
    })
    codeflash_output = fix_nan_category(df); result = codeflash_output # 378μs -> 426μs (11.2% slower)

def test_basic_mixed_types():
    # Basic: Categorical and non-categorical columns
    df = pd.DataFrame({
        "A": pd.Series(["a", "b"], dtype="category"),
        "B": [1, 2],
        "C": ["x", "y"]
    })
    codeflash_output = fix_nan_category(df); result = codeflash_output # 295μs -> 262μs (12.6% faster)


def test_edge_empty_dataframe():
    # Edge: Empty dataframe
    df = pd.DataFrame()
    codeflash_output = fix_nan_category(df); result = codeflash_output # 1.62μs -> 44.8μs (96.4% slower)

def test_edge_no_categorical_columns():
    # Edge: No categorical columns
    df = pd.DataFrame({"A": [1, 2], "B": ["x", "y"]})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 114μs -> 72.6μs (57.6% faster)

def test_edge_all_nan_column():
    # Edge: Categorical column with all values as NaN
    df = pd.DataFrame({"A": pd.Series([None, None], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 297μs -> 340μs (12.5% slower)

def test_edge_duplicate_column_names():
    # Edge: DataFrame with duplicate column names
    df = pd.DataFrame(
        [
            ["a", "b"],
            ["c", "d"]
        ],
        columns=["X", "X"]
    )
    df.iloc[:, 0] = pd.Series(df.iloc[:, 0], dtype="category")
    df.iloc[:, 1] = pd.Series(df.iloc[:, 1], dtype="category")
    codeflash_output = fix_nan_category(df); result = codeflash_output # 76.7μs -> 55.1μs (39.0% faster)

def test_edge_empty_categorical_column():
    # Edge: Categorical column with no rows
    df = pd.DataFrame({"A": pd.Series([], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 290μs -> 326μs (11.0% slower)

def test_edge_non_string_category():
    # Edge: Categorical column with non-string categories
    df = pd.DataFrame({"A": pd.Series([1, 2], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 289μs -> 332μs (13.0% slower)

def test_edge_nan_values_in_categorical():
    # Edge: Categorical column with actual NaN values
    df = pd.DataFrame({"A": pd.Series(["a", None], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 281μs -> 332μs (15.2% slower)

# -----------------------------
# 3. Large Scale Test Cases
# -----------------------------

def test_large_scale_many_rows():
    # Large scale: Categorical column with many rows
    values = ["cat", "dog", "mouse"] * 333 + ["cat"]  # 1000 elements
    df = pd.DataFrame({"A": pd.Series(values, dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 276μs -> 326μs (15.2% slower)

def test_large_scale_many_columns():
    # Large scale: Many categorical columns
    data = {f"col{i}": pd.Series(["x", "y"], dtype="category") for i in range(50)}
    df = pd.DataFrame(data)
    codeflash_output = fix_nan_category(df); result = codeflash_output # 7.27ms -> 7.26ms (0.022% faster)
    for col in df.columns:
        pass


def test_large_scale_all_empty_categorical():
    # Large scale: All columns are empty categorical columns
    df = pd.DataFrame({f"col{i}": pd.Series([], dtype="category") for i in range(20)})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 3.05ms -> 3.04ms (0.026% faster)
    for col in df.columns:
        pass

# -----------------------------
# Mutation Testing: Ensure failures if function mutated
# -----------------------------
def test_mutation_fail_if_not_add_nan():
    # If function does not add "nan" category, this test should fail
    df = pd.DataFrame({"A": pd.Series(["x", "y"], dtype="category")})
    codeflash_output = fix_nan_category(df); result = codeflash_output # 291μs -> 337μs (13.8% slower)

def test_mutation_fail_if_non_categorical_modified():
    # If function modifies non-categorical columns, this test should fail
    df = pd.DataFrame({"A": [1, 2]})
    original = df.copy()
    codeflash_output = fix_nan_category(df); result = codeflash_output # 73.5μs -> 74.3μs (1.08% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_testsunittest_dataframe_browser_py_testsunittest_config_py_testsunittest_runtime_executor_py___replay_test_0.py::test_deepnote_toolkit_ocelots_pandas_utils_fix_nan_category 245ms 18.9ms 1203%✅
test_pytest_testsunittest_xdg_paths_py_testsunittest_jinjasql_utils_py_testsunittest_url_utils_py_testsun__replay_test_0.py::test_deepnote_toolkit_ocelots_pandas_utils_fix_nan_category 244ms 18.0ms 1261%✅

To edit these changes git checkout codeflash/optimize-fix_nan_category-mhmv7xt8 and push.

Codeflash Static Badge

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling and performance when adding a placeholder category for missing values in categorical columns: the process now skips work when no categorical columns are present and applies category additions in bulk for better efficiency.
The optimized version achieves a **137% speedup** by eliminating unnecessary work through two key optimizations:

**What was optimized:**
1. **Pre-filtered categorical detection**: Instead of checking `column.dtype.name == "category"` for every column in the loop, the optimization identifies all categorical columns upfront using `enumerate(df.dtypes)` and stores their indices.
2. **Early exit for non-categorical DataFrames**: Added a guard clause that returns immediately if no categorical columns exist, avoiding any loop overhead.

**Why this is faster:**
- **Reduced dtype access overhead**: The original code called `df.iloc[:, i]` (expensive pandas indexing) for every column, then checked its dtype. The optimization accesses `df.dtypes` once, which is much faster than repeated `iloc` calls.
- **Eliminated wasted iterations**: For DataFrames with few/no categorical columns, the original code still iterates through all columns. The optimization skips non-categorical columns entirely and exits early when possible.

**Performance characteristics from tests:**
- **Large DataFrames with mixed types**: Shows significant gains (16-22% faster) when many columns exist but only some are categorical
- **No categorical columns**: Dramatic improvement (33-58% faster) due to early exit
- **Small DataFrames**: Slight overhead (9-16% slower) due to upfront processing, but this is negligible in absolute terms (microseconds)

The line profiler confirms this: the original spent 66.8% of time on `df.iloc` access across all columns, while the optimized version only accesses iloc for the pre-identified categorical columns, reducing this bottleneck substantially.
@misrasaurabh1 misrasaurabh1 requested a review from a team as a code owner November 13, 2025 00:34
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

The fix_nan_category function in deepnote_toolkit/ocelots/pandas/utils.py now first collects indices of categorical columns, returns immediately if none are found, and then calls add_categories("nan") for all identified categorical columns in a single pass. The change reduces repeated dtype checks and avoids work when there are no categorical columns; functional behavior for datasets with categorical columns remains the same.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Caller
participant fix_nan_category
participant DataFrame
Note over fix_nan_category,DataFrame: Original flow (per-column checks)
Caller->>fix_nan_category: call
fix_nan_category->>DataFrame: iterate columns
DataFrame-->>fix_nan_category: column + dtype
alt is categorical?
fix_nan_category->>DataFrame: add_categories("nan") for column
DataFrame-->>fix_nan_category: updated column
else not categorical
DataFrame-->>fix_nan_category: skip
end
fix_nan_category-->>Caller: return

mermaid
sequenceDiagram
participant Caller
participant fix_nan_category
participant DataFrame
Note over fix_nan_category,DataFrame: New flow (collect indices then bulk update)
Caller->>fix_nan_category: call
fix_nan_category->>DataFrame: scan columns -> collect categorical indices
DataFrame-->>fix_nan_category: list of categorical indices
alt no categorical indices
fix_nan_category-->>Caller: return early
else has categorical indices
fix_nan_category->>DataFrame: add_categories("nan") for identified columns (bulk)
DataFrame-->>fix_nan_category: updated columns
fix_nan_category-->>Caller: return
end

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly describes the main optimization: a 137% performance improvement to fix_nan_category function, directly matching the PR's core objective.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9994c8f and 2552607.

📒 Files selected for processing (1)
  • deepnote_toolkit/ocelots/pandas/utils.py (1 hunks)
🔇 Additional comments (1)
deepnote_toolkit/ocelots/pandas/utils.py (1)

24-29: Solid optimization: pre-filter and early exit.

Pre-collecting categorical indices and returning early when none exist avoids wasted iteration. Correct approach.

Comment on lines +32 to +34
for i in categorical_indices:
column = df.iloc[:, i]
df.iloc[:, i] = column.cat.add_categories("nan")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: simplify the assignment.

Lines 33-34 can be combined into one statement without the intermediate variable.

     for i in categorical_indices:
-        column = df.iloc[:, i]
-        df.iloc[:, i] = column.cat.add_categories("nan")
+        df.iloc[:, i] = df.iloc[:, i].cat.add_categories("nan")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in categorical_indices:
column = df.iloc[:, i]
df.iloc[:, i] = column.cat.add_categories("nan")
for i in categorical_indices:
df.iloc[:, i] = df.iloc[:, i].cat.add_categories("nan")
🤖 Prompt for AI Agents
In deepnote_toolkit/ocelots/pandas/utils.py around lines 32 to 34, the code uses
an intermediate variable 'column' to add a category; simplify by replacing the
two statements with a single assignment that updates the DataFrame column in
place, e.g. assign the result of df.iloc[:, i].cat.add_categories("nan")
directly back to df.iloc[:, i].
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
deepnote_toolkit/ocelots/pandas/utils.py (1)

45-47: Intermediate variable still present.

Past review suggested combining lines 46-47 into a single statement. The intermediate variable remains.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2552607 and b045cd1.

📒 Files selected for processing (1)
  • deepnote_toolkit/ocelots/pandas/utils.py (1 hunks)
🔇 Additional comments (2)
deepnote_toolkit/ocelots/pandas/utils.py (2)

41-42: Early exit is correct.

Avoids wasted iterations when no categorical columns exist. Validated by benchmark improvements.


36-48: Optimization logic is sound, but manual verification needed.

Pre-collection of categorical indices and early exit avoid repeated dtype checks. However, edge case testing couldn't run in this environment (pandas unavailable). Manually verify behavior with empty DataFrames and non-categorical inputs to confirm the early exit and iloc assignment work correctly across your pandas version.

Comment on lines +38 to +40
categorical_indices = [
i for i, dtype in enumerate(df.dtypes) if dtype.name == "category"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using pandas API for dtype checking.

dtype.name == "category" works but pd.api.types.is_categorical_dtype(dtype) is more idiomatic and robust.

     categorical_indices = [
-        i for i, dtype in enumerate(df.dtypes) if dtype.name == "category"
+        i for i, dtype in enumerate(df.dtypes) if pd.api.types.is_categorical_dtype(dtype)
     ]
🤖 Prompt for AI Agents
In deepnote_toolkit/ocelots/pandas/utils.py around lines 38 to 40, replace the
dtype name string comparison with the pandas API for checking categorical
dtypes: use pd.api.types.is_categorical_dtype(dtype) when filtering df.dtypes so
detection is more idiomatic and robust; update the list comprehension
accordingly and ensure pd.api.types is imported/accessible in the module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant