1

I'm struggling with the right combo of piso functions to perform the following analysis:

Let's say I have a length of road with mile markers as such:

0---1---2---3---4---5

And let's say I have a record of inspections like so:

(mile 0 to 5 inspected 1/1/2025)

(mile 2 to 5 inspected 1/8/2025)

(mile 0 to 3 inspected 1/15/2025)

(mile 0 to 2 inspected 1/22/2025)

The table/dataframe "all_df" representing these inspections would look like:

import pandas as pd

all_df = pd.DataFrame(
    data=(
        (0, 5, pd.Timestamp("2025-1-1")),
        (2, 5, pd.Timestamp("2025-1-8")),
        (0, 3, pd.Timestamp("2025-1-15")),
        (0, 2, pd.Timestamp("2025-1-22")),
    ),
    columns = ["From_Mi", "To_Mi", "Date"]
)

The desired dataframe "recent_df" showing only most recent inspections would look like:

| From_MI | To_Mi | Last Date |
| ------- | ----- | --------- |
| 0       | 2     | 1/22/2025 |
| 2       | 3     | 1/15/2025 |
| 3       | 5     | 1/8/2025  |

Would this be some operation involving .split() and .intersection()?

Any help is appreciated!

3 Answers 3

0

I wasn't able to work out how to use a more vectorised approach with piso and IntervalArrays, but you could use something like the following:

# ensure sorted
all_df = all_df.sort_values("Date", ascending=True)
# interval array
arr = pd.arrays.IntervalArray.from_arrays(all_df.From_Mi, all_df.To_Mi)
# unique intervals
unique_intervals = piso.split(arr, set(arr.left).union(arr.right)).unique()

# dates with interval index
dates = all_df.set_index(pd.IntervalIndex(arr)).Date
# create dataframe with left and right intervals and last date where interval overlaps
output = pd.DataFrame(
    [
        {
            "From_Mi": ui.left,
            "To_Mi": ui.right,
            "Date": dates[dates.index.overlaps(ui)].iloc[-1]
        }
        for ui in unique_intervals
    ]
)
Sign up to request clarification or add additional context in comments.

Comments

0

Maybe you can convert the From_Mi/To_Mi to range(), explode it and use .groupby to get the first/last inspection dates, e.g.:

# sort if necessary:
# all_df = all_df.sort_values(by='Date')

all_df['mile'] = all_df.apply(lambda row: range(row['From_Mi'], row['To_Mi']), axis=1)
all_df = all_df.explode('mile')

out = (all_df
    .groupby('mile')['Date']
    .last()
    .reset_index()
)

out = (out
    .groupby((out['Date'] != out['Date'].shift()).cumsum())
    .agg({'Date': 'first', 'mile':'first'})
    .reset_index(drop=True)
    .rename({'mile':'From_Mi'}, axis=1)
)

out['To_Mi'] = out['From_Mi'].shift(-1).fillna(all_df['To_Mi'].max()).astype(int)

print(out)

Prints:

        Date  From_Mi  To_Mi
0 2025-01-22        0      2
1 2025-01-15        2      3
2 2025-01-08        3      5

Comments

0

When I think of intervals, associated with values, I tend to think step functions. I think an approach with piso could exist but it'd be easier to do with staircase. If you have piso installed then you also have staircase because piso is built upon it.

The values of the step function need to be numbers though, not dates, so this means converting the dates into a number - time passed since some (arbitrary) start date or "epoch":

import staircase as sc

epoch = pd.Timestamp("2024-1-1")
series_stairs = all_df.apply(
    lambda row: sc.Stairs(
        start=row.From_Mi,
        end=row.To_Mi,
        value=(row.Date-epoch)/pd.Timedelta("1D"),
    ),
    axis=1,
)

series_stairs is then a pandas Series of Stairs objects. Each Stairs object is an abstraction of a step function:

0    <staircase.Stairs, id=4673826608>
1    <staircase.Stairs, id=4697665760>
2    <staircase.Stairs, id=4703875344>
3    <staircase.Stairs, id=4703876688>
dtype: object

and you can plot them:

import matplotlib.pyplot as plt

_, ax = plt.subplots(figsize=(5,2))
for s in series_stairs:
    s.plot(ax=ax)

plots of step functions

If we take the maximum of these step functions then the result will be a step function which gives the latest point in time that each point on the road is checked - any point in the road, not only integer points, so this extends to using fractional values for From_Mi and To_Mi if you wish.

max_stairs = sc.max(series_stairs)
max_stairs.plot()

max step function

You can then convert this step function back to dataframe format, remove and 0-valued intervals (these will correspond to any interval on the road which have never been inspected according to the data) and convert the step function values back to datetimes:

recent_df = max_stairs.to_frame().query("value != 0")
recent_df["Last-Date"] = recent_df["value"].apply(
    lambda n: epoch + pd.Timedelta(days=n)
)

recent_df then looks like this:

start end  value   Last-Date
    0   2  387.0  2025-01-22
    2   3  380.0  2025-01-15
    3   5  373.0  2025-01-08

and you can clean it up to your liking.

Disclaimer: I am the author of piso and staircase.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.