Comparison of Pandas lookup times

Question

After experimenting with timing various types of lookups on a Pandas (0.17.1) DataFrame I am left with a few questions.

Here is the set up...

import pandas as pd
import numpy as np
import itertools

letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]

df1 = pd.DataFrame({
        'value': np.random.normal(size=(1000000)), 
        'letter': np.random.choice(letter_combinations, 1000000)
    })
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()

So df1 looks something like this...

print(df1.head(5))


>>>
  letter     value
0    bdh  0.253778
1    cem -1.915726
2    mru -0.434007
3    lnw -1.286693
4    fjv  0.245523

Here is the code to test differences in lookup performance...

print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']

print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']

print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']

print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']

And the results...

~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached 
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop

Questions...

It's pretty clear why the lookup on the sorted index is so much faster, binary search to get O(log(n)) performance vs O(n) for a full array scan. But, why is the lookup on the sorted non-indexed df2 column SLOWER than the lookup on the unsorted non-indexed column df1?
What is up with the The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached. Surely, the results aren't being cached. Is it because the created index is lazy and isn't actually reindexed until needed? That would explain why it is only on the first call to .loc[].
Why isn't an index sorted by default? The fixed cost of the sort can be too much?

3. Why isn't an index sorted by default? - imagine that you've set a custom index and it got sorted, not asking you ... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jul 7, 2016 at 20:13
1. - the memory footprint of df2 is approx. 50% larger compared to df1 — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jul 7, 2016 at 20:16
@MaxU The reason I ask is because other tools (DBs and R's data.table) sorts by the index by default — Bruce Pucci
– Bruce Pucci, Commented Jul 7, 2016 at 20:21

unutbu · Accepted Answer · 2016-07-08 11:32:39Z

The disparity in these %timeit results

In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop

In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop

also shows up in the pure NumPy equality comparisons:

In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop

In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop

Under the hood, Pandas' df1['letter'] == 'ben' calls a Cython function which loops through the values of the underlying NumPy array, df1['letter'].values. It is essentially doing the same thing as df1['letter'].values == 'ben' but with different handling of NaNs.

Moreover, notice that simply accessing the items in df1['letter'] in sequential order can be done more quickly than doing the same for df2['letter']:

In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop

In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop

The difference in times within each of these three sets of %timeit tests are roughly the same. I think that is because they all share the same cause.

Since the letter column holds strings, the NumPy arrays df1['letter'].values and df2['letter'].values have dtype object and therefore they hold pointers to the memory location of the arbitrary Python objects (in this case strings).

Consider the memory location of the strings stored in the DataFrames, df1 and df2. In CPython the id returns the memory location of the object:

memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
                       'df2': list(map(id, df2['letter'])), })

               df1              df2
0  140226328244040  140226299303840
1  140226328243088  140226308389048
2  140226328243872  140226317328936
3  140226328243760  140226230086600
4  140226328243368  140226285885624

The strings in df1 (after the first dozen or so) tend to appear sequentially in memory, while sorting causes the strings in df2 (taken in order) to be scattered in memory:

In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]: 
         df1         df2
0        NaN         NaN
1     -952.0   9085208.0
2      784.0   8939888.0
3     -112.0 -87242336.0
4     -392.0  55799024.0
5     -392.0   5436736.0
6      952.0  22687184.0
7       56.0 -26436984.0
8     -448.0  24264592.0
9      -56.0  -4092072.0
10    -168.0 -10421232.0
11 -363584.0   5512088.0
12      56.0 -17433416.0
13      56.0  40042552.0
14      56.0 -18859440.0
15      56.0 -76535224.0
16      56.0  94092360.0
17      56.0  -4189368.0
18      56.0     73840.0
19      56.0  -5807616.0
20      56.0  -9211680.0
21      56.0  20571736.0
22      56.0 -27142288.0
23      56.0   5615112.0
24      56.0  -5616568.0
25      56.0   5743152.0
26      56.0 -73057432.0
27      56.0  -4988200.0
28      56.0  85630584.0
29      56.0  -4706136.0

Most of the strings in df1 are 56 bytes apart:

In [14]: 
In [16]: diffs['df1'].value_counts()
Out[16]: 
 56.0           986109
 120.0           13671
-524168.0          215
-56.0                1
-12664712.0          1
 41136.0             1
-231731080.0         1
Name: df1, dtype: int64

In [20]: len(diffs['df1'].value_counts())
Out[20]: 7

In contrast the strings in df2 are scattered all over the place:

In [17]: diffs['df2'].value_counts().head()
Out[17]: 
-56.0     46
 56.0     44
 168.0    39
-112.0    37
-392.0    35
Name: df2, dtype: int64

In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764

When these objects (strings) are located sequentially in memory, their values can be retrieved more quickly. This is why the equality comparisons performed by df1['letter'].values == 'ben' can be done faster than those in df2['letter'].values == 'ben'. The lookup time is smaller.

This memory accessing issue also explains why there is no disparity in the %timeit results for the value column.

In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop

In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop

df1['value'] and df2['value'] are NumPy arrays of dtype float64. Unlike object arrays, their values are packed together contiguously in memory. Sorting df1 with df2 = df1.sort_values('letter') causes the values in df2['value'] to be reordered, but since the values are copied into a new NumPy array, the values are located sequentially in memory. So accessing the values in df2['value'] can be done just as quickly as those in df1['value'].

chrisb · Accepted Answer · 2016-07-08 13:04:52Z

4

(1) pandas currently has no knowledge of the sortedness of a column.
If you want to take advantage of sorted data, you could use df2.letter.searchsorted See @unutbu's answer for an explanation of what's actually causing the difference in time here.

(2) The hash table that sits underneath the index is lazily created, then cached.

edited Jul 8, 2016 at 13:04

answered Jul 7, 2016 at 20:49

chrisb

52.7k8 gold badges73 silver badges70 bronze badges

2 Comments

Alex Riley Over a year ago

I don't quite see how the Index type is responsible for the time difference here. OP is using 0.17.1 which doesn't feature RangeIndex (both indexes will be of integer type) and new indexes have to be created in both cases. Boolean indexing just boils down to a take from the underlying arrays and index type doesn't seem to matter here.

chrisb Over a year ago

You're right - I thought the index creation time might have been the issue, but it's not.

Collectives™ on Stack Overflow

Comparison of Pandas lookup times

2 Answers 2

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Linked

Related