2

This Python 3.12.7 script with NumPy 2.2.4 and Numba 0.61.2:

import numpy as np, timeit as ti, numba as nb


def f0(a):
  p0 = a[:-2]
  p1 = a[1:-1]
  p2 = a[2:]
  return (p0 < p1) & (p1 > p2)


def f1(a):
  p0 = a[:-4]
  p1 = a[1:-3]
  p2 = a[2:-2]
  p3 = a[3:-1]
  p4 = a[4:]
  return ((p0 < p1) & (p1 == p2) | (p1 < p2)) & ((p2 > p3) | (p2 == p3) & (p3 > p4))


@nb.njit(fastmath=True)
def g0(a):
  r = np.zeros_like(a, dtype=np.bool)

  for i in range(1, a.size-1):
    r[i] = (a[i-1] < a[i]) & (a[i+1] < a[i])

  return r[1:-1]


@nb.njit(fastmath=True)
def g1(a):
  r = np.zeros_like(a, dtype=np.bool)

  for i in range(2, a.size-2):
    r[i] = ((a[i-1] == a[i]) & (a[i-2] < a[i-1]) | (a[i-1] < a[i])) & \
           ((a[i+1] == a[i]) & (a[i+2] < a[i+1]) | (a[i+1] < a[i]))

  return r[2:-2]


a = np.random.randint(0, 256, (500, 500)).astype(np.uint8)
b = a.ravel()
print(f'Minimum, median and maximum execution time in us:')

for fun in ('f0(b)', 'f1(b)', 'g0(b)', 'g1(b)'):  
  t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=999))
  print(f'{fun:20}  {np.amin(t):8,.3f}  {np.median(t):8,.3f}  {np.amax(t):8,.3f}')  

produces these timings on AMD Ryzen Ryzen 7 3800X PC running Ubuntu 22.04.5:

Minimum, median and maximum execution time in us:
f0(b)                   32.261    33.483    95.640
f1(b)                  118.974   120.737   129.424
g0(b)                   11.081    11.281    19.327
g1(b)                  723.319   744.419   794.042

For a simple array expression, i.e. f0 vs g0, Numba is about 3x faster than NumPy, but for more complex expression, i.e. f1 vs g1, Numba becomes 5x slower. This is very surprising to me. What is the reason? Can these run times be improved?

4
  • 2
    In my experience, Numba can fail to vectorize when you manipulate the loop index or use complicated start/stop conditions. Keep the loop bounds as simple as possible, ideally just range(len(a)), and restrict array access to a[i]. In short, I believe you’ll get more stable behavior if you simply replace the last line of your f1 function with an explicit loop. Commented Jun 19 at 4:33
  • This is just my rule of thumb. Hopefully someone will explain the reason behind this. Commented Jun 19 at 4:34
  • @ken "replace the last line of your f1 function with an explicit loop" - f1 is already 5x faster than g1. I need g1 improvement. Commented Jun 19 at 4:41
  • 2
    Perhaps I was not clear enough. Copy f1 as g1b. Replace the last line of g1b (taken from f1) with an explicit loop. Add the njit decorator to g1b. Comparing the performance of f1, g1, and g1b, g1b will be the fastest. Commented Jun 19 at 5:18

1 Answer 1

5

@ken had the right idea to combine the pattern in f1 of setting up the slices before the loop with the loop logic of g1 by using a single index over all slices.

@nb.njit(fastmath=True)
def g2(a):
  r = np.zeros_like(a, dtype=np.bool_)
  rp2 = r[2:-2]
  am2 = a     # formerly a[i-2]
  am1 = a[1:] # formerly a[i-1]
  a0 = a[2:]  # formerly a[i]
  ap1 = a[3:] # formerly a[i+1]
  ap2 = a[4:] # formerly a[i+2]

  for i in range(a.size-4):
    rp2[i] = ((am1[i] == a0[i]) & (am2[i] < am1[i]) | (am1[i] < a0[i])) & \
             ((ap1[i] == a0[i]) & (ap2[i] < ap1[i]) | (ap1[i] < a0[i]))

  return rp2

Results on my system:

Minimum, median and maximum execution time in us:
f0(b)                   18.685    19.928   240.951
f1(b)                   68.579    70.974   620.332
g0(b)                    6.813     7.053    10.750
g1(b)                  400.350   411.972  1,153.973
g2(b)                   10.700    11.221    28.834

Side note: My numpy complained about np.bool so I replaced it with np.bool_

Sign up to request clarification or add additional context in comments.

1 Comment

Wow! 48x speedup just by avoiding a simple index arithmetic! Looking at generated assembly, g2 is using XMM registers, but g1 fails to vectorize.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.