Conversation
Without this we free the entire `TopNOperator` and then pitch the reference to it. It'd be best if these two things were atomically locked together, but so long as everything we free is small-ish we can live with keeping the reference for a bit. We reserve some room to maneuver here. But if the thing we're freeing is quite large this ain't great. We think we have room to allocate but we don't. With this we pitch the reference to each row as we go so they can be GCed immediately. This keeps the untracked-but-hard-referenced state to a single row at a time.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| * the circuit breaker itself. With this we're still racing, but we're only racing a | ||
| * single row at a time. And single rows can only be so large. And we have enough slop |
There was a problem hiding this comment.
we're only racing a single row at a time
WDYM there? Would it be enough too swap the close() and "= null" statements here?
There was a problem hiding this comment.
No, the row is still referenced on the current thread. It's a shame, but I think it's somewhat inherent in having a close clear the breaker. If we swapped it so we had a ramBytesUsed() method on Row then we could do what you are saying. But that doesn't seem worth it at the moment because we'd have to rewrite a bunch of tracking bits.
| }))); | ||
| } | ||
| try { | ||
| OperatorTestCase.runDriver(drivers); |
There was a problem hiding this comment.
Shouldn't this fail if the breaker doesn't trip? I suppose an OOM should happen otherwise, but I would fail() just to be sure the test is right, and remains right (E.g. No optimizations breaking)
There was a problem hiding this comment.
These are tricky. If, say, this runs in a single thread and each Driver finishes before the others wake up, then it won't OOM.
I would try and take more "personal" control of the Drivers and run a couple of single-loop-iterations so I could be sure.
There was a problem hiding this comment.
I ran it in a loop for many hours and it kept failing. So I'll update it.
There was a problem hiding this comment.
Worst case, it fails in CI, and we revert it.
A similar test that expected a CB 100% was muted for aggs, one I added for reduce phase, and I'm considering having it lenient like this one was initially
|
Hi @nik9000, I've created a changelog YAML for you. |
…l_topn_free_faster
| */ | ||
| () -> { | ||
| for (int i = 0; i < getHeapArray().length; i++) { | ||
| Row row = ((Row) getHeapArray()[i]); |
There was a problem hiding this comment.
Nit: redundant parentheses.
| int approxSizePerFilledRow = RamUsageEstimator.NUM_BYTES_OBJECT_REF * 5 + repeats * Integer.BYTES; | ||
| int topCount = (int) ((double) limit.getBytes() / approxSizePerFilledRow * usedByEachThread) - 1; |
There was a problem hiding this comment.
Extract 5 to a constant (assuming it's the "same" 5 as the one in parameters).
There was a problem hiding this comment.
It's not. I'll push more comments.
| blockFactory, | ||
| breaker, | ||
| topCount, | ||
| IntStream.range(0, repeats).mapToObj(i -> ElementType.INT).toList(), |
There was a problem hiding this comment.
Use Collections.nCopies instead (here and below).
| import static org.hamcrest.Matchers.equalTo; | ||
| import static org.hamcrest.Matchers.greaterThan; | ||
|
|
||
| /** |
There was a problem hiding this comment.
While I get the setup of this test, I'm not sure what is it actually trying to test... what are we trying to verify here? That we properly break the circuits?
There was a problem hiding this comment.
That we don't oom. I'll push a bigger comment explaining. But without the change that I made to the close code the test does OOM.
There was a problem hiding this comment.
Not 100% of the time though. Maybe, 10%. Which is why I've got it running 20 times.
| } | ||
|
|
||
| public void testFromHeapDump1() { | ||
| TopNOperator.Row row = new TopNOperator.Row(new NoopCircuitBreaker(CircuitBreaker.REQUEST), sortOrders(5), 56, 24); |
There was a problem hiding this comment.
The relationship between 5, 56, and 24 and 304L isn't obvious to me (same thing for the numbers in the method below). Are they just magic numbers, do they come from the same heap dump or the 304L, or is it something else?
There was a problem hiding this comment.
They come from the heap dump. They are actual measured results. I'll leave a comment.
| } | ||
|
|
||
| public void testFromHeapDump2() { | ||
| TopNOperator.Row row = new TopNOperator.Row(new NoopCircuitBreaker(CircuitBreaker.REQUEST), sortOrders(1), 1160, 1153096); |
|
@GalLalouche, I pushed a patch for your comments. Could you have another look? |
Without this we free the entire `TopNOperator` and then pitch the reference to it. It'd be best if these two things were atomically locked together, but so long as everything we free is small-ish we can live with keeping the reference for a bit. We reserve some room to maneuver here. But if the thing we're freeing is quite large this ain't great. We think we have room to allocate but we don't. With this we pitch the reference to each row as we go so they can be GCed immediately. This keeps the untracked-but-hard-referenced state to a single row at a time.
💚 Backport successful
|
Without this we free the entire `TopNOperator` and then pitch the reference to it. It'd be best if these two things were atomically locked together, but so long as everything we free is small-ish we can live with keeping the reference for a bit. We reserve some room to maneuver here. But if the thing we're freeing is quite large this ain't great. We think we have room to allocate but we don't. With this we pitch the reference to each row as we go so they can be GCed immediately. This keeps the untracked-but-hard-referenced state to a single row at a time.
Without this we free the entire
TopNOperatorand then pitch the reference to it. It'd be best if these two things were atomically locked together, but so long as everything we free is small-ish we can live with keeping the reference for a bit. We reserve some room to maneuver here. But if the thing we're freeing is quite large this ain't great. We think we have room to allocate but we don't.With this we pitch the reference to each row as we go so they can be GCed immediately. This keeps the untracked-but-hard-referenced state to a single row at a time.