OrientDB 3.2 stress testing results and issues #10502

timw · 2025-08-04T08:13:33Z

timw
Aug 4, 2025

Our team has finally got some space in the calendar to upgrade to OrientDB 3.2, and ported our functionality patches across to 3.2.
The distributed code has changed a bunch in the interim though, so we repeated the stress testing we performed on 3.1 to see if the same issues exist and then patched them where required.

tl;dr: 3.2 has pretty much the same problems that 3.1 has in distributed mode that cause significant problems under even modest stress conditions.

Essentially there's no new issues here that we haven't reported or submitted patches for in 3.2, so we opted to post the overall results here rather than raise a series of disconnected issues that might not be fixable in the existing architecture, but we're happy to support that if there are distinct issues that can be acted on - it might be more productive for committers to run the stress test tool themselves and recreate issues, which will give them detailed database logging and resulting databases that can be examined.

Stress testing scenarios

We used the orientdb-stress tool we developed to stress test OrientDB 3.2 in various scenarios.
As an aside, orientdb-stress has had a few improvements as a result of this work, and is easier to run against different releases now, so replicating these results should be straightforward for anyone interested.

The scenarios we ran were:

basic-startup - simply starts a cluster and checks it starts up correctly.
random-restart - restarts random nodes at intervals.
rolling-restart - performs a rolling restart of all nodes (one node immediately after the other).

These were initially run with no client workload (e.g. simply creating a test database and opening it), and then with read and read/write workloads.

A vanilla OrientDB 3.2.42 fails the most basic of these tests pretty much immediately, so we then proceeded to replicate the stability patches we maintain for 3.1 and iterated on testing to find other issues.

The remaining sections summarise the testing and findings from that work, and the resulting patches can be found in our 3.2 HA stability branch

Summary of problems

Most of the issues we observed essentially fall into two categories:

Race conditions on startup or database availability
The inability to effectively bleed or block in-progress database requests during lifecycle changes

While we've been able to patch most of the most prevalent startup issues, we've come to the conclusion that the second category of issues cannot be solved in OrientDB without a restructuring of the subsystems into proper modules with defined lifecycles and interaction points.

These issues manifest in errors with varying severities:

Transient errors that can be retried by clients
Errors that wedge a node or the cluster, requiring restart of the server node (e.g. failed distributed plugin startup).
Data corruption issues that cannot be resolved by restarting server nodes (e.g. Lucene index corruption)

We don't see a lot of the data corruption issues in 3.2 (aside from Lucene index corruption), but some of the other types of errors are very easy to reproduce, and at present we would not consider it safe to restart a production OrientDB 3.2 node under client load.

Splitting of OrientDBDistributed behaviour

The OrientDBDistributed implementation is always enabled in OrientDB, as the server startup logic attempts to load it first and always succeeds since the distributed plugin was moved in-tree.
As a result, the distributed implementation needs to know whether operate in embedded or distributed mode. Detecting this incorrectly cause various codepaths (notably the various document database and session wrappers) to be created with the wrong (embedded) versions, creating issues for later requests.

=> OrientDBDistributed needs to reliably detect when the distributed plugin is present and enabled.

Checking whether the plugin is set and enabled is insufficient for this, as there's a race between distributed plugin startup and client requests.

We solve this by separating the loading and registration/startup of plugins, allowing an enabled distributed plugin to be detected prior to the server starting.

Access to distributed plugin state required by database operations

Various database operations (notably create and open) require access to distributed server state. Attempting to access this too early results in NPEs on the distributed state fields.

dorientdb1-1  | java.lang.NullPointerException
dorientdb1-1  |     at com.orientechnologies.orient.server.hazelcast.OHazelcastClusterMetadataManager.getDatabaseStatus(OHazelcastClusterMetadataManager.java:1042)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.getDatabaseStatus(ODistributedPlugin.java:2681)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.checkDbAvailable(OrientDBDistributed.java:323)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.checkDbAvailableOpen(OrientDBDistributed.java:328)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.open(OrientDBDistributed.java:354)
dorientdb1-1  |     at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:854)
dorientdb1-1  |     at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostDatabase.execute(OServerCommandPostDatabase.java:93)
dorientdb1-1  |     at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.service(ONetworkProtocolHttpAbstract.java:242)
dorientdb1-1  |     at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.execute(ONetworkProtocolHttpAbstract.java:800)
dorientdb1-1  |     at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:68)

=> OrientDBDistributed needs to know when the distributed plugin startup is complete, and block access to distributed server state until it is.

getPlugin is inadequate for this - the plugin field may not be set yet (due to startup races) or OServer may be in startup and not activate, at which point it does not await the startup latch.
There are various solutions to this, but we chose to fix it by signalling the distributed plugin startup completion by setting the plugin field, and using a latch to await that startup in OrientDBDistributed.

Distributed database access before online state reached

We saw other errors that appear related to accessing distributed database (via client or distributed requests) before a database reached the online state.

dorientdb1-1  | 2025-07-24 02:32:54:602 SEVER {db=_scenario} Internal server error:
dorientdb1-1  |  [ONetworkProtocolHttpDb]
dorientdb1-1  | java.lang.NullPointerException
dorientdb1-1  |     at com.orientechnologies.orient.server.hazelcast.OHazelcastClusterMetadataManager.getOnlineDatabaseConfiguration(OHazelcastClusterMetadataManager.java:1473)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.getOnlineDatabaseConfiguration(ODistributedPlugin.java:2865)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedConfigurationManager.loadDistributedConfiguration(ODistributedConfigurationManager.java:46)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedConfigurationManager.getDistributedConfiguration(ODistributedConfigurationManager.java:36)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.getOrInitDistributedConfiguration(OrientDBDistributed.java:524)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.initFirstOpen(ODistributedDatabaseImpl.java:603)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.newDistributedDatabase(OrientDBDistributed.java:464)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.lambda$registerNewDatabaseIfNeeded$1(OrientDBDistributed.java:486)
dorientdb1-1  |     at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.registerNewDatabaseIfNeeded(OrientDBDistributed.java:486)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.newCreateSessionInstance(OrientDBDistributed.java:151)
dorientdb1-1  |     at com.orientechnologies.orient.core.db.OrientDBEmbedded.internalCreate(OrientDBEmbedded.java:822)
dorientdb1-1  |     at com.orientechnologies.orient.core.db.OrientDBEmbedded.create(OrientDBEmbedded.java:685)
dorientdb1-1  |     at com.orientechnologies.orient.distributed.db.OrientDBDistributed.create(OrientDBDistributed.java:263)
dorientdb1-1  |     at com.orientechnologies.orient.core.db.OrientDBEmbedded.create(OrientDBEmbedded.java:647)
dorientdb1-1  |     at com.orientechnologies.orient.server.OServer.createDatabase(OServer.java:1190)
dorientdb1-1  |     at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostDatabase.execute(OServerCommandPostDatabase.java:89)
dorientdb1-1  |     at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.service(ONetworkProtocolHttpAbstract.java:242)
dorientdb1-1  |     at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.execute(ONetworkProtocolHttpAbstract.java:800)
dorientdb1-1  |     at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:68)

We solved these by blocking (polling) the various open methods on the database open state (similar to the current 3.2 implementation).

As an aside, the fallback to checking if a database exists in 3.2.42 appears unreliable - local databases are registered on load, so this looks like it would always succed, so we removed that particular behaviour?

Lifecycle races

We observe various errors that are due to access to database resources being accessed while database lifecycle changes are in progress, primarily backup/sync to/from a remote node.

e.g. NPEs while performing database directory backup, due to files being removed while the backup directory is being traversed.

dorientdb3-1  | java.lang.NullPointerException
dorientdb3-1  |         at com.orientechnologies.common.io.OFileUtils.copyDirectory(OFileUtils.java:196)
dorientdb3-1  |         at com.orientechnologies.common.io.OFileUtils.copyDirectory(OFileUtils.java:199)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.backupCurrentDatabase(ODistributedPlugin.java:1780)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.requestDatabaseFullSync(ODistributedPlugin.java:1642)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.requestFullDatabase(ODistributedPlugin.java:1526)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.internalInstallDatabase(ODistributedPlugin.java:1333)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.lambda$installDatabase$4(ODistributedPlugin.java:1249)
dorientdb3-1  |         at com.orientechnologies.orient.server.hazelcast.OHazelcastClusterMetadataManager.executeInDistributedDatabaseLock(OHazelcastClusterMetadataManager.java:1455)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.executeInDistributedDatabaseLock(ODistributedPlugin.java:270)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.installDatabase(ODistributedPlugin.java:1245)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker.checkServerStatus(OClusterHealthChecker.java:269)
dorientdb3-1  |         at com.orientechnologies.orient.server.distributed.impl.OClusterHealthChecker.run(OClusterHealthChecker.java:74)
dorientdb3-1  |         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
dorientdb3-1  |         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
dorientdb3-1  |         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
dorientdb3-1  |         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
dorientdb3-1  |         at java.lang.Thread.run(Thread.java:750)

NPEs during distributed transaction commits:

dorientdb3-1  | java.lang.NullPointerException
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.OLocalKeySource.lambda$getUniqueKeys$0(OLocalKeySource.java:35)
dorientdb3-1  |     at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.OLocalKeySource.getUniqueKeys(OLocalKeySource.java:30)
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.localLock(ODistributedDatabaseImpl.java:838)
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedTxCoordinator.tryCommit(ODistributedTxCoordinator.java:146)
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedTxCoordinator.commit(ODistributedTxCoordinator.java:90)
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.distributedCommitV1(ODatabaseDocumentDistributed.java:482)
dorientdb3-1  |     at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.internalCommit(ODatabaseDocumentDistributed.java:398)
dorientdb3-1  |     at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:651)
dorientdb3-1  |     at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:116)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.commit(ODatabaseDocumentAbstract.java:1599)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.commit(ODatabaseDocumentAbstract.java:1569)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.saveAll(ODatabaseDocumentEmbedded.java:1287)
dorientdb3-1  |     at com.orientechnologies.orient.core.tx.OTransactionNoTx.saveRecord(OTransactionNoTx.java:208)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.saveInternal(ODatabaseDocumentAbstract.java:1488)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.save(ODatabaseDocumentAbstract.java:1436)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.save(ODatabaseDocumentAbstract.java:1319)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentAbstract.save(ODatabaseDocumentAbstract.java:80)
dorientdb3-1  |     at com.orientechnologies.orient.core.sql.executor.SaveElementStep.mapResult(SaveElementStep.java:31)
dorientdb3-1  |     at com.orientechnologies.orient.core.sql.executor.SaveElementStep.lambda$syncPull$0(SaveElementStep.java:25)
dorientdb3-1  |     at com.orientechnologies.orient.core.sql.executor.resultset.OResultSetMapper.next(OResultSetMapper.java:28)
dorientdb3-1  |     at com.orientechnologies.orient.core.sql.executor.OInsertExecutionPlan.executeInternal(OInsertExecutionPlan.java:46)
dorientdb3-1  |     at com.orientechnologies.orient.core.sql.parser.OInsertStatement.execute(OInsertStatement.java:154)
dorientdb3-1  |     at com.orientechnologies.orient.core.sql.parser.OStatement.execute(OStatement.java:76)
dorientdb3-1  |     at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.command(ODatabaseDocumentEmbedded.java:634)
dorientdb3-1  |     at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostCommand.executeStatement(OServerCommandPostCommand.java:221)
dorientdb3-1  |     at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostCommand.execute(OServerCommandPostCommand.java:104)
dorientdb3-1  |     at com.orientechnologies.orient.server.network.protocol.http.command.post.OServerCommandPostCommandGraph.execute(OServerCommandPostCommandGraph.java:34)
dorientdb3-1  |     at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.service(ONetworkProtocolHttpAbstract.java:242)
dorientdb3-1  |     at com.orientechnologies.orient.server.network.protocol.http.ONetworkProtocolHttpAbstract.execute(ONetworkProtocolHttpAbstract.java:800)
dorientdb3-1  |     at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:68)

We've explored various options to preclude these types of errors, but the affected access paths are simply too numerous and convoluted, and we have concluded there's no sensible way to do so without re-architecting the subsystems to have proper boundaries, lifecycles and locking mechanisms.

Lost updates

We have observed multiple (i.e. enough to be common, but not the majority) scenario executions with read/write workloads that fail with lost updates - i.e. a write was acknowledged in the REST API, but could not be found in a subsequent query.

Informally we've observed these after a node startup, e.g. shortly after the HA STATUS API indicates that the database server is fully online.

At present we don't have a root cause or fix for these issues.

Corrupted Lucene Indexes

We intermittently observe corrupted Lucene indexes after restarting servers under a write workload.
Specifically the Lucene index directory has multiple segments_N files, and the one loaded by IndexWriter has checksum mismatches with the files in the index directory.
We've previously reported these to the Lucene developers, but they indicated that this scenario is only possible if the database system is manually managing segments files - we suspect there's actually a Lucene issue here where a merge is being interrupted by a system shutdown and the IndexWriter isn't able to politely recover (roll forward or back) from that state.

dorientdb1-1  | org.apache.lucene.index.CorruptIndexException: file mismatch, expected id=2hv51b0nbmkei0m6tlx445y38, got=nf2h7yr3qf2yq8x79t322gmj (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/orientdb/databases/_scenario/luceneIndexes/Record.prop_ftx/_0.si")))
dorientdb1-1  |     at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
dorientdb1-1  |     at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
dorientdb1-1  |     at org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
dorientdb1-1  |     at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:361)
dorientdb1-1  |     at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
dorientdb1-1  |     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:845)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneIndexWriterFactory.createIndexWriter(OLuceneIndexWriterFactory.java:20)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneFullTextIndexEngine.createIndexWriter(OLuceneFullTextIndexEngine.java:89)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.open(OLuceneIndexEngineAbstract.java:243)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.reOpen(OLuceneIndexEngineAbstract.java:228)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.openIfClosed(OLuceneIndexEngineAbstract.java:458)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneIndexEngineAbstract.openIfClosed(OLuceneIndexEngineAbstract.java:466)
dorientdb1-1  |     at com.orientechnologies.lucene.engine.OLuceneFullTextIndexEngine.put(OLuceneFullTextIndexEngine.java:137)
dorientdb1-1  |     at com.orientechnologies.lucene.index.OLuceneIndexNotUnique.lambda$doPut$0(OLuceneIndexNotUnique.java:155)
dorientdb1-1  |     at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.doCallIndexEngine(OAbstractPaginatedStorage.java:3219)
dorientdb1-1  |     at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.callIndexEngine(OAbstractPaginatedStorage.java:3198)
dorientdb1-1  |     at com.orientechnologies.lucene.index.OLuceneIndexNotUnique.doPut(OLuceneIndexNotUnique.java:146)
dorientdb1-1  |     at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.applyTxChanges(OAbstractPaginatedStorage.java:2584)
dorientdb1-1  |     at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commitIndexes(OAbstractPaginatedStorage.java:2569)
dorientdb1-1  |     at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:2493)
dorientdb1-1  |     at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commitPreAllocated(OAbstractPaginatedStorage.java:2320)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.internalCommit2pc(ODatabaseDocumentDistributed.java:843)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ONewDistributedTxContextImpl.commit(ONewDistributedTxContextImpl.java:90)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.commit2pc(ODatabaseDocumentDistributed.java:710)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.task.OTransactionPhase2Task.execute(OTransactionPhase2Task.java:127)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.localInternalExecute(ODistributedPlugin.java:998)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.lambda$executeOnLocalNode$2(ODistributedPlugin.java:988)
dorientdb1-1  |     at com.orientechnologies.orient.core.db.OScenarioThreadLocal.executeAsDistributed(OScenarioThreadLocal.java:91)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.executeOnLocalNode(ODistributedPlugin.java:986)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.execute(ODistributedDatabaseImpl.java:369)
dorientdb1-1  |     at com.orientechnologies.orient.server.distributed.impl.ODistributedDatabaseImpl.lambda$null$5(ODistributedDatabaseImpl.java:321)

Rolling Restart is Unsafe

The rolling-restart scenario is particularly troublesome for OrientDB, and exposes a number of areas that can wedge the server and/or cluster.
This isn't a generally useful scenario for us in production, but it's a genuine stress test of the infrastructure to see if it recovers.

Node connection on distributed plugin startup

The distributed plugin forces a connect to all active (according to Hazelcast) nodes during startup, and fails if one of those nodes has gone down in the interim.
This leaves that server in an inconsistent state, and in a "passive" state according to Hazelcast, which breaks distributed operations until the node is restarted.

dorientdb1-1  | java.net.ConnectException: Connection refused (Connection refused)
dorientdb1-1  | 	at java.net.PlainSocketImpl.socketConnect(Native Method)
dorientdb1-1  | 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
dorientdb1-1  | 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
dorientdb1-1  | 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
dorientdb1-1  | 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
dorientdb1-1  | 	at java.net.Socket.connect(Socket.java:607)
dorientdb1-1  | 	at com.orientechnologies.orient.client.binary.OChannelBinaryClientAbstract.<init>(OChannelBinaryClientAbstract.java:80)
dorientdb1-1  | 	at com.orientechnologies.orient.client.binary.OChannelBinarySynchClient.<init>(OChannelBinarySynchClient.java:34)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.ORemoteServerChannel.connect(ORemoteServerChannel.java:210)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.ORemoteServerChannel.<init>(ORemoteServerChannel.java:110)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.ORemoteServerController.<init>(ORemoteServerController.java:64)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.impl.ORemoteServerManager.connectRemoteServer(ORemoteServerManager.java:28)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.getRemoteServer(ODistributedPlugin.java:2535)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.connectToAllNodes(ODistributedPlugin.java:2615)
dorientdb1-1  | 	at com.orientechnologies.orient.server.hazelcast.OHazelcastClusterMetadataManager.startupHazelcastPlugin(OHazelcastClusterMetadataManager.java:241)
dorientdb1-1  | 	at com.orientechnologies.orient.server.distributed.impl.ODistributedPlugin.startup(ODistributedPlugin.java:311)
dorientdb1-1  | 	at com.orientechnologies.orient.server.OServer.registerPlugins(OServer.java:1184)
dorientdb1-1  | 	at com.orientechnologies.orient.server.OServer.activate(OServer.java:501)

We have in the past patched this to defer the initial connect, but have not replicated that in 3.2 yet.

Database storage opens failing

Database storage operations seem to break during database open:

dorientdb2-1  | com.orientechnologies.orient.core.exception.OStorageException: Exception during execution of atomic operation inside of storage _scenario
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.executeInsideAtomicOperation(OAtomicOperationsManager.java:146)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.open(OAbstractPaginatedStorage.java:576)
dorientdb2-1  | 	at com.orientechnologies.orient.core.db.OrientDBEmbedded.getAndOpenStorage(OrientDBEmbedded.java:528)
dorientdb2-1  | 	at com.orientechnologies.orient.core.db.OrientDBEmbedded.openNoAuthorization(OrientDBEmbedded.java:457)
dorientdb2-1  | 	at com.orientechnologies.orient.distributed.db.OrientDBDistributed.openNoAuthorization(OrientDBDistributed.java:402)
dorientdb2-1  | 	at com.orientechnologies.orient.core.db.OrientDBEmbedded.lambda$executeNoAuthorization$8(OrientDBEmbedded.java:1150)
dorientdb2-1  | 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
dorientdb2-1  | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
dorientdb2-1  | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
dorientdb2-1  | 	at java.lang.Thread.run(Thread.java:750)
dorientdb2-1  | Caused by: com.orientechnologies.orient.core.exception.OStorageException: Exception during execution of component operation inside component internal.pcl in storage _scenario
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.executeInsideComponentOperation(OAtomicOperationsManager.java:172)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.executeInsideComponentOperation(OAtomicOperationsManager.java:157)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurableComponent.executeInsideComponentOperation(ODurableComponent.java:102)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.open(OPaginatedClusterV2.java:196)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.openClusters(OAbstractPaginatedStorage.java:756)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.executeInsideAtomicOperation(OAtomicOperationsManager.java:140)
dorientdb2-1  | 	... 9 more
dorientdb2-1  | Caused by: com.orientechnologies.orient.core.exception.OStorageException: File internal.fsm has already been added to the storage
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.cache.local.OWOWCache.bookFileId(OWOWCache.java:594)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationBinaryTracking.addFile(OAtomicOperationBinaryTracking.java:326)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.base.ODurableComponent.addFile(ODurableComponent.java:172)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.cluster.v2.FreeSpaceMap.create(FreeSpaceMap.java:29)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.cluster.v2.OPaginatedClusterV2.lambda$open$1(OPaginatedClusterV2.java:221)
dorientdb2-1  | 	at com.orientechnologies.orient.core.storage.impl.local.paginated.atomicoperations.OAtomicOperationsManager.executeInsideComponentOperation(OAtomicOperationsManager.java:165)
dorientdb2-1  | 	... 14 more

tglman · 2025-08-04T08:44:02Z

tglman
Aug 4, 2025
Maintainer

Hi,

Thanks to post this, I will have a deep read of this and reply here also I will check your changes to see what can be merged directly in the upstream codebase, feel free to open PR from your side.

0 replies

tglman · 2025-08-04T13:36:01Z

tglman
Aug 4, 2025
Maintainer

Hi,

I went trough it, so first thank you for publishing the good stress test tool, I will be looking for reusing it for do additional testing also for the next major.

In terms of your general finding, I do agree with them, OrientDB as current stable version struggle on first boot and node restart use-case, especially when under heavy load, and the reason for it is exactly the one that you mentioned and tried to tackle, and yes some of these issues cannot be fixed without reviewing the module structures and the general lifecycle. Is somewhat comfortable that you independently arrived to this conclusions as they match with mine.

So in the short term, if you are happy with it, I will be pulling some of your changes directly to the main repo, so everybody can enjoy the fixes, starting from your 3.2/ha_db_stability I will:

Port for sure following changes that seams easy without trouble:
indexity-io@e045483
indexity-io@cf8e154

Try to port with the needed double check and testing the following:
indexity-io@7f4f6a3
indexity-io@36807a2
indexity-io@1fbbe43
indexity-io@9c2e0cc
indexity-io@4954dd1
indexity-io@a98639e

And keep out:
indexity-io@3cd6051 -> This is a feature that is already part of another pull request
indexity-io@39b4dca -> This is only for testing purpose, and directly plugged with env, if included should be more "desinged" and part of the OrientDB configuration logic.

I should be able to cherry pick this changes directly without the need of a PR, obviously keeping the right attribution, let me know if you are against it and you want to keep something out.

For more long term view on these issue, in the next major the distributed module is going trough a complete review/redesign exactly on these parts where you found issues (is happening right now), if everything goes all right the distributed will not be anymore a server plugin, and the lifecycle of it will all be handled by OrientDBDistributed, which has control also on databases state and session opening, which should make everything more coherent.

For the lucene related inconsistencies it all depends on how the lucene indexes are integrate on the transaction lifecycle and the backup/restore functionality, again this does not have a solution in short term, but for the next major have been already done some enabling work for allow a better integration, help and/or sponsorship on getting that work done is really welcome.

History (for who has more time to read and want to understand why things are like they are):

OrientDB was born as a standalone embedded database for java applications, when the first network layer appeared it was integrated at the storage layer, and was not aware of the content of what now are documents, after that when the distributed module was introduced was also introduced at the same storage layer, for all the 1.x and 2.x series of releases the storage was the only interface that had "concurrency handling" and was shared across sessions, that back in time were instances of ODatabaseDocumentTx, each instance would fetch the storage instance from the Orient singleton, and in the storage instance there will be plugged-in higher level instances that need to exists across thread (see the Schema), the distributed module would plug itself as a "decorator/proxy" of the storage when enabled directly in the Orient singleton, during the evolution though was clear that this design was not enough so in 3.x was introduced the OrientDB context with the 3 implementations OrientDBEmbedded,OrientDBRemote and OrientDBDistributed and slowly during the 3.x series structures were moved out of the storage interface and the Orient singleton to the OrientDB context, most of the work for remote and embedded has been done already in 3.x which now have a pretty OK design(even though still improved for the next major), more work is still going on to make OrientDBDistributed the center of coordination of structural/topology operations, data operations have already a shared structure that handle the coordination correctly. I hope this give a brief on the work that had to happen and still is happening to get OrientDB stable.

0 replies

tglman · 2025-08-07T15:22:13Z

tglman
Aug 7, 2025
Maintainer

Hi,

I just released the 3.2.43 including most the commits I mentioned before I would include.

Regards

0 replies

timw · 2025-08-10T20:36:52Z

timw
Aug 10, 2025
Author

Thanks @tglman - we've rebased on 3.2.43 now. Appreciate the merge.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

OrientDB 3.2 stress testing results and issues #10502

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

OrientDB 3.2 stress testing results and issues #10502

Uh oh!

timw Aug 4, 2025

Stress testing scenarios

Summary of problems

Splitting of OrientDBDistributed behaviour

Access to distributed plugin state required by database operations

Distributed database access before online state reached

Lifecycle races

Lost updates

Corrupted Lucene Indexes

Rolling Restart is Unsafe

Node connection on distributed plugin startup

Database storage opens failing

Replies: 4 comments

Uh oh!

tglman Aug 4, 2025 Maintainer

Uh oh!

tglman Aug 4, 2025 Maintainer

Uh oh!

tglman Aug 7, 2025 Maintainer

Uh oh!

timw Aug 10, 2025 Author

timw
Aug 4, 2025

tglman
Aug 4, 2025
Maintainer

tglman
Aug 4, 2025
Maintainer

tglman
Aug 7, 2025
Maintainer

timw
Aug 10, 2025
Author