Replies: 4 comments
-
|
Hi, Thanks to post this, I will have a deep read of this and reply here also I will check your changes to see what can be merged directly in the upstream codebase, feel free to open PR from your side. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I went trough it, so first thank you for publishing the good stress test tool, I will be looking for reusing it for do additional testing also for the next major. In terms of your general finding, I do agree with them, OrientDB as current stable version struggle on first boot and node restart use-case, especially when under heavy load, and the reason for it is exactly the one that you mentioned and tried to tackle, and yes some of these issues cannot be fixed without reviewing the module structures and the general lifecycle. Is somewhat comfortable that you independently arrived to this conclusions as they match with mine. So in the short term, if you are happy with it, I will be pulling some of your changes directly to the main repo, so everybody can enjoy the fixes, starting from your Port for sure following changes that seams easy without trouble: Try to port with the needed double check and testing the following: And keep out: I should be able to cherry pick this changes directly without the need of a PR, obviously keeping the right attribution, let me know if you are against it and you want to keep something out. For more long term view on these issue, in the next major the distributed module is going trough a complete review/redesign exactly on these parts where you found issues (is happening right now), if everything goes all right the distributed will not be anymore a server plugin, and the lifecycle of it will all be handled by OrientDBDistributed, which has control also on databases state and session opening, which should make everything more coherent. For the lucene related inconsistencies it all depends on how the lucene indexes are integrate on the transaction lifecycle and the backup/restore functionality, again this does not have a solution in short term, but for the next major have been already done some enabling work for allow a better integration, help and/or sponsorship on getting that work done is really welcome. History (for who has more time to read and want to understand why things are like they are): OrientDB was born as a standalone embedded database for java applications, when the first network layer appeared it was integrated at the storage layer, and was not aware of the content of what now are documents, after that when the distributed module was introduced was also introduced at the same storage layer, for all the 1.x and 2.x series of releases the storage was the only interface that had "concurrency handling" and was shared across sessions, that back in time were instances of ODatabaseDocumentTx, each instance would fetch the storage instance from the |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I just released the 3.2.43 including most the commits I mentioned before I would include. Regards |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @tglman - we've rebased on 3.2.43 now. Appreciate the merge. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Our team has finally got some space in the calendar to upgrade to OrientDB 3.2, and ported our functionality patches across to 3.2.
The distributed code has changed a bunch in the interim though, so we repeated the stress testing we performed on 3.1 to see if the same issues exist and then patched them where required.
tl;dr: 3.2 has pretty much the same problems that 3.1 has in distributed mode that cause significant problems under even modest stress conditions.
Essentially there's no new issues here that we haven't reported or submitted patches for in 3.2, so we opted to post the overall results here rather than raise a series of disconnected issues that might not be fixable in the existing architecture, but we're happy to support that if there are distinct issues that can be acted on - it might be more productive for committers to run the stress test tool themselves and recreate issues, which will give them detailed database logging and resulting databases that can be examined.
Stress testing scenarios
We used the orientdb-stress tool we developed to stress test OrientDB 3.2 in various scenarios.
As an aside, orientdb-stress has had a few improvements as a result of this work, and is easier to run against different releases now, so replicating these results should be straightforward for anyone interested.
The scenarios we ran were:
basic-startup- simply starts a cluster and checks it starts up correctly.random-restart- restarts random nodes at intervals.rolling-restart- performs a rolling restart of all nodes (one node immediately after the other).These were initially run with no client workload (e.g. simply creating a test database and opening it), and then with read and read/write workloads.
A vanilla OrientDB 3.2.42 fails the most basic of these tests pretty much immediately, so we then proceeded to replicate the stability patches we maintain for 3.1 and iterated on testing to find other issues.
The remaining sections summarise the testing and findings from that work, and the resulting patches can be found in our 3.2 HA stability branch
Summary of problems
Most of the issues we observed essentially fall into two categories:
While we've been able to patch most of the most prevalent startup issues, we've come to the conclusion that the second category of issues cannot be solved in OrientDB without a restructuring of the subsystems into proper modules with defined lifecycles and interaction points.
These issues manifest in errors with varying severities:
We don't see a lot of the data corruption issues in 3.2 (aside from Lucene index corruption), but some of the other types of errors are very easy to reproduce, and at present we would not consider it safe to restart a production OrientDB 3.2 node under client load.
Splitting of OrientDBDistributed behaviour
The
OrientDBDistributedimplementation is always enabled in OrientDB, as the server startup logic attempts to load it first and always succeeds since the distributed plugin was moved in-tree.As a result, the distributed implementation needs to know whether operate in embedded or distributed mode. Detecting this incorrectly cause various codepaths (notably the various document database and session wrappers) to be created with the wrong (embedded) versions, creating issues for later requests.
=>
OrientDBDistributedneeds to reliably detect when the distributed plugin is present and enabled.Checking whether the plugin is set and enabled is insufficient for this, as there's a race between distributed plugin startup and client requests.
We solve this by separating the loading and registration/startup of plugins, allowing an enabled distributed plugin to be detected prior to the server starting.
Access to distributed plugin state required by database operations
Various database operations (notably create and open) require access to distributed server state. Attempting to access this too early results in NPEs on the distributed state fields.
=>
OrientDBDistributedneeds to know when the distributed plugin startup is complete, and block access to distributed server state until it is.getPluginis inadequate for this - the plugin field may not be set yet (due to startup races) orOServermay be instartupand not activate, at which point it does not await the startup latch.There are various solutions to this, but we chose to fix it by signalling the distributed plugin startup completion by setting the plugin field, and using a latch to await that startup in
OrientDBDistributed.Distributed database access before online state reached
We saw other errors that appear related to accessing distributed database (via client or distributed requests) before a database reached the online state.
We solved these by blocking (polling) the various open methods on the database open state (similar to the current 3.2 implementation).
As an aside, the fallback to checking if a database
existsin 3.2.42 appears unreliable - local databases are registered on load, so this looks like it would always succed, so we removed that particular behaviour?Lifecycle races
We observe various errors that are due to access to database resources being accessed while database lifecycle changes are in progress, primarily backup/sync to/from a remote node.
e.g. NPEs while performing database directory backup, due to files being removed while the backup directory is being traversed.
NPEs during distributed transaction commits:
We've explored various options to preclude these types of errors, but the affected access paths are simply too numerous and convoluted, and we have concluded there's no sensible way to do so without re-architecting the subsystems to have proper boundaries, lifecycles and locking mechanisms.
Lost updates
We have observed multiple (i.e. enough to be common, but not the majority) scenario executions with read/write workloads that fail with lost updates - i.e. a write was acknowledged in the REST API, but could not be found in a subsequent query.
Informally we've observed these after a node startup, e.g. shortly after the
HA STATUSAPI indicates that the database server is fully online.At present we don't have a root cause or fix for these issues.
Corrupted Lucene Indexes
We intermittently observe corrupted Lucene indexes after restarting servers under a write workload.
Specifically the Lucene index directory has multiple
segments_Nfiles, and the one loaded byIndexWriterhas checksum mismatches with the files in the index directory.We've previously reported these to the Lucene developers, but they indicated that this scenario is only possible if the database system is manually managing segments files - we suspect there's actually a Lucene issue here where a merge is being interrupted by a system shutdown and the IndexWriter isn't able to politely recover (roll forward or back) from that state.
Rolling Restart is Unsafe
The
rolling-restartscenario is particularly troublesome for OrientDB, and exposes a number of areas that can wedge the server and/or cluster.This isn't a generally useful scenario for us in production, but it's a genuine stress test of the infrastructure to see if it recovers.
Node connection on distributed plugin startup
The distributed plugin forces a connect to all active (according to Hazelcast) nodes during startup, and fails if one of those nodes has gone down in the interim.
This leaves that server in an inconsistent state, and in a "passive" state according to Hazelcast, which breaks distributed operations until the node is restarted.
We have in the past patched this to defer the initial connect, but have not replicated that in 3.2 yet.
Database storage opens failing
Database storage operations seem to break during database open:
Beta Was this translation helpful? Give feedback.
All reactions