Hi all,
I deployed a multi-site cluster in order to sync object from an old
cluster to a brand new cluster. It seems good cause I can see
the data syncing. However, when I check the cluster health, it shows
warn messages "2 daemons have recently crashed".
I get the crash info by 'sudo ceph crash info $id':
{
"os_version_id": "7",
"utsname_release": "3.10.0-957.27.2.el7.x86_64",
"os_name": "CentOS Linux",
"entity_name": "client.rgw.ceph-node7",
"timestamp": "2020-05-09 15:17:59.482502Z",
"process_name": "radosgw",
"utsname_machine": "x86_64",
"utsname_sysname": "Linux",
"os_version": "7 (Core)",
"os_id": "centos",
"utsname_version": "#1 SMP Mon Jul 29 17:46:05 UTC 2019",
"backtrace": [
"(()+0xf5f0) [0x7f32b1bdf5f0]",
"(RGWCoroutine::set_sleeping(bool)+0xc) [0x555eeb1351ac]",
"(RGWOmapAppend::flush_pending()+0x2d) [0x555eeb13acad]",
"(RGWOmapAppend::finish()+0x10) [0x555eeb13acd0]",
"(RGWDataSyncShardCR::stop_spawned_services()+0x2b)
[0x555eeb0a185b]",
"(RGWDataSyncShardCR::incremental_sync()+0x72a) [0x555eeb0a9baa]",
"(RGWDataSyncShardCR::operate()+0x9d) [0x555eeb0ab33d]",
"(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x60)
[0x555eeb136520]",
"(RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*,
std::allocator<RGWCoroutinesStack*> >&)+0x236) [0x555eeb137196]",
"(RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x555eeb138098]",
"(RGWRemoteDataLog::run_sync(int)+0x1cf) [0x555eeb08851f]",
"(RGWDataSyncProcessorThread::process()+0x46) [0x555eeb1e71a6]",
"(RGWRadosThread::Worker::entry()+0x115) [0x555eeb1b6195]",
"(()+0x7e65) [0x7f32b1bd7e65]",
"(clone()+0x6d) [0x7f32b10e188d]"
],
"utsname_hostname": "ceph-node7",
"crash_id":
"2020-05-09_15:17:59.482502Z_b80d7bee-faa0-4d2f-9d86-a1b3f4d4802e",
"ceph_version": "14.2.8"
}
AND
{
"os_version_id": "7",
"utsname_release": "3.10.0-957.27.2.el7.x86_64",
"os_name": "CentOS Linux",
"entity_name": "client.rgw.ceph-node7",
"timestamp": "2020-05-10 16:23:13.375063Z",
"process_name": "radosgw",
"utsname_machine": "x86_64",
"utsname_sysname": "Linux",
"os_version": "7 (Core)",
"os_id": "centos",
"utsname_version": "#1 SMP Mon Jul 29 17:46:05 UTC 2019",
"backtrace": [
"(()+0xf5f0) [0x7f409f42e5f0]",
"(RGWCoroutine::set_sleeping(bool)+0xc) [0x55e3f45e01ac]",
"(RGWOmapAppend::flush_pending()+0x2d) [0x55e3f45e5cad]",
"(RGWOmapAppend::finish()+0x10) [0x55e3f45e5cd0]",
"(RGWDataSyncShardCR::stop_spawned_services()+0x2b)
[0x55e3f454c85b]",
"(RGWDataSyncShardCR::incremental_sync()+0x72a) [0x55e3f4554baa]",
"(RGWDataSyncShardCR::operate()+0x9d) [0x55e3f455633d]",
"(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x60)
[0x55e3f45e1520]",
"(RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*,
std::allocator<RGWCoroutinesStack*> >&)+0x236) [0x55e3f45e2196]",
"(RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x55e3f45e3098]",
"(RGWRemoteDataLog::run_sync(int)+0x1cf) [0x55e3f453351f]",
"(RGWDataSyncProcessorThread::process()+0x46) [0x55e3f46921a6]",
"(RGWRadosThread::Worker::entry()+0x115) [0x55e3f4661195]",
"(()+0x7e65) [0x7f409f426e65]",
"(clone()+0x6d) [0x7f409e93088d]"
],
"utsname_hostname": "ceph-node7",
"crash_id":
"2020-05-10_16:23:13.375063Z_9e70a0c0-929e-445f-b4cd-8d29e909fe2f",
"ceph_version": "14.2.8"
}
So I fetch and check the file "ceph-client.rgw.ceph-node7.log".
The log has huge amount of errors like:
-732> 2020-05-09 23:17:53.476 7f328b7ff700 0
RGW-SYNC:data:sync:shard[98]:entry[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]:bucket[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]:inc_sync[harbor-registry:f70a5
eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]: ERROR: lease is not taken,
abort
AND
-723> 2020-05-09 23:17:56.388 7f328b7ff700 5
RGW-SYNC:data:sync:shard[88]:entry[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:13]:bucket[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:13]:
incremental sync on bucket fa
iled, retcode=-125
AND
-215> 2020-05-09 23:17:58.809 7f328b7ff700 5
RGW-SYNC:data:sync:shard[10]:entry[pf2-harbor-swift:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4608.101:113]:bucket[pf2-harbor-swift:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4608.101:113]:
full sync on bucket failed, retcode=-125
AND
2020-05-09 23:18:24.048 7f4085867700 1 robust_notify: If at first you
don't succeed: (110) Connection timed out
2020-05-09 23:18:24.048 7f4083863700 0 ERROR: failed to distribute cache
for
shubei.rgw.log:datalog.sync-status.shard.f70a5eb9-d88d-42fd-ab4e-d300e97094de.5
2020-05-09 23:28:49.181 7f407e859700 1 heartbeat_map reset_timeout
'RGWAsyncRadosProcessor::m_tp thread 0x7f407e859700' had timed out after 600
2020-05-10 03:12:01.905 7f409708a700 -1 received signal: Hangup from
killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw
rbd-mirror
And finally it crashed. I'm not sure where the problem is.
Were the crashes caused by the network?
Thanks