RGW crashed - ceph-users

11 May 2020

Hi all,

I deployed a multi-site cluster in order to sync object from an old
cluster to a brand new cluster. It seems good cause I can see
the data syncing. However, when I check the cluster health, it shows
warn messages "2 daemons have recently crashed".

I get the crash info by 'sudo ceph crash info $id':

{
    "os_version_id": "7",
    "utsname_release": "3.10.0-957.27.2.el7.x86_64",
    "os_name": "CentOS Linux",
    "entity_name": "client.rgw.ceph-node7",
    "timestamp": "2020-05-09 15:17:59.482502Z",
    "process_name": "radosgw",
    "utsname_machine": "x86_64",
    "utsname_sysname": "Linux",
    "os_version": "7 (Core)",
    "os_id": "centos",
    "utsname_version": "#1 SMP Mon Jul 29 17:46:05 UTC 2019",
    "backtrace": [
        "(()+0xf5f0) [0x7f32b1bdf5f0]",
        "(RGWCoroutine::set_sleeping(bool)+0xc) [0x555eeb1351ac]",
        "(RGWOmapAppend::flush_pending()+0x2d) [0x555eeb13acad]",
        "(RGWOmapAppend::finish()+0x10) [0x555eeb13acd0]",
        "(RGWDataSyncShardCR::stop_spawned_services()+0x2b)
[0x555eeb0a185b]",
        "(RGWDataSyncShardCR::incremental_sync()+0x72a) [0x555eeb0a9baa]",
        "(RGWDataSyncShardCR::operate()+0x9d) [0x555eeb0ab33d]",
        "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x60)
[0x555eeb136520]",
        "(RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*,
std::allocator<RGWCoroutinesStack*> >&)+0x236) [0x555eeb137196]",
        "(RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x555eeb138098]",
        "(RGWRemoteDataLog::run_sync(int)+0x1cf) [0x555eeb08851f]",
        "(RGWDataSyncProcessorThread::process()+0x46) [0x555eeb1e71a6]",
        "(RGWRadosThread::Worker::entry()+0x115) [0x555eeb1b6195]",
        "(()+0x7e65) [0x7f32b1bd7e65]",
        "(clone()+0x6d) [0x7f32b10e188d]"
    ],
    "utsname_hostname": "ceph-node7",
    "crash_id":
"2020-05-09_15:17:59.482502Z_b80d7bee-faa0-4d2f-9d86-a1b3f4d4802e",
    "ceph_version": "14.2.8"
}

AND

{
    "os_version_id": "7",
    "utsname_release": "3.10.0-957.27.2.el7.x86_64",
    "os_name": "CentOS Linux",
    "entity_name": "client.rgw.ceph-node7",
    "timestamp": "2020-05-10 16:23:13.375063Z",
    "process_name": "radosgw",
    "utsname_machine": "x86_64",
    "utsname_sysname": "Linux",
    "os_version": "7 (Core)",
    "os_id": "centos",
    "utsname_version": "#1 SMP Mon Jul 29 17:46:05 UTC 2019",
    "backtrace": [
        "(()+0xf5f0) [0x7f409f42e5f0]",
        "(RGWCoroutine::set_sleeping(bool)+0xc) [0x55e3f45e01ac]",
        "(RGWOmapAppend::flush_pending()+0x2d) [0x55e3f45e5cad]",
        "(RGWOmapAppend::finish()+0x10) [0x55e3f45e5cd0]",
        "(RGWDataSyncShardCR::stop_spawned_services()+0x2b)
[0x55e3f454c85b]",
        "(RGWDataSyncShardCR::incremental_sync()+0x72a) [0x55e3f4554baa]",
        "(RGWDataSyncShardCR::operate()+0x9d) [0x55e3f455633d]",
        "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x60)
[0x55e3f45e1520]",
        "(RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*,
std::allocator<RGWCoroutinesStack*> >&)+0x236) [0x55e3f45e2196]",
        "(RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x55e3f45e3098]",
        "(RGWRemoteDataLog::run_sync(int)+0x1cf) [0x55e3f453351f]",
        "(RGWDataSyncProcessorThread::process()+0x46) [0x55e3f46921a6]",
        "(RGWRadosThread::Worker::entry()+0x115) [0x55e3f4661195]",
        "(()+0x7e65) [0x7f409f426e65]",
        "(clone()+0x6d) [0x7f409e93088d]"
    ],
    "utsname_hostname": "ceph-node7",
    "crash_id":
"2020-05-10_16:23:13.375063Z_9e70a0c0-929e-445f-b4cd-8d29e909fe2f",
    "ceph_version": "14.2.8"
}

So I fetch and check the file "ceph-client.rgw.ceph-node7.log".
The log has huge amount of  errors like:

  -732> 2020-05-09 23:17:53.476 7f328b7ff700  0
RGW-SYNC:data:sync:shard[98]:entry[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]:bucket[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]:inc_sync[harbor-registry:f70a5
eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]: ERROR: lease is not taken,
abort

AND

  -723> 2020-05-09 23:17:56.388 7f328b7ff700  5
RGW-SYNC:data:sync:shard[88]:entry[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:13]:bucket[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:13]:
incremental sync on bucket fa
iled, retcode=-125

AND

  -215> 2020-05-09 23:17:58.809 7f328b7ff700  5
RGW-SYNC:data:sync:shard[10]:entry[pf2-harbor-swift:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4608.101:113]:bucket[pf2-harbor-swift:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4608.101:113]:
full sync on bucket failed, retcode=-125

AND

2020-05-09 23:18:24.048 7f4085867700  1 robust_notify: If at first you
don't succeed: (110) Connection timed out
2020-05-09 23:18:24.048 7f4083863700  0 ERROR: failed to distribute cache
for
shubei.rgw.log:datalog.sync-status.shard.f70a5eb9-d88d-42fd-ab4e-d300e97094de.5
2020-05-09 23:28:49.181 7f407e859700  1 heartbeat_map reset_timeout
'RGWAsyncRadosProcessor::m_tp thread 0x7f407e859700' had timed out after 600

2020-05-10 03:12:01.905 7f409708a700 -1 received  signal: Hangup from
killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw
rbd-mirror

And finally it crashed. I'm not sure where the problem is.
Were the crashes caused by the network?

Thanks