Hi,
Currently running Mimic 13.2.5.
We had reports this morning of timeouts and failures with PUT and GET
requests to our Ceph RGW cluster. I found these messages in the RGW
log:
RGWReshardLock::lock failed to acquire lock on
bucket_name:bucket_instance ret=-16
NOTICE: resharding operation on bucket index detected, blocking
block_while_resharding ERROR: bucket is still resharding, please retry
Which were preceded by many of these, which I think are normal/expected.
check_bucket_shards: resharding needed: stats.num_objects=6415879
shard max_objects=6400000
Our RGW cluster sits behind haproxy which notified me approx 90
seconds after the first 'resharding needed' message that no backends
were available. It appears this dynamic reshard process caused the
RGWs to lock up for a period of time. Roughly 2 minutes later the
reshard error messages stop and operation returns to normal.
Looking back through previous RGW logs, I see a similar event from
about a week ago, on the same bucket. We have several buckets with
shard counts exceeding 1k (this one only has 128), and much larger
object counts, so clearly this isn't the first time dynamic sharding
has been invoked on this cluster.
Has anyone seen this? I expect it will come up again, and can turn up
debugging if that'll help. Thanks for any assistance!
Josh
On Thu, 27 Feb 2020 at 06:27, Anthony D'Atri <aad(a)dreamsnake.net> wrote:
> If the heap stats reported by telling the OSD `heap stats` is large, telling each `heap release` may be useful. I suspect a TCMALLOC shortcoming.
osd.158 tcmalloc heap stats:------------------------------------------------
MALLOC: 5722761448 ( 5457.7 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 311621760 ( 297.2 MiB) Bytes in central cache freelist
MALLOC: + 26242992 ( 25.0 MiB) Bytes in transfer cache freelist
MALLOC: + 62721768 ( 59.8 MiB) Bytes in thread cache freelists
MALLOC: + 113340608 ( 108.1 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 6236688576 ( 5947.8 MiB) Actual memory used (physical + swap)
MALLOC: + 21415870464 (20423.8 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 27652559040 (26371.5 MiB) Virtual address space used
MALLOC:
MALLOC: 394518 Spans in use
MALLOC: 37 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
After upgrading one of our clusters from Luminous 12.2.12 to Nautilus 14.2.6, I am seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). The way we found this was due to Prometheus being unable to report out certain pieces of data, specifically OSD Usage, OSD Apply and Commit Latency. Which are all similar issues people were having in previous versions of Nautilus.
Bryan Stillwell reported this previously on a separate cluster, 14.2.5, we have here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VW3GNVJGOOW…
That issue was resolved with the upgrade to 14.2.6.
We are seeing a similar issue on this other cluster with a couple differences.
This cluster has 1900+ OSD in it, the previous one had 300+
The top user is libceph-common, instead of mmap
4.86% libceph-common.so.0 [.] EventCenter::create_time_event
2.78% [kernel] [k] nmi
2.64% libstdc++.so.6.0.19 [.] __dynamic_cast
On all our other clusters that have been upgraded to 14.2.6 we are not experiencing this issue, the next largest being 800+ OSD.
We feel this is related to the size of the cluster, similarly to the previous report.
Anyone else experiencing this and/or can provide some direction on how to go about resolving this?
Thanks,
Joe
Hi Mohamed,
> On Jan 22, 2020, at 10:05 AM, mohamed zayan <mohamed.zayan19(a)gmail.com <mailto:mohamed.zayan19@gmail.com>> wrote:
>
> Currently I have a cluster of 2 nodes on two raspberrypi 3 devices.
> pi1 is admin/mon/mgr/osd
>
> pi2 is osd
>
> I am currently trying to run radosgw on pi2. I have failed multiple times
>
> /var/lib/ceph/radosgw# /usr/bin/radosgw -f --cluster ceph --name
> client.rgw.pi2 --setuser ceph --setgroup ceph
> Thread::try_create(): pthread_create failed with error
> 11/tmp/release/Raspbian/WORKDIR/ceph-12.2.9-38-gaeeb23362d/src/common/Thread.cc <http://thread.cc/>:
> In function 'void Thread::create(const char *, size_t)' thread
> 71114000 time 2020-01-22 14:58:13.793803
> /tmp/release/Raspbian/WORKDIR/ceph-12.2.9-38-gaeeb23362d/src/common/Thread.cc <http://thread.cc/>:
> 152: FAILED assert(ret == 0)
I’m not too familiar with the Raspberry Pi platform or Raspbian. But there are some clues in the output you provided, specifically this text — “ Thread::try_create(): pthread_create failed with error 11”.
I assume these low valued error codes are consistent between Fedora and Raspbian, and so the code 11 means:
#define EAGAIN 11 /* Try again */
I found the following:
https://stackoverflow.com/questions/47078106/pthread-create-fails-with-eaga… <https://stackoverflow.com/questions/47078106/pthread-create-fails-with-eaga…>
So that seems to leave 3 possibilities:
1. RGW is asking for more resources that your platform is able to provide.
a. There are limits as to the number of threads.
b. There may be a memory limit, since each thread needs to maintain its own stack.
2. There is a bug in RGW where completed threads are not thread_join’d by their parents.
a. This seems unlikely as this appears to happen during start-up before threads are likely done with their work.
3. Florian Weimer identifies a kernel bug.
I suspect it’s #1, so you might want to try reducing the number of threads rgw uses by lowering the value of rgw_thread_pool_size in your configuration.
Eric
--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA
I have test one-node ceph cluster with 4 osds under assumption to add the second node just before production.
Linux 4.19.0-6-amd64 - debian 10 - ceph version 12.2.11
Unfortunately, system drive was broken before it.
I recovered the system from full backup.
Since no changes was performed to the cluster configuration after that backup, I hoped that it works.
For the reasons I can't understand first few seconds after boot ceph status was OK (134 active+clean, 2 active+clean+scrubbing+deep), but a minute later status changed to:
# ceph status
cluster:
id: e02f2885-946b-46c8-91d5-146dd724ecaf
health: HEALTH_WARN
1 filesystem is degraded
2 osds down
1 slice (2 osds) down
Reduced data availability: 136 pgs inactive, 15 pgs peering
services:
mon: 1 daemons, quorum rbd0
mgr: rbd0(active)
mds: fs-1/1/1 up {0=rbd0=up:replay}
osd: 5 osds: 1 up, 3 in
data:
pools: 2 pools, 136 pgs
objects: 118.53k objects, 429GiB
usage: 7.15TiB used, 3.77TiB / 10.9TiB avail
pgs: 88.971% pgs unknown
11.029% pgs not active
121 unknown
15 peering
# ceph osd dump
epoch 1983
fsid e02f2885-946b-46c8-91d5-146dd724ecaf
created 2019-08-16 15:14:07.783009
modified 2020-02-29 13:55:39.212461
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 27
full_ratio 0.97
backfillfull_ratio 0.94
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release luminous
pool 1 'fs_data' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 1595 flags hashpspool stripe_width 0 application cephfs
pool 2 'fs_meta' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 1595 flags hashpspool stripe_width 0 application cephfs
max_osd 5
osd.0 down out weight 0 up_from 1970 up_thru 1973 down_at 1975 last_clean_interval [1949,1963) 192.168.101.111:6806/440 192.168.101.111:6807/440 192.168.101.111:6808/440 192.168.101.111:6809/440 autoout,exists 78eaeb63-47c9-4962-b8ff-46607921f4f6
osd.1 down in weight 1 up_from 1970 up_thru 1970 down_at 1975 last_clean_interval [1952,1963) 192.168.101.111:6801/439 192.168.101.111:6810/439 192.168.101.111:6811/439 192.168.101.111:6812/439 exists c4c4c85d-f537-4199-823b-b7ab01c78f03
osd.2 down in weight 1 up_from 1969 up_thru 1975 down_at 1976 last_clean_interval [1946,1963) 192.168.101.111:6802/441 192.168.101.111:6803/441 192.168.101.111:6804/441 192.168.101.111:6805/441 exists bd66a9c3-bfa4-4352-816e-2e4cd86389f3
osd.3 down out weight 0 up_from 1617 up_thru 1619 down_at 1631 last_clean_interval [1602,1610) 192.168.101.111:6805/933 192.168.101.111:6806/933 192.168.101.111:6807/933 192.168.101.111:6808/933 exists f247115b-c6d5-49b1-9b0e-e799c50be379
osd.4 up in weight 1 up_from 1973 up_thru 1973 down_at 1972 last_clean_interval [1956,1963) 192.168.101.111:6813/442 192.168.101.111:6814/442 192.168.101.111:6815/442 192.168.101.111:6816/442 exists,up c208221e-1228-4247-a742-0c16ce01d38f
blacklist 192.168.101.111:6800/2636437603 expires 2020-03-01 13:26:01.809132
"ceph pg query" of any PG didn't response.
I can't find any errors in journalctl or in /var/log/ceph/*
I wonder why only osd 4 up, what means outoout, why 15 pgs are peering, where to search detail information, is it a way to restore data.
Please help me to understand what happend and how to restore data if it possible.
Hi all,
I'm running a Ceph Mimic cluster 13.2.6 and we use the ceph-balancer
in upmap mode. This cluster is fairly old and pre-Mimic we used to set
osd reweights to balance the standard deviation of the cluster. Since
moving to Mimic about 9 months ago I enabled the ceph-balancer with
upmap mode and let it do its thing but I did not think about setting
the previously modified reweights back to 1.00000 (not sure if this is
fine or would have been a best practice?)
Does the ceph-balancer in upmap mode manage the osd reweight
dynamically? Just wondering if I need to proactively go back and set
all non 1.00000 reweights to 1.00000.
Thanks all, I hope that makes sense!
All;
We just started really fiddling with CephFS on our production cluster (Nautilus - 14.2.5 / 14.2.6), and I have a question...
Is there a command / set of commands that transitions a standby-replay MDS server to the active role, while swapping the active MDS to standby-replay, or even just standby?
I'm looking for a way to seamlessly, and without down time, prepare the active MDS to go offline (reboot) as part of planned /periodic maintenance.
Thank you,
Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
I do not have a large ceph cluster, only 4 nodes plus a mon/mgr with 48
OSDs. I have one data pool and one metadata pool with a total of about
140TB of usable storage. I have maybe 30 or so clients. The rest of my
systems connect via a host that is a ceph client and then reshares
through samba and nfs-ganesha. I'm not using rgw anywhere. I'm running
the latest stable release of nautilus (14.2.7) and have had it in
production since August 2019. All ceph nodes and the smb/nfs host are
running centos7 with latest patches. Other clients are a mix of debian
and ubuntu.
For the last several weeks, I have been getting the warning "Large omap
object found" off and on. I've been resolving it by gradually increasing
the value of osd_deep_scrub_large_omap_object_key_threshold and then
running a deep scrub on the affected pg. I have now increased this
threshold to 1000000 and am wondering if I should keep doing this or if
there is another problem that needs to be addressed.
The affected pg has been different most times, but they are all on the
same osd and with the same mds object. Here's an excerpt from my current
set of logs to show what I'm seeing:
# zgrep -i "large omap object found" /var/log/ceph/ceph.log*
/var/log/ceph/ceph.log:2020-02-27 06:02:01.761641 osd.40 (osd.40) 1578 :
cluster [WRN] Large omap object found. Object:
2:654134d2:::mds0_openfiles.0:head PG: 2.4b2c82a6 (2.26) Key count:
1048576 Size (bytes): 46403355
/var/log/ceph/ceph.log:2020-02-27 16:18:00.328869 osd.40 (osd.40) 1585 :
cluster [WRN] Large omap object found. Object:
2:654134d2:::mds0_openfiles.0:head PG: 2.4b2c82a6 (2.26) Key count:
1048559 Size (bytes): 46407183
/var/log/ceph/ceph.log-20200227.gz:2020-02-26 19:56:24.972431 osd.40
(osd.40) 1450 : cluster [WRN] Large omap object found. Object:
2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count:
939236 Size (bytes): 40179994
/var/log/ceph/ceph.log-20200227.gz:2020-02-26 21:14:16.497161 osd.40
(osd.40) 1460 : cluster [WRN] Large omap object found. Object:
2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count:
939232 Size (bytes): 40179796
/var/log/ceph/ceph.log-20200227.gz:2020-02-26 21:15:06.399267 osd.40
(osd.40) 1464 : cluster [WRN] Large omap object found. Object:
2:c9647462:::mds0_openfiles.1:head PG: 2.462e2693 (2.13) Key count:
939231 Size (bytes): 40179756
Unfortunately, older logs have already been rotated out, but if memory
serves correctly, they had similar messages. As you can see, the key
count continues to increase. Last week, I bumped the threshold to 750000
to clear the warning. Before that, I had bumped to 500000. It looks to
me like something isn't getting cleaned up like it's supposed to. I
haven't been using ceph long enough to figure out what that might be.
Do I continue to bump the key threshold and not worry about the
warnings, or is there something going on that needs to be corrected? At
what point is the threshold too high? If the problem is due to a
specific client not closing files, is it possible to identify that
client and attempt to reset it?
Any advice is welcome. I'm happy to provide additional data if needed.
Thanks.
Seth
--
Seth Galitzer
Systems Coordinator
Computer Science Department
Kansas State University
http://www.cs.ksu.edu/~sgsax
sgsax(a)ksu.edu
785-532-7790
Hi all,
I'm looking for the best way to merge/remap existing host buckets into one.
I'm running a Ceph Nautilus cluster used as a Ceph Cinder backend with 2
pools "volume-service" and "volume-recherche" both with dedicated OSDs :
|host cccephnd00x-service {||
|| id -2 # do not change unnecessarily||
|| alg straw2||
|| hash 0 # rjenkins1||
|| item osd.0 weight 7.275||
|| item osd.6 weight 7.275||
||}||
||host cccephnd00x-recherche {||
|| id -3 # do not change unnecessarily||
|| alg straw2||
|| hash 0 # rjenkins1||
|| item osd.11 weight 7.266||
|| item osd.17 weight 7.266||
|| item osd.22 weight 7.266||
|| item osd.27 weight 7.266||
||}
...
||root service {||
|| id -26 # do not change unnecessarily||
|| alg straw2||
|| hash 0 # rjenkins1||
|| item cccephnd001-service weight 14.550||
|| ...||
|| item cccephnd006-service weight 14.550||
||
||}||
||root recherche {||
|| id -27 # do not change unnecessarily||
|| alg straw2||
|| hash 0 # rjenkins1||
|| item cccephnd001-recherche weight 29.064||
|| ...||
|| item cccephnd006-recherche weight 29.064||
||}||
||rule HAService {||
|| id 1||
|| type replicated||
|| min_size 1||
|| max_size 10||
|| step take service||
|| step chooseleaf firstn 0 type host||
|| step emit||
||}||
||rule Recherche {||
|| id 2||
|| type replicated||
|| min_size 1||
|| max_size 10||
|| step take recherche||
|| step chooseleaf firstn 0 type host||
|| step emit||
||}
$ ceph osd pool get volumes-service crush_rule
crush_rule: HAService
$ ceph osd pool get volumes-recherche crush_rule
crush_rule: Recherche
|
The pool "volume-service" used to work with SSD cache tiering but we
decided to stop using it.
So I would like to keep these 2 pools but merge the buckets /host
cccephnd00X-service/ and /host cccephnd00X-recherche/ into one
/cccephnd00X-cinder/ for better performance (more OSDs assigned to pools).
In theory, migrate to this kind of crushmap:
|host cccephnd00x-cinder {||
|| id -22 ||
|| alg straw2||
|| hash 0 # rjenkins1||
|| item osd.0 weight 7.275||
|| item osd.6 weight 7.275|||
|| item osd.11 weight 7.266||
|| item osd.17 weight 7.266||
|| item osd.22 weight 7.266||
|| item osd.27 weight 7.266||
| ||}||
|...
||root cinder {||
|| id -23 ||
|| alg straw2||
|| hash 0 # rjenkins1||
|| item cccephnd001-|||||cinder| weight 29.064||
|| ...||
|| item cccephnd006-|||||cinder| weight 29.064||
||}||
||rule Cinder {||
|| id 24||
|| type replicated||
|| min_size 1||
|| max_size 10||
|| step take ||||||cinder||||
|| step chooseleaf firstn 0 type host||
|| step emit||
||}
|||$ ceph osd pool get volumes-service crush_rule
crush_rule: ||||||cinder||||
|| $ ceph osd pool get volumes-recherche crush_rule
crush_rule: ||||||cinder||||
||
|In practice, I'm thinking about creating the new host bucket/root/rule
and then change the crush_rule used for the 2 pools to the new one. And
then delete the old one.
Do you think that's can be done easily? (And without losing existing data?)
I guess there will be a huge rebalance activity but I don't have many
choice.
Or do you have any other suggestions?
Cheers,
Adrien
||||
||| |||