Dear friends,
Running 14.2.11, we have one particularly large bucket with a very
strange distribution of objects among the shards. The bucket has 512
shards, and most shards have ~75k entries, but shard 0 has 1.75M
entries:
# rados -p default.rgw.buckets.index listomapkeys
.dir.61c59385-085d-4caa-9070-63a3868dccb6.272652427.1.0 | wc -l
1752085
# rados -p default.rgw.buckets.index listomapkeys
.dir.61c59385-085d-4caa-9070-63a3868dccb6.272652427.1.1 | wc -l
78388
# rados -p default.rgw.buckets.index listomapkeys
.dir.61c59385-085d-4caa-9070-63a3868dccb6.272652427.1.2 | wc -l
78764
We had resharded this bucket (manually) from 32 up to 512 shards just
before upgrading from 12.2.12 to 14.2.11 a couple weeks ago.
Any idea why shard .0 is getting such an imbalance of entries?
Should we manually reshard this bucket again?
Thanks!
Dan
Good day, cephers!
We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also
performing full system packages upgrade(Ubuntu 18.04 LTS).
After that performance significantly dropped, main reason beeing that
journal SSDs are now have no merges, huge queues, and increased latency.
There's a few screenshots in attachments. This is for an SSD journal that
supports block.db/block.wal for 3 spinning OSDs, and it looks like this for
all our SSD block.db/wal devices across all nodes.
Any ideas what may cause that? Maybe I've missed something important in
release notes?
Dear Cephers,
we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6):
ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other)
For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example,
only if the last file access was >24 hours ago. However, that does not seem to be the case:
----------------------------------------------
$ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root
...
Access: 2019-04-10 15:50:04.975959159 +0200
Modify: 2019-04-10 15:50:05.651613843 +0200
Change: 2019-04-10 15:50:06.141006962 +0200
...
$ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null
$ sync
$ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root
...
Access: 2019-04-10 15:50:04.975959159 +0200
Modify: 2019-04-10 15:50:05.651613843 +0200
Change: 2019-04-10 15:50:06.141006962 +0200
...
----------------------------------------------
I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps,
but atime never changes.
Is atime really never updated with CephFS, or is this configurable?
Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case.
Cheers,
Oliver
Ever since we jumped from 14.2.9 to .12 (and beyond) a lot of the ceph commands just hang. The mgr daemon also just stops responding to our Prometheus scrapes occasionally. A daemon restart and it wakes back up. I have nothing pointing to these being related but it feels that way.
I also tried to get device health monitoring with smart up and running around that upgrade time. It never seemed to be able to pull in and report on the health across the drives. I did see the osd process firing off smartctl on occasion though so it was trying to do something. Again, I have nothing pointing to this being related but it feels like it may be.
Some commands that currently hang:
ceph osd pool autoscale-status
ceph balancer *
ceph iostat (oddly, this spit out a line of all 0 stats once and then hung)
ceph fs status
toggling ceph device monitoring on or off and a lot of the device health stuff too
Mgr logs on disk show flavors of this:
2020-11-24 13:05:07.883 7f19e2c40700 0 log_channel(audit) log [DBG] : from='mon.0 -' entity='mon.' cmd=[{,",p,r,e,f,i,x,",:, ,",o,s,d, ,p,e,r,f,",,, ,",f,o,r,m,a,t,",:, ,",j,s,o,n,",}]: dispatch
2020-11-24 13:05:07.895 7f19e2c40700 0 log_channel(audit) log [DBG] : from='mon.0 -' entity='mon.' cmd=[{,",p,r,e,f,i,x,",:, ,",o,s,d, ,p,o,o,l, ,s,t,a,t,s,",,, ,",f,o,r,m,a,t,",:, ,",j,s,o,n,",}]: dispatch
2020-11-24 13:05:08.567 7f19e1c3e700 0 log_channel(cluster) log [DBG] : pgmap v587: 17149 pgs: 1 active+remapped+backfill_wait, 2 active+clean+scrubbing, 55 active+clean+scrubbing+deep, 9 active+remapped+backfilling, 17082 active+clean; 2.1 PiB data, 3.5 PiB used, 2.9 PiB / 6.4 PiB avail; 108 MiB/s rd, 53 MiB/s wr, 1.20k op/s; 7525420/9900121381 objects misplaced (0.076%); 99 MiB/s, 40 objects/s recovering
ceph status:
cluster:
id: 971a5242-f00d-421e-9bf4-5a716fcc843a
health: HEALTH_WARN
1 nearfull osd(s)
1 pool(s) nearfull
services:
mon: 3 daemons, quorum ceph-mon-01,ceph-mon-03,ceph-mon-02 (age 4h)
mgr: ceph-mon-01(active, since 97s), standbys: ceph-mon-03, ceph-mon-02
mds: cephfs:1 {0=ceph-mds-02=up:active} 3 up:standby
osd: 843 osds: 843 up (since 13d), 843 in (since 2w); 10 remapped pgs
rgw: 1 daemon active (ceph-rgw-01)
task status:
scrub status:
mds.ceph-mds-02: idle
data:
pools: 16 pools, 17149 pgs
objects: 1.61G objects, 2.1 PiB
usage: 3.5 PiB used, 2.9 PiB / 6.4 PiB avail
pgs: 6482000/9900825469 objects misplaced (0.065%)
17080 active+clean
54 active+clean+scrubbing+deep
9 active+remapped+backfilling
5 active+clean+scrubbing
1 active+remapped+backfill_wait
io:
client: 877 MiB/s rd, 1.8 GiB/s wr, 1.91k op/s rd, 3.33k op/s wr
recovery: 136 MiB/s, 55 objects/s
ceph config dump:
WHO MASK LEVEL OPTION VALUE RO
global advanced cluster_network 192.168.42.0/24 *
global advanced mon_max_pg_per_osd 400
global advanced mon_pg_warn_max_object_skew -1.000000
global dev mon_warn_on_pool_pg_num_not_power_of_two false
global advanced osd_max_backfills 2
global advanced osd_max_scrubs 4
global advanced osd_scrub_during_recovery false
global advanced public_network 1xx.xx.171.0/24 10.16.171.0/24 *
mon advanced mon_allow_pool_delete true
mgr advanced mgr/balancer/mode none
mgr advanced mgr/devicehealth/enable_monitoring false
osd advanced bluestore_compression_mode passive
osd advanced osd_deep_scrub_large_omap_object_key_threshold 2000000
osd advanced osd_op_queue_cut_off high *
osd advanced osd_scrub_load_threshold 5.000000
mds advanced mds_beacon_grace 300.000000
mds basic mds_cache_memory_limit 16384000000
mds advanced mds_log_max_segments 256
client advanced rbd_default_features 5
client.libvirt advanced admin_socket /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok *
client.libvirt basic log_file /var/log/ceph/qemu-guest-$pid.log *
/etc/ceph/ceph.conf is the stub file with fsid and the mons listed.
Yes I have a drive that just started to tickle the full warn limit. That's what pulled me back into the "I should fix this" mode. I'm manually adjusting the weight on that one for the time being along with slowly lowering pg_num on an oversized pool. The cluster still has this issue when in health_ok.
I'm free to do a lot of debugging and poking around even though this is our production cluster. The only service I refuse to play around with is the MDS. That one bites back. Does anyone have more ideas on where to look to try and figure out what's going on?
--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec(a)rit.edu
CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------
Hi all,
I’m new to ceph. I recently deployed a ceph cluster with cephadm. Now I want to add a single new OSD daemon with a db device on SSD. But I can’t find any documentation about this.
I have tried:
1. Using web dashboard. This requires at least one filter to proceed (type, vendor, model or size). But I just want to select the block device manually.
2. Using ‘ceph orch apply osd -i spec.yml’. This is also filter based.
3. Using ‘ceph orch daemon add osd host:device’. Seems I cannot specify my SSD db device in this way.
4. On the target host, run ‘cephadm shell’ then ceph-volume prepare and activate. But ceph-volume seems can’t create systemd service outside the container like ‘ceph orch’ does.
5. On the target host, run ‘cephadm ceph-volume’, but it requires a json config file, I can’t figure out what is that.
Any help is appreciated. Thanks.
Hi,
I did some search about replacing osd, and found some different
steps, probably for different release?
Is there recommended process to replace an osd with Octopus?
Two cases here:
1) replace HDD whose WAL and DB are on a SSD.
1-1) failed disk is replaced by the same model.
1-2) working disk is replaced by bigger one.
2) replace the SSD holding WAL and DB for multiple HDDs.
Thanks!
Tony
Hi,
I'm still evaluating ceph 15.2.5 in a lab so the problem is not really hurting me, but I want to understand it and hopefully fix it. It is a good practice. To test the resilience of the cluster I try to break it by doing all kinds of things. Today I powered off (clean shutdown) one osd node and powered it back on. Last time I tried this there was no problem getting it back online. After a few minutes the cluster health was back to ok. This time it stayed degraded forever. I checked and noticed that the service osd.0 on the osd node was failing. So i used google and there people recommended to simply delete the osd and re-create it. I tried it and still can't get the osd back in service.
First I removed the osd:
[root@gedasvl02 ~]# ceph osd out 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
osd.0 is already out.
[root@gedasvl02 ~]# ceph auth del 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
Error EINVAL: bad entity name
[root@gedasvl02 ~]# ceph auth del osd.0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
updated
[root@gedasvl02 ~]# ceph osd rm 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
removed osd.0
[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43658 root default
-7 0.21829 host gedaopl01
2 ssd 0.21829 osd.2 up 1.00000 1.00000
-3 0 host gedaopl02
-5 0.21829 host gedaopl03
3 ssd 0.21829 osd.3 up 1.00000 1.00000
Looks ok it's gone...
Then i zapped it:
[root@gedasvl02 ~]# ceph orch device zap gedaopl02 /dev/sdb --force
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
INFO:cephadm:/usr/bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices.
INFO:cephadm:/usr/bin/podman:stderr --> Zapping: /dev/sdb
INFO:cephadm:/usr/bin/podman:stderr --> Zapping lvm member /dev/sdb. lv_path is /dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323 bs=1M count=10 conv=fsync
INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in
INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0314447 s, 333 MB/s
INFO:cephadm:/usr/bin/podman:stderr stderr:
INFO:cephadm:/usr/bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy volume group ceph-3bf1bb28-0858-4464-a848-d7f56319b40a
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/sbin/vgremove -v -f ceph-3bf1bb28-0858-4464-a848-d7f56319b40a
INFO:cephadm:/usr/bin/podman:stderr stderr: Removing ceph--3bf1bb28--0858--4464--a848--d7f56319b40a-osd--block--3a79800d--2a19--45d8--a850--82c6a8113323 (253:0)
INFO:cephadm:/usr/bin/podman:stderr stderr: Archiving volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" metadata (seqno 5).
INFO:cephadm:/usr/bin/podman:stderr stderr: Releasing logical volume "osd-block-3a79800d-2a19-45d8-a850-82c6a8113323"
INFO:cephadm:/usr/bin/podman:stderr stderr: Creating volume group backup "/etc/lvm/backup/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" (seqno 6).
INFO:cephadm:/usr/bin/podman:stderr stdout: Logical volume "osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" successfully removed
INFO:cephadm:/usr/bin/podman:stderr stderr: Removing physical volume "/dev/sdb" from volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a"
INFO:cephadm:/usr/bin/podman:stderr stdout: Volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" successfully removed
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync
INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in
INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0355641 s, 295 MB/s
INFO:cephadm:/usr/bin/podman:stderr --> Zapping successful for: <Raw Device: /dev/sdb>
And re-added it:
[root@gedasvl02 ~]# ceph orch daemon add osd gedaopl02:/dev/sdb
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
Created osd(s) 0 on host 'gedaopl02'
But the osd is still out...
[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43658 root default
-7 0.21829 host gedaopl01
2 ssd 0.21829 osd.2 up 1.00000 1.00000
-3 0 host gedaopl02
-5 0.21829 host gedaopl03
3 ssd 0.21829 osd.3 up 1.00000 1.00000
0 0 osd.0 down 0 1.00000
Looking at the cluster log in the webui i see the following error:
Failed to apply osd.dashboard-admin-1606745745154 spec DriveGroupSpec(name=dashboard-admin-1606745745154->placement=PlacementSpec(host_pattern='*'), service_id='dashboard-admin-1606745745154', service_type='osd', data_devices=DeviceSelection(size='223.6GB', rotational=False, all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): No filters applied Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 2108, in _apply_all_services if self._apply_service(spec): File "/usr/share/ceph/mgr/cephadm/module.py", line 2005, in _apply_service self.osd_service.create_from_spec(cast(DriveGroupSpec, spec)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 43, in create_from_spec ret = create_from_spec_one(self.prepare_drivegroup(drive_group)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 127, in prepare_drivegroup drive_selection = DriveSelection(drive_group, inventory_for_host) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 32, in __init__ self._data = self.assign_devices(self.spec.data_devices) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, in assign_devices if not all(m.compare(disk) for m in FilterGenerator(device_filter)): File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, in <genexpr> if not all(m.compare(disk) for m in FilterGenerator(device_filter)): File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/matchers.py", line 410, in compare raise Exception("No filters applied") Exception: No filters applied
I have another error "pgs undersized", maybe this is also causing trouble?
[root@gedasvl02 ~]# ceph -s
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
cluster:
id: d0920c36-2368-11eb-a5de-005056b703af
health: HEALTH_WARN
Degraded data redundancy: 13142/39426 objects degraded (33.333%), 176 pgs degraded, 225 pgs undersized
services:
mon: 1 daemons, quorum gedasvl02 (age 2w)
mgr: gedasvl02.vqswxg(active, since 2w), standbys: gedaopl02.yrwzqh
mds: cephfs:1 {0=cephfs.gedaopl01.zjuhem=up:active} 1 up:standby
osd: 3 osds: 2 up (since 4d), 2 in (since 94m)
task status:
scrub status:
mds.cephfs.gedaopl01.zjuhem: idle
data:
pools: 7 pools, 225 pgs
objects: 13.14k objects, 77 GiB
usage: 148 GiB used, 299 GiB / 447 GiB avail
pgs: 13142/39426 objects degraded (33.333%)
176 active+undersized+degraded
49 active+undersized
io:
client: 0 B/s rd, 6.1 KiB/s wr, 0 op/s rd, 0 op/s wr
Best Regards,
Oliver
We are seeing very high osd_pglog usage in mempools for ceph osds. For
example...
"mempool": {
"bloom_filter_bytes": 0,
"bloom_filter_items": 0,
"bluestore_alloc_bytes": 41857200,
"bluestore_alloc_items": 523215,
"bluestore_cache_data_bytes": 50876416,
"bluestore_cache_data_items": 1326,
"bluestore_cache_onode_bytes": 6814080,
"bluestore_cache_onode_items": 13104,
"bluestore_cache_other_bytes": 57793850,
"bluestore_cache_other_items": 2599669,
"bluestore_fsck_bytes": 0,
"bluestore_fsck_items": 0,
"bluestore_txc_bytes": 29904,
"bluestore_txc_items": 42,
"bluestore_writing_deferred_bytes": 733191,
"bluestore_writing_deferred_items": 96,
"bluestore_writing_bytes": 0,
"bluestore_writing_items": 0,
"bluefs_bytes": 101400,
"bluefs_items": 1885,
"buffer_anon_bytes": 21505818,
"buffer_anon_items": 14949,
"buffer_meta_bytes": 1161512,
"buffer_meta_items": 13199,
"osd_bytes": 1962920,
"osd_items": 167,
"osd_mapbl_bytes": 825079,
"osd_mapbl_items": 17,
"osd_pglog_bytes": 14099381936,
"osd_pglog_items": 134285429,
"osdmap_bytes": 734616,
"osdmap_items": 26508,
"osdmap_mapping_bytes": 0,
"osdmap_mapping_items": 0,
"pgmap_bytes": 0,
"pgmap_items": 0,
"mds_co_bytes": 0,
"mds_co_items": 0,
"unittest_1_bytes": 0,
"unittest_1_items": 0,
"unittest_2_bytes": 0,
"unittest_2_items": 0
},
Where roughly 14g is required for pg_logs. Cluster has 106 OSD and 2432
placement groups.
The pg log count for placement groups is much less than 134285429 logs.
Top counts are...
1486 1.41c
883 7.3
834 7.f
683 7.13
669 7.a
623 7.5
565 7.8
560 7.1c
546 7.16
544 7.19
Summing these gives 21594 pg logs.
Overall the performance of the cluster is poor, OSD memory usage is high
(20-30G resident), and with a moderate workload we are seeing iowait on OSD
hosts. The memory allocated to caches appears to be low, I believe because
osd_pglog is taking most of the available memory.
Regards,
Rob
--
*******************************************************************
This
message was sent from RiskIQ, and is intended only for the designated
recipient(s). It may contain confidential or proprietary information and
may be subject to confidentiality protections. If you are not a designated
recipient, you may not review, copy or distribute this message. If you
receive this in error, please notify the sender by reply e-mail and delete
this message. Thank you.
*******************************************************************