Hi,Jason:
As discussed last time,After setting conf_rbd_qos_bps_limit, the speed of discard is also limited,
which can make operation such as mkfs.xfs very slow, though we can add -K option to solve this problem,
but we can't make sure that other operation or application never call discard interface,
so if the discard operation can be not limited by conf_rbd_qos_bps_limit option, and if so, is there any other risk?
Thanks.
Beset Regards,
LiuGangbiao
I have a 74GB vm with 34466MB free space. But when I do fstrim / 'rbd
du' shows still 60GB used.
When I fill the 34GB of space with an image, delete it and do again the
fstrim 'rbd du' still shows 59GB used.
Is this normal? Or should I be able to get it to ~30GB used?
Dear cephers,
I see "Long heartbeat ping times on back interface seen" in ceph status and ceph health detail says that I should "Use ceph daemon mgr.# dump_osd_network for more information". I tries, but it seems this command was removed during upgrade from mimic 13.2.8 to 13.2.10:
[root@ceph-01 ~]# ceph daemon mgr.ceph-01 dump_osd_network
no valid command found; 10 closest matches:
log flush
log dump
git_version
get_command_descriptions
kick_stale_sessions
help
config unset <var>
config show
dump_mempools
dump_cache
admin_socket: invalid command
Has this been replaced by some other command?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi all,
I have the opposite problem as discussed in "slow down keys/s in recovery". I need to increase the number of objects in flight during rebalance. It is already all remapped PGs in state backfilling, but it looks like no more than 8 objects/sec are transferred per PG at a time. The pools sits on high-performance SSDs and could easily handle a transfer of 100 or more objects/sec simultaneously. Is there any way to increase the number of transfers/sec or simultaneous transfers? Increasing the options osd_max_backfills and osd_recovery_max_active has no effect.
Background: The pool in question (con-fs2-meta2) is the default data pool of a ceph fs, which stores exclusively the kind of meta data that goes into this pool. Storage consumption is reported as 0, but the number of objects is huge:
NAME ID USED %USED MAX AVAIL OBJECTS
con-fs2-meta1 12 216 MiB 0.02 933 GiB 13311115
con-fs2-meta2 13 0 B 0 933 GiB 118389897
con-fs2-data 14 698 TiB 72.15 270 TiB 286826739
Unfortunately, there were no recommendations on dimensioning PG numbers for this pool, so I used the same for con-fs2-meta1, and con-fs2-meta2. In hindsight, this was potentially a bad idea, the meta2 pool should have a much higher PG count or a much more aggressive recovery policy.
I now need to rebalance PGs on meta2 and it is going way too slow compared with the performance of the SSDs it is located on. In a way, I would like to keep the PG count where it is, but increase the recovery rate for this pool by a factor of 10. Please let me know what options I have.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
I have made some tests with creating OSDs and I have found out that there
are big issues with the ceph-volume functionality.
1) If using dmcrypt and separate data and db block devices, ceph-volume
creates cryprodevs/PVs/VGs/LVs for both devices. This might seem as
normal, until one considers the possibility that a single SSD will support
multiple OSDs and the whole concept of being able to resize bluestore DB
partitions gets very complicated like this; I can have a full SSD
encrypted, create a PV/VG on top and provide the LVs myself without an
extra layer of encryption.
2) If I try to circumvent this limitation by asking ceph-volume to use
already encrypted devices for both the data and the bluestore db, the OSDs
are not auto scanned at startup, ceph-volume simple scan complaints that
they are not real devices and nothing starts.
Am I missing something?
Good day, cephers!
We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also
performing full system packages upgrade(Ubuntu 18.04 LTS).
After that performance significantly dropped, main reason beeing that
journal SSDs are now have no merges, huge queues, and increased latency.
There's a few screenshots in attachments. This is for an SSD journal that
supports block.db/block.wal for 3 spinning OSDs, and it looks like this for
all our SSD block.db/wal devices across all nodes.
Any ideas what may cause that? Maybe I've missed something important in
release notes?
Hi all,
When my cluster gets into a recovery state (adding new node) I see a huge
read throughput on its disks and it affects the latency! Disks are SSD and
they don't have a separate WAL/DB.
I'm using nautilus 14.2.14 and bluefs_buffered_io is false by default. When
this throughput came on my disk it will get too much high latency. After I
turned on bluefs_buffered_io another huge throughput around 1.2GB/s came in
and it against affect my latency but much less than the previous one!
(Graphs are attached and bluefs_buffered_io was turned on with ceph tell
injectargs at 13:41 also I have restarted the OSD at 13:16 because it
doesn't get better at the moment)
I have four questions:
1. What are they? I see the recovery speed is 20MB/s and client io on that
OSD is 10MB/s so what is this high throughput for?!
2. How can I control this throughput? Because my disks don't support this
much throughput!
3. I see a common issue here https://tracker.ceph.com/issues/36482 that I
think it's similar to my case. It was discussed about read_ahead, well
should I change the read_ahead_kb config of my disk to support this type of
request? I'm using the default value in ubuntu (128)
4. Is there any tuning required that can help to turn off the
bluefs_buffered_io again?
Configs I used for recovery:
osd max backfills = 1
osd recovery max active = 1
osd recovery op priority = 1
osd recovery priority = 1
osd recovery sleep ssd = 0.2
My OSD memory target is around 6GB.
Thanks.
Hi,
I recently attempted to run the ‘rgw-orphan-list’ tool against our cluster (octopus 15.2.7) to identify any orphans and noticed that the 'radosgw-admin bucket radoslist’ command appeared to be stuck in a loop.
I saw in the 'radosgw-admin-XXXXXX.intermediate’ output file the same sequence of objects looping repeatedly, and the command would not progress.
There is a tracker item that appears to be for this issue (https://tracker.ceph.com/issues/47074), but I believe it is still unverified whether this is a bug or not.
I ran the command 'radosgw-admin bucket radoslist --bucket=<bucketname> --debug-rgw=20 > radoslist-out.txt 2> radoslist-err.txt’ in order to identify what was happening as per the tracker, but I would appreciate some help in analysing the logs.
In the radoslist-err.txt file, I see entries similar to the following, which just repeat until I cancel the command:
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=345930 stripe_ofs=345930 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=360383 stripe_ofs=360383 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=302704 stripe_ofs=302704 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=381794 stripe_ofs=381794 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=334269 stripe_ofs=334269 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=378182 stripe_ofs=378182 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=325041 stripe_ofs=325041 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=295118 stripe_ofs=295118 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=315823 stripe_ofs=315823 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=346112 stripe_ofs=346112 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=303128 stripe_ofs=303128 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=364011 stripe_ofs=364011 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): result: ofs=331137 stripe_ofs=331137 part_ofs=0 rule->part_size=0
2020-12-04T12:37:22.227+0100 7fc405f79080 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
Just before these entries begin, I see the following:
2020-12-04T12:37:22.203+0100 7fc405f79080 20 RGWRados::Bucket::List::list_objects_ordered INFO end of outer loop, truncated=1, count=4, attempt=9
2020-12-04T12:37:22.203+0100 7fc405f79080 20 RGWRadosList::do_incomplete_multipart processing incomplete multipart entry RGWMultipartUploadEntry{ obj.key="_multipart_/halon_2_settings_2020-08-28.json.2~TD-WUPWhOicUKkz4NTxZcxuEl3b_IqC.meta" mp=RGWMPObj:{ prefix="/halon_2_settings_2020-08-28.json.2~TD-WUPWhOicUKkz4NTxZcxuEl3b_IqC", meta="/halon_2_settings_2020-08-28.json.2~TD-WUPWhOicUKkz4NTxZcxuEl3b_IqC.meta" } }
2020-12-04T12:37:22.203+0100 7fc405f79080 20 RGWRadosList::do_incomplete_multipart processing incomplete multipart entry RGWMultipartUploadEntry{ obj.key="_multipart_gl-events_0_2020-08-28.json.2~OQWHwayxwHwUMLteZWQZWRnsdgT5Cl2.meta" mp=RGWMPObj:{ prefix="gl-events_0_2020-08-28.json.2~OQWHwayxwHwUMLteZWQZWRnsdgT5Cl2", meta="gl-events_0_2020-08-28.json.2~OQWHwayxwHwUMLteZWQZWRnsdgT5Cl2.meta" } }
2020-12-04T12:37:22.203+0100 7fc405f79080 20 RGWRadosList::do_incomplete_multipart processing incomplete multipart entry RGWMultipartUploadEntry{ obj.key="_multipart_gl-events_0_2020-08-28.json.2~zz-6rqcuHGmSPYTY2IRueMP5HIFOL6t.meta" mp=RGWMPObj:{ prefix="gl-events_0_2020-08-28.json.2~zz-6rqcuHGmSPYTY2IRueMP5HIFOL6t", meta="gl-events_0_2020-08-28.json.2~zz-6rqcuHGmSPYTY2IRueMP5HIFOL6t.meta" } }
2020-12-04T12:37:22.203+0100 7fc405f79080 20 RGWRadosList::do_incomplete_multipart processing incomplete multipart entry RGWMultipartUploadEntry{ obj.key="_multipart_gl-events_1_2020-08-28.json.2~-WCk4AJE1k7or-KN6qTePxfIvHL1NLf.meta" mp=RGWMPObj:{ prefix="gl-events_1_2020-08-28.json.2~-WCk4AJE1k7or-KN6qTePxfIvHL1NLf", meta="gl-events_1_2020-08-28.json.2~-WCk4AJE1k7or-KN6qTePxfIvHL1NLf.meta" } }
Is anyone able to interpret these logs and see an explanation as to why the command appears to be looping over the same objects?
Thanks,
James.
Dear Cephers,
we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6):
ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other)
For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example,
only if the last file access was >24 hours ago. However, that does not seem to be the case:
----------------------------------------------
$ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root
...
Access: 2019-04-10 15:50:04.975959159 +0200
Modify: 2019-04-10 15:50:05.651613843 +0200
Change: 2019-04-10 15:50:06.141006962 +0200
...
$ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null
$ sync
$ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root
...
Access: 2019-04-10 15:50:04.975959159 +0200
Modify: 2019-04-10 15:50:05.651613843 +0200
Change: 2019-04-10 15:50:06.141006962 +0200
...
----------------------------------------------
I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps,
but atime never changes.
Is atime really never updated with CephFS, or is this configurable?
Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case.
Cheers,
Oliver