Hi,
Other than get all objects of the pool and filter by image ID,
is there any easier way to get the number of allocated objects for
a RBD image?
What I really want to know is the actual usage of an image.
An allocated object could be used partially, but that's fine,
no need to be 100% accurate. To get the object count and
times object size, that should be sufficient.
"rbd export" exports actual used data, but to get the actual usage
by exporting the image seems too much. This brings up another
question, is there any way to know the export size before running it?
Thanks!
Tony
Hi Eugen
Please find the details below
root@meghdootctr1:/var/log/ceph# ceph -s
cluster:
id: c59da971-57d1-43bd-b2b7-865d392412a5
health: HEALTH_WARN
nodeep-scrub flag(s) set
544 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum meghdootctr1,meghdootctr2,meghdootctr3 (age 5d)
mgr: meghdootctr1(active, since 5d), standbys: meghdootctr2, meghdootctr3
mds: 3 up:standby
osd: 36 osds: 36 up (since 34h), 36 in (since 34h)
flags nodeep-scrub
data:
pools: 2 pools, 544 pgs
objects: 10.14M objects, 39 TiB
usage: 116 TiB used, 63 TiB / 179 TiB avail
pgs: 544 active+clean
io:
client: 24 MiB/s rd, 16 MiB/s wr, 2.02k op/s rd, 907 op/s wr
Ceph Versions:
root@meghdootctr1:/var/log/ceph# ceph --version
ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus
(stable)
Ceph df -h
https://pastebin.com/1ffucyJg
Ceph OSD performance dump
https://pastebin.com/1R6YQksE
Ceph tell osd.XX bench (Out of 36 osds only 8 OSDs give High IOPS value of 250
+. Out of that 4 OSDs are from HP 3PAR and 4 OSDS from DELL EMC. We are using
only 4 OSDs from HP3 par and it is working fine without any latency and iops
issues from the beginning but the remaining 32 OSDs are from DELL EMC in which 4
OSDs are much better than the remaining 28 OSDs)
https://pastebin.com/CixaQmBi
Please help me to identify if the issue is with the DELL EMC Storage, Ceph
configuration parameter tuning or the Overload in the cloud setup
On November 1, 2023 at 9:48 PM Eugen Block <eblock(a)nde.ag> wrote:
> Hi,
>
> for starters please add more cluster details like 'ceph status', 'ceph
> versions', 'ceph osd df tree'. Increasing the to 10G was the right
> thing to do, you don't get far with 1G with real cluster load. How are
> the OSDs configured (HDD only, SSD only or HDD with rocksdb on SSD)?
> How is the disk utilization?
>
> Regards,
> Eugen
>
> Zitat von prabhav(a)cdac.in:
>
> > In a production setup of 36 OSDs( SAS disks) totalling 180 TB
> > allocated to a single Ceph Cluster with 3 monitors and 3 managers.
> > There were 830 volumes and VMs created in Openstack with Ceph as a
> > backend. On Sep 21, users reported slowness in accessing the VMs.
> > Analysing the logs lead us to problem with SAS , Network congestion
> > and Ceph configuration( as all default values were used). We updated
> > the Network from 1Gbps to 10Gbps for public and cluster networking.
> > There was no change.
> > The ceph benchmark performance showed that 28 OSDs out of 36 OSDs
> > reported very low IOPS of 30 to 50 while the remaining showed 300+
> > IOPS.
> > We gradually started reducing the load on the ceph cluster and now
> > the volumes count is 650. Now the slow operations has gradually
> > reduced but I am aware that this is not the solution.
> > Ceph configuration is updated with increasing the
> > osd_journal_size to 10 GB,
> > osd_max_backfills = 1
> > osd_recovery_max_active = 1
> > osd_recovery_op_priority = 1
> > bluestore_cache_trim_max_skip_pinned=10000
> >
> > After one month, now we faced another issue with Mgr daemon stopped
> > in all 3 quorums and 16 OSDs went down. From the
> > ceph-mon,ceph-mgr.log could not get the reason. Please guide me as
> > its a production setup
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Thanks & Regards,
Ms V A Prabha / श्रीमती प्रभा वी ए
Joint Director / संयुक्त निदेशक
Centre for Development of Advanced Computing(C-DAC) / प्रगत संगणन विकास
केन्द्र(सी-डैक)
Tidel Park”, 8th Floor, “D” Block, (North &South) / “टाइडल पार्क”,8वीं मंजिल,
“डी” ब्लॉक, (उत्तर और दक्षिण)
No.4, Rajiv Gandhi Salai / नं.4, राजीव गांधी सलाई
Taramani / तारामणि
Chennai / चेन्नई – 600113
Ph.No.:044-22542226/27
Fax No.: 044-22542294
------------------------------------------------------------------------------------------------------------
[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
------------------------------------------------------------------------------------------------------------
Hi,
I'm facing a rather new issue with our Ceph cluster: from time to time
ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over
100 GB RAM:
[Nov21 15:02] tp_osd_tp invoked oom-killer:
gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ +0.000010] oom_kill_process.cold+0xb/0x10
[ +0.000002] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[ +0.000008]
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=504d37b566d9fd442d45904a00584b4f61c93c5d49dc59eb1c948b3d1c096907,mems_allowed=0-1,global_oom,task_memcg=/docker/3826be8f9115479117ddb8b721ca57585b2bdd58a27c7ed7b38e8d83eb795957,task=ceph-mgr,pid=3941610,uid=167
[ +0.000697] Out of memory: Killed process 3941610 (ceph-mgr)
total-vm:146986656kB, anon-rss:125340436kB, file-rss:0kB, shmem-rss:0kB,
UID:167 pgtables:260356kB oom_score_adj:0
[ +6.509769] oom_reaper: reaped process 3941610 (ceph-mgr), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The cluster is stable and operating normally, there's nothing unusual going
on before, during or after the kill, thus it's unclear what causes the mgr
to balloon, use all RAM and get killed. Systemd logs aren't very helpful:
they just show normal mgr operations until it fails to allocate memory and
gets killed: https://pastebin.com/MLyw9iVi
The mgr experienced this issue several times in the last 2 months, and the
events don't appear to correlate with any other events in the cluster
because basically nothing else happened at around those times. How can I
investigate this and figure out what's causing the mgr to consume all
memory and get killed?
I would very much appreciate any advice!
Best regards,
Zakhar
Hi,
As I’v read and thought a lot about the migration as this is a bigger project, I was wondering if anyone has done that already and might share some notes or playbooks, because in all readings there where some parts missing or miss understandable to me.
I do have some different approaches in mind, so may be you have some suggestions or hints.
a) upgrade nautilus on centos 7 with the few missing features like dashboard and prometheus. After that migrate one node after an other to ubuntu 20.04 with octopus and than upgrade ceph to the recent stable version.
b) migrate one node after an other to ubuntu 18.04 with nautilus and then upgrade to octupus and after that to ubuntu 20.04.
or
c) upgrade one node after an other to ubuntu 20.04 with octopus and join it to the cluster until all nodes are upgraded.
For test I tried c) with a mon node, but adding that to the cluster fails with some failed state, still probing for the other mons. (I dont have the right log at hand right now.)
So my questions are:
a) What would be the best (most stable) migration path and
b) is it in general possible to add a new octopus mon (not upgraded one) to a nautilus cluster, where the other mons are still on nautilus?
I hope my thoughts and questions are understandable :)
Thanks for any hint and suggestion. Best . Götz
Hi folks,
I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output below.
What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep state. However, what I see is completely different. There seems to be no distinction between scrub- and deep-scrub start times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly to bi-weekly is normal, specifically for large drives.
Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do that?
The observed behavior is very unusual for RAID systems (if its not a bug in the report script). With this behavior its not surprising that people complain about "not deep-scrubbed in time" messages and too high deep-scrub IO load when such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already.
Sample output:
# scrub-report
dumped pgs
Scrub report:
4121 PGs not scrubbed since 1 intervals (6h)
3831 PGs not scrubbed since 2 intervals (6h)
4012 PGs not scrubbed since 3 intervals (6h)
3986 PGs not scrubbed since 4 intervals (6h)
2998 PGs not scrubbed since 5 intervals (6h)
1488 PGs not scrubbed since 6 intervals (6h)
909 PGs not scrubbed since 7 intervals (6h)
771 PGs not scrubbed since 8 intervals (6h)
582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing
431 PGs not scrubbed since 10 intervals (6h)
333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing
265 PGs not scrubbed since 12 intervals (6h)
195 PGs not scrubbed since 13 intervals (6h)
116 PGs not scrubbed since 14 intervals (6h)
78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing
72 PGs not scrubbed since 16 intervals (6h)
37 PGs not scrubbed since 17 intervals (6h)
5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233* 14.40e*
33 PGs not scrubbed since 20 intervals (6h)
23 PGs not scrubbed since 21 intervals (6h)
16 PGs not scrubbed since 22 intervals (6h)
12 PGs not scrubbed since 23 intervals (6h)
8 PGs not scrubbed since 24 intervals (6h)
2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3*
4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed*
5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788* 19.16c0*
6 PGs not scrubbed since 28 intervals (6h)
2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d*
3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a*
1 PGs not scrubbed since 32 intervals (6h) 19.133f*
1 PGs not scrubbed since 33 intervals (6h) 19.1103*
3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248*
1 PGs not scrubbed since 39 intervals (6h) 19.1984*
1 PGs not scrubbed since 41 intervals (6h) 14.449*
1 PGs not scrubbed since 44 intervals (6h) 19.179f*
Deep-scrub report:
3723 PGs not deep-scrubbed since 1 intervals (24h)
4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep
3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep
2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep
1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep
1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep
1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep
1304 PGs not deep-scrubbed since 8 intervals (24h) 7 scrubbing+deep
923 PGs not deep-scrubbed since 9 intervals (24h) 5 scrubbing+deep
557 PGs not deep-scrubbed since 10 intervals (24h) 7 scrubbing+deep
501 PGs not deep-scrubbed since 11 intervals (24h) 2 scrubbing+deep
363 PGs not deep-scrubbed since 12 intervals (24h) 2 scrubbing+deep
377 PGs not deep-scrubbed since 13 intervals (24h) 1 scrubbing+deep
383 PGs not deep-scrubbed since 14 intervals (24h) 2 scrubbing+deep
252 PGs not deep-scrubbed since 15 intervals (24h) 2 scrubbing+deep
116 PGs not deep-scrubbed since 16 intervals (24h) 5 scrubbing+deep
47 PGs not deep-scrubbed since 17 intervals (24h) 2 scrubbing+deep
10 PGs not deep-scrubbed since 18 intervals (24h)
2 PGs not deep-scrubbed since 19 intervals (24h) 19.1c6c* 19.a01*
1 PGs not deep-scrubbed since 20 intervals (24h) 14.1ed*
2 PGs not deep-scrubbed since 21 intervals (24h) 19.1322* 19.10f6*
1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc*
1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f*
PGs marked with a * are on busy OSDs and not eligible for scrubbing.
The script (pasted here because attaching doesn't work):
# cat bin/scrub-report
#!/bin/bash
# Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h.
# Print how many PGs have not been (deep-)scrubbed since #intervals.
ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json
echo ""
T0="$(date +%s)"
scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [
.pgid,
(.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*6)|ceil),
(.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*24)|ceil),
.state,
(.acting | join(" "))
] | @tsv
' /root/.cache/ceph/pgs_dump.json)"
# less <<<"$scrub_info"
# 1 2 3 4 5..NF
# pg_id scrub-ints deep-scrub-ints status acting[]
awk <<<"$scrub_info" '{
for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i
if($4 == "active+clean") {
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
} else if($4 ~ /scrubbing\+deep/) {
deep_scrubbing[$3]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else if($4 ~ /scrubbing/) {
scrubbing[$2]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else {
unclean[$2]++
unclean_d[$3]++
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
for(i=5; i<=NF; ++i) osd[$i]="busy"
}
}
END {
print "Scrub report:"
for(si=1; si<=si_mx; ++si) {
if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue;
printf("%7d PGs not scrubbed since %2d intervals (6h)", pg_sn[si], si)
if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si])
if(unclean[si]) printf(" %d unclean", unclean[si])
if(pg_sn[si]<=5) {
split(pg_sn_ids[si], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "Deep-scrub report:"
for(dsi=1; dsi<=dsi_mx; ++dsi) {
if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && unclean_d[dsi]==0) continue;
printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", pg_dsn[dsi], dsi)
if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", deep_scrubbing[dsi])
if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi])
if(pg_dsn[dsi]<=5) {
split(pg_dsn_ids[dsi], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "PGs marked with a * are on busy OSDs and not eligible for scrubbing."
}
'
Don't forget the last "'" when copy-pasting.
Thanks for any pointers.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello.
We are using Ceph storage to test whether we can run the service by uploading and saving more than 40 billion files.
So I'd like to check the contents below.
1) Maximum number of Rados gateway objects that can be stored in one cluster using the bucket index
2) Maximum number of Rados gateway objects that can be stored in one bucket
Although we have referred to the limitations on the number of Rados gateway objects mentioned in existing documents, it seems theoretically unlimited
If you have operated the number of files at the level we think in actual services or products, we would appreciate it if you could share them.
Below are related documents and related settings values.
> Related documents
- https://documentation.suse.com/ses/5.5/html/ses-all/cha-ceph-gw.html
- https://www.ibm.com/docs/en/storage-ceph/6?topic=resharding-limitations-buc…
- https://docs.ceph.com/en/latest/dev/radosgw/bucket_index/
> Related config
- rgw_dynamic_resharding: true
- rgw_max_objs_per_shard: 100000
- rgw_max_dynamic_shards : 65521
Hi Dan,
thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away.
Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side.
Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively.
The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that
1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing".
2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are.
3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit.
If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info.
Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen.
Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Dan van der Ster <dvanders(a)gmail.com>
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs
Hi Frank,
What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in time, you need to increase the
amount of scrub slots accordingly.
On the other hand, IMHO the 1-week deadline for deep scrubs is often
much too ambitious for large clusters -- increasing the scrub
intervals is one solution, or I find it simpler to increase
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
until you find a ratio that works for your cluster.
Of course, all of this can impact detection of bit-rot, which anyway
can be covered by client reads if most data is accessed periodically.
But if the cluster is mostly idle or objects are generally not read,
then it would be preferable to increase slots osd_max_scrubs.
Cheers, Dan
On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder <frans(a)dtu.dk> wrote:
>
> Hi all,
>
> we are using 16T and 18T spinning drives as OSDs and I'm observing that they are not scrubbed as often as I would like. It looks like too few scrubs are scheduled for these large OSDs. My estimate is as follows: we have 852 spinning OSDs backing a 8+2 pool with 2024 and an 8+3 pool with 8192 PGs. On average I see something like 10PGs of pool 1 and 12 PGs of pool 2 (deep) scrubbing. This amounts to only 232 out of 852 OSDs scrubbing and seems to be due to a conservative rate of (deep) scrubs being scheduled. The PGs (dep) scrub fairly quickly.
>
> I would like to increase gently the number of scrubs scheduled for these drives and *not* the number of scrubs per OSD. I'm looking at parameters like:
>
> osd_scrub_backoff_ratio
> osd_deep_scrub_randomize_ratio
>
> I'm wondering if lowering osd_scrub_backoff_ratio to 0.5 and, maybe, increasing osd_deep_scrub_randomize_ratio to 0.2 would have the desired effect? Are there other parameters to look at that allow gradual changes in the number of scrubs going on?
>
> Thanks a lot for your help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Details of this release are summarized here:
https://tracker.ceph.com/issues/63443#note-1
Seeking approvals/reviews for:
smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
rados - Neha, Radek, Travis, Ernesto, Adam King
rgw - Casey
fs - Venky
orch - Adam King
rbd - Ilya
krbd - Ilya
upgrade/quincy-x (reef) - Laura PTL
powercycle - Brad
perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
Please reply to this email with approval and/or trackers of known
issues/PRs to address them.
TIA
YuriW
Dear fellow cephers,
today we observed a somewhat worrisome inconsistency on our ceph fs. A file created on one host showed up as 0 length on all other hosts:
[user1@host1 h2lib]$ ls -lh
total 37M
-rw-rw---- 1 user1 user1 12K Nov 1 11:59 dll_wrapper.py
[user2@host2 h2lib]# ls -l
total 34
-rw-rw----. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py
[user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test
[user1@host1 h2lib]$ ls -l
total 37199
-rw-rw---- 1 user1 user1 11641 Nov 1 11:59 dll_wrapper.py
-rw-rw---- 1 user1 user1 11641 Nov 1 13:10 dll_wrapper.py.test
[user2@host2 h2lib]# ls -l
total 45
-rw-rw----. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py
-rw-rw----. 1 user1 user1 11641 Nov 1 13:10 dll_wrapper.py.test
Executing a sync on all these hosts did not help. However, deleting the problematic file and replacing it with a copy seemed to work around the issue. We saw this with ceph kclients of different versions, it seems to be on the MDS side.
How can this happen and how dangerous is it?
ceph fs status (showing ceph version):
# ceph fs status
con-fs2 - 1662 clients
=======
RANK STATE MDS ACTIVITY DNS INOS
0 active ceph-15 Reqs: 14 /s 2307k 2278k
1 active ceph-11 Reqs: 159 /s 4208k 4203k
2 active ceph-17 Reqs: 3 /s 4533k 4501k
3 active ceph-24 Reqs: 3 /s 4593k 4300k
4 active ceph-14 Reqs: 1 /s 4228k 4226k
5 active ceph-13 Reqs: 5 /s 1994k 1782k
6 active ceph-16 Reqs: 8 /s 5022k 4841k
7 active ceph-23 Reqs: 9 /s 4140k 4116k
POOL TYPE USED AVAIL
con-fs2-meta1 metadata 2177G 7085G
con-fs2-meta2 data 0 7085G
con-fs2-data data 1242T 4233T
con-fs2-data-ec-ssd data 706G 22.1T
con-fs2-data2 data 3409T 3848T
STANDBY MDS
ceph-10
ceph-08
ceph-09
ceph-12
MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
There is no health issue:
# ceph status
cluster:
id: abc
health: HEALTH_WARN
3 pgs not deep-scrubbed in time
services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 9w)
mgr: ceph-25(active, since 7w), standbys: ceph-26, ceph-01, ceph-03, ceph-02
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1279 up (since 2d), 1279 in (since 5d)
task status:
data:
pools: 14 pools, 25065 pgs
objects: 2.20G objects, 3.9 PiB
usage: 4.9 PiB used, 8.2 PiB / 13 PiB avail
pgs: 25039 active+clean
26 active+clean+scrubbing+deep
io:
client: 799 MiB/s rd, 55 MiB/s wr, 3.12k op/s rd, 1.82k op/s wr
The inconsistency seems undiagnosed, I couldn't find anything interesting in the cluster log. What should I look for and where?
I moved the folder to another location for diagnosis. Unfortunately, I don't have 2 clients any more showing different numbers, I see a 0 length now everywhere for the moved folder. I'm pretty sure though that the file still is non-zero length.
Thanks for any pointers.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14