Hi all,
I just had created a ceph cluster to use cephfs. When i create the a ceph
fs pool i get the filesystem below error.
# ceph osd pool create cephfs_data 128
pool 'cephfs_data' created
# ceph osd pool create cephfs_metadata 128
pool 'cephfs_metadata' created
# ceph fs new cephfs cephfs_metadata cephfs_data
new fs with metadata pool 6 and data pool 5
# ceph -s
cluster:
id: 1c27def45-f0f9-494d-sfke-eb4323432fd
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
services:
mon: 2 daemons, quorum ceph-mon01,ceph-mon02
mgr: ceph-adm01(active)
mds: cephfs-0/0/1 up
osd: 12 osds: 12 up, 12 in
data:
pools: 2 pools, 256 pgs
objects: 0 objects, 0 B
usage: 12 GiB used, 588 GiB / 600 GiB avail
pgs: 256 active+clean
but when i check the max_mds for the ceph fs it says 1
# ceph fs get cephfs | grep max_mds
max_mds 1
Let anyone know what am i missing here? Any inputs is much appreciated.
Regards,
Ram
Ceph-explorer..
I have some questions for those who’ve experienced this issue.
1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous?
2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe.
3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed?
4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted).
Thanks,
Eric
> On Nov 12, 2020, at 4:54 PM, huxiaoyu(a)horebdata.cn wrote:
>
> Looks like this is a very dangerous bug for data safety. Hope the bug would be quickly identified and fixed.
>
> best regards,
>
> Samuel
>
>
>
> huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>
>
> From: Janek Bevendorff
> Date: 2020-11-12 18:17
> To: huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; EDH - Manuel Rios; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
> I have never seen this on Luminous. I recently upgraded to Octopus and the issue started occurring only few weeks later.
>
> On 12/11/2020 16:37, huxiaoyu(a)horebdata.cn wrote:
> which Ceph versions are affected by this RGW bug/issues? Luminous, Mimic, Octupos, or the latest?
>
> any idea?
>
> samuel
>
>
>
> huxiaoyu(a)horebdata.cn
>
> From: EDH - Manuel Rios
> Date: 2020-11-12 14:27
> To: Janek Bevendorff; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
> This same error caused us to wipe a full cluster of 300TB... will be related to some rados index/database bug not to s3.
>
> As Janek exposed is a mayor issue, because the error silent happend and you can only detect it with S3, when you're going to delete/purge a S3 bucket. Dropping NoSuchKey. Error is not related to S3 logic ..
>
> Hope this time dev's can take enought time to find and resolve the issue. Error happens with low ec profiles, even with replica x3 in some cases.
>
> Regards
>
>
>
> -----Mensaje original-----
> De: Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>
> Enviado el: jueves, 12 de noviembre de 2020 14:06
> Para: Rafael Lopez <rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>>
> CC: Robin H. Johnson <robbat2(a)gentoo.org <mailto:robbat2@gentoo.org>>; ceph-users <ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>>
> Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
>
> Here is a bug report concerning (probably) this exact issue:
> https://tracker.ceph.com/issues/47866 <https://tracker.ceph.com/issues/47866>
>
> I left a comment describing the situation and my (limited) experiences with it.
>
>
> On 11/11/2020 10:04, Janek Bevendorff wrote:
>>
>> Yeah, that seems to be it. There are 239 objects prefixed
>> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none
>> of the multiparts from the other file to be found and the head object
>> is 0 bytes.
>>
>> I checked another multipart object with an end pointer of 11.
>> Surprisingly, it had way more than 11 parts (39 to be precise) named
>> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I
>> could find them in the dump at least.
>>
>> I have no idea why the objects disappeared. I ran a Spark job over all
>> buckets, read 1 byte of every object and recorded errors. Of the 78
>> buckets, two are missing objects. One bucket is missing one object,
>> the other 15. So, luckily, the incidence is still quite low, but the
>> problem seems to be expanding slowly.
>>
>>
>> On 10/11/2020 23:46, Rafael Lopez wrote:
>>> Hi Janek,
>>>
>>> What you said sounds right - an S3 single part obj won't have an S3
>>> multipart string as part of the prefix. S3 multipart string looks
>>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".
>>>
>>> From memory, single part S3 objects that don't fit in a single rados
>>> object are assigned a random prefix that has nothing to do with
>>> the object name, and the rados tail/data objects (not the head
>>> object) have that prefix.
>>> As per your working example, the prefix for that would be
>>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow"
>>> objects with names containing that prefix, and if you add up the
>>> sizes it should be the size of your S3 object.
>>>
>>> You should look at working and non working examples of both single
>>> and multipart S3 objects, as they are probably all a bit different
>>> when you look in rados.
>>>
>>> I agree it is a serious issue, because once objects are no longer in
>>> rados, they cannot be recovered. If it was a case that there was a
>>> link broken or rados objects renamed, then we could work to
>>> recover...but as far as I can tell, it looks like stuff is just
>>> vanishing from rados. The only explanation I can think of is some
>>> (rgw or rados) background process is incorrectly doing something with
>>> these objects (eg. renaming/deleting). I had thought perhaps it was a
>>> bug with the rgw garbage collector..but that is pure speculation.
>>>
>>> Once you can articulate the problem, I'd recommend logging a bug
>>> tracker upstream.
>>>
>>>
>>> On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff
>>> <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>
>>> <mailto:janek.bevendorff@uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
>>>
>>> Here's something else I noticed: when I stat objects that work
>>> via radosgw-admin, the stat info contains a "begin_iter" JSON
>>> object with RADOS key info like this
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz",
>>> "instance": "",
>>> "ns": ""
>>> }
>>>
>>>
>>> and then "end_iter" with key info like this:
>>>
>>>
>>> "key": {
>>> "name":
>>> ".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239",
>>> "instance": "",
>>> "ns": "shadow"
>>> }
>>>
>>> However, when I check the broken 0-byte object, the "begin_iter"
>>> and "end_iter" keys look like this:
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1",
>>> "instance": "",
>>> "ns": "multipart"
>>> }
>>>
>>> [...]
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19",
>>> "instance": "",
>>> "ns": "multipart"
>>> }
>>>
>>> So, it's the full name plus a suffix and the namespace is
>>> multipart, not shadow (or empty). This in itself may just be an
>>> artefact of whether the object was uploaded in one go or as a
>>> multipart object, but the second difference is that I cannot find
>>> any of the multipart objects in my pool's object name dump. I
>>> can, however, find the shadow RADOS object of the intact S3 object.
>>>
>>>
>>>
>>>
>>> --
>>> *Rafael Lopez*
>>> Devops Systems Engineer
>>> Monash University eResearch Centre
>>>
>>> T: +61 3 9905 9118 <tel:%2B61%203%209905%209118>
>>> E: rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>
>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Bonjour,
Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:
> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter.
There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects.
That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same?
Cheers
[0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
[1] https://docs.ceph.com/en/latest/radosgw/layout/
[2] https://github.com/ceph/ceph/pull/32809
[3] https://www.spinics.net/lists/ceph-users/msg45755.html
--
Loïc Dachary, Artisan Logiciel Libre
Hi,
I have setup a ceph cluster with cephadm with docker backend.
I want to move /var/lib/docker to a separate device to get better
performance and less load on the OS device.
I tried that by stopping docker copy the content of /var/lib/docker to
the new device and mount the new device to /var/lib/docker.
The other containers started as expected and continues to work and run
as expected.
But the ceph containers seems to be broken.
I am not able to get them back in working state.
I have tried to remove the host with `ceph orch host rm itcnchn-bb4067`
and readd it but no effect.
The strange thing is that 2 of 4 containers comes up as expected.
ceph orch ps itcnchn-bb4067
NAME HOST STATUS
REFRESHED AGE VERSION IMAGE NAME IMAGE ID
CONTAINER ID
crash.itcnchn-bb4067 itcnchn-bb4067 running (18h) 10m
ago 4w 15.2.7 docker.io/ceph/ceph:v15 2bc420ddb175
2af28c4571cf
mds.cephfs.itcnchn-bb4067.qzoshl itcnchn-bb4067 error 10m
ago 4w <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
mon.itcnchn-bb4067 itcnchn-bb4067 error 10m
ago 18h <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc itcnchn-bb4067 running (18h) 10m
ago 4w 15.2.7 docker.io/ceph/ceph:v15 2bc420ddb175
00d000aec32b
Docker logs from the active manager does not say much about what is
wrong
debug 2021-01-05T09:57:52.537+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring mds.cephfs.itcnchn-bb4067.qzoshl (unknown last
config time)...
debug 2021-01-05T09:57:52.541+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon mds.cephfs.itcnchn-bb4067.qzoshl on
itcnchn-bb4067
debug 2021-01-05T09:57:52.973+0000 7fdb64e88700 0 log_channel(cluster)
log [DBG] : pgmap v347: 241 pgs: 241 active+clean; 18 GiB data, 50 GiB
used, 52 TiB / 52 TiB avail; 18 KiB/s rd, 78 KiB/s wr, 24 op/s
debug 2021-01-05T09:57:53.085+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring mon.itcnchn-bb4067 (unknown last config
time)...
debug 2021-01-05T09:57:53.085+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon mon.itcnchn-bb4067 on itcnchn-bb4067
debug 2021-01-05T09:57:53.625+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc (unknown
last config time)...
debug 2021-01-05T09:57:53.629+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc on
itcnchn-bb4067
debug 2021-01-05T09:57:54.141+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring crash.itcnchn-bb4067 (unknown last config
time)...
debug 2021-01-05T09:57:54.141+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon crash.itcnchn-bb4067 on itcnchn-bb4067
- Karsten
Has anybody run into a 'stuck' OSD service specification? I've tried
to delete it, but it's stuck in 'deleting' state, and has been for
quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
osd.osd_spec 504/525 <deleting> 12m label:osd
root@ceph01:/# ceph orch rm osd.osd_spec
Removed service osd.osd_spec
From active monitor:
debug 2021-05-06T23:14:48.909+0000 7f17d310b700 0
log_channel(cephadm) log [INF] : Remove service osd.osd_spec
Yet in ls, it's still there, same as above. --export on it:
root@ceph01:/# ceph orch ls osd.osd_spec --export
service_type: osd
service_id: osd_spec
service_name: osd.osd_spec
placement: {}
unmanaged: true
spec:
filter_logic: AND
objectstore: bluestore
We've tried --force, as well, with no luck.
To be clear, the --export even prior to delete looks nothing like the
actual service specification we're using, even after I re-apply it, so
something seems 'bugged'. Here's the OSD specification we're applying:
service_type: osd
service_id: osd_spec
placement:
label: "osd"
data_devices:
rotational: 1
db_devices:
rotational: 0
db_slots: 12
I would appreciate any insight into how to clear this up (without
removing the actual OSDs, we're just wanting to apply the updated
service specification - we used to use host placement rules and are
switching to label-based).
Thanks,
David
Dear cephers,
I have a strange problem. An OSD went down and recovery finished. For some reason, I have a slow ops warning for the failed OSD stuck in the system:
health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
The OSD is auto-out:
| 580 | ceph-22 | 0 | 0 | 0 | 0 | 0 | 0 | autoout,exists |
It is probably a warning dating back to just before the fail. How can I clear it?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
I’m continuously getting scrub errors in my index pool and log pool that I need to repair always.
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 20.19 is active+clean+inconsistent, acting [39,41,37]
Why is this?
I have no cue at all, no log entry no anything ☹
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
We've done our fair share of Ceph cluster upgrades since Hammer, and
have not seen much problems with them. I'm now at the point that I have
to upgrade a rather large cluster running Luminous and I would like to
hear from other users if they have experiences with issues I can expect
so that I can anticipate on them beforehand.
As said, the cluster is running Luminous (12.2.13) and has the following
services active:
services:
mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04
mgr: osdnode01(active), standbys: osdnode02, osdnode03
mds: pmrb-3/3/3 up {0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active}, 1 up:standby
osd: 116 osds: 116 up, 116 in;
rgw: 3 daemons active
Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster
is 1.01PiB.
We have 2 active crush-rules on 18 pools. All pools have a size of 3 there is a total of 5760 pgs.
{
"rule_id": 1,
"rule_name": "hdd-data",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -10,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "ssd-data",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
rbd -> crush_rule: hdd-data
.rgw.root -> crush_rule: hdd-data
default.rgw.control -> crush_rule: hdd-data
default.rgw.data.root -> crush_rule: ssd-data
default.rgw.gc -> crush_rule: ssd-data
default.rgw.log -> crush_rule: ssd-data
default.rgw.users.uid -> crush_rule: hdd-data
default.rgw.usage -> crush_rule: ssd-data
default.rgw.users.email -> crush_rule: hdd-data
default.rgw.users.keys -> crush_rule: hdd-data
default.rgw.meta -> crush_rule: hdd-data
default.rgw.buckets.index -> crush_rule: ssd-data
default.rgw.buckets.data -> crush_rule: hdd-data
default.rgw.users.swift -> crush_rule: hdd-data
default.rgw.buckets.non-ec -> crush_rule: ssd-data
DB0475 -> crush_rule: hdd-data
cephfs_pmrb_data -> crush_rule: hdd-data
cephfs_pmrb_metadata -> crush_rule: ssd-data
All but four clients are running Luminous, the four are running Jewel
(that needs upgrading before proceeding with this upgrade).
So, normally, I would 'just' upgrade all Ceph packages on the
monitor-nodes and restart mons and then mgrs.
After that, I would upgrade all Ceph packages on the OSD nodes and
restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting
the OSD's will probably take a while.
If anyone has a hint on what I should expect to cause some extra load or
waiting time, that would be great.
Obviously, we have read
https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking
for real world experiences.
Thanks!
--
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | info(a)tuxis.nl
Hi Reed,
To add to this command by Weiwen:
On 28.05.21 13:03, 胡 玮文 wrote:
> Have you tried just start multiple rsync process simultaneously to transfer different directories? Distributed system like ceph often benefits from more parallelism.
When I migrated from XFS on iSCSI (legacy system, no Ceph) to CephFS a
few months ago, I used msrsync [1] and was quite happy with the speed.
For your use case, I would start with -p 12 but might experiment with up
to -p 24 (as you only have 6C/12T in your CPU). With many small files,
you also might want to increase -s from the default 1000.
Note that msrsync does not work with the --delete rsync flag. As I was
syncing a live system, I ended up with this workflow:
- Initial sync with msrsync (something like ./msrsync -p 12 --progress
--stats --rsync "-aS --numeric-ids" ...)
- Second sync with msrsync (to sync changes during the first sync)
- Take old storage off-line for users / read-only
- Final rsync with --delete (i.e. rsync -aS --numeric-ids --delete ...)
- Mount cephfs at location of old storage, adjust /etc/exports with fsid
entries where necessary, turn system back on-line / read-write
Cheers
Sebastian
[1] https://github.com/jbd/msrsync