Hi all,
since a few weeks our Nautilus cluster was struggling with severe performance issues. When an OSD would go down, the rebalancing was really slow. Long periods with no data transfer at all (client and rebalancing!) and times with rebalancing traffic only. However, client traffic was almost stalled for the whole period until all objects were in place again (VMs were frozen). PGs were stuck in peering or inactive for long times. Sometimes we had to restart the ceph-mon in order to get the whole process running again.
The issues started all of a sudden, we don't remember doing any changes to the configuration.
The whole cluster has been updated from Mimic to Nautilus (14.2.3) in September while the issue occurred just a few weeks ago. Updating it to 14.2.5 did not resolve the issue back then.
Looking through mailing lists I found the following message: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/028035.html
So I ran "ceph osd require-osd-release nautilus" and all of a sudden the problems where gone! I do not recall executing that command right after the upgrade because the documentation states "Complete the upgrade by disallowing pre-Nautilus OSDs and enabling all new Nautilus-only functionality.". As by that point in time all OSDs, MONs and MGRs were successfully updated there was no reason to believe this command would be necessary.
Therefore I got two questions:
1. What exactly does the command do besides preventing old OSDs from joining?
2. What could have been the issue with the cluster and how did this command fix it?
If it is really that important to run the command, the docs should state this more clearly.
I appreciate any insight on this topic.
Thanks,
Georg
Hi,
happy new year to you!
I'm running a multinode cluster with 3 MGR nodes.
The issue I'm facing now is that ceph balancer <argument> runs for
minutes or, in worst case, hangs.
I have documented the runtime of the following executions:
root@ld3955:~# date && time ceph balancer status
Mon Dec 23 10:06:12 CET 2019
{
"active": true,
"plans": [],
"mode": "upmap"
}
real 1m45,045s
user 0m0,315s
sys 0m0,026s
root@ld3955:~# date && time ceph balancer status
Tue Jan 7 08:11:24 CET 2020
^CInterrupted
Traceback (most recent call last):
File "/usr/bin/ceph", line 1263, in <module>
retval = main()
File "/usr/bin/ceph", line 1194, in main
verbose)
File "/usr/bin/ceph", line 619, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
File "/usr/bin/ceph", line 593, in do_command
return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment
real 102m44,084s
user 0m2,404s
sys 0m1,065s
root@ld3955:~# date && time ceph balancer off
Tue Jan 7 09:57:36 CET 2020
real 1m45,371s
user 0m0,358s
sys 0m0,013s
root@ld3955:~# date && time ceph balancer on
Tue Jan 7 14:57:03 CET 2020
real 0m0,452s
user 0m0,284s
sys 0m0,020s
root@ld3955:~# date && time ceph balancer status
Tue Jan 7 14:57:11 CET 2020
{
"active": true,
"plans": [],
"mode": "upmap"
}
real 1m52,902s
user 0m0,301s
sys 0m0,042s
root@ld3955:~# date && time ceph balancer off
Wed Jan 8 08:49:26 CET 2020
^CInterrupted
Traceback (most recent call last):
File "/usr/bin/ceph", line 1263, in <module>
retval = main()
File "/usr/bin/ceph", line 1194, in main
verbose)
File "/usr/bin/ceph", line 619, in new_style_command
ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
File "/usr/bin/ceph", line 593, in do_command
return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment
real 14m29,097s
user 0m0,579s
sys 0m0,157s
In correlation with this finding I have identified that active MGR node
is using +100% CPU, to be pricise 108-120%.
To workaround this issue I must stop the active MRG node service and
wait until another node becomes active.
What's the issue with MGR service here?
Should I open a bug report?
Regards
Happy New Year Ceph Community!
I'm in the process of figuring out RBD mirroring with Ceph and having a really tough time with it. I'm trying to set up just one way mirroring right now on some test systems (baremetal servers, all Debian 9). The first cluster is 3 nodes, and the 2nd cluster is 2 nodes (not worried about a properly performing setup, just the functionality of RBD mirroring right now). The purpose is to have a passive failover ceph cluster in a separate DC. Mirroring seems like the best solution, but if we can't get it working, we'll end up resorting to a scheduled rsync which is less than ideal. I've followed several guides, read through a lot of documentation, and nothing has worked for me thus far. If anyone can offer some troubleshooting help or insight into what I might have missed in this setup, I'd greatly appreciate it! I also don't fully understand the relationship between images and pools and how you're supposed to configure statically sized images for a pool that has a variable amount of data, but that's a question for afterwards, I think :)
Once RBD mirroring is set up, the mirror test image status shows as down+unknown:
On ceph1-dc2:
rbd --cluster dc1ceph mirror pool status fs_data --verbose
health: WARNING
images: 1 total
1 unknown
mirror_test:
global_id: c335017c-9b8f-49ee-9bc1-888789537c47
state: down+unknown
description: status not found
last_update:
Here are the commands I run using ceph-deploy on both clusters to get everything up and running (run from a deploy directory on the first node of each cluster). The clusters are created at the same time, and rbd setup commands are only run after the clusters are up and healthy, and the fs_data pool is created.
-----------------------------------------------------------
Cluster 1 (dc1ceph):
ceph-deploy new ceph1-dc1 ceph2-dc1 ceph3-dc1
sed -i '$ s,.*,public_network = *.*.*.0/24\n,g' ceph.conf
ceph-deploy install ceph1-dc1 ceph2-dc1 ceph3-dc1 --release luminous
ceph-deploy mon create-initial
ceph-deploy admin ceph1-dc1 ceph2-dc1 ceph3-dc1
ceph-deploy mgr create ceph1-dc1 ceph2-dc1 ceph3-dc1
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph1-dc1 ; done
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph2-dc1 ; done
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph3-dc1 ; done
ceph-deploy mds create ceph1-dc1 ceph2-dc1 ceph3-dc1
ceph-deploy rgw create ceph1-dc1 ceph2-dc1 ceph3-dc1
for f in 1 2 ; do scp ceph.client.admin.keyring ceph$f-dc2:/etc/ceph/dc1ceph.client.admin.keyring ; done
for f in 1 2 ; do scp ceph.conf ceph$f-dc2:/etc/ceph/dc1ceph.conf ; done
for f in 1 2 ; do ssh ceph$f-dc2 "chown ceph.ceph /etc/ceph/dc1ceph*" ; done
ceph osd pool create fs_data 512 512 replicated
rbd --cluster ceph mirror pool enable fs_data image
rbd --cluster dc2ceph mirror pool enable fs_data image
rbd --cluster ceph mirror pool peer add fs_data client.admin@dc2ceph
(generated id: b5e347b3-0515-4142-bc49-921a07636865)
rbd create fs_data/mirror_test --size=1G
rbd feature enable fs_data/mirror_test journaling
rbd mirror image enable fs_data/mirror_test
chown ceph.ceph ceph.client.admin.keyring
Cluster 2 (dc2ceph):
ceph-deploy new ceph1-dc2 ceph2-dc2
sed -i '$ s,.*,public_network = *.*.*.0/24\n,g' ceph.conf
ceph-deploy install ceph1-dc2 ceph2-dc2 --release luminous
ceph-deploy mon create-initial
ceph-deploy admin ceph1-dc2 ceph2-dc2
ceph-deploy mgr create ceph1-dc2 ceph2-dc2
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph1-dc2 ; done
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph2-dc2 ; done
ceph-deploy mds create ceph1-dc2 ceph2-dc2
ceph-deploy rgw create ceph1-dc2 ceph2-dc2
apt install rbd-mirror
for f in 1 2 3 ; do scp ceph.conf ceph$f-dc1:/etc/ceph/dc2ceph.conf ; done
for f in 1 2 3 ; do scp ceph.client.admin.keyring ceph$f-dc1:/etc/ceph/dc2ceph.client.admin.keyring ; done
for f in 1 2 3 ; do ssh ceph$f-dc1 "chown ceph.ceph /etc/ceph/dc2ceph*" ; done
ceph osd pool create fs_data 512 512 replicated
rbd --cluster ceph mirror pool peer add fs_data client.admin@dc1ceph
(generated id: e486c401-e24d-49bc-9800-759760822282)
systemctl enable ceph-rbd-mirror@admin
systemctl start ceph-rbd-mirror@admin
rbd --cluster dc1ceph mirror pool status fs_data --verbose
Cluster 1:
ls /etc/ceph:
ceph.client.admin.keyring
ceph.conf
dc2ceph.client.admin.keyring
dc2ceph.conf
rbdmap
tmpG36OYs
cat /etc/ceph/ceph.conf:
[global]
fsid = 8fede407-50e1-4487-8356-3dc98b30c500
mon_initial_members = ceph1-dc1, ceph2-dc1, ceph3-dc1
mon_host = *.*.*.1,*.*.*.27,*.*.*.41
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
cat /etc/ceph/dc2ceph.conf
[global]
fsid = 813ff410-02dc-47bd-b678-38add38495bb
mon_initial_members = ceph1-dc2, ceph2-dc2
mon_host = *.*.*.56,*.*.*.0
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
Cluster 2:
ls /etc/ceph:
ceph.client.admin.keyring
ceph.conf
dc1ceph.client.admin.keyring
dc1ceph.conf
rbdmap
tmp_yxkPs
cat /etc/ceph/ceph.conf
[global]
fsid = 813ff410-02dc-47bd-b678-38add38495bb
mon_initial_members = ceph1-dc2, ceph2-dc2
mon_host = *.*.*.56,*.*.*.70
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
cat /etc/ceph/dc1ceph.conf
[global]
fsid = 8fede407-50e1-4487-8356-3dc98b30c500
mon_initial_members = ceph1-dc1, ceph2-dc1, ceph3-dc1
mon_host = *.*.*.1,*.*.*.27,*.*.*.41
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
RBD Mirror daemon status:
ceph-rbd-mirror(a)admin.service - Ceph rbd mirror daemon
Loaded: loaded (/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2020-01-06 16:21:44 EST; 3s ago
Process: 910178 ExecStart=/usr/bin/rbd-mirror -f --cluster ${CLUSTER} --id admin --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
Main PID: 910178 (code=exited, status=0/SUCCESS)
Jan 06 16:21:44 ceph1-dc2 systemd[1]: Started Ceph rbd mirror daemon.
Jan 06 16:21:44 ceph1-dc2 rbd-mirror[910178]: 2020-01-06 16:21:44.462916 7f76ecf88780 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Jan 06 16:21:44 ceph1-dc2 rbd-mirror[910178]: 2020-01-06 16:21:44.462949 7f76ecf88780 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication
Jan 06 16:21:44 ceph1-dc2 rbd-mirror[910178]: failed to initialize: (2) No such file or directory2020-01-06 16:21:44.463874 7f76ecf88780 -1 rbd::mirror::Mirror: 0x558d3ce6ce20 init: error connecting to local cluster
-------------------------------------------
I also tried running the ExecStart command manually, substituting in different values for the parameters, and just never got it to work. If more info is needed, please don’t hesitate to ask. Thanks in advance!
-Miguel
Hi,
I am trying to copy the contents of our storage server into a CephFS,
but am experiencing stability issues with my MDSs. The CephFS sits on
top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting
of two. My Ceph cluster version is Nautilus, the client is Mimic and
uses the kernel module to mount the FS.
The index of filenames to copy is about 23GB and I am using 16 parallel
rsync processes over a 10G link to copy the files over to Ceph. This
works perfectly for a while, but then the MDSs start reporting oversized
caches (between 20 and 50GB, sometimes more) and an inode count between
1 and 4 million. Particularly the Inode count seems quite high to me.
Each rsync job has 25k files to work with, so if all 16 processes open
all their files at the same time, I should not exceed 400k. Even if I
double this number to account for the client's page cache, I should get
nowhere near that number of inodes (a sync flush takes about 1 second).
Then after a few hours, my MDSs start failing with messages like this:
-21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15
-20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
heartbeat is not healthy!
The standby nodes try to take over, but take forever to become active
and will fail as well eventually.
During my research, I found this related topic:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html,
but I tried everything in there from increasing to lowering my cache
size, the number of segments etc. I also played around with the number
of active MDSs and two appears to work the best, whereas one cannot keep
up with the load and three seems to be the worst of all choices.
Do you have any ideas how I can improve the stability of my MDS damons
to handle the load properly? single 10G link is a toy and we could query
the cluster with a lot more requests per second, but it's already
yielding to 16 rsync processes.
Thanks
Happy new year to all!
In these holidays i've suffered a disk failure, but hitted also an
'inconsistent pg' error, and i want to understand.
Ceph 12.2.12, filestore.
Starting from 27/12 i get classical disk error:
Dec 27 20:52:21 capitanmarvel kernel: [345907.286795] ata1.00: exception Emask 0x0 SAct 0xfe00000 SErr 0x0 action 0x0
Dec 27 20:52:21 capitanmarvel kernel: [345907.286849] ata1.00: irq_stat 0x40000008
Dec 27 20:52:21 capitanmarvel kernel: [345907.286880] ata1.00: failed command: READ FPDMA QUEUED
Dec 27 20:52:21 capitanmarvel kernel: [345907.286920] ata1.00: cmd 60/00:a8:20:87:3b/04:00:00:00:00/40 tag 21 ncq dma 524288 in
Dec 27 20:52:21 capitanmarvel kernel: [345907.286920] res 41/40:00:46:8a:3b/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Dec 27 20:52:21 capitanmarvel kernel: [345907.287018] ata1.00: status: { DRDY ERR }
Dec 27 20:52:21 capitanmarvel kernel: [345907.287046] ata1.00: error: { UNC }
Dec 27 20:52:21 capitanmarvel kernel: [345907.288676] ata1.00: configured for UDMA/133
Dec 27 20:52:21 capitanmarvel kernel: [345907.288698] sd 1:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 27 20:52:21 capitanmarvel kernel: [345907.288702] sd 1:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current]
Dec 27 20:52:21 capitanmarvel kernel: [345907.288705] sd 1:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Dec 27 20:52:21 capitanmarvel kernel: [345907.288708] sd 1:0:0:0: [sdc] tag#21 CDB: Read(10) 28 00 00 3b 87 20 00 04 00 00
Dec 27 20:52:21 capitanmarvel kernel: [345907.288711] print_req_error: I/O error, dev sdc, sector 3902022
but also:
Dec 27 20:52:24 capitanmarvel ceph-osd[3852]: 2019-12-27 20:52:24.714716 7f821fbfd700 -1 log_channel(cluster) log [ERR] : 4.9b missing primary copy of 4:d97871c4:::rbd_data.142b816b8b4567.0000000000012ae1:head, will try copies on 8,14
OSD 'flip-flop' a bit for some days. At first scrub, i got:
cluster:
id: 8794c124-c2ec-4e81-8631-742992159bd6
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent
services:
mon: 5 daemons, quorum blackpanther,capitanmarvel,4,2,3
mgr: hulk(active), standbys: blackpanther, deadpool, thor, capitanmarvel
osd: 12 osds: 12 up, 12 in
data:
pools: 3 pools, 768 pgs
objects: 671.04k objects, 2.54TiB
usage: 7.62TiB used, 9.66TiB / 17.3TiB avail
pgs: 766 active+clean
1 active+clean+inconsistent
1 active+clean+scrubbing+deep
finally, OSD die, and so i got (after automatic remapping):
cluster:
id: 8794c124-c2ec-4e81-8631-742992159bd6
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent
services:
mon: 5 daemons, quorum blackpanther,capitanmarvel,4,2,3
mgr: hulk(active), standbys: blackpanther, deadpool, thor, capitanmarvel
osd: 12 osds: 11 up, 11 in
data:
pools: 3 pools, 768 pgs
objects: 674.26k objects, 2.55TiB
usage: 7.65TiB used, 8.71TiB / 16.4TiB avail
pgs: 767 active+clean
1 active+clean+inconsistent
To fix the issue i've tried to read the docs (looking for
'OSD_SCRUB_ERRORS'), finding:
https://docs.ceph.com/docs/doc-12.2.0-major-changes/rados/operations/health…
but the link within is empty:
https://docs.ceph.com/docs/doc-12.2.0-major-changes/rados/operations/pg-rep…
and after fiddling a bit with google, i've found:
https://ceph.io/geen-categorie/ceph-manually-repair-object/
that permit me to fix the issue easily with 'ceph pg repair'.
Two question:
1) the missing page on 'pg-repair' is a bug of documentation? There's
something i can do?
2) what happens?
- While, if the OSD was not able to write data to the OSD, they are
not automatically relocated to other OSD? This violate the crushmap?
- while, when the failing OSD get out, the inconsistent PG get not
automatically fixed? I've count=3, the other 2 copies are not
coherent? But, if so, how ceph was able to fix them?
Sorry... and thanks. ;)
--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/
Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN)
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797
Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
Please file a tracker with the symptom and examples. Please attach your
OSDMap (ceph osd getmap > osdmap.bin).
Note that https://github.com/ceph/ceph/pull/31956 has the Nautilus
version of improved upmap code. It also changes osdmaptool to match the
mgr behavior, so that one can observe the behavior of the upmap balancer
offline.
Thanks
David
On 12/8/19 11:04 AM, Philippe D'Anjou wrote:
> It's only getting worse after raising PGs now.
>
> Anything between:
> 96 hdd 9.09470 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 97 KiB 13 GiB 4.2
> TiB 53.62 0.76 54 up
>
> and
>
> 89 hdd 9.09470 1.00000 9.1 TiB 8.1 TiB 8.1 TiB 88 KiB 21 GiB 1001
> GiB 89.25 1.27 87 up
>
> How is that possible? I dont know how much more proof I need to
> present that there's a bug.
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users(a)lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Happy New Year Ceph Community!
I'm in the process of figuring out RBD mirroring with Ceph and having a really tough time with it. I'm trying to set up just one way mirroring right now on some test systems (baremetal servers, all Debian 9). The first cluster is 3 nodes, and the 2nd cluster is 2 nodes (not worried about a properly performing setup, just the functionality of RBD mirroring right now). The purpose is to have a passive failover ceph cluster in a separate DC. Mirroring seems like the best solution, but if we can't get it working, we'll end up resorting to a scheduled rsync which is less than ideal. I've followed several guides, read through a lot of documentation, and nothing has worked for me thus far. If anyone can offer some troubleshooting help or insight into what I might have missed in this setup, I'd greatly appreciate it! I also don't fully understand the relationship between images and pools and how you're supposed to configure statically sized images for a pool that has a variable amount of data, but that's a question for afterwards, I think :)
Once RBD mirroring is set up, the mirror test image status shows as down+unknown:
On ceph1-dc2:
rbd --cluster dc1ceph mirror pool status fs_data --verbose
health: WARNING
images: 1 total
1 unknown
mirror_test:
global_id: c335017c-9b8f-49ee-9bc1-888789537c47
state: down+unknown
description: status not found
last_update:
Here are the commands I run using ceph-deploy on both clusters to get everything up and running (run from a deploy directory on the first node of each cluster). The clusters are created at the same time, and rbd setup commands are only run after the clusters are up and healthy, and the fs_data pool is created.
-----------------------------------------------------------
Cluster 1 (dc1ceph):
ceph-deploy new ceph1-dc1 ceph2-dc1 ceph3-dc1
sed -i '$ s,.*,public_network = *.*.*.0/24\n,g' ceph.conf
ceph-deploy install ceph1-dc1 ceph2-dc1 ceph3-dc1 --release luminous
ceph-deploy mon create-initial
ceph-deploy admin ceph1-dc1 ceph2-dc1 ceph3-dc1
ceph-deploy mgr create ceph1-dc1 ceph2-dc1 ceph3-dc1
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph1-dc1 ; done
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph2-dc1 ; done
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph3-dc1 ; done
ceph-deploy mds create ceph1-dc1 ceph2-dc1 ceph3-dc1
ceph-deploy rgw create ceph1-dc1 ceph2-dc1 ceph3-dc1
for f in 1 2 ; do scp ceph.client.admin.keyring ceph$f-dc2:/etc/ceph/dc1ceph.client.admin.keyring ; done
for f in 1 2 ; do scp ceph.conf ceph$f-dc2:/etc/ceph/dc1ceph.conf ; done
for f in 1 2 ; do ssh ceph$f-dc2 "chown ceph.ceph /etc/ceph/dc1ceph*" ; done
ceph osd pool create fs_data 512 512 replicated
rbd --cluster ceph mirror pool enable fs_data image
rbd --cluster dc2ceph mirror pool enable fs_data image
rbd --cluster ceph mirror pool peer add fs_data client.admin@dc2ceph
(generated id: b5e347b3-0515-4142-bc49-921a07636865)
rbd create fs_data/mirror_test --size=1G
rbd feature enable fs_data/mirror_test journaling
rbd mirror image enable fs_data/mirror_test
chown ceph.ceph ceph.client.admin.keyring
Cluster 2 (dc2ceph):
ceph-deploy new ceph1-dc2 ceph2-dc2
sed -i '$ s,.*,public_network = *.*.*.0/24\n,g' ceph.conf
ceph-deploy install ceph1-dc2 ceph2-dc2 --release luminous
ceph-deploy mon create-initial
ceph-deploy admin ceph1-dc2 ceph2-dc2
ceph-deploy mgr create ceph1-dc2 ceph2-dc2
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph1-dc2 ; done
for x in b c d e f g h i j k; do ceph-deploy osd create --data /dev/sd${x}1 ceph2-dc2 ; done
ceph-deploy mds create ceph1-dc2 ceph2-dc2
ceph-deploy rgw create ceph1-dc2 ceph2-dc2
apt install rbd-mirror
for f in 1 2 3 ; do scp ceph.conf ceph$f-dc1:/etc/ceph/dc2ceph.conf ; done
for f in 1 2 3 ; do scp ceph.client.admin.keyring ceph$f-dc1:/etc/ceph/dc2ceph.client.admin.keyring ; done
for f in 1 2 3 ; do ssh ceph$f-dc1 "chown ceph.ceph /etc/ceph/dc2ceph*" ; done
ceph osd pool create fs_data 512 512 replicated
rbd --cluster ceph mirror pool peer add fs_data client.admin@dc1ceph
(generated id: e486c401-e24d-49bc-9800-759760822282)
systemctl enable ceph-rbd-mirror@admin
systemctl start ceph-rbd-mirror@admin
rbd --cluster dc1ceph mirror pool status fs_data --verbose
Cluster 1:
ls /etc/ceph:
ceph.client.admin.keyring
ceph.conf
dc2ceph.client.admin.keyring
dc2ceph.conf
rbdmap
tmpG36OYs
cat /etc/ceph/ceph.conf:
[global]
fsid = 8fede407-50e1-4487-8356-3dc98b30c500
mon_initial_members = ceph1-dc1, ceph2-dc1, ceph3-dc1
mon_host = *.*.*.1,*.*.*.27,*.*.*.41
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
cat /etc/ceph/dc2ceph.conf
[global]
fsid = 813ff410-02dc-47bd-b678-38add38495bb
mon_initial_members = ceph1-dc2, ceph2-dc2
mon_host = *.*.*.56,*.*.*.0
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
Cluster 2:
ls /etc/ceph:
ceph.client.admin.keyring
ceph.conf
dc1ceph.client.admin.keyring
dc1ceph.conf
rbdmap
tmp_yxkPs
cat /etc/ceph/ceph.conf
[global]
fsid = 813ff410-02dc-47bd-b678-38add38495bb
mon_initial_members = ceph1-dc2, ceph2-dc2
mon_host = *.*.*.56,*.*.*.70
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
cat /etc/ceph/dc1ceph.conf
[global]
fsid = 8fede407-50e1-4487-8356-3dc98b30c500
mon_initial_members = ceph1-dc1, ceph2-dc1, ceph3-dc1
mon_host = *.*.*.1,*.*.*.27,*.*.*.41
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = *.*.*.0/24
RBD Mirror daemon status:
ceph-rbd-mirror(a)admin.service - Ceph rbd mirror daemon
Loaded: loaded (/lib/systemd/system/ceph-rbd-mirror@.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2020-01-06 16:21:44 EST; 3s ago
Process: 910178 ExecStart=/usr/bin/rbd-mirror -f --cluster ${CLUSTER} --id admin --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
Main PID: 910178 (code=exited, status=0/SUCCESS)
Jan 06 16:21:44 ceph1-dc2 systemd[1]: Started Ceph rbd mirror daemon.
Jan 06 16:21:44 ceph1-dc2 rbd-mirror[910178]: 2020-01-06 16:21:44.462916 7f76ecf88780 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Jan 06 16:21:44 ceph1-dc2 rbd-mirror[910178]: 2020-01-06 16:21:44.462949 7f76ecf88780 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication
Jan 06 16:21:44 ceph1-dc2 rbd-mirror[910178]: failed to initialize: (2) No such file or directory2020-01-06 16:21:44.463874 7f76ecf88780 -1 rbd::mirror::Mirror: 0x558d3ce6ce20 init: error connecting to local cluster
-------------------------------------------
I also tried running the ExecStart command manually, substituting in different values for the parameters, and just never got it to work. If more info is needed, please don't hesitate to ask. Thanks in advance!
-Miguel
Hi,
in this <https://ceph.io/community/the-first-telemetry-results-are-in/>
blog post I find this statement:
"So, in our ideal world so far (assuming equal size OSDs), every OSD now
has the same number of PGs assigned."
My issue is that accross all pools the number of PGs per OSD is not equal.
And I conclude that this is causing very unbalanced data placement.
As a matter of fact the data stored on my 1.6TB HDD in specific pool
"hdb_backup" is in a range starting with
osd.228 size: 1.6 usage: 52.61 reweight: 1.00000
and ending with
osd.145 size: 1.6 usage: 81.11 reweight: 1.00000
This impacts the amount of data that can be stored in the cluster heavily.
Ceph balancer is enabled, but this is not solving this issue.
root@ld3955:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
Therefore I would ask you for suggestions how to work on this unbalanced
data distribution.
I have attached pastebin for
- ceph osd df sorted by usage <https://pastebin.com/QLQHjA9g>
- ceph osd df tree <https://pastebin.com/SvhP2hp5>
My cluster has multiple crush roots respresenting different disks.
In addition I have defined multiple pools, one pool for each disk type:
hdd, ssd, nvme.
THX
On Sun, Dec 29, 2019 at 9:21 PM zhengyin(a)cmss.chinamobile.com
<zhengyin(a)cmss.chinamobile.com> wrote:
>
> Hello dillaman:
>
> Problem: create a clone from a parent image, create a snapshot on the clone, when I use command "rbd export-diff <pool>/clone@snap <file> --whole-object", it can't diff parent image data. But I use this command without "--whole-object", it is ok.
>
> steps:
> 1、rbd create volumes/test1 -s 1G
>
> 2、write data to volumes/test1 (offset=0, len=8388608)
>
> 3、rbd snap create volumes/test1@snap
>
> 4、rbd snap protect volumes/test1@snap
>
> 5、rbd clone volumes/test1@snap volumes/clone1
>
> 6、write data to volumes/clone1 (offset=16777216, len=4194304)
>
> 7、rbd snap create volumes/clone1@snap1
>
> 8、rbd export-diff volumes/clone1@snap1 /root/diff1 --whole-object
> It only diffs data [16777216L, 20971520L]
>
> 9、rbd export-diff volumes/clone1@snap1 /root/diff2
> It is ok and can diffs data [0L, 8388608L], [16777216L, 20971520L]
>
> If you confirm that this is a bug or fix it, please let me know, thank you very much.
Is this the same as this ticket [1] that was recently fixed in master
and is pending backport?
>
> ________________________________
> zhengyin(a)cmss.chinamobile.com
[1] https://tracker.ceph.com/issues/42248
--
Jason
In an exploration of trying to speedup the long tail of backfills resulting
from marking a failing OSD out I began looking at my PGs to see if i could
tune some settings and noticed the following:
Scenario: on a 12.2.12 Cluster, I am alerted of an inconsistent PG and am
alerted of SMART failures on that OSD. I inspect that PG and notice it is a
read_error from the SMART-failing osd.
Steps I take: Set the primary affinity of the failing OSD to 0 (thought
process being, I dont want a failing drive to be responsible for
backfilling data), wait for peering to complete, then mark the OSD out. At
this point backfill begins.
90% of the PGs complete backfill very quickly. Towards the tail end of the
backfill I have 20 PGs or so in backfill_wait and 1 backfilling (presuming
because of osd_max_backfills = 1).
I do a `ceph pg ls backfill_wait` and notice that 100% of the tail end PGs
are such that all OSDs in the up_set are different than those of acting_set
and that the acting_primary is the OSD that was set with primary affinity 0
and marked out.
My questions are the following:
- Upon learning a disk has failed smart and has an inconsistent PG I want
to prevent its potentially-corrupt data from being replicated out to other
OSDs, even for PGs which may not have been discovered to be inconsistent
yet so I set primary affinity to 0. At this step shouldn't the
acting_primary be another OSD from the acting_set and backfill be copied
out of a different OSD?
- Should I be additionally marking the OSD as down, which would cause the
PGs to go degraded until backfill finishes but would presumably finish
faster as more OSDs would become the acting_primary and I wouldnt be
throttled by osd_max_backfills. My thought here is its best to avoid
degraded PGs as I do not want to drop below min_size.
I recognize some of these things may be different in Nautilus but I am
waiting on the 14.2.6 release as i am aware of some bugs I do not want to
contend with. Thanks.
Respectfully,
*Wes Dillingham*
wes(a)wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>