April 2020 - ceph-users - lists.ceph.io

virtual machines crashes after upgrade to octopus

by Lomayani S. Laizer

Hello, After upgrade our ceph cluster to octopus few days ago we are seeing vms crashes with below error. We are using ceph with openstack(rocky). Everything running ubuntu 18.04 with kernel 5.3. We seeing this crashes in busy vms. this is cluster was upgraded from nautilus. kernel: [430751.176904] fn-radosclient[3905]: segfault at da0801 ip 00007fe78e076686 sp 00007fe7697f9470 erro r 4 in librbd.so.1.12.0[7fe78de73000+5cb000] Apr 6 03:26:00 compute6 kernel: [430751.176922] Code: 00 64 48 8b 04 25 28 00 00 00 48 89 44 24 18 31 c0 48 85 db 0f 84 fa 00 00 00 8 0 bf 38 01 00 00 00 48 89 fd 0f 84 ea 00 00 00 <83> bb 20 3f 00 00 ff 0f 84 dd 00 00 00 48 8b 83 18 3f 00 00 48 8d Apr 6 03:26:11 compute6 libvirtd[1671]: 2020-04-06 03:26:11.955+0000: 1671: error : qemuMonitorIO:719 : internal error: End of file f rom qemu monitor

4 years

3
4
0 0

rgw multisite with https endpoints

by Richard Kearsley

Hi there I have a fairly simple ceph multisite configuration with 2 ceph clusters in 2 different datacenters in the same city The rgws have this config for ssl: rgw_frontends = civetweb port=7480+443s ssl_certificate=/opt/ssl/ceph-bundle.pem The certificate is a real issued certificate, not self signed I configured the multisite with the guide from https://docs.ceph.com/docs/nautilus/radosgw/multisite/ More or less ok so far, some learning curve but that's ok I can access and upload to buckets at both endpoints with s3 client using https - https://ceph01cs1.domain.com and https://ceph01cs2.domain.com - all good Now the problem seems to be when my zones in the zonegroup use https endpoints, e.g. { "id": "4c6774fb-01eb-41fe-a74a-c2693f8e69fc", "name": "eu", "api_name": "eu", "is_master": "true", "endpoints": [ "https://ceph01cs1.domain.com:443" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "0c203df2-6f31-4ad1-a899-91f85bf34c4e", "zones": [ { "id": "0c203df2-6f31-4ad1-a899-91f85bf34c4e", "name": "ceph01cs1", "endpoints": [ "https://ceph01cs1.domain.com:443" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "fec1fec8-a3c1-454d-8ed2-2c1da45f9c33", "name": "ceph01cs2", "endpoints": [ "https://ceph01cs2.domain.com:443" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "08921dd5-1523-41b6-908f-2f58aa38c969" } Meta syncs ok - buckets and users get created, but data doesn't, and period can be commited and appears on both clusters I can also curl between the two clusters over 443 However, data sync gets stuck on 'init': realm 08921dd5-1523-41b6-908f-2f58aa38c969 (world) zonegroup 4c6774fb-01eb-41fe-a74a-c2693f8e69fc (eu) zone 0c203df2-6f31-4ad1-a899-91f85bf34c4e (ceph01cs2) metadata sync no sync (zone is master) data sync source: fec1fec8-a3c1-454d-8ed2-2c1da45f9c33 (ceph01cs1) init full sync: 128/128 shards full sync: 0 buckets to sync incremental sync: 0/128 shards data is behind on 128 shards behind shards: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127] I find errors like: 2020-03-31 20:27:11.372 7f60c84e1700 0 RGW-SYNC:data:sync: ERROR: failed to init sync, retcode=-16 2020-03-31 20:27:29.548 7f60c84e1700 0 RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote data log shards 2020-03-31 20:29:48.499 7f60c94e3700 0 RGW-SYNC:meta: ERROR: failed to fetch all metadata keys If I change the endpoints in the zonegroup to plain http, e.g. http://ceph01cs1.domain.com:7480 and http://ceph01cs2.domain.com:7480 then sync starts! So my question, and I couldn't find any examples of people using https to sync.. are https endpoints supported with multisite? and why would meta work over https but not data? Many thanks Richard

4 years

3
3
0 0

can not visit dashboard of ceph mgr

by sz_cuitao＠163.com

hi: I enabled the dashboard module of ceph mgr and configed it,but it does not works. why it is? [cephuser@cadmin ~]$ sudo ceph mgr module enable dashboard module 'dashboard' is already enabled [cephuser@cadmin ~]$ sudo ceph config set mgr mgr/dashboard/server_addr 192.168.137.90 [cephuser@cadmin ~]$ [cephuser@cadmin ~]$ [cephuser@cadmin ~]$ sudo ceph config set mgr mgr/dashboard/server_port 6000 [cephuser@cadmin ~]$ [cephuser@cadmin ~]$ [cephuser@cadmin ~]$ sudo ceph dashboard ac-user-create admin ceph123 administrator [root@cadmin ~]# systemctl restart ceph-mgr@cadmin [root@cadmin ~]# [root@cadmin ~]# [root@cadmin ~]# netstat -an | grep 6000 [root@cadmin ~]# [root@cadmin ~]# [root@cadmin ~]# [root@cadmin ~]# ceph mgr services {} [root@cadmin ~]# [root@cadmin ~]# [root@cadmin ~]# [root@cadmin ~]# systemctl status ceph-mgr@cadmin ● ceph-mgr(a)cadmin.service - Ceph cluster manager daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2020-04-05 18:57:44 CST; 5min ago Main PID: 86161 (ceph-mgr) CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr(a)cadmin.service └─86161 /usr/bin/ceph-mgr -f --cluster ceph --id cadmin --setuser ceph --setgroup ceph Apr 05 18:57:44 cadmin systemd[1]: Stopping Ceph cluster manager daemon... Apr 05 18:57:44 cadmin systemd[1]: Stopped Ceph cluster manager daemon. Apr 05 18:57:44 cadmin systemd[1]: Started Ceph cluster manager daemon. [root@cadmin ~]# sz_cuitao(a)163.com

4 years

2
1
0 0

osd with specifiying directories

by Micha

Hi, I want to try using object storage with java. Is it possible to set up osds with "only" directories as data destination (using cephadmin) , instead of whole disks? I have read through much of the docu but didn't found how to do it (if it's possible). Thanks Michael

4 years

3
5
0 0

Commands on cephfs mounts getting stuck in uninterruptible sleep

by david.piper＠metaswitch.com

Hello, I am seeing some commands running on CephFS mounts getting stuck in an uninterruptible sleep, at which point I can only terminate them by rebooting the client. Has anyone experienced anything similar and found a way to safe-guard against this? My mount is using the ceph kernel driver, with the following config in fstab: 10.225.44.236,10.225.44.237,10.225.44.238:6789:/albacore/system/deploy on /opt/dcl/deploy type ceph (rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev) The vast majority of commands complete successfully on the mounted filesystem but on one occasion a "chmod -R +r *" command hung indefinitely (despite having run successfully numerous times before). Attempts to terminate the process using `kill` fail. Repeated attempts to run the same command also get blocked in the same state. A `ps` command shows the processes are stuck in uninterruptable sleep: [root@svr01 albacore] ~> ps -Al | grep chmod 4 D 0 18657 18656 0 80 0 - 26998 rwsem_ pts/2 00:00:00 chmod 4 D 0 21835 1 0 80 0 - 26998 rwsem_ ? 00:00:00 chmod Ceph seems to be unaware of the hung process. There are no slow ops / ops in flight in either of the dump_ops_in_flight output on the server, or under sys/kernel/debug/ceph/ on the client. Similarly there are no logs in dmesg for the command / process. Ceph health reports no MDS issues, and there's nothing in the logs from my MDS from when the processes hung. The only method I've found of clearing the processes is to reboot my client. Has anyone got experience with this? Are there ceph mount options that would guard against this? Some details of the current setup: • ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable) • We're using the ceph kernel driver, kernel: 5.5.7-1.el7.elrepo.x86_64 • The client server has 38 separate directories mounted, all from the same CephFS filesystem. • All 38 directories are mounted with the same config by three separate clients. • Mount config (in fstab): 10.225.44.236,10.225.44.237,10.225.44.238:6789:/albacore/system/deploy on /opt/dcl/deploy type ceph (rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev) Kind regards, Dave

4 years

1
0
0 0

Can I fix the devive name,when using image map?

by sz_cuitao＠163.com

Hi: I use image map to mount a ceph device to localhost,but I found that,the device name I can not control. This is a problem,when it's name changed,the FS on it or the database using this device may get wrongs. Can I control the device name? For exapme: [root@gate2 ~]# rbd showmapped id pool namespace image snap device 0 testpool test_img - /dev/rbd0 Can I map the device like this,just directly give the deveice name?like this: rbd map testpool/test_img /dev/rbd0 sz_cuitao(a)163.com

4 years

2
2
0 0

Removing OSDs in Mimic

by ADRIAN NICOLAE

Hi all, I have a Ceph cluster with ~ 70 OSDs of different sizes running on Mimic . I'm using ceph-deploy for managing the cluster size. I have to remove some smaller drives and replace them with bigger drives. From your experience, are the removing an OSD guidelines from Mimic docs accurate ? I know that there were some changes from the older versions and I want to avoid any confusions. I'm talking about the following procedure : - take the OSD out of the cluster with "ceph osd out osd_no" - stopping the osd daemon if it's still running - purging the OSD from the cluster map running from the ceph-deploy host (or one of the mons ?) : ceph osd purge {id} --yes-i-really-mean-it I don't have any specific entries for these OSDs in ceph.conf so I guess I shouldn't change any conf file. Thank you.

4 years

1
0
0 0

Radosgw WAF

by m.kefayati＠afranet.com

Hi we deploy ceph object storage and want secure RGW. Is there any solution or any user experience about it? Is it common to use WAF ? tnx

4 years

2
1
0 0

v14.2.8 Nautilus released

by Abhishek Lekshmanan

This is the eighth update to the Ceph Nautilus release series. This release fixes issues across a range of subsystems. We recommend that all users upgrade to this release. Please note the following important changes in this release; as always the full changelog is posted at: https://ceph.io/releases/v14-2-8-nautilus-released Notable Changes --------------- * The default value of `bluestore_min_alloc_size_ssd` has been changed to 4K to improve performance across all workloads. * The following OSD memory config options related to bluestore cache autotuning can now be configured during runtime: - osd_memory_base (default: 768 MB) - osd_memory_cache_min (default: 128 MB) - osd_memory_expected_fragmentation (default: 0.15) - osd_memory_target (default: 4 GB) The above options can be set with:: ceph config set osd <option> <value> * The MGR now accepts `profile rbd` and `profile rbd-read-only` user caps. These caps can be used to provide users access to MGR-based RBD functionality such as `rbd perf image iostat` an `rbd perf image iotop`. * The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap balancing has been removed. Instead use the mgr balancer config `upmap_max_deviation` which now is an integer number of PGs of deviation from the target PGs per OSD. This can be set with a command like `ceph config set mgr mgr/balancer/upmap_max_deviation 2`. The default `upmap_max_deviation` is 1. There are situations where crush rules would not allow a pool to ever have completely balanced PGs. For example, if crush requires 1 replica on each of 3 racks, but there are fewer OSDs in 1 of the racks. In those cases, the configuration value can be increased. * RGW: a mismatch between the bucket notification documentation and the actual message format was fixed. This means that any endpoints receiving bucket notification, will now receive the same notifications inside a JSON array named 'Records'. Note that this does not affect pulling bucket notification from a subscription in a 'pubsub' zone, as these are already wrapped inside that array. * CephFS: multiple active MDS forward scrub is now rejected. Scrub currently only is permitted on a file system with a single rank. Reduce the ranks to one via `ceph fs set <fs_name> max_mds 1`. * Ceph now refuses to create a file system with a default EC data pool. For further explanation, see: https://docs.ceph.com/docs/nautilus/cephfs/createfs/#creating-pools * Ceph will now issue a health warning if a RADOS pool has a `pg_num` value that is not a power of two. This can be fixed by adjusting the pool to a nearby power of two:: ceph osd pool set <pool-name> pg_num <new-pg-num> Alternatively, the warning can be silenced with:: ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-14.2.8.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 2d095e947a02261ce61424021bb43bd3022d35cb -- Abhishek Lekshmanan SUSE Software Solutions Germany GmbH GF: Felix Imendörffer HRB 21284 (AG Nürnberg)

4 years

6
7
0 0

Cann't create ceph cluster

by sz_cuitao＠163.com

hi! why can not create custer using ceph-deploy? Please give me some advice,thanks! [root@cadmin ~]# cat /etc/redhat-release CentOS Linux release 7.7.1908 (Core) [root@cadmin ~]# [root@cadmin ~]# python --version Python 2.7.5 [root@cadmin ~]# [root@cadmin ~]# [root@cadmin ~]# su - cephuser Last login: Sat Apr 4 23:44:23 CST 2020 on pts/0 [cephuser@cadmin ~]$ [cephuser@cadmin ~]$ cd cluster/ [cephuser@cadmin cluster]$ [cephuser@cadmin cluster]$ ceph-deploy new cadmin [ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephuser/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.25): /bin/ceph-deploy new cadmin [ceph_deploy.new][DEBUG ] Creating new cluster named ceph [ceph_deploy.new][INFO ] making sure passwordless SSH succeeds [ceph_deploy][ERROR ] Traceback (most recent call last): [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/util/decorators.py", line 69, in newfunc [ceph_deploy][ERROR ] return f(*a, **kw) [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/cli.py", line 162, in _main [ceph_deploy][ERROR ] return args.func(args) [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/new.py", line 141, in new [ceph_deploy][ERROR ] ssh_copy_keys(host, args.username) [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/new.py", line 35, in ssh_copy_keys [ceph_deploy][ERROR ] if ssh.can_connect_passwordless(hostname): [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/util/ssh.py", line 15, in can_connect_passwordless [ceph_deploy][ERROR ] if not remoto.connection.needs_ssh(hostname): [ceph_deploy][ERROR ] AttributeError: 'module' object has no attribute 'needs_ssh' [ceph_deploy][ERROR ] sz_cuitao(a)163.com

4 years

3
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2020