Hi all,
I'm a rookie in CEPH. I want to ask two questions. One is the maturity of cephfs, the file system of CEPH, and whether it is recommended for production environment. The other is the maturity of CEPH's erasure code and whether it can be used in production environment. Are the above two questions explained in the official documents? I may not see where they are.
Thank you!
Fred Fan
Hello everyone,
we have updated one of our clusters from 15.2.9 to 15.2.10 and cannot
access the dashboard any more.
The dashboard is behind an nginx reverse proxy, which proxy_passes
https://some.admin.hostname/ceph (note the /ceph path!) to the actual
mgr daemon in the (not so) public network of the ceph cluster.
In 15.2.9, there would be <base href="https://some.admin.hostname/ceph">
set via JS.
This has apparently changed in 15.2.10 -- see PR
https://github.com/ceph/ceph/pull/39372 -- leaving a plain <base
href="/"> in the HTML header.
Browsers (tested with Firefox and Chromium on Debian Buster) now try to
load further JS for dashboard functionality from
https://some.admin.hostname/ *without* the /ceph path, which obviously
fails.
Is there any way to configure the path in the <base href=""> tag, e.g.
some "ceph dashboard" cli option?
The ceph cluster runs non-containerized on Ubuntu 20.04, using the
packages from download.ceph.com, and was deployed using ceph-ansible.
Best,
Christoph
--
Dr. Christoph Brüning
Universität Würzburg
HPC & DataManagement @ ct.qmat & RZUW
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499
Hi all!
We run a 1.5PB cluster with 12 hosts, 192 OSDs (mix of NVMe and HDD) and need to improve our failure domain by altering the crush rules and moving rack to pods, which would imply a lot of data movement.
I wonder what would the preferred order of operations be when doing such changes to the crush map and pools? Will there be minimal data movement by moving all racks to pods at once and change pool repl rules or is the best approach to first move racks one by one to pods and then change pool replication rules from rack to pods? Anyhow I guess it's good practice to set 'norebalance' before moving hosts and unset to start the actual moving?
Right now we have the following setup:
root -> rack2 -> ups1 + node51 + node57 + switch21
root -> rack3 -> ups2 + node52 + node58 + switch22
root -> rack4 -> ups3 + node53 + node59 + switch23
root -> rack5 -> ups4 + node54 + node60 -- switch 21 ^^
root -> rack6 -> ups5 + node55 + node61 -- switch 22 ^^
root -> rack7 -> ups6 + node56 + node62 -- switch 23 ^^
Note that racks 5-7 are connected to same ToR switches as racks 2-4. Cluster and frontend network are in different VXLANs connected with dual 40GbE. Failure domain for 3x replicated pools are currently by rack, and after adding hosts 57-62 we realized that if one of the switches reboots or fails, replicated PGs located only on those 4 hosts will be unavailable and force pools offline. I guess the best way would instead like to organize the racks in pods like this:
root -> pod1 -> rack2 -> ups1 + node51 + node57
root -> pod1 -> rack5 -> ups4 + node54 + node60 -> switch21
root -> pod2 -> rack3 -> ups2 + node52 + node58
root -> pod2 -> rack6 -> ups5 + node55 + node61 -> switch 22
root -> pod3 -> rack4 -> ups3 + node53 + node59
root -> pod3 -> rack7 -> ups6 + node56 + node62 -> switch 23
The reason for this arrangement is that we in the future plan to organize the pods in different buildings. We're running nautilus 14.2.16 and are about to upgrade to Octopus. Should we upgrade to Octopus before crush changes?
Any thoughts or insight on how to achieve this with minimal data movement and risk of cluster downtime would be welcome!
--thomas
--
Thomas Hukkelberg
thomas(a)hovedkvarteret.no
Hi Patrick,
Any updates? Looking forward to your reply :D
On Thu, Dec 17, 2020 at 11:39 AM Patrick Donnelly <pdonnell(a)redhat.com> wrote:
>
> On Wed, Dec 16, 2020 at 5:46 PM Alex Taylor <alexu4993(a)gmail.com> wrote:
> >
> > Hi Cephers,
> >
> > I'm using VSCode remote development with a docker server. It worked OK
> > but fails to start the debugger after /root mounted by ceph-fuse. The
> > log shows that the binary passes access X_OK check but cannot be
> > actually executed. see:
> >
> > ```
> > strace_log: access("/root/.vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7",
> > X_OK) = 0
> >
> > root@develop:~# ls -alh
> > .vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7
> > -rw-r--r-- 1 root root 978 Dec 10 13:06
> > .vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7
> > ```
> >
> > I also test the access syscall on ext4, xfs and even cephfs kernel
> > client, all of them return -EACCES, which is expected (the extension
> > will then explicitly call chmod +x).
> >
> > After some digging in the code, I found it is probably caused by
> > https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L5549-L5550.
> > So here come two questions:
> > 1. Is this a bug or is there any concern I missed?
>
> I tried reproducing it with the master branch and could not. It might
> be due to an older fuse/ceph. I suggest you upgrade!
>
I tried the master(332a188d9b3c4eb5c5ad2720b7299913c5a772ee) as well
and the issue still exists. My test program is:
```
#include <stdio.h>
#include <unistd.h>
int main() {
int r;
const char path[] = "test";
r = access(path, F_OK);
printf("file exists: %d\n", r);
r = access(path, X_OK);
printf("file executable: %d\n", r);
return 0;
}
```
And the test result:
```
# local filesystem: ext4
root@f626800a6e85:~# ls -l test
-rw-r--r-- 1 root root 6 Dec 19 06:13 test
root@f626800a6e85:~# ./a.out
file exists: 0
file executable: -1
root@f626800a6e85:~# findmnt -t fuse.ceph-fuse
TARGET SOURCE FSTYPE OPTIONS
/root/mnt ceph-fuse fuse.ceph-fuse
rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other
root@f626800a6e85:~# cd mnt
# ceph-fuse
root@f626800a6e85:~/mnt# ls -l test
-rw-r--r-- 1 root root 6 Dec 19 06:10 test
root@f626800a6e85:~/mnt# ./a.out
file exists: 0
file executable: 0
root@f626800a6e85:~/mnt# ./test
bash: ./test: Permission denied
```
Again, ceph-fuse says file `test` is executable but in fact it can't
be executed.
The kernel version I'm testing on is:
```
root@f626800a6e85:~/mnt# uname -ar
Linux f626800a6e85 4.9.0-7-amd64 #1 SMP Debian 4.9.110-1 (2018-07-05)
x86_64 GNU/Linux
```
Please try the program above and make sure you're running it as root
user, thank you. And if the reproduction still fails, please let me
know the kernel version.
> > 2. It works again with fuse_default_permissions=true, any drawbacks if
> > this option is set?
>
> Correctness (ironically, for you) and performance.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
We just upgraded our cluster from Lumious to Nautilus and after a few
days one of our MDS servers is getting:
2021-03-28 18:06:32.304 7f57c37ff700 5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 16
2021-03-28 18:06:32.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:32.308 7f57c8809700 5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 16 rtt 0.00400001
2021-03-28 18:06:36.308 7f57c37ff700 5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 17
2021-03-28 18:06:36.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:36.308 7f57c8809700 5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 17 rtt 0
2021-03-28 18:06:37.788 7f57c900a700 0 auth: could not find secret_id=34586
2021-03-28 18:06:37.788 7f57c900a700 0 cephx: verify_authorizer could
not get service secret for service mds secret_id=34586
2021-03-28 18:06:37.788 7f57c6004700 5 mds.sun-gcs01-mds02
ms_handle_reset on v2:10.65.101.13:46566/0
2021-03-28 18:06:40.308 7f57c37ff700 5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 18
2021-03-28 18:06:40.308 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
2021-03-28 18:06:40.308 7f57c8809700 5 mds.beacon.sun-gcs01-mds02
received beacon reply up:standby seq 18 rtt 0
2021-03-28 18:06:44.304 7f57c37ff700 5 mds.beacon.sun-gcs01-mds02
Sending beacon up:standby seq 19
2021-03-28 18:06:44.304 7f57c37ff700 20 mds.beacon.sun-gcs01-mds02
sender thread waiting interval 4s
I've tried removing the /var/lib/ceph/mds/ directory and getting the
key again. I've removed the key and generated a new one, I've checked
the clocks between all the nodes. From what I can tell, everything is
good.
We did have an issue where the monitor cluster fell over and would not
boot. We reduced the monitors to a single monitor, disabled cephx,
pulled it off the network and restarted the service a few times which
allowed it to come up. We then expanded back to three mons and
reenabled cephx and everything has been good until this. No other
services seem to be suffering from this and it even appears that the
MDS works okay even with these messages. We would like to figure out
how to resolve this.
Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Hello,
follow up from my mail from 2020 [0], it seems that OSDs sometimes have
"multiple classes" assigned:
[15:47:15] server6.place6:/var/lib/ceph/osd/ceph-4# ceph osd crush rm-device-class osd.4
done removing class of osd(s): 4
[15:47:17] server6.place6:/var/lib/ceph/osd/ceph-4# ceph osd crush rm-device-class osd.4
osd.4 belongs to no class,
[15:47:20] server6.place6:/var/lib/ceph/osd/ceph-4# ceph osd crush set-device-class xruk osd.4
set osd(s) 4 to class 'xruk'
[15:47:45] server6.place6:/var/lib/ceph/osd/ceph-4# ceph osd crush set-device-class xruk osd.4
osd.4 already set to class xruk. set-device-class item id 4 name 'osd.4' device_class 'xruk': no change.
[15:47:47] server6.place6:/var/lib/ceph/osd/ceph-4# /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph
2021-03-22 15:48:02.773 7fe2f81e4d80 -1 osd.4 94608 log_to_monitors {default=true}
2021-03-22 15:48:02.777 7fe2f81e4d80 -1 osd.4 94608 mon_cmd_maybe_osd_create fail: 'osd.4 has already bound to class 'xruk', can not reset class to 'hdd'; use 'ceph osd crush rm-device-class <id>' to remove old class first': (16) Device or resource busy
[15:48:02] server6.place6:/var/lib/ceph/osd/ceph-4#
[15:48:02] server6.place6:/var/lib/ceph/osd/ceph-4#
We are running ceph 14.2.9.
As written before, it also seems that the affected OSD is peering with
OSDs from the wrong class (hdd). Does anyone have a hint on how to fix
this?
Best regards,
Nico
[0]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SFJVJI5XUD7…
--
Sustainable and modern Infrastructures by ungleich.ch
Hello,
i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we
lost two disks, so two OSDs (67,90) are down. The two disks are on two
different hosts. A third ODS on a third host repotrts slow ops. ceph is
repairing at the moment.
Pools affected are eg these ones:
pool 35 'pxa-rbd' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 256 pgp_num 256 last_change 192082 lfor
0/27841/27845 flags hashpspool,selfmanaged_snaps stripe_width 0
pg_num_min 128 target_size_ratio 0.0001 application rbd
pool 36 'pxa-ec' erasure size 6 min_size 5 crush_rule 7 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 192177 lfor 0/172580/172578
flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
pg_num_min 512 target_size_ratio 0.15 application rbd
At the mmoment the proxmox-cluster using storage from the seperate ceph
cluster hangs. The ppols with date are erasure coded with the following
profile:
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
What I do not understand is why access on the virtualization seem to
block. Could that be related to min_size of the pools cause this
behaviour? How can I find out if this is true or what else is causing
the blocking behaviour seen?
This is the current status:
health: HEALTH_WARN
Reduced data availability: 1 pg inactive, 1 pg incomplete
Degraded data redundancy: 42384/130014984 objects degraded
(0.033%), 4 pgs degraded, 5 pgs undersized
15 daemons have recently crashed
150 slow ops, oldest one blocked for 15901 sec, daemons
[osd.60,osd.67] have slow ops.
services:
mon: 3 daemons, quorum ceph2,ceph5,ceph8 (age 4h)
mgr: ceph2(active, since 7w), standbys: ceph5, ceph8, ceph-admin
mds: cephfsrz:1 {0=ceph6=up:active} 2 up:standby
osd: 144 osds: 142 up (since 4h), 142 in (since 5h); 6 remapped pgs
task status:
scrub status:
mds.ceph6: idle
data:
pools: 15 pools, 2632 pgs
objects: 21.70M objects, 80 TiB
usage: 139 TiB used, 378 TiB / 517 TiB avail
pgs: 0.038% pgs not active
42384/130014984 objects degraded (0.033%)
2623 active+clean
3 active+undersized+degraded+remapped+backfilling
3 active+clean+scrubbing+deep
1 active+undersized+degraded+remapped+backfill_wait
1 active+undersized+remapped+backfill_wait
1 remapped+incomplete
io:
client: 2.2 MiB/s rd, 3.6 MiB/s wr, 8 op/s rd, 179 op/s wr
recovery: 51 MiB/s, 12 objects/s
Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312