January 2020 - ceph-users

by Atherion

Hi Ceph Community. We currently have a luminous cluster running and some machines still on Ubuntu 14.04 We are looking to upgrade these machines to 18.04 but the only upgrade path for luminous with the ceph repo is through 16.04. It is doable to get to Mimic but then we have to upgrade all those machines to 16.04 but then we have to upgrade again to 18.04 when we get to Mimic, it is becoming a huge time sink. I did notice in the Ubuntu repos they have added 12.2.12 in 18.04.4 release. Is this a reliable build we can use? https://ubuntu.pkgs.org/18.04/ubuntu-proposed-main-amd64/ceph_12.2.12-0ubun… If so then we can go straight to 18.04.4 and not waste so much time. Best

4 years, 2 months

3
3
0 0

data loss on full file system?

by Håkan T Johansson

Hi, for test purposes, I have set up two 100 GB OSDs, one taking a data pool and the other metadata pool for cephfs. Am running 14.2.6-1-gffd69200ad-1 with packages from https://mirror.croit.io/debian-nautilus Am then running a program that creates a lot of 1 MiB files by calling fopen() fwrite() fclose() for each of them. Error codes are checked. This works successfully for ~100 GB of data, and then strangely also succeeds for many more 100 GB of data... ?? All written files have size 1 MiB with 'ls', and thus should contain the data written. However, on inspection, the files written after the first ~100 GiB, are full of just 0s. (hexdump -C) To further test this, I use the standard tool 'cp' to copy a few random-content files into the full cephfs filessystem. cp reports no complaints, and after the copy operations, content is seen with hexdump -C. However, after forcing the data out of cache on the client by reading other earlier created files, hexdump -C show all-0 content for the files copied with 'cp'. Data that was there is suddenly gone...? I am new to ceph. Is there an option I have missed to avoid this behaviour? (I could not find one in https://docs.ceph.com/docs/master/man/8/mount.ceph/ ) Is this behaviour related to https://docs.ceph.com/docs/mimic/cephfs/full/ ? (That page states 'sometime after a write call has already returned 0'. But if write returns 0, then no data has been written, so the user program would not assume any kind of success.) Best regards, Håkan

4 years, 2 months

2
4
0 0

ceph fs dir-layouts and sub-directory mounts

by Frank Schilder

I would like to (in this order) - set the data pool for the root "/" of a ceph-fs to a custom value, say "P" (not the initial data pool used in fs new) - create a sub-directory of "/", for example "/a" - mount the sub-directory "/a" with a client key with access restricted to "/a" The client will not be able to see the dir layout attribute set at "/", its not mounted. Will the data of this client still go to the pool "P", that is, does "/a" inherit the dir layout transparently to the client when following the steps above? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

4 years, 2 months

3
6
0 0

TR: Understand ceph df details

by CUZA Frédéric

Turns out it is probably orphans. We are running ceph luminous : 12.2.12 And the orphans find is stuck in the stage : "iterate_bucket_index" on shard "0" for 2 days now. Anyone is facing this issue ? Regards, De : ceph-users <ceph-users-bounces(a)lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>> Envoyé : 21 January 2020 10:10 À : ceph-users(a)lists.ceph.com<mailto:ceph-users@lists.ceph.com> Objet : [ceph-users] Understand ceph df details Hi everyone, I'm trying to understand where is the difference between the command : ceph df details And the result I'm getting when I run this script : total_bytes=0 while read user; do echo $user bytes=$(radosgw-admin user stats --uid=${user} | grep total_bytes_rounded | tr -dc "0-9") if [ ! -z ${bytes} ]; then total_bytes=$((total_bytes + bytes)) pretty_bytes=$(echo "scale=2; $bytes / 1000^4" | bc) echo " ($bytes B) $pretty_bytes TiB" fi pretty_total_bytes=$(echo "scale=2; $total_bytes / 1000^4" | bc) done <<< "$(radosgw-admin user list | jq -r .[])" echo "" echo "Total : ($total_bytes B) $pretty_total_bytes TiB" When I run df I get this : default.rgw.buckets.data 70 N/A N/A 226TiB 89.23 27.2TiB 61676992 61.68M 2.05GiB 726MiB 677TiB And when I use my script I don't have the same result : Total : (207579728699392 B) 207.57 TiB It means that I have 20 TiB somewhere but I can't find and must of all understand where this 20 TiB. Does anyone have an explanation ? Fi : [root@ceph_monitor01 ~]# radosgw-admin gc list -include-all | grep oid | wc -l 23

4 years, 2 months

2
1
0 0

small cluster HW upgrade

by Philipp Schwaha

We have a small ceph cluster running built from components that were phased out from compute applications. the current cluster consists of i7-860s. 6 disks (5TB, 7200RPM) per node and 8 nodes totaling 48 OSDs. A compute cluster will be discontinued, which will make Ryzen 5-1600 hardware available (8 nodes with 16GB RAM each) with which to replace the CPUs of the current setup. How could we best distribute the OSDs (keeping existing disks for storage) to the Ryzen systems to get a good performance improvement? Unfortunately the interconnect is still only 1Gb/s so expected to be a limiting factor. Would it make sense to create fewer bigger nodes, e.g., to use 6 nodes with 8 disks each or even more condensed? We would like to move the Luminous cluster to Nautilus/Bluestore and can get SSDs for each of the nodes, as it appears to be essential to get performance. Can actually benefit from improvements in the OSDs if the network is so limited? Would bonding of network interfaces be a workaround until we can get a network update or are we overestimating the power of the upgraded OSD nodes? What strategy would you suggest with these resources? Any comments and suggestions would be highly welcome :) Thanks in advance Philipp

4 years, 2 months

4
4
0 0

CephFS - objects in default data pool

by CASS Philip

I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/: "The data pool used to create the file system is the "default" data pool and the location for storing all inode backtrace information, used for hard link management and disaster recovery. For this reason, all inodes created in CephFS have at least one object in the default data pool." This does not match my experience (nautilus servers, nautlius FUSE client or Centos 7 kernel client). I have a cephfs with a replicated top-level pool and a directory set to use erasure coding with setfattr, though I also did the same test using the subvolume commands with the same result. "Ceph df detail" shows no objects used in the top level pool, as shown in https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also displayed in-line below) It would be useful if indeed clients didn't have to write to the top-level pool, since that would mean we could give different clients permission only to pool-associated subdirectories without giving everyone write access to a pool with data structures shared between all users of the filesystem. [root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 3.3 PiB 3.3 PiB 32 TiB 32 TiB 0.95 nvme 2.9 TiB 2.9 TiB 504 MiB 2.5 GiB 0.08 TOTAL 3.3 PiB 3.3 PiB 32 TiB 32 TiB 0.95 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR cephfs.fs1.metadata 5 162 MiB 63 324 MiB 0.01 1.4 TiB N/A N/A 63 0 B 0 B cephfs.fs1-replicated.data 6 0 B 0 0 B 0 1.0 PiB N/A N/A 0 0 B 0 B cephfs.fs1-ec.data 7 8.0 GiB 2.05k 11 GiB 0 2.4 PiB N/A N/A 2.05k 0 B 0 B name: fs1, metadata pool: cephfs.fs1.metadata, data pools: [cephfs.fs1-replicated.data cephfs.fs1-ec.data ] fs1 - 4 clients === +------+--------+------------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+--------+------------+---------------+-------+-------+ | 0 | active | hdr-meta02 | Reqs: 0 /s | 29 | 16 | +------+--------+------------+---------------+-------+-------+ +----------------------------+----------+-------+-------+ | Pool | type | used | avail | +----------------------------+----------+-------+-------+ | cephfs.fs1.metadata | metadata | 324M | 1414G | | cephfs.fs1-replicated.data | data | 0 | 1063T | | cephfs.fs1-ec.data | data | 11.4G | 2505T | +----------------------------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ | hdr-meta01 | +-------------+ MDS version: ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable) [root@hdr-admon01 ec]# ll /test-fs/ec/ total 12582912 -rw-r--r--. 1 root root 4294967296 Jan 27 22:26 new-file -rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file2 -rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file-same-inode-as-newfile2 Regards, Phil _________________________________________ Philip Cass HPC Systems Specialist - Senior Systems Administrator EPCC [cid:image002.png@01D5D5EF.2E463230] Advanced Computing Facility Bush Estate Penicuik Tel: +44 (0)131 4457815 Email: p.cass(a)epcc.ed.ac.uk<mailto:p.cass@epcc.ed.ac.uk> _________________________________________ The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. If you have received this message in error, please delete it and notify the originator immediately. Please consider the environment before printing this email.

4 years, 2 months

4
6
0 0

Changing failure domain

by Francois Legrand

Hi, I have a cephfs in production based on 2 pools (data+metadata). Data is in erasure coding with the profile : crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=3 m=2 plugin=jerasure technique=reed_sol_van w=8 Metadata is in replicated mode with k=3 The crush rules are as follow : [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "ec_data", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 5, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] When we installed it, everything was in the same room, but know we splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated the crushmap by adding a room layer (with ceph osd crush add-bucket room1 room etc) and move all our servers in the tree to the correct place (ceph osd crush move server1 room=room1 etc...). Now, we would like to change the rules to set a failure domain to room instead of host (to be sure that in case of disaster in one of the rooms we will still have a copy in the other). What is the best strategy to do this ? F.

4 years, 2 months

4
10
0 0

General question CephFS or RBD

by Willi Schiegel

Hello All, I have a HW RAID based 240 TB data pool with about 200 million files for users in a scientific institution. Data sizes range from tiny parameter files for scientific calculations and experiments to huge images of brain scans. There are group directories, home directories, Windows roaming profile directories organized in ZFS pools on Solaris operating systems, exported via NFS and Samba to Linux, macOS, and Windows clients. I would like to switch to CephFS because of the flexibility and expandability but I cannot find any recommendations for which storage backend would be suitable for all the functionality we have. Since I like the features of ZFS like immediate snapshots of very large data pools, quotas for each file system within hierarchical data trees and dynamic expandability by simply adding new disks or disk images without manual resizing would it be a good idea to create RBD images, map them onto the file servers and create zpools on the mapped images? I know that ZFS best works with raw disks but maybe a RBD image is close enough to a raw disk? Or would CephFS be the way to go? Can there be multiple CephFS pools for the group data folders and for the user's home directory folders for example or do I have to have everything in one single file space? Maybe someone can share his or her field experience? Thank you very much. Best regards Willi

4 years, 2 months

3
2
0 0

Getting rid of trim_object Snap .... not in clones

by Andreas John

Hello, in my cluster one after the other OSD dies until I recognized that it was simply an "abort" in the daemon caused probably by 2020-01-31 15:54:42.535930 7faf8f716700 -1 log_channel(cluster) log [ERR] : trim_object Snap 29c44 not in clones Close to this msg I get a stracktrace: ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af) 1: /usr/bin/ceph-osd() [0xb35f7d] 2: (()+0x11390) [0x7f0fec74b390] 3: (gsignal()+0x38) [0x7f0feab43428] 4: (abort()+0x16a) [0x7f0feab4502a] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f0feb48684d] 6: (()+0x8d6b6) [0x7f0feb4846b6] 7: (()+0x8d701) [0x7f0feb484701] 8: (()+0x8d919) [0x7f0feb484919] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27e) [0xc3776e] 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x10dd) [0x868cfd] 11: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0x80) [0x8690e0] 12: (Context::complete(int)+0x9) [0x6c8799] 13: (void ReplicatedBackend::sub_op_modify_reply<MOSDRepOpReply, 113>(std::tr1::shared_ptr<OpRequest>)+0x21b) [0xa5ae0b] 14: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x15b) [0xa53edb] 15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x1cb) [0x84c78b] 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ef) [0x6966ff] 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x4e4) [0x696e14] 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x71e) [0xc264fe] 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc29950] 20: (()+0x76ba) [0x7f0fec7416ba] 21: (clone()+0x6d) [0x7f0feac1541d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Yes, I know it's still hammer, I want to upgrade soon, but I want to resolve that issue first. If I lose that PG, I don't worry. So: What it the best approach? Can I use something like ceph-objectstore-tool ... <object> remove-clone-metadata <cloneid> ? I assume 29c44 is my Object, but what's the clone od? Best regards, derjohn

4 years, 2 months

1
3
0 0

v14.2.7 Nautilus released

by David Galloway

This is the seventh update to the Ceph Nautilus release series. This is a hotfix release primarily fixing a couple of security issues. We recommend that all users upgrade to this release. Notable Changes --------------- * CVE-2020-1699: Fixed a path traversal flaw in Ceph dashboard that could allow for potential information disclosure (Ernesto Puerta) * CVE-2020-1700: Fixed a flaw in RGW beast frontend that could lead to denial of service from an unauthenticated client (Or Friedmann) -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

4 years, 2 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2020