I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/:
"The data pool used to create the file system is the "default" data pool and the location for storing all inode backtrace information, used for hard link management and disaster recovery. For this reason, all inodes created in CephFS have at least one object in the default data pool."
This does not match my experience (nautilus servers, nautlius FUSE client or Centos 7 kernel client). I have a cephfs with a replicated top-level pool and a directory set to use erasure coding with setfattr, though I also did the same test using the subvolume commands with the same result. "Ceph df detail" shows no objects used in the top level pool, as shown in https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also displayed in-line below)
It would be useful if indeed clients didn't have to write to the top-level pool, since that would mean we could give different clients permission only to pool-associated subdirectories without giving everyone write access to a pool with data structures shared between all users of the filesystem.
[root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 3.3 PiB 3.3 PiB 32 TiB 32 TiB 0.95
nvme 2.9 TiB 2.9 TiB 504 MiB 2.5 GiB 0.08
TOTAL 3.3 PiB 3.3 PiB 32 TiB 32 TiB 0.95
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
cephfs.fs1.metadata 5 162 MiB 63 324 MiB 0.01 1.4 TiB N/A N/A 63 0 B 0 B
cephfs.fs1-replicated.data 6 0 B 0 0 B 0 1.0 PiB N/A N/A 0 0 B 0 B
cephfs.fs1-ec.data 7 8.0 GiB 2.05k 11 GiB 0 2.4 PiB N/A N/A 2.05k 0 B 0 B
name: fs1, metadata pool: cephfs.fs1.metadata, data pools: [cephfs.fs1-replicated.data cephfs.fs1-ec.data ]
fs1 - 4 clients
===
+------+--------+------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+------------+---------------+-------+-------+
| 0 | active | hdr-meta02 | Reqs: 0 /s | 29 | 16 |
+------+--------+------------+---------------+-------+-------+
+----------------------------+----------+-------+-------+
| Pool | type | used | avail |
+----------------------------+----------+-------+-------+
| cephfs.fs1.metadata | metadata | 324M | 1414G |
| cephfs.fs1-replicated.data | data | 0 | 1063T |
| cephfs.fs1-ec.data | data | 11.4G | 2505T |
+----------------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| hdr-meta01 |
+-------------+
MDS version: ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
[root@hdr-admon01 ec]# ll /test-fs/ec/
total 12582912
-rw-r--r--. 1 root root 4294967296 Jan 27 22:26 new-file
-rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file2
-rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file-same-inode-as-newfile2
Regards,
Phil
_________________________________________
Philip Cass
HPC Systems Specialist - Senior Systems Administrator
EPCC
[cid:image002.png@01D5D5EF.2E463230]
Advanced Computing Facility
Bush Estate
Penicuik
Tel: +44 (0)131 4457815
Email: p.cass(a)epcc.ed.ac.uk<mailto:p.cass@epcc.ed.ac.uk>
_________________________________________
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. If you have received this message in error, please delete it and notify the originator immediately.
Please consider the environment before printing this email.
Hi,
I have a cephfs in production based on 2 pools (data+metadata).
Data is in erasure coding with the profile :
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Metadata is in replicated mode with k=3
The crush rules are as follow :
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "ec_data",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 5,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
When we installed it, everything was in the same room, but know we
splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated
the crushmap by adding a room layer (with ceph osd crush add-bucket
room1 room etc) and move all our servers in the tree to the correct
place (ceph osd crush move server1 room=room1 etc...).
Now, we would like to change the rules to set a failure domain to room
instead of host (to be sure that in case of disaster in one of the rooms
we will still have a copy in the other).
What is the best strategy to do this ?
F.
Hello All,
I have a HW RAID based 240 TB data pool with about 200 million files for
users in a scientific institution. Data sizes range from tiny parameter
files for scientific calculations and experiments to huge images of
brain scans. There are group directories, home directories, Windows
roaming profile directories organized in ZFS pools on Solaris operating
systems, exported via NFS and Samba to Linux, macOS, and Windows clients.
I would like to switch to CephFS because of the flexibility and
expandability but I cannot find any recommendations for which storage
backend would be suitable for all the functionality we have.
Since I like the features of ZFS like immediate snapshots of very large
data pools, quotas for each file system within hierarchical data trees
and dynamic expandability by simply adding new disks or disk images
without manual resizing would it be a good idea to create RBD images,
map them onto the file servers and create zpools on the mapped images? I
know that ZFS best works with raw disks but maybe a RBD image is close
enough to a raw disk?
Or would CephFS be the way to go? Can there be multiple CephFS pools for
the group data folders and for the user's home directory folders for
example or do I have to have everything in one single file space?
Maybe someone can share his or her field experience?
Thank you very much.
Best regards
Willi
Hello,
in my cluster one after the other OSD dies until I recognized that it
was simply an "abort" in the daemon caused probably by
2020-01-31 15:54:42.535930 7faf8f716700 -1 log_channel(cluster) log
[ERR] : trim_object Snap 29c44 not in clones
Close to this msg I get a stracktrace:
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
1: /usr/bin/ceph-osd() [0xb35f7d]
2: (()+0x11390) [0x7f0fec74b390]
3: (gsignal()+0x38) [0x7f0feab43428]
4: (abort()+0x16a) [0x7f0feab4502a]
5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f0feb48684d]
6: (()+0x8d6b6) [0x7f0feb4846b6]
7: (()+0x8d701) [0x7f0feb484701]
8: (()+0x8d919) [0x7f0feb484919]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27e) [0xc3776e]
10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x10dd) [0x868cfd]
11: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0x80)
[0x8690e0]
12: (Context::complete(int)+0x9) [0x6c8799]
13: (void ReplicatedBackend::sub_op_modify_reply<MOSDRepOpReply,
113>(std::tr1::shared_ptr<OpRequest>)+0x21b) [0xa5ae0b]
14:
(ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x15b)
[0xa53edb]
15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x1cb) [0x84c78b]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ef) [0x6966ff]
17: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x4e4) [0x696e14]
18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x71e)
[0xc264fe]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc29950]
20: (()+0x76ba) [0x7f0fec7416ba]
21: (clone()+0x6d) [0x7f0feac1541d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Yes, I know it's still hammer, I want to upgrade soon, but I want to
resolve that issue first. If I lose that PG, I don't worry.
So: What it the best approach? Can I use something like
ceph-objectstore-tool ... <object> remove-clone-metadata <cloneid> ? I
assume 29c44 is my Object, but what's the clone od?
Best regards,
derjohn