- ceph-fs - lists.ceph.io

issue with fuse client on multifs auth PR

by Rishabh Dave

Hi all, I am testing my multiFS auth caps PR[1] using teuthology. I have been seeing this error[2] in logs. I thought the issue was genuinely that directory was absent so I wrote a patch to create directory when absent[3] and ran the tests again which fixes the issue but the test jobs hits another error[4]. Also, it turned out that the patch[3] was not correct fix (the directory "/sys/fs/fuse/connections" should've been created with the mount command) from the conversation here[5][6][7]. So I am certainly not on right track here. Any suggestions/ideas on why do I get error on [2] or on how do I investigate further into the reason for this error? [8][9] are teuthology logs where I get error on [2] and [10][11] are teuthology logs where I get error[4]. Thanks, - Rishabh [1] https://github.com/ceph/ceph/pull/32581 [2] https://gist.github.com/rishabh-d-dave/eef6cdb21f54a95edec25d412e52d09e#fil… [3] https://github.com/ceph/ceph/pull/34839/commits/17db08fedb9f24af3b874756b76… [4] https://gist.github.com/rishabh-d-dave/eef6cdb21f54a95edec25d412e52d09e#fil… [5] https://github.com/ceph/ceph/pull/34839#pullrequestreview-405003346 [6] https://github.com/ceph/ceph/pull/34839#issuecomment-623906416 [7] https://github.com/ceph/ceph/pull/34839#issuecomment-625171252 [8] http://qa-proxy.ceph.com/teuthology/rishabh-2020-05-11_05:57:36-fs-wip-rish… [9] [10] http://qa-proxy.ceph.com/teuthology/rishabh-2020-05-08_15:19:02-fs-wip-rish… [11] http://qa-proxy.ceph.com/teuthology/rishabh-2020-05-08_13:02:19-fs-wip-rish…

3 years, 11 months

1
0
0 0

skipping build process on teuthology to test python changes

by Rishabh Dave

Hi all, I am testing my "multifs auth caps" PR[1] with teuthology. The issue is that just to discover mistakes in my patch for files like kernel_mount.py[2], mount.py[3], and fuse_mount.py[4], I need to push the branch on ceph-ci, wait several hours for the build process to complete, trigger the teuthology tests and again wait for them to be executed and repeat all these steps again until all the issues in my patch are fixed. Since [2][3][4] are Python programs, they don't play any role in the build process AFAIK. If that's absolutely true, is there a way to circumvent the "waiting for build process to complete" part and trigger the tests directly using the binaries from previous build? This would save several hours for me and also take the boredom out of testing these changes. If previous build are wiped out on updating my copy of PR branch on ceph-ci, I can maintain two branches on ceph-ci too: one for builds and other for Python changes. I did run my tests with vstart_runner.py locally to reduce my number of round trips to pulpito.ceph.com but the changes in [2][3][4] don't get tested with vstart_runner.py since vstart_runner.py uses its own classes for handling CephFS mounts. Thanks, - Rishabh [1] https://github.com/ceph/ceph/pull/32581 [2] https://github.com/rishabh-d-dave/ceph/blob/wip-djf-15070/qa/tasks/cephfs/k… [3] https://github.com/rishabh-d-dave/ceph/blob/wip-djf-15070/qa/tasks/cephfs/m… [4] https://github.com/rishabh-d-dave/ceph/blob/wip-djf-15070/qa/tasks/cephfs/f…

3 years, 12 months

3
4
0 0

Ceph fully crash and we unable to recovery

by Parker Lau

Hello Sir/Madam, We are facing the serious problem for our proxmox with Ceph. I have already submitted the ticket to Proxmox but they said that only option trying to recover the mondb.We would like to know any some suggestion in our situation. So far the only option that I see would be in trying to recover the mondb from an OSD. But this action is usually a last resort. Since I don't know the outcome, the cluster could then very well be dead and thus all data lost. https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-m… --- > > I would like to set the nodown on the cluster to see if the OSDs are kept in the cluster. > > The OSDs are joining the cluster but are set as down shortly after. > > > ok . Please go ahead. Sadly this didn't have any effect either. But I think I found a clue to what might be going on. # ceph-osd.0.log 2020-03-24 21:22:06.462100 7fb33aab0e00 10 osd.0 0 read_superblock sb(e8e81549-91e5-4370-b091-9500f406a2b2 osd.0 0bb2b9bb-9a70-4d6f-8d4e-3fc5049d63d6 e14334 [13578,14334] lci=[0,14334]) # ceph-mon.cccs01.log 2020-03-24 21:26:48.038345 7f7ef791a700 10 mon.cccs01(a)0(leader).osd e14299 e14299: 48 total, 13 up, 35 in 2020-03-24 21:26:48.038351 7f7ef791a700 5 mon.cccs01(a)0(leader).osd e14299 can_mark_out current in_ratio 0.729167 < min 0.75, will not mark osds out 2020-03-24 21:26:48.038360 7f7ef791a700 10 mon.cccs01(a)0(leader).osd e14299 tick NOOUT flag set, not checking down osds 2020-03-24 21:26:48.038364 7f7ef791a700 10 mon.cccs01(a)0(leader).osd e14299 min_last_epoch_clean 0 # ceph-mon.cccs06.log 2020-03-22 22:26:57.056939 7f3c1993a700 1 mon.cccs06(a)5(peon).osd e14333 e14333: 48 total, 48 up, 48 in 2020-03-22 22:27:04.113054 7f3c1993a700 0 mon.cccs06@5(peon) e31 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1 2020-03-22 22:27:04.113086 7f3c1993a700 0 log_channel(audit) log [DBG] : from='client.? 10.1.14.8:0/4265796352' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch 2020-03-22 22:27:09.752027 7f3c1993a700 1 mon.cccs06(a)5(peon).osd e14334 e14334: 48 total, 48 up, 48 in ... 2020-03-23 10:42:51.891722 7ff1d9079700 0 mon.cccs06(a)2(synchronizing).osd e14269 crush map has features 288514051259236352, adjusting msgr requires 2020-03-23 10:42:51.891729 7ff1d9079700 0 mon.cccs06(a)2(synchronizing).osd e14269 crush map has features 288514051259236352, adjusting msgr requires 2020-03-23 10:42:51.891730 7ff1d9079700 0 mon.cccs06(a)2(synchronizing).osd e14269 crush map has features 1009089991638532096, adjusting msgr requires 2020-03-23 10:42:51.891732 7ff1d9079700 0 mon.cccs06(a)2(synchronizing).osd e14269 crush map has features 288514051259236352, adjusting msgr requires It seems that the OSD have a epoch of e14334 but the MONs seems to have e14269 for the OSDs. I could only find the e14334 on the ceph-mon.cccs06. The cccs06 has been the last MON standing (the node also reset). But when the cluster came back the MONs with the old epoch came up first and joined. The syslog showed that these were the last log entries written. So the nodes reset shortly after. This would fit to cccs06 being the last MON alive. # cccs01 Mar 22 22:16:09 cccs01 pmxcfs[2502]: [dcdb] notice: leader is 1/2502 Mar 22 22:16:09 cccs01 pmxcfs[2502]: [dcdb] notice: synced members: 1/2502, 5/2219 # cccs02 Mar 22 22:15:57 cccs02 pmxcfs[2514]: [dcdb] notice: we (3/2514) left the process group Mar 22 22:15:57 cccs02 pmxcfs[2514]: [dcdb] crit: leaving CPG group # cccs06 Mar 22 22:31:16 cccs06 pmxcfs[2662]: [status] noMar 22 22:34:01 cccs06 systemd-modules-load[773]: Inserted module 'iscsi_tcp' Mar 22 22:34:01 cccs06 systemd-modules-load[773]: Inserted module 'ib_iser' There must have been some issue prior to the reset as I found those error messages. They could explain why no older epoch was written anymore. # cccs01 2020-03-22 22:22:15.661060 7fc09c85c100 -1 rocksdb: IO error: /var/lib/ceph/mon/ceph-cccs01/store.db/LOCK: Permission denied 2020-03-22 22:22:15.661067 7fc09c85c100 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-cccs01': (22) Invalid argument # cccs02 2020-03-22 22:31:11.209524 7fd034786100 -1 rocksdb: IO error: /var/lib/ceph/mon/ceph-cccs02/store.db/LOCK: Permission denied 2020-03-22 22:31:11.209541 7fd034786100 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-cccs02': (22) Invalid argument # cccs06 no such entries. From my point of view, the last question is now: How to get the epoch from the OSD into the MON DB. I have no answer to this yet. Best Regards, Parker Lau ReadySpace Ltd - Cloud and Managed Hosting Professionals Direct: +852 3726 1120 Hotiline: +852 3568 3372<tel:+852%203568%203372> Fax: +852 3568 3376<tel:+852%203568%203376> Website: www.readyspace.com.hk<http://www.readyspace.com.hk/> Helpdesk: helpdesk.readyspace.com<http://helpdesk.readyspace.com/> Get the latest update and promotion here: twitter.com/readyspace<http://twitter.com/readyspace> | facebook.com/readyspace<http://www.facebook.com/readyspace> [cid:image001.jpg@01D5CC61.CF90E050]Please consider the environment before printing this email. Information in this message is confidential. It is intended solely for the person or the entity to whom it is addressed. If you are not the intended recipient, you are not to disseminate, distribute or copy this communication. Please notify the sender and delete the message and any other record of it from your system immediately. CONFIDENTIALITY: This message is intended for the recipient named above. It may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately by replying to this message and then delete it from your system. Do not read, copy, use or circulate this communication.   SECURITY: Internet communications are not secure. While reasonable effort has been made to ensure that this communication has not been tampered with, the sender cannot be responsible for alterations made to the contents of this message without his/her express consent. The sender also cannot be responsible for any computer virus/worm/ trojan horse or other program(s) accompanying this message which he/she is unaware of. Please consider the environment before printing this email. Information in this message is confidential. It is intended solely for the person or the entity to whom it is addressed. If you are not the intended recipient, you are not to disseminate, distribute or copy this communication. Please notify the sender and delete the message and any other record of it from your system immediately.

4 years

1
0
0 0

[ceph-fs] CephFS QA testing with vstart_runner

by Patrick Donnelly

For the benefit of our new folks and for posterity: Many of our QA tests for CephFS are located in qa/tasks/cephfs/*. These get run in teuthology with various cluster configurations. What everyone will need to be able to do is develop these tests locally without waiting for teuthology so you can rapidly find errors in your test cases and development builds. To do this, you need to use the qa/tasks/vstart_runner.py script. This allows you to use a vstart cluster to execute your tests by providing the necessary frameworks the tests expect. On a development box*, build ceph. If you're just testing CephFS, you can usually get away with a smaller build without rbd/rgw: /do_cmake.sh -DWITH_PYTHON3:BOOL=ON -DWITH_BABELTRACE=OFF -DWITH_MANPAGE=OFF -DWITH_RBD=OFF -DWITH_RADOSGW=OFF && time (cd build && make -j24 CMAKE_BUILD_TYPE=Debug -k) Next, build teuthology: git clone https://github.com/ceph/teuthology.git && cd teuthology && virtualenv ./venv && source venv/bin/activate && pip install --upgrade pip && pip install -r requirements.txt && python setup.py develop Next, start a vstart cluster: cd ceph/build && env MDS=3 ../src/vstart.sh -d -b -l -n --without-dashboard Finally, run vstart_runner: python2 ../qa/tasks/vstart_runner.py --interactive tasks.cephfs.test_snapshots.TestSnapshots ^ That's an example test. The format is based on the directory structure of qa/tasks/cephfs/test_snapshots. The final part is the class we're testing, TestSnapshots. This invocation of vstart_runner.py will run every test in TestSnapshots, methods beginning with "test_". If you want to run a specific test, then we could do: python2 ../qa/tasks/vstart_runner.py --interactive tasks.cephfs.test_snapshots.TestSnapshots.test_snapclient_cache Please give the above a try sometime soon so you know how to do it and we can resolve any problems. This is an important skill to have for developing CephFS. * Hopefully you're using one of the beefy development boxes that make compiling Ceph fast. I recommend one of the senta boxes like senta03.front.sepia.ceph.com. -- Patrick Donnelly

4 years, 5 months

2
1
0 0

could not mount CephFS using kernel on Ubuntu

by Rishabh Dave

Hello, I was unsuccessful in mounting CephFS with kernel on Ubuntu (it was senta02) today. Here's the command I was using to mount - $ sudo mount -t ceph 172.21.9.32:40886:/ /mnt/kcephfs1 -o name=admin,secret=AQCpa8ld1ahOIhAACPhh2qncfv0LkuI6+kUsEA== mount: mount 172.21.9.32:40886:/ on /mnt/kcephfs1 failed: Connection timed out I got the following message in dmesg logs - [328388.882391] libceph: mon0 172.21.9.32:40886 feature set mismatch, my 107b84a842aca < server's 40107b84a842aca, missing 400000000000000 [328388.894612] libceph: mon0 172.21.9.32:40886 missing required protocol features Port 40886 spoke msgv1, so AFAIS, the command looks fine. I tried 40885 too but I got "Connection timed out" on stdout and "socket closed (con state CONNECTING) in dmesg logs as usual. I also tried running cluster on loopback/localhost and used 127.0.0.1:40917:/ but even that was unsuccessful. To make sure that I am not missing anything I tried the same thing on Fedora 29 and the mount was successful. I've attached logs containing mount commands, dmesg logs and keyring for cluster. I was using this branch to build and run Ceph cluster - https://github.com/rishabh-d-dave/ceph/tree/add-test-for-acls. Thanks, - Rishabh

4 years, 5 months

3
2
0 0

can't mount without secret key on master

by Rishabh Dave

Hi, I tried mounting CephFS with kernel driver on a vstart cluster on master branch (latest commit SHA: d33c281b6437523a66d7802a39514f1ae74ec8e7) without secret key, but I was unsuccessful.Following is a copy of stdout while I was trying to mount. The first mount failed because I picked the wrong port. However, the second attempt (with key) was successful and the third (without key) wasn't. build$ sudo mount -t ceph 192.168.0.218:40112:/ /mnt/kcephfs -o name=admin,secret=AQDrjqpdy0fGKhAATIRQrdPhXB/uIi+86xuijQ== ^C build$ sudo mount -t ceph 192.168.0.218:40113:/ /mnt/kcephfs -o name=admin,secret=AQDrjqpdy0fGKhAATIRQrdPhXB/uIi+86xuijQ== build$ sudo umount /mnt/kcephfs/ build$ sudo mount -t ceph 192.168.0.218:40113:/ /mnt/kcephfs -o name=admin mount: /mnt/kcephfs: wrong fs type, bad option, bad superblock on 192.168.0.218:40113:/, missing codepage or helper program, or other error. build$ dmesg | tail [ 806.561086] libceph: mon0 192.168.0.218:40112 socket closed (con state CONNECTING) [ 810.770148] libceph: mon0 192.168.0.218:40113 session established [ 810.772603] libceph: client4275 fsid 54b1853a-1a08-482d-baf4-644eec15e830 [ 822.452439] libceph: no secret set (for auth_x protocol) [ 822.452443] libceph: error -22 on auth protocol 2 init build$ Just to make sure, I tried a fourth time with key to - build$ mount -t ceph 192.168.0.218:40113:/ /mnt/kcephfs -o name=admin,secret=AQDrjqpdy0fGKhAATIRQrdPhXB/uIi+86xuijQ== build$ $ mount | grep kcephfs 192.168.0.218:40113:/ on /mnt/kcephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl) Thinking that mount.ceph helper might be looking for file `ceph.client.admin.keyring`, I copied the admin keyring in a file, and placed it build/ as well as in /etc/ceph. However, that too didn't help. I've copied shell output for mount commands and contents keyring files here, in case that helps - https://paste.fedoraproject.org/paste/YbFY235S3DaEryje9HDPAw. Thanks, - Rishabh

4 years, 5 months

3
2
0 0

xfstest local.config for fuse testing?

by Rishabh Dave

Hi, Running ACLs test from xfstests-dev against a kernel client works fine. I use the following local.config - export FSTYP=ceph export TEST_DEV=10.5.50.196:40535:/test export TEST_DIR=/mnt/test #export SCRATCH_DEV=10.215.99.205:40336:/scratch #export SCRATCH_MNT=/mnt/scratch and I get following output - FSTYP -- ceph PLATFORM -- Linux/x86_64 p50 5.0.7-200.fc29.x86_64 generic/099 2s ... 3s Ran: generic/099 Passed all 1 tests I can't get the tests running when the client is FUSE mounted. I've tried setting FSTYP to 'ceph', 'ceph-fuse' and 'fuse', but none of them worked. The error and local.config in all these cases are copied here - https://paste.fedoraproject.org/paste/Ol2UtjEs7GgHFTXXzMea~w. I am mounting /mnt/test using ceph-fuse using following Ceph command - sudo ./bin/ceph-fuse --client_mountpoint=/test /mnt/test Thanks, - Rishabh

4 years, 8 months

2
2
0 0

what's the best way to stop an MDS?

by Rishabh Dave

Hi all, I am working on a ceph-ansible playbook[1] that removes an MDS from an already deployed Ceph cluster. Going through documentation and ceph-ansible codebase I found out 3 ways to stop an MDS - * ceph fail mds fail <mds-name> && rm -rf /var/lib/cephmds/ceph-{id} [2] * systemctl stop ceph-mds@$HOSTNAME * ceph tell mds.x exit How do these 3 ways compare to each other? I ran these commands on ceph-ansible deployed cluster and all 3 had the very same effect. Is any one of these better than the rest? What about "ceph mds rm" and "ceph mds rmfailed"? The first time I was looking for various ways to stop an MDS, I tried "ceph mds fail <mds-name> && ceph mds rm <global-id>" and it did not work since "ceph mds rm" requires an MDS to inactive[3]. Is there a way to render an MDS inactive? I couldn't find one. I also tried "ceph mds fail <mds-name> && ceph mds rmfailed <mds-rank>" but this did not stop MDS. It only changed MDS's state to 'standby" - (teuth-venv) $ ./bin/ceph fs dump | grep -A 1 standby_count_wanted 2> /dev/null dumped fsmap epoch 4 standby_count_wanted 0 4232: [v2:192.168.0.217:6826/2113356090,v1:192.168.0.217:6827/2113356090] 'a' mds.0.3 up:active seq 4 (teuth-venv) $ ./bin/ceph mds fail a 2> /dev/null && ./bin/ceph mds rmfailed --yes-i-really-mean-it 0 2> /dev/null && ./bin/ceph fs dump | grep -A 3 Standby 2> /dev/null dumped fsmap epoch 6 Standby daemons: 4286: [v2:192.168.0.217:6826/401505106,v1:192.168.0.217:6827/401505106] 'a' mds.-1.0 up:standby seq 1 (teuth-venv) $ Also, I find the usage of "remove" in this doc[2] ambiguous -- it can mean removing MDS from cluster by changing MDS's state to standby or it can mean killing/stopping it altogether. Reading [2] my impression was that it meant killing/stopping it but "remove" is also used to describe "ceph mds rm" and "ceph mds rmfailed" commands. Of these, at least "ceph mds rmfailed" does not stop the MDS. If I am not the only one to find this ambiguous, I'll go ahead and change the docs accordingly. - Rishabh [1] https://github.com/ceph/ceph-ansible/pull/4083 [2] http://docs.ceph.com/docs/master/cephfs/add-remove-mds/ [3] http://docs.ceph.com/docs/master/man/8/ceph/

4 years, 10 months

2
1
0 0

how to run a tests from test_cephfs_shell.py

by Rishabh Dave

Hi all, AFAIK, running tests with vstart_runner.py makes it mandatory that CWD should be <ceph-repo-root>/build. But, apparently, test_cephfs_shell.py[1] attempts to issue CephFS shell commands directly from CWD[1], which is impossible IMO. Is this a bug or am I missing something? Am I supposed to configure my environment before running the tests fom test_cephfs_shell.py? I tried running a couple of tests from test_cephfs_shell.py in the same we try to run a test from any other suite locally but that didn't work. The command I used is - $ python2 ../qa/tasks/vstart_runner.py --interactive --create tasks.cephfs.test_cephfs_shell.TestCephFSShell.test_mkdir Following is the traceback for the command above - File "/home/rishabh/repos/ceph/review/qa/tasks/cephfs/test_cephfs_shell.py", line 45, in test_mkdir o = self._cephfs_shell("mkdir d1") File "/home/rishabh/repos/ceph/review/qa/tasks/cephfs/test_cephfs_shell.py", line 29, in _cephfs_shell stdin=stdin) File "../qa/tasks/vstart_runner.py", line 324, in run env=env) File "/usr/lib64/python2.7/subprocess.py", line 394, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1047, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory If I am not missing anything, this surely a bug. [1] https://github.com/ceph/ceph/blob/master/qa/tasks/cephfs/test_cephfs_shell.…

4 years, 10 months

2
1
0 0

[ceph-fs] Ceph RBD/FS top

by Venky Shankar

Hey Jason, We are working towards bringing in `top` like functionality to CephFS for displaying various client (and MDS) metrics. Since RBD has something similar in the form of `perf image io*` via rbd cli, we would like to understand some finer details regarding its implementation and detail how CephFS is going forward for `fs top` functionality. IIUC, the `rbd_support` manager module requests object perf counters from the OSD, thereby extracting image names from the returned list of hot objects. I'm guessing it's done this way since there is no RBD related active daemon to forward metrics data to the manager? OTOH, `rbd-mirror` does make use of `MgrClient::service_daemon_update_status()` to forward mirror daemon status, which seems to be ok for anything that's not too bulky. For forwarding CephFS related metrics to Ceph Manager, sticking in blobs of metrics data in daemon status doesn't look clean (although it might work). Therefore, for CephFS, `MMgrReport` message type is expanded to include metrics data as part of its report update process, as per: https://github.com/ceph/ceph/pull/26004/commits/a75570c0e73ef67bbca8f73a974… ... and a callback function is provided to `MgrClient` (invoked periodically) to fill in appropriate metrics data in its report. This works well and is similar to how OSD updates PG stats to Ceph Manager. I guess changes of this nature was not required by RBD as it can get the required data by querying the OSDs (and were other approaches considered regarding the same)? Thanks, -Venky

5 years

2
2
0 0

2024

2023

2022

2021

2020

2019

ceph-fs