April 2020 - ceph-users - lists.ceph.io

by Ron Gage

Hi everyone! I have been working for the past week or so trying to get ceph-iscsi to work - Octopus release. Even just getting a single node working would be a major victory in this battle but so far, victory has proven elusive. My setup: a pair of Dell Optiplex 7010 desktops, each with 16 gig of memory and 1 boot drive (USB 3) and 3 SATA drives (500 Gb SSHD drives). No RAID controllers anywhere. Yes, I know that 3 nodes is the recommended minimum number for a production system - this isn't production (this is just seeing if the darned thing will even work). I am using Centos 8.1.1911 for the OS (4.18.0 kernel) with a basic or minimal installation (no X-Window). Single Gigabit ethernet per node. I have 2 MON and 2 Mgr installed and working, and I have a total of 6 OSDs working. I created the RBD pool (named "rbd" per the published instructions), creating it initially with 256 PGs (autoscale decided that 32 was a better choice - whatever). The cluster is green and all 6 OSDs are green (up and in). All deployment is via cephadm and all containers are running via podman. Here is where things start to fall apart. I was able to find RPM packages for targetcli and python-rtslib (called python3-rtslib) but was not able to find tcmu-runner nor ceph-iscsi packages. OK, no big deal. Time to head over to the manual install guide. I was able to build tcmu-runner, install it and apparently it is running (systemctl says it is active) so that appears to be OK. The problem is getting rbd-target-gw and rbd-target-api to work. They appear to build OK and of course, I am able to get them registered with systemd. They universally fail when trying to run them (systemctl start rbd-target-gw or systemctl start rbd-target-api). Both report failure. Looking in journalctl -xe shows no hints at all regarding why they failed (only that they did). Looking in /var/log/rbd-target-api/ show nothing at all (no files). Likewise in /var/log/rbd-target-gw/ (no files). HELP!! Now, some possibly germane questions: 1) are any other Ceph services required for ceph-iscsi to work like RADOSgw? 2) since there are no apparent packages available for ceph-iscsi, can anything be inferred to the production-readiness of the subsystem? 3) are there any known errata or missing steps in the instructions for getting ceph-iscsi to work? Thanks! Ron Gage

3 years, 12 months

2
3
0 0

How to debug ssh: ceph orch host add ceph01 10.10.1.1

by Ml Ml

Hello List, i did: root@ceph01:~# ceph cephadm set-ssh-config -i /tmp/ssh_conf root@ceph01:~# cat /tmp/ssh_conf Host * User root StrictHostKeyChecking no UserKnownHostsFile /dev/null root@ceph01:~# ceph config-key set mgr/cephadm/ssh_identity_key -i /root/.ssh/id_rsa set mgr/cephadm/ssh_identity_key root@ceph01:~# ceph config-key set mgr/cephadm/ssh_identity_pub -i /root/.ssh/id_rsa.pub set mgr/cephadm/ssh_identity_pub But i get: root@ceph01:~# ceph orch host add ceph01 10.10.1.1 Error ENOENT: Failed to connect to ceph01 (10.10.1.1). Check that the host is reachable and accepts connections using the cephadm SSH key root@ceph01:~# ceph config-key get mgr/cephadm/ssh_identity_key => this shows my private key How can i debug this? root@ceph01:~# ssh 10.10.1.1 or root@ceph01:~# ssh ceph01 work without a prompt or key error. I am using 15.2.0. Thanks, Michael

3 years, 12 months

3
3
0 0

CDS Pacific: Dashboard planning summary

by Lenz Grimmer

Hi all, a few weeks ago, a number of virtual Ceph Developer Summit meetings took place as a replacement for the in-person summit that was planned as part of Cephalocon in Seoul: https://pad.ceph.com/p/cds-pacific The Ceph Dashboard team also participated in these and held three video conference meetings to lay out our plans for the Pacific release. For details, please take a look at our notes at this Etherpad: https://pad.ceph.com/p/ceph-dashboard-pacific-priorities We tried to identify a few "themes", outlining individual tasks which we keep track of in the tracker.ceph.com bug tracker. The tracker issues should be used for discussing and defining the tasks at hand. A key theme for the upcoming Ceph Pacific release is the intention to further deepen and enhance the integration and support with cephadm and the orchestrator. For Ceph octopus, we tried focusing on the most common day-2 operation which is OSD management, but going forward we would like to also support the deployment and management of all other Ceph-related services that can be rolled out via cephadm and the orchestrator. In a hopefully not so distant future, we would like to be able to use the dashboard as a kind of "graphical installer", that guides the user through the entire installation deployment process of a Ceph cluster from scratch (well, almost: starting from an initial Mon+Mgr deployment). Another key theme is closing feature gaps: the various services of a Ceph cluster like RBD or RGW are constantly evolving and getting new features, so we always are trying to catch up with the latest developments there. We're also looking into enhancing our monitoring/alerting support and integration with Grafana and Prometheus. Last but not least, we always try to enhance and improve existing functionality and work on better usability and user experience. This also includes bigger refactoring work or updating key components that the dashboard depends on. As always, we would like the dashboard to be an application that Ceph administrators like and actually *want* to use to perform their jobs, so we are very keen on getting your feedback here! If there is anything you are missing or if you find any part of the dashboard to be confusing or not helpful, we'd like to know about it! Please get in touch with us to share your impressions and ideas. The best way to do this is to join the #ceph-dashboard IRC channel on OFTC or by filing a bug report via the tracker: https://tracker.ceph.com/projects/mgr/issues/new Thank you, Lenz -- SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)

3 years, 12 months

1
0
0 0

Upgrading to Octopus

by Simon Sutter

Hello everybody In octopus there are some interesting looking features, so I tried to upgrading my Centos 7 test nodes, according to: https://docs.ceph.com/docs/master/releases/octopus/ Everything went fine and the cluster is healthy. To test out the new dashboard functions, I tried to install it, but there are missing dependencies: yum install ceph-mgr-dashboard.noarch ..... --> Finished Dependency Resolution Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch) Requires: python3-routes Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch) Requires: python3-jwt Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch) Requires: python3-cherrypy Installing them with pip3 does of course make no difference, because those are yum dependencies. Does anyone know a workaround? Do I have to upgrade to Centos 8 for this to work? Thanks in advance, Simon

3 years, 12 months

5
12
0 0

Re: RGW and the orphans

by Katarzyna Myrek

Hi @Eric Ivancich my cluster has some history and trash gathered over the years. Most (terabytes) is from https://tracker.ceph.com/issues/43756. I was able to reproduce the problem on my LAB and it is for sure connected with https://tracker.ceph.com/issues/43756. When you are on a version older than 14.2.8 you would need to apply lifecycle policy which tries to abort interrupted multiparts older than x days. And when the bucket index is sharded then the broken/un-cancellable MPs are born. To test it I can use s3cmd. My LAB cluster was upgraded to 14.2.8 to make sure the new version does not do cleanup automagically. Here is my procedure (I truncated my personal data): *s3cmd --access_key= --secret_key= --host= --host-bucket= multipart s3://kate-mp-issue* *s3://kate-mp-issue/Initiated Path Id2020-04-06T07:48:55.323Z s3://kate-mp-issue/bottest_20200406T074855.img 2~-9SKkHzGKXYX_zNdHNs_S8RY9hWjISS* *s3cmd --access_key= --secret_key= --host= --host-bucket= abortmp s3://kate-mp-issue/bottest_20200406T074855.img 2~-9SKkHzGKXYX_zNdHNs_S8RY9hWjISS* *ERROR: S3 error: 404 (NoSuchUpload)* *RGW logs:* 2020-04-29 07:24:23.126 7fb21b819700 1 ====== starting new request req=0x7fb21b8128d0 ===== 2020-04-29 07:24:23.126 7fb21b819700 1 ====== req done req=0x7fb21b8128d0 op status=0 http_status=200 latency=0s ====== 2020-04-29 07:24:23.126 7fb21b819700 1 civetweb: 0x381a000: IP - - [29/Apr/2020:07:24:22 +0000] "GET /kate-mp-issue/?location HTTP/1.1" 200 275 - - 2020-04-29 07:24:23.202 7fb21b819700 1 ====== starting new request req=0x7fb21b8128d0 ===== 2020-04-29 07:24:23.202 7fb21b819700 1 ====== req done req=0x7fb21b8128d0 op status=-2009 *http_status=404* latency=0s ====== 2020-04-29 07:24:23.202 7fb21b819700 1 civetweb: 0x381a000: Ip - - [29/Apr/2020:07:24:22 +0000] "DELETE /kate-mp-issue/bottest_20200406T074855.img?uploadId=2~-9SKkHzGKXYX_zNdHNs_S8RY9hWjISS HTTP/1.1" *404* 439 - - So basically I need to remove those ghost entries from the list of interrupted multiparts and clean up objects which are left overs. As far as I understand I would need to go over every object in the pool with `rados ls`, then compare the output with `radosgw-admin bi list (done for every bucket)` and with new command `radosgw-admin radoslist` and remove objects which are on 1 but not on 2 and 3. Plus clean up the interrupted multipart list. Is that correct? @EDH - Manuel Rios <mriosfer(a)easydatahost.com> Is that your method also? I really need to clean up the terbaytes of the leftovers, because my prod cluster is getting full. And now buying anything is not an option (harsh times due to pandemy). Kind regards / Pozdrawiam, Katarzyna Myrek Kind regards / Pozdrawiam, Katarzyna Myrek wt., 28 kwi 2020 o 19:45 EDH - Manuel Rios <mriosfer(a)easydatahost.com> napisał(a): > Im prettty sure that you got the same issue than we already reported : > > https://tracker.ceph.com/issues/43756 > > Garbage and garbage stored into our OSD without be able to cleanup wasting > a lot of space. > > As you can see its solved in the new versions but... the last versión > didnt have any "scrub" or similar system to fix the garbage generated in > the past versions. > > As result , even big companies got their RGW plattform with tons of TB > wasted. > > Eric, Is there a way to ask you to develop(RGW Team) a system to clean our > rgw clusters like rgw bucket scrub? > > I talked with Cbodley and he explained how to it manually but the process > is extremely complex. > > We already calculated that at least a 25% of our rgw cluster is garbage > (100TB), and our options right now: > > - Deploy a new cluster a move rgw Users one by one with their buckets with > an external copy, hopping in the last nautilus version this not happen > again (Not usefull option and not transparent) > - Buy disk and disk waiting for a solution as External Tool (No sure to > continue this way) > - Hire external developers with knowleage of ceph and create a private > tool for that. (Developers with Ceph Core/Rgw knowleage will be no easy to > find) > > Here: ceph version 14.2.8 > > > > -----Mensaje original----- > De: Eric Ivancich <ivancich(a)redhat.com> > Enviado el: martes, 28 de abril de 2020 18:39 > Para: Katarzyna Myrek <katarzyna(a)myrek.pl> > CC: ceph-users(a)ceph.io > Asunto: [ceph-users] Re: RGW and the orphans > > Hi Katarzyna, > > Incomplete multipart uploads are not considered orphans. > > With respect to the 404s…. Which version of ceph are you running? What > tooling are you using to list and cancel? Can you provide a console > transcript of the listing and cancelling? > > Thanks, > > Eric > > -- > J. Eric Ivancich > he / him / his > Red Hat Storage > Ann Arbor, Michigan, USA > > > On Apr 28, 2020, at 2:57 AM, Katarzyna Myrek <katarzyna(a)myrek.pl> wrote: > > > > Hi all > > > > I am afraid that there is even more thrash available - running > > rgw-orphan-list does not find everything. Like I still have broken > > multiparts -> when I do s3cmd multipart I get a list of > > "pending/interrupted multiparts". When I try to cancel such multipart > > I get 404. > > > > Does anyone have a method for cleanup of such things? Or even a list > > of tasks which should be run regularly on clusters with rgw ? > > > > > > Kind regards / Pozdrawiam, > > Katarzyna Myrek > > > > > > wt., 21 kwi 2020 o 09:57 Janne Johansson <icepic.dz(a)gmail.com> > napisał(a): > >> > >> Den tis 21 apr. 2020 kl 07:29 skrev Eric Ivancich <ivancich(a)redhat.com > >: > >>> > >>> Please be certain to read the associated docs in both: > >>> > >>> doc/radosgw/orphans.rst > >>> doc/man/8/rgw-orphan-list.rst > >>> > >>> so you understand the limitations and potential pitfalls. Generally > this tool will be a precursor to a large delete job, so understanding > what’s going on is important. > >>> I look forward to your report! And please feel free to post additional > questions in this forum. > >>> > >> > >> Where are those? > >> https://github.com/ceph/ceph/tree/master/doc/man/8 > >> https://github.com/ceph/ceph/tree/master/doc/radosgw > >> don't seem to contain them in master. Nor in nautilus branch or octopus. > >> > >> This whole issue feels weird, rgw (or its users) produces dead > >> fragments of mulitparts, orphans and whatnot that needs cleaning up > sooner or later and the info we get is that the old cleaner isn't meant to > be used, it hasn't worked for a long while, there is no fixed version, > perhaps there is a script somewhere with caveats. This (slightly > frustrated) issue is of course on top of "bi trim" > >> "bilog trim" > >> "mdlog trim" > >> "usage trim" > >> > >> "datalog trim" > >> > >> "sync error trim" > >> > >> "gc process" > >> > >> "reshard stale-instances rm" > >> > >> > >> > >> that we rgw admins are supposed to know when to run, how often, what > their quirks are and so on. > >> > >> > >> 'Docs' for rgw means "datalog trim" --help says "trims the datalog", > and the long version on the web would be "this operation trims the datalog" > or something that doesn't add anything more. > >> > >> > >> > >> > >> -- > >> > >> "Grumpy cat was an optimist" > >> > > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io >

3 years, 12 months

1
0
0 0

rados buckets copy

by Andrei Mikhailovsky

Hello, I have a problem with radosgw service where the actual disk usage (ceph df shows 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB usage). I have tried to get to the end of the problem, but no one seems to be able to help. As a last resort I will attempt to copy the buckets, rename them and remove the old buckets. What is the best way of doing this (probably on a high level) so that the copy process doesn't carry on the wasted space to the new buckets? Cheers Andrei

3 years, 12 months

1
0
0 0

Nautilus upgrade causes spike in MDS latency

by Josh Haft

Hi, I upgraded from 13.2.5 to 14.2.6 last week and am now seeing significantly higher latency on various MDS operations. For example, the 2min rate of ceph_mds_server_req_create_latency_sum / ceph_mds_server_req_create_latency_count for an 8hr window last Monday prior to the upgrade was an average of 2ms. Today, however the same stat shows 869ms. Other operations including open, readdir, rmdir, etc. are also taking significantly longer. Here's a partial example of an op from dump_ops_in_flight: { "description": "client_request(client.342513090:334359409 create #...)", "initiated_at": "2020-04-13 15:30:15.707637", "age": 0.19583208099999999, "duration": 0.19767626299999999, "type_data": { "flag_point": "submit entry: journal_and_reply", "reqid": "client.342513090:334359409", "op_type": "client_request", "client_info": { "client": "client.342513090", "tid": 334359409 }, "events": [ { "time": "2020-04-13 15:30:15.707637", "event": "initiated" }, { "time": "2020-04-13 15:30:15.707637", "event": "header_read" }, { "time": "2020-04-13 15:30:15.707638", "event": "throttled" }, { "time": "2020-04-13 15:30:15.707640", "event": "all_read" }, { "time": "2020-04-13 15:30:15.781935", "event": "dispatched" }, { "time": "2020-04-13 15:30:15.785086", "event": "acquired locks" }, { "time": "2020-04-13 15:30:15.785507", "event": "early_replied" }, { "time": "2020-04-13 15:30:15.785508", "event": "submit entry: journal_and_reply" } ] } } This along with every other 'create' op I've seen has a 50ms+ delay between all_read and dispatched events - what is happening during this time? I'm not sure what I'm looking for the in the MDS debug logs. We have a mix of clients from 12.2.x through 14.2.8; my plan was to upgrade those pre-Nautilus clients this week. There is only a single MDS rank with 1 backup. Other functions of this cluster - RBDs and RGW - do not appear impacted so this looks limited to the MDS. I did not observe this behavior after upgrading a dev cluster last month. Has anyone seen anything similar? Thanks for any assistance! Josh

3 years, 12 months

3
6
0 0

How to recover files from cephfs data pool

by Edison Shadabi

Hi Ceph folks, I am relatively new to Ceph cluster and I hope I can quickly receive some help here. I would like to recover files from cephfs data pool. Someone wrote that inode linkage and file names are stored in omap data of objects in metadata pool. I cant find any information about the structure of omap data of the objects in metadata pool to help me write for example a script to retrieve filenames and the related objects So I can use for example “rados get” to retrieve those files. Are there any working script that would traverse through all the metadata pool to find out file names corresponded to objects in data pool? /Ed

3 years, 12 months

1
0
0 0

Bucket sync across available DCs

by Szabo, Istvan (Agoda)

Hi, is there a way to synchronize a specific bucket by Ceph across the available datacenters? I've just found multi site setup but that one sync the complete cluster, which is equal to failover solution. For me just 1 bucket. Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 12 months

2
1
0 0

Bucket dynamically resharded to 65521 shards - resharding manually won't work

by gl＠productsup.com

Hello, running Ceph Nautilus 14.2.4, we encountered this documented dynamic resharding issue: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-November/037531.ht… We disabled dynamic resharding in the configuration, and attempted to reshard to 1 shard: # radosgw-admin reshard add --bucket files --num-shards 1 --yes-i-really-mean-it However, it achieved nothing, and the bucket is now stuck in resharding status. It is impossible to clear the resharding flag (I have tried the bucket check --fix operation with no avail) # radosgw-admin reshard cancel --bucket=files 2020-04-28 11:47:18.721 7fd213b969c0 -1 ERROR: failed to remove entry from reshard log, oid=reshard.0000000000 tenant= bucket=files # radosgw-admin bucket reshard --bucket files --num-shards 1 ERROR: the bucket is currently undergoing resharding and cannot be added to the reshard list at this time

3 years, 12 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2020