On 4/29/20 2:11 AM, Simone Lazzaris wrote:
In data martedì 28 aprile 2020 18:41:27 CEST, Mike
Christie ha scritto:
Could you send me:
1. The /var/log/messages for the initiator when
you do IO and see those
lock messages.
On the initiator (XenServer 7.1 which is based on CentOS AFAIK) the
/var/log/messages is empty.
I (sporadicly) see:
Apr 29 09:00:36 xs-n1 systemd[1]: Starting Multipath Count Service...
Apr 29 09:00:36 xs-n1 systemd[1]: Started Multipath Count Service.
Apr 29 09:00:36 xs-n1 systemd[1]: Started Session 146 of user root.
Apr 29 09:00:36 xs-n1 systemd[1]: Starting Session 146 of user root.
Apr 29 09:00:40 xs-n1 multipathd: dm-3: remove map (uevent)
Apr 29 09:00:40 xs-n1 multipathd: dm-3: devmap not registered, can't remove
Apr 29 09:00:40 xs-n1 multipathd: dm-3: remove map (uevent)
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="PBD.get_all_records"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_uuid"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_name_label"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_uuid"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_name_label"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_uuid"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_name_label"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_uuid"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_name_label"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_uuid"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_name_label"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_uuid"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_name_label"];
Apr 29 09:00:40 xs-n1 mpathalert: [debug|xs-n1|2 ||mscgen]
mpathalert=>xapi [label="host.get_all_records"];
2. The output of
From one of the gateways:
# gwcli ls
Attached (gwcli.txt)
From the initiator node you send the
/var/log/messages for:
# iscsiadm -m session -P 3
attacched (iscsi-session.txt)
# multipath -ll
36001405d7480e5f84b94ab19ebeebd6c dm-0 LIO-ORG ,TCMU device
size=3.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='queue-length 0' prio=50 status=active
| `- 2:0:0:0 sdc 8:32 active ready running
`-+- policy='queue-length 0' prio=10 status=enabled
`- 3:0:0:0 sdb 8:16 active ready running
3. version info:
# uname -a
On the Initiator:
Linux xs-n1 4.4.0+2 #1 SMP Thu Jun 15 16:38:02 UTC 2017 x86_64 x86_64
x86_64 GNU/Linux
On the Target:
Linux iscsi1 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Thu Apr 9 13:49:54 UTC
2020 x86_64 x86_64 x86_64 GNU/Linux
If you using rpm do:
# rpm -q ceph-iscsi
# rpm -q tcmu-runner
# rpm -q python-rtslib
No, I've installed them from source on the target
What version of tcmu-runner did you use? Was it one of the 1.4 or 1.5
releases or from the github master branch?
There was a bug in the older 1.4 release where due to a linux kernel
initiator side change the behavior for an error code we used went from
retrying for up to 5 minutes to 5 times. The 5 retries were then used in
less than a second, so we could see the issue you are seeing.
To map that to an iscsi gateway then you can do
the following.
If sdb is the AO one, then run
iscsiadm -m session -P 3
Here you can see the sdXYZ name to iscsi session
mapping. The iscsi
session/connection's target IP address from
that command should match to
the gateway that is listed as the
"owner" of the LUN in the "gwcli ls"
output.
I see... thanks for the hint.
I've done a test: I've unmapped all the drive, then mapped the first
gateway (iscsi1) on all the nodes, waited, then mapped the second
gateway, to be sure that all the nodes would see the first node as the
active/master
Now things seems a little better in "normal" vm use: I only see the
"Cannot send after transport endpoint shutdown." on the secondary target
node.
I do see some hopping between the nodes when importing a disk drive, but
at this point I'm starting to suspect some strange activity from the Xen
infrastructure in that circumstance.
--
*Simone Lazzaris*
*Qcom S.p.A. a socio unico*
simone.lazzaris(a)qcom.it <mailto:simone.lazzaris@qcom.it> |
www.qcom.it
<https://www.qcom.it
* LinkedIn <https://www.linkedin.com/company/qcom-spa>* | *Facebook*
<http://www.facebook.com/qcomspa