Em sex., 14 de fev. de 2020 às 13:25, Mike Christie <mchristi(a)redhat.com>
escreveu:
On 02/13/2020 08:52 PM, Gesiel Galvão Bernardes
wrote:
Hi
Em dom., 9 de fev. de 2020 às 18:27, Mike Christie <mchristi(a)redhat.com
<mailto:mchristi@redhat.com>> escreveu:
On 02/08/2020 11:34 PM, Gesiel Galvão Bernardes wrote:
Hi,
Em qui., 6 de fev. de 2020 às 18:56, Mike Christie
<mchristi(a)redhat.com
<mailto:mchristi@redhat.com>
> <mailto:mchristi@redhat.com <mailto:mchristi@redhat.com>>>
escreveu:
On 02/05/2020 07:03 AM, Gesiel Galvão Bernardes wrote:
> Em dom., 2 de fev. de 2020 às 00:37, Gesiel Galvão Bernardes
> <gesiel.bernardes(a)gmail.com
<mailto:gesiel.bernardes@gmail.com>
<mailto:gesiel.bernardes@gmail.com <mailto:
gesiel.bernardes(a)gmail.com>>
<mailto:gesiel.bernardes@gmail.com
<mailto:gesiel.bernardes@gmail.com>
<mailto:gesiel.bernardes@gmail.com
<mailto:gesiel.bernardes@gmail.com>>>> escreveu:
>
> Hi,
>
> Just now was possible continue this. Below is the
information
> > required. Thanks advan
>
>
> Hey, sorry for the late reply. I just back from PTO.
>
> >
> > esxcli storage nmp device list -d
> naa.6001405ba48e0b99e4c418ca13506c8e
> > naa.6001405ba48e0b99e4c418ca13506c8e
> > Device Display Name: LIO-ORG iSCSI Disk
> > (naa.6001405ba48e0b99e4c418ca13506c8e)
> > Storage Array Type: VMW_SATP_ALUA
> > Storage Array Type Device Config:
{implicit_support=on;
> > explicit_support=off;
explicit_allow=on;
alua_followover=on;
action_OnRetryErrors=on;
{TPG_id=1,TPG_state=ANO}}
Path Selection Policy: VMW_PSP_MRU
Path Selection Policy Device Config: Current
Path=vmhba68:C0:T0:L0
Path Selection Policy Device Custom
Config:
Working Paths: vmhba68:C0:T0:L0
Is USB: false
........
Failed: H:0x0 D:0x2 P:0x0 Valid sense data:
0x2 0x4 0xa.
Act:FAILOVER
Are you sure you are using tcmu-runner 1.4? Is that the actual
daemon
reversion running? Did you by any chance
install the 1.4 rpm,
but you/it
did not restart the daemon? The error code
above is returned
in 1.3 and
earlier.
You are probably hitting a combo of 2 issues.
We had only listed ESX 6.5 in the docs you probably saw, and
in 6.7 the
value of action_OnRetryErrors defaulted to on
instead of off.
You should
set this back to off.
You should also upgrade to the current version of tcmu-runner
1.5.x. It
should fix the issue you are hitting, so non
IO commands like
inquiry,
RTPG, etc are executed while failing
over/back, so you would
not hit the
problem where path initialization and path
testing IO is
failed causing
the path to marked as failed.
I updated tcmu-runner to 1.5.2, and change action_OnRetryErrors to
off,
but the problem continue 😭
Attached is vmkernel.log.
When you stopped the iscsi gw at around 2020-02-09T01:51:25.820Z, how
many paths did your device have? Did:
esxcli storage nmp path list -d your_device
report only one path? Did
esxcli iscsi session connection list
show a iscsi connection to each gw?
Hmmm, I believe the problem may be here. I verified that I was listing
only one GW for each path. So I ran a "rescan HBA" on VMware on both
ESX, now one of them lists the 3 (I added one more) gateways, but an ESX
host with the same configuration continues to list only one gateway. See
the different outputs:
[root@tcnvh7:~] esxcli iscsi session connection list
vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000001,0
Adapter: vmhba68
Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
ISID: 00023d000001
CID: 0
DataDigest: NONE
HeaderDigest: NONE
IFMarker: false
IFMarkerInterval: 0
MaxRecvDataSegmentLength: 131072
MaxTransmitDataSegmentLength: 262144
OFMarker: false
OFMarkerInterval: 0
ConnectionAddress: 192.168.201.1
RemoteAddress: 192.168.201.1
LocalAddress: 192.168.201.107
SessionCreateTime: 01/19/20 00:11:25
ConnectionCreateTime: 01/19/20 00:11:25
ConnectionStartTime: 02/13/20 23:03:10
State: logged_in
vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000002,0
Adapter: vmhba68
Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
ISID: 00023d000002
CID: 0
DataDigest: NONE
HeaderDigest: NONE
IFMarker: false
IFMarkerInterval: 0
MaxRecvDataSegmentLength: 131072
MaxTransmitDataSegmentLength: 262144
OFMarker: false
OFMarkerInterval: 0
ConnectionAddress: 192.168.201.2
RemoteAddress: 192.168.201.2
LocalAddress: 192.168.201.107
SessionCreateTime: 02/13/20 23:09:16
ConnectionCreateTime: 02/13/20 23:09:16
ConnectionStartTime: 02/13/20 23:09:16
State: logged_in
vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000003,0
Adapter: vmhba68
Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
ISID: 00023d000003
CID: 0
DataDigest: NONE
HeaderDigest: NONE
IFMarker: false
IFMarkerInterval: 0
MaxRecvDataSegmentLength: 131072
MaxTransmitDataSegmentLength: 262144
OFMarker: false
OFMarkerInterval: 0
ConnectionAddress: 192.168.201.3
RemoteAddress: 192.168.201.3
LocalAddress: 192.168.201.107
SessionCreateTime: 02/13/20 23:09:16
ConnectionCreateTime: 02/13/20 23:09:16
ConnectionStartTime: 02/13/20 23:09:16
State: logged_in
=====
[root@tcnvh8:~] esxcli iscsi session connection list
vmhba68,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,00023d000001,0
Adapter: vmhba68
Target: iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
ISID: 00023d000001
CID: 0
DataDigest: NONE
HeaderDigest: NONE
IFMarker: false
IFMarkerInterval: 0
MaxRecvDataSegmentLength: 131072
MaxTransmitDataSegmentLength: 262144
OFMarker: false
OFMarkerInterval: 0
ConnectionAddress: 192.168.201.1
RemoteAddress: 192.168.201.1
LocalAddress: 192.168.201.108
SessionCreateTime: 01/12/20 02:53:53
ConnectionCreateTime: 01/12/20 02:53:53
ConnectionStartTime: 02/13/20 23:06:40
State: logged_in
Is that the problem? Any ideas on how to proceed from here?
Yes. Normally, you would have the connection already created, and when
one path/gateway goes down, then the multipath layer will switch to
another path. When the path/gateway comes back up, the initiator side's
iscsi layer will reconnect automatically and the multipath layer will
re-setup the path structure, so it can failback if its a higher priority
path or failover later if other paths go down.
Something happened with the automatic path connection process on that
node. We know it works for that one gateway you brought up/down. For the
other gateways I would check:
1. Check that all target portals are being discovered. In the GUI screen
you entered in the discovery address, you should also see a list of all
target portals that were found in the static section. Do you only see 1
portal?
Yes, all targers are listen in "Targets" section. But in "Paths",
just
See here:
https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.storage.do…
2. If you see all the portals then when you hit the rescan HBA button,
do you see any errors on the target side in /var/log/messages? Maybe
something about CHAP/login/auth errors?
When I hit rescan HBA, I received this messages in Logs (below). He seems
to complain about the username / password, but if that is the case, the
password is the same for all gateways, why does one connect and another
does not?
ceph-iscsi2 - /var/log/messages
Feb 16 07:31:21 ceph-iscsi2 kernel: Security negotiation failed.
Feb 16 07:31:21 ceph-iscsi2 kernel: iSCSI Login negotiation failed.
tcnvh8 - /var/log/vmkernel.log
2020-02-16T07:25:44.718Z cpu13:2097316)qfle3:
qfle3_queue_remove_filter:2063: [vmnic0] QueueOps.queueRemoveFilter
2020-02-16T07:25:44.720Z cpu13:2097316)qfle3:
qfle3_remove_queue_filter:2012: [vmnic0] NetQ removed RX filter: queue:1
mac: 00:50:56:6c:24:9d filter id:0
2020-02-16T07:25:44.721Z cpu13:2097316)qfle3:
qfle3_queue_apply_filter:1923: [vmnic0] QueueOps.queueApplyFilter 0
2020-02-16T07:25:44.721Z cpu13:2097316)WARNING: qfle3:
qfle3_apply_queue_mac_filter:1734: [vmnic0] NetQ could not add RX filter,
no filters for queue 0
2020-02-16T07:25:44.721Z cpu13:2097316)qfle3: qfle3_queue_quiesce:1061:
[vmnic0] QueueOps.queueQuiesce
2020-02-16T07:25:44.721Z cpu13:2097316)qfle3: qfle3_queue_quiesce:1069:
[vmnic0] RxQ, QueueIDVal:1
2020-02-16T07:25:44.721Z cpu13:2097316)qfle3: qfle3_rx_queue_stop:1558:
[vmnic0] qfle3_rx_queue_stop, QueueIDVal:1
2020-02-16T07:25:44.721Z cpu13:2097316)qfle3: qfle3_rq_stop:740: [vmnic0]
qfle3_rq_stop 1
2020-02-16T07:25:44.721Z cpu13:2097316)qfle3: qfle3_rq_stop:811: [vmnic0]
Stopping queue 0
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_rq_stop:831: [vmnic0]
disable netpoll for q_index 1
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_rq_stop:842: [vmnic0]
Disabling interrupt on vector # 3
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_rq_stop:867: [vmnic0]
active Rx queue Count 1
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_queue_free:690:
[vmnic0] QueueOps.queueFree
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_queue_free:697:
[vmnic0] RxQ, QueueIDVal:1
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_rq_free:618: [vmnic0]
Loop through 1 RSS queues 1
2020-02-16T07:25:44.724Z cpu13:2097316)qfle3: qfle3_cmd_remove_q:19191:
[vmnic0] Releasing Q idx 1
2020-02-16T07:25:49.833Z cpu37:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2170: socket 0x430fd072fc00 network resource pool
netsched.pools.persist.iscsi associated
2020-02-16T07:25:49.833Z cpu37:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2198: socket 0x430fd072fc00 network tracker id
974086138 tracker.iSCSI.192.168.201.2 associated
2020-02-16T07:25:49.833Z cpu37:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2170: socket 0x430fd097ae20 network resource pool
netsched.pools.persist.iscsi associated
2020-02-16T07:25:49.833Z cpu37:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2198: socket 0x430fd097ae20 network tracker id
974086138 tracker.iSCSI.192.168.201.3 associated
2020-02-16T07:25:50.341Z cpu20:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:517: vmhba68:CH:1 T:0 CN:0: Failed to receive
data: Connection closed by peer
2020-02-16T07:25:50.341Z cpu20:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:525: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:50.341Z cpu20:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:526: Conn [CID: 0 L: 192.168.201.108:39679 R:
192.168.201.2:3260]
2020-02-16T07:25:50.341Z cpu20:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1235: vmhba68:CH:1 T:0 CN:0: Connection rx
notifying failure: Failed to Receive. State=Bound
2020-02-16T07:25:50.341Z cpu20:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1236: Sess [ISID: TARGET: (null) TPGT: 0
TSIH: 0]
2020-02-16T07:25:50.341Z cpu20:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1237: Conn [CID: 0 L: 192.168.201.108:39679 R:
192.168.201.2:3260]
2020-02-16T07:25:50.341Z cpu20:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:517: vmhba68:CH:2 T:0 CN:0: Failed to receive
data: Connection closed by peer
2020-02-16T07:25:50.341Z cpu20:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:525: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:50.341Z cpu20:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:526: Conn [CID: 0 L: 192.168.201.108:45598 R:
192.168.201.3:3260]
2020-02-16T07:25:50.341Z cpu20:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1235: vmhba68:CH:2 T:0 CN:0: Connection rx
notifying failure: Failed to Receive. State=Bound
2020-02-16T07:25:50.342Z cpu20:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1236: Sess [ISID: TARGET: (null) TPGT: 0
TSIH: 0]
2020-02-16T07:25:50.342Z cpu20:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1237: Conn [CID: 0 L: 192.168.201.108:45598 R:
192.168.201.3:3260]
2020-02-16T07:25:50.595Z cpu37:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:699: vmhba68:CH:1 T:0 CN:0: iSCSI connection is
being marked "OFFLINE" (Event:2)
2020-02-16T07:25:50.595Z cpu37:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:700: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:50.595Z cpu37:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:701: Conn [CID: 0 L: 192.168.201.108:39679 R:
192.168.201.2:3260]
2020-02-16T07:25:50.600Z cpu37:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:699: vmhba68:CH:2 T:0 CN:0: iSCSI connection is
being marked "OFFLINE" (Event:2)
2020-02-16T07:25:50.600Z cpu37:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:700: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:50.600Z cpu37:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:701: Conn [CID: 0 L: 192.168.201.108:45598 R:
192.168.201.3:3260]
2020-02-16T07:25:51.204Z cpu0:2837132)J6: 2651: 'Storage_Ceph_pool1':
Exiting async journal replay manager world
2020-02-16T07:25:55.360Z cpu38:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2170: socket 0x430fd06ef2a0 network resource pool
netsched.pools.persist.iscsi associated
2020-02-16T07:25:55.360Z cpu38:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2198: socket 0x430fd06ef2a0 network tracker id
974086138 tracker.iSCSI.192.168.201.2 associated
2020-02-16T07:25:55.361Z cpu38:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2170: socket 0x430fd0938540 network resource pool
netsched.pools.persist.iscsi associated
2020-02-16T07:25:55.361Z cpu38:2098255)iscsi_vmk:
iscsivmk_ConnNetRegister:2198: socket 0x430fd0938540 network tracker id
974086138 tracker.iSCSI.192.168.201.3 associated
2020-02-16T07:25:55.868Z cpu21:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:517: vmhba68:CH:1 T:0 CN:0: Failed to receive
data: Connection closed by peer
2020-02-16T07:25:55.868Z cpu21:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:525: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:55.868Z cpu21:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:526: Conn [CID: 0 L: 192.168.201.108:17752 R:
192.168.201.2:3260]
2020-02-16T07:25:55.868Z cpu21:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1235: vmhba68:CH:1 T:0 CN:0: Connection rx
notifying failure: Failed to Receive. State=Bound
2020-02-16T07:25:55.868Z cpu21:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1236: Sess [ISID: TARGET: (null) TPGT: 0
TSIH: 0]
2020-02-16T07:25:55.868Z cpu21:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1237: Conn [CID: 0 L: 192.168.201.108:17752 R:
192.168.201.2:3260]
2020-02-16T07:25:55.868Z cpu21:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:517: vmhba68:CH:2 T:0 CN:0: Failed to receive
data: Connection closed by peer
2020-02-16T07:25:55.868Z cpu21:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:525: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:55.868Z cpu21:2098243)WARNING: iscsi_vmk:
iscsivmk_ConnReceiveAtomic:526: Conn [CID: 0 L: 192.168.201.108:15616 R:
192.168.201.3:3260]
2020-02-16T07:25:55.868Z cpu21:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1235: vmhba68:CH:2 T:0 CN:0: Connection rx
notifying failure: Failed to Receive. State=Bound
2020-02-16T07:25:55.868Z cpu21:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1236: Sess [ISID: TARGET: (null) TPGT: 0
TSIH: 0]
2020-02-16T07:25:55.868Z cpu21:2098243)iscsi_vmk:
iscsivmk_ConnRxNotifyFailure:1237: Conn [CID: 0 L: 192.168.201.108:15616 R:
192.168.201.3:3260]
2020-02-16T07:25:56.121Z cpu38:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:699: vmhba68:CH:1 T:0 CN:0: iSCSI connection is
being marked "OFFLINE" (Event:2)
2020-02-16T07:25:56.121Z cpu38:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:700: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:56.121Z cpu38:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:701: Conn [CID: 0 L: 192.168.201.108:17752 R:
192.168.201.2:3260]
2020-02-16T07:25:56.127Z cpu38:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:699: vmhba68:CH:2 T:0 CN:0: iSCSI connection is
being marked "OFFLINE" (Event:2)
2020-02-16T07:25:56.127Z cpu38:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:700: Sess [ISID: TARGET: (null) TPGT: 0 TSIH: 0]
2020-02-16T07:25:56.127Z cpu38:2098255)WARNING: iscsi_vmk:
iscsivmk_StopConnection:701: Conn [CID: 0 L: 192.168.201.108:15616 R:
192.168.201.3:3260]
2020-02-16T07:30:49.971Z cpu7:2098243)NMP: nmp_ThrottleLogForDevice:3788:
Cmd 0x28 (0x45a296a54040, 0) to dev "naa.6001405ba48e0b99e4c418ca13506c8e"
on path "vmhba68:C0:T0:L0" Failed: H:0x0 D:0x8 P:0x0 Invalid sense data:
0x0 0x0 0x0. Act:NONE
What about in the /var/log/vmkernel.log on the
initiator side? Any iscsi
errors?