NFS tuning over 100 Gbps Infiniband

Issues related to configuring your network
Post Reply
alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

NFS tuning over 100 Gbps Infiniband

Post by alpha754293 » 2019/08/09 05:46:32

So, I'm currently using CentOS 7.6.1810, installed on my four-node micro compute cluster and all of them have a Mellanox ConnectX-4 100 Gbps 4x EDR Infiniband.

I've installed 'Infiniband Support' software group from the installation media, OpenSM is running, IPs have been assigned, everything is looking good.

Here is how the host is setup:

Code: Select all

$ cat /etc/exports
/home/user/cluster *(rw,sync,no_root_squash,no_all_squash,no_subtree_check)
Here is how the client is setup:

Code: Select all

$ cat /etc/fstab
...
aes1:/home/user/cluster /home/user/cluster nfs defaults 0 0
$ cat /proc/mounts
...
aes1:/home/user/cluster /home/user/cluster nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.1.2,local_lock=none,addr=10.0.1.1 0 0
The NFS share is physically mounted on an Intel 545 Series 1 TB SATA 6 Gbps SSD, but I'm only able to get around 300 MB/s max.

The IB NIC is currently running in datagram mode.

In my other post, I now know how to change it to connected mode, but I haven't done that with the system yet because it's currently busy finishing up an analysis for me (so that I would be able to take the MTU up from 2044 to either 4096 or 9216 or something like that. I'll have to play around with that.)

But besides that, I was wondering if there might be other tuning parameters that I can set (e.g. TCP_PAYLOAD_SIZE?) in order to improve the NFS transfer performance.

When the system is finished its current analysis, I'll have to re-run dd to generate some data for everybody to review, so for now, perhaps if people have general recommendations in regards to what are some common, high level NFS tuning parameters that I could employ.

I tried looking up, again, for example, TCP_PAYLOAD_SIZE but I didn't find it in any of the NFS tuning guides online, so I didn't know if there were other things that I could do.

I did read that some people were recommending asynchronous operation, but I don't think that I will be able to do that because the NFS mount/share is used to centrally store the results from my analyses, so I think that the synchronous writes would give it some "peace of mind" despite the fact that the system is connected to a 3 kW UPS.

Thank you.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: NFS tuning over 100 Gbps Infiniband

Post by chemal » 2019/08/09 15:06:15

NFS over RDMA insted of IPoIB is what you should be using if you want to optimize for performance.

https://access.redhat.com/documentation ... g#nfs-rdma

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: NFS tuning over 100 Gbps Infiniband

Post by alpha754293 » 2019/08/09 15:31:19

chemal wrote:
2019/08/09 15:06:15
NFS over RDMA insted of IPoIB is what you should be using if you want to optimize for performance.

https://access.redhat.com/documentation ... g#nfs-rdma
Is NFSoRDMA implemented in the 'Infiniband Support' software package group?

I know that the Mellanox MLNX Linux OFED v2 driver has REMOVED NFSoRDMA support, so if you can please confirm that, that would be greatly appreciated.

I have confirmation in regards to that directly from Mellanox via their forums.

Thank you.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: NFS tuning over 100 Gbps Infiniband

Post by TrevorH » 2019/08/09 15:49:57

Code: Select all

# cat /etc/sysctl.d/tengb.conf
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 87380 33554432
net.core.netdev_max_backlog = 30000
Those are the changes I make to achieve full 10Gbps connection speeds on my machines.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: NFS tuning over 100 Gbps Infiniband

Post by chemal » 2019/08/09 15:51:59

RH doesn't ship OFED, especially not Mellanox OFED. I've given you the link to RH's documentation for RHEL 7. The necessary packages should already be installed and you should have a file /etc/rdma/rdma.conf. It's from rdma-core.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: NFS tuning over 100 Gbps Infiniband

Post by alpha754293 » 2019/08/09 16:41:44

chemal wrote:
2019/08/09 15:51:59
RH doesn't ship OFED, especially not Mellanox OFED. I've given you the link to RH's documentation for RHEL 7. The necessary packages should already be installed and you should have a file /etc/rdma/rdma.conf. It's from rdma-core.
Thank you.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: NFS tuning over 100 Gbps Infiniband

Post by alpha754293 » 2019/08/10 05:17:42

So, I've configured my system now to run in connected mode on the Infiniband adapters (rather than datagram mode) and also exported the NFS mount over RDMA.

(The Redhat Storage Administration guide actually missed something -- I needed to add the option "rdma,port=20049" (per Mellanox) which RedHat's Storage Administration guide didn't mention because if you leave it to mount with the defaults, it would mount using the TCP protocol instead of the RDMA protocol.)

Despite that though, my write speeds from my slave node to the head node (that has hosts the NFS mount on a 6 Gbps SATA SSD) is only writing at about 0.79 Gbps.

Not really sure why though.

The command that I am using is this:

Code: Select all

# mount | grep cluster
aes1:/home/user/cluster on /home/user/cluster type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.0.1.3,local_lock=none,addr=10.0.1.1)
# cd /home/user/cluster
# time -p dd if=/dev/zero of=10Gfile bs=1024k count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 101.131 s, 106 MB/s
real 101.13
user 0.01
sys 8.86
Is there a way for me to test the data transfer NFS over RDMA data transfer speeds with tmpfs/ramfs mounts?

(My SATA SSDs aren't fast enough to be able to show whether the NFS over RDMA is truly working to its fullest/highest potential and I was wondering how I may be able to test it to confirm the operational state/status of it.)

Thank you.

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: NFS tuning over 100 Gbps Infiniband

Post by TrevorH » 2019/08/10 11:25:38

# time -p dd if=/dev/zero of=10Gfile bs=1024k count=10240
That isn't really measuring much at all. Try it again but add oflag=direct or conv=fdatasync to your dd command (and optionally status=progress) to get more accurate results.

That would appear to be capped at gigabit speeds though don't know whether that's coincidence or a real thing.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: NFS tuning over 100 Gbps Infiniband

Post by alpha754293 » 2019/08/10 12:20:19

TrevorH wrote:
2019/08/10 11:25:38
# time -p dd if=/dev/zero of=10Gfile bs=1024k count=10240
That isn't really measuring much at all. Try it again but add oflag=direct or conv=fdatasync to your dd command (and optionally status=progress) to get more accurate results.

That would appear to be capped at gigabit speeds though don't know whether that's coincidence or a real thing.
Thank you.

Yeah it would appear so, but hostname of the NFS server should be pointing to the Infiniband interface and they're separated by IPv4 Class A address (10.x.y.z) vs. (192.a.b.c) so I don't think that it could use the GbE interface given that.

(I also set their hostnames to be different.)

I'm not sure how I would be able to test/check and be able to confirm that it isn't using the GbE interface (for whatever reason). The System Monitor doesn't show any network activity during the test, which would appear like it isn't using the GbE interface, but again, if there's a way that I can check and confirm that, that would be greatly appreciated.

Thank you.

alpha754293
Posts: 69
Joined: 2019/07/29 16:15:14

Re: NFS tuning over 100 Gbps Infiniband

Post by alpha754293 » 2019/08/10 17:40:09

alpha754293 wrote:
2019/08/10 12:20:19
(I also set their hostnames to be different.)
Sorry - point of clarification - by that I mean that the entries in /etc/hosts are different.

But the system only has one hostname.

Post Reply