Friday, October 9, 2020

How to install and configure xCAT (for beginners)

Are you fed-up of manual server installation?

Are you not a fan of managing your cluster using DVDs and USB sticks anymore?

Are you afraid that configuration on some of your servers in your HPC cluster might be inconsistent?

To find answers to these questions and to make your life easier, follow this guide and enjoy being a real Sys admin!!!

 Here comes…

xCAT tutorial for beginners…

 

Table of contents

xCAT introduction

Downloading and installing xCAT

Configuring xCAT

Node Deployment (Installation)

Thursday, October 8, 2020

Introduction

Title: How to install and configure xCAT - Tutorial

OS: CentOS Linux release 7.8.2003 (Core)

Kernel Version: 3.10.0-1127.13.1.el7.x86_64

Software: xCAT

Software Version: 2.13.7

User Account: Use ‘sudo’ (recommended)

Reboot Required: No

 Requirements for your ‘proof of concept’ (POC) environment:

Servers: 1 x Management node & 1 x Compute node (Actual hardware or VMs i.e using VirtualBox)

Cluster Network: 10.0.0.0/24 (Netmask: 255.255.255.0)

Hostname

IP Address

Netmask

Role

omaster

10.0.0.11

255.255.255.0

xCAT management server

compute001

10.0.0.101

255.255.255.0

Compute Node

 

Note: For actual hardware (servers) connect them over ethernet (ideally to a Gigabit switch)

1        What is xCAT?

xCAT is a cluster deployment and management tool which can be used to easily manage different types of servers with different operating systems (OS). It’s very powerful tool for managing HPC clusters. For more information check this out:     

              https://xcat-docs.readthedocs.io/en/latest/overview/index.html


                        <Tutorial Home>        Next>>

Wednesday, October 7, 2020

xCAT Installation

1.1         Downloading xCAT

 mkdir -p /tools/software/sources/xcat

export XCAT_SOURCE_DIR=/tools/software/sources/xcat

cd ${XCAT_SOURCE_DIR}

wget https://hpc.lenovo.com/downloads/17e/xcat-2.13.7.lenovo2_confluent-1.7.2_lenovo-confluent-0.7.1-el7.tar.bz2

tar xfj xcat-2.13.7.lenovo2_confluent-1.7.2_lenovo-confluent-0.7.1-el7.tar.bz2

 

1.2         Installing xCAT

The following procedure can be followed to install xCAT

cd ${XCAT_SOURCE_DIR}/lenovo-hpc-el7

./mklocalrepo.sh

yum –y install xCAT

Note: Make sure the OS repositories are accessible by YUM (either online repos or local)

Verify xCAT installation.

source /etc/profile.d/xcat.sh

tabdump site

If installation is successful, the above command will return similar outputs as it’s listed below.

#key,value,comments,disable

"domain","cluster.local",,

"blademaxp","64",,

"fsptimeout","0",,

"installdir","/install",,

"ipmimaxp","64",,

"ipmiretries","3",,

"ipmitimeout","2",,

"consoleondemand","no",,

"master","10.0.0.11",,

"forwarders","10.125.30.25,10.125.30.51,10.98.48.39",,

"nameservers","10.0.0.11",,

"maxssh","8",,

"ppcmaxp","64",,

"ppcretry","3",,

"ppctimeout","0",,

"powerinterval","0",,

"syspowerinterval","0",,

"sharedtftp","1",,

"SNsyncfiledir","/var/xcat/syncfiles",,

"nodesyncfiledir","/var/xcat/node/syncfiles",,

"tftpdir","/tftpboot",,

"xcatdport","3001",,

"xcatiport","3002",,

"xcatconfdir","/etc/xcat",,

"timezone","Europe/London",,

"useNmapfromMN","no",,

"enableASMI","no",,

"db2installloc","/mntdb2",,

"databaseloc","/var/lib",,

"sshbetweennodes","ALLGROUPS",,

"dnshandler","ddns",,

"vsftp","n",,

"cleanupxcatpost","no",,

"dhcplease","43200",,

"auditnosyslog","0",,

"xcatsslversion","TLSv1",,

"auditskipcmds","ALL",,

  

      <<Previous        <Tutorial Home>        Next>>

Tuesday, October 6, 2020

xCAT Configuration

 xCAT configuration is saved in a database which consists of many tables (commandtabdumpcan show a list of all tables). The default DB xCAT uses issqlite’. This can be changed to MySQL (MariaDB), PostgreSQL or others.

Different tables in xCAT DB are for different purposes. There are a handful of common tables which need to be configured for all kind of hardware but there are some specific tables which are only used for specific kind of environment. For example ppc’ table will only be used when using Power PC based hardware and prodkey’ table will only be used when installing MS Windows environment.

It’s time to configure common xCAT tables. Let’s do it.

tabedit site

Make sure you edit these lines. If these lines are not present then add new ones.

"dhcpinterfaces","10.0.0.11|enp0s3",,

"dnsinterfaces","10.0.0.11|enp0s3",,

"master","10.0.0.11",,

"nameservers","10.0.0.11",,

"domain","cluster.local",,

Note: You need to use the network interface which is set with IP10.0.0.11’. In my case it isenp0s3

tabedit networks

The table should be looking like this

#netname,net,mask,mgtifname,gateway,dhcpserver,tftpserver,nameservers,ntpservers,logservers,dynamicrange,staticrange,staticrangeincrement,nodehostname,ddnsdomain,vlanid,domain,mtu,comments,disable

"cluster","10.0.0.0","255.255.255.0","enp0s3","<xcatmaster>",,"<xcatmaster>",,,,"10.0.0.151 10.0.0.175",,,,,,,"1500",,

It’s always a good idea to changenetname’ to a name which is easy to understand. Also add dynamicrange’ for DHCP. Other fields are normally pre-filled.

Useful Notes: Since this is just a POC environment and we don’t need any IPMI or other networks i.e 10G or IPoIB (Infiniband). In real world these networks are normally required.

 tabedit hosts

Add all nodes with their IP addresses. This will help us in generating ‘/etc/hostsfile.

  #node,ip,hostnames,otherinterfaces,comments,disable

"omaster","10.0.0.11",,,,

"compute001","10.0.0.101",,,,

tabedit noderes

#node,servicenode,netboot,tftpserver,tftpdir,nfsserver,monserver,nfsdir,installnic,primarynic,discoverynics,cmdinterface,xcatmaster,current_osimage,next_osimage,nimserver,routenames,nameservers,proxydhcp,syslog,comments,disable

"compute",,"xnba",,,,,,,,,,,,,,,,,,,

You can use ‘pxe’ instead of ‘xnba’ but ‘xnba’ is recommended as this supports more options than ‘pxe’

tabedit passwd

#key,username,password,cryptmethod,authdomain,comments,disable

"system","root","$6$9FdrtZiCetX4cV2G$JJlywIQZByWwVmiaGPAW0ZSChOTU0VIa5MybqMwwj8fr9.Fg9BGrcu1Nq/PpyVBt8r4shXPxzSwi5BkdtYwZq1",,,,

Password can be in plain text or encrypted. Encrypted password can be copied from/etc/shadowfile.

tabedit chain

#node,currstate,currchain,chain,ondiscover,comments,disable

"compute",,,,"nodediscover",,

This was the minimal table configuration. To deploy nodes an OS ISO needs to be copied to omasternode. Type this command to get this done.

copycds CentOS-7-x86_64-Everything-2003.iso

This will copy the OS into /install directory. The ISO doesn’t have to include Everything, normal ISO is good enough for installation.

Here you go. This is the least configuration we need to add.

      <<Previous        <Tutorial Home>        Next>>

Monday, October 5, 2020

xCAT Node Deployment

                                                                                                                         

1.1       Adding nodes into xCAT DB

Follow these steps to addcompute001into xCAT DB.

nodeadd omaster groups=mgmt

nodeadd compute001 groups=compute,node

 

makehosts

makedhcp –n

makedhcp –a

makedns –n

 makedns -a

 

Once done check if DNS is working correctly.

host compute001

Expected output:

compute001.cluster.local has address 10.0.0.101

Hmmm, is your name resolution not working? Check if your ‘/etc/resolv.conf’ file is correct.

It should look like similar to this:

search cluster.local

nameserver 10.0.0.11

nameserver 10.0.3.2

nameserver 10.125.30.25

Bottom two lines can be different as they reflect your own environment.

 

1.2       Node Discovery

It’s time to have real fun. We need to add a new node and then deploy it with CentOS 7.

There are three ways you can add node MAC addresses into xCAT.

1)      Add MAC addresses manually

2)      Discovering nodes using ‘Sequential Node Discovery’ method &

3)      Auto discovery

 

1)      Add MAC addresses manually

The easiest method to add nodes in xCAT is to add their MAC addresses intomac table.

If you are using a VM then its pretty easy to get the MAC address of the node (in my case compute001has MAC address08:00:27:E9:BB:DB’)

To add the MAC follow this:

tabedit mac

The table should be looking like as below:

#node,interface,mac,comments,disable

"compute001",,"08:00:27:E9:BB:DB",,

2)      Discovering nodes using ‘Sequential Node Discovery’ method &

Follow these steps

·        Keep compute001off

·        Typenodediscoverstart noderange=compute001on omasternode

·        Turn compute001on and wait

·        If you have access to its console, keep looking. It should be able to get an IP from DHCP and xCAT should start discovering this node.

·        Upon success,tabdump macshould show its MAC address automatically added.

·        Typenodediscoverls’ to see what xCAT has discovered and type nodediscoverstop’ to stop the process.

 

3)      Auto discovery

Auto discovery process will be covered in ‘Advanced Tutorial’. Watch this space…

   

1.3            xCAT OS images

To deploy a system with xCAT, it requires an OS image. OS image is just the information in xCAT DB to help xCAT create a kickstart file (for Redhat, CentOS etc) for the installation.

When copycdscommand is executed, it automatically creates some images by default but it’s always good to create a custom one for our purpose.

You can list all OS images using this command:

lsdef –t osimage

Use these lines to create an OS image named ascompute

mkdef -t osimage compute imagetype=linux osarch=x86_64 \

osname=Linux osvers=centos7.8 \

otherpkgdir=/install/post/otherpkgs/centos7.8/x86_64 \

pkgdir=/install/centos7.8/x86_64 \

pkglist=/install/templates/compute/pkglist \

profile=compute provmethod=install \

template=/install/templates/compute/tmpl \

partitionfile=/install/templates/compute/partfile \

otherpkglist=/install/templates/compute/otherpkgs.pkglist \

synclists=/install/templates/compute/synclist

 

Now it’s time to configure files we have mentioned in the image (above).

Luckily xCAT installation comes with most of these files so it’s just a matter of copying them to our custom directory /install/templates/compute’

Let’s get it done…

mkdir –p /install/templates/compute

rsync -avp /opt/xcat/share/xcat/install/centos/compute.centos7.tmpl /install/templates/compute/tmpl

rsync -avp /opt/xcat/share/xcat/install/centos/compute.centos7.pkglist /install/templates/compute/pkglist

touch /install/templates/compute/otherpkgs.pkglist

touch /install/templates/compute/synclist

 

Function of ‘otherpkgs.pkglist’ and ‘synclist’ files will be covered in the advanced tutorial. For now the existence of these files is sufficient for xCAT to installcompute001’.

The file /install/templates/compute/partfile’ is used by xCAT to create custom disk partitioning for nodes. This file can be modified according to the requirements and also depending on the disk sizes.

 

part /boot/efi --size=100 --fstype=exfat

part /boot --fstype=ext4 --size=500

part pv.1 --grow --size=1

volgroup system --pesize=4096 pv.1

logvol swap --name=swap --vgname=system --size=2048

logvol / --fstype=ext4 --name=root --vgname=system --grow --size=4096

 

1.4            Nodes Installation

nodeset compute001 osimage=compute

Reset ‘compute001’ and make sure it boots up over network to receive DHCP and TFTP information from xCAT server (omaster).

Wow! You have managed to install a compute node using xCAT. That’s an achievement!

Monitor the screen of the node and make sure it completes the installations. It should automatically boot up and you should be able to SSH into it without any password.

For further reading and understanding please check out my ‘xCAT Tutorial for advanced users’ (coming soon…)

 

      <<Previous        <Tutorial Home>                                                                                                               

 

Friday, August 28, 2020

How to install Mellanox OFED IB Drivers in CentOS Linux 7 (unattended)

Title: Unattended Mellanox Infiniband OFED Installation - Tutorial

OS: CentOS Linux release 7.8.2003 (Core)

Kernel Version: 3.10.0-1127.13.1.el7.x86_64

Compiler: gcc v4.8.5

Software: Mellanox OFED

Software Version: 5.0-2.1.8.0

User Account: Use ‘sudo’ (recommended)

Reboot Required: Yes (recommended)

·        Create a directory ‘/tools/apps/sources/mellanox/ofed/5.0-2.1.8.0’ (Please create directory according to your requirements)

o   mkdir -p /tools/apps/sources/mellanox/ofed/5.0-2.1.8.0

o   export MLNX_ROOT_DIR=/tools/apps/sources/mellanox/ofed/5.0-2.1.8.0 

·        Download the latest Mellanox OFED .tgz file for your OS distribution and architecture from Mellanox website (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) and place it into the directory created in the previous step. In my case the downloaded file is ‘MLNX_OFED_LINUX-5.0-2.1.8.0-rhel7.8-x86_64.tgz’.

·        Navigate into the directory and extract the tarball.

o   cd ${MLNX_ROOT_DIR}

o   tar xvfz MLNX_OFED_LINUX-5.0-2.1.8.0-rhel7.8-x86_64.tgz 

·        Navigate into ‘MLNX_OFED_LINUX-5.0-2.1.8.0-rhel7.8-x86_64’ directory

o   export MLNX_SOURCE=MLNX_OFED_LINUX-5.0-2.1.8.0-rhel7.8-x86_64

o   cd ${MLNX_ROOT_DIR}/${MLNX_SOURCE} 

(If your kernel version is not listed in ‘.supported_kernels’ file then you will have to add support for your kernel. To accomplish this, please follow the steps under ‘Troubleshooting’ section below)

·        Mellanox package comes with some sample configuration files which can be used to install mellanox OFED unattended. Configuration samples can be found in ‘${MLNX_ROOT_DIR}/${MLNX_SOURCE}/docs/conf’ directory i.e ofed-basic.conf, ofed-hpc.conf and ofed-all.conf. 

·        Install the necessary packages required for Mellanox OFED installations

o   yum install tcl tk 

·        To install Mellanox OFED stack run the following script with the necessary flags for unattended installation. You can use configuration file according to your requirements.

o   ./mlnxofedinstall –c docs/conf/ofed-hpc.conf --force

·        After the above command, the installation will be completed. Restart the system so that it may load the device drivers on startup.

How to start infiniband related services?

Once the installation has been performed, it installs two main services:

·        Openibd: Infiniband related drivers

·        Opensmd: Subnet manager

An instance of a ‘Subnet Manager’ is required on an infiniband fabric (network) for machines to communicate. This service can either be started on Infiniband switches (subject to having the ‘Subnet Manager’ support on the switches) or on servers connected to the same infiniband fabric.

openibd’ service needs to be running on all servers whereas ‘opensmd’ service can only be started on a single server/switch or on multiple core servers/switches.

·        Start and enable ‘openibd’ service

o   systemctl start openibd

o   systemctl enable openibd 

·        Start and enable ‘opensmd’ service

o   systemctl start opensmd

o   systemctl enable opensmd

 Troubleshooting:

Adding Additional Kernel Support

Sometimes a downloaded .tgz file doesn’t come with the RPMs which support the latest kernel. To add support for a different kernel follow these steps:

o   Navigate into the Mellanox OFED source directory and run these commands

      •   cd MLNX_OFED_LINUX-5.0-2.1.8.0-rhel7.8-x86_64
      •   yum install python-devel
      •   ./mlnx_add_kernel_support.sh -m . -make-tgz

The above command will re-compile Mellanox OFED stack and will create a ‘tgz’ file in ‘/tmp’ directory. This will be the tarball which can be used to install the OFED stack on machines with a specific kernel.

 

How to install and configure xCAT (for beginners)

Are you fed-up of manual server installation? Are you not a fan of managing your cluster using DVDs and USB sticks anymore? Are you afra...