With the release of Openfiler 1.1, there is now preliminary support for creating a high-availability replicated storage solution that can function both at a LAN and WAN level. The replication is one-to-one replication not one-to-many with the primary live server copying each individual data block written to local disk, to a remote secondary server over a LAN or WAN link. The Open Source Distributed Replicated Block Device (DRBD) software is used to achieve this. The other component used to achieve high-availability is heartbeat from the Linux High Availability project. Heartbeat is used to monitor uptime and availability of individual nodes that make up the cluster and the resources which they provide. In the context of Openfiler NAS these would be the file serving resources such as httpd, smb and NFS which along with the service IP address can be migrated from a primary active server to the secondary passive server in the event of a hardware failure in the primary.
Unlike shared storage clusters that make use of expensive external fibre channel or SCSI RAID arrays, DRBD provides a very low-cost alternative that confers the added advantage of not having to locate the primary and secondary server in the same rack, building or even country. All replication is done using standard TCP/IP networking technology. In most setups however, the primary and secondary server would at most be located within relatively close proximity to one another e.g a college campus or large office building within a single network domain. This guide is geared toward such a scenario where the service network IP address failover would not be constrained by complex routing issues.
All clustering and high-availability configuration is done at the command line and using text configuration files. The Administrator is expected to be fairly confident with Linux system administration idiosyncracies.
In a replicated cluster deployment scenario, there are three types of participants.
- Client node/machine: this is a workstation or server on the network accessing services on the Openfiler NAS via one of the supported protocols
- Primary node/machine: this is the live Openfiler NAS server that is generally providing operational service to data.
- Secondary node/machine: this is the passive Openfiler NAS server that is deployed as standby to provide services to client machines in the event of a failure of the primary node.
The hardware and software components required for deploying a highly-available Openfiler solution are outlined below.
The 1.1 release of Openfiler is the first release with full cluster and replication support. Previous versions and releases do not have cluster support and there are no plans to backport support to older releases.
The Open Source Distributed Replicated Block Device (DRBD) software is used to provide a mechanism of copying individual disk data blocks from a live server to a secondary server usually located at a remote site for disaster-recovery purposes. DRBD is the equivalent of RAID 1 except that the data blocks are sent over a network to disks located in a remote server. DRBD can work over VPN links for added security if a dedicated secure link is not available between the primary active server and secondary passive server. DRBD is *not* a filesystem and does not support simultaneous access to disk blocks by the primary and secondary nodes.
DRBD works along the premise that only one server or node will be accessing the disk blocks at any given point in time. The node primary node, the node with read and write access to the DRBD block device, is the source and the secondary node is the target. And data written to the source is sent over a dedicated network link to the target and written to the local disk on the target before being written to the local disk on the source and the write acknowledged to the application performing the write operation. Data reads are however done locally from the primay node. In the event of a failure of the primary node – even as catastrophic as total destruction from fire, the data is still safe and available to serve out from the secondary node.
DRBD is an abstraction layer that sits between the local disks and upper level volume managers and filesystems. It is a form of virtualisation, where the local disks are no longer accessed directly by the file system or volume manager. Rather operations are intercepted by the DRBD layer and depending on what operation is to be performed, either one of three possible replication protocols is used to redirect the operation or the operation is passed through to the local disk. If the operation is a read request, the DRBD layer accesses local disk, reads the required blocks and sends the data back up to the requesting applicaiton. If the operation is a write request one of the following protocols is used to redirect the write:
- Protocol A: with this protocol, the data is immediately committed to local disk and the write operation is acknowledged as complete, meanwhile the data is sent out over the network to be written to the remote disk.
- Protocol B: with this protocol, hte data is is committed to disk and sent out over the network to the remote / secondary node, which acknowleges reciept of the data and the write operation is acknowledged as complete, meanwhile the remote node commits the write to disk.
- Protocol C: this is the most stringent of the protocols and the default / recommended and only commercially supported protocol for Openfiler. With this protocol, the write operation is not acknowledged until it has been committed to the remote disk (target) first and then the local disk (source).
To gain a deeper understand of DRBD and its modus operandi, please visit www.drbd.org.
Heartbeat is an Open Source application that provides a mechanism for monitoring resources in a cluster and manages the process of automatically failing over these resources between nodes participating in the cluster. It is a “cluster manager” for all intents and purposes. A resource is any service, application or system component that needs to be kept continuously available to clients accessing them regardless of the state, healthy or otherwise, of the host/node providing the resource. Examples of resources that can be managed by heartbeat include file services such as smb/nfs/ftp; filesystems such as ext3; IP addresses. The aptly named hearbeat sends a network ping between nodes in the cluster at predetermined intervals to ensure that the nodes are “alive” and in a serviceable state. Should a ping not return within an expected timeframe and subsequent pings fail up to a limited threshold, it is assumed the unresponsive node is “dead” and a prescribed chain of events is initiated to acquire the resources that were hitherto being provided by the now failed node.
In an Openfiler replicated cluster, heartbeat is used to manage all the file services, the LVM volumes and filesystems as well as the state of the replicated block device, DRBD. In the event of a failure of the primary node providing services to clients on the network, heartbeat will take care of ensuring service and resource bringup on the secondary node and can even be configured to automatically switch services and resources back to the original primary node when that node has been brought back to a serviceable state. Heartbeat ensures that resource bringup upon a failover instance is coherent and as transparent as possible to the client machine. In many cases, the client will even be oblivious to the fact that services have been failed over. At most, during a failover operation, the client machine will experience a short blip where access to data is suspended.
To learn more about Heartbeat, please visit www.linux-ha.org.
Two identical server machines are required to act as primary and secondary nodes in the cluster. While it is possible to have systems with different specificaitons perform these roles, it is not recommended practice. At the very least the disk and network device configurations should be identical. It is possible to take advantage of network device bonding to increase the network bandwidth between the nodes in the cluster and thereby increase overall write performance. If possible to achieve, a dedicated network link, bypassing network hubs and switches, between the two nodes is preferred. A minimum of two network interfaces is required for a cluster setup. Better performance and resiliency will be achieved however, with five network interfaces. Two network interfaces, bonded, are used for the replication link between the primary and secondary nodes, two network links, again bonded, are used for service provision and the fifth link is dedicated for heartbeat pings.
This section deals with configuring two Openfiler NAS appliances to participate in a highly available cluster deployment.
In this section it is assumed that there are two identically specified nodes participating in the cluster and that each node has only two network interfaces. It is aslo assumed that the cluster nodes are on the same logical network. A final assumption is that the system disk (where Openfiler is installed) and the data disks (for storing data) are physically disparate. Please note that the steps outlined below would work just as well for a single physical data disk as for a hardware or software RAID device. The sequence of processes is as follows:
- Configure hardware and networking
- Install cluster software
- Configure secure link between cluster nodes
- Configure DRBD
- Configure heartbeat
- Start DRBD
- Create LVM Volume Group
- Prepare systems for cluster operation
- Start cluster
To configure the hardware and network, perform the following steps;
- Place the cluster nodes in their final resting places and power them up and decide which is to be the primary node
- Connect one network interface (service interface) on each node respecitively to the service network (network switch/hub)
- Connect the second network interface (replication interface) on the primary node to the second network interface on the secondary node using a crossover ethernet cable
- Assign a routeable IP address to the respective service interfaces on the nodes
- Assign a private IP address to the respective replication interfaces on the nodes
- Verify that both network interfaces are active and working on both nodes i.e the service IP’s can be pinged from another host on the network and the replication IPs can be pinged from the primary to the secondary and vice versa.
To install the cluster software, perform the following steps;
- As the root user, install heartbeat and drbd RPM packages located on the Openfiler distribution CD in the
/EXTRAS/CLUSTERING/[drbd | heartbeat]
directories. This step must be performed on both nodes. - Edit
/etc/modules.conf
and add aliases for the DRBD kernel module as follows:alias block-major-147 drbd alias block-major-43 drbd
To configure a secure link between the two cluster nodes, perform the following steps;
- Log into the primary node as the root superuser either at a local terminal or via ssh.
- Create an ssh key for the root account (hitting the enter key at all prompts) as follows:
[root@midgard ~]# ssh-keygen -t dsa Generating public/private dsa key pair. Enter file in which to save the key (/root/.ssh/id_dsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_dsa. Your public key has been saved in /root/.ssh/id_dsa.pub. The key fingerprint is: 13:7a:c6:08:9e:eg:2e:85:ab:6f:2f:61:43:f3:7a:42 root@midgard.example.com
- Copy the
/root/.ssh/id_dsa.pub
file to the secondary node and append its contents to the/root/.ssh/authorized_keys2
file on the secondary node - Repeat steps 1 – 3 on the secondary node, with the target being the primary node
- Verify that the keys have been setup correctly on both nodes by logging into the root account via ssh from the primary node to the secondary and vice versa. If a challenge is presented for a password, then retrace to step 1 and repeat on both nodes.
At least two DRBD block devices are required. One for the cluster metadata and one for application data. The following steps must be performed identically on both nodes. To configure DRBD, please proceed as follows:
- Using fdisk or parted create a partition of at least 300MB for the cluster metadata volume on the data disk.
- Using fdisk or parted create a partition of your desired size to be used for the application data.
- Edit the
/etc/drbd.conf (the file must be identical on both nodes)
file using the following as a guideline:global { minor-count 9; } resource cluster_metadata { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { wfc-timeout 0; degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error panic; } syncer { rate 10M; group 1; al-extents 257; } on drbd1.example.com { device /dev/drbd0; disk /dev/hda5; address 10.1.5.1:7788; meta-disk internal; } on drbd2.example.com { device /dev/drbd0; disk /dev/hda5; address 10.1.5.2:7788; meta-disk internal; } } resource vg0_drbd { protocol C; incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { wfc-timeout 0; ## Infinite! degr-wfc-timeout 120; ## 2 minutes. } disk { on-io-error panic; } syncer { rate 10M; group 1; # sync concurrently with r0 } on drbd1.example.com { device /dev/drbd1; disk /dev/hda6; address 10.1.5.1:7789; meta-disk internal; } on drbd2.example.com { device /dev/drbd1; disk /dev/hda6; address 10.1.5.2:7789; meta-disk internal; } }
- Create the cluster metadata mount point;
mkdir /cluster_metadata
- Initialise DRBD devices
drbdadm adjust cluster_metadata drbdadm adjust vg0_drbd
Heartbeat makes use of three configuration files. The /etc/ha.d/ha.cf
and /etc/ha.d/authkeys
files are configured by the administrator manually and the /etc/ha.d/haresources
file is managed by Openfiler. The steps to configure heartbeat must be performed identically on both nodes. To configure heartbeat, proceed as follows;
- Edit /etc/ha.d/authkeys
auth 2 2 crc
- Edit /etc/ha.d/ha.cf
debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 bcast eth1 keepalive 5 warntime 10 deadtime 120 initdead 120 udpport 694 auto_failback off node drbd1.example.com node drbd2.example.com
To start the DRBD service, proceed as follows;
- As root, on both nodes, run the following command;
[root]# service drbd start
- Verify that the DRBD service is up and running on both nodes, and that the block devices are consistent and in secondary state;
[root]# service drbd status drbd driver loaded OK; device status: version: 0.7.6 (api:77/proto:74) SVN Revision: 1663 build by root@aztec.corp.xinit.com, 2005-03-08 13:59:43 0: cs:Connected st:Secondary/Secondary ld:Consistent ns:179636 nr:14532 dw:194168 dr:4351 al:1 bm:107 lo:0 pe:0 ua:0 ap:0 1: cs:Connected st:Secondary/Secondary ld:Consistent ns:159224 nr:19515 dw:178739 dr:2097 al:12 bm:291 lo:0 pe:0 ua:0 ap:0
- On the primary node (this step must be performed on the primary node only), change the state of the DRBD block devices to primary;
[root]# drbdadm primary all
- Create a filesystem on the cluster_metadata device;
[root]# mkfs.ext3 /dev/drbd0
To create the volume group for Openfiler data partitions, proceed as follows;
- Edit /etc/lvmfilter (on both nodes)
/dev/hda6
- Create the LVM physical volume (only on the primary node)
[root]# pvcreate /dev/drbd1
The following steps should be performed in final preparation for cluster operation;
- Mount the cluster_metadata device on the primary node
[root]# mount /dev/drbd0 /cluster_metadata
- Move the /opt/openfiler directory to /opt/openfiler.local (on both nodes)
- Copy the contents of /opt/openfiler.local to /cluster_metadata/opt/openfiler (on primary only)
- Create a symbolic link from /cluster_metadata/opt/openfiler to /opt/openfiler (on both nodes)
- Delete the /cluster_metadata/opt/openfiler/etc/rsync.xml file (on primary only)
- Create a symbolic link from /cluster_metadata/opt/openfiler/etc/rsync.xml to /opt/openfiler.local/etc/rsync.xml (on both nodes)
- Edit /opt/openfiler.local/etc/rsync.xml (on both nodes)
<?xml version="1.0" ?> <rsync> <remote hostname="10.1.3.61"/> ## IP address of peer node. <item path="/etc/ha.d/haresources"/> <item path="/etc/ha.d/ha.cf"/> <item path="/etc/ldap.conf"/> <item path="/etc/openldap/ldap.conf"/> <item path="/etc/ldap.secret"/> <item path="/etc/nsswitch.conf"/> <item path="/etc/krb5.conf"/> </rsync>
- Move /etc/samba to /cluster_metadata/etc/samba (on primary only)
- Delete /etc/samba (on secondary only)
- Create a symbolic link from /cluster_metadata/etc/samba to /etc/samba (on both nodes)
- Move /var/cache/samba to /cluster_metadata/var/cache/samba (on primary only)
- Delete /var/cache/samba (on secondary only)
- Create a symbolic link from /cluster_metadata/var/cache/samba to /var/cache/samba (on both nodes)
- Move /var/lib/nfs to /cluster_metadata/var/lib/nfs (on primary only)
- Delete /var/lib/nfs (on secondary only)
- Create a symbolic link from /cluster_metadata/var/lib/nfs to /var/lib/nfs (on both nodes)
- Move /etc/exports to /cluster_metadata/etc/exports (on primary only)
- Delete /etc/exports (on secondary only)
- Create a symbolic link from /cluster_metadata/etc/exports to /etc/exports (on both nodes)
- Move /etc/httpd/conf.d/openfiler-shares.conf to /cluster_metadata/etc/httpd/conf.d/openfiler-shares.conf (on primary only)
- Delete /etc/httpd/conf.d/openfiler-shares.conf (on secondary only)
- Create a symbolic link from /cluster_metadata/etc/httpd.conf.d/openfiler-shares.conf to /etc/httpd/conf.d/openfiler-shares.conf (on both nodes)
- Edit /cluster_metadata/opt/openfiler/etc/cluster.xml
<?xml version="1.0" ?> <cluster> <clustering state="on" /> <nodename value="cluster" /> <resource value="MailTo::admin@example.com::ClusterFailover"/> <resource value="IPaddr::10.1.3.65/24" /> <resource value="drbddisk::"> <resource value="LVM::vg0_drbd"> <resource value="Filesystem::/dev/drbd0::/mnt/cluster_metadata::ext3::defaults"> <resource value="MakeMounts"/> </cluster>
This is the final step in the process;
- Start the hearbeat services on both nodes;
[root# ] service heartbeat start
- Verify that heartbeat services have started up on both nodes
- Access the Openfiler interface using the IP address/hostname assigned for the cluster
Trobleshooting
================================
IF it doesn’t boot properly edit the grub then
remove .snmp from grub
===================================
Installing openfiler into into USB Device
http://www.openfiler.com/community/forums/viewtopic.php?pid=4265#p4265
It is possible, following the following steps:
– disable your internal HD’s, its not really needed but to be sure not to mess up very handy. I had Grub take out my internal HD of my laptop 😉
– boot from the latest Openfiler iso (2.2 respin 2)
– at the boot prompt, for textmode, type: expert text
type, for graphical: expert
(This way anaconda will let you choose your usb stick to install to)
– Install as usual up and until the reboot message.
– reboot
– be surprised, it does boot but ends in a kernel panic, don’t panic 😉
– the initrd doesn’t know about USB storage so at the end of the boot process in the ramdisk it will switch to a blackhole instead of a filesystem on disk.
– the short way, fix is on the following URL:
http://simonf.com/usb/
The slightly longer and edited version follows, with my comments between ():
mount /dev/sda3 /mnt/sysimage
mount /dev/sda1 /mnt/sysimage/boot
chroot /mnt/sysimage
(your /dev/sdaX might vary depending on your installation choices!)
(I haven’t tried the mkinitrd stuff but the alternative does work.)
cp /boot/initrd-2.X.X.img /tmp/initrd.gz
gunzip /tmp/initrd.gz
mkdir /tmp/a
cd /tmp/a
cpio -i < /tmp/initrd
vi init
(find the line with ‘insmod /lib/sd_mod.ko’)
( insert the following beneath it)
insmod /lib/sr_mod.ko
insmod /lib/ehci-hcd.ko
insmod /lib/uhci-hcd.ko
sleep 5
insmod /lib/usb-storage.ko
sleep 8
(save)
cd /lib/modules/`uname -r`/kernel/drivers
cp usb/storage/usb-storage.o /tmp/a/lib
cp usb/host/ehci-hcd.ko /tmp/a/lib
cp usb/host/uhci-hcd.ko /tmp/a/lib
cd /tmp
find . | cpio -c -o | gzip -9 > /boot/usbinird.img
(you could also replace your original initrd)
(edit /boot/grub/menu.lst to have a new boot option with your custom usbinitrd.img)
(this needs to be done for every kernel update!!! but could be automated with a script)
(To be honest I would like to see this appear in the official rPath distro since it is so simple to get it to work, when you know what you’re need looking for ;-))
Further digging around with Google revealed that you can add a few things to make life either easier for yourself or your USB stick’s life.
For example move /tmp, /var/lock, /var/log, /var/run, /var/tmp to tmpfs
further edit /etc/fstab to include the flag noatime with your /boot and / mountpoints and to preserve history it is also possible through two init scripts to copy those /var directories to tmpfs on startup and back to usbstick on shutdown.
I’ll include my fstab en preinit script and later my preshutdown script.
LABEL=/ / ext3 defaults,noatime 1 1
LABEL=/boot /boot ext3 defaults,noatime 1 2
/dev/devpts /dev/pts devpts gid=5,mode=620 0 0
/dev/shm /dev/shm tmpfs defaults 0 0
/dev/proc /proc proc defaults 0 0
/dev/sys /sys sysfs defaults 0 0
# LABEL=SWAP-sda3 swap swap defaults 0 0
tmpfs /tmp tmpfs defaults,noatime 0 0
tmpfs /var/lock tmpfs defaults,noatime 0 0
tmpfs /var/log tmpfs defaults,noatime 0 0
tmpfs /var/run tmpfs defaults,noatime 0 0
tmpfs /var/tmp tmpfs defaults,noatime 0 0
Further this is my preinit script
#!/bin/bash
echo “creating tmpfs in /var/log”
mount -n -t tmpfs tmpfs /var/log
echo “copy /var/log”
cp -a /var/log_persistent/* /var/log
mount -n -t tmpfs tmpfs /var/lock
echo “fixing /var/lock structure”
cp -a /var/lock_persistent/* /var/lock
mount -n -t tmpfs tmpfs /var/run
echo “fixing /var/run structure”
cp -a /var/run_persistent/* /var/run
echo “done, resume normal operations….”
made a link to it:
chmod 755 /sbin/preinit
cd /etc/rc3.d
ln -s /sbin/preinit S00preinit
If you want it in other runlevels as well then add it in rc1.d and rc2.d etc..
I made the following directories in /var:
/var/run_persistent
/var/log_persistent
/var/lock_persistent
and booted without the tmpfs entries in fstab and then copied the files from /var/log to /var/log_persistent and then activated tmpfs in fstab after which it worked.
What I need todo is add the script to copy everything back on shutdown and/or add a cronjob copying everything back every hour or so.
Hopefully it is clear what needs to be done, if not let me know through the forum.
Joop
This is what got it working for me EXCEPT the only problem part of this post was:
cd /tmp
find . | cpio -c -o | gzip -9 > /boot/usbinird.img
Should be:
cd /tmp/a
find . | cpio -c -o | gzip -9 > /boot/usbinird.img
I just got mine booting on a 2GB Udrive on the 64bit version of Openfiler 2.2
Also, I manually partitioned on the anaconda installer:
75 MB – /boot *forced primary ext2
100 MB – swap
rest MB – / ext3