Sysadmin Guide

“Everyone has a test environment… not everyone is lucky enough to have a separate production environment.”

Application Cluster

The main application cluster is a cluster of 3 physical machines managed by proxmox . With proxmox, it’s easy to spin up new VPS from a template, move a VPS from a physical machine in the cluster to another one, powerdown / reboot machines, and other common sysadmin operations.

Proxmox’ web admin interface

You can access the Proxmox admin interface using one of the following urls.

You have to be in the VPN in order to access them. Access to the webadmin interface is restricted by the following constraints:

  • username and password pair

  • OTP

Database Clusters

We have highly redundant PostgreSQL, PGPool-II and MariaDB database clusters, which use PostgreSQL-XL, PGPool-II and MariaDB Galera respectively.

You can read more about the database clusters (and how to fix them in case of problems) in their documentation section: DBCL Cluster

Resizing Filesystems

From time to time, disk space could not be enough on one server, so you’ll have to resize the disk ONLINE (i.e. without shutting down the server). Don’t worry, it’s easy if you follow the guide below. Please notice that the guide works only if you have created the VM using one of the templates provided (slackware or devuan). You can manage if you manually created one, but the resize section could not work.

To start, log in one of the Proxmox administration consoles, then go to the VM which got the disk you want to resize. Click on the Hardware link and you’ll be presented with the virtual hardware you have on it. One of the lines will read something like that:

Hard Disk (scsiX) VM-Pool_1_vm:vm_xxx-disk-x,size=XXXG

Click on it, then on the button Resize Disk above. Enter the number of GB you want it to GROW. Normally you’ll want to resize the root filesystem (because the template by default does not create additional filesystems for /var, /home, etc, so you’ll have just /boot, / and the swap)

Now login on the server with a root account. At this point, for example, if you resized the scsi0 disk, you can do

fdisk /dev/sda

/dev/sdb is for scsi1. You will read:

Command (m for help):

Type p at this command prompt and hit return.

You will see something like this:

Welcome to fdisk (util-linux 2.27.1). Changes will remain in memory only, until you decide to write them. Be careful before using the write command.

Command (m for help): p Disk /dev/sda: 32 GiB, 34359738368 bytes, 67108864 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x72e3860a

Device Boot Start End Sectors Size Id Type /dev/sda1 2048 1640447 1638400 800M 83 Linux /dev/sda2 1640448 10029055 8388608 4G 82 Linux swap /dev/sda3 10029056 67108863 57079808 27.2G 83 Linux

If we resized the /dev/sda disk, remember that the ROOT filesystem, if using one of the provided templates, is always on the LAST partition of the disk. Now you can delete the last partition (/dev/sda3). Don’t worry, you will not lose data.

Command (m for help): d
Partition number (1-3, default 3): 3

Partition 3 has been deleted.

Then immediately re-create it with default size:

Command (m for help): n
Partition type
   p   primary (2 primary, 0 extended, 2 free)
   e   extended (container for logical partitions)
Select (default p): p
Partition number (3,4, default 3):
First sector (10029056-67108863, default 10029056):
Last sector, +sectors or +size{K,M,G,T,P} (10029056-67108863, default 67108863):

Created a new partition 3 of type 'Linux' and of size 27.2 GiB.

Finally, write your changes to disk. You will see a warning for sure:

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy

The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).

Now you have to tell the kernel you’ve changed something. In this case, the disk is /dev/sda, so:

partx -u /dev/sda

Now the kernel nows the partition table changed. Finally we can resize the filesystem!

resize2fs /dev/sda3

Wait some time, and you’ll see your partition grew up in size!

Backup

The servers are backed up daily using bacula.

You can access its web interface here: https://backup-admin while in the VPN.

  • User is bacula

  • Password - ask the BOFH

Putting server on backup

To put a server on backup, start by cloning a slackware-template or a devuan-template on the AppCluster. After you’ve done all necessary configuration, you can edit the file in the new server:

/etc/bacula/bacula-fd.conf
And change the relevant configuration elements:
  • Change everything that contains dbcl03 to hostname-of-the-server

  • If you want, change FDAddress to the internal server address (this is firewalled anyway)

  • Change the Password for the Director (Name = storage-dir) A useful command to generate a random password can be:

openssl rand -base64 33

Remember that this same password must be configured on the bacula server. Once you have configured and restarted the bacula-fd with the commands:

/etc/rc.d/rc.bacula restart
service bacula-fd restart

Depending from the distribution you cloned in the first place, you can now configure the bacula server. Go to storage.webmonks.net, and open the file /etc/bacula/bacula-dir.conf.

Here is a snippet of what you should configure in the job section (search for the first “Job {” string):

Job {
   Name = "servername-backup"
   JobDefs = "Backup filesystem full no TS"
   Client = servername
}

And, in the clients section (search for the first “Client {” string):

Client {
   Name = servername
   Address = servername.localmonks.net
   FDPort = 9102
   Catalog = monk-catalog
   Password = "<the-password-you-configured-on-server>
   File Retention = 15 days
   Job Retention = 1 month
   AutoPrune = yes
}

After you’ve finished, you can launch the command:

bconsole

And type:

reload

Be careful, because if there are errors, the bconsole will HANG. If you’re cautious, you could first check for errors by typing:

bacula-dir -Ttc /etc/bacula/bacula-dir.conf

And then, if it’s all OK, you can reload. After you reloaded bacula, you can try running your shiny new job in the bconsole by typing:

run

And follow the menu.

Restoring a server from backup

Let’s hope you aren’t in need of this. But if you do, here’s what you have to do.

Please notice that if you are FULL-RESTORING a server, you have to reinstall a base system with the same template as the server you’re going to reinstall: you can check it on the appcl cluster console, or you can simply recognize it from the backup, by doing a fast lookup on the restore operation below, not running it. I use the /etc/hostname file to recognize them: if you have /etc/HOSTNAME then you have a Slackware 14.2 system. Otherwise, you have a Devuan system.

Login in the “storage” server and run the command:

bconsole

In the bconsole menu, you can type:

And type:

restore

You will be presented with this menu:

To select the JobIds, you have the following choices:

1: List last 20 Jobs run 2: List Jobs where a given File is saved 3: Enter list of comma separated JobIds to select 4: Enter SQL list command 5: Select the most recent backup for a client 6: Select backup for a client before a specified time 7: Enter a list of files to restore 8: Enter a list of files to restore before a specified time 9: Find the JobIds of the most recent backup for a client 10: Find the JobIds for a backup for a client before a specified time 11: Enter a list of directories to restore for found JobIds 12: Select full restore to a specified Job date 13: Cancel

You have several means to restore data from a backup, let’s follow the easiest one for now. Choose the 5 option for now. You will then read the list of available clients. Each one is a server you have put on backup. Choose one of the servers you want to restore. If all went well, you will be presented with a pseudo-shell, where you will be able to navigate using cd and ls, and will be able to type the special command add in order to add directories or files to the list of files you want to restore.

You are now entering file selection mode where you add (mark) and
remove (unmark) files to be restored. No files are initially added, unless
you used the "all" keyword on the command line.
Enter "done" to leave this mode.

cwd is: /
$ ls
bin/
boot
dev
etc/
home
initrd.img
lib/
lib64/
media/
misc
net
opt/
root/
run
sbin/
srv
usr/
var/
vmlinuz

Choose which files you want to restore (or type add all to choose them all), and then type done.

You will be presented with this menu:

1 file selected to be restored.

Using Catalog "monk-catalog"
Run Restore job
JobName:         Restore backup files
Bootstrap:       /var/lib/bacula/working/storage-dir.restore.1.bsr
Where:           /tmp/bacula-restores
Replace:         Always
FileSet:         <type-of-fileset (see below)>
Backup Client:   <name-of-the-server>
Restore Client:  <name-of-the-server>
Storage:         raid5storage
When:            2018-09-12 11:26:27
Catalog:         monk-catalog
Priority:        10
Plugin Options:  *None*
OK to run? (yes/mod/no):

As you can see, the files will be restored in the /tmp/bacula-restores directory of the client. You can of course change it to / on the server, but be careful, because the restore will be done as “bacula” user on the remote server, so you will not be able to write on the / filesystem. I suggest you to leave it as is, then login on the server and do something like:

cd /tmp/bacula-restores
tar -pcvf - * | tar -C / -pcvf -

Then you will be able to destroy the source directory. Reboot your new system and enjoy your restored server.

Restoring DB backups

We also have a daily backup for databases. The restore process is pretty similar to the above, with a slight difference: when you will be choosing files to restore, you’ll see database names. When you will choose them, you’re actually choosing which database to restore.

In order to restore database file, after you have chosen the client to restore the files from, you will be presented with something like this:

The defined FileSet resources are:
   1: Full FS no tablespaces
   2: PostgreSQL Logical Backup
   3: PostgreSQL tablespace

The “Logical” backup is what you need. Then you will be choosing database file(s) to restore, and when you’ll run the job, the database will be cleared and restored with the new copy.

In our Great Plans (tm) we are preparing a pg_barman solution to have a PITR backup for Postgres. Stay tuned for instructions.

Physical Machines

Our physical servers are provided by OVH.

DNS Management

We manage all of our domains with Route 53 by amazon, which has some cool features.

We also have an internal DNS service running on ns1.localmonks.net (which is an alias for storage.localmonks.net), and on ns2.localmonks.net, which is it’s slave. Should you install a new server as a BOFH, you should configure, on ns1.localmonks.net, the /etc/named/zones/* files. The most common files to be edited are localmonks.net and 0.18.172.in-addr.arpa: the first is used for direct resolution, the second is used for reverse lookup. Please remember to edit the “serial” number for every edit you make, and then run:

named-checkzone <zone-file-you-changed> <zone-name>

If it’s all ok, you can do:

rndc reload <zone-name>

In this way, changes are propagated to slave server.

Wordpress/Sites/NodeJS Cluster

A brief note: all the services below run as user “www-data”, apart from NodeJS which runs with user “node”. This is done this way because of the group permissions. If you are a nodeJS administrator, you can become the user node with the usual command sudo -u node -i, otherwise, if you are a www administrator, you can become it with sudo -u www-data -i. Login with your usual personal account, and then become the application user with sudo.

Wordpress

On wordpress-be-prod01 and wordpress-be-prod02 we also have a wordpress cluster! You can migrate your site here. Ask @agostikkio (the wordpress admin) to integrate your site!

The WP cluster runs on APACHE, and its configuration is in:

/etc/apache2/sites-enabled/default-ssl.conf.

The wordpress cluster runs in /var/www/wordpress. You can login on the Wordpress cluster console by going into

https://wp.monksoftware.it/wp-admin

and logging in with administrator credentials. If you do not have them, ask a BOFH or @agostikkio to create one for you. You can restart https with

service apache2 restart

Please remember that the /var/www/wordpress hosts multiple sites! You should not, for any reason, modify files therein, otherwise you could break the wordpress multi-site installation. If you need to import new sites, follow the guide below (by @agostikkio):

  1. Export information of current site with native Wordpress importer (Strumenti -> Esporta -> Tutti i contenuti) having a .xml file.

  2. Dump current website’s .sql data.

  3. Create a new site from the backoffice network section (I miei siti -> Gestione network -> Siti) choosing a proper new fourth level domain-name. Please notice that you can also use second-level domains.

  4. Check which new tables (prefix -> e.g.: ‘wp_8’) this new site will create and search/replace this prefix into the previously dumped file (ask BOFH if you don’t know how to do it).

  5. Give .sql file to the BOFH that will updates the database.

  6. After the database update, import previously exported .xml data into the new network website, using the native Wordpress importer (Strumenti -> Importa).

  7. Check for differences from current and network site, after installed plugins, imported data and so on.

  8. Ask the BOFH to create the virtualhost on the NGINX Frontends, so the traffic will be properly directed on the Wordpress cluster. Use the existing files in /etc/nginx/sites-available as a template

WARNINGS: - New plugins have to be installed from plugin’s network section (I miei siti -> Gestione network -> Plugin) and then activated into new website dashboard. - Exporter doesn’t export all data (e.g.: no menus, some plugins functions).

Static/PHP sites

The wordpress cluster also runs all static/basic sites in /var/www/sites, and those are exposed using NGINX UNIT. The configuration for NGINX UNIT lies in:

/etc/unit/unit.config

You can configure as many services as you like, UNIT will come up with all of them. Create a configuration like this:

{
    "listeners": {
        "<server-ip>:<service-port>": {
            "application": "<application-name>"
        },

     },
     "applications": {
         "<application-name>": {
            "type": "php",
            "processes": {
            "max": 200,
            "spare": 10
         },
         "environment": {
          "UMASK": "0002"
         },
      "user": "www-data",
      "group": "nogroup",
      "root": "/var/www/homedir",
      "script": "index.php"
    }
}

And then run

/etc/init.d/unit loadconfig <path-to-file>

Now you’ll see that there is a new service running on “<service-port>”. You can proceed by configuring your virtualhost on the NGINX Frontends, nginx-fe01, nginx-fe02, nginx-fe03, which sends traffic on the UNIT cluster.

NodeJS

We’re also running NodeJS servers on this cluster! Become user “node”, if you are in the “nodeadmins” group, and you’ll be able to check service by typing:

pm2 status

If you are a nodeJS developer, you’ll find it easy to do stuff. Please remember that direct deploy on the server using pm2 is forbidden.

Please notice that the wordpress/sites/nodejs cluster is running on a GlusterFS clustered filesystem, so any change you do on one server is instantly reflected on all other servers. There’s no NFS’d single point of failure, and the servers can run independently of each other.

Owncloud auto-share

If a user happens to have a roaming profile (e.g a profile running on our storage server, which lets you have the same home directory on whatever server you ssh into), he can also publish files (and only files) to their owncloud directory. Just tell them to create a “.owncloud_tunnel” directory under their own home dir. Every hour, at minute 16, all files contained therein will be moved to the Owncloud share. You’ll know because the files will disappear from their homes.

Logging

All logging for clustered servers will appear in the storage server, located at storage.localmonks.net and only reachable via VPN. The logs are gathered using syslog from all the concerned servers, where applicable. Currently these are the services logged into storage:

Service

Location

Pushcamp

/var/log/pushcampa

Odoo

/var/log/odoo

Wind-proxy

/var/log/windproxy

XMPP

/var/log/xmpp

Redis Cluster

/var/log/redis

Galera Cluster

/var/log//dbcl

Postgres-XL

/var/log/dbcl

Cassandra Cluster

/var/log/cassandra

Riak KV

/var/log/riak

Logging for Docker swarm services (such as the buddies, the router, etc) is done on the Docker Swarm servers and the logs are fetched using the command

docker service logs <servicename>

You can search for available services with the command

docker service ls