=====================
PGPool-II SSD Cluster
=====================

This document describes the PgPool-II cluster dedicated to Pushcamp production.


Acronyms and document referrals
===============================

.. glossary::
   :sorted:

   PGPool-II
      `PGPool`_

   PGBench
      `PGBench`_

   PostgreSQL
      `PostgreSQL`_

.. _PostgreSQL: https://www.postgresql.org/
.. _PGPool: https://www.pgpool.net
.. _PGBench: https://www.postgresql.org/docs/12/pgbench.html
.. _Network Information Services: https://en.wikipedia.org/wiki/Network_Information_Service
.. _Network Filesystem: https://en.wikipedia.org/wiki/Network_File_System


Architectural Design
====================

.. image:: pgpool.png

The new PgPool-II cluster is a high-speed, redundant, highly available Pgpool cluster based off
PostgreSQL 12.
The PgPool-II architecture can be configured and deployed in various nomenclatures and types:
for the Monksoftware installation we opted for a Streaming Replication cluster, which also does
query load balancing, a watchdog enabled configuration which uses heartbeat signals to check
if all nodes are up and running and with correct replication data. In the following document,
we'll try to skim over the various cluster features and characteristics, starting off with the
base "raw" PostgreSQL configuration and the various methods used to configure the backends.

PostgreSQL hardware and tuning
==============================

We are currently using a master-slave-slave cluster, with three nodes: one node is always the primary/master
and the other nodes are hot-standby i.e. they accept read only queries and can be used to balance SELECT
and leverage the primary.
The hardware used is the OVH Infra-4 (https://www.ovh.it/server_dedicati/infra/tariffe/) that sports
2x Xeon silver with 24 CPUs/48threads at 3.2 ghz, with additional configuration options: 192GBs of RAM and 
three NVME SSDs which powers a RAID-5 array for each of the servers. We tried to have a RAID10 configured,
but the infrastructural limit for this server is three NVME disks. Maybe in a new version we could migrate to RAID10
and almost double the speed.

RAID5/Filesystem configuration
-------------------------------

A study has been conducted over several papers, regarding the correct stripe size to use for a RAID5 setup
that must back off a PostgreSQL database. The sources are:

`Chunks The Hidden Key to Raid Performance <https://www.zdnet.com/article/chunks-the-hidden-key-to-raid-performance/>`_ 
`RAID Performance <https://www.linux-magazine.com/Issues/2009/108/RAID-Performance>`_ 

The parameters used for the chunk size for the raid is 64.
The partition table for the RAID5 disks is then prepared so that it's aligned to the cylinder boundary.
We use a GPT partition table that takes care of that without esoteric cylinder alignment.
As for the RAID5 chunk size, an ext4 filesytem has been configured in order to match the chunk size using
stride and stripe-width accordingly. We used this simple calculator (https://busybox.net/~aldot/mkfs_stride.html
) that extracts the recommended settings for ext4 filesystem on RAID devices according 
to the filesystem blocks (we chose 4096 as the block size), number of disks, RAID level and so on.
The result is, for an ext4 filesystem, to have:

.. code-block:: sh

   - stride=16
   - stripe-width=32

Then, we applied this whole RAID5->Partition table->Ext4 configuration to a *customized* PostgreSQL 12 build.
We built PostgreSQL from source changing the default configuration parameters for the WAL segment size, so
it's aligned to the filesystem/RAID stripes, and there's absolutely no overhead from disk strips overlapping.
This is the configuration we used to built the PostgreSQL server.

.. code-block:: sh

  --with-blocksize=4
  --with-wal-blocksize=4
  --with-segsize=1

All the benchmarks agree on the fact that the stripe size of 64k is the best compromise between
performance and latency, especially for a SSD-backed database.


Other OS tunings
----------------

We made some hdparm tests and discovered that by setting the readahead amount to 4k for each of the
three NVME disks that back the RAID5 hardware, performance was increased threefold, so:

.. code-block:: sh

  blockdev --setra 4096 /dev/nvme0n1
  blockdev --setra 4096 /dev/nvme1n1
  blockdev --setra 4096 /dev/nvme2n1

Finally we also setup the following sysctl parameters, and kernel flag:

For less IO:

.. code-block:: sh

  vm.dirty_background_ratio = 5
  vm.dirty_ratio = 10
  vm.swappiness = 0

For more shared memory:

.. code-block:: sh

  kernel.shmmax = 8420048896
  kernel.shmall = 2055676
  kernel.sem = 250 32000 32 128

Set the elevator to noop and tell the kernel to use the BFQ I/O scheduler for high database performance:

.. code-block:: sh

    elevator=noop scsi_mod.use_blk_mq=1


PostgreSQL configuration
------------------------

The configuration for the PostgreSQL_ servers has been tuned according to postgresqltuner script
output, and the PGTune configuration calculator for a mixed-type application.

The problem was to calculate the correct amount of concurrent maximum requests for the database.
There is much literature on the amount of concurrent connection that a PostgreSQL database should accept,
but every paper agrees on something: *it's much better to queue up exceeding connections than to 
work them all at once*. There is a hard limit on the amount of concurrent threads for a database that's
somewhat calculated using this formula:

num_connections = ((thread_count * 2) + effective_spindle_count)

Where effective_spindle_count is 1 for each hard drive involved (that can do 1 spindle at a time, of course).
So with our current configuration where core is 48*2 + 3 disks, the max_connections should be about 100!
Much less than what we see on the usual database configurations. But, SSD comes into help. There is almost
no documentation regarding the effective spindle count that should be used for SSD disks, because well,
they can do much more than 1 operation per second, and presently the above formula is deemed as "obsolete"
for SSD-backed database. So we launched tests with hdparm O_DIRECT reading and we found that the 
NVME disks performance is something between 5 and 13 times better than mechanical disks (depending also
whether we use cached or buffered reads). So empirically we could think that this spindle count could be
multiplied by 5-13, also because writes and reads are the almost the same for SSD disks.

We ended up with 600 concurrent connections, eventually raising them to 900 when we live-tuned the
parameters, given the sheer amount of "get_news" pushcamp endpoint traffic.


PGPool-II
=========

PGPool_ is a clustering software that lets us build a multiple PostgreSQL cluster system. Have a look
again at the architecture at the top of this document. It can be configured in several different ways,
and offers quite some configuration styles and customization. We opted for a Streaming Replication (SR from now),
Watchdog-enabled (WD from now) heartbeat clustering, Query Load-Balancing (QLB from now) and a queued connection pool. We start explaining why we did not configure the in-memory query cache nor the connection pooling.

Why no query cache
------------------

We ran several benchmarks (results are at the bottom of the document) against the cluster, but we started
with the classic PgBench_ benchmark. We noticed no difference in pgbench benchmarks against a query-cache
enabled cluster, but we experienced a noticeable performance decrease while doing the standard Pushcamp
test suite against a query-cache enabled cluster. This is probably due to the very high amount of 
changes the postgres database receives every second from each of the clients, thus making the query
cache almost useless, just giving overhead to the cluster as the results from each query varies each moment.

Why no connection pooling
-------------------------

The connection pooling for PGPool_ works as follows. Have a look at `this`_.

.. _this: https://www.pgpool.net/mediawiki/index.php/Relationship_between_max_pool,_num_init_children,_and_max_connections

A child process opens a new backend connection only if the requested [user,database] pair
is not already in the cache. So if the application uses only one user to connect to only
one database, say [pguser1,pgdb1], then each child will continue to reuse the first
connection and will never open a second connection, so as a result Pgpool will open no more
than 100 backend connections even though max_pool is set to 3.
As we are here to serve just one user for pushcamp we'll have just 1 user and 1 database.

Cluster setup and resilience
----------------------------

To avoid split-brain situations, with two nodes that have to elect a master/primary and got no clue
because every one of them votes for itself, we used a three-node cluster, with one primary and two replicas 
(or one master and two slaves...). We used a OVH VRack enabled configuration in order to let the nodes
speak with each other. One node is at Roubaix, one at Gravelines and one at Strasbourg. There is no
"fixed" master in the cluster. Each time the cluster loads up, a new master is being elected and becomes the
primary node. The primary node is the only node in the cluster that can receive writes: the other two
nodes are connected via SR to it, and receive the updates almost in realtime. The advantage of having
two slaves is that PGPool can redirect SELECT queries to all of the nodes, with LB technique that balances
the traffic between them all. We opted in configuring the Roubaix node with a slightly higher weight in the
LB configuration because of its geographic position in relation to the Monksoftware server farm.

The resilience of the cluster is guaranteed by PGPool-II: the HB and the WD run from time to time, to guarantee
that all nodes are up, running and in good shape. Should a node detach from the network, return a bad state
when asked about, or get some disk problem or something like that, it is quarantined by PGPool-II, and
removed from the cluster. If the removed node is a primary node, a new election starts immediately to decide
the new primary node.

The whole cluster is accessed through one single virtual IP address, which is taken from the current 
WD primary node, which could be or not be the PostgreSQL SR primary. If the node that has died is also
the WD current holder, then the virtual IP is redeemed by the new node that won the election.

Automatic failover and recovery
-------------------------------

We configured the cluster, and written a set of scripts, which handle the automatic failover and recovery of
the cluster nodes. If a slave/replica goes down, no problem, but if a master dies, then, when the election for
the new master ends, the remaining slave must connect to it and resume the replication. This situation is
handled by the failover_command and follow_master_command, which can be configured in PGPool-II. Those
two configuration parameters point to their corresponding scripts, which are living in /etc/pgpool2/scripts
directory, in each of the nodes of the cluster. The scripts take care of:
   
   - Promoting the elected node to master
   - Dropping its old replication slot from PostgreSQL
   - Launching a pg_basebackup on the remaining slave, based on the current master and the current WAL position
   - Attaching the slave to the new master. SR resumes and the cluster is resilient again

For the recovery, we prepared several commands that use the above scripts to automatically recover a failed node,
re-attaching it to a running master. This can also be used, after having configured PGPool-II, to attach
a new slave node to the cluster.

Look at the Procedures chapter to have a list of useful commands to monitor the cluster status and 
launch a node recovery.


Slots and connection queuing
-----------------------------

As we already said, we opted for a no-pool connection handler. But we have to eventually let connecting more clients
than the current connection amount, piling up the connections and let them wait until the slots free up. We must
not let in more than (max_connections_on_postgres - X) connections at the same time, where X is a fixed amount
we leave free for WAL sender/receiver processes for replication, administrative tasks, and superuser connections.
We fixed that X to 30 connections. So the total amount of concurrent connection for each cluster node is 870.
On top of that, we configured the listen_backlog_multiplier parameter to 10. This will let, for each of the
PGPool children, queue up to 10 connections, thus having a total of 8700 available connection sockets. This should
not be needed because this should be handled by Sequelize/Knex ORM used to connect to the database, given that 
we prepared a pool distribution that does not overcome the given 870 connection limit.

This is the formula used to calculate the amount of num_init_children:

max_pool*num_init_children <= (max_connections - superuser_reserved_connections)

In our case, given that we do not pool, max_pool is 1.


Cluster benchmarking
====================

To benchmark the cluster, other than a basic PgBench_ test, we also used a Jmeter test suite
that implements each of the available endpoints, with several scenarios. 
The scenarios involved are:

  - Failure under load: we load the Pushcamp Backend with requests, then force a DB node failure (by rebooting). We expect the cluster to keep working and recover automatically from the failure.
  - Recovery under load: after the database node fails, we keep loading the Pushcamp backend, then we try a node recovery with our automatic recovery tools. We expect no issues.
  - SetUserIds dequeuing performance: we load setUserIds while we are concurrently loading a TRM targeting campaign
  - Initialize: we send LOTS of initialize requests, which go directly in the database. This is a performance test.
  - Load at connection limit: we lower PgPool-II available clients to 10-20, then we load test overloading the PgPool connection limit. This is a test to check if the clients can come in anyway.
 
Results:


Failure under load
------------------

+---------+---------+-------+------------+
| Samples | Average | Error | Throughput |
+=========+=========+=======+============+
| 132400  | 2324    | 1.34% | 127.7/sec  |
+---------+---------+-------+------------+

Result: OK. We forced a primary node failure.
The cluster quarantined the failed node, got some errors during the new election, then recovered successfully with almost no downtime.


Recovery under load
-------------------

+---------+---------+-------+------------+
| Samples | Average | Error | Throughput |
+=========+=========+=======+============+
| 132400  | 2125    | 0.1%  | 527.7/sec  |
+---------+---------+-------+------------+

Result: OK. The cluster node recovered as a slave, attaching to the new master that was elected, while the traffic kept on going with a hitch.


SetUserUids performance
-----------------------

+---------+---------+-------+------------+
| Samples | Average | Error | Throughput |
+=========+=========+=======+============+
| 3307856 | 108703  | 0.01% | 1014.3/sec |
+---------+---------+-------+------------+

Result: OK. The workers dequeued at 1/3 of the incoming speed as the requests were coming in.

Initialize performance
-----------------------

+---------+---------+-------+------------+
| Samples | Average | Error | Throughput |
+=========+=========+=======+============+
| 6351598 | 308703  | 0.00% | 320.2/sec  |
+---------+---------+-------+------------+

Result: OK


Load at connection limit
------------------------

+---------+---------+-------+------------+
| Samples | Average | Error | Throughput |
+=========+=========+=======+============+
| 6716310 | 2802292 | 0.41% | 32.1/sec   |
+---------+---------+-------+------------+

Result: OK. The lower throughput is due to the PGPool-II num_init_children set to 10.


Pushcamp connection pool configuration
=======================================

To plan on how to distribute the available 870 connections among the three Pushcamp backend nodes, we fetched
five days worth of log from the Nginx frontends, merged them, and extracted the mean and max TPS for each
of the available endpoints. Then, we matched this information with the db time for a single interrogation.

This was the result:

+----------------------------------+---------+----------+----------------+
| Endpoint                         | Max TPS | Mean TPS | Query time     |
+==================================+=========+==========+================+
| /get_all_news                    | 4333    | 220      | ~7ms           |
+----------------------------------+---------+----------+----------------+
| /get_commercial_and_inbound_news | 51      | 8        | ~14 ms/cmpgn   |
+----------------------------------+---------+----------+----------------+
| /details                         | 79      | 6        | ~6ms           |
+----------------------------------+---------+----------+----------------+
| /config                          | 161     | 50       | - (redis-only) |
+----------------------------------+---------+----------+----------------+
| /receipt                         | 50      | 20       | ~4ms           |
+----------------------------------+---------+----------+----------------+
| /set_filters                     | 22      | 8        | ~4ms           |
+----------------------------------+---------+----------+----------------+
| /set_user_ids                    | 17      | 8        | ~4ms           |
+----------------------------------+---------+----------+----------------+
| /info                            | 77      | 7        | ~4ms           |
+----------------------------------+---------+----------+----------------+
| /initialize                      | 53      | 5        | ~1ms           |
+----------------------------------+---------+----------+----------------+

Given that the /receipt, /set_filters, /set_user_ids and /info endpoints all end on Redis, and then are
dequeued by the workers, we distributed the endpoints as follows:


+----------------------------------+----------------------------------------+--------------------------+
| API Endpoint                     | Receiving backend(s)                   | API estimated pool slots |
+==================================+========================================+==========================+
| /get_all_news                    | pushcamp-be-prod03                     | 250                      |
+----------------------------------+----------------------------------------+--------------------------+
| /get_commercial_and_inbound_news | pushcamp-be-prod03                     | 250 (shared with above)  |
+----------------------------------+----------------------------------------+--------------------------+
| /details                         | pushcamp-be-prod01, pushcamp-be-prod03 | 20                       |
+----------------------------------+----------------------------------------+--------------------------+
| /config                          | all                                    | (no db involved)         |
+----------------------------------+----------------------------------------+--------------------------+
| /receipt                         | pushcamp-be-prod01, pushcamp-be-prod02 | 30                       |
+----------------------------------+----------------------------------------+--------------------------+
| /set_filters                     | pushcamp-be-prod01, pushcamp-be-prod02 | 20                       |
+----------------------------------+----------------------------------------+--------------------------+
| /set_user_ids                    | pushcamp-be-prod01, pushcamp-be-prod02 | 15                       |
+----------------------------------+----------------------------------------+--------------------------+
| /info                            | pushcamp-be-prod01, pushcamp-be-prod02 | 15                       |
+----------------------------------+----------------------------------------+--------------------------+
| /initialize                      | pushcamp-be-prod01, pushcamp-be-prod02 | 20                       |
+----------------------------------+----------------------------------------+--------------------------+


About the other processes involved:

+---------------------+--------------------+-----------------------+
| Process             | Running backend(s) | Connection pool slots |
+=====================+====================+=======================+
| Dequeuers (Workers) | pushcamp-be-prod02 | 200                   |
+---------------------+--------------------+-----------------------+
| Pusher              | pushcamp-be-prod01 | 10                    |
+---------------------+--------------------+-----------------------+
| CRM Targeting       | pushcamp-be-prod01 | 5                     |
+---------------------+--------------------+-----------------------+
| CRM Exports         | pushcamp-be-prod01 | 5                     |
+---------------------+--------------------+-----------------------+
| CRM Uploads         | pushcamp-be-prod01 | 5                     |
+---------------------+--------------------+-----------------------+
| Web                 | all                | 5                     |
+---------------------+--------------------+-----------------------+


And here is the total amount of connection pool slots allocated per API server:


+--------------------+-------------+
| API Backend        | Total slots |
+====================+=============+
| pushcamp-be-prod01 | 120         |
+--------------------+-------------+
| pushcamp-be-prod03 | 70          |
+--------------------+-------------+
| pushcamp-be-prod03 | 300         |
+--------------------+-------------+
| Total              | 490         |
+--------------------+-------------+

Finally, this is the amount of workers that dequeue redis, differentiated by worker type:

+---------------------------+--------+
| Worker                    | amount |
+===========================+========+
| receipts                  | 60     |
+---------------------------+--------+
| offerActivation           | 10     |
+---------------------------+--------+
| saveTrmTargetStatus       | 80     |
+---------------------------+--------+
| saveTrmReceipt            | 30     |
+---------------------------+--------+
| SsaveTrmActivationReceipt | 10     |
+---------------------------+--------+
| setFilters                | 25     |
+---------------------------+--------+
| setUserIds                | 25     |
+---------------------------+--------+
| deviceInfo                | 30     |
+---------------------------+--------+
| deleteDevice              | 1      |
+---------------------------+--------+
| trmTarget                 | 30     |
+---------------------------+--------+
| createCollateralCampaign  | 1      |
+---------------------------+--------+
| configData                | 1      |
+---------------------------+--------+
| Total                     | 303    |
+---------------------------+--------+


PGPool-II Administration procedures
====================================

In this last paragraph we'll describe the available operation procedures we've set up to show the cluster
status, recover the nodes, etc. The whole cluster is being backupped daily, and a PITR Barman solution is in thee works. Also, everything is alarmed on Zabbix. At the bottom of the present document we'll list the various alarm types.

The PGPool-II backends are running on:

+----------------+----------------------------+
| Server         | Location                   |
+================+============================+
| postgresprod01 | Roubaix(RBX7) - France     |
+----------------+----------------------------+
| postgresprod02 | Strasbourg (SBG3) - France |
+----------------+----------------------------+
| postgresprod03 | Gravelines (GRA2) - France |
+----------------+----------------------------+

Login is only permitted to administrative or operation accounts.
You can login on any server and launch one of the following administrative commands as the 'postgres' user.

Show cluster nodes
-------------------

Type:

.. code-block:: sh

    node_show_all

To have a list of the current nodes: this is am example output

+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| node_id |    hostname    | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  | 
+=========+================+======+========+===========+=========+============+===================+===================+===================+========================+=====================+
| 0       | postgresprod01 | 5433 | up     | 0.500000  | standby | 308113326  | true              | 0                 | streaming         | async                  | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| 1       | postgresprod02 | 5433 | up     | 0.250000  | standby | 104541395  | false             | 0                 | streaming         | async                  | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| 2       | postgresprod03 | 5433 | up     | 0.250000  | primary | 131545180  | false             | 0                 |                   |                        | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+

The above shows the status of an healthy cluster: every node is up, there are two standby (slaves) nodes which got an active streaming asynchronous replication state, with no replication
delay. The load balance node set to true shows the current node that received the node_show_all query we just made.

If one of the node is "down", or quarantined, the first thing to check if the postgres server is running.

If the backend node is running, and its streaming replication is attached, like below:

+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| node_id |    hostname    | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  | 
+=========+================+======+========+===========+=========+============+===================+===================+===================+========================+=====================+
| 0       | postgresprod01 | 5433 | down   | 0.500000  | standby | 308113326  | true              | 0                 | streaming         | async                  | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| 1       | postgresprod02 | 5433 | up     | 0.250000  | standby | 104541395  | false             | 0                 | streaming         | async                  | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| 2       | postgresprod03 | 5433 | up     | 0.250000  | primary | 131545180  | false             | 0                 |                   |                        | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+

Chances are that you can simply reattach the node to the cluster with the command (in the above case)

.. code-block:: sh

    node_attach X

If this does not seem to work (e.g. the output is not what you expect), bring it down and recover it as described below.

First, stop the backend:

.. code-block:: sh

    /etc/init.d/postgresql stop

Or, as postgres user:

.. code-block:: sh

   pg_ctl immediate stop

When the node is really down, you can proceed to node recovery.

Node recovery
-------------

To recover a node which is down or for some reason have been quarantined/removed from the cluster, or with no SR active, the recovery must be done using PGPool-II tools.
From one up node, preferrably the current master, run:

.. code-block:: sh

    node_recovery X

Where X is the id you find using the node_show_all command above. It will take a lot of time, because the dead node must make a full backup from the master, to ensure that the basebackup
is good, then will receive the last WAL segments, resync the replication and finally attach itself again to the PGPool-II cluster. *BE PATIENT* and do *NOT* CTRL-C the process. You can
follow the node recovery on the PGPool cluster and Postgres logs (more on that below). Finally, you will read something like:

.. code-block:: sh

    Node recovery complete

And then you will be able to use the node_show_all to confirm the node is back in the cluster.


Watchdog information and cluster split-brain recovery
-----------------------------------------------------

Notice that the split-brain situation is unlikely to happen because we have three nodes: the third node will act as an arbiter, and prevent the other
two nodes to argue on who should be the master, and have a cluster which nodes will keep stepping down from the master and participate again with an election:
but, if a watchdog process dies in the middle of a node failover, some edgy situation may present itself, and the following procedures will help recovering.
Be always calm because the cluster is resilient, and will keep on serving requests all the time you are managing it, as soon as you don't reap the virtual IP address.

By typing the command:

.. code-block:: sh

    watchdog_info

You will be presented with the watchdog processed list and the watchdog cluster status for PgPool-II. The WD is what makes your cluster stay alive, so it's important to have it
always running. The output could be something like follows:


3 YES postgresprod01:5432 Linux postgresprod01 postgresprod01

postgresprod01:5432 Linux postgresprod01 postgresprod01 5432 9000 4 MASTER
Not_Set postgresprod02 5432 9000 0 DEAD
postgresprod03:5432 Linux postgresprod03 postgresprod03 5432 9000 7 STANDBY

In the above message, one of the watchdog processes is dead. So we have to restart the PGPool-II process on that node. Login on that, remove the hanging pidfile
in /var/run/postgresql/pgpool.pid, then start PGPool-II with:

.. code-block:: sh

    /etc/init.d/pgpool2 start 

As root user. Check results on aggregated logs. A good output should show:

3 YES postgresprod01:5432 Linux postgresprod01 postgresprod01

postgresprod01:5432 Linux postgresprod01 postgresprod01 5432 9000 4 MASTER
postgresprod02:5432 Linux postgresprod02 postgresprod02 5432 9000 7 STANDBY
postgresprod03:5432 Linux postgresprod03 postgresprod03 5432 9000 7 STANDBY

Accurately check PGPool-II logs for cluster master escalation problems (split-brain and such). We must always have one master and nothing more.
If the PGpool-II process does not join back the cluster for some reason, it could be an ARP cache problem (i.e. the virtual ip was switched over to a
new master, but the ARP cache has not been cleared: you can check with arp -a, and comparing the MAC address to the one of the current
PGPool-II master that is holding the master address 172.16.0.30/postgres-prod). To resolve the problem, launch on ALL nodes as root:

.. code-block:: sh

    ip -s -s neigh flush all

And then, on ALL nodes:

.. code-block:: sh

    /etc/pgpool2/scripts/arping.sh postgres-prod

Then restart the node. Finally, if all else fails, and you keep seeing messages like following:

.. code-block:: sh

    LOG:  I am the coordinator but "postgresprod02:5432 Linux postgresprod02" is also announcing as a coordinator
    DETAIL:  trying to figure out the best contender for the master/coordinator node
    LOG:  remote node:"postgresprod02:5432 Linux postgresprod02" should step down from master because it is not escalated
    LOG:  We are in split brain, and I am the best candidate for master/coordinator
    DETAIL:  asking the remote node "postgresprod02:5432 Linux postgresprod02" to step down
    LOG:  "postgresprod01:5432 Linux postgresprod01" is the coordinator as per our record but "postgresprod02:5432 Linux postgresprod02" is also announcing as a coordinator
    DETAIL:  cluster is in the split-brain


Just run on *ALL* three servers the command:

.. code-block:: sh

    /etc/init.d/pgpool2 force-reload

And everything will get back to normal. 
*As a final check*, don't remember to login on all Pushcamp nodes and ping the *postgres-prod* address. If it cannot be pinged, you have to flush the ARP table cache:

.. code-block:: sh

    ip -s -s neigh flush all

On all the Pushcamp client nodes which did not refresh the ARP table, and then restart all the clients there.


Multiple standby nodes and node promoting
-----------------------------------------

It may happen that, after a massive and prolonged network failure, and if, in that situation, all the backend nodes go down at once, you could end up
with a bad situation like follows, when you list nodes:

+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| node_id |    hostname    | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  | 
+=========+================+======+========+===========+=========+============+===================+===================+===================+========================+=====================+
| 0       | postgresprod01 | 5433 | up     | 0.500000  | standby |            | true              | 0                 |                   |                        | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| 1       | postgresprod02 | 5433 | up     | 0.250000  | standby |            | false             | 0                 |                   |                        | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+
| 2       | postgresprod03 | 5433 | up     | 0.250000  | standby |            | false             | 0                 |                   |                        | 2020-04-15 18:21:08 |
+---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------+

This is a bad situation but it's also recoverable. All the nodes are standby and there is no master. You need to promote one of the PGPool-II nodes to the new master,
align the PostgreSQL backend. To avoid any loss of data, you should remember the last primary node that was active. If you don't, please check carefully after the cluster have been recovered,
and eventually use Barman to launch a Point-in-time recovery with the last point available. Type, as postgres user, in the "old" primary:

.. code-block:: sh

    node_promote X

Where X is the node id in the above table. The command should return immediately.
Then, on the corresponding backend node, type:

.. code-block:: sh

    pg_ctl promote

The node should promote itself to master for SR, and the cluster should begin to assemble again. You should now have one primary, with the two standby nodes still out of luck.
Now you have to login in the two remaining nodes, shutdown the PostgreSQL backends as already explained above, check that they result as "down" from the node_show_all.
You should have one master node, with the up status, and two slave nodes, down, with SR not running. Now launch

.. code-block:: sh

    node_recovery Y

For each of the ids corresponding to the slave nodes to recover. It will take a lot of time, but, eventually, the cluster will be up again.


Logging and Zabbix alarms
=========================

The logging takes place in storage, that gathers all the logs from all the backend nodes, both PostgreSQL and PGPool-II. 

PGPool-II logs are logged in /var/log/pgpool/pgpool-prod
PostgreSQL logs are logged in /var/log/pgpool/postgres-prod
Barman logs are logged in /var/log/pgpool/barman-prod

Following is the list of current alarms and the explanation of each of them:

+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| Alarm string                                | Meaning                          | Solution                                                                                |
+=============================================+==================================+=========================================================================================+
| Pgpool-II backend node is down on           | The node is not attached         | Reattach the node.                                                                      |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| PostgreSQL service is not running on        | The PostgreSQL service is down.  | Recover the node or reattach the node.                                                  |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| PostgreSQL Transaction IDs are exhausted on | TIDs exhausted.                  | Check if autovacuum process is running on all servers. Restart nodes, and recover them. |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| PostgreSQL split-brain is occurring on      | Split-brain occorring            | Recover as explained in the Watchdog section above.                                     |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| PGPool-II split-brain is occurring on       | Split-brain occorring            | Recover as explained in the Watchdog section above.                                     |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| PGPool-II service is not running on         | PGPool-II service is not running | Restart the PGPool-II process as explained above.                                       |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| PGPool-II delegate IP not exists on         | The VIP is not on any server!    | You probably have all standby and no master. Recover as explained.                      |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| All PostgreSQL servers are dead on          | The PostgreSQL service is down.  | Recover the node.                                                                       |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+
| All PGPool-II servers are dead on           | PGPool-II service is not running | Restart the PGPool-II process as explained above.                                       |
+---------------------------------------------+----------------------------------+-----------------------------------------------------------------------------------------+