Some time ago I had to install and setup a PostgreSQL cluster for a very critical application, an Warehouse Management System (WMS) developed internally for an ecommerce startup in which I worked.
I would like to share the steps I followed back in those days to setup the proper configuration, I have to say that I was a lucky guy because the Dev Lead who managed the Dev Team in this case he had already worked with PostgreSQL in the past, so our meetings to define Dev Team requirements were pretty easy.
From our meetings, I had to understand what were the specific requirements needed for such a specific peace of software as a WMS is, in a regular situation this software is provided by an external Warehouse company, but in this case the relationship with this startup was so good with Warehouse company that they reached an agreement to include a very custom WMS inside the Warehouse provider infrastructure, which it’s not very common situation.
Apart from that the requirements for the PostgreSQL cluster were pretty clear:
- Not expecting a high load we only needed two nodes, one master node and one replica in standby mode
- Regarding replication strategy we decided to use an Asynchronous streaming replication
Setting up the MASTER
Create archive directory:
db01:/etc/postgresql/10/main# mkdir /opt/postgresql/10/main/archive db01:/etc/postgresql/10/main# chown postgres:postgres /opt/postgresql/10/main/archive
Edit postgresql.conf configuration file:
... listen_addresses = '*' # what IP address(es) to listen on; wal_level = hot_standby # minimal, archive, hot_standby, or logical archive_mode = on # enables archiving; off, on, or always archive_command = 'cp %p /opt/postgresql/10/main/archive/%f' max_wal_senders = 5 # max number of walsender processes wal_keep_segments = 32 # in logfile segments, 16MB each; 0 disables hot_standby = on # "on" allows queries during recovery ...
Setting up the REPLICA
The configuration file postgresql.conf was the same as on the master node in our case. It was needed to create an archive directory:
db02:/etc/postgresql/10/main# mkdir /opt/postgresql/10/main/archive db02:/etc/postgresql/10/main# chown postgres:postgres /opt/postgresql/10/main/archive
It was also required to create a recovery.conf file with this content in our DATA DIRECTORY:
# This tells the slave to keep pulling WALs from master. standby_mode = on # This is how to connect to the master. primary_conninfo = 'host=10.240.0.XX user=repl password=YYYYY' trigger_file = '/opt/postgresql/10/main/archive/failover_db_trigger' restore_command = 'cp /opt/postgresql/10/main/archive/%f %p' archive_cleanup_command = 'pg_archivecleanup /opt/postgresql/10/main/archive %r'
As an additional setup I included an immutability check for this configuration file:
db02:/opt/postgresql/10/main# chattr +i recovery.conf
To verify that everything was working from replication perspective we made some tests:
postgres=# create database foo; CREATE DATABASE postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+-------------+-------------+----------------------- foo | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres (4 rows)
postgres=# create database foo; ERROR: cannot execute CREATE DATABASE in a read-only transaction To verify replication status: # sudo -u postgres psql -c "select * from pg_stat_replication";
If you face some issues as we did something similar to this:
2018-05-22 09:34:54.839 UTC  LOG: record with incorrect prev-link 0/70013E8 at 0/7001420 cp: cannot stat '/opt/postgresql/10/main/archive/000000010000000000000007': No such file or directory 2018-05-22 09:34:59.844 UTC  LOG: record with incorrect prev-link 0/70013E8 at 0/7001420 cp: cannot stat '/opt/postgresql/10/main/archive/000000010000000000000007': No such file or directory 2018-05-22 09:35:04.851 UTC  LOG: record with incorrect prev-link 0/70013E8 at 0/7001420
We followed these steps to fix the issue (note 22960 port was a custom port where SSH server was running):
0. Stop postgresql on standby standby# systemctl stop postgresql 1. invoke pg_start_backup() on master master# sudo -u postgres psql -c "SELECT pg_start_backup('`date +'%Y-%m-%d %H:%M'`')" 2. remove data files on standby standby# /bin/rm -rf /opt/postgresql/10/main/* 3. copy data files master# rsync -e 'ssh -p 22960' -av --delete /opt/postgresql/10/main pro-wms-db-g02:/opt/postgresql/10/ --exclude=pg_xlog --exclude=postmaster.pid --exclude=server.crt --exclude=server.key 4. invoke pg_stop_backup() on master master# sudo -u postgres psql -c "SELECT pg_stop_backup()" 5. start postgresql on standby standby# systemctl start postgresql
“It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you’ll do things differently”
— Warren Buffett