Postgres and other musings

Announcing TPC-C.js; a Lightweight Implementation of TPC-C

I am glad to announce the beta release of TPC-C.js, which implements one of the most popular database benchmarks, TPC-C. It’s not a coincidence that today is also the 22nd anniversary of the TPC-C benchmark.

It currently supports Postgres database, but can be easily extended to test other database systems.

You might ask “Why another TPC-C implementation when we already have so many of them?”“

Short answer: This one is very light on system resources, so you can

  1. Run the benchmark strictly adhering to the specification, and
  2. Invest more in database hardware, rather than client hardware.

Long answer: It’s covered in the Motivation section of TPC-C.js, which I’ll quote here:

Motivation

The TPC-C benchmark drivers currently available to us, like TPCC-UVa, DBT2, HammerDB, BenchmarkSQL, etc., all run one process (or thread) per simulated client. Because the TPC-C benchmark specification limits the max tpmC metric (transactions per minute of benchmark-C) from any single client to be 1.286 tpmC, this means that to get a result of, say, 1 million tpmC we have to run about 833,000 clients. Even for a decent number as low as 100,000 tpmC, one has to run 83,000 clients.

Given that running a process/thread, even on modern operating systems, is a bit expensive, it requires a big upfront investment in hardware to run the thousands of clients required for driving a decent tpmC number. For example, the current TPC-C record holder had to run 6.8 million clients to achieve 8.55 million tpmC, and they used 16 high-end servers to run these clients, which cost them about $ 220,000 (plus $ 550,000 in client-side software).

So, to avoid those high costs, these existing open-source implementations of TPC-C compromise on the one of the core requirements of the TPC-C benchmark: keying and thinking times. These implementations resort to just hammering the SUT (system under test) with a constant barrage of transactions from a few clients (ranging from 10-50).

So you can see that even though a decent modern database (running on a single machine) can serve a few hundred clients simultaneously, it ends up serving very few (10-50) clients. I strongly believe that this way the database is not being tested to its full capacity; at least not as the TPC-C specification intended.

The web-servers of yesteryears also suffer from the same problem; using one process for each client request prohibits them from scaling, because the underlying operating system cannot run thousands of processes efficiently. The web-servers solved this problem (known as c10k problem) by using event-driven architecture which is capable of handling thousands of clients using a single process, and with minimal effort on the operating system’s part.

So this implementation of TPC-C uses a similar architecture and uses NodeJS, the event-driven architecture, to run thousands of clients against a database.

Postgres Hibernator: Reduce Planned Database Down Times

TL;DR: Reduce planned database down times by about 97%, by using Postgres Hibernator.

DBAs are often faced with the task of performing some maintenance on their database server(s) which requires shutting down the database. The maintenance may involve anything from a database minor-version upgrade, to a hardware upgrade. One ugly side-effect of restarting the database server/service is that all the data currently in database server’s memory will be all lost, which was painstakingly fetched from disk and put there in response to application queries over time. And this data will have to be rebuilt as applications start querying database again. The query response times will be very high until all the “hot” data is fetched from disk and put back in memory again.

People employ a few tricks to get around this ugly truth, which range from running a select * from app_table;, to dd if=table_file ..., to using specialized utilities like pgfincore to prefetch data files into OS cache. Wouldn’t it be ideal if the database itself could save and restore its memory contents across restarts!

The Postgres Hibernator extension for Postgres performs the automatic save and restore of database buffers, integrated with database shutdown and startup, hence reducing the durations of database maintenance windows, in effect increasing the uptime of your applications.

Postgres Hibernator automatically saves the list of shared buffers to the disk on database shutdown, and automatically restores the buffers on database startup. This acts pretty much like your Operating System’s hibernate feature, except, instead of saving the contents of the memory to disk, Postgres Hibernator saves just a list of block identifiers. And it uses that list after startup to restore the blocks from data directory into Postgres’ shared buffers.

As explained in my earlier post, this extension is a set-it-and-forget-it solution, so, to get the benefits of this extension there’s not much a DBA has to do, except install it.

Ideal database installations that would benefit from this extension would be the ones with a high cache-hit ratio. With Postgres Hibernator enabled, your database would start cranking pre-maintenance TPS (Transactions Per Second) within first couple of minutes after a restart.

As can be seen in the chart below, the database ramp-up time drops dramatically when Postgres Hibernator is enabled. The sooner the database TPS can reach the steady state, the faster your applications can start performing at full throttle.

The ramp-up time is even shorter if you wait for the Postgres Hibernator processes to end, before starting your applications.

Sample Runs

Postgres Hibernator Comparison Postgres Hibernator Comparison Bars

As is quite evident, waiting for Postgres Hibernator to finish loading the data blocks before starting the application yeilds a 97% impprovement in database ramp-up time (2300 seconds to get to 122k TPS without Postgres Hibernator vs. 70 seconds).

Details

Please note that this is not a real benchmark, just something I developed to showcase this extension at its sweet spot.

The full source of this mini benchmark is available with the source code of the Postgres Hibernator, at its Git repo.

1
2
3
4
5
6
7
8
Hardware: MacBook Pro 9,1
OS Distribution: Ubuntu 12.04 Desktop
OS Kernel: Linux 3.11.0-19-generic
RAM: 8 GB
Physical CPU: 1
CPU Count: 4
Core Count: 8
pgbench scale: 260 (~ 4 GB database)

Before every test run, except the last (‘DB-only restart; No Hibernator’), the Linux OS caches are dropped to simulate an OS restart.

In ‘First Run’, the Postgres Hibernator is enabled, but since this is the first ever run of the database, Postgres Hibernator doesn’t kick in until shutdown, to save the buffer list.

In ‘Hibernator w/ App’, the application (pgbench) is started right after database restart. The Postgres Hibernator is restoring the data blocks to shared buffers while the application is also querying the database.

In the ‘App after Hibernator’ case, the application is started after the Postgres Hibernator has finished reading database blocks. This took 70 seconds for reading the ~4 GB database.

In ‘DB-only restart; No Hibernator` run, the OS caches are not dropped, but just the database service is restarted. This simulates database minor version upgrades, etc.

It’s interesting to monitor the bi column in vmstat 10 output, while these tests are running. In ‘First Run’ and ‘DB-only restart’ cases this column’s values stayed between 2000 and 5000 until all data was in shared buffers, and then it dropped to zero (meaning that all data has been read from disk into shared buffers). In ‘Hibernator w/ app’ case, this column’s value ranges from 15,000 to 65,000, with an average around 25,000. it demonstrates that Postgres Hibernator’s Block Reader process is aggressively reading the blocks from data directory into the shared buffers, but apparently not fast enough because the applicaion’s queries are causing random reads from disk, which interfere with the sequential scans that Postgres Hibernator is trying to perform.

And finally, in ‘App after Hibernator’ case, this column consistently shows values between 60,000 and 65,000, implying that in absence of simultaneous application load, the Block Reader can read data into shared buffers much faster.

Introducing Postgres Hibernator

As it must have been obvious from my last post that I wasn’t really pleased by the amount of work needed to implement hibernation of Postgres shared-buffers, so I set out to implement a seamless Postgres hibernation solution.

A couple of hours ago I published the Postgres/EDB extension I had been working on for last 10 days or so, in my spare time. Following are the contents of the README file from the extension.

Postgres Hibernator

This Postgres extension is a set-it-and-forget-it solution to save and restore the Postgres shared-buffers contents, across Postgres server restarts.

For some details on the internals of this extension, also see the proposal email to Postgres hackers’ mailing list.

Why

When a database server is shut down, for any reason (say, to apply patches, for scheduled maintenance, etc.), the active data-set that is cached in memory by the database server is lost. Upon starting up the server again, the database server’s cache is empty, and hence almost all application queries respond slowly because the server has to fetch the relevant data from the disk. It takes quite a while for the server to bring the cache back to similar state as before the server shutdown.

The duration for which the server is building up caches, and trying to reach its optimal cache performance is called ramp-up time.

This extension is aimed at reducing the ramp-up time of Postgres servers.

How

Compile and install the extension (of course, you’d need Postgres installation or source code):

$ make -C pg_hibernate/ install

Then.

  1. Add pg_hibernate to the shared_preload_libraries variable in postgresql.conf file.
  2. Restart the Postgres server.
  3. You are done.

How it works

This extension uses the Background Worker infrastructure of Postgres, which was introduced in Postgres 9.3. When the server starts, this extension registers background workers; one for saving the buffers (called Buffer Saver) when the server shuts down, and one for each database in the cluster (called Block Readers) for restoring the buffers saved during previous shutdown.

When the Postgres server is being stopped/shut down, the Buffer Saver scans the shared-buffers of Postgres, and stores the unique block identifiers of each cached block to the disk (with some optimizatins). This information is saved under the $PGDATA/pg_hibernate/ directory. For each of the database whose blocks are resident in shared buffers, one file is created; for eg.: $PGDATA/pg_hibernate/2.postgres.save.

During the next startup sequence, the Block Reader threads are registerd, one for each file present under $PGDATA/pg_hibernate/ directory. When the Postgres server has reached stable state (that is, it’s ready for database connections), these Block Reader processes are launched. The Block Reader process reads the save-files looking for block-ids to restore. It then connects to the respective database, and requests Postgres to fetch the blocks into shared-buffers.

Caveats

  • It saves the buffer information only when Postgres server is shutdown in normal mode.
  • It doesn’t save/restore the filesystem/kernel’s disk cache.

Hibernating and Restoring Postgres Buffer Cache

With the introduction of pg_prewarm extension in Postgres, it has become very easy to save and restore the contents of Postgres server’s buffer cache across a server restart. The tables, indexes and other on-disk data structures (like TOAST data, Visibility Map, etc.) are cached in shared-buffers before they can be read or modified by user queries. Hence, by saving the shared-buffers before a server restart, and restoring those buffers after the restart, you can expect to reduce the ramp-up time, and hence expect peak performance almost right off the bat.

Following is a brain-dead way of implementing hibernation, and I am sure it can be optimized to reduce the number of calls to pg_prewarm, since it allows the caller to specify a range of blocks to be loaded, and I’m asking it to load just one block per call.

We’re going to use pg_buffercache to extract the list of buffers currently loaded in buffer cache, and after a server restart, pg_prewarm will be used to load those buffers back in.

Declare environment variables to be used by scripts

HIBERNATE_DESTINATION=$HOME/pg_hibernate/

export PGUSER=postgres
PSQL="psql"
PSQL_TEMPL_DB="$PSQL -d template1"
PSQL_PG_DB="$PSQL -d postgres"

PSQL_TEMPL_NOFLUFF="$PSQL_TEMPL_DB -A -t"
PSQL_PG_NOFLUFF="$PSQL_PG_DB -A -t"

Prepare the databases

This is a one time operation. We avoid installing anything into template0 database, since it is a read-only database. But we do consciously install the extension into template1 database; this is so that any new databases created after this point will get this extension pre-installed.

Install extensions

for db in $($PSQL_TEMPL_NOFLUFF -c 'select datname from pg_database where datname <> $$template0$$'); do
  echo Installing pg_prewarm in $db
  $PSQL_TEMPL_DB -c 'create extension if not exists pg_prewarm;'
done

echo Installing pg_buffercache extension in postgres database
$PSQL_PG_DB -c 'create extension if not exists pg_buffercache;'

Save buffer information

We are actually generating a psql script, that can be later fed to psql, as is.

mkdir -p $HIBERNATE_DESTINATION

for db in $($PSQL_TEMPL_NOFLUFF -c 'select datname from pg_database where datname <> $$template0$$'); do
  $PSQL_PG_NOFLUFF -c 'select    $q$select pg_prewarm((    select    oid
                                                         from      pg_class
                                                         where     pg_relation_filenode(oid) = $q$ || relfilenode || $q$)::regclass,
                                                     $$buffer$$, $q$
                                                     || case relforknumber
                                                        when 0 then $q$$$main$$$q$
                                                        when 1 then $q$$$fsm$$$q$
                                                        when 2 then $q$$$vm$$$q$
                                                        when 3 then $q$$$init$$$q$
                                                        end || $q$, $q$
                                                     || relblocknumber || $q$, $q$
                                                     || relblocknumber || $q$);$q$
                                    from     pg_buffercache
                                    where    reldatabase = (select    oid
                                                            from      pg_database
                                                            where     datname = $$'${db}'$$)
                                    order by relfilenode, relforknumber, relblocknumber;' > $HIBERNATE_DESTINATION/${db}.save
done

Restore the buffers

Ideally, this would be performed after a server restart, but there’s no harm in doing it without the restart either (except the performance implications, if doing this eveicts buffers currently in use by other connections ;). Note that if some table/index’s underlying file storage has been renamed by the server since we extracted the buffers list, some of these calls will return with ERROR.

We connect to each database, and simply feed the script generated earlier, into psql.

for db in $($PSQL_TEMPL_NOFLUFF -c 'select datname from pg_database where datname <> $$template0$$'); do
  if [ ! -e $HIBERNATE_DESTINATION/${db}.save ]; then
    continue
  fi
  $PSQL -d $db -f $HIBERNATE_DESTINATION/${db}.save
done

At this point, the Postgres shared-buffers should contain all the buffers that were present when we extracted the buffer list from pg_buffercache.

PS: When reviewing the pg_prewarm code, I did not think through the user-experience aspect of this extension. But after it was committed, the more I thought about how it’s going to be used, the less appealing this solution became. As is evident from above, the administrator needs the help of two extensions, and then a storage location where to store the list of buffers. Ideally, as a DBA, I would like to see a feature which doesn’t require me to muck around with catalogs etc. I have a design for such an extension, and may start coding it some time soon.

Understanding Postgres Parameter Context

Postgres’ parameters have an associated context, which determines when that parameter can be changed.

You can see the context of every parameter using the following query. A sample output is also shown.

select name, context from pg_settings order by category;

              name               |  context   
---------------------------------+------------
 autovacuum_freeze_max_age       | postmaster
 autovacuum_max_workers          | postmaster
 autovacuum_vacuum_threshold     | sighup
 ...
 IntervalStyle                   | user
 ...
 server_encoding                 | internal
 ...
 lc_messages                     | superuser
 ...
 local_preload_libraries         | backend
 ...

The possible values of context are:

  • internal (called PGC_INTERNAL in source code)
  • postmaster (PGC_POSTMASTER)
  • sighup (PGC_SIGHUP)
  • backend (PGC_BACKEND)
  • superuser (PGC_SUSET)
  • user (PGC_USERSET)

The above list is in order of when a parameter can be set; if a parameter can be changed in a certain context, then it can be changed at any of the earlier contexts as well.

internal

The internal parameters cannot be changed; these are usually compile-time constants. If you want to change any of these, you’ll have to change it in Postgres source code and compile a new set of Postgres executables.

postmaster

The postmaster parameters can be set at Postgres startup, or during source code compilation. (Postmaster is the parent process of all the Postgres processes, hence the context’s name).

These parameters can be set in the postgresql.conf file or on the command-line when starting the Postgres server.

sighup

The sighup parameters can be changed while the server is running, at Postgres startup, or during code compilation.

To change such a parameter, you can change it in the postgresql.conf file and send a SIGHUP signal to the Postmaster process. An easy way to send the SIGHUP signal to the Postmaster process is to use pg_ctl or your distribution’s Postgres-init script, like so:

pg_clt -D $PGDATA reload

OR

sudo service postgresql-9.3 reload

backend

The backend parameters can be changed/set while making a new connection to Postgres, and never after that (and these can be changed by SIGHUP, at Postgres startup, or during code compilation).

Usually the applications set these parameters while making the initial connection.

An example is the local_preload_libraries parameter. Say, if you want to try a plugin for just one session, then you can initiate a psql session, with that plugin loaded for the connection, like so:

PGOPTIONS="-c local_preload_libraries=my_plugin" psql

The above method of changing parameters is possible for any application that uses libpq library to connect to Postrges (for eg. pgAdmin), since the PGOPTIONS environment variable is recognized and honored by libpq. Other applications/libraries may have their own methods to allow changing parameters during connection initiation.

superuser

To change a superuser parameter, one needs to have superuser privileges in Postgres. These parameters can be changed while a session is in progress, during a backend startup, using SIGHUP, at Postgres startup, or during code compilation.

Note that normal users cannot change these parameters.

user

The parameters with user context can be changed by any user, at any time, to affect the current session they are connected to. Needless to say that since this is the last context in the list, a parameter that is marked as user context, can be changed using any of the methods shown for the other contexts.

The SET command can be used to change a user context parameter’s value, for eg.:

SET work_mem = '32 MB';

Using context in queries

Although, as explained above, there is a certain order in the values of context, there is no built-in way for one to see this order, and exploit that knowledge using queries.

Say, if one wants to see a list of all parameters that cannot be changed by a normal user, there’s no straightforward way to do it. To that end, I create the following enum type and use it in queries to extract that information easily:

create type guc_context as enum (
    'internal',
    'postmaster',
    'sighup',
    'backend',
    'superuser',
    'user');

select name as cannot_be_changed_by_user, context
from pg_settings
where context::guc_context < 'user';

Other useful information that can now be easily extracted using this enum:

select name as parameter,
    context_enum > 'internal' as can_be_changed,
    context_enum = 'postmaster' as change_requires_restart,
    context_enum >= 'sighup' as can_be_changed_by_reload
from (select name, context::guc_context as context_enum
    from pg_settings) as v;

            parameter            | can_be_changed | change_requires_restart | can_be_changed_by_reload 
---------------------------------+----------------+-------------------------+--------------------------
 allow_system_table_mods         | t              | t                       | f
 application_name                | t              | f                       | t
 archive_command                 | t              | f                       | t
 archive_mode                    | t              | t                       | f
 archive_timeout                 | t              | f                       | t
 array_nulls                     | t              | f                       | t
 authentication_timeout          | t              | f                       | t
 ...