The Vanishing Thread and Postgresql TCP Connection Parameters

Published in

tech-at-instacart

14 min readApr 15, 2022

Like any good story, this one starts in the logs. Our platform engineering team noticed that we occasionally see very long-running requests that ran well past our SLA. They seemed to ignore any application timeouts so they timeout around 15 minutes instead of 5,10 or 15 seconds. We thought it may be a logging fluke, but after digging further in the logs it seemed that the Puma threads that raised these delayed timeouts stopped processing any work and were blocked on something.

Log volume from a single thread on a busy web service. You can see the 15-minute gap

Logs from the 15-minute request. The 15-minute gap is also visible here

What was particularly interesting here was that all 15-minute requests were blocked on what appeared to be a 15-minute database query. Now, this is impossible because we have a tight statement_timeout on Postgres on the scale of seconds (not minutes!). What’s more, we log slow queries using Postgres log_min_duration_statement setting, a query that takes 15 minutes was sure to pop up in that log but it didn’t.

Our general database access layout looks something like this

  Application → AWS Network Load Balancer → Pgbouncer → Database

If it is not the database, maybe the query was blocked on Pgbouncer? That is also highly unlikely because we have a very short query_wait_timeout configured for our Pgbouncers (from Pgbouncer docs, query_wait_timeout is the maximum time queries are allowed to spend waiting for execution. If the query is not assigned to a server during that time, the client is disconnected) so the query can’t spend more than a few seconds on Pgbouncer before being sent to the database or rejected with a query_wait_timeouterror.

If neither the Database nor Pgbouncer could have possibly seen this query, the query must have been lost somewhere between the application and the NLB or the NLB and Pgbouncer.

Examining network metrics between the application and Network Load Balancer revealed a spike in TCP retransmits right around the time when the request got stuck. When we went back and examined older 15-minute requests, we noticed they were always preceded by a TCP retransmission spike ~15 minutes earlier, we also noticed that such retransmission blips are not uncommon.

Retransmission spike responsible for the 15-minute gap described above

A more recent but much more severe retransmission spike/storm

Okay, so we have a smoking gun, but how does the increase in TCP retransmissions relate to the 15-minute requests?

By default, the Linux kernel will retransmit unacknowledged packets up to 15 times with exponential backoff, after that, it gives up and notifies the application that the connection is lost. The default settings yield a retransmission timeout that can reach up to 15 minutes which lines up very well with the timeouts we are seeing.

We also saw this issue with background workers that don’t have application-level timeouts. On these workers, the errors we saw were

PQconsumeInput() could not receive data from server: Connection timed out

Searching for this error message in Postgresql code reveals that the suffix Connection timed out is coming directly from the OS which gives more credibility to the TCP retransmissions hypothesis.

Using tcpdumpand iptables we reproduced this 15-minute wait locally by opening a connection, blocking traffic to port 5432, and sending a query to the database.

“ss -o” command showing TCP retransmission timer (given the label on). It also shows the time until the next retransmission attempt (1min36sec) and the number of retransmissions so far (9)

How should we fix this?

It seems that we want to override the number of attempts the OS does by default. Under Linux, this is controlled by the system-wide setting tcp_retries2, but changing it will affect all outgoing traffic which is too disruptive. Luckily, libpq-based clients allow the developer to set TCP_USER_TIMEOUT on individual connections, this timeout will put a cap on how many retransmissions are done before killing the connection. This is exactly what we want but there is a catch, setting TCP_USER_TIMEOUT overrides a set of other TCP settings such as keepalives and tcp_syn_retries so we have to pick the value carefully.

There is an excellent article by Marek Majkowski that describes this interaction quite nicely. For sake of completeness, we will also cover this interaction, how each parameter can be set, and how these parameters work.

So, what value should we use for TCP_USER_TIMEOUT? Well, It depends.
In our use case, we want to drop unresponsive connections early while giving struggling connections enough of a window to retry. Obviously, this will depend on the network conditions so it will require tuning by trial and error while keeping in mind that TCP_USER_TIMEOUT interacts with other TCP parameters.

So, does changing TCP_USER_TIMEOUT actually help?

We started with aggressive values for TCP_USER_TIMEOUT and keepalives, bumped it up until we stopped seeing Connection timed out errors, and added a small buffer on top. The values we landed on for our application PG Connection objects are

keepalives: 1          # Interpreted as a boolean
keepalives_idle: 2     # seconds
keepalives_interval: 3 # seconds
keepalives_count: 3    # a count
tcp_user_timeout: 9000 # In milliseconds

With these parameters, keepalives will kill an idle unresponsive connection around the 11 seconds mark and kill a non-idle but unresponsive connection after about 6 retransmissions.

So did it work? Yes! We actually started seeing Connection timed out errors around the 13 seconds mark, that’s much better than 15 minutes!

This is also the final piece of evidence that the gap we saw was really due to TCP retransmissions.

Connection timed out TCP error after 13.1 seconds instead of 15 minutes!

Needless to say, there are no universally “good” values for these parameters. The parameters should be tuned for each network/application/use case.

The sections below will dig deeper into how to set TCP parameters and how they work, it is meant to serve as a reference rather than a casual read. Feel free to read only the parts that interest you.

Postgres TCP Connection Parameters

Before we start, I want to point out that some of the examples in this article use Ruby/Rails but the information here can be applied to other languages/frameworks.

Most Postgres client libraries are based on libpq, libpq is a C API to PostgreSQL. libpq allows the developer to set many connection parameters. Below are the parameters relevant to this discussion.

keepalives

Controls whether client-side TCP keepalives are used. Default is 1

keepalives_idle

Controls the number of seconds of inactivity after which TCP should send a keepalive message to the server.

Default is <not_set> which falls back to using system default (controlled by /proc/sys/net/ipv4/tcp_keepalive_time or net.ipv4.tcp_keepalive_time through sysctl)

keepalives_count

Controls the number of TCP keepalives that can be lost before the client’s connection to the server is considered dead.

Default is <not_set> which falls back to using system default (controlled by /proc/sys/net/ipv4/tcp_keepalive_probes or net.ipv4.tcp_keepalive_probes through sysctl)

keepalives_interval

Controls the number of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted.

Default is <not_set> which falls back to using system default (controlled by /proc/sys/net/ipv4/tcp_keepalive_intvl or net.ipv4.tcp_keepalive_intvl through sysctl)

tcp_user_timeout

Controls the number of milliseconds that transmitted data may remain unacknowledged before a connection is forcibly closed.

When set, this timeout will take precedence over the system’s tcp_syn_retries, tcp_keepalive_probes, and tcp_retries2 (and even the keepalives_count configured on the connection)

For example, if tcp_user_timeout is set to 2500, TCP will keep resending SYN packets for up to 2500ms (instead of resending tcp_syn_retries times),

TCP will mark the connection dead when 2500ms elapse since the last ACK (instead of sending tcp_keepalive_probes packets)¹

TCP will attempt to retransmit unACK’d packets for up to 2500ms (instead of retransmitting tcp_retries2 times with exponential backoff)

Default is 0 which means it is disabled.

____

[1] The connection will be closed at the time of the next keepalive probe provided it is not the first, if it is, the connection will be closed at the time of the second probe

How to set TCP connection parameters

Please note that the parameter values in the following examples are for demonstration purposes only and are not at all a recommendation.

Application

Connection string

You can specify the connection parameters in the query string.

postgres://username:password@host:port/database?connect_timeout=2&keepalives=1&keepalives_idle=1&keepalives_count=1&tcp_user_timeout=1000

This is very convenient because it will work with any libpq-based client regardless of the language.

Ruby On Rails

In database.yml

default: &default
  adapter: postgresql
  pool: 4
  connect_timeout: 2
  keepalives: 1
  keepalives_idle: 1
  keepalives_interval: 1
  tcp_user_timeout: 1000

To see what happens when an application loses a connection to the server due to a TCP timeout, see this section.

Pgbouncer

In the ini file used by pgbouncer

[pgbouncer]
tcp_keepalive=1
tcp_keepcnt=1
tcp_keepidle=1
tcp_keepintvl=1
tcp_user_timeout=1000

Notes on Pgbouncer

These settings will apply to both client and server connections.
When keepalives timeout or tcp_user_timeout are triggered, pgbouncer will emit a log message that looks like this

# For server connection loss
closing because: server conn crashed? (age=2s)# For client connection loss
closing because: client unexpected eof (age=2s)

If Pgbouncer closes a client connection that had a query running on the server or a connection that was idle in an unfinished transaction, it will also close the corresponding server connection with the error message closing because: unclean server or closing because: client disconnect while server was not ready
A high volume of server disconnections and reconnections can be detrimental to database performance. In practice, we have seen that significant database connection thrashing manifests as increased system CPU utilization (as opposed to user CPU utilization which indicates query load).

Postgres

In Postgres config file, set the following parameters

# Updating these settings does not require restart
tcp_keepalives_count=1
tcp_keepalives_idle=1
tcp_keepalives_count=1
tcp_user_timeout=1000   # Only in Postgres 12+

You can also set these parameters as command-line arguments

postgres -c tcp_keepalives_idle=1 \
         -c tcp_keepalives_interval=1 \
         -c tcp_keepalives_count=1 \
         -c tcp_user_timeout=1000

When a client connection is killed due to TCP timeout, Postgresql will emit log messages that look like this

LOG:  could not receive data from client: Connection timed out
LOG:  could not send data to client: Connection timed out
# Notice the 'Connection timed out' suffix

Note that these settings apply to incoming connections, they don’t apply to connections initiated by logical replication or foreign data wrappers (FDW)

Postgres Logical Replication

Logical replication connections do not use the tcp_* settings on the server, instead, they are configured on the SUBSCRIPTION object

CREATE SUBSCRIPTION log_replication
  CONNECTION '
    host=primary_host
    port=5432
    dbname=primary
    user=main_user
    connect_timeout=2
    keepalives=1
    keepalives_idle=1
    keepalives_interval=1
    keepalives_count=1
    tcp_user_timeout=1000'
PUBLICATION target_publication;

To update connection settings on an existing SUBSCRIPTION

ALTER SUBSCRIPTION log_replication 
    CONNECTION '
      connect_timeout=2
      keepalives=1
      keepalives_idle=1
      keepalives_interval=1
      keepalives_count=1
      tcp_user_timeout=1000'';

Postgres Foreign Data Wrappers (FDW)

FDW connections do not use the tcp_* settings on the server, instead, they are configured on the SERVER object

CREATE SERVER IF NOT EXISTS foreign_server 
  FOREIGN DATA WRAPPER 
  postgres_fdw OPTIONS (
    dbname 'foreign_db', 
    host 'foreign_host', 
    connect_timeout '2',
    keepalives '1',
    keepalives_idle '1',
    keepalives_interval '1',
    keepalives_count '1',
    tcp_user_timeout '1000'
  );

To update settings for an existing SERVER object

ALTER SERVER foreign_server OPTIONS (
  SET connect_timeout '2',
  SET keepalives '1'
  SET keepalives_idle '1',
  SET keepalives_interval '1',
  SET keepalives_count '1',
  SET tcp_user_timeout '1000'
);

Notes regarding FDW

If TCP timeouts are triggered, FDW will report an error that looks like this

ERROR:  could not receive data from server: Connection timed out 
# CONTEXT:  remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ  
# Again, notice the 'Connection timed out' suffix

What happens on the application side when a connection is killed due to a TCP timeout (keepalive or tcp_user_timeout)

A libpq client application might report this error as

PQconsumeInput() could not receive data from server: Connection timed out

As stated before, the suffix after the : in this error message comes straight from the OS, Connection timed out points at TCP timeouts of some sort. No route to host is another example of errors raised by the operating system, it generally indicates ARP resolution failure.

We can use the time until the error was raised by the OS to figure out which timeout occurred.

Assuming the default Linux settings, a Connection timed out triggered after ~15 minutes is likely a retransmission timeout since that is what the default tcp_retries2 value of 15 would yield.

A Connection timed out after 130 seconds is likely a timeout for failing to get SYN packets ACK’d by the remote host. The default tcp_syn_retries value of 6 would consume roughly 130 seconds.

tcp_syn_retries Timeout. Note that it is triggered during a connect operation which is be expected because that’s when we send SYN packets

Keepalives are somewhat different because the connection expires while being idle. If the application was actively polling the connection, it will immediately know when the connection is closed and will raise something likePQconsumeInput() could not receive data from server: Connection timed out.

If the application was not polling the connection and instead let it go idle and it failed the keepalives checks, the application will only know that the connection died when it tries to use it again. This will manifest as

PG::UnableToSend: server closed the connection unexpectedly
       This probably means the server terminated abnormally
       before or while processing the request.

Application attempting to use a connection that was closed, likely because it failed TCP keepalive checks

Examples

In this section, we show step-by-step examples of how keepalive values interact with TCP_USER_TIMEOUT

Idle connections

No TCP_USER_TIMEOUT

keepalives=1
keepalives_idle=10
keepalives_count=3
keepalives_interval=5
tcp_user_timeout=000:00:00 Local connection went idle 
00:00:00 Remote connection dies
00:00:10 TCP wakes up 
         first keepalive probe is transmitted
00:00:15 TCP wakes up
         second keepalive probe is transmitted
00:00:20 TCP wakes up
         third keepalive probe is transmitted
00:00:25 TCP wakes up
         connection closed

Small TCP_USER_TIMEOUT relative to keepalive interval and count

keepalives=1
keepalives_idle=10
keepalives_count=3
keepalives_interval=5
tcp_user_timeout=450000:00:00 Local connection went idle (Last ACK)
00:00:00 Remote connection dies 
00:00:10 TCP wakes up 
         Time since last ACK is 10000ms which is greater than 5000ms
         but TCP_USER_TIMEOUT only springs 
         into action after the first probe is sent
         so it doesn't kill the connection
         first keepalive probe is transmitted
00:00:15 TCP wakes up
         Time since last ack is 15000ms so 
         TCP_USER_TIMEOUT kills the connection
         second and third probe are never sent

Relatively large TCP_USER_TIMEOUT relative to keepalive interval and count

keepalives=1
keepalives_idle=10
keepalives_count=3
keepalives_interval=5
tcp_user_timeout=1600000:00:00 Local connection goes idle (Last ACK)
00:00:00 Remote connection dies
00:00:10 TCP wakes up
         first keepalive probe is transmitted
         Time since last ACK is 10000ms which is < 16000ms
00:00:15 TCP wakes up
         second keepalive probe is transmitted
         Time since last ACK is 15000ms which is < 16000ms
00:00:20 TCP wakes up
         Time since last ACK is 20000ms which is > 16000ms
         TCP_USER_TIMEOUT kills the connection
         third probe is not transmitted

Very large TCP_USER_TIMEOUT relative to keepalive interval and count

keepalives=1
keepalives_idle=10
keepalives_count=3
keepalives_interval=5
tcp_user_timeout=3700000:00:00 Local connection goes idle (Last ACK)
00:00:00 Remote connection dies
00:00:10 TCP wakes up
         first keepalive probe is transmitted
         Time since last ACK is 10000ms which is < 37000ms
00:00:15 TCP wakes up
         Time since last ACK is 15000ms which is < 37000ms
         second keepalive probe is transmitted
00:00:20 TCP wakes up
         Time since last ACK is 20000ms which is < 37000ms
         thrid keepalive probe is transmitted
00:00:25 TCP wakes up
         Time since last ACK is 25000ms which is < 37000ms
         fourth keepalive probe is transmitted 
         # despite keepalives_count being 3
00:00:30 TCP wakes up
         Time since last ACK is 30000ms which is < 37000ms
         fifth keepalive probe is transmitted 
         # despite keepalives_count being 3
00:00:35 TCP wakes up
         Time since last ACK is 35000ms which is < 37000ms
         sixth keepalive probe is transmitted 
         # despite keepalives_count being 3
00:00:40 TCP wakes up
         Time since last ACK is 40000ms which is > 37000ms
         TCP_USER_TIMEOUT kills the connection

Notes

As stated in this excellent article by Marek Majkowski, using tcp_user_timeout with keepalive_* can be confusing, you can use this python snippet to understand what would be the approximate effective keepalive timeout and the effective keepalives_count for different settings.

The application will not notice the connection has died unless it is actively polling the connection.

# Example case 1
conn.async_exec("SELECT pg_sleep(50);")# In the above code, the application is not sending 
# any data through the connection after sending the query 
# so the keepalive idle timer kicks in.
# However, because libpq is polling the socket for updates
# it will notice the connection loss 
# as soon as the OS kills the connection# Example case 2
conn.async_exec("SELECT 1") # call succeeds
# ... application leaves the connection go idle ....
# ... Keepalive timer kicks in ...
# ... Connection fails keepalive checks ...
# ... OS kills the connection ...
conn.async_exec("SELECT 2") 
# Error raised because the socket is already closed

To inspect keepalive timers, Run ss -o | grep postgres on the host. This command will show you open sockets on Postgres port and their TCP timers including keepalives timers.

ss showing keepalive timer. It indicates the time until the next probe transmission (1sec429ms) and the number of retransmission so far (2)

Active connections

In the following examples, the remote host stops ACK’ing packets. So the local connection will try to retransmit and as such it never goes idle so keepalive settings are not used. Only tcp_user_timeout and the system’s tcp_retries2 matter here.

No TCP_USER_TIMEOUT is set

tcp_user_timeout=0# Exponential backoff has some randomness to it, 
# for simplicity, we will assume the maximum possible backoff everytime
00:00:00.000 Remote connection stops ACK'ing packets
00:00:00.200 Retransmit 1 (after 200ms)
00:00:00.600 Retransmit 2 (after 400ms)
00:00:01.400 Retransmit 3 (after 800ms)
00:00:03.000 Retransmit 4 (after 1600ms)
00:00:06.200 Retransmit 5 (after 3200ms)
00:00:12.600 Retransmit 6 (after 6400ms)
00:00:25.400 Retransmit 7 (after 12800ms)
00:00:51.000 Retransmit 8 (after 25,600ms)
00:01:42.200 Retransmit 9 (after 51,200ms)
00:03:24.600 Retransmit 10 (after 102,400ms)
00:05:24.600 Retransmit 11 (after 120,000ms) # TCP_RTO_MAX hit, no more backoff
00:07:24.600 Retransmit 12 (after 120,000ms)
00:09:24.600 Retransmit 13 (after 120,000ms)
00:11:24.600 Retransmit 14 (after 120,000ms)
00:13:24.600 Retransmit 15 (after 120,000ms)
00:15:24.600 TCP wakes up, tcp_retries2 exhausted. (after 120,000ms)
             Kill connection and notify the application

When TCP_USER_TIMEOUT is set

tcp_user_timeout=1200# Exponential backoff has some randomness to it, 
# for simplicity, we will assume the maximum possible backoff everytime
00:00:00.000 Remote connection stops ACK'ing packets
00:00:00.200 Retransmit 1 (after 200ms)
00:00:00.600 Retransmit 2 (after 400ms) 
00:00:01.400 TCP wakes up
             Packet not ACK'd for 1400ms
             which is greater than TCP_USER_TIMEOUT
             Kill the connection

Like keepalive timers, retransmission timers can be inspected using ss -o

ss showing TCP retransmission timer (given the label on). It also shows the time until the next retransmission attempt (1min36sec) and the number of retransmissions so far (9)

Conclusion

Improperly set TCP parameters can be responsible for mysterious wait times on the application and relying on the OS defaults may not be appropriate for all applications. The values chosen for these parameters should be tailored to the application needs and validated in the deployment environment.

Killing connections aggressively using these settings is not without cost (even if the application reconnects). There is no way to recover a transaction if the client loses its connection to the database. Also, high connection churn is detrimental to database performance, it can happen even if a pooling software (such as Pgbouncer) sits between the database and the application because closing a connection to Pgbouncer that has a running query or is in a transaction will prompt Pgbouncer to close the server connection which can cause churn.

There are still a few open questions

Why does ruby-pg sometimes ignore Ruby interrupts (e.g. Timeouts) while waiting for the TCP layer?
What is the root cause for these occasional spikes in retransmissions? Is it network congestion or something on the network load balancer?

But these are rabbit holes for another day.

Resources/References

We’re hiring

If problems like this are interesting to you, consider joining us!