Reverse proxy - excessive number of upstream connections

kpine · April 20, 2018, 10:47pm

I have Caddy configured as a reverse proxy for Grafana and InfluxDB. Grafana talks locally to InfluxDB to make queries. A python InfluxDB client is writing data to InfluxDB by way of the Caddy proxy every few seconds. I noticed that at regular intervals of time the server (1GB Vultr instance) was not responding, and found out that it runs out of memory, and once the OOM killer kills a process then things are OK for awhile, then it repeats.

After further investigation, it appears that Caddy is creating new connections to InfluxDB and never closing them. It’s probably been a little over an hour and Caddy has over 4000 local connections, and the count never goes down (yes, they are ESTABLISHED).

root@grafana:/etc/caddy/conf.d# ss -p -o -nt '( dport = :8086 )' -o | wc -l
1536
root@grafana:/etc/caddy/conf.d# ss -p -o -nt '( dport = :8086 )' -o | wc -l
1947
root@grafana:/etc/caddy/conf.d# ss -p -o -nt '( dport = :8086 )' -o | wc -l
3982
root@grafana:/etc/caddy/conf.d# ss -p -o -nt '( dport = :8086 )' -o | wc -l
4732
root@grafana:/etc/caddy/conf.d# ss -p -o -nt -o state established '( dport = :8086 )' | head
Recv-Q Send-Q Local Address:Port               Peer Address:Port
0      0               ::1:53508                       ::1:8086                users:(("caddy",pid=8020,fd=3667)) timer:(keepalive,23sec,0)
0      0               ::1:54428                       ::1:8086                users:(("caddy",pid=8020,fd=4128)) timer:(keepalive,29sec,0)
0      0               ::1:49504                       ::1:8086                users:(("caddy",pid=8020,fd=1665)) timer:(keepalive,13sec,0)
0      0               ::1:56404                       ::1:8086                users:(("caddy",pid=8020,fd=5115)) timer:(keepalive,5.016ms,0)
0      0               ::1:49572                       ::1:8086                users:(("caddy",pid=8020,fd=1699)) timer:(keepalive,13sec,0)
0      0               ::1:53422                       ::1:8086                users:(("caddy",pid=8020,fd=3624)) timer:(keepalive,19sec,0)
0      0               ::1:57902                       ::1:8086                users:(("caddy",pid=8020,fd=5864)) timer:(keepalive,920ms,0)
0      0               ::1:48444                       ::1:8086                users:(("caddy",pid=8020,fd=1135)) timer:(keepalive,13sec,0)
0      0               ::1:53818                       ::1:8086                users:(("caddy",pid=8020,fd=3823)) timer:(keepalive,2.968ms,0)

Port 8086 is where InfluxDB is listening. If I watch the timer it appears to be 30 seconds, which would match the keepalive setting the reverse proxy code is using (If I understand correctly).

There are also persistent connections open to Grafana, even though I have closed my browser’s Grafana tab over an hour ago, so I wouldn’t expect anything to be happening between Caddy and Grafana.

root@grafana:/etc/caddy/conf.d# ss -p -o -nt '( dport = :3000 )' -o | wc -l
57

The InfluxDB client looks like it’s correctly using a single http keepalive session to Caddy. There’s usually just one connection, but sometimes two.

root@grafana:/etc/caddy/conf.d# ss -p -o -ant '( sport = :443 )'
State      Recv-Q Send-Q              Local Address:Port     		Peer Address:Port
LISTEN     0      128                 :::443                 		:::* 			            users:(("caddy",pid=8020,fd=904))
ESTAB      0      0                   ::ffff:NNN.NNN.NNN.NNN:443    :ffff:MM.MM.MM.MM:38998 	users:(("caddy",pid=8020,fd=3))

Here is some memory usage over time. I’m assuming the massive number of open sockets is the culprit, as both influx and caddy show memory usage increasing over time.

root@grafana:/tmp# vmstat                                                       
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 722396  20176 116712    0    0 21653    20   35  109  0  5 93  2  0

root@grafana:/tmp# vmstat                                                       
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 680008  22840 122060    0    0 21594    20   35  109  0  5 93  2  0

root@grafana:/tmp# vmstat                                                       
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 423664  35892 160592    0    0 21129    20   36  110  0  5 93  2  0

root@grafana:/etc/caddy/conf.d# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 273876  38828 147328    0    0 20800    20   36  113  0  5 93  2  0

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 7472 influxdb   20   0  647M  282M 18412 S  0.0 28.4  0:32.12 influxd -config /etc/influxdb/influxdb.conf
 8020 www-data   20   0  187M  182M 10504 S  0.0 18.3  0:14.62 caddy -log stdout -agree=true -conf=/etc/caddy/Caddyfile -root=/var/tmp

I tried setting the the proxy max_conns setting to small numbers (10 and 15), to no avail, although I have noticed some 502 errors on the client sometimes. Prior to using Caddy I was running this same setup on hosted containers but with haproxy as the frontend. The container memory sizes were no more than 128MB, and I never had an issue there. Are there some other config settings that can limit the proxy connections? I also found this post and the solution was to setup an idle timeout of 5 minutes. However, that post was from 2016 and the Caddy docs state this is now the default. So I’m not sure why all these connections are being kept open.

Here’s my configuration:

Running with:
/usr/local/bin/caddy -log stdout -agree=true -conf=/etc/caddy/Caddyfile -root=/var/tmp

Caddy version 0.10.14, Debian 9 x86_64

Caddyfile:
import conf.d/*

Grafana config, conf.d/grafana (adding max_conns made no difference):

https://grafana.foo.bar {
  proxy / localhost:3000 {
    transparent
    websocket
    max_conns 15
  }

  gzip

  log / stdout "[{when}] - {remote} -> {status} for {method} {host}{path}"
  errors stderr

  import ../ssl.conf
}

InfluxDB config, conf.d/influxdb (adding max_conns made no difference):

https://influxdb.foo.bar {                                            
  proxy / localhost:8086 {                                                
    transparent                                                           
    max_conns 10                                                          
  }                                                                       
                                                                          
  basicauth user pass
                                                                          
  log / stdout "[{when}] - {remote} -> {status} for {method} {host}{path}"
  errors stderr                                                           
                                                                          
  import ../ssl.conf                                                      
}

kpine · April 21, 2018, 6:50am

I’ve disabled keepalive in the influxdb proxy config for now, that is avoiding the problem.

I also tried limiting the number of connections in InfluxDB to 10 (Configuring InfluxDB OSS | InfluxDB OSS 1.5 Documentation). This did work, but the http client started seeing 502 http statuses, which tells me Caddy is attempting to create too many connections, and not re-using or closing the older ones.

kpine · April 21, 2018, 7:12pm

Looks like I misunderstood the proxy settings. I should have set keepalive to a small non-zero number (e.g. 10), and not max_conns.

system · July 20, 2018, 7:12pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.