serving Advanced HTTP/2 support - Go

HTTP/2 support in Knative

PR #2539 introduced the basic ability to use a Knative Service with HTTP/2. There have been numerous discussions on how to "properly" support HTTP/2 (and other stream based protocols) in Knative. This will focus on the different aspects of HTTP/2 only and how we could implement it to the benefit of our users. Some of this might also be applicable to other protocols such as Websockets or gRPC.

Why HTTP/2?

The official spec outlines the key differences to HTTP/1.x as follows: HTTP/2

  1. is binary, instead of textual
  2. is fully multiplexed, instead of ordered and blocking
  3. can therefore use one connection for parallelism
  4. uses header compression to reduce overhead
  5. allows servers to “push” responses proactively into client caches

Single connection parallelism is further superior according to the spec in:

In the past, browsers have used multiple TCP connections to issue parallel requests. However, there are limits to this; if too many connections are used, it’s both counter-productive (TCP congestion control is effectively negated, leading to congestion events that hurt performance and the network), and it’s fundamentally unfair (because browsers are taking more than their share of network resources). At the same time, the large number of requests means a lot of duplicated data “on the wire”. Both of these factors means that HTTP/1.1 requests have a lot of overhead associated with them; if too many requests are made, it hurts performance.

For the purpose of this document, I'll divide these properties into two buckets:

  1. Wire-protocol properties: These include the binary nature of the protocol itself (1) and the compression of the headers (4).
  2. Connection properties: These include the multiplex (2/3) and server "push" (5) properties.

What do we need to do to gain advantage of these properties?

Taking advantage of the wire-protocol properties is relatively simple. Given that our routing layers and applications correctly support HTTP/2 end-to-end, we will get these for "free". This should already been done and tested with #2539.

Supporting the connection properties though is a different beast. Especially single connection parallelism could be a neck-breaker for autoscaling and routing in Knative. I therefore propose different "modes" of HTTP/2 support, which lets the user define some additional information so Knative can decide how to properly handle incoming HTTP/2 connections.

HTTP/2 end-to-end

To support server "push" and fully take advantage of HTTP/2's parallelism properties, we need to support HTTP/2 end-to-end. That means, we need to route a connection to the user-application as-is. Since we want to take advantage of the per-connection multiplexing, we need to allow a parallelism of greater than one per connection. This then means, that we have no opportunity to reroute one of these requests once a pod becomes overloaded. Once a connection is routed to a pod, it sticks and all requests sent over it are going to that pod, no matter what.

If containerConcurrency is set to 0 (allowing infinite parallelism), this is not really an issue, as we have no defined limit of how many concurrent requests we can handle, thus we don't need to enforce one. Vertical scalability could become crucial on this path, as one connection could potentially overload a pod and we cannot reroute individual requests on the connection to relieve that pod.

If containerConcurrency is set to > 0 (allowing only a set amount of parallelism) things get a little more tricky. The HTTP/2 spec knows a SETTINGS_MAX_CONCURRENT_STREAMS setting to control the maximum amount of active concurrent streams on one connection. As long as we allow only one HTTP/2 connection per pod, this should work well to indicate to the maximum allowed concurrency per client-connection. If we stick to one connection per pod, autoscaling would naturally scale to one pod per connection and leave all request/stream based concurrency to the pod itself. The SETTINGS_MAX_CONCURRENT_STREAMS would be sent by the queue-proxy.

If a single connection from a single client does not saturate a pod though, we are left with unused capacity. Once we allow multiple HTTP/2 connections per pod, we'll have to deal with sizing each of them properly relative to each other. Autoscaling in this case also needs to account for total active streams across pods.

HTTP/2 to gateway

An alternative approach is to support HTTP/2 until we reach the gateway of a service. The gateway then demultiplexes the connections and sends HTTP/1.1 requests to the application pods themselves. Scaling granularity is not an issue in this case and we need no changes to the user-pods at all (which includes the queue-proxy). This approach takes advantage of HTTP/2's reduced overhead until the user's requests are routed to the gateway. The "last mile" then has the usual overhead of HTTP/1.1, which is hopefully less crucial in an in-cluster network vs. user-requests coming from somewhere on the planet (although multi-region HA services would see the same overhead potentially).


Given the different cases laid out above, I feel like it's hard to infer which kind of HTTP/2 support a user wants for her application. If anything, we can try to infer decent defaults based on the containerConcurrency setting. We should always allow to override this default though.

Based on the above, I see at least 3 modes:

  1. Manual: We route HTTP/2 through to the application and it needs to handle everything accordingly itself (status quo).
  2. Convert: Converts to HTTP/1.1 at the gateway. Routing and loadbalancing logic stays intact.
  3. Single: Allows only a single HTTP/2 connection per pod, which is properly sized in concurrency for the allowed containerConcurrency. Trying to resize connections etc. seems error prone and "hard to guess right" to me. We could maybe implement a Multiple mode later?
Asked Oct 24 '21 08:10
avatar markusthoemmes

9 Answer:

Doing some reading (thanks @cmluciano) envoy calls this 'connection pooling' and will merge all steams in to one connection to a backend: - So if we did a GOAWAY we'd effectively be restarting all traffic to that app.

Answered Jan 17 '19 at 23:42
avatar  of greghaynes

Doing a bit more digging and reading, the shutdown mechanism you describe @greghaynes seems to be implemented into the Shutdown logic of a http2 server. At least connections already behave this way.

Answered Jan 18 '19 at 07:39
avatar  of markusthoemmes

I propose we allowed unlimited connections and controlled streams with a semaphore. If we cross a concurrency threshold, we start booting connections with GOAWAY, starting with the smallest stream-count, until we go below the threshold again. Setting max concurrency streams on every connection ensures we can always handle at least 1.

Something like this:

breaker := NewBreaker(queueDepth, maxConcurrency, initialCapacity)
for {
    c := acceptConnection()
    go func() {
        for {
            s := c.AcceptStream()
            if !breaker.Maybe(func() {
            }) {
                c.Send(GOAWAY); return
                // or boot the connection with the lowest stream count, then retry breaker

This would allow us to autoscale without prior knowledge of how many streams will be multiplexed within a connection. We could have 1000 connections and 10 concurrent streams per pod. Or we could have 1 connection and 10 concurrent streams.

Answered Jan 17 '19 at 21:38
avatar  of josephburnett

I agree with @josephburnett that streams is really what we care about here WRT max-concurrency and connections aren't terribly interesting to a user. Something I am not clear on is whether our upstream LB (envoy/etc) will use multiple connections to a single backend or whether the streams get merged in to one (as is sort of the point with http/2). If they get merged this makes the GOAWAY design a bit problematic.

Unless there's some compelling reason I'm missing I think the "http/2 gateway / convert" is an antifeature - if a user's app can only http/1 then we should be exposing http/1 on the frontend as well. I could see some desire for converting the other way - making even http/1 frontend traffic convert to http/2 on the backend.

Answered Jan 17 '19 at 23:29
avatar  of greghaynes

@josephburnett wrote:

If we cross a concurrency threshold, we start booting connections with GOAWAY...

How will this affect the streams which are rejected? Will GOAWAY surface as an error to the client (and perhaps require the application to provide retry logic) or will some component transparently redirect the rejected stream(s) to a new (or at least a different) connection?

Answered Jan 17 '19 at 23:36
avatar  of glyn

Solution 2 is the current behavior without #2539. Istio upgrades all connections to HTTP/2 under the hood, and downgrades to http/1 where appropriate. It's not viable for gRPC, see this chart in #2539.

IMO the best solution would be to do what @greghaynes suggests and take advantage Envoy's connection pooling. We can do this by setting the ConnectionPoolSetting in Istio.

Answered Jan 18 '19 at 00:01
avatar  of tanzeeb

I'm noticing we should be explicit our LB setup here, partially due to the connection pooling stuff mentioned above but also there's a few ways to solve this. Here's what I'm thinking:

We use a HTTP/2 ingress/gateway (as opposed to raw tcp) which we assume may connection pool to its backends. For autoscaling purposes we measure streams as our unit of concurrency (as @josephburnett mentioned). We then apply backpressure at the stream level in queue-proxy via HTTP 5xx responses. Additionally, during shutdown we drain our connections so the LB can rebalance. Queue proxy can send a GOAWAY on each connection and stop accepting new connections based on a pre-stop hook.

My thinking here is that by removing connections entirely from the equation rather than using raw TCP through the LB a single user connection can outlive (by the LB holding the connection open regardless of backend) and outscale a single application instance.

Sorry to just be restating the obvious if this is what you were envisioning already @markusthoemmes - just trying to get some clarity :)

Answered Jan 18 '19 at 00:02
avatar  of greghaynes

Oh that envoy behavior is interesting indeed. Does that also mean that envoy can send requests coming in from the same stream to different backends? If that is the case, we might have no problem at all wrt. autoscaling etc. and can just apply the "usual" backpressure as @greghaynes mentioned. Essentially we're then entering the "Single" case I described above.

The whole writeup is based on the assumption that the relationship of a connection coming in from the user and one going to one of the potential backends is a 1:1 relationship.

If we can get envoy to correctly respect the SETTINGS_MAX_CONCURRENT_STREAMS on a connection and if it does in-fact only create one HTTP2 connection to the backend, maybe this is also a great solution for our overloading issues and we can reduce the amount of 503s being sent that way? In this case, upgrading connections from HTTP1 to HTTP2 on the backend indeed sounds very interesting.


If a GOAWAY frame is received or if the connection reaches the maximum stream limit, the connection pool will create a new connection and drain the existing one.

That kinda sounds as if envoy will create a new connection to the same backend if the stream limit is reached? It'd be great if we can verify this.

Exciting, thanks for all the input. Lots of knowledge gaps on my end to fill!

Answered Jan 18 '19 at 06:42
avatar  of markusthoemmes

@greghaynes sounds right.

@markusthoemmes another aspect of the design is how the Activator participates. It must count streams for HTTP2 as well as connections for HTTP1.

Answered Jan 18 '19 at 15:41
avatar  of josephburnett