Justin.tv's Live Video Broadcasting Architecture (2010)

I worked at Twitch from 2014 to 2018. I was never on the video team, but here are some details that have changed.

Video:

- Everything is migrated from RTMP to HLS; RTMP couldn't scale across several consumer platforms

- This added massive delay (~30+s) early on, the video team has been crushing it getting this back down to the now sub-2s

- Flash is dead (now HTML5)

- F5 storms ("flash crowds") are still a concern teams design around; e.g. 900k people hitting F5 after a stream blips offline due to the venue's connection

- afaik Usher is still alive and well, in much better health today

- Most teams are on AWS now; video was the holdout for a while because they needed specialized GPUs. EDIT: "This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article)" -spenczar5

- Realtime transcoding is a really interesting architecture nowadays (I am not qualified to explain it)

Web:

- No more Ruby on Rails because no good way was found to scale it organizationally; almost everything is now Go microservices back + React front

- No more Twice

- Data layer was split up to per-team; some use PostgreSQL, some DynamoDB, etc.

- Of course many more than 2 software teams now :P

- Chat went through a major scaling overhaul during/after Twitch Plays Pokemon. John Rizzo has a great talk about it here: https://www.twitch.tv/videos/92636123?t=03h13m46s

Twitch was a great place to spend 5 years at. Would do again.

spenczar5 · 6 years ago

Hi glacials :) Small correction from someone at Twitch today:

> video was the holdout for a while because they needed specialized GPUs

This isn't quite right; it has more to do with the tight coupling of the video system with the network (eg, all the peering stuff described in the article).

Deleted Comment

glacials · 6 years ago

Oh, edited! Thanks Spencer :)

arcticfox · 6 years ago

F5 storm is an awesome name for the venue blip -> refresh reaction. I've certainly contributed my fair share to your storms. It's basically automatic.

ummonk · 6 years ago

Yeah, and the problem is that it often does work to fix issues. It's the web equivalent of "have you tried turning it off and turning it back on again?"...

pulkitsh1234 · 6 years ago

I initially thought 'F5 storm' refers to people typing "F" in the chat.

wrigby · 6 years ago

On my first read-through I thought 'F5 Storm' referred to F5 load balancers, not hitting the F5 key to refresh the page.

diminoten · 6 years ago

> - No more Ruby on Rails because no good way was found to scale it organizationally; almost everything is now Go microservices back + React front

Ugh, I just... I keep trying to pretend I don't need to learn Go, but every highly scalable system I read about that's recently been written about seems to be using it. Maybe I just need to stay away from systems that need to scale? Heh...

snaky · 6 years ago

The keyword here is organizationally.

Technically speaking you can build scalable systems using anything you want. But if you need to hire a couple of hundreds developers, you better go with Java 7 or Go than Ruby, Lisp or Perl. The dumber and more uniform the better.

jchw · 6 years ago

Personally, I think it’s hugely worth learning. Aside from some eschewed defacto behaviors, Go is very easy to pick up and learn the entirety of in a week or two, because the language itself is really not that large. So I’d argue the time investment is a good one for what you get.

Still, you definitely do not need Go to scale systems. People scale Everything, perhaps most impressively PHP applications.

echelon · 6 years ago

Go isn't the only language that scales, it just happens to be popular amongst the scripting language crowd as a next step. You're by no means limited in your choice. You could do Java, C#, Rust...

empath75 · 6 years ago

The good news is that it takes like an hour to learn enough go to be productive.

apta · 6 years ago

Before golang was a thing, there were highly scalable systems that handled way more traffic than anything written in golang today. Those systems were (and are) written in languages like C++ and Java and C#.

You're just seeing golang in articles because of hype.

thomastjeffery · 6 years ago

> Realtime transcoding

I'm curious: Do services like Twitch specify a specific desired codec/bitrate that doesn't get transcoded? Transcoding seems like a lot of effort for lower quality end result.

If I were streaming, I would want to avoid transcoding as much as possible. Since we're talking about live broadcasting, there is a unique ability for the streamer to choose the format they upload.

kd5bjo · 6 years ago

In the RTMP days, the highest quality setting in the viewer was always a straight pass through from the broadcaster, and the reduced versions were transcoded in the data center to fit down lower-bandwidth last-mike pipes.

PullJosh · 6 years ago

> everything is now Go microservices back

Excuse the simple question: When I hear "microservices", I think serverless backend. Is that right, or are they different? If they're the same, how do you stream video with serverless? (Seems like streaming, websockets, etc... shouldn't be possible in a serverless environment...)

013a · 6 years ago

"Microservice" describes the size and scope of each deployment artifact. It answers the question "is the whole system just one big ball, is it broken up, how broken up is it?" It doesn't describe how it is deployed.

"Serverless" describes how a deployment artifact is deployed and runs. Generally it refers to a class of technologies in multiple domains whereby intricate knowledge of the underlying host is abstracted behind a cleaner API, with things like scaling, security, patching, etc handled by an infrastructure provider. While the term rose in prominence alongside "functions as a service", which is certainly a technology that generally qualifies as serverless, there are many serverless products out there: AWS Fargate for running containers, DynamoDB for a database, S3 for object storage, all of these are "serverless". A good signal is: if I can SSH into it, its not serverless.

A microservice can certainly be deployed serverless (ECS/Fargate or Google Cloud Run comes to mind). A microservice can even refer to one or more logically related functions-as-a-service; the term more-so speaks to how the engineering teams organize their business domain into the code and how the APIs speak to each other, rather than the exact underlying technologies.

leonidasv · 6 years ago

Microservices are about splitting code into different servers instead of a monolithic codebase. You end up with different servers (probably virtualized) for each domain of the application.

Like, instead of having the video decoding and the analytics code in the same monolith attached to same DB, you deploy a different server for each one, generally with a new DB for each. When the services need to talk to each other, they do it via network (REST, gRPC, etc.).

glacials · 6 years ago

They're different. Microservices are still stateful applications that run 24/7. They are just really small in scope.

e.g. the Friends feature on Twitch is one microservice, running in its own autoscaling group, with internal APIs used by other microservices like Whispers.

discordance · 6 years ago

My team follows microservice patterns, and have deployed services that utilise websockets over both serverless (Azure Functions and Lambda) as well as regular hosted services (on k8s, EC2 and Azure App service etc). Nothing stopping you there. On the streaming video side we did an app that used Azure Media services + Azure functions.. works well enough.

Not necessarily a good idea, but one 'feature' of microservices is the ability to pick different stacks, languages and delivery methods on an individual service level.

stale2002 · 6 years ago

I work at twitch. Let me put it this way. My team that I am on (VOD) has ~8 backend engineers and we are in charge of something like ~2 dozen services.

We literally have services that are run entirely using AWS Lambda functions only.

This is a pretty big difference from teams I've worked on in the past, that have 8 engineers all working on a singular service.

"Microservices" is more of a philosophy than anything.

conover · 6 years ago

“+ React front” to say the least! Hope you are well, glacials.

glacials · 6 years ago

Hehe, great job on that Chris :)

cantbecool · 6 years ago

Didn't Twitch use elemental machines for transcoding?

grogenaut · 6 years ago

No. Elemental is more of a high end encoding system for quality. Twitch is more about bulk cheap transcodes of good quality. Think about it. MLB has maybe 18 concurrent events. Twitch is running minimum in the 10k range.

Dobbs · 6 years ago

No we never had Elementals. In the early days there was no way we could afford them. In the later days I don't think we would want them as we needed to scale so many transcode jobs that it was easier to have a large farm of dumb machines to organise jobs across.

There may have been an element machine at one point that was used for testing/playing but I really don't think so, and know there wasn't one between 2010 and 2017.

zemnmez · 6 years ago

miss you!

glacials · 6 years ago

Miss you too buddy! Hit me up when you're in Seattle next.

NightlyDev · 6 years ago

"F5 storms" are easy to handle. Intercept all keypress combinations for refresh and do what you want with it client side. (spread it out over time, use a high-performance endpoint to check if live or a combination)

Most people doesn't use the refresh button in the browser, so only a small amount of traffic will be uncontrolled.

jrockway · 6 years ago

Do you have any data to support that? I personally don't have an F5 key on my keyboard (it requires pressing a modifier), so I pretty much always click the reload button to fix a stream blip. The impression I get from reading Twitch chat is that most people are using mobile. I doubt they have a keyboard plugged in and press F5 to refresh.

That said, you certainly don't need your video streaming servers to handle those hundred-thousand refresh requests.

cortesoft · 6 years ago

You have any data on most people not using the refresh button in the browser?

I can barely follow along with this, it's very technical. I can't imagine how Kyle Vogt acquired the necessary knowledge to make this work. Example:

> The point of having multiple datacenters is not for redundancy, it's to be as close as possible to all the major peering exchanges. They picked the best locations in the country so they would have access to the largest number of peers.

This is the kind of thing where I would have to hire some kind of network engineering expert, and he just figured this stuff out and made it work? I can't fathom other people's intelligence sometime.

jedberg · 6 years ago

He leveraged the YCombinator network to absorb a lot of information quickly. For example, I taught him basic networking (routers, switches, multicast/anycast, AS numbers, etc). I shared my 10 years of knowledge with him in a single two hour session because he's a genius, and then he ran from there, vastly exceeding my knowledge. I was there because Emmet asked Steve and Steve asked me to go over, and I was happy to help. I'm sure I wasn't the only one.

karambir · 6 years ago

Like other sibling, I would be very much interested in a talk like this. Teaching networking with real world examples and explaining 2-3 large scale architectures w.r.t networking. Maybe a long video(or small series) and a follow-up on twitch for Q/A. Would even pay for this.

nickpsecurity · 6 years ago

That's pretty nice. Quick look at articles show Emmet was the CEO. Which Steve is this? Huffman at Reddit?

wslack · 6 years ago

This is the talk I want to hear sometime.

throwaway9876a · 6 years ago

> I can’t imagine how Kyle Vogt acquired the necessary knowledge to make this work.

I've worked on projects with Kyle, and he often goes into bulldozer mode. It is no surprise to me that Kyle could "learn" all he needed in order to get something like this set up (or at least learn enough to orchestrate a small group in constructing it). Kyle is, by all means, a "force of nature" as YC tends to define it.

The downside to Kyle's optimism is that he often has very little concern for the humanity of others. He can set up decent optics around his actions and decisions in the wake of what many might consider failures, but he has consistently abused those who try to give him good-faith constructive feedback and often brought co-workers to tears. This is all well-documented at least through the past 4-5 years. (Kyle does actually explicitly ask for "direct" feedback btw. He just is only capable of handing the feedback on a periodic, weekly or monthly basis).

A key lesson of this article (and in glacials post above) is a testament to what can be achieved very quickly if technical debt is of minor concern. Kyle's key strength is in building a proof of concept that supports rapid iteration. This point appears to be something the Justin / Twitch teams did very well.

A secondly lesson is in getting alignment among diverse engineers. Think about how the team might have debated the architecture presented. Think about how some of the choices might rub people the wrong way.

Finally, Kyle is a unique character in several ways but is not alone in possessing a transient "bulldozer" mentality. If you see yourself having the same pattern of behavior, get help before others get hurt. There are a variety of mitigations that can help, but they need explicit participation.

kd5bjo · 6 years ago

> I can’t imagine how Kyle Vogt acquired the necessary knowledge to make this work.

By this point in history, it wasn’t just him anymore and we’d done a few rounds of improvements already out of necessity. As I recall, he got us up and running at PAIX based mostly on research, but most of the other data centers were built out by a network engineer(1) we hired away from YouTube.

While he was working on the network engineering and keeping the original system afloat, I did a lot of the software work for the system described here.

(1) Name withheld out of courtesy

kvogt · 6 years ago

True. Kd5bjo is an incredible engineer.

scurvy · 6 years ago

CC? Dude's on webcasts with GigaOM. He was in customer sales roles with a VAR. He's not shy.

smcl · 6 years ago

Don't be so hard on yourself, it's pretty common to read blog posts like this and come away with the idea that a super smart person took one look at the lay of the land and leapt directly from problem -> solution in one neat step. What you don't see is the people the talked to about the problem, their back-and-forth spitballing ideas, the various googling to see if there's a standard approach ... and most importantly you don't see any failed attempts.

Bear in mind that I don't think this is some deliberate attempt to appear superhuman, I think it's just accidental

joefourier · 6 years ago

I'm sure if you tried building your own livestreaming or VOD service, you'd come up with similar solutions and insights. Peering problems are fairly obvious - put up a gigabit server in Germany and try livestreaming high bitrate video to a highspeed connection in San Francisco (or vice-versa), and watch as you run into problems despite having more than enough theoretical bandwidth.

When your users start to complain, you tend to develop the domain knowledge necessary to solve their problems pretty quickly.

drwl · 6 years ago

Justin Kan goes a bit into Kyle's accelerated learning - https://www.youtube.com/watch?v=YzyatiQrQlQ

arcticfox · 6 years ago

Pretty exceptional indeed. Also impressive that he was able to grow from founder-stage tech to that scale, since they're largely different problems.

Especially back in 2010. I feel like I'd have a much better shot of being able to figure out that scale these days than a decade ago. (If I spent my free time studying and not watching Age of Empires 2 on Justin.tv/Twitch).

scurvy · 6 years ago

If that was actually the case, why the f did they have a boatload of gear in 200 Paul? There's almost no peering exchange there whatsoever (until SFMIX about 3 years ago). Can think of a lot better connected places in the Bay.

kd5bjo · 6 years ago

It’s one of the reasons we moved out of there. Moving day was an ... interesting experience: lots of planning to minimize downtime, and everything that was actually planned went relatively well. Unfortunately, what we thought was a 90% plan turned out to be more like 50%. Several people pulled all-nighters on that one.

At the time PAIX had a reverse-billing setup: the more data you transferred, the cheaper your connection charge was; we managed to get all the way into the cheapest billing tier within the first billing cycle which was basically unheard-of at the time.