After months of beta testing, Fortnite is officially entered Early Access. The latest title from Epic Games has arrived just before the end of the summer game drought, summoning high demand from gamers looking for a co-operative title to play.
On this front, it appears that day one sales expectations were underestimated. The login servers have been struggling to let in new players, while those in-game have faced crashes and lag.
Fortnite‘s social media channels and forum have been hit by hundreds of reports of issues. In order to explain the situation, Epic Games has been diligent about posting updates on its forum. Below is the information that has been given which provides context to the situation:
Update #1 – 6:50pm EST:
Sorry for not being able to provide an ETA.
I can provide a bit of back story for those interested though. At around 4 AM ET we added extra shards to our profile database during our “planned” downtime to handle increased load. Balancing shards takes around 24 hours so we decided to add a few more shards at 2 PM ET given anticipated future load / increasing player counts. This resulted in us running into an open file handle limitation (default is 1024) which brought everything down.
We incorrectly identified the root cause and tried a speculative fix of rolling back an unrelated change to our backend, but the deploy of that change failed. In the mean time we identified the actual root cause and raised the file handle limit and rolled back to the previously working instance. This deploy failed as well.
The root cause of the deploy failure is not clear and we are working on trying to resolve / debug this. The goal is to spin up a single instance of our backend and then rapidly scale it up in parallel. After that we will enable our “waiting room” (avoids a land rush taking system down if everyone tries to get in at once) and let everyone in again.
Update #2 – 7:17pm EST
We had an explicit / planned downtime from 3 AM ET to 630 AM ET this morning to make major changes to our configuration, scaled up XMPP (which we use for matchmaking and global chat), added shards to our profile DB, made sure it can be backed up, and deployed a new version of our dedicated servers.
Our unscheduled downtime started at 4:36 PM ET and is still ongoing as of 7:09 PM. No ETA as we haven’t found the root cause of our backend service deploy failing. We can spin them up, but they fail to connect to rest of infrastructure and then fail our health checks due to it.
Update #3 – 7:57pm EST
Current status is that we can successfully deploy our backend services and they are running with internal connections fine, but health checks are failing the moment we allow traffic to them.
Update #4 – 8:36pm EST
Current status is that we can successfully deploy our backend services and they are running with internal connections fine, but health checks are failing shortly after we allow traffic to them.
We have a “waiting room” service to prevent land rush scenarios, but certain backend calls are made before the client respects the queue.
Right now we are cutting off all traffic to backend via ELB (Elastic Load Balancer), scaling up backend to full capacity, and allow (waiting room) traffic and see what breaks <TM>.
If that works we will allow players in at a slow rate. If it fails the next step is to implement an orthogonal system that reduces traffic to our backend. First by restricting connections to just Epic HQ and then see what fails. If successful we would allow more traffic piecewise — e.g. last digit of IP etc.
Update #5 – 9:13pm EST
Our backend is back to being able to handle the load of current traffic (there is a lot of traffic even though you can’t play :-/).
We are running 6 shards for our profile DB. Each shard has a replication set of 3 machines. One of the shards had the primary machine fail / lock up. Switching roles in that set fixed the issue. We also had a grey failure on another replication set and killing the instance and spinning up another one took care of that.
We are now looking into some long running queries on the nodes and are making sure the DB is ready before allowing more traffic.
Update #6 – 9:41pm EST
Not related to servers going online again, but we now know what triggered us running out of file handles (aside from the limit being the default value which was crazy low).
“internal.db.name” failed but looked alive. This caused errors in our backend, which in turn caused us to run out of open file handles, which then started the outage.
Fortnite is a co-op sandbox survival game where players are able to band together to scavenge for items while defending themselves against hordes of zombie-like creatures. Its style of play has been described by Epic Games founder Tim Sweeney as “Minecraft meets Left 4 Dead“.
Stay tuned for updates on this launch situation.