How to scale online services for millions of users without losing vital data

football-soccer.jpg
football-soccer.jpg

bet365's system for placing bets during sporting events can handle up to two million customers at once.

 Image: Shutterstock

football-soccer.jpg

The need to scale online services to keep pace with growing demand is a familiar challenge for modern businesses.

The online gambling portal bet365 can handle up to two million bets at any one time and servicing increasing numbers of customers had meant the UK-based firm was devoting much of its time to reworking systems based on the Java programming language.

"Because we keep on growing at such a rate we had to constantly re-engineer a lot of our core systems, not to excite our customers and build new products that bring revenue, just to handle the load," said Dan Macklin, head of R&D at bet365.

The company, which processed more than £26.5bn in bets during the last financial year, needed to solve what Macklin calls the "scale-innovation dilemma".

"All your team is constantly shifting the same software for scale, so you're not able to do any innovation any more, and when you do do innovation you worry 'Can it scale?'."

To better spread the load bet365 chose to build key systems in the programming language Erlang and to keep data consistent, when distributed across many different machines, it settled on the NoSQL data store Riak.

Erlang is a programming language created by the Swedish telecoms firm Ericsson almost three decades ago to help build telephony applications. The language was designed to support large-scale routing of telephone calls and handle faults without collapsing, making it a good choice for building similar systems today.

"We chose Erlang for simplicity, reliability and scalability. If you look at what was happening with telephone switches in the 1990s there are some parallels that can be drawn with modern web systems today."

One such system was bet365's Push messaging that shows punters the changing odds when they place bets during a sporting event, known as In-Play betting. To work, the system needs to deliver up to date information to up to two million customers at once.

"The aim is to get the betting information - the price changes and score changes - out to all of those customers as fast as possible with minimal latency," Macklin said.

Erlang helps bet365 relay up-to-date odds by splitting the work that needs to be carried out into millions of tasks that can be executed in parallel. Put simply, on a technical level, Erlang can handle hundreds of thousands of isolated, lightweight processes on each 32-core server. These processes will be executed concurrently and if one of these processes goes wrong it will be automatically restarted without affecting its neighbours. As Macklin puts it, "with Erlang you just let it crash".

"Rather than having big site outages when there's an issue, we end up with much more fine-grained manageable problems that we can deal with."

It is an example of what is called an embarrassingly parallel problem, where the work that needs to be done can be split into many tasks that can be carried out in tandem. Not only that, but each task is independent from each other, not having to wait for any other task to complete.

Erlang's technical prowess at handling these kinds of tasks is why Facebook's WhatsApp uses Erlang to handle the tens of billions of messages sent by the service each day.

"Erlang is a very small language with reliability and scalability built into it as a core foundation," Macklin said.

"We've found we can run things much more in parallel, use more of the CPU in the box and, because the concurrency semantics are via message passing, it vastly simplifies the software we're writing."

The compact, modular code enabled by Erlang has resulted in a "massive reduction" in the size of applications compared to Java, which in turn has allowed bet365 to "massively reduce testing".

The ease with which Erlang handles embarrassingly parallel problems also led to it being selected to build a new system that handles bet365's Cash Out offering, which allows people to claim money from a bet before the event they wagered on has concluded.

"That's a very challenging problem because for odds updates we've got to calculate all those odds changes for every open bet that we have in the system."

However, Erlang isn't a silver bullet, while it chews through embarrassingly parallelisable problems with ease there are languages whose performance is much better suited to tasks such as number crunching or building a user interface. About 10 percent of bet365's 300 developers now use Erlang, and Macklin says that while its unfamiliar syntax was initially offputting to developers coming from Java, those who made the switch find it to be an "intellectual stimulating" language.

Protecting vital data

Beyond finding a way to build applications that could easily scale with demand, bet365 needed to make its data equally flexible.

bet365's uses Microsoft SQL Server as its main database but where data needs to be manipulated en-masse, the firm is increasingly using the NoSQL datastore Riak.

"With regards to SQL, without sharding, we took things as high as they could go and we found that with Riak we're able to build simple systems that run at the scale we need. It's a big investment to decide to shard your SQL infrastructure because it costs a lot of money and adds a lot of complexity."

bet365 uses Riak to underpin its Cash Out system and in some of its transaction processing. While many different NoSQL databases offer scalability, Macklin said bet365 chose Riak because of the guarantees it provides over data accuracy.

"The hardest part is balancing scalability with correctness," he said.

"In various scenarios you want to make sure you don't lose half of your data."

Macklin says the approach many competing NoSQL data stores take to resolving data access problems can result in such a loss. These problems can arise when multiple machines attempt to update the same data at the same time or when database servers become inaccessible.

"When the network heals itself itself and rejoins again it will go 'Oh, what was the last thing that was written? I'll write that' and you've lost a whole load of data."

Riak, in contrast, offers features that allow it to be set up to automatically update the system with the correct data in the event of such issues. In most instances, it is able to successfully resolve these snags, according to Macklin.

"You get the benefit of writing a synchronisation-free architecture that throws data into a NoSQL database and in the event of concurrency or failure, for most of the use cases that we have, will heal itself automatically.

"It's not for absolutely everything but I think Riak's main selling point is that it enables you to run at scale and they do deliver a degree of correctness as well."

bet365's Erlang-based systems can handle about five times as many users as their predecessors and with the Olympics Games just over a year away it won't be long before it gets the chance to see how they cope under seriously heavy load.

"In a business like ours we get massive peaks and we want the user experience at the top of that peak to be really good."

Sign up for the TechRepublic UK newsletter.

Further reading