Know what to optimize
Since optimizing for high traffic takes valuable time, it becomes crucial to be able to actually measure system’s capacity to focus on bottlenecks instead of making premature optimization. As Donald Knuth said:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
So the basic thing we have to know is what we should be optimizing and if we should be optimizing.
What to measure
Measurement of the complex social network system is similarly complex task. We can use some simple, yet useful methods. First of all we’ll assume that highest cost of page load is dynamic generation of the main page. It seems plausible since all business logic should be focused there, surrounded mostly by static content. Optimizing servers for serving static files is much less resource-demanding than preparing dynamic pages. So we’ll focus only on measuring this part.
Duration of the request
The most common and simple method to check web page performance is to refresh it and see how much it takes. The problem is, that if you find out during development, that your page refreshes really fast, it really doesn’t mean you can sleep peacefully. Actually it doesn’t mean much at all. You can say you don’t have any immediate problem, but still your bottleneck may be hidden just below the surface.
Simulating multiple users
The next logical step is to ask: how it will perform when multiple users visit it at the same time. This is where the fun begins. There are few tools that easily allow to simulate 50, 100 or more simultaneous users coming to the site. One of most commonly known is probably Apache Benchmark (ab), distributed with popular Apache server.
Below is plotted example data of such benchmark. It shows ordered response times of total 1024 requests made by 128 concurrent clients. You can think of it as 128 guys pressing refresh button 8 times in a row (of course we don’t use real people ;-) ). The blue line is the total time required by client to receive full response from server.
Determining page throughput
Very well, we managed to get some concrete data, but still our basic question for maximum amount of users we can handle, remains unanswered. The most straightforward approach is to try to increase amount of concurrent users on and on to see when the server reaches it’s top limit.
But how to see this limit? Is it when server becomes totally unresponsive? Or maybe when it even crashes? (quite common to Apache BTW) It seems the working limit should be somehow lower.
We’ll focus on determining average throughput of the server defined as amount of requests, that server is able to complete every second, under constant load. Imagine an exapmle. Let’s say, it takes 2s for server to deliver file to user (it’s not so straightforward in real, but let’s assume that for the sake of example). Let’s assume also that server can handle at most 10 concurrent user downloads (other users get queued). What is the server throughput here? We can complete 10 requests in 2s, so average is 5 requests per second. What does it really mean? It means that if we get more than 5 new users per second for longer period of time - we’re in trouble. Very short bursts are not a concern, the queue will be emptied again. But if it goes on, response times for users will become longer and longer, and eventually users start receiving time-out errors.
How it looks in practice? Let’s see below:
This is real benchmark result used to determine server throughput for particular resource. As we see, gradually increasing amount of result in reaching some top requests per second value and after that only thing that grows is response time. In this case we can safely assume that out throughput is just below 2000 requests per second. We can than use it now as precise, and what more important - meaningful estimate of our social network capabilities.
After we looked around the application, and measured it’s key components performance, it’s time for improvement of elements that really need it. But now we’re not guessing, since we have hard evidence where our problem is. Now is the time to narrow the search, and go to the code profiling part (for backend scripts case).
Using described approach allowed us to reveal some bottlenecks that were invisible to simple tests. The case was when database connection took only 3% of the loading time for single user on page. But after we started tests with concurrent users, we got appallingly low throughput value. It helped us to figure out that our database was growing to 35% of run-time. Very simple optimization in right place improved our throughput by 3 times! Much more cost-effective than making swarm of random tweaks.
Focusing our approach, let us to spend more time on development of more complex and well-targeted optimization solution. We saved tremendous amount of server resources and got similarly better user experience. You can read about our high load solution in another article. To sum up: take time to pinpoint the problem, and save even more time on unnecessary optimizations.