Saturday, January 19, 2008

How To Provision Data Storage Capacity For Content Caching

A technician, recently called me up to estimate storage capacity for Internet Caching. The technician, is a part of a team that will be responsible for setting up Internet Gateway for an ISP, in a small Middle-East country. The technician had very limited numbers available, to do the necessary derivation. Thus he could phrase the question very simplistically - The ISP will be potentially serving 40,000 users. What should be the provisioned Data Storage Capacity, for content caching? Being a part of Team SafeSquid that builds Content Filtering Proxy, we often get this kind of a query, but with one major difference. Mostly queries are phrased like - We have an Internet Pipe of X Mbps. What is the recommended Data Storage Capacity for most efficient caching? A reasonable advice to such a query can be derived, if we allow a few assumptions, and focus on some simple facts. 1. Only the Content that has been fetched over HTTP can be cached. 2. The maximum rate at which content can be fetched depends upon the Internet Pipe. 3. There is a lot of HTTP traffic that is un-cacheable, for example - streaming audio / video, pages that display results of any other SQL driven queries including search engine queries, and even HTML content in Web Mail. 4. Most important content that gets cached are HTML pages, embedded images, style-sheets, java scripts, and other files that you would have to download and execute, on the local desktop, or view with another viewing application, like PDF / Flash (some) files. 5. A simple request to view a web-page, with a normal browser, automatically triggers, downloads of a variety of content like cookies, images and other embedded objects. These are required, by the browser, to display the page, as per the page-design. All the components, that constitute the web-page may NOT necessarily be sourced from the web-site that was serving the requested web-page. 6. Modern Internet browsers, provide caching, that is user manageable, and is quite similar to the caching principles involved in the design of caching proxies. So each content or object may not be necessarily requested. But yes these browsers depend upon the availability of local storage, on the client systems, and is usually not over a few hundred Mbytes. And in any case, these local-caches are not shareable between different users. 7. Internet resources could have varying levels of utilization, depending upon time of the day, resulting in peak and off-peak hours. Therefore if we have an Internet Pipe of 10Mbps, the max data we can transfer (data-throughput) = 10Mbps x 60 seconds = 600 Mbits of data in a minute = 600 x 60 = 36000 Mbits of data in an hour Now suppose the enterprise uses a bandwidth manager, to reserve QoS for each pre-defined application (or protocol). Generally applications like SMTP and VPN are given the lion s share, almost 50%, and the remaining gets shared between HTTP/HTTPS and others. But I know of quite a few customers who would invest in pipes meant for exclusively SMTP and/or VPN, and a separate (cheaper) Internet connection for HTTP / HTTPS. If the enterprise has chosen to host it s web-server within it s own business premises, then the entire distribution program, changes completely. Even in case, the enterprise does not use a bandwidth manager, resulting in first come, first serve , we could still be guided by an estimated proportioning of traffic on the basis of applications or protocols. So to build our algorithm, it might be practical, to coin a term - HTTP_Share, such that - HTTP_Share = x% of Internet Pipe. Now, HTTP_Share would signify, the max data that would get transferred over HTTP traffic Therefore, further to our earlier derivation, of 36000 Mbits of data throughput per hour, if we factor the HTTP_Share HTTP_Traffic = x% of data-throughput Now, if x = 35 (35% of overall data transfer was for HTTP) HTTP_Traffic / hour = ( 0.35 x 36000 ) Mbits / hour = 12600 Mbits /hour Now presuming, the enterprise has off-peak hours and peak-hours of Internet Usage, such that 40% of the day (approximately 9.6 hours) is peak-hours, while 60% of the day is off-peak. Peak hours are the daytime-periods, when we would witness, TOTAL UTILIZATION of our Internet pipe. And if we suppose that the utilization ratio is about 30%, i.e. the load level during non-peak hours is about 25% of peak hours; then we may further estimate, on the basis of above derivation - HTTP_Traffic / day = ( ( 12600 x 0.4) ( 12600 x 0.6 x 0.25 ) ) x 24 HTTP_Traffic / day = ( ( 0.4 x 1 ) ( 0.6 x 0.25 ) ) x 12600 x 24 = 166320 Mbits This is a rather simplistic looking model. Something more realistic, would require, a proper hourly stepping, that gives a proper distribution pattern over the day. Now we deal with the toughest, and debatable part! What would be the ratio of cacheable_content in the HTTP_Traffic? Based on my experience at various customer premises, I prefer to assume - 30%. That would mean 166320 x 0.3 = 52617 Mbits of content could be cached per day. Standard practice is to store content for at least 72 hours (store-age). That means we would need a storage of at least 49896 Mbits. So a conventional 8bits = 1byte conversion, tells me, that we need a storage of at least 6237 MBytes Another interesting picture that should be visible during peak hours is that the HTTP_Traffic when considered as data downloaded by the proxy server, should be less than the data sent to the clients, and the difference would be the caching efficiency. That would signify that the cached content was used to serve the requests made by the clients. In the overall discussion, we have not considered the performance degradation that would be caused due to factors such as network latencies. The above methodology however, still doesn t answer the original question. Because in the original question, the Internet Pipe was not defined. So I was quite skeptical, that such calculation could ever be performed, because it was the number of users (clients) that was defined, rather than my known approach via Internet_Pipe. My arguments and insistence was based on the fact that, the content that can be cached will be an assumable fraction of downloaded HTTP Content. And the maximum content that can be downloaded, will depend on the Internet_Pipe, whether you have one user or a million users. Tushar Dave from Reliance Infocomm, helped me to complete the puzzle with an interesting algorithm, that turned out to be the missing piece of the overall jigsaw puzzle! Suppose the ISP serves its customers with 256Kbps connections, then for 40,000 users it apparently needs almost 10 Gbps of Internet Pipe. But actually, that s generally never true ( in fact, for 40,000 users an ISP would actually commission an Internet Pipe of less than 1 Gbps in most cases! ) . The ISP is never going to receive 1 request from each user concurrently, every moment. This is known as the OFF-time, i.e. the period when a user is viewing the content that has been already fetched. An ISP can safely expect at least 50% of OFF-time. OFF-time can actually go up to even more than 75% if the ISP is serving more of Home users and small businesses, where the Internet Connection is not shared between multiple users. Secondly most of such user accounts are governed by a bandwidth cap, for example a user can choose for accounts that allow a download of a few Gbs. In the above derivations we estimated the HTTP_traffic / day from Internet Pipe, now instead we simply need to derive HTTP_traffic / day from expected HTTP_Traffic per month. So the estimation over-all data throughput can still be derived, without knowing the Internet Pipe! And the above derivation can be still valid! So let s see if we can do some calculations now (empirical, of course!) connections = 40,000 user_connection = 256Kbps HTTP_Share = 35% ON_time = 50% peak_hours = 60% off_peak_utilisation = 25% cacheable_content = 35% store_age = 3 days PEAK_HTTP_LOAD (in Kbps) = connections x user_connection x HTTP_Share = 3584000 NORMAL_HTTP_LOAD (in Kbps) = PEAK_HTTP_LOAD x ON_time = 1792000 HTTP_Traffic / hour (in Kbits) = NORMAL_HTTP_LOAD x 3600 = 6451200000 Cache_Increment / hour (in Kbits) = cacheable_content x ( HTTP_Traffic / hour ) = 2257920000 Total_Cache_Increment / day = 24 x ( ( 1 - peak_hours x off_peak_utilisation) peak_hours ) x ( Cache_Increment / hour ) = 2257920000 Required Storage Capacity ( in Kbits ) = store_age x (Total_Cache_Increment / day) = 6773760000 Required Storage Capacity ( in Mbits ) = 6615000 Required Storage Capacity ( in Gbits ) = 6459.9609375 Considering 8 bits = 1 byte, it looks like we need a little over 800 GB of Storage However I would Requisition a Storage Capacity that can accommodate for possible increase in downloaded content of 35% (cacheable_content) to sustain at least 3 store_age cycles, i.e. 800 x 1.35^3 = 1968 GB The above derivation is quite subject to a lot of assumptions. But it should allow deriving by ratio adjustments, quite easily. For example - if the connections went up by 20% then we would need 20% more storage! But more importantly, it allows anybody to differ with my assumptions, and yet approximate the storage required. Looks so simple now, Thank you Tushar. Manish Kochar is the founder CEO of Office Efficiencies India Private Limited. Under his guidance, OEIPL has developed a number of security products like CxProtect, an anti virus solution for Linux based Email servers; and SafeSquid , which is a Linux based Content Filtering Internet Proxy.

No comments: