Web developers: 10 ways to deal with intermittent connections

This post is about web applications designed for online-only usage that for reasons beyond your control will occasional go offline, or appear to have connection problems to non-techy end users. Even though we expect it, connectivity is not guaranteed. The good news: there are many things that you can control to help improve the usability of your sites and the perception of its uptime.

The Internet is inherently unreliable and it goes up and down as well as faster and slower all the time. It’s even more unreliable if you are talking about mobile web as compared to being plugged into a dedicated Ethernet or WiFi connection. Failures can happen within the app, on the Internet connection and even at the web server or CDN and when it happens it can frustrate users and eventually turn them into unhappy customers. The challenge for you as web developers and IT managers: it’s often hard for the people managing websites to get a real good look at the end-user experience because it can be so hard to duplicate.

In general most users typically blame their “internet connection” which is a euphemism for it’s the cellphone providers fault or the DSL or cable company’s fault. And, most people don’t know or really care where the problem is, they just want it fixed.  A common reflex when there is a problem is for a user to simply reload the entire page. In some cases, a full page reload isn’t possible or it’s painful such as more complex sites where a full reload means potentially walking about through several steps to get to back to the final page or view.

So here are a few suggestions to you, as a web developer, to help minimize occasional disruptions and keep users as happy as possible. Some of these are major repeats but they are well worth seeing yet again:

Performance. Make your web pages as lightweight as possible. Pages that load faster will ‘appear’ to be more responsive to requests even if you aren’t concerned about millisecond response times. Most of you will have already had this drilled into your head over and over: The goal should be fewer and smaller files, using CDNs, moving CSS and JavaScript library loading operations to the bottom of your html pages, use inline images and the list goes on and on. There are many articles on the web about improving performance. Search for ‘website performance’ to find out more. Another example, Steve Souder has an excellent website and even written books on the subject.

Caching. Consider page cache settings carefully. The subject of setting header caches, such as ETags, Expires and Last-Modified headers, is often overlooked and usually misunderstood. Cached content cuts down on the total number of HTTP requests when someone loads your web page. Static content, or content that doesn’t change much, usually has longer cache times than content that changes frequently.  Even though there are many articles on the web about caching, doing it well can be tricky. It can be very handy to hire an expert to figure out optimal configurations in a short period of time. Or in my case, I spent several months of experimenting while subjecting my blog readers to unnecessary page lag, and variety of other problems, until I finally broke down and hired an expert.

HTTP requests that block. Be aware of any HTTP operations that block the loading or use of your pages.  If you have to use a blocking HTTP request then make sure you set a timeout in the client request, such as 20 seconds and display some sort of a loading icon. A good web designer can help walk you through the UI experience. Most modern web servers have server-based timeouts that are longer than most people are willing to wait.

Auto-retry. Alternatively, consider a significantly shorter HTTP timeout setting and retry the connection several times before failing and notifying the user that the app couldn’t connect. These days a single 404 error doesn’t necessarily mean the website is down. But…very, very few websites employ this pattern. So what happens in response is most people reflexively keep hitting reload when there are any loading problems. Reloading an entire page is much more bandwidth intensive on your servers as compared to having the app simply retrying quietly and quickly in the background to load a specific item.

More efficient database polling. Long running database queries can give the impression that the connection is broken. If you have requirements to poll a server-side database for changes, consider implementing a server-based process that simply returns a JSON-based Boolean such as {changes: “false”} if there are no changes. In comparison, most server-side database requests typically run entire and potentially complex SQL queries with every internet request to tell you nothing changed.  From a server resource preservation viewpoint, it’s significantly less overhead to return a simple JSON-based Boolean and let a long-running server side process do all the heavy lifting on a regular timer cycle.

Fail gracefully.  Don’t hang an entire page if your app fails to load a JavaScript library or some other content throws a 404 error, or if a database request fails. Don’t do it. I know this seems obvious, but I see it all the time when doing my daily web surfing. See my suggestions above for handling HTTP requests. Most major companies seem to be guilty of this for activities such as viewing billing pages.  Let the end user know through some sort of a pop-up that a connection has failed or timed out. Native mobile applications have built in mechanisms for doing this, and granted they can auto-detect when the Internet connection goes down, but I still believe regular web apps should mimic the behavior when possible.

ApplicationCache. Consider storing some pages and resources for when a connection goes down by using the HTML5 ApplicationCache interface. This lets you go beyond the typical caching mechanisms using patterns that can be easier to understand and control as compared to the somewhat black box and variable nature of header settings.

Feedback. The ability to email web administrators directly has lost favor over the last five years or so. I suggest bringing this back in a big way, along with clearly posted links. Sometimes the best way to know something is down or slow is to hear it directly and immediately from a customer. Yeh sure, you’ll get some spam email but if it means keeping customers happy then there are both automated and manual ways to deal with it that work. I can speak personally on this topic as my blog has received over 40,000 spam attempts of which I’ve personally deleted over 3,000, and I’m just a team of one. Some techy sites do provide a “Performance” section of their forums, which is fine as long as employees are actually monitoring it (often). The problem with forums is notification of new posts…and, of course, is usually done via email.

Uptime Monitors. Use uptime monitors from different spots around the country you live in, or around the world if you are using a worldwide CDN. Some providers can do this for you, but you should ask questions. The most common scenario I’ve seen is that the update monitor lives in the same server farm as the web server. This is okay but it doesn’t cover the scenario of connectivity outside your firewall. Uptime monitors should not just ping a website, they should also attempt to load and parse actual content, throw a warning email or text message if the content throws an error and throw a warning if a connection takes too long. There are many reasons why you may think your website is up and it’s not. For example, a CDN node could be down, a CDN server could have the wrong permissions, a major Internet router could be down, or your support folks could be using an internal pathway to view pages on your web server that is no longer visible to the outside world. These types of monitors don’t cost much to operate and can significantly boost customer service ratings and help keep customers happy.

Browser Support. Last but not least and probably the touchiest subject is browser support. My recommendation is if you don’t support a particular browser type, then give the end user a message that says some functionality may not work properly. We’ve all been to sites on our tablets or phones, for example, and popups didn’t work right or things didn’t display properly. Non-tech -savvy end users can easily misunderstand these types of things since it rightly gives the appearance that something is broken. If a popup didn’t work it may appear that a sale did not complete, for example. It’s very easy these days to use libraries for browser detection. Doing browser detection should always be part of a web app deployment plan.

Resources

HTTP Caching Protocols (W3C)

What is a CDN?

Beginners Guide to ApplicationCache

Browser support – Caniuse.com

The Art of Internet Connectivity

Everyone’s internet connectivity experience is unique and it can vary from minute to minute. Most internet users can sense slowdowns, and everyone can identify when a connection fails. Web developers absolutely rely on a web connection to build web pages. So, when our internet connection goes down our productivity comes to a halt.

I’ve lost count of the number of times I’ve reported to various tech support organizations that I wasn’t able to reach a particular website or web service and was told by the tech: “I was able to reach it just fine.” This happened again today when I called my DSL provider to inform them our internet service went down completely and then was degraded to 1/10 of what we were paying for (e.g. ~1.12 Mbps on a 12 Mbps service). They told me that the line was stable. Although I’m not real sure what stable means. Then the speed gradually increased back up to normal of over the next hour and a half. This has happened about a half dozen times over the last three months. 

As a web developer, you load web pages up to several hundred times per day. I almost always have monitoring tools hooked up that give the exact time to download a page and its associated elements. So, I have a good idea of when the internet is performing well, and when it isn’t.  Because of this I’ve become sensitized to small, millisecond changes in download times.

I also gained extensive knowledge of internet connections when working on high availability systems with up to five-nines uptime. We deployed systems that monitored web traffic all over the U.S. 24×7. I was amazed to see that internet traffic was very much like our roadways. Sometimes traffic is moving fast, other times it’s slow in spots, and sometimes it’s completely stopped or even re-routed.

In many cases, a modem (or router, as I’m using the terms interchangeably) simply locked up. This is quite common as these devices often run a small linux-based operating system that can occasionally flake out. I can say with certainty in the cases where my DSL modem/wireless router didn’t die, and there was no internet connection, then in 9 out of 10 of these cases it was a problem upstream with the carrier.

Guidelines. So, here are some guidelines for helping you narrow down where the problem might be:

– Check the modem connectivity lights. Usually if a modem is connected to the internet, the connectivity light will be a steady or flickering green. Red  or no connectivity light almost always means no connection. It should be a matter of reflex to simply restart the modem and see if that fixes the problem.

– If the internet connectivity light doesn’t come back after restarting the modem, then call tech support.

– On rare occasions (1 out of 10), restarting the server plus the modem restored connectivity.

– Still no service? You can go get a cup of coffee then come back later and recheck.

– Or, if the internet connection light is green, try blowing away the browser cache and try to reload? Sometimes old versions of pages can stick in the cache.

– Can you load any other websites? If you can, then your particular server or service is most likely down.

– Can you ping the server? (for servers that allow ping). Determines if the server has basic connectivity.

– Can you run a tracert? Let’s you look at the connectivity between you and the remote server.

– Document the problems so you have a record for future reference.

– If you need continuous monitoring with alert thresholds, then look into evaluating continuous monitoring tools such as Paessler.

– If you know how to get the basic troubleshooting out of the way, or if you’ve already done it, then insist on escalation when you call tech support. You need to get back to coding as fast as possible.

10 hard earned hints on how to increase your website uptime

Here are some tips on how to keep your website up and running from a hardware perspective. It’s these things that are often overlooked or underfunded that bring a website down. I spend all day working with software, but the hardware is often a mystery especially to software developers.

Twitter’s recent outage brought to light an important fact about any website: they can crash. Of course, Twitter’s outage affected perhaps a hundred million people globally, but the point is that your website can go down as well.

Let’s take a look at how uptime is typically calculated on a monthly basis. 30 days x 24 hours = 720 hours. If the website is down for 8 hours during that time period, it translates into an uptime of 712 hours. And, 712 divided by 720 equals an uptime of 98.888%.

Now generally that doesn’t sound too bad, and most outages are intentional due to maintenance and therefore occur in the middle of the night, or at some other time when web traffic is the expected to be the least. When you start noticing outages is when they occur during what you would consider as normal business hours. Because when that happens then your employees or visitors productivity can take a hit.

Oftentimes, for typical businesses very little focus is placed on the following. It’s my contention, for those of us who aren’t Twitter, you can easily reach near perfect uptime using the following guidelines. Even if you only have a handful of customers that hit your website a few times a week, if they experience an outage it will negatively reflect on you even if you are contractually covered. With a little work, you can significantly reduce down time.

  1. Use real-time monitoring tools or build your own. At a minimum you need to monitor: CPU, available memory, available harddisk space, current inbound bandwidth usage, and time to complete an HTTP request to at least your home page. Assign staff members who will get emails and text messages if the monitor detects a problem. Expect them to respond within a certain time period even on holidays and weekend.
  2. If you use a system monitor, host it somewhere other than where your server farm is located. This seems obvious but I’ve seen it happen where the monitor went down with the server since they were on the same machine, or located on the same site.
  3. Always, always, always take an image backup of the previous version of your website machine. Yes, I mean an image of the entire machine and not just a copy of the website directory. It’s easy to do, it doesn’t take long, and it doesn’t take much storage space these days to do what’s called a snapshot image.  With modern software, you should be able to do this with just a couple of clicks with your mouse.
  4. If you don’t use the cloud, have a cheapo spare server machine that can be swapped out for your more expensive super, ultra redundant machine. You can buy a decent quad-core, 16GB RAM server for a roughly $400 – $500 dollars as an insurance policy.
  5. Have a cheapo spare even if you use virtual machines. I’ve personally seen two instances of complete shutdown of all virtual instances when the master server blew out its memory.  Computer memory is like the engine in your car – when it dies your forward progress will come to a complete halt. I can tell you with certainty that most servers that you and I use don’t have redundant memory capabilities.
  6. If your primary server is down for some unknown reason you have several immediate choices: restart it and hope it comes back up, spin up a copy in the cloud, or direct traffic to your backup server. Now, all off these are great strategies as long as your site isn’t under a denial of service (DoS) attack or getting overwhelming amounts of standard, non-vengeful traffic. You can find this out by using the monitoring tools mentioned in #2. Here’s one possible approach. Whatever approach you use you should test it and document it:

    – Restart primary server. If it restarts okay then you’re golden.
    – If primary crashes, then plug-in either a hot-spare or spin up the stand-by copy.
    – If the backup copy crashes then go to the rollback snapshot copy that you made.
    – If the rollback snapshot copy crashes then you have no choice but to start doing a detailed investigation of log files and other forensics.

  7. Do your investigation or forensics after you’ve restored service. This should probably be rule number one! Your website visitors will appreciate this. I’ve been in several situations over the years where the focus was on figuring out why a system went down first while the users languished with no service. Do your best to get things started again and then worry about what happened.
  8. I’m going to mention ISP redundancy because it is definitely overlooked. If you have hosted your server at a major hosting provider, they often contract with multiple internet backbone providers. The backbone providers are the primary carriers of all internet traffic. And, the hosting facility will have multiple trunks from various primary carriers coming into their facility on different sides of the building in case one trunk is accidentally cut or has an outage. Check with your hosting provider for options on this.
  9. Universally everyone will tell you to cache your static content on a CDN. I agree with this although, to me, this is more of a performance issue than an uptime issue and that’s why I have it down here at number 9. It’s possible that if you have huge amounts of static content that this could prevent your server from crashing, but for most of us this is a performance boost.
  10. If you are getting overwhelming amounts of non-vengeful traffic, one concept I used to great success was to place a simple HTML file that was served up every so many visitors that said we were experiencing high traffic volumes. The simple file took very little overhead to serve and had an immediate and significant reduction in the load on the server. Naturally, when the event was over we took the message down.

I’m not going to address Denial of Service attacks since I’m not a security expert. There are plenty of papers on the web and experts you can call. I can say I have averted limited DoS attacks in the past by filtering IP addresses and working with an experienced network service provider. But, if you are under a widespread DoS attack get expert help immediately.

[Edited 6/23 – added a tenth hint! Fixed minor typos.]