Archive for January 21st, 2005

21st Century resilience

This article was part three of a series intended to develop a theme for a presentation at the Enterprise Networks Conference 2004, an event which included the CMA Plenary keynote session. See www.enterprisenetworks.co.uk for more details

What do we mean by resilience? One dictionary definition is “able to quickly return to a previous good condition”, giving a rubber ball as an example. For people, we generally mean bouncing back after some hard knocks, whether physical or emotional. For Telecoms systems, it is perseverance in the face of adversity, the Pony Express rider ensuring the mail gets through whilst dodging bandits and arrows (or fires in cold-war underground tunnels!)

Phone systems are real-time in software terms- the most important thing they have to do is notice events happening and act on them in timely fashion. These events are generally quite low level, i.e. a user has pushed a button on a featurephone, a DTMF digit has been detected on a trunk line, an E1 interface has received a signalling message, the technician has pressed a key on the system terminal. (In the 1970s, it was lower still- the processor was often keeping track of every dialled digit, validating it for speed, make/break ratio, inter-digit pause etc.)

What often isn’t appreciated is that in many systems letting go of a button on a featurephone is also an event- how else could it be possible to buzz-buzz your Secretary?

There are also a corresponding set of “non-events” to be handled- i.e. timeouts where something should have happened but didn’t. An example of this is a user being given dial tone but not doing anything about it. Often the hardware will react to non-events from the software as well- sometimes called a watchdog timer, to capture the situation where St. Vitus gets involved.

The software architecture will normally revolve around something called a work scheduler- this is the main engine that allocates tasks according to their importance. Giving a user dial tone may involve several iterations around the loop- firstly recognising the event at a low level & ensuring it is placed into the correct signalling buffer, then recognising what the transition actually means, i.e. a user appears to be initiating a call and it isn’t a priority user so put it into the correct call processing buffer, then eventually allocating resources & providing dial tone when it reaches the head of the queue.

So our system is constantly working very hard looking for something to do & leaving imaginary timed notes for itself to check up things that don’t happen when they should. So far, this isn’t that different to what any computer system would be up to, be it an enormous supercomputer at the Met Office or a humble Gameboy in your child’s bedroom. The difference comes in the resilience- being able to keep going despite all odds.

Hardware resilience comes in the form of robustness- duplicate or triplicate processors that work in co-operation or hot standby to ensure call processing continues. Mirrored memory, dual interfaces, backplanes & power ensure that most likely hardware failures result in minimal interruption and automatic recovery.

Software resilience comes in the form of recognising hardware faults and recovering from them. It also means recognising unexpected software outcomes and recovering from them, as well as producing an audit trail so that the situation can be duplicated & a solution found. Providing tools and alerts for the maintainers to ensure optimal service delivery varies from product to product & is generally much more simplistic for a Business system than for a Public Exchange.

In the phone system, unexpected events are hopefully worked around, provided that the scenario has been recognised by the designers. As the only software that runs on the system is proprietary there is a high expectation that outside of beta trials, the systems will run stable. Or, more accurately, they appear to do so- most systems run housekeeping routines that “tidy up” after the call processing routines and close scrutiny indicates that the garden is not totally rosy. Phone systems have generally evolved from older systems where memory capacity, processor throughput and intra-system bottlenecks have resulted in constraints. Stress any one of these and even the best known systems start to behave oddly during the busy hour. (This is inevitable as it is never possible to test for absolutely every eventuality in a complex system).

Contrast this with Servers. High end Servers are available that have redundant power supplies, robust disk arrays, hot swappable cards etc. Whilst they are not totally duplicated internally in their architecture (don’t be fooled by the number of CPUs) they can be clustered together in order to make them more powerful and resilient. Unfortunately, however, it is all too easy to load any old software onto them.

Are they suitable for call processing? They can be, provided that call processing is the primary function, or indeed the only function in a Windows environment. It is also a good idea to have automatic routines to enforce the occasional out of hours reboot in order to minimise the impact of memory leaks. Similarly, call processing needs to be executed as a “service” so that it starts again automatically and the device needs to be tuned for optimal behaviour, something not particularly suited to the ICT policies of using a preferred Server build and load.

Another thing about clustering- it comes in more than one flavour and may require intervention to recover from fail-over. The most useful approach is where one Server can be taken off-line for patching or upgrade whilst the other does the work & then vica versa, preferably seamlessly. This doesn’t entirely exist in the real world, although high availability is possible, even if zero downtime doesn’t. Also, be prepared for occasional intervention whilst the typical Server response to an unexpected event continues to be the “blue screen of death”.

Of course, there are benefits. Servers are streets ahead of legacy phone systems with Ethernet and network connectivity. Woe betide anyone who naively put their Option 11 onto the network without an intervening router, as the system would spend more time watching what was going on out beyond the RJ45 than attending to low priority call processing tasks. Servers enable applications to communicate with each other efficiently and effectively, whether across the LAN or the WAN.

Something that does need to be taken seriously on Servers is in managing threats- without the latest patch fixes they are vulnerable to attack, as are the intervening network devices, increasingly so if they are from Cisco.

We communicate most effectively using all of our senses, i.e. sight, hearing, touch, smell and taste, along with an awareness of acceleration and gravitational force through our balancing mechanisms. ICT mainly concentrates on sight and hearing (once you put aside oddities like Sensurround, Smellovision and Theme Park rides).

Voice communication has an immediacy & convenience factor that has made the humble telephone omnipotent in the developed world and the penetration of the mobile telephone even more remarkable. The convenience factor is a double-edged sword, of course, as it is frequently inconvenient to receive calls and often downright intrusive. The Unidirectional equivalent of the phone is of course Broadcast Radio, although the Tannoy system also fits the analogy.

Text communication has progressed through Telegraph, Telex, Fax and Email, with an exponential increase in content potential and effectiveness. Fax added imaging, the first “rich text format”. Email gave the ability to attach data, graphics, pictures, sounds and programs. However, it remains a unidirectional, non-real time channel, in that whilst it is possible to have email conversations, they are by their nature asynchronous. SMS and MMS are mobility approaches to un-tethered rich text channels and Instant Messaging is an interesting hybrid of immediacy that can be more effective in contacting someone than persistent phoning.

Combining sight and sound, we have Video Conferencing which is poised to become commonplace on most desktops, along with Video messaging, Video broadcast (multicast) and on-demand playback (unicast).

In Telecommunications there are also elements of communication not immediately obvious to the user, e.g. data flow from PCs to the Network, Clients to Servers, Applications to other Applications, housekeeping for connection management and billing.

This is definitely a fractured landscape for the poor old user, further complicated by Chinese walls between channels and devices for home and business use. (That sound you heard was the author giving himself a quick slap for jumping on the buzzword bandwagon).

So how can our new unified systems possibly hang together and work well? Let us take a look at the disparate communications channels that we might possibly want to bring together, along with the transport mechanisms, interfaces and deliverables. This diagram has had several iterations and adding further detail or subtlety has been abandoned for now. It is an exercise for the reader to join all the possible entries and associations together and not make it look a complete rat’s nest!

There are a number of challenges here. As the main theme is resilience, then let us consider what a UCI based system needs to do in a fairly straight-forward set-up, say a single-site office. It needs to communicate with all of the existing communications systems in order to be aware of what is going on. It may need to communicate with the back-office applications where there are associations (or hooks) into functionality. It needs to communicate to all of the users in order to serve their needs & be aware of their current availability set. It will have to be scalable for multi-site and nomadic use.

The important bit is that it has to be able to maximise the user experience in the event of any of these complex interactions failing or behaving inconsistently. Getting everything onto an IP backbone has to make it easier and given time, the other systems will be developed to be aware of and support UCI.

Next article- 21st Century presence, what is it and how can it be tamed?

21st Century Bugs

(This article originally appeared in CMA Newsline March 2004)

This article is part two of a series intended to develop a theme for a presentation at the Enterprise Networks Conference 2004, an event which includes the CMA Plenary keynote session. See www.enterprisenetworks.co.uk for more details.

In part one, I suggested that the biggest obstacle to 21st Century Communications was the poor quality of Software Engineering, a heresy likely to get me drummed out of the British Computer Society. I have visions of the dishonourable discharge where the pony-tailed Technical Design Authority Prime removes the pencils from my breast pocket & ceremonially snaps them, breaks my keyboard over his knee, guillotines my CDs, removes the ball from my mouse, takes a hammer to my cathode ray tube & plunges my SecureID dongle device into my Dilbert Coffee mug!

How can I substantiate that sweeping statement? From a couple of decades in the Telecoms industry, on the fringes of (& sometimes in the thick of) the programmers. Something I discovered very quickly is that most programmers love programming but have very little interest in the product they are developing, especially if it is for business process or something not directly software related, such as switching phone calls in a business communications system. A dream job for a programmer is probably either developing software tools, or developing computer games, depending on inclination.

My first stint as a professional programmer involved three months of on-the-job training over in Canada in the mid - 1980s. I had the advantage over many of the newbies (& lots of the oldbies as well) in that I had worked on the product for a number of years & knew what it could do. Understanding how it ticked under the hood gave me considerable insight into why it worked the way it did.

After exposure to the various aspects of the process, we were all given a number of simple bugs to investigate, fix & report back on. In a complex multi-user system a peer review is normally part of the process- you have to convince others that you recognise the problem and your solution is appropriate before being allowed to re-integrate into the main software product.

Five of my six bugs were all straightforward enough, but one of them was a real humdinger. The feature was something called dial-tone detection, where an outgoing Trunk line is checked for dial tone before sending the digits (This was before the days of ISDN and supervision signalling could not always be relied upon in some Countries, ground start/earth calling PBX trunk lines being uncommon outside UK & America). The high level design showed that there was a sophisticated algorithm (arcane programmer-speak for a recursive mathematical routine) that would carefully track overall Exchange response in order to isolate faulty Trunks and dial-tone detectors whilst allowing sufficient time to react to dial tone speed changes. There was a robust validation of the algorithm using data familiar to those of us who have had to use grade of service calculations derived from Erlang tables.

The documentation was spot-on and the Author was the most respected senior pointy-head in the place (who had recently moved on to greater things within the Company). Why didn’t it work?

The answer came to me in a flash of insight a couple of days later whilst setting up breakpoints & traps, the software equivalent of an oscilloscope. The reason it didn’t work properly was that the algorithm had a major shortfall- it was designed based on precision numeric calculations to several decimal places but the system itself used integer maths- 42.99999997 was still 42.

Once this was grasped, it showed up several flat-spots on the response curve where certain scenarios could result in calls hunting between Trunks rather than waiting the appropriate time. The fix was reasonably straight-forward- a couple of lines of compensatory code to smooth out the response hiccups.

What was considerably more enlightening, however, was getting the solution through code review. The experienced programmers had trouble accepting that such a mistake was possible by one so senior, such is the hubris of the profession. Whilst I was able to eventually persuade them, it took considerable escalation through the ranks as there was also a certain level of denial- if it was so fundamentally flawed, it should have shown up during testing, regression, field trial etc. The reality had also been that the programmer had designed some very complex code as an intellectual exercise then someone else had implemented it without too much thought.

Whilst this was just one (particularly memorable) bug from two decades ago, have things got better? To an extent. On the positive side, languages have evolved considerably, along with the tools and processes used to create and manage them. The downside, however, is that the proprietary platforms and languages have largely given way to generic tools. This is a good thing in many ways but it means that problems with functionality cannot always be identified and sorted out in-house, especially when the tool source code cannot be examined (the Windows vs. Linux argument).

A particular problem that hasn’t changed much is something known as scope creep- where the definition of what software is meant to actually do evolves during the course of development, often compromising the original design. There are many factors that cause this, starting with the business not actually knowing what they want at the outset, the analysts interpreting the requirements to fit a solution, the developers creating specifications that are open to interpretation and the business then changing priorities & functionality once they get their hands on the beta software.

There is always pressure- pressure to deliver on time, pressure to deliver what is wanted (rather than what is asked for), pressure to maximise quality and pressure to minimise expenditure. These pressures are perfectly normal for any type of project but when it comes to complex systems where the project managers can’t actually wander round site and see what the chaps in hard hats are up to, then it is often a voyage into the unknown.

Another area that causes bugs is interpretation of standards. When Telecommunications was very standards-based, the specs were thorough, well researched and definitive. The CCITT would publish the manuals on a 4 year cycle and the colour of the cover determined the vintage. Suppliers would talk about Q.921 Blue book & everyone knew what they meant. The standards bodies would adopt each others standards and everything in the garden was more-or-less rosy. However, as time went on, the pace of change blew this out of the water. An example of this is ISDN in Europe, where the early introductions were Country specific with standards such as DASS and 1TR6 before giving way to Euro-ISDN which still has a number of flavours.

Nowadays, whilst there are still numerous standards for all sorts of things, the Internet (bizarrely enough) doesn’t actually have any in the accepted sense of the word. Instead it has RFCs, or “request for comments”, maintained by the Internet Engineering Task Force. RFCs cover things such as routing, protocols and even accepted procedures for quoting text in email replies, something Microsoft chose to ignore when they created Outlook Express.

So, equipped with some programmers, some processes and some standards, all with varying levels of dodginess, our tame software house wants to write the ultimate killer application, the one that unifies the fractured communications landscape & makes them a shed-load of money in the process.

How well will they manage that? It depends on their background and to some extent whether the emphasis in ICT is on IT or CT. Developers of real-time carrier-class systems with expectations of 99.999% availability have different approaches to developers of billing systems requiring 99.999% accuracy.

The next article in the series looks at resilience, i.e. ensuring the IP equivalent of dial tone for all communications services.