Human Factors in Managing IT Security Systems

Sunday, June 26, 2005

This paper came about from other areas that I've been working on lately, namely studying aviation medicine and human factors and the causes of aviation accidents which were attributed to human factors. Many of the concepts of aviation safety can be also applied to IT security matters, and this paper outlines some initial thoughts on the matter.

While aviation is extremely safe ( well, safer than travelling in a car ) it sould be noted that in aviation there is little rooom for mistakes. Gravity is very unforgiving. It is a similar situation with IT security - if an administrator makes a simple mistake, there is a hoard of hackers, crackers, spammers, and automated malware which will dish out punishment for the mistake with very little regard for the poor administrator.

Human Factors

All computer systems are managed, operated, and used by people. Contrary to what popular Science Fiction may have you believe, computers do not have a mind of their own, they are carefully designed, managed, and used by real people. This paper explores how human error has led to major problems in manging IT security systems, and how most problems are increasingly caused by human factors rather than any technical or environmental failure.

Here's a theory: In any reasonably well organised IT operation, most security failures will be caused by the operations personnel.

To anyone in management, or the casual user-in-the street, this may defy belief. But it comes about because its quite easy to implement a security system which is sufficient to "do the job", but very hard to maintain it.

Systems with a single administrative domain are particularly vulnerable. These are the systems where management has been fully centralised into a single tool or administration interface.

Lets look at a typical example. Company xyz has an Internet firewall consisting of two redundant nodes in a High Availability (H/A) configuration, both on uninterrutible power supplies, separate racks, etc. However, they are managed as a single unit, ie. a single security policy configuration is loaded onto both systems. Here, the hardware and infrastructure is extremely reliable - its very unlikely to fail. But there is a single point of administration : if an administrator makes a bad configuration and loads it onto the firewall cluster, then the whole environment fails. The environment has a weak link : the people who operate it can ( and most likely, will ) make mistakes which will have a severe impact.

I have had first-hand experience of such a situation. A large company in New Zealand has a pair of load balancers to manage traffic volumes. They are in a dual-redundant configuration to prevents failures. However, they had a single administrative domain and automatically share and replicate a common configuration. During one period, a misconfiguration disabled all traffic passing through these devices, and it wasn't caused by any attacker.

In another example, a networking company decided against using DNS for host-name to IP address mapping and decided to use local host files on every system. To keep everything up to date, they implemented a distribution system where the hosts file would be copied from a single internal system and onto all other systems in the network. As you can probably guess, one day the hosts file on the central server got corrupted, and was diligently copied to all of the other systems. With corrupt hosts files, the systems could no longer communicate on the network, and could not even receive a valid hosts file again. It was a number of days before order was restored.

Personally, I wish I had a dollar for each time a systems administrator has accidentally deleted the entire contents of a file server, or database, and realise that there is no backup or that it will take an immense amount of time to recover from the backups.

System administrators make mistakes. And the mistakes are made for a number of reasons:

  • Ignorance - Ignorance is the lack of education. It is very common in the IT industry for people to try to "fix" critical systems with which they have insufficient knowledge. This can be a very dangerous area - and managers should be aware of the importance of educating staff to perform the tasks required of them. Ignorance can often lead to stress.
  • Negligence - Negligence is the failure to perform a task which should be well known and/or obvious. It is common amongst IT people to skip practices ( such as change control procedures, performing backups, etc ) when the workload becomes high - also known as load shedding. Such actions can lead to further problems. Negligence is best managed by management educating staff, and managing their workload effectively.
  • Stress - Stress is very common in the IT industry, particularly amongst those working in a Helpdesk sitution, or in a technical support role. These people constantly receive nothing but the worst of calls : clients with problems that need to be understood and fixed quickly. Stress is caused by many factors, mostly from the desire to perform to perceived expectations. Stress often leads to fatigue and health problems.
  • Fatigue - Fatigue is the desire to rest. It often occurs when stress levels are high, or the person has not been resting enough. Mistakes through fatigue are common in the IT industry : frequently changes to critical systems must be made at night when they can be changed without disrupting important users. These jobs place a lot of strain on technical people, and can lead to fatigue - often to the point that the probability of making a mistake can outweigh the importance of the work being performed. Fatigue can build up over a period of days and weeks and often leads to serious health problems.
In Security Operations, stress can be a major contributing factor. Many operations teams I have come across have a purely reactive role within the organisation. While they are respoonsible for correcting problems, they often don't have the authority to remove the root cause of the problem. At one large organisation they had a virus circulate throughout their internal network, most likely introduced from a laptop which had been take home and connected directly to the Internet. While the security team spent days cleaning up the mess, they were powerless to prevent users and managers from taking home their laptops and connecting them to the Internet.

In his book "Practical Unix and Internet Security," Professor Gene Spafford of Purdue University spells out Spaf's first principle of security administration:
"If you have responsibility for security but have no authority to set rules or punish violators, your own role in the organization is to take the blame when something big goes wrong."

This is a perfect description of almost every IT security operations team that I've met. It is very difficult to efficiently work in any situation where you are frequently faced with problems which keep re-occurring and over which you have no control. And it leads to immense levels of stress in people.

For instance, lets look at viruses and worms which commonly attack Windows-based PC. these things have been in existence for many years now, they are well understood ( save for new subtle variations ) and there are many tools and products available to defend against them. However, time and again companies are infected, and in some cases, totally overwhelmed by viruses. There is no excuse for this happening - save for the basic human factor of negligence.