SSH Key Management

Last week I moved towards using a Lubuntu host for a management workstation in various scenarios. I need the following tools in order for this to be effective:

  • SSH Client (for network devices, especially Cisco switches, routers, and firewalls)
  • Web Browser (for various reasons, especially ESXi, iDRACs, and Collaboration / Ticketing software
  • Remote Desktop Client (for pesky Windows systems)

Along the way I learned that Cisco has implemented SSH Key authentication in newer versions of the IOS. The OpenSSH client (included with Lubuntu by default) can be configured to connect per host, allowing simple user operation if the hostnames of the managed devices are known by the operator. This leads to the question of how to properly manage an SSH key infrastructure. This is an age-old problem with established best practices that can be found in this NIST document:

https://csrc.nist.gov/publications/detail/nistir/7966/final

Within this document is a reference to another NIST document that I need to keep on my reading list detailing currently recommended key lengths and approved algorithms:

https://csrc.nist.gov/publications/detail/sp/800-131a/rev-2/final

Segmentation of Services

I’ve read and heard recommendations of network segmentation for years, and have made my living segmenting networks from time to time. Recently I’ve realized that segmenting the network services are just a subset of segmenting other services for the system.

We’ve watched for a while as a centralized Active Directory service is compromised and then used to propogate malware throughout the domain. The same bad actors sometimes target backup (or recovery) systems; if a backup system provides recovery services for a large or critical group of systems then it itself becomes a significant point of failure.

In industrial control systems one of the unacknowledged strengths has been the distributed (or segmented) nature of the systems, as they have traditionally been deployed without taking ‘advantage’ of centralized corporate Information Technology (IT) services. It seems that there is now a push to provide centralized management of Operational Technology (OT) systems for all kinds of useful things. Inventory management, vulnerability management, malicious code protection (anti-virus, whitelisting, or intrusion detection), log management, performance management, and so on. In many cases these services aren’t implemented well or at all in the OT environments and so the centralized provision of these services seems like an easy improvement. The local facilities are enticed into participation because the cost is often borne by the greater corporation and not directly by the local budgets.

To what degree should these services be segmented across the enterprise? It doesn’t make sense to have a standalone solution for each service at every one of hundreds of sites; but it also doesn’t make sense to have a single solution for all services for the entire enterprise either. The challenge is to define the balance between the cost to maintain (unacceptable with many small site solutions) versus the cost of compromise (unacceptable with a single enterprise solution).

One approach is to evaluate the business by functional blocks, from an OT perspective. A business segment can be defined as the group of facilities that are required to provide a product or service to the customer; the impact of a failure could be limited to a product, service, or group of similar products or services. Another approach is to define groups of facilities and to plan for the failure of any one group of facilities. The other groups should be able to increase capacity or contribute inventory to cover the failure.

Network segmentation tends to be at two levels. Segmentation of the OT environments from the IT environments, and segmentation of the OT networks within a facility. This provides a series of network bulkheads as in a ship where the failure of one compartment does not cause the entire ship to sink. What is the necessary balance to achieve the same benefit to risk reduction for other services?

Supervision, again

This week I realized that I’m seeing a problem repeat itself over time and organizations. Today I realized that there is a broader application to the problem, and I’m beginning to think that there may be a solution to the problem. I say Supervision, again because I believe that automation without supervision is doomed to fail. In the end, supervision is a major part of the solution to this problem as well.

I have watched as companies deploy computerized control systems using traditional project methodologies and the local maintenance personnel or users take responsibility for the administration of these systems once the projects end. The degree and quality of administration varies with the people who do it and the amount of time that they are willing to spend on this new and additional responsibility. Most of the time this includes physical inventory, backups and maybe anti-virus software, but only rarely does it include patch management, recovery exercises, or performance monitoring. The company makes a capital investment (the project), reaps the benefits of the new automated controls (return on investment), and dodges ongoing maintenance costs by hiding them in existing maintenance infrastructure or failing to perform them entirely.

More recently a central authority from elsewhere offers to provide administrative services to the system in the name of compliance or security. They have a project (another capital investment) to install centralized tools such as backup software, anti-virus software, patch management systems, performance monitoring systems, and sometimes intrusion detection or log management systems. A consequence of the centralization is that all of these systems are administered by the central authority and the local maintenance personnel cannot easily supervise the effectiveness of the tools. In some cases the tools are not visible to the locals, in others the locals are not provided meaningful training about how to reach or operate them, and in yet others the locals are encouraged to spend their time elsewhere and to simply trust the centralized authority to handle the responsiblity of administering the control systems.

In my long career I’ve seen what happens when the centralized authority fails to meet its responsibility for whatever reasons. A switch dies and the locals discover that a spare was never provisioned. A computer fails and we discover that the database wasn’t backed up by the software. A virus infects a computer and is reported to the centralized console, but nobody took action on the alert in time to prevent damage. Even when the central authority does perform some of the tasks it rarely helps to solve the problems that arise; and now the locals are trying to fix a reported problem that they do not understand using skills that they haven’t been trained to have with time that is borrowed from their full-time jobs.

In a similar way I am aware that some Owners hire Infrastructure as a Service (IAAS) providers and Managed Security Service Providers (MSSPs) to provide administrative services and later find that they were billed for services that weren’t performed completely, or were inadequate for their needs. The owner is responsible for their assets from beginning to end and can only delegate authority to others to help manage that responsibility.

In the end, the Owner is responsible for administering their cyber assets. When the Owner delegates authority to others, either internally or externally they are still responsible to ensure that the responsibility is met. If the Owner is not technically capable of supervising their delegates then they can delegate that authority to a third party and still meet the responsibility.

Local maintenance personnel end up in a curious place; when the systems fail they are personally responsible for repairing them. I have rarely seen a CEO or senior manager at the plant on a midnight shift, a weekend, or a holiday when the repairs are ongoing. Perhaps management pays a price in money, but this is significantly different than spending one’s personal time as the price. This harkens back to discussions about why a personally written Thank You card can mean more than an electronically transfered check.

Part of the solution is for the Owner to establish a Program to administer computerized control systems, to staff that Program adequately to meet the requirements, and to provide projects to enable everyone manage the program and any repair events. These projects may be to provide tools or to provide training. Part of the solution is for the Owner to supervise the Program to ensure that it is effective through a combination of audits and exercises.

Syslog Collection

Lately I’ve been thinking about ways to collect syslog data from network devices (especially Cisco and pfSense). Traditionally I’ve used commercial solutions, but would really like to have a simple open source solution to supply my clients early in a network consulting engagement. This post documents the search, and hopefully ends with a Preferred Solution.

A long time ago I discovered a tool called Logstash (https://en.wikipedia.org/wiki/Elasticsearch) which sounds like it was made for the job. I may always love the logo! It turns out that Logstash is a log-parsing engine, and feeds its data into an Elastic database which uses Kibana for presentation. It looks like setting this up requires many servers and is way bigger and complex than I really want.

Next up is a tool called syslog-ng (https://www.syslog-ng.com/). It looks like a powerful tool, but the open source version includes only command line support. Most of my clients are more comfortable with graphical user interfaces (aka GUIs), so I’d like to expose syslog data to them through a GUI even if I manage it using the command line. Searching for an accompanying tool let me to an interesting option: Logzilla (https://www.syslog-ng.com/community/b/blog/posts/web-interfaces-for-your-syslog-server-an-overview/). According to the post “Logzilla focuses on logs from Cisco devices” – a perfect fit!

Logzilla is another open source solution (https://www.logzilla.net/). They’re marketing it as Network Event Orchestration (NEO). The NEO engine can be used at no cost for less than one million events per day. It runs in a docker container and has the following system requirements:

  • Docker version 18+
  • 8 CPU cores
  • 8 GB RAM
  • 1000 Disk IOPs

Logzilla makes a VM image available: https://logzilla.sh/LogZilla-NEO.ova. It is a sizeable download, but makes for an easy install.

What if I want to run NEO on a FreeBSD system? Then I’d need to install Docker to FreeBSD (which is broken today), as detailed here: https://wiki.freebsd.org/Docker. Then it looks like I would need to install the application using the following command: curl -fsSL https://logzilla.sh |bash.

Firewall Logs

A firewall’s primary function is to control traffic. Troubleshooting authorized connections or verifying configuration changes are necessary functions to support that primary function. While we can use a firewall to detect and notify administrators of suspicious activity, there is a balance between collecting log information to support administrative functions and conserving resources (processor, bandwidth, and useability of data).  As we learn what normal traffic is we should tailor the log information to reflect that normal;  sometimes an adversary will be able to cloak their activity within that normal traffic.  An owner would be wise to install an Intrusion Detection System and even a Network Monitoring System in order to better detect adversarial activity.  In cases where an owner chooses not to install more capable detection layers then I think that we are well served by maintaining a robust ability to detect abnormal activity using log information.  We can only effectively detect abnormal activity with log information alone by reducing the noise floor generated by normal activity so that a casual network administrator can recognize the abnormal activity.

In the case where an owner puts forth minimal effort to control traffic and correct configuration errors, we will have much less sensitivity to abnormal traffic.  This is a consequence of that owner’s decisions and can only be avoided by giving up on detection entirely.