Few would argue that the footprint of today’s data center at most enterprises, including financial institutions, is changing. While some still maintain a large, enclosed computer or server room, most are opting for a hybrid approach by relocating much if not all technology to one or more cloud service providers.
Whether on-site or not, you, the enterprise, still must manage and oversee your information technology. And that requires a cross-department group of skilled team members working with a common set of goals and objectives.
Tackling operations in the data center
In prior CUSO Magazine articles, we discussed managing a data center from a business continuity perspective using a risk-based approach to address the confidentiality, integrity, and availability of systems and data on your network. But what does that look like on any given day?
In 2021, the FFIEC* published an IT Examination Handbook titled “Architecture, Infrastructure, and Operations,” replacing the 2004 “Operations Management” guide. While two-thirds of the handbook addresses Architecture (Design) and Infrastructure (Build and Maintain), the section on Operational Management is well worth the read.
In this article (part one), we will provide an overview of Operations in the data center. In future articles, we will discuss Architecture (part two) and Infrastructure (part three). For now, our focus is on managing the data center and enterprise IT.
Breaking down operations
In the IT handbook, Operations is defined as “the performance of activities comprising methods, principles, processes, and services that support the business function.” Again, whether your data center is the size of a warehouse or more like a walk-in closet, there are activities you need to be aware of to maintain the IT your staff depends on to build, deliver, and support the products and services you provide.
With the ongoing digital transformation both internal and member-facing, reliability and sustainability of your data center and network have never been more important. Failure to do so may result in various risks, including credit, liquidity, operational, compliance, and reputation.
These operational activities are often referred to as “back-office” in part because they are traditionally performed behind the scenes, out of sight for member-facing functions. Operational activities can be thought of as the “nerve center” of the enterprise, encompassing the day-to-day processing and support functions, service delivery and management, and control processes to support internal staff and the overall mission of the enterprise.
The FFIEC IT handbook breaks these down into four categories: Operational Controls, IT Systems Controls, Service and Support, and Ongoing Monitoring and Evaluation.
Accomplishing these management and support activities requires collaboration from multiple departments. Depending on the structure of your organization, these may include Facilities, Networking (NOC), Security (SOC), Software Development, Customer Service, Finance, Compliance/Auditing, Human Resources/Training, Operations, and Procurement/Logistics, to name a few.
Let us now define what we mean by “data center.” The IT handbook describes it as “the location where the entity houses and maintains its processing, data, storage, and communications systems and equipment.” Data centers may be on-premises, at a third-party location, co-located, or operated in the cloud. In parts two and three of this article, we will discuss the types and design of the data center in more detail.
Management should designate responsibilities for implementing appropriate security and environmental controls. Whether centralized or decentralized, data center responsibilities also should include training staff to operate and maintain the enterprise’s equipment and systems, deploying appropriate connectivity, and managing incidents and events. As part of its responsibilities, operating center personnel often provide support for disaster recovery and business continuity testing for the operating center and business lines. Operational support may include switching processing from production to alternate sites and systems.
A summary of the four categories for supporting the data center is provided below:
The operational controls are the guardrails that allow internal staff to operate in a safe and effective manner aligned with the risk appetite of the enterprise. They are comprised of three main categories: security, user access management, and personnel controls.
- Physical, logical, and environmental controls (locks, lighting, fire suppression, etc.).
- Perimeter protection devices with defined internal security zones.
- Internal/external communication systems.
- People and processes supporting the enterprise’s mission and functions.
- Hiring and retention practices
- Maintaining appropriate skills and knowledge for job performance
User access management:
- User authentication controls (identity)
- Use access controls (permissions)
- Clearly defined roles, responsibilities, and expectations
- Dual control and segregation of duties
- Rotation of duties
- Activity monitoring
Operational controls are at the core of the data center, regardless of size and complexity. Basic security controls start at this layer. Human error is often the source of poor performance and productivity if not a disruptive event. An enterprise’s staff is an ongoing target of social engineering given that human interaction is the initial step in most data breaches today.
Managing operational controls means being intentional and detail-oriented. Here is where the culture of the enterprise can have a significant impact.
IT systems controls
This category is designed and implemented to protect the confidentiality, integrity, and availability of IT resources required to support enterprise business functions.
Maintenance: Regular systems and equipment maintenance including environmental systems (HVAC, UPS, generator, etc.).
Secure configuration and change management: Throughout the entire lifecycle of all systems and equipment to maintain the desired security posture.
Vulnerability and patch management: Established policies and procedures for timely discovery, evaluation, and remediation of vulnerabilities on all systems and devices.
Backup and replication:
- Established strategy of redundancy and data archiving that meets predetermined recovery time and recovery point objectives (RTO/RPO).
- Off-site storage of data as a safeguard in the event of a disaster.
- Increased protection of backup systems for potential response and recovery from cyberattacks (esp. ransomware).
Job scheduling: Coordination and orchestration of daily, weekly, and monthly job activities (both automated and manual) to ensure consistency, reliability, and sustainability of operations.
Capacity management: Strategic planning and management of system capacities (CPU, RAM, I/O, data storage, bandwidth, etc.) to ensure required performance metrics are met.
- System and device log aggregation and analysis for historical and predictive purposes to detect the occurrence of jobs, anomalies, usage, and activity.
- Controls for restricting access and protecting the integrity of logs are critical for effective detection and response to a cyberattack.
Retention and disposal of data:
- Policies, standards, and procedures must be established for the proper retention and disposal of all data media.
- An audit process must be performed on a regular basis to ensure compliance.
IT Systems Controls are prominent in most IT regulatory standards, guidelines, and best practices, as well as the heart of IT and cybersecurity certification programs. A weakness in any one area can have a direct impact in detecting and responding to an attack. It is also an area that is perhaps most distributed across multiple departments and teams. Cross-team communication and planning is essential to consistency and ongoing improvement.
Service and support
Service and support are designed to equip, enable, and support the enterprise’s human resources in the execution of processes and procedures required during business function activities.
Oversight and management of an enterprise’s activities and resources to ensure compliance of Service Level Agreements (SLAs), operational level agreements (OLAs), and contractual provisions.
Established key performance indicators are used to measure success and detect discrepancies.
Internal IT support for staff performing activities that either directly or indirectly support the enterprise’s line of business.
Support workflow process for tracking reported issues through remediation.
Event, incident, and problem management
- Problem/issue escalation process for timely response through to resolution.
- Root cause and post-mortem gap analysis reporting to prevent recurrence and enable continuous improvement.
Service and support is the most tangible example of where IT touches and interacts with business processes across the organization. The success of service management is normally gauged by the overall quality and proactive identification of identification and resolution of problems. Service management activities should be designed with an emphasis on preventing issues and ensuring continuous reliability and resilience where possible.
Ongoing monitoring and evaluation
This is meant to maintain an established baseline level of performance across the enterprise.
- Monitoring and reporting the effectiveness of established controls to senior management and other stakeholders.
- Compare service level benchmarks with actual service performance to identify areas for improvement.
- Regular audit and review of policies, plans, and procedures to ensure alignment with established goals and objectives.
- Perform a risk-based controls self-assessment as part of the ERM process.
- Communicate a strategy and invest in training and resources for continuous improvement.
Ongoing monitoring and evaluation are the bookends to successful data center operations. Senior management sets the desired performance targets and risk appetite and funds the programs created to carry them out. Measuring performance and compliance to identify where corrections are necessary is an ongoing process, best performed by an independent source. With data centers now more integrated now with third/fourth parties, getting accurate measurements and discovering root causes becomes more difficult and time consuming. This introduces yet another area of control and topic for a future article Vendor Management and Due Diligence.
Summary and emphasis on continuous improvement
Continuous improvement is an ongoing effort to improve the enterprise’s business objectives. The reliance on technology involves continuous innovation and the need for updated and more advanced systems. Improvements in data center operations can be gradual/incremental or implanted all at once when conditions warrant. That said, continuous improvement relies heavily on management and personnel working together, with a common vision and a process in place to recommend and implement changes.
In a nutshell, that’s a view into what it takes to run a data center today from an operations perspective. Watch for future articles where we will discuss the Architecture (design) and Infrastructure (build and maintain) perspectives. Also please be sure to review prior articles on Business Continuity and Resilience at the data center from CUSO Magazine.
If you have any questions about data center operations at your organization, I can be reached at my contact information below.
*Source: FFIEC It Examination Handbook “Architecture, Infrastructure, and Operations” June 2021.