Today, with IT systems and automation constituting the lifeline of all major business and government operations, putting in place a Business Continuity Planning (BCP) strategy with a focus on Disaster Recovery Management (DRM) is not only essential, it is crucial. Important aspects that need to be addressed are the criticality of DRM, the importance and availability of experts, costs, and strategies to deliver adequate service levels post disaster.
The telecommunications infrastructure for Disaster Recovery (DR) requires appropriate products and processes, people with knowledge and expertise, and high availability of the main data centre. “Transferring operations to a Disaster Recovery Centre (DRC) is a business decision and not a technology decision,” says H. Krishnamurthy, Principal Research Scientist, Indian Institute of Science, Bangalore. “The important starting point is good business impact analysis (BIA), a driver for performance impact analysis (PIA) which leads to making disaster recovery and backup plans for technology changeover optimal and cost effective.” DRM needs to ensure that there is no data loss and that there is service level continuity with an alternate site in place as well as comprehensive IT audits. A DRC with synchronised replication ensures that almost all the data and transactions can be recovered. Data protection and transaction integrity are of paramount importance and key to driving the business.
According to Chandra Sekhar Pulamarasetti, Founder and CEO of Sanovi, a provider of IT disaster recovery and service continuity management solutions, “natural disasters have a high impact but occur infrequently. Virus attacks, application failures, human errors, power outages and cyber issues account for 93 per cent of IT failures.”
Chandra explains how the DR plan integrates with the BCP. The first aspect is the recovery point in case of an outage which is data loss and recovery time,that is, how much data can an organisation tolerate losing and how soon does it need to come up with service levels. The second is application dependencies, where the DR plan takes inputs from the BCP. DR deployment involves setting up a DR data centre, which equals the primary production data centre, and working from both the sites together. Solution providers face three key issues: visibility in terms of how much data can be lost and how quickly the recovery process can be done; high downtime as DR process activities are manual and time-consuming; and deployment and operational costs. “While today DR is in place for about 15 per cent of an organisation’s applications, the figure is expected to go up to 35 per cent in the next three to four years,” indicates Chandra.
The telecommunications infrastructure for Disaster Recovery (DR) requires appropriate products and processes, people with knowledge and expertise
A recent report by Gartner, an IT research and advisory, says, IT DRM continues to be labour-intensive and a significant barrier to scaling DR programmes. While DR is manageable for up to about five applications, some organisations are looking at 50 applications, even up to 200 applications. Multiple heterogeneous databases require qualified people in each of those environments. The manual operation processes come back to expert dependency. When outages occur will your expert be available on time? asks the report. However, emerging fail-over automation tools have the potential to broaden availability to end-to-end IT service.
An increasing number of organisations are implementing IT DRM modernisation, which offers a service infrastructure that is equally effective for individual applications, and a move from just recovery to service continuity. DRM requires a holistic approach: a Service Level Agreement (SLA) that defines the minimum performance criteria, commitment to recovery time frames, regular and frequent drills, a central framework to ensure that heterogeneous technologies are covered, ensuring that every backup process is covered, best practices for easier maintainability, and monitoring SLAs on a regular basis. This approach reduces business outages, increases operational efficiencies and ensures transparency. When there is an outage, reporting to all stakeholders is important to optimise the DR process.
According to Sangeeta Chugh, Director, Department of Telecommunications (DoT), in the new National Telecom Policy the emphasis is on disaster management (DM), and their recommendations for the 12th Five Year Plan include a budget of Rs 500 million) for the telecom sector. The DoT is working on a DM programme in coordination with the National Disaster Management Authority (NDMA) and telecom service providers including alerting citizens in case of a disaster.
Bombay Stock Exchange (BSE) does regular drills and has been conducting live intra-day trading twice per quarter, informs its GM-IT, Dilip Oak. The exchange has a smaller window than banks but the criticality and the volumes are huge with more than 15,000 orders per second. To meet this challenge, the BSE has good tools in place to replicate the data from the primary site to the DR site, and the latency, barring the first few minutes, is less than a few milliseconds. Thanks to the continuously running operations at the DR site, their near site has almost zero data loss. Mahindra & Mahindra Financial Services, which is into insurance, rural housing, personal loans and other products, has the same experience. Their accounts, HR, legal, IT and all other components are available across the company 24×7 from their centralised storage, with frequent DR drills in place depending on business demand and compliance.
ABM Knowledgeware works closely with state governments and urban local bodies, offering applications implementation and support. According to Govind Chauhan, Vice President of the company, IT DR is much more mature in the financial sector and governmental mission-mode projects. The IT ministry has come up with some standards and guidelines but these are moving very slowly in most state governments and the lower structure of the government bodies. “DM and DR are discussed more either as compliance or as an after-thought,” says Chauhan.
G B Bhuyan, Deputy General Manager-IT, Bank of Baroda , which has a presence in nearly 24 countries, talks about the challenges of shifting over to a disaster site and the bigger problem of shifting to the primary site. “In a big bank like ours with state-of-the-art IT, when we did a DC migration we used the DR site very effectively for a smooth transition.” He credited the extensive planning, implementation and cooperation of the team for the replication move of applications in the DC to the DR without any disruption in service. Since 2005, when Bank of Baroda opted for technology-enabled business transformation, the volumes have become massive. Yet, when their DC collapsed in 2008 it took less than three hours to shift to DR and business was not affected. The bank will soon have a “near centre site” to ensure almost zero data loss.
Sudhir Rao, CTO of HP Technology Services, HP India, points to the importance of risk, agility, service quality and cost management. “Here, IT managers and CIOs are very happy with their DR but feel BCP is inadequate. In several countries in Europe, there is little pessimism about BCP.” With good networks, could administrators handle the DR and the DC from some other location? While this is possible 90 per cent of the time, says Rao, a number of times when DR happens there could be uncertainties that could only be handled at the site, such as applying the redo-log differently, the repercussions, understanding the reality and in-depth analysis.
HP has introduced enterprise cloud services in Bangalore that can intertwine with a customer’s BCP. This “recovery to service” is being done very effectively in some parts of the world. Cloud services offer a host of benefits including DR. Perhaps not for very large applications but for 30 to 40 per cent of applications in an enterprise which are not mission critical cloud would work well. If a few hours are available as recovery time objective, “managed cloud” could work as an alternative to provide DR instead of investing in infrastructure and other costs. But this would depend on the requirements of a particular business. Krishnamurthy comments that unless there is performance assurance, high availability, security with respect to the architecture and cost savings as benefits, only allied applications and services and not mission critical applications will move into cloud.
It is important to note that BCP-DR requires long-term investment commitment, to install, constantly upgrade, and invest in storage, security and more
It is important to note that BCP-DR requires long-term investment commitment, to install, constantly upgrade, and invest in storage, security and more. Failures occur in the primary site which could be because of overshooting the threshold or processing manual errors – situations where one cannot move to a DR site. There are solutions available which need to be customised to a businesses’ environment. It is crucial to look into every aspect of the BCP – from the security guard to the man operating the generator to the back office processes. Investing only in technology will not solve problems in cases of disaster.
From being a predominantly software company, over the last three to four years we have realised the need to leverage the external environment, that is, the Internet, and offer large scale cloud-based infrastructure on a fee basis, says Srikanth Karnakota of Microsoft Corporation (India). The company has a data centre in the US that is the size of sixteen football fields manned by just ten people. The entire service management is spread over 170 countries, and run out of Hyderabad with DR centres in Chicago, Ireland, Singapore and Hong Kong. Bing, created as a mission critical server, runs off 700,000 physical servers to offer instant access through the Internet browser.
Dnyanesh Nerurkar, EVP, National Securities Depository Ltd, credits its data centre’s high availability with all the servers in a loop, which means that there is no single point of failure – if a machine goes down, another takes over the operations. They also have an identical DR site. Their RPO is one minute and RTO around 90 minutes. Every quarter, they shift to the DR site, operate from there for at least a week and then shift back to the primary site. When their CPU utilisation exceeds 40 per cent on an average basis, the infrastructure is upgraded. And 90 per cent of important business transactions are served within one second.
Surely, transferring operations to a DRC is a business decision and not a technology decision. Data and transaction integrity are of paramount importance and key to driving the business that cannot be compromised. Indian corporates, banking and financial service players, and governments alike all understand and stress the need to safeguard data and keep busines continuity planning right upfront on the drawing board.
BCP – Key to Uninterrupted Business Opearations
- Ministry of Communications and IT needs to speed up finalisation of guidelines and standards for disaster recovery for all state governments and ensure compliance.
- DR and BCP require not only long-term investment commitment but also investment in people, training and skill upgradation.
- It is critical for each organisation to understand its business imperatives and then take a business decision for a suitable DR solution. One DR solution does not fit all – appropriate care must be taken to customise.
- Investing purely in IT is not enough. Organisations must put processes in place that are backed up with strict compliance.
- DR drills should be made compulsory and periodic – no more than a quarter. It should also be carried out at peak load rather than off-peak. This will ensure effective recovery to service continuity should a disaster strike.
- DR automation is required to meet stringent RPO/RTOs and to eliminate human error. Organisations that are using automation find it beneficial and consistently meet their DR drill goals of conducting successful drills as stipulated.
- Considering data loss requirements are getting lower and lower, zero data loss can be achieved with a three-site DR solution.
- DR SLA visibility is important for managing reliable DR programmes. DR health monitoring tools are required to find out if the DR solution is meeting its desired objectives and goals.
- There is need to ensure that the application software architecture in both the DC and DR sites is identical at all times. In times of disaster, the recovery will be immediate and there will be no outages in services delivery.
- If possible, the DR centre should be located in a different seismic zone.
- The applications should not only be tested for their functionality and features, they should also be tested for performance, fault tolerance and security.
- The disaster recovery team should consist of people from business operations, administration and from facilities. There should also exist a check-list for each individual or teams/departments.
- There should be separate DCs for national level networks like National Knowledge Network (NKN) etc, State Wide Area Networks (SWAN) and district level networks. This is important, as applications running at these three levels would be different in each layer.