It’s been reported in a few places that yesterday Barclays (UK bank) suffered an issue with a “disc array” (presumably they mean disk array) that took out their ATM and online banking systems.  See the comments here and here.

Allegedly, Barclays now use USP-V arrays as their back-end storage devices, so presumably HDS USP-Vs were involved in yesterday’s problems.  Systems seemed to have been down for a number of hours before normal service was resumed.

The first thing to say is that “stuff” happens.  Hardware fails – arrays fail and it’s the same for all vendors.  No vendor can ever claim that their hardware doesn’t fail once in a while.  We all know that RAID is not infallible; in fact, it isn’t even necessary to have a hardware failure to experience service outage as many problems are caused by human error.  

What surprises me with this story is the time Barclays appeared to take to recover from the original incident.  If a storage array is supporting a number of critical applications including online banking and ATMs, then surely a high degree of resilience has been built in that caters for more than just simple hardware failures?  Surely the data and servers supporting ATMs and the web are replicated (in real time) with automated clustered failover or similar technology?

We shouldn’t be focusing here on the technology that failed.  We should be focusing on the process, design and support of the environment that wasn’t able to manage the hardware failure and “re-route” around the problem.  

One other thought.  I wonder if this problem would have been avoided with a bit of Hitachi HAM?

8 Responses to Enterprise Computing: Barclays Bank Services Down Due to Storage Array Problems

  1. Chris I agree with you. The most important thing is not the hardware, it can fail or not, but design the right process and test it every time it’s necessary!
    ciao,
    Enrico

  2. Rob says:

    “We shouldn’t be focusing here on the technology that failed. We should be focusing on the process, design and support of the environment that wasn’t able to manage the hardware failure and “re-route” around the problem.”

    Right. Along those lines, did the architects miss a single
    point of failure for a very important design?

    A design doc surely shows it to be a single point of failure.
    Or: “We didn’t know it to be a single point of failure!”
    “Who know what when? Who signed off on it?”

    Either way, someone is called on the carpet somewhere. Not good.

  3. Locutus says:

    As is often the case, DR (not to be confused with HA) is looked upon as an IT expense with low ROI. Therefore, the DR infrastructure is built not so much as a useable environmnet but rather as a tickeyboo to satify an audit requirement.
    You can preach the merits of the golden copy.
    You can preach the importance of like for like storage footprints at the source and target to preserve the performance of the applications post DR.
    You can preach the necessity of maintaining consistency to guaranty a recoverable copy in the event of a rolling disaster.
    You can preach ad nauseum.

    In the end, it comes down to money Vs risk and often times customers err on the side of money because no one expects to fall into the .0001% availability hole in enterprise storage. They hope they’ll retire before the perfect storm hitts.

    I can’t say that is the case at Barklays. But it has always been a sore point with me when I discuss HA/DR with my customers.

  4. Gar says:

    Yep, from reading the article on Barclays, I agree with you Chris. I’ve witnessed outages caused by H/W but exacerbated by failings in the support system, a lack of true understanding of operating in a DR mode, and poor design. It would be intersting to know if they had exectued a full Disaster Recovery test recently.
    Regards,
    Gar.

  5. It it also a reminder that no matter how resilient a single array is desgined to be, the frame and the datacenter it is housed in are in themselves single points of failure. The only way to mitigate that is to have a well thought out business continuity strategy that inlcudes sound processes and technology such as mirroring of arrays.

    http://storage-sense.blogspot.com

  6. It is also a reminder that no resilient you design an array to be, the actual physical frame, the firmware in the array, and the datacenter the array is housed in are themselves are single points of failure.

    What is needed is a well thought out Business Continuity strategy that includes sound processes, personnel managment, and useful technology such as array mirroring.

    http://storage-sense.blogspot.com

  7. [...] storage consultant Chris M. Evans, who has worked with HDS products and customers, came to the vendor’s defense. He pointed the [...]

  8. Chris Evans says:

    Sounds like everyone commenting is of a similar mind – don’t blame the technology per se, look at the process. Thanks for all the comments.

    Chris

Leave a Reply

Improve the web with Nofollow Reciprocity.

Looking for something?

Use the form below to search the site:


Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...