From schiano at uci.edu Fri Sep 8 10:23:08 2017 From: schiano at uci.edu (Allen Schiano) Date: Fri, 8 Sep 2017 10:23:08 -0700 Subject: [HPC-Users] Recent HPC Downtime Message-ID: <9b7378f5-da52-6cc0-d37e-2c8189b9a403@uci.edu> HPC Users The weekend of August 26 and 27th we brought down the HPC cluster and related systems to perform critically needed maintenance and upgrades to the physical infrastructure and operating system software and systems. We projected the outage to be two days in duration but the system was not *mostly* operational under the following Monday. Some major issues continued until Tuesday. We made major changes to the operating system of the cluster, the distributed file system that most of you use, and the physically placement and connectivity of the system. As of today, we still have some nagging issues to resolve but mostly we are determining all the additional work we'd still like to do to parts of the system that had grown 'organically' over the years. The addition of new, even though highly qualified and experienced, staff to the group also shows us that we need to modernize and better document several key components of the system. At this point, we do not expect any more down times for the system until possibly the winter break when we traditionally use lower user demand to make some needed upgrades. The one effect of this event to users is that we will be slower in meeting your upgrade needs for the next two weeks or so. We will continue to log your requests so that we may get to them as soon as possible. We are also adding a few more nodes to the system that we promised users, especially the new 128-core AMD nodes that have recently come to market. And we are also expanding available HPC storage space. Thank you for bearing the effects of this needed work with us. Sorry for taking longer than expected. But the system is better now than before and we have a larger number of experienced hands operating the system and backing each other up. Please send me any concerns you might have on the HPC service or the recent outage. We appreciate your many years of trust in our efforts to bring you a powerful and reliable research service. Dr. Allen V. R. Schiano Interim Director, Research Cyber Infrastructure Center schiano at uci.edu 949 824-2829