Este vídeo pertenece al curso Mainframe - Crucial Role in Modern Enterprise Computing de openHPI. ¿Quiere ver más?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Okay so welcome at our final presentation talking about the
- 00:06mainframe and data center trends, and the future.
- 00:10Title of this presentation is Beyond the Mainframe- Datacenter Trends.
- 00:15First we want to look back and identify some trends and if
- 00:19you look at this old classification, you see the SISD, SIMD
- 00:24MISD and MIMD systems classification by Flynn from 1966, but
- 00:31basically we distinguish between unit processors the classical for normal computer
- 00:36and vector and area processors which are
- 00:40all computers, but today we have the entire world of multiprocessor distributed systems
- 00:46and we have the pipelining as a concept that we see in all
- 00:50the instruction sets today as well. So basically
- 00:54we can notice that all the systems that we
- 00:57see today are either SIMD or MIMD systems.
- 01:01And if you look at MIMD systems, then you see similarities to standard
- 01:07multicore, manycore systems. Today you may have shared memory
- 01:11with and close coupling,
- 01:13you may have shared disks with RAID systems, or you may have
- 01:19so called shared-nothing systems. So most systems today are
- 01:23still closely coupled but there's a clear trend
- 01:26towards distributed systems with fewer coupling
- 01:31facilities. The mainframe belongs in the category of closely coupled systems.
- 01:37There's another aspect and this is often called a NUMA
- 01:42problem so the non uniform memory access problem that systems
- 01:47are built out of components, and while for a long time the hardware kept the
- 01:52impression that the components are all
- 01:56ordered and accessed in a quite regular fashion,
- 02:01this abstraction is no longer maintained by today's hardware.
- 02:05So we have NUMA characteristics, which means that
- 02:09some parts of the memory are
- 02:12quicker accessible from a processor than other parts of the memory
- 02:18that are closer to other processes.
- 02:21Sp memory and processing units are getting co-located,
- 02:25and this has implication on the speed of execution if you organize your programs
- 02:33not correctly. One more
- 02:36look into the history and into the textbook basically,
- 02:42the sequence symmetry which was a classical example for a system
- 02:46that is well balanced that is symmetric in the sense that all
- 02:51the processors have the same access path to memory. You have a bus infrastructure
- 02:57and so forth. So a well balanced system
- 03:00classical MIMD computer
- 03:03and then we see another example that became quite
- 03:08famous, which is still pro-gun. It was not terribly successful
- 03:12in the commercial sense however it was a milestone in the sense that it
- 03:16consisted of many nodes and each node was running its own operating system's
- 03:21instance. In that case that was the MACH operating system but
- 03:26this is an indication that systems getting closer to clusters
- 03:31or distributed systems are in today's world
- 03:36closer then to classical
- 03:40SIMD or MIMD systems. And there's actually an entire trend
- 03:43which is called red scale computing which indicates that future
- 03:47architectures are going to be built in that fashion.
- 03:53So we are talking about scaling and scaling happens in scale
- 03:58out scenarios, basically addressing more than one computer, addressing an entire
- 04:03rack, but scaling also happens in scale up scenarios where computers getting
- 04:09more compac, more dense. If you still look at standard architectures
- 04:14where multi-core and multi sweating are key features and we
- 04:18look at standard operating systems with socialization and trustworthiness and security clustering
- 04:24as key attributes. Turns out these systems
- 04:31need new programming models, we look at software architecture,
- 04:36we look at services. So
- 04:39there's a trend that architecture principles that have been
- 04:44exercised in the mainframe world for a long time, that they arrive at standard
- 04:49architectures at commodity service today and need to be addressed
- 04:53by the software as well.
- 04:55And then there is one one other problem that is
- 04:59well known to the mainframe people, but also largely ignored
- 05:03by the mainframe people because these systems have one objective,
- 05:07and this is zero downtime.
- 05:10If you look at other optimization criteria then energy consumption comes
- 05:15into play, and what we see here is a screenshot actually from
- 05:20an on-board administration console and blade system where we are able to
- 05:26put power caps on the
- 05:31power units and
- 05:34restricting the energy consumption of the entire computer. So
- 05:37this is actually an open research question- how should you structure your programs
- 05:41how should you place your workload so that the energy consumption
- 05:47is minimized?
- 05:50And along that line comes the term of green IT, where we look at
- 05:55consolidation and we try to increase the efficiency of program execution.
- 06:00And you can only optimize what you can measure. So the first initial question is,
- 06:05how would you measure energy consumption of your programs? So
- 06:08if you look at your programming language, then you
- 06:12will figure out that there's no such keyword like power consumption or energy
- 06:17consumption in the language. So it has to be somewhere in between.
- 06:21It's a non functional property, but it needs to be added to
- 06:24systems in the future.
- 06:27Next aspect, I mean going from zero downtime what the mainframe stands for, to
- 06:33more commodity commoditized service systems, we still talk about dependability.
- 06:41And dependability means that we have certain trustworthiness
- 06:45that the computer system
- 06:48works reliable and the question always is how can we deal with unexpected events
- 06:54and these events they impact the cost and performance of a
- 07:00system. We need maybe to replicate
- 07:02units, we need to take measures for fault prevention and and fault removal.
- 07:10We need to look at the source code and
- 07:13put efforts into software engineering, but also
- 07:18it's a question of system architecture.
- 07:22And there are examples like the left hand side where we have
- 07:26to have a solution, we look at mission-critical systems, and
- 07:31if it comes to human life no price is too high. This is the
- 07:33place for the mainframe and still will
- 07:36be the place for the mainframe for a long, long time. But
- 07:40then we have also large scale clusters and distributed systems,
- 07:43and these many core service
- 07:45we want to optimize for both- for performance and cost but also
- 07:50to some extent at least for dependability. So what is the
- 07:54unit of failure problem basically.
- 08:00Another aspect if you look at commodity service, this is the
- 08:03introduction of accelerators of additional functional units
- 08:07that are no longer CPUs but maybe GPUs or FPJS that bring additional
- 08:13compute capacity to the table, but somehow need to be
- 08:18added into the infrastructure. And this can be done via
- 08:22interfaces like the PCI Express bus for instance or can be done
- 08:27in a cash coherent manner. So the seamless integration of accelerators
- 08:33is another trend that we see in the data center.
- 08:37And the mainframe is actually not a very good example for integrating
- 08:41accelerators
- 08:44if it comes to additional functional units. If it comes to
- 08:49enriching the instruction set and building additional functionality on a processor,
- 08:54then the mainframe is the prime example for implementing java
- 08:58application processes for implementing crypto units and having
- 09:02extensions to the instruction set. So the message here while
- 09:05mainframe tries to integrate everything into a single box,
- 09:09the trend in a commodity service is rather that we keep adding
- 09:14additional functional units.
- 09:17And this goes quite a way what we see here is a screenshot
- 09:20of the structure of a silicon graphics systems that we actually
- 09:24have in the building, where we see that besides the quickpath interconnect which is
- 09:31the prime link inter-
- 09:34connecting two structures of four
- 09:37basically blades four sockets to a more dense interconnection network and a
- 09:47directory based protocol in place to
- 09:51set party interconnection
- 09:54technologies in place. So Rackscale computing- a future trend in the data center.
- 10:00And besides the mainframe
- 10:05there's another system this is worth mentioning which will be IBM power.
- 10:10Also a little exotic in comparison to the ubiquitous intel architecture
- 10:15however a system that inherits many attributes from the
- 10:18the mainframe in terms of high availability, in terms of maintainability and serviceability.
- 10:25And these are also great scale up systems like depicted here
- 10:30on the screen.
- 10:34The power architecture is interesting because they have a special approach towards integrating
- 10:39accelerators with the computer. As I mentioned, nothing gets
- 10:44close to the mainframe where we just integrate or keep integrating an instruction set
- 10:49additional capabilities. But with power is CAPI and CAPI is to
- 10:54Coherent Accelerator Processor Interface very idea is that we
- 10:58have a cache coherent integration of accelerator units like GPU or FPGA
- 11:03is the processor.
- 11:05And then you figure that problems occur
- 11:10in software more and more
- 11:13and that we have additional fault classes, that we have less error containment
- 11:18if it comes to software.
- 11:20If you look from an HPC High-Performance Computing Perspective
- 11:24it might even be as bad that as entire system, entire computation
- 11:29breaks down if a single node breaks
- 11:33in more throughput, oriented workloads.
- 11:37We oftentimes have little better error containment, but typically
- 11:43if a system feels quite a significant amount of the system
- 11:47is going to fail. So the traditional harware models need an update.
- 11:53We have memory with increased density and higher data rates instead of
- 11:57cores. We use many cores instead of a simple
- 12:01monolithic processor.
- 12:04We have to interconnect which actually might be
- 12:07a source for contention and a source for problems, which is
- 12:11crucial if it comes to fault isolation.
- 12:14If you look at reactive fault-tolerance
- 12:17that might get in appropriate. Also an area where we can learn
- 12:21from the mainframe operating systems if you look for instance at z/OS
- 12:25that you have predictive
- 12:28fault tolerance measures.
- 12:31And also systems do not really scale very well.
- 12:36We have additional software layers like the virtualization
- 12:38layer which has been there for a long time in the mainframe world
- 12:42but is relatively new to the table in the classical interbase data center.
- 12:48And we have relatively weak tool sets if it comes to reliability research.
- 12:57Here at the Hasso Plattner Institut, we have the FutureSOC Lab which
- 13:00is collaboration with industry partners
- 13:04for looking at next generation
- 13:07X86, X64 hardware. It
- 13:11is active research work where we want to understand new fault classes,
- 13:15we want to do prediction, where we want to look at active,
- 13:20pro-active virtual machine integration.
- 13:23This is actually lab that hosts many of these research projects,
- 13:28not only from our institute but it's open to the entire world basically.
- 13:36And I just want to mention a few details from one project which looks at
- 13:42failure prediction on several layers. On the CPU level, we want
- 13:46to look at online hardware failure prediction which means performance counters.
- 13:51These are available in all the major architectures.
- 13:55You can also look at performance counters if it comes to memory
- 13:58and this is actually an interesting incident here where we see that
- 14:03a memory module failure was predicted a
- 14:08couple of years ago already, but
- 14:10twenty minutes basically
- 14:13before the actual failure happened. And talking about problem size,
- 14:20this unit was
- 14:23sixteen gigabyte of memory so it's not just a few memory
- 14:28cells or a few bits but huge amount of memory that fails.
- 14:33I mentioned z/OS as an example where we have Predictive Failure Analysis
- 14:38and Runtime Diagnostics built into the z/OS.
- 14:42So operating systems need to be extended and
- 14:46I mentioned power where we have AIX with a couple of these features
- 14:49but we also seeing these features showing up in
- 14:53the linux operating system.
- 14:56And then there's an entire new field that receives much attention
- 15:00today which is machine learning and data science and probably
- 15:05the vision is that we could do fault prediction by applying machine
- 15:10learning to all the aspects like to the hardware where we have
- 15:13failure predictors, to the virtual machine monitor, to the operating system,
- 15:17but also to the application. That
- 15:20way we can factor in domain-specific knowledge. It will be pluggable
- 15:24so that it's adaptable to different use cases and so forth.
- 15:27And división then is
- 15:29building such an system health indicator as outcome, as result
- 15:35of a multi-level failure prediction and a system health indicator
- 15:39would trigger a virtual machine migration, so that we can migrate
- 15:44away from a service system before problems manifest in the application.
- 15:52System health indicator.
- 15:55And with that approach
- 15:59we can draw a couple of conclusions. So we see that there are
- 16:02advances in server technology.
- 16:04We have an ever growing number of CPU cores, systems are getting
- 16:08more and more dense. We have tremendous amounts of memory today
- 16:14and there's non-volatile memory showing up on the horizon,
- 16:18which actually might bring new problems to the table because
- 16:22the typical approach to solve a problem by just rebooting and redoing the thing
- 16:28will no longer work if you have non-volatile memory.
- 16:32Also we notice that reliability becomes a key quality attribute.
- 16:36This is something we take away from the mainframe, however the
- 16:40high level of integration and the shrinking structure sizes like 22nm
- 16:44or 14 nanometers or even 7 nanometers in the future.
- 16:49We will see multi bit errors instead of just a single bit flip.
- 16:53So we need new measures for
- 16:56fault runs and error correction.
- 16:59And as I suggested for prediction dynamic reconfiguration where
- 17:03we see examples in all the major systems,
- 17:06they may provide new solutions. And isolation against fault propagation,
- 17:12we know the dLPAR as the ultimate means of isolation, the mainframe in
- 17:17commodities server would correspond or would translate into blades. So you just
- 17:21place your workload across several nodes
- 17:24and then there's also new programming models showing up like
- 17:28the micro service architecture which became famous with netflix
- 17:31and the cosmic chaos monkeys where
- 17:34while operating just
- 17:37kill service instances to show and to see and to be sure that the system
- 17:42has the desired self-healing capabilities.
- 17:45So in the end we see again that the computer architecture drives
- 17:50changes in system software on several levels and it
- 17:54starts on the left where we see that the servers have evolved
- 17:57two new form factors, higher density. We see advances in the
- 18:01operating system like virtualization and trustworthiness, security like clustering.
- 18:06We see new problems coming up with virtualization, by the way,
- 18:09like the so called blue pill attack. So how can you basically tell
- 18:14as somebody has placed the virtualization layer between you and your hardware
- 18:18and makes changes to the operation of the hardware.
- 18:23We also see at that started with cloud computing
- 18:26that people don't want to buy big systems anymore but they want to
- 18:31follow the pay as you go scheme, so we need to provide additional hardware
- 18:37and have a model for pricing,
- 18:42to have the customer pay only if data is being used.
- 18:47And then I mentioned already the GPU and FPGA.
- 18:51So we see an entire world of hybrid computing where there is
- 18:54more than just the cpu as a processing model and OpenCL is
- 18:59one of the new programming models CUDA and as a model but there's
- 19:04more to come, and I also mentioned already the CAPI and opencapi
- 19:08which got introduced with power.The idea is that everything gets
- 19:14in a homogeneous fashion again by implementing cache coherent interfaces.
- 19:20So an entire world of changes, some of these changes are visible in
- 19:26mainframe today already but some of them also are driven by
- 19:33price optimization (PCO) goals and they will not be visible
- 19:39in the mainframe dead march but more important in the commodity data center today.
- 19:45In the end the mainframe still defines the gold standard for the datacenter.
- 19:51Thank you for your attention. That concludes the course.
To enable the transcript, please select a language in the video player settings menu.