6.7 Beyond the Mainframe - Datacenter Trends

Este vídeo pertenece al curso Mainframe - Crucial Role in Modern Enterprise Computing de openHPI. ¿Quiere ver más?

6.7 Beyond the Mainframe - Datacenter Trends

Duración: aproximada 20 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Okay so welcome at our final presentation talking about the
00:06mainframe and data center trends, and the future.
00:10Title of this presentation is Beyond the Mainframe- Datacenter Trends.
00:15First we want to look back and identify some trends and if
00:19you look at this old classification, you see the SISD, SIMD
00:24MISD and MIMD systems classification by Flynn from 1966, but
00:31basically we distinguish between unit processors the classical for normal computer
00:36and vector and area processors which are
00:40all computers, but today we have the entire world of multiprocessor distributed systems
00:46and we have the pipelining as a concept that we see in all
00:50the instruction sets today as well. So basically
00:54we can notice that all the systems that we
00:57see today are either SIMD or MIMD systems.
01:01And if you look at MIMD systems, then you see similarities to standard
01:07multicore, manycore systems. Today you may have shared memory
01:11with and close coupling,
01:13you may have shared disks with RAID systems, or you may have
01:19so called shared-nothing systems. So most systems today are
01:23still closely coupled but there's a clear trend
01:26towards distributed systems with fewer coupling
01:31facilities. The mainframe belongs in the category of closely coupled systems.
01:37There's another aspect and this is often called a NUMA
01:42problem so the non uniform memory access problem that systems
01:47are built out of components, and while for a long time the hardware kept the
01:52impression that the components are all
01:56ordered and accessed in a quite regular fashion,
02:01this abstraction is no longer maintained by today's hardware.
02:05So we have NUMA characteristics, which means that
02:09some parts of the memory are
02:12quicker accessible from a processor than other parts of the memory
02:18that are closer to other processes.
02:21Sp memory and processing units are getting co-located,
02:25and this has implication on the speed of execution if you organize your programs
02:33not correctly. One more
02:36look into the history and into the textbook basically,
02:42the sequence symmetry which was a classical example for a system
02:46that is well balanced that is symmetric in the sense that all
02:51the processors have the same access path to memory. You have a bus infrastructure
02:57and so forth. So a well balanced system
03:00classical MIMD computer
03:03and then we see another example that became quite
03:08famous, which is still pro-gun. It was not terribly successful
03:12in the commercial sense however it was a milestone in the sense that it
03:16consisted of many nodes and each node was running its own operating system's
03:21instance. In that case that was the MACH operating system but
03:26this is an indication that systems getting closer to clusters
03:31or distributed systems are in today's world
03:36closer then to classical
03:40SIMD or MIMD systems. And there's actually an entire trend
03:43which is called red scale computing which indicates that future
03:47architectures are going to be built in that fashion.
03:53So we are talking about scaling and scaling happens in scale
03:58out scenarios, basically addressing more than one computer, addressing an entire
04:03rack, but scaling also happens in scale up scenarios where computers getting
04:09more compac, more dense. If you still look at standard architectures
04:14where multi-core and multi sweating are key features and we
04:18look at standard operating systems with socialization and trustworthiness and security clustering
04:24as key attributes. Turns out these systems
04:31need new programming models, we look at software architecture,
04:36we look at services. So
04:39there's a trend that architecture principles that have been
04:44exercised in the mainframe world for a long time, that they arrive at standard
04:49architectures at commodity service today and need to be addressed
04:53by the software as well.
04:55And then there is one one other problem that is
04:59well known to the mainframe people, but also largely ignored
05:03by the mainframe people because these systems have one objective,
05:07and this is zero downtime.
05:10If you look at other optimization criteria then energy consumption comes
05:15into play, and what we see here is a screenshot actually from
05:20an on-board administration console and blade system where we are able to
05:26put power caps on the
05:31power units and
05:34restricting the energy consumption of the entire computer. So
05:37this is actually an open research question- how should you structure your programs
05:41how should you place your workload so that the energy consumption
05:47is minimized?
05:50And along that line comes the term of green IT, where we look at
05:55consolidation and we try to increase the efficiency of program execution.
06:00And you can only optimize what you can measure. So the first initial question is,
06:05how would you measure energy consumption of your programs? So
06:08if you look at your programming language, then you
06:12will figure out that there's no such keyword like power consumption or energy
06:17consumption in the language. So it has to be somewhere in between.
06:21It's a non functional property, but it needs to be added to
06:24systems in the future.
06:27Next aspect, I mean going from zero downtime what the mainframe stands for, to
06:33more commodity commoditized service systems, we still talk about dependability.
06:41And dependability means that we have certain trustworthiness
06:45that the computer system
06:48works reliable and the question always is how can we deal with unexpected events
06:54and these events they impact the cost and performance of a
07:00system. We need maybe to replicate
07:02units, we need to take measures for fault prevention and and fault removal.
07:10We need to look at the source code and
07:13put efforts into software engineering, but also
07:18it's a question of system architecture.
07:22And there are examples like the left hand side where we have
07:26to have a solution, we look at mission-critical systems, and
07:31if it comes to human life no price is too high. This is the
07:33place for the mainframe and still will
07:36be the place for the mainframe for a long, long time. But
07:40then we have also large scale clusters and distributed systems,
07:43and these many core service
07:45we want to optimize for both- for performance and cost but also
07:50to some extent at least for dependability. So what is the
07:54unit of failure problem basically.
08:00Another aspect if you look at commodity service, this is the
08:03introduction of accelerators of additional functional units
08:07that are no longer CPUs but maybe GPUs or FPJS that bring additional
08:13compute capacity to the table, but somehow need to be
08:18added into the infrastructure. And this can be done via
08:22interfaces like the PCI Express bus for instance or can be done
08:27in a cash coherent manner. So the seamless integration of accelerators
08:33is another trend that we see in the data center.
08:37And the mainframe is actually not a very good example for integrating
08:41accelerators
08:44if it comes to additional functional units. If it comes to
08:49enriching the instruction set and building additional functionality on a processor,
08:54then the mainframe is the prime example for implementing java
08:58application processes for implementing crypto units and having
09:02extensions to the instruction set. So the message here while
09:05mainframe tries to integrate everything into a single box,
09:09the trend in a commodity service is rather that we keep adding
09:14additional functional units.
09:17And this goes quite a way what we see here is a screenshot
09:20of the structure of a silicon graphics systems that we actually
09:24have in the building, where we see that besides the quickpath interconnect which is
09:31the prime link inter-
09:34connecting two structures of four
09:37basically blades four sockets to a more dense interconnection network and a
09:47directory based protocol in place to
09:51set party interconnection
09:54technologies in place. So Rackscale computing- a future trend in the data center.
10:00And besides the mainframe
10:05there's another system this is worth mentioning which will be IBM power.
10:10Also a little exotic in comparison to the ubiquitous intel architecture
10:15however a system that inherits many attributes from the
10:18the mainframe in terms of high availability, in terms of maintainability and serviceability.
10:25And these are also great scale up systems like depicted here
10:30on the screen.
10:34The power architecture is interesting because they have a special approach towards integrating
10:39accelerators with the computer. As I mentioned, nothing gets
10:44close to the mainframe where we just integrate or keep integrating an instruction set
10:49additional capabilities. But with power is CAPI and CAPI is to
10:54Coherent Accelerator Processor Interface very idea is that we
10:58have a cache coherent integration of accelerator units like GPU or FPGA
11:03is the processor.
11:05And then you figure that problems occur
11:10in software more and more
11:13and that we have additional fault classes, that we have less error containment
11:18if it comes to software.
11:20If you look from an HPC High-Performance Computing Perspective
11:24it might even be as bad that as entire system, entire computation
11:29breaks down if a single node breaks
11:33in more throughput, oriented workloads.
11:37We oftentimes have little better error containment, but typically
11:43if a system feels quite a significant amount of the system
11:47is going to fail. So the traditional harware models need an update.
11:53We have memory with increased density and higher data rates instead of
11:57cores. We use many cores instead of a simple
12:01monolithic processor.
12:04We have to interconnect which actually might be
12:07a source for contention and a source for problems, which is
12:11crucial if it comes to fault isolation.
12:14If you look at reactive fault-tolerance
12:17that might get in appropriate. Also an area where we can learn
12:21from the mainframe operating systems if you look for instance at z/OS
12:25that you have predictive
12:28fault tolerance measures.
12:31And also systems do not really scale very well.
12:36We have additional software layers like the virtualization
12:38layer which has been there for a long time in the mainframe world
12:42but is relatively new to the table in the classical interbase data center.
12:48And we have relatively weak tool sets if it comes to reliability research.
12:57Here at the Hasso Plattner Institut, we have the FutureSOC Lab which
13:00is collaboration with industry partners
13:04for looking at next generation
13:07X86, X64 hardware. It
13:11is active research work where we want to understand new fault classes,
13:15we want to do prediction, where we want to look at active,
13:20pro-active virtual machine integration.
13:23This is actually lab that hosts many of these research projects,
13:28not only from our institute but it's open to the entire world basically.
13:36And I just want to mention a few details from one project which looks at
13:42failure prediction on several layers. On the CPU level, we want
13:46to look at online hardware failure prediction which means performance counters.
13:51These are available in all the major architectures.
13:55You can also look at performance counters if it comes to memory
13:58and this is actually an interesting incident here where we see that
14:03a memory module failure was predicted a
14:08couple of years ago already, but
14:10twenty minutes basically
14:13before the actual failure happened. And talking about problem size,
14:20this unit was
14:23sixteen gigabyte of memory so it's not just a few memory
14:28cells or a few bits but huge amount of memory that fails.
14:33I mentioned z/OS as an example where we have Predictive Failure Analysis
14:38and Runtime Diagnostics built into the z/OS.
14:42So operating systems need to be extended and
14:46I mentioned power where we have AIX with a couple of these features
14:49but we also seeing these features showing up in
14:53the linux operating system.
14:56And then there's an entire new field that receives much attention
15:00today which is machine learning and data science and probably
15:05the vision is that we could do fault prediction by applying machine
15:10learning to all the aspects like to the hardware where we have
15:13failure predictors, to the virtual machine monitor, to the operating system,
15:17but also to the application. That
15:20way we can factor in domain-specific knowledge. It will be pluggable
15:24so that it's adaptable to different use cases and so forth.
15:27And división then is
15:29building such an system health indicator as outcome, as result
15:35of a multi-level failure prediction and a system health indicator
15:39would trigger a virtual machine migration, so that we can migrate
15:44away from a service system before problems manifest in the application.
15:52System health indicator.
15:55And with that approach
15:59we can draw a couple of conclusions. So we see that there are
16:02advances in server technology.
16:04We have an ever growing number of CPU cores, systems are getting
16:08more and more dense. We have tremendous amounts of memory today
16:14and there's non-volatile memory showing up on the horizon,
16:18which actually might bring new problems to the table because
16:22the typical approach to solve a problem by just rebooting and redoing the thing
16:28will no longer work if you have non-volatile memory.
16:32Also we notice that reliability becomes a key quality attribute.
16:36This is something we take away from the mainframe, however the
16:40high level of integration and the shrinking structure sizes like 22nm
16:44or 14 nanometers or even 7 nanometers in the future.
16:49We will see multi bit errors instead of just a single bit flip.
16:53So we need new measures for
16:56fault runs and error correction.
16:59And as I suggested for prediction dynamic reconfiguration where
17:03we see examples in all the major systems,
17:06they may provide new solutions. And isolation against fault propagation,
17:12we know the dLPAR as the ultimate means of isolation, the mainframe in
17:17commodities server would correspond or would translate into blades. So you just
17:21place your workload across several nodes
17:24and then there's also new programming models showing up like
17:28the micro service architecture which became famous with netflix
17:31and the cosmic chaos monkeys where
17:34while operating just
17:37kill service instances to show and to see and to be sure that the system
17:42has the desired self-healing capabilities.
17:45So in the end we see again that the computer architecture drives
17:50changes in system software on several levels and it
17:54starts on the left where we see that the servers have evolved
17:57two new form factors, higher density. We see advances in the
18:01operating system like virtualization and trustworthiness, security like clustering.
18:06We see new problems coming up with virtualization, by the way,
18:09like the so called blue pill attack. So how can you basically tell
18:14as somebody has placed the virtualization layer between you and your hardware
18:18and makes changes to the operation of the hardware.
18:23We also see at that started with cloud computing
18:26that people don't want to buy big systems anymore but they want to
18:31follow the pay as you go scheme, so we need to provide additional hardware
18:37and have a model for pricing,
18:42to have the customer pay only if data is being used.
18:47And then I mentioned already the GPU and FPGA.
18:51So we see an entire world of hybrid computing where there is
18:54more than just the cpu as a processing model and OpenCL is
18:59one of the new programming models CUDA and as a model but there's
19:04more to come, and I also mentioned already the CAPI and opencapi
19:08which got introduced with power.The idea is that everything gets
19:14in a homogeneous fashion again by implementing cache coherent interfaces.
19:20So an entire world of changes, some of these changes are visible in
19:26mainframe today already but some of them also are driven by
19:33price optimization (PCO) goals and they will not be visible
19:39in the mainframe dead march but more important in the commodity data center today.
19:45In the end the mainframe still defines the gold standard for the datacenter.
19:51Thank you for your attention. That concludes the course.