3.1 Power Processor micro architecture - Part 1 (Arni Ingimundarson) |

This video belongs to the openHPI course Future of Computing - IBM POWER9 and beyond. Do you want to see more?

3.1 Power Processor micro architecture - Part 1 (Arni Ingimundarson)

Time effort: approx. 15 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:01Welcome to this part
00:03of the lecture series.
00:05My name is Arni Ingimundarson and I want to
00:07introduce you to the POWER process of
00:09microarchitecture and giving you a
00:12few samples from the history of
00:15the POWER processor.
00:19A few words about me.
00:21I said, my name is Arni Ingimundarson and I studied
00:23electrical engineering at the Technical University
00:26of Darmstadt, specializing
00:28on computer system and formal verification.
00:33After 11 years at Texas Instruments, developing
00:36ultra-lowPOWER microcontroller and secure
00:38microcontrollers, I've been
00:40with IBM since 2015 developing
00:43arithmetic logic units and working with
00:46SAP HANA on POWER team,
00:50porting and supporting HANA on the POWER
00:52architecture.
00:58I want to split my
01:02subject into four parts.
01:04In this video, I will give you an overview of the
01:06DLX, microarchitecture.
01:09The second video, I will show
01:12you the POWER4 and POWER5
01:15microarchitecture.
01:17The third video, I will give you an overview of the
01:19POWER7 architecture, including
01:23touching on the subject of symmetric multiprocessing
01:26and the last lecture, I will give you an overview of
01:28the POWER8 and POWER9.
01:35The DLX processor architecture
01:37was designed by Hennessy and
01:40Patterson.
01:42It's architecture is used mainly in the academic
01:45world for teaching.
01:47It is a simple 32 bit RISC
01:50architecture based on the MIPS architecture.
01:55It has 32 bit fixed-width
01:58instructions set.
02:00It has three instructions types and
02:03the instructions are being processed by
02:06the processor in a five-stage pipeline.
02:15The instructions types of the
02:20DLX architecture has three
02:22types, and it shows
02:25a significant and typical
02:29way how the RISC instruction
02:32sets usually are defined.
02:33We see this as a homogenic set up with the
02:36opcode always in the same position, that position
02:39for all three types.
02:41We also see that the source registered
02:44bits defining the source register also on
02:47the same type and then we have a varying
02:50fields for the other types.
02:52The three types are the
02:56beginning with the I-type instruction, which is used
02:58for load and store instruction, conditional branch
03:01and jump to register instructions.
03:04And here we see that we
03:07have a field defining the source
03:09register and the destination register, which
03:12is unused, for example, for jump instruction.
03:17The R-type instructions
03:20type is used for all instructions
03:23where we have to register operations.
03:26We have an opcode,
03:29we have a two-source register
03:32defined and we have one destination register.
03:36The function field at the end is
03:39showing you what is defining which
03:42or further defining the opcode, for example,
03:46the specific types of arithmetic logic
03:48unit operations.
03:50And the third and the last instruction type
03:54is the J-type instructions, which is used for
03:57jump and link instructions.
04:05The difference between CISC and
04:09RISC architecture is
04:12it's mainly in how the instructions words are
04:15defined.
04:16On the RISC side, we usually have a simpler set of
04:19instructions and we have specific
04:21instructions for memory access.
04:26While on the CISC side, you can
04:30often have special
04:33operands of an instruction referencing
04:36memory directly.
04:38The one advantage of the RISC type instruction set
04:42is that it has fixed width,
04:46bit width which means that
04:48if you have a window for
04:51the instruction fetch that you always have a fixed
04:54set of instructions being fetched with every window
04:57while on the CISC side you have a variable-length
05:00instructions.
05:02So you have a little bit more effort
05:05on the decoder side to decode the instruction.
05:10Today, most modern processor,
05:13at least internally, use a RISC like
05:17micro instructions inside, which is not visible to
05:19the outside.
05:26The five-stage pipeline of a DLX
05:29microprocessor or microarchitecture
05:32is a
05:35good example of one pipeline for a RISC
05:38microcontroller or microprocessor.
05:43We start with an instruction fetch
05:45where the instructions are loaded from the memory.
05:50When the instruction word has been
05:53fetched, we need to decode it to
05:56decide what to do.
05:57And the third stage we execute that instruction.
06:02And in the fourth state, we
06:05execute memory accesses,
06:08which not all does not apply to all instructions.
06:12And in the fifth cycle, the Write Back cycles,
06:16the results of the operation are written back into
06:18the register field.
06:21And this applies both to the memory as well
06:24as register to register operations, the
06:27results are written in the Write Back cycle.
06:33Now, as you can see
06:36that you have five cycles executing
06:39one instructions, and
06:42since each stage is usually
06:44a separate logic function,
06:47we can have
06:51what is called pipelining of instruction.
06:53So as soon as the first instruction has been fetched
06:58and it's in the decode cycle, we
07:01can use parallel to the decode
07:04cycle of the first instruction.
07:05We can fetch the next instruction.
07:10And this can be done even
07:13further for all five stages
07:17so that we can have five instructions
07:20in flight and in
07:23parallel.
07:25If pipelining is implemented, depends on
07:28on the implementation of the architecture,
07:31there are implementations available that do not
07:34parallelize the instructions or pipelining
07:37instructions, but there are quite
07:40a few that will do so.
07:47I want to take at this point
07:50to show you a little bit more details on how
07:55such a pipeline is
07:58on high level implemented in the hardware
08:01and this diagram I'm
08:04showing, the five
08:07pipeline stages,
08:10these symbols here,
08:13signals registers or latches
08:16in the hardware which
08:20store the data
08:24and output of constant data over the whole clock
08:26cycles. And they are updated
08:30with every clock cycle.
08:32So what happens in these systems
08:36is that we have for the fetch
08:39instruction for we have, for example, the
08:42instruction address and this register
08:45and the instruction address is then indexed
08:49into the memory.
08:52And given this address, the memory will
08:56output the corresponding
08:59words stored as that address.
09:02And within the next rising
09:04edge of the clock over the next clock cycle, we
09:06store the results in this register here.
09:10And parallel to that, the address
09:13instruction address is being incremented
09:17by the offset of the size of one
09:20instruction words.
09:22And in the DLX case, we have
09:26a 32 bit instruction as of four bytes per
09:28instruction.
09:31The second
09:33stage, the instruction decode stage,
09:36here we have two operations
09:40happening in parallel.
09:41There is we are decoding the instructions, what
09:44we need to do, as well
09:47as reading all the register
09:50referenced by that instruction word.
09:55In the third cycle,
09:57we take the data that has
10:00been stored in this register
10:03or in the snatch, and
10:06we select the correct operands
10:10depending on how the interaction was defined,
10:13and we run it through the ALU, the arithmetic
10:15logic unit, which performs additions,
10:18subtractions and so on, depending on what
10:21instruction types have been or instruction functions
10:24have been implemented.
10:29In the fourth cycle, we have a memory access
10:32cycle, so for
10:35read instructions, we are only
10:38specifying the address here, which will give us
10:40the data that was stored at that
10:43address. For
10:46store instructions, we both have the address as
10:49well as the data to be stored
10:52and the last cycle is showing us
10:55that we have a path back to the register
10:58file where the updated
11:01results are being stored.
11:05And just as a reminder,
11:08looking at how the pipeline worked, these
11:11five stages can be pipelined.
11:13And we see everything is working in parallel.
11:19I would like to point out one
11:22minor thing here, which is actually
11:25quite typical for high-level block
11:28diagrams of
11:32computer architecture or many things.
11:35Usually, they show the bare minimum
11:38of things that you want to show, which implies
11:41that there are a lot of other things that are
11:42hidden.
11:44And one thing that I want to point out that is
11:47hidden is that
11:51the branch decision
11:53or branch execution happens also
11:56in the execution cycle.
11:59And if we have a pipeline architecture,
12:04we see that if
12:07and if this let's assume that
12:10we want that the second instruction
12:13is a branch instruction.
12:15We see that there are two more instructions that are
12:18being fetched or worked on
12:21while until these second instructions
12:25in the execution phase where we actually have an
12:27updated address or instruction
12:30pointer and
12:32this path is missing from this diagram.
12:39And
12:44that is a quite critical part of when
12:46reading block diagrams of microarchitecture,
12:49there is so many things that is implied
12:52in the drawings that we need to think about.
12:56And so please
12:59keep that in mind when reading block diagrams, even
13:02if things seems hard to understand.
13:05That is often because there are something missing in
13:07the diagram.
13:12So,
13:17before we finish this
13:20video, I want to give you a
13:23short overview of the history of the
13:26POWER processors, what the history of the releases
13:29of the different versions.
13:33The POWER4 processor, which we will talk about
13:35in the next video.
13:37That is a consolidation of a longer
13:40history within IBM with collaborations with
13:42other companies
13:46which resulted in the POWER4
13:50chips and
13:53IBM has been steadily developing
13:56and inventing in the POWER architecture until we
13:59have POWER9
14:01released in 2017.
14:08And as I said, we will touch upon
14:10part of those in the next videos.
14:17I have here a short list
14:20of material for further
14:22reading, which
14:25I can quite recommend.
14:29And with that, I
14:32want to thank you for your attention this time and
14:34hope to see you in the next video.