Query Optimization

This video belongs to the openHPI course In-Memory Data Management. Do you want to see more?

Query Optimization

Time effort: approx. 15 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:04This is a nice quote from Thomas Neumann, who is
00:06a professor at TU Munich
00:10and he basically says that you can optimize as much on the
00:15storage engine, on the execution engine as you want - usually you
00:18get like maybe 60% percent as Intel has shown
00:21improvements there but optimizing the query
00:25optimizer often gives us performance improvements
00:29by a factor of ten or even more as we see later.
00:32This basically says once you have your optimized
00:35storage engine or your execution engine, it's often worth looking
00:39into the query optimizer before you further optimize your next
00:43SIMD generation, because there is a really huge potential
00:47for performance improvements in the query optimizer.
00:52I hope you all know that SQL is declarative so
00:57you don't tell the database how this query is executed, which
01:01operator follows, which operator follows which
01:05operator but you just say what you want to have, what you want
01:08to receive from the database. What the query optimizer
01:12now does is, it takes this first plan and builds
01:17logically equivalent plans,
01:20many many of these, so often thousand of hundred of plans
01:23and then tries to find the plan with the
01:26lowest expected costs. What these costs can be you can see
01:30here on the bottom. This might be
01:34algorithmic costs, for example the complexity of your
01:37sort operation, this might be logical costs
01:40where you try to estimate the output size of each operator
01:44or physical costs, for example IO bandwidth or
01:47cache misses. Interesting part is that
01:51previously, databases like DB2 which
01:55were disk resident, usually optimized for the physical
01:58part because you have your one single large bottleneck
02:01which was IO block acesses
02:03and now with increasingly more data in memory
02:08we are moving a little bit, or the trend goes to the logical costs
02:11that most databases just concentrate on logical costs
02:14that means you want to minimize your output size
02:17of each operator which has the advantage also that you need
02:20to transfer less memory. What you usually do is
02:24you execute all the filters first
02:27or you do, well you execute operations first where you expect
02:31that the output size will be minimal and then later you do
02:35the joins or you do some sorting and so on.
02:37We try to rearrange operators in order to minimize the expected
02:41output size, the logical costs. That is basically the main
02:44concept of query optimization.
02:48Query optimization usually works in two steps,
02:52they are kind of related, depends on the system,
02:55but usually there are two steps. They are first
02:57they are semantic query transformations and simple
03:00heuristics. Then the second step is a little bit more
03:03complex where you have cost models to estimate
03:07for example, join costs
03:10and try to re-order operators. Let us look at some examples
03:14for that, the first part is about query
03:17reformulation. We try to exploit semantic query
03:22transformations, that means we transform the query
03:26into a new query which is logically equivalent
03:29and hopefully we find a reformulation that
03:32is easier to process, that has lower costs.
03:35A pretty simple example is, there in the middle of the
03:39of the slide we have a selection on A
03:42and you select all tuples that have a value A
03:46less than ten and bigger than ten so obviously you can,
03:50the optimizer could see here that we don't have to
03:53access the data at all, we don't have to scan the data twice in
03:56the column A. We can just return an empty result set.
04:00The example on the bottom chose
04:04another simplification or reformulation
04:07of a query where we simplify the selection on A.
04:10I think it's rather obvious if we select all the
04:15tuples where A is smaller than ten we don't have to check if
04:18they are smaller than twenty because that's a given.
04:23Another standard optimization
04:27is the so called predicate push down that you can see on the
04:30top there where we push down our selection on the attribute
04:33B into our sub query here, but this is a natural join between
04:38T1 and T2, we push down the selection into
04:41the sub query here and on the bottom you
04:45see a simple reformulation
04:49or simplification of an equation. These are
04:53bread and butter operations of
04:56a query optimizer. This is
04:58something that every query optimizer does and needs to do
05:01in order to optimize your queries.
05:05Some heuristics that are a little bit more tricky
05:10to do are shown here on the top.
05:14We have, for example as you have said already,
05:18we have the strategy that we always execute the filters
05:22first. We always execute the filters,
05:25we order them by restrictiveness, so we want to execute the
05:29filters first that we consider the most restrictive
05:32afterwards we might do the joins or other
05:36operators but usually you always want to execute the filters
05:40first. Another example we have just shown is the predicate
05:44push down, which usually always makes sense
05:48and join re-ordering which we will talk about
05:52later in a second. But
05:55the important thing here is to recognize that these
05:58are heuristics so they don't have to be true
06:01but they are a good starting point for every optimizer because
06:04usually they are always true, usually. You can easily can
06:08have cases where it's not true but for the vast
06:10majority of cases optimizing using these rules might
06:15be fine but if you imagine an example where you,
06:18maybe we have the world population table and
06:20you join this with all the presidents of the United States.
06:23This is a highly restrictive join, right? Maybe you have
06:27five or ten living presidents,
06:29so this join result will be really small but if you join for all,
06:33 if you select then all the male presidents of United States,
06:37well this is not the most effective selection
06:39so this is a case where a join might be more useful to
06:42do before the selection, might be better to do that before selection.
06:46But again, usually you are rather good
06:49using these heuristics.
06:54Here we have an example where we
06:56again show you the the logical query plan
06:59and how we can reformulate the query plan into a logically
07:03equivalent plan where we estimate the costs to be lower.
07:07What we do here is just what we have discussed
07:10before, we do the selections first so the selections here are
07:14on state = hessen and the birth year
07:17larger than 2010 and before we join now we do
07:22both selections on the locations table and the world population
07:25table. Join these both tables afterwards
07:29and hopefully we have lower estimated costs.
07:34We have shown you have the input sizes here
07:37as, before optimizing the query, we are joining
07:40eight billion tuples with one hundred tuples from the
07:45actors table. This is a huge just a huge join if you
07:50have seen the join examples yesterday and the run times,
07:54this is probably in the minutes or so, this would probaly be in the
07:57minutes or hours to have a join over eight
08:00billion tuples, un-optimized. But if we compare the plan
08:06now to the right side where we re-order
08:09the operators, you can see that we have now a sequential scan
08:12in the beginning on eight million tuples.
08:14We have seen how fast sequential scans are on dictionary encoded
08:18columns with SIMD and so on so this is a comparatively
08:23simple and fast operation
08:26and we can reduce the input size
08:29for a much more expensive operation - the join here.
08:34The next step that you usually do is
08:37you build your physical query plan, once you have
08:40your query plan that we consider as the best plan we have
08:43found or the best plan to be there, we translate
08:46this plan with logical operators to the
08:49actual database operators that our database
08:52implementation offers us. That might be different
08:54joins, for example here we have a hash join and a sort merge join
08:57so now we try to find the best possible way to do
09:00our join and build a physical query plan. This plan really
09:04includes physical operators.
09:11What we need to have in place for all these
09:15estimations, for all these cost estimations are
09:18good statistics. Every database relies on
09:21the statistics that we have in place to, for example, to have
09:26an estimate on how expensive is my filter
09:29on the column first name. Such statistics
09:36include, for example the number of distinct values.
09:39This is usually a given for our database, for Hyrise or SAP HANA
09:42because every column is dictionary encoded so we have this
09:46number already. Another statistic might be the
09:49presence or absence of indices,
09:52often make sense to use indices, in some cases where
09:55a filter is really restrictive, so we want to know if we have
09:58these indices. It makes sense to have
10:01or it is nice to have the value distribution, we
10:03come back later to that, top-n values and so on.
10:06Which help you to estimate the costs of an operator
10:09without executing it first obviously. What we can
10:13see here is an example for the world
10:17population table, how it could look implemented so we have the
10:21data in the table obviously and then we have this
10:24metadata. This metadata includes all the first,
10:28all the attributes and type of the attributes
10:32this is what you need, that is what you always need obviously for
10:35your table but then you have also the index columns
10:37and the statistics.
10:41These statistics include, for example,
10:43the min and max, here, in this example it's the birth year so we
10:48have a minimum of 1900
10:51and the maximum of 2017.
10:54It might include distinct counts, so how many distinct values
10:57do we have in that particular column.
11:00As we have said, usually that's a given if we
11:03have dictionary encoding and then we have for
11:06example histograms and they are really important
11:10for a couple of operators, for example,
11:13selections or join estimations.
11:15If you imagine that you have a query, give me all the
11:19people having the first name Martin
11:24and you could basically
11:27before executing the query, when you
11:29want to estimate the cost, you could look into the histogram
11:32and you might see that is rather a uniform distribution
11:36so there is a distinct count of 1,000 maybe.
11:38I would expect my results, that will be 1,000th of
11:42the overall table if I select on Martin
11:45but if you now select all the people who are
11:47from a particular country, this distribution is
11:52absolutely not uniform, so this is highly skewed.
11:54There is a huge difference if you have a filter on the country
11:57Germany which is a result set of maybe 80
12:01million I guess and then you have a selection on the U.S which
12:04is four times the size and then you have a selection on China
12:07which might be pretty much I think a quarter of the people
12:10of the general population of the world. Basically there is a huge skew on
12:13that and you want to be able to recognize that skew
12:17in order to estimate the cost of your operators.
12:23This is basically,
12:26so as soon as you have these statistics and you
12:29know how to use them, what you want to do is join reordering
12:32because that's one of the most
12:35expensive parts of the database that's joining and the order
12:38of joins and the order of how they are executed.
12:41We can use some information here again that
12:44the database provides us, for example if we know about
12:47a following key relationship we can
12:51use that information to improve our cost estimation
12:55for a join between two tables.
12:57We can check if we have previously selected
13:01on a uniform distribution or maybe on a top five
13:05on attribute that is among the top five attributes.
13:07We can use this information
13:10to better predict how large or how expensive our join will be.
13:18Just a simple example, why we want to do this join re-ordering,
13:23If you have, if you join these
13:25three relations here, there's a so called
13:29join-associativity that means that you can re-order these joins
13:34in any form that you want, atleast if it's an inner-join.
13:38Instead of joining relation one or two first
13:41and then joining it with relation three, you could also
13:44switch the order and join two with three
13:49first for example, because these joins might be faster to do.
13:55What have we learned about query optimization?
13:58It just becomes increasingly important, so if you look at OLTP
14:03systems, all you do usually there in 90 %
14:06of the cases is, you have your access via the primary key,
14:10you access your primary key index, because there usually always is one
14:14and then that's pretty much done. But now with all the analytical
14:17systems with mixed workloads where you execute
14:20transaction queries and complex analytical queries,
14:23the query optimization or the optimizer becomes increasingly
14:26important.
14:29However as we, challenging tasks that we have
14:32shown in the last slides, pretty much all the databases are
14:35still not very accurate in estimating the costs
14:38and it is computationally really expensive
14:43operation to optimize the queries. Usually
14:46no database will give you the perfect plan, they just stop at
14:49some time because the are just too many alternatives
14:52and try to give you the best estimate, the best plan they
14:56can find in a certain amount of time.

About this video

In this lecture we continue the series about query processing and have a closer look at the optimizations conducted by the database the query optimizer. This includes the selection of suited execution orders composed in so called query plans.