This video belongs to the openHPI course In-Memory Data Management. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:04This is a nice quote from Thomas Neumann, who is
- 00:06a professor at TU Munich
- 00:10and he basically says that you can optimize as much on the
- 00:15storage engine, on the execution engine as you want - usually you
- 00:18get like maybe 60% percent as Intel has shown
- 00:21improvements there but optimizing the query
- 00:25optimizer often gives us performance improvements
- 00:29by a factor of ten or even more as we see later.
- 00:32This basically says once you have your optimized
- 00:35storage engine or your execution engine, it's often worth looking
- 00:39into the query optimizer before you further optimize your next
- 00:43SIMD generation, because there is a really huge potential
- 00:47for performance improvements in the query optimizer.
- 00:52I hope you all know that SQL is declarative so
- 00:57you don't tell the database how this query is executed, which
- 01:01operator follows, which operator follows which
- 01:05operator but you just say what you want to have, what you want
- 01:08to receive from the database. What the query optimizer
- 01:12now does is, it takes this first plan and builds
- 01:17logically equivalent plans,
- 01:20many many of these, so often thousand of hundred of plans
- 01:23and then tries to find the plan with the
- 01:26lowest expected costs. What these costs can be you can see
- 01:30here on the bottom. This might be
- 01:34algorithmic costs, for example the complexity of your
- 01:37sort operation, this might be logical costs
- 01:40where you try to estimate the output size of each operator
- 01:44or physical costs, for example IO bandwidth or
- 01:47cache misses. Interesting part is that
- 01:51previously, databases like DB2 which
- 01:55were disk resident, usually optimized for the physical
- 01:58part because you have your one single large bottleneck
- 02:01which was IO block acesses
- 02:03and now with increasingly more data in memory
- 02:08we are moving a little bit, or the trend goes to the logical costs
- 02:11that most databases just concentrate on logical costs
- 02:14that means you want to minimize your output size
- 02:17of each operator which has the advantage also that you need
- 02:20to transfer less memory. What you usually do is
- 02:24you execute all the filters first
- 02:27or you do, well you execute operations first where you expect
- 02:31that the output size will be minimal and then later you do
- 02:35the joins or you do some sorting and so on.
- 02:37We try to rearrange operators in order to minimize the expected
- 02:41output size, the logical costs. That is basically the main
- 02:44concept of query optimization.
- 02:48Query optimization usually works in two steps,
- 02:52they are kind of related, depends on the system,
- 02:55but usually there are two steps. They are first
- 02:57they are semantic query transformations and simple
- 03:00heuristics. Then the second step is a little bit more
- 03:03complex where you have cost models to estimate
- 03:07for example, join costs
- 03:10and try to re-order operators. Let us look at some examples
- 03:14for that, the first part is about query
- 03:17reformulation. We try to exploit semantic query
- 03:22transformations, that means we transform the query
- 03:26into a new query which is logically equivalent
- 03:29and hopefully we find a reformulation that
- 03:32is easier to process, that has lower costs.
- 03:35A pretty simple example is, there in the middle of the
- 03:39of the slide we have a selection on A
- 03:42and you select all tuples that have a value A
- 03:46less than ten and bigger than ten so obviously you can,
- 03:50the optimizer could see here that we don't have to
- 03:53access the data at all, we don't have to scan the data twice in
- 03:56the column A. We can just return an empty result set.
- 04:00The example on the bottom chose
- 04:04another simplification or reformulation
- 04:07of a query where we simplify the selection on A.
- 04:10I think it's rather obvious if we select all the
- 04:15tuples where A is smaller than ten we don't have to check if
- 04:18they are smaller than twenty because that's a given.
- 04:23Another standard optimization
- 04:27is the so called predicate push down that you can see on the
- 04:30top there where we push down our selection on the attribute
- 04:33B into our sub query here, but this is a natural join between
- 04:38T1 and T2, we push down the selection into
- 04:41the sub query here and on the bottom you
- 04:45see a simple reformulation
- 04:49or simplification of an equation. These are
- 04:53bread and butter operations of
- 04:56a query optimizer. This is
- 04:58something that every query optimizer does and needs to do
- 05:01in order to optimize your queries.
- 05:05Some heuristics that are a little bit more tricky
- 05:10to do are shown here on the top.
- 05:14We have, for example as you have said already,
- 05:18we have the strategy that we always execute the filters
- 05:22first. We always execute the filters,
- 05:25we order them by restrictiveness, so we want to execute the
- 05:29filters first that we consider the most restrictive
- 05:32afterwards we might do the joins or other
- 05:36operators but usually you always want to execute the filters
- 05:40first. Another example we have just shown is the predicate
- 05:44push down, which usually always makes sense
- 05:48and join re-ordering which we will talk about
- 05:52later in a second. But
- 05:55the important thing here is to recognize that these
- 05:58are heuristics so they don't have to be true
- 06:01but they are a good starting point for every optimizer because
- 06:04usually they are always true, usually. You can easily can
- 06:08have cases where it's not true but for the vast
- 06:10majority of cases optimizing using these rules might
- 06:15be fine but if you imagine an example where you,
- 06:18maybe we have the world population table and
- 06:20you join this with all the presidents of the United States.
- 06:23This is a highly restrictive join, right? Maybe you have
- 06:27five or ten living presidents,
- 06:29so this join result will be really small but if you join for all,
- 06:33 if you select then all the male presidents of United States,
- 06:37well this is not the most effective selection
- 06:39so this is a case where a join might be more useful to
- 06:42do before the selection, might be better to do that before selection.
- 06:46But again, usually you are rather good
- 06:49using these heuristics.
- 06:54Here we have an example where we
- 06:56again show you the the logical query plan
- 06:59and how we can reformulate the query plan into a logically
- 07:03equivalent plan where we estimate the costs to be lower.
- 07:07What we do here is just what we have discussed
- 07:10before, we do the selections first so the selections here are
- 07:14on state = hessen and the birth year
- 07:17larger than 2010 and before we join now we do
- 07:22both selections on the locations table and the world population
- 07:25table. Join these both tables afterwards
- 07:29and hopefully we have lower estimated costs.
- 07:34We have shown you have the input sizes here
- 07:37as, before optimizing the query, we are joining
- 07:40eight billion tuples with one hundred tuples from the
- 07:45actors table. This is a huge just a huge join if you
- 07:50have seen the join examples yesterday and the run times,
- 07:54this is probably in the minutes or so, this would probaly be in the
- 07:57minutes or hours to have a join over eight
- 08:00billion tuples, un-optimized. But if we compare the plan
- 08:06now to the right side where we re-order
- 08:09the operators, you can see that we have now a sequential scan
- 08:12in the beginning on eight million tuples.
- 08:14We have seen how fast sequential scans are on dictionary encoded
- 08:18columns with SIMD and so on so this is a comparatively
- 08:23simple and fast operation
- 08:26and we can reduce the input size
- 08:29for a much more expensive operation - the join here.
- 08:34The next step that you usually do is
- 08:37you build your physical query plan, once you have
- 08:40your query plan that we consider as the best plan we have
- 08:43found or the best plan to be there, we translate
- 08:46this plan with logical operators to the
- 08:49actual database operators that our database
- 08:52implementation offers us. That might be different
- 08:54joins, for example here we have a hash join and a sort merge join
- 08:57so now we try to find the best possible way to do
- 09:00our join and build a physical query plan. This plan really
- 09:04includes physical operators.
- 09:11What we need to have in place for all these
- 09:15estimations, for all these cost estimations are
- 09:18good statistics. Every database relies on
- 09:21the statistics that we have in place to, for example, to have
- 09:26an estimate on how expensive is my filter
- 09:29on the column first name. Such statistics
- 09:36include, for example the number of distinct values.
- 09:39This is usually a given for our database, for Hyrise or SAP HANA
- 09:42because every column is dictionary encoded so we have this
- 09:46number already. Another statistic might be the
- 09:49presence or absence of indices,
- 09:52often make sense to use indices, in some cases where
- 09:55a filter is really restrictive, so we want to know if we have
- 09:58these indices. It makes sense to have
- 10:01or it is nice to have the value distribution, we
- 10:03come back later to that, top-n values and so on.
- 10:06Which help you to estimate the costs of an operator
- 10:09without executing it first obviously. What we can
- 10:13see here is an example for the world
- 10:17population table, how it could look implemented so we have the
- 10:21data in the table obviously and then we have this
- 10:24metadata. This metadata includes all the first,
- 10:28all the attributes and type of the attributes
- 10:32this is what you need, that is what you always need obviously for
- 10:35your table but then you have also the index columns
- 10:37and the statistics.
- 10:41These statistics include, for example,
- 10:43the min and max, here, in this example it's the birth year so we
- 10:48have a minimum of 1900
- 10:51and the maximum of 2017.
- 10:54It might include distinct counts, so how many distinct values
- 10:57do we have in that particular column.
- 11:00As we have said, usually that's a given if we
- 11:03have dictionary encoding and then we have for
- 11:06example histograms and they are really important
- 11:10for a couple of operators, for example,
- 11:13selections or join estimations.
- 11:15If you imagine that you have a query, give me all the
- 11:19people having the first name Martin
- 11:24and you could basically
- 11:27before executing the query, when you
- 11:29want to estimate the cost, you could look into the histogram
- 11:32and you might see that is rather a uniform distribution
- 11:36so there is a distinct count of 1,000 maybe.
- 11:38I would expect my results, that will be 1,000th of
- 11:42the overall table if I select on Martin
- 11:45but if you now select all the people who are
- 11:47from a particular country, this distribution is
- 11:52absolutely not uniform, so this is highly skewed.
- 11:54There is a huge difference if you have a filter on the country
- 11:57Germany which is a result set of maybe 80
- 12:01million I guess and then you have a selection on the U.S which
- 12:04is four times the size and then you have a selection on China
- 12:07which might be pretty much I think a quarter of the people
- 12:10of the general population of the world. Basically there is a huge skew on
- 12:13that and you want to be able to recognize that skew
- 12:17in order to estimate the cost of your operators.
- 12:23This is basically,
- 12:26so as soon as you have these statistics and you
- 12:29know how to use them, what you want to do is join reordering
- 12:32because that's one of the most
- 12:35expensive parts of the database that's joining and the order
- 12:38of joins and the order of how they are executed.
- 12:41We can use some information here again that
- 12:44the database provides us, for example if we know about
- 12:47a following key relationship we can
- 12:51use that information to improve our cost estimation
- 12:55for a join between two tables.
- 12:57We can check if we have previously selected
- 13:01on a uniform distribution or maybe on a top five
- 13:05on attribute that is among the top five attributes.
- 13:07We can use this information
- 13:10to better predict how large or how expensive our join will be.
- 13:18Just a simple example, why we want to do this join re-ordering,
- 13:23If you have, if you join these
- 13:25three relations here, there's a so called
- 13:29join-associativity that means that you can re-order these joins
- 13:34in any form that you want, atleast if it's an inner-join.
- 13:38Instead of joining relation one or two first
- 13:41and then joining it with relation three, you could also
- 13:44switch the order and join two with three
- 13:49first for example, because these joins might be faster to do.
- 13:55What have we learned about query optimization?
- 13:58It just becomes increasingly important, so if you look at OLTP
- 14:03systems, all you do usually there in 90 %
- 14:06of the cases is, you have your access via the primary key,
- 14:10you access your primary key index, because there usually always is one
- 14:14and then that's pretty much done. But now with all the analytical
- 14:17systems with mixed workloads where you execute
- 14:20transaction queries and complex analytical queries,
- 14:23the query optimization or the optimizer becomes increasingly
- 14:26important.
- 14:29However as we, challenging tasks that we have
- 14:32shown in the last slides, pretty much all the databases are
- 14:35still not very accurate in estimating the costs
- 14:38and it is computationally really expensive
- 14:43operation to optimize the queries. Usually
- 14:46no database will give you the perfect plan, they just stop at
- 14:49some time because the are just too many alternatives
- 14:52and try to give you the best estimate, the best plan they
- 14:56can find in a certain amount of time.
To enable the transcript, please select a language in the video player settings menu.
About this video
In this lecture we continue the series about query processing and have a closer look at the optimizations conducted by the database the query optimizer. This includes the selection of suited execution orders composed in so called query plans.