This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00Now that we have our data in our program, we can now
- 00:04Simply follow this data analyze and visualize.
- 00:08That just helps us understand what we're doing. and we're going to use it for data
- 00:12process. Even though we are at the end of Apply a machine learning model daily,
- 00:16is nevertheless important, that we understand the data.
- 00:21And we can now do that with our data. just start because we have to measure the longitude and the
- 00:26Latitude, which means that it's ultimately Geodata, to visualize this very nicely.
- 00:33We can make this latitude and longitude simple visualize in a so-called scatter plot.
- 00:39That is, the data is represented like this. on the x-axis Longitude and
- 00:46on Y-axis Latitude.
- 00:48And we can already see. approximately how the residential areas are distributed.
- 00:54But that doesn't really help us much.
- 00:56That is why we are enriching them Visualize step by step.
- 01:01And the next step that we do well is simply to these individual points
- 01:05information, for example on the Population, i.e. population density
- 01:10or the house value.
- 01:12This is also relatively easy.
- 01:14We can visualize the data and see it. now in our visualization not only where this
- 01:20residential areas, but also, how much they cost and how many
- 01:25People live in this neighborhood.
- 01:27And here you can see quite well that here at the course of this edge apparently more expensive
- 01:34Apartments are more like up here.
- 01:37That might help us now very good.
- 01:40However, we can take another step and then visualize the data even better.
- 01:44Now, if we know that's geospatial data from California, we can now
- 01:49see that we have a picture of Find California in those places
- 01:54and finally deposit it.
- 01:56And if you do that, you know, this picture. , we can visualize that very easily.
- 02:04And I have a little script here before which we import again here.
- 02:09And if we do it now visualize, we see now where on the
- 02:15Map these residential areas are accurate.
- 02:18Yeah, well, now we can really gradually to better understand why they
- 02:23Neighborhoods down here on the Edge are more expensive than the top.
- 02:26It is probably simply because they are closer by the sea and by the sea the apartments are more expensive.
- 02:33This could be the first be good or a first good element,
- 02:38what we understand about our data: The closer the residential areas by the sea are the more expensive.
- 02:45So, for example, what we've just seen is this. I'm going to take another step back.
- 02:50that expensive neighborhoods seem to even are more often grouped, i.e. are close to each other.
- 02:56Whatever normal, a information.
- 03:00So now we have two pieces of information. that we were able to derive from our visualization.
- 03:05Now we know that and we can use it go even deeper into our analysis and
- 03:11better understand our data. What we can just do it now.
- 03:14when we look at our data and we know that which attributes we have, then we can
- 03:18Maybe come to the point that ocean proximity, which translates as proximity to the sea
- 03:25And I want you to look at these values and see now. that in our data frame the categorical values
- 03:31, such as less than an hour to the sea or a domestic.
- 03:39Yeah, so how this houses categorized are able to understand that now
- 03:44And maybe work pretty well on that.
- 03:46But we're also looking directly at this. that the expression Iceland, i.e. island,
- 03:53five times and we have to I'd like to start thinking about this at once.
- 03:59What we can do well now is simple to group these values
- 04:05and the median value.
- 04:07And that's where we can see something straight out there. very exciting what we also just
- 04:10in our data.
- 04:12And that is that the media price on this grouped data, if we just
- 04:17on our data, see that we have significantly lower prices in Germany
- 04:24as near to sea, i.e. less than an hour,
- 04:27or close to a bay or with of the distinctions right next to the sea.
- 04:31So, you have to look again in the data, as now close to the sea
- 04:34and under an hour.
- 04:36Maybe near the sea something like this like five or ten minutes to the sea.
- 04:40But we're already seeing relatively Well, that's kind of different.
- 04:44The value Iceland has the greatest value, but again
- 04:50Caution, there are five points on that we measured.
- 04:53Whether this is particularly meaningful now , we have to ask questions.
- 04:58And what's going to do very nicely? I'd like to take a moment to review that.
- 05:03is this visualization in stages To look at us, that is to say
- 05:08on a filtered data set according to the Ocean Proximity.
- 05:12And for that, just write us a little Help function that matches our data frame exactly
- 05:17filters and can now be incremented rebuild our visualization.
- 05:23We're going to start with just making the Ads that have as category Domestic.
- 05:28When we do that, we see it. are actually almost just blue dots.
- 05:35And blue on our color scale means that it really simple cheap values, cheap houses.
- 05:43We can do the same for under for one hour to the ocean.
- 05:46See, here are some more blue, but also more in the A red colors
- 05:54, which again means more expensive prices.
- 05:57And the same for near Ocean.
- 06:00Wherever we see, it will It seems to be getting more and more red.
- 06:05And so our prices are always more expensive, just for understanding.
- 06:11We can use this visualization to get more and more our data and also sometimes
- 06:17To play with it, to look gradually, how you can build visualizations like that.
- 06:23And they have another The idea that the closer a house is to the sea,
- 06:29The more expensive it becomes again somewhere.
- 06:32Let's go back to the two that we've created. still outstanding. This is near bay once.
- 06:38And here we see, apparently, that there is a specific bay where all the houses are
- 06:43also collect. That is, this seems to be the bay to be mentioned in our data.
- 06:49And as just announced we still have Iceland as an outlier.
- 06:55And when we plot that data, We need to look very carefully.
- 07:01where the points are now and see That seems to be where our data is.
- 07:08Now we really have to think, is enough for us to make any statement about a house,
- 07:12What's on an Iceland?
- 07:14Because apparently the prices so different,
- 07:17That means that we may already have a first And the evidence is that we may be looking at this data.
- 07:21Do not want to use training because they Maybe we'll distort our training.
- 07:26perhaps our solution to hybrid solution.
- 07:28So for neighborhoods where we have enough data. quietly use our model,n but for data that
- 07:36So houses that are in Iceland, maybe but would rather ask the person again.
- 07:42So we can start right here, and start thinking how we want to deal with it.
- 07:47And for now we just want to Let's make the decision that we're going to take Iceland.
- 07:54So islands, where we really only have five values. that we just remove them from our data.
- 07:59We don't have enough of what we know about it. and don't want to use that for our training.
- 08:04That is now a possible decision.
- 08:10What we are now living from our experience is that one is theoretical
- 08:15already somewhere a forecast which could be based purely on the
- 08:19Proximity to the sea is based, that is has nothing to do with machine learning.
- 08:23So we're not at our model yet, but we're which you could use anyway.
- 08:28And we just want to try that. by taking the constant value per group.
- 08:33For example, as soon as is a house in the interior,
- 08:35we just get the median price for domestic houses.
- 08:38We can just apply that and on our data.
- 08:43And if you scroll a little bit to the right, We can counter this and see that we can
- 08:48perhaps still strongly do, like For example, a domestic house.
- 08:52Now we have one here, which seems to be is significantly more expensive than our median.
- 08:56That is, we would have a They're going to make a very bad forecast.
- 08:59But nonetheless, it's just about to play around with the data, to hypothesize
- 09:05and maybe just a baseline, So you have to build a foundation against which you can then
- 09:09and can always be better.
- 09:14Exactly, here again held, so you can create a kind of a bassline like that.
- 09:19can't really hurt.
- 09:20Because then you already It's something to build on.
- 09:26What we want to look at now? before we go to the AI training, the
- 09:30Correlation analysis, that is, we have now our categorical data, which is proximity to the sea.
- 09:35information. And now we want to look at what it looks like on numeric data.
- 09:39You can see what's called the Pearson. Calculate correlation coefficients.
- 09:44This is one of several possible, and And that's what you're going to try to understand
- 09:50if we have a target variable, a target variable Y, for example,
- 09:55and we have other variables X if X increases or decreases by a certain value,
- 10:02how does this affect our target variable Y?
- 10:05And the higher these correlation coefficients, So now in amount, the bigger the effect.
- 10:11What you have to watch out for , this is only for a linear context.
- 10:16We are looking at it now anyway but we For example, you can see very quickly.
- 10:22yes also somewhere sense, the more people in earn the living area, the more expensive
- 10:27So on average, these are the neighborhoods.
- 10:29Because then you just might be in the in a more expensive residential area.
- 10:35We can also do that and visualize.
- 10:39And you can see there, so it's on the x-axis. the median income in thousands
- 10:45and Y-axis our target variable.
- 10:47And again, you can see It's certainly a trend.
- 10:49But you have to recognize it. that there's just one big scatter.
- 10:54that means we don't have a perfect explanation from our Data on income.
- 10:59What we see again here, the we now find again and again in our data
- 11:04if we plot our target variable, the up here at 500,000, you get a lot of values.
- 11:09This comes via this cut, the we have already established before.
- 11:12So we kind of have to think about this. really want to put this project into practice
- 11:18how to deal with such values.
- 11:22Exactly, even this again held here.
- 11:25So the more people earn, the higher the price can be.
- 11:29But above all, that at 500,000 euros appears to have been set an upper limit.
- 11:35And we now choose this In case we keep the data, just
- 11:40take as they are, but might have to take a practical project consider whether this is different
- 11:44should be addressed.
- 11:46Now we can still consider whether we can the attributes that we have that are somehow
- 11:52may explain a correlation. to build additional attributes.
- 11:55So for example, we could totally make bedrooms. sharing through total rooms
- 12:00the share of bedrooms.
- 12:02Maybe it is with houses that have more bedrooms than in proportion,
- 12:07these may be more expensive apartments.
- 12:09Or people per household.
- 12:13So we can still have operations on our data. to develop additional attributes and
- 12:17test and watch this time, whether you can work with it.
- 12:20And we can do that in a very simple way. calculate our data and then again
- 12:27display the correlation vector to us.
- 12:31And recognize here, for example, that seemingly ratio Bedrooms a relatively good
- 12:38has a correlation value that we can at least You can use it and try it.
- 12:42Maybe our AI model works well.
- 12:45That means here too, the data is simply try a little, play around and watch
- 12:50what could it do for even more good Provide information for our model?
- 12:55And that's exactly what we have our data with relatively nicely analyzed, have a quite good
- 13:00image of our data and can now be with relatively good confidence
- 13:05our first AI model in the next video.
To enable the transcript, please select a language in the video player settings menu.
About this video
- Auf GitHub haben wir alle Materialien für die praktischen Einheiten zusammengefasst und für Sie aufbereitet.