2.3 Erster Blick in die Daten

This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?

Enroll yourself for free

2.3 Erster Blick in die Daten

Time effort: approx. 13 minutes

An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.

Scroll to current position

00:00Welcome to the unit first look into the data for the application of the film proposal system in week 2.
00:07As a Data Scientist it is common, simple dive into the data to get a feeling
00:12to get the data,
00:15what properties we have and how many observations we have available.
00:19That is what we want to achieve in this Unity once together.
00:24First, of course, we must download the dataset once.
00:28For this week we have selected the following data set: The Movies Dataset, a dataset with metadata about 45,000
00:36Movies and over 26 million reviews of over 270,000 users.
00:43The dataset comes from the data science community platform Keggle, where there are records, code examples, and
00:51There are so-called data science challenges.
00:55If we want to download the data set, we can either manually or via API directly in the code.
01:04For both options here is a quick guide.
01:07However, I should like to say that you have chosen Keggle in order to get credentialized access, or
01:15their access tokens if they have the Coder out, you can put.
01:23First of all, let's talk about Get an overview of all files.
01:28For our application of the film proposal system we are mainly interested in two files.
01:34Movies Metadata CSV and ratings CSV.
01:39CSV stands for comma-separated values. A file format that is well known in the field of machine learning and data science.
01:48To get a feeling for the other data , we switch to the directory with the
01:55files, show us all files once.
02:00That means here the change into the respective directory and we give we can remove all files in this directory.
02:09Here we see the movie Metadata CSV and the ratings CSV.
02:15Let's start with the movie metadata.
02:18We want to use this to Content Base Recommendation.
02:25So the first step, of course, is to take these data and try to take them. to the main memory or to reserve us.
02:33We do this with the following code and have this whole data frame in a pandas data frame
02:41Very simply shown. just in a tabular form.
02:46So let's look at what properties are. has this file and what features and dimensions we have,
02:55so we can use the command Point Head to get the top Output elements of this data frame once.
03:01That is what we want to do.
03:04For example, you can see that for each movie it is specified whether it is an adult film, whether the film belongs to a series
03:13or if, yes, what is the budget of a film, what genres this movie has been assigned what the original language in the
03:23Film or the original spoken language in the Film is and how the title is and also many other features.
03:30Since we have over ten features here, we want to Make the first overview a little easier.
03:38And so use an existing library to give us a short Overview or a short report about all features.
03:48This will now take a few seconds.
03:50But then we get an overview of what's going on. all features and properties in this dataset.
04:07In this overview we get a In this report or in this overview we get a
04:12General overview of all statistics and all Properties of the various dimensions.
04:18In total we have over 21 different variables, approx. 45,000 observations or observations and, for example, 67,000
04:29missing cells in this type of table.
04:33So as we scroll down further, we can Each variable gets a single statistic.
04:39For example, the statement whether a movie is For example, the film series is part of the James Bond film series.
04:48For example, what is the budget?
04:50And what we see here is that in a lot of films, for example. that the value or budget 0 is specified.
04:57Also we can experience here very What are the dominant genres?
05:03And here we see that there are many dramas than The second category is comedy.
05:13We also see that a major part of the films or a large part of the of the films was written in English as the original language.
05:21Next comes then French and Italian.
05:28Another interesting feature is, for example, Production Company Metro-Goldwyn-Mayer, Warner Bros.
05:42or Paramount Pictures are very well known production companies.
05:46The last interesting feature we have is the average of the rating.
05:55Here we see how many ratings or how the average assessment.
06:00Here we can also once again look at individual Have statistics for this exact variable specified.
06:08For example, how is the standard deviation, how is the average, the median, how skewed is the distribution.
06:16And we can also Let's look at the histogram.
06:19That is, how is the frequency of all Reviews in this data set.
06:27You are welcome to add more with this report.
06:31There are a few more exciting things to discover.
06:36Exactly. We now have a Get a rough overview of all features.
06:42Now, of course, we have to choose which one to do for our Content Based Recommendation.
06:49Potentially, there would be the following candidates for Features for our Content Based Recommendation.
06:56However, we will mainly focus on the so-called overview.
07:01The overview would be too german something how short description of the film means.
07:06Of course we could also use a content-based Recommendation based on the actors, the production company, the
07:12original language or genre.
07:14We will be here or in our However, leave out the application.
07:19The short description is available in the Overview column.
07:23We will now also store them in a list And give us a good example of what that might look like.
07:32In place 10 and our list you find the film Golden Eye, a film from the James Bond series.
07:39And to the information or brief description of this Films once out or issue, we must add the list to
07:48Point 9.
07:49Instead of 9, because lists are used in Python as well as in many other programming languages started with index 0.
07:58Thus, the tenth place with index 9.
08:02The title of the film is Golden Eye and the description, yes, gives a short overview of the movie Golden Eye.
08:12To get a rough overview of possible topics and frequently to obtain the words occurring in this record,
08:19we're going to have a word cloud. or a Word Cloud.
08:23which contains all the words in this brief description once in a number of minutes.
08:30That is, large-scale Words happen very often.
08:34Here we see that the words here, off and very other words like is, to or that often occur.
08:45That's not really descriptive about the content.
08:48Since these are mainly filling words that are very often of the disease. And this is quite a topic, which we also have in the
08:55Unit 2.4, looking at preprocessing will be: For example, how can I preprocess text?
09:03to continue using only relevant words?
09:08We will take a short step here and stop words removal. So those words that aren't really great for
09:17Contributing meaning, that is, that, her, so-called filler words.
09:23It may now take a few seconds, since we are each each short description, remove the words that are not
09:32and in a stop word List for the English language.
09:38The hope here, of course, is that we can then use more meaningful words. , which just describe the content a little more.
09:48And that is certainly the case.
09:50We often find the word life. young, women, love and family.
09:56This is quite stringent with the observation we made before We know that the genre drama, the genre comedy, happens a lot.
10:06So here we have a short feeling or a get a rough feeling for what topics are
10:11in those movies.
10:17As a second type of recommendation, we look at We are also interested in the collaborative filtering method.
10:22In this kind of recommendation system, we look at only the users and their ratings
10:29and how those two play together.
10:32Of course, we must also read the file again.
10:38That means we read the ratings CSV also back into a data frame.
10:44Here too, we want to use the dot command once again Head outputs the first few elements of this data frame.
10:53Here we see that it significantly fewer features.
10:56There is only one user ID, a movie ID, the rating as Number and timestamp when this rating was given.
11:07Let us now know how many users there are and how many movies were actually rated, we can
11:13To do this: Let's look at the number. unique Values, the number of unique values.
11:21Number of unique values, method name.
11:24And here we find that there are 45,000. and about 270,000 unique users.
11:36As it is handled differently on each evaluation platform the possibilities for evaluation,
11:42if you can give half stars, like the size of the assessment or the range of assessments;
11:47Let's take a closer look.
11:52We see here that on this assessment platform or the data set allows us to evaluate 0,5
12:00in steps of 0.5 to 5.0.
12:06That's right, let's look at that histogram showing us how often individual
12:13Reviews were given in this record.
12:17Here we see that the rating 0.5 for example was given very rarely, whereas the evaluation 4
12:26and 5.0 or 3 were also given significantly more frequently.
12:32It was already the first look into the data.
12:36We hope that this will at least be a rough feeling for the data and the data quality.
12:42And in the next unit, we're going to focus on the extra topic of the devote data preprocessing to the next
12:50Unit already the recommendation Systems and the recommender Systems
12:55for movies to start in, have fun.

About this video

Auf GitHub haben wir alle Materialien für die praktischen Einheiten zusammengefasst und für Sie aufbereitet.
Erratum: ab ca. 09:08 min sollte es "Worte" statt Woche heißen