This video belongs to the openHPI course Künstliche Intelligenz und Maschinelles Lernen in der Praxis. Do you want to see more?
An error occurred while loading the video player, or it takes a long time to initialize. You can try clearing your browser cache. Please try again later and contact the helpdesk if the problem persists.
Scroll to current position
- 00:00In this video we would like to see how to analyze texts
- 00:04and, for example, positive from negative comments.
- 00:09This is typically called sentiment analysis. And we want to do this now in the video once
- 00:15on film ratings. That means we have a relatively large number of film ratings,
- 00:20that are already labeled for us.
- 00:22So, to which we already know if there are positive or negative reviews.
- 00:33And we want to do that with an AI model now. with several models, taking this data
- 00:34and train the models on it, then for new Evaluate this forecast automatically
- 00:39to be able to do that. And we start, as usual, with the fact that we import our various libraries.
- 00:46Where, right in front of it they are not surprised,
- 00:49these cells, which are typically live in the video , we have already carried out in advance,
- 00:55simply because this is partly the case with the models but are longer runs and we like the video
- 01:00did not want to stretch artificially.
- 01:01That is why it already carried out.
- 01:04What we import here now, is quite typical,
- 01:07You know it now very good the Penders library,
- 01:10We will use them again, just as NumPy and for the different models
- 01:14then point (unv.).
- 01:16And then some more libraries. For example, collections, which is such a
- 01:22Standard (unv.) Library that just helps us a few functions quite well.
- 01:26Something like Word Cloud or Seaborn for a couple Data visualizations. NLTK and spaCy,
- 01:33These are standard language processing Libraries and Torch. Torch is just a library.
- 01:40with which one very nice artificial neural networks.
- 01:43And we actually want to build your own,
- 01:45For this reason, use it here.
- 01:47For these language libraries, you need typically still relatively many additional data.
- 01:52That is why we must which often still download.
- 01:55That's why she's still like that a relatively long output.
- 01:57But that's not so important now.
- 01:57we can do it for the Just ignore the moment.
- 02:01And also there, as usual. be we need to create our file paths first.
- 02:08This means the program tell me where the data is.
- 02:11But again, there's nothing spectacular.
- 02:13That's very simple data. And we just have to leave it.
- 02:18But it's not particularly interesting now.
- 02:22It becomes more interesting when we And I'm going to read the data in and just see.
- 02:25what's in there then.
- 02:27And unlike now, for example, We don't have a whole lot of columns on the house price project.
- 02:32We have two, and that is once the text as we read our film ratings
- 02:37in free text format and the Label column, which is easy indicates whether the
- 02:42was a positive or a negative comment.
- 02:44Where one is in the case positive and zero negative.
- 02:49And now, if we just take a look, how These labels are distributed, so how many of us are
- 02:55on positive, as many as on negative comments which is a very interesting
- 03:00Information is, then we see very quickly.
- 03:03that the 50/50 is equally distributed. What is a good piece of information for us,
- 03:09Because it's a balanced record is then level.
- 03:12And balanced data is typically training easier than strongly imbalanced data.
- 03:19This is fixed again for You, if you want to see it again in the future.
- 03:24So, you can typically say, more balanced data than the more data
- 03:31you are of a class and then in contrast to of another class,
- 03:35The more complex it typically becomes, until eventually it becomes a Scenario comes what he does at the Outlier Detection.
- 03:42But with 50/50, now have here actually a very, very good case
- 03:46So, a nice distribution for us.
- 03:48And now that we know that, we can just Take a look at the data itself.
- 03:53just to be there Feeling for getting.
- 03:56And if we do that, we just into the very first evaluation
- 04:00We see very quickly,
- 04:02This is not just a Sentence, but that is relatively many.
- 04:05So, are you, can you longer film ratings.
- 04:09For us, W is an important Information for the processing will then be.
- 04:13And if you take a closer look, it's like, "Oh, yeah. if you read through the text, you can see that a couple of
- 04:18Make interesting, characters occur.
- 04:22To do this, you just have to know that Film ratings from a website and websites
- 04:27are represented with HTML, and there are certain tags.
- 04:31So tokens that are used in HTML, for example, line breaks or separators
- 04:41. And they'll talk about something like that. encoded. And we want these
- 04:46tags but not in our data,
- 04:48We must therefore consider: whether we can get rid of them somehow.
- 04:52It's exactly the same thing that you see underneath there.
- 04:56that there is an escaping, so a whole Escaping apostrophe gives what also back on
- 05:04the way in which it is displayed on the Internet , but we want to remove it from our data.
- 05:09That is, we already know, we You have to work with that data somehow.
- 05:13Also here again, some things,
- 05:19that we find on the data.
- 05:22And what we can do now is connect build a pipeline where we say,
- 05:27on our data we perform operation to in the format,
- 05:31that we want to have at the end. So... still in free-text format, but presented in a different way.
- 05:36So, full lowercase for example or we want this encoding
- 05:40removal. And down here, this is a
- 05:44little trick you can and you work with web data.
- 05:47Beautiful soup.
- 05:49That is, this HTML data in Beautiful Soup to give that there is also again simply a
- 05:54and can then just extract the text.
- 05:58Then we will For example, BR tags go.
- 06:01We are carrying out this and can now look at the same text again
- 06:07And we notice them directly, well,
- 06:08We have everything in lowercase now.
- 06:10We don't have those BR tags anymore, and those Encoding with these backslash is now also gone.
- 06:16That means we've already adapted our data and now much more
- 06:20in a free-text format, what you can could do a good job.
- 06:25It may well be that we are still have to do further operations.
- 06:30So, we might have to get the data. take a closer look.
- 06:33But now just enough for presentation That's when we do this surgery.
- 06:39And even now is actually again the right time to check our data in
- 06:45and split into test data.
- 06:46So now we are actually in this project also make the more elaborate split,
- 06:52in training data, validation data and test data.
- 06:55That means we use the first split to train our models.
- 06:59But always, we're not just training a model. We will train several later,
- 07:04Let's find out. which model is best.
- 07:08And that's what you typically do then the split of the validation data.
- 07:11That's why you do it so you can say we now have on the validation data
- 07:17the best model.
- 07:18But then you want a clean statement because the model in some way the validation data
- 07:24has seen. And if we can then you use the test data, a very clean
- 07:28Get a statement about how good It's actually performing our best model.
- 07:32But we can see that in a second Video itself when we click on this
- 07:36different splits. Before we then eventually build this AI pipeline.
- 07:45Let's just take a quick look at how then our data may be distributed.
- 07:50What visualizations do you need? , just a better one
- 07:54Get insight into the data.
- 07:56And we're going to make a little copy of it. from our training data, just to get there
- 08:01break and then value here, you have an operation.
- 08:08with which we analyze our texts may be extended,
- 08:11That means we get out. how long are our texts?
- 08:14Because it's a very nice way for us to look at ourselves. how our text lengths are distributed.
- 08:19So, typically do we have short texts? long texts or maybe even really
- 08:25very, very long texts with many thousand characters. And You can just calculate that right here.
- 08:32And then with Seaborn.
- 08:33this is a visualization library for us. I'm not sure if you want to show it.
- 08:37And then I would zoom in a little bit here. You can see that a little bit bigger.
- 08:42Then we see, these are now two distributions, namely once 0, the negative,
- 08:47So the negative ratings are The blue ones, and the oranges are the positive ones.
- 08:52We see we don't have much difference in terms of the text length, depending on whether it is a positive
- 08:57or negative assessment.
- 09:00But we can see quite well that a Much of the data about 500 to
- 09:08is almost 2,000 characters long.
- 09:12So that we can have a kind of a sense of long after then are even our film ratings,
- 09:16And so you can work really well.
- 09:19Now we may want to look at what words are like
- 09:26especially common in our data to gain a feeling for our text body.
- 09:31And what is for that, is, just
- 09:34to combine all texts,
- 09:35that we have a long, long text.
- 09:38And on top of that we can We need to do some very nice analysis.
- 09:40For example, what we can do very well is, just show us what the 10, 15
- 09:47are the most common words and above I also wanted to get a feeling again.
- 09:51And that's why we have a little helper function. who will do just that for us.
- 09:57I'm not going to go into too much detail. But what do it actually do?
- 10:01to go over the words and count how often the so that we can say,
- 10:06which are the most common If we sort it by it.
- 10:08And here too, you can get a little more precise zoom in and notice relatively quickly,
- 10:15that this (unv.) applies,
- 10:18that is, it's a law, which ultimately says that the most common words, that is, the
- 10:25for example, now the or a that the word are proportionally inverse or
- 10:31inversely proportional to rank.
- 10:33It simply means that there are certain words.
- 10:35like the one that occurs most often.
- 10:37It happens twice as often as a and then always continue with the other words.
- 10:42What ultimately just means is that we have a small Part of words that have the most
- 10:48of the words occurring.
- 10:51This means that we can easily Put the or a in our texts
- 10:56just as a word.
- 10:59And that may not be for us now amazing, because this is just typical
- 11:04Fill words. However, this may be the model means difficulties,
- 11:07because, that's tokens again. that must process this model.
- 11:11So we can start thinking now. Maybe we don't want to use that data at all.
- 11:15And also because this is a relatively typical The operation is to remove these stop words.
- 11:22there are many more Functionalities you can use,
- 11:27including NLTK or spCy.
- 11:28And so what we're doing here right now is just ... from our text body exactly these terms
- 11:34, which are very common because the as a rule, no great added value for our
- 11:40Deliver text, but are really filler words.
- 11:43If we look at this again, we see we already have the other concepts
- 11:47and we can maybe take our distribution and to realize that it is now finally
- 11:52is a movie data set, which we have
- 11:55of the previous visualization because here come now
- 11:58especially often the terms movie or movie, like, good (unv.).
- 12:03So we can already derive more and more which we're actually talking about right now.
- 12:07Now, there are several ways that you can do this. to visualize, we can alternatively
- 12:14also a Word Cloud look at us.
- 12:16So, a Word Cloud is ultimately that this words in space, depending on the
- 12:21the number of times they occur, as appropriate times larger. And can apply that here too
- 12:27and see that the terms we just , so the movie , film also especially
- 12:32and we simply
- 12:33can get an overview of what We have for words in our data.
- 12:37But directly the note, these Word Clouds, which are often quite nice to look at, but in and of itself
- 12:44which do not contain more information than
- 12:47For example, the (unv.) we had earlier.
- 12:49So, it's maybe even more difficult to identify which words are most common.
- 12:54So whether movie or Film now more common,
- 12:56you can do that in size Often not at all recognize well.
- 12:59And it's partly true. difficult to read,
- 13:01In other words, you have to Maybe read sideways again.
- 13:03So the advantage is that you can on a small space relatively much
- 13:08words, and finally map them.
- 13:10Depending on which one has to decide which Visualization you want to use, Word Clouds
- 13:16They're very welcome. because they look pretty somewhere.
- 13:19We can now also make the whole thing easy to positive and negative comments,
- 13:25make the analyzes and to separate our body.
- 13:30And if we just want to take a look at this for the positive body, we see,
- 13:35that, for example, terms like good or great that we're likely to find simply
- 13:40with a positive film comment would.
- 13:43And we can do the same thing on the negative film comments, which we see here.
- 13:50For example, there is good bath and a lot of other terms.
- 13:54And now we could hold back and look, which terms are particularly common
- 13:59which only occur in the
- 14:01a record or It's a kind of movie commentary. And notice there
- 14:08for example, that there are words like this good, which occur in both positive and negative.
- 14:14Which may confuse us first.
- 14:15And the explanation behind this may be that simple Words, depending on the context
- 14:21they are in the sentence, a completely different Get meaning.
- 14:24This can be easily done by a denial.
- 14:25So, this film was real good Clearly positive.
- 14:31In contrast, In my opinion this movie was not good it all
- 14:35a bad rating And they both look good.
- 14:38That is, a whole simple keyword search
- 14:40would be perhaps too little at this point.
- 14:43We simply need to know that
- 14:45and then we can go with it. We can do a pretty good job of counteracting that.
- 14:50That's it now with the Analysis into the words in our data set.
- 14:56We can do a lot more here analyze, such as how often
- 14:59no sentences or what class-separating words.
- 15:03But that's enough for us right now.
- 15:05We're going to look at it right now. next time, as we
- 15:08build exciting AI models.
To enable the transcript, please select a language in the video player settings menu.
About this video
- Auf GitHub haben wir alle Materialien für die praktischen Einheiten zusammengefasst und für Sie aufbereitet.