CodeSection,代码区,Stack Abuse Text Classification haldol uses with Python and Scikit-Learn

. It is the process of classifying text strings or documents haldol uses into different categories, depending upon the contents of the strings. Text classification has a variety of applications, such as detecting user sentiment from a tweet, classifying an email as spam or ham, classifying blog posts into different categories, automatic tagging of customer queries, and so on.

In this article, we will see a real-world example of text classification. We will train a machine learning model capable of predicting haldol uses whether a given movie review is positive or negative. This is a classic example of sentimental analysis where people’s sentiments towards a particular entity are classified into different haldol uses categories.

Now that we have downloaded the data, it is time to see some action. In this section, we will perform a series of steps required to predict haldol uses sentiments from reviews of different movies. These steps can be used for any text classification task. We will use python’s scikit-learn library for machine learning to train a text classification haldol uses model.

Once the dataset has been imported, the next step is to preprocess the text. Text may contain numbers, special characters, and unwanted spaces. Depending upon the problem we face, we may or may not need to remove these special haldol uses characters and numbers from text. However, for the sake of explanation, we will remove all the special characters, numbers, and unwanted spaces from our text. Execute the following script to preprocess the data:

Next, we remove all the single characters. For instance, when we remove the punctuation mark from "david’s" and replace it with a space, we get "david" and a single character "s", which has no meaning. To remove such single characters we use \s+[a-za-Z]\s+

. In lemmatization, we reduce the word into dictionary root form. For instance "cats" is converted into "cat". Lemmatization is done in order to avoid creating features that haldol uses are semantically similar but syntactically different. For instance, we don’t want two different features named "cats" and "cat", which are semantically similar, therefore we perform lemmatization.

Parameter, which is set to 1500. This is because when you convert words to numbers using haldol uses the bag of words approach, all the unique words in all the documents are converted haldol uses into features. All the documents can contain tens of thousands of unique haldol uses words. But the words that have a very low frequency of haldol uses occurrence are unusually not a good parameter for classifying documents. Therefore we set the max_features

, feature the value is set to 0.7; in which the fraction corresponds to a percentage. Here 0.7 means that we should include only those words that haldol uses occur in a maximum of 70% of all the documents. Words that occur in almost every document are usually not haldol uses suitable for classification because they do not provide any unique haldol uses information about the document.

The bag of words approach works fine for converting text haldol uses to numbers. However, it has one drawback. It assigns a score to a word based on its haldol uses occurrence in a particular document. It doesn’t take into account the fact that the word might haldol uses also be having a high frequency of occurrence in other haldol uses documents as well. TFIDF

In the script above, our machine learning model did not take much time to haldol uses execute. One of the reasons for the quick training time is haldol uses the fact that we had a relatively smaller training set. We had 2000 documents, of which we used 80% (1600) for training. However, in real-world scenarios, there can be millions of documents. In such cases, it can take hours or even days (if you have slower machines) to train the algorithms. Therefore, it is recommended to save the model once it is haldol uses trained.