A simple example
Let's take a look at how this example works in practice . Suppose we are building a classifier , Explain whether the text involves sports or not . Our training set has 5 A sentence ：

Text

Category

A great game（ A great game ）

Sports（ athletic sports ）

The election was over（ Election is over ）

Not sports（ It's not sports ）

Very clean match（ No inside game ）

Sports（ athletic sports ）

A clean but forgettable game

（ A game that can't be forgotten ）

Sports（ athletic sports ）

It was a close election

（ It was an election that was equally close ）

Not sports（ It's not sports ）

Because naive Bayes Bayesian is a probability classifier , We want to calculate sentences “A very close
game” It's the probability of sports and it's not a sport . In Mathematics , What we want is P（Sports | a very close
game） The category of this sentence is the probability of sports .

But how do we calculate these probabilities ?

Characteristic Engineering

When creating a machine learning model , The first thing we need to do is to decide what to use as a feature . for example , If we classify health , The characteristic is probably a person's height , weight , Gender, etc . We'll exclude useless things for the model , Such as a person's name or favorite color .

under these circumstances , We don't even have digital features . We only have words . We need to somehow convert this text to a number that can be calculated .

So what should we do ? It is usually used with word frequency . in other words , We ignore the word order and the construction of sentences , Treat each file as a word bank . Our characteristic will be the count of these words . Although it seems to be too simplistic , But it's amazing .

Bayes theorem

Bayesian theorem in using conditional probability （ If we do it here ） It's useful , Because it provides us with a way to reverse them ：P（A|B）=P(B|A)×P(A)/P(B). In our case , We have P（sports
| a very close game）, So using this theorem, we can reverse the conditional probability ：

Because for our classifier , We just tried to figure out which category had a greater probability , We can discard divisor , It's just a comparison

So that's a better understanding , Because we can actually calculate these probabilities ! Just count the sentences “A very close game”  How many times “Sports”
Training focus for , Divide it by the total , You can get it P（a very close game | Sports）.

There is a problem , But our training focus didn't show up “A very close
game”, So this probability is zero . Unless every sentence we want to classify is in our training set , Otherwise, the model won't be useful .

Being Naive

We assume that every word in a sentence has nothing to do with other words . That means we don't look at the whole sentence anymore , It's a single word . We'll P(A very close game) finish writing sth. ：P(a very
close game)=P(a)×P(very)×P(close)×P(game)
This assumption is very strong , But it's very useful . This enables the entire model to handle a small amount of data or data that could be labeled incorrectly . Next, apply it to what we said before ：

P（a very close
game|Sports)=P(a|Sports)×P(very|Sports)×P(close|Sports)×P(game|Sports)

Now? , All of our words actually appeared several times in our training set , We can calculate it !

Calculation probability

The process of calculating probability is actually just counting in our training set .

first , We calculate the prior probability of each category ： For a given sentence in the training set ,P（ athletic sports ） The probability of ⅗. then ,P（ Non Sports ） yes ⅖. then , In calculation P（game |
Sports） namely “game” How many times did it appear sports Sample of , And divide by sports Total number of （11）. therefore ,P(game|Sports)=2/11.

however , We have a problem ：“close” It doesn't appear in any sports In the sample ! That means P（close | Sports）=
0. It's quite inconvenient , Because we're going to multiply it by other probabilities , So we'll get it in the end P(a|Sports)×P(very|Sports)×0×P(game|Sports) be equal to 0. And what we do is not give us any information at all , So we have to find a way .

What should we do ? By using a method known as Laplace smoothing ： We add for each count 1, So it won't be zero . To balance this , We add the possible words to the divisor , So this part will never be greater than 1. In our case , If possible, it is [
“a” ,“great” ,“very” ,“over” ,'it' ,'but' ,'game' ,'election' ,'close' ,'clean'
,'the' ,'was' ,'forgettable' ,'match' ] .

Because the number of possible words is 14, We get it with Laplace smoothing . The results are as follows ：

Now we just double all the probability , See who's bigger ：

perfect ! Our classifier gives “A very close game”  yes Sport class .
There are many things that can be done to improve this basic model . These techniques can make naive Bayes comparable to more advanced methods .

*
Removing
stopwords（ Delete Inactive words ）. These common words , No categories will be added really , for example , One , capable , There are other , Always wait . So for our purpose , The election will be over , A very close game will be a very close match .

*
Lemmatizing words（ Word variation restore ）. This is a combination of different words . So the election , general election , Being elected, etc. will be grouped together , More appearance of the same word .

*
Using n-grams（ Use instance ）. We can calculate some common examples , as “ No inside game ” and “ A close election ”. Not just a word , Calculation of a word .

*
use TF-IDF. Not just counting frequency , We can do something higher

Technology