Machine Learning - Preprocessing Structured Data - One Hot Encoding
36K views
Oct 10, 2024
Machine Learning - Preprocessing Structured Data - One Hot Encoding https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we are going to discuss one hot encoding
0:04
So what is one hot encoding? Actually in case of one hot encoding and we can do the encoding of the categorical
0:13
variables in a certain form that which we can feed to the machine learning algorithms
0:21
for the better prediction-oriented applications. So one hot encoding is a process by which
0:29
categorical variables are converted into a form that could be provided to
0:35
machine learning algorithms to do a better job in prediction the categorical
0:42
variable assignment can be done using this Skel learns level encoder but the
0:48
problem is that it assumes the higher the categorical value in case of
0:54
numerical obviously the better is the category one hot encoder performs but
0:59
Binarization of the category and include it as a feature to train the model
1:06
Suppose you have flower feature which can take values that is daffodil lily and rose
1:14
One hot encoding converts the flower feature to three features that is is daffodil is lily and is
1:24
rose which are all binary in nature so if we're not getting this idea
1:29
clear so let us go for this example so here we're having this feature number
1:34
zero and it is having 0 and 1 it is having only the value 0 and 1 in case of
1:40
feature number 0 then we can write in this way that is a feature number 0
1:44
we're having two columns so here each and every column is having the binary
1:49
value that is 0 and 1 and it will be for the feature number 0 and it will be for the
1:54
feature number 0 with the value 0 and it is for the feature number 0 with the value 1
1:59
If you are not getting let me repeat so here it is zero so I'm putting one here in the
2:05
zero column and here it is one so I shall be putting one here in the one column here
2:10
it is zero so one is there here it is one then one is here so this particular
2:15
feature can have two values so zero and one so for the zero I just mentioning this column and for the 1 I mentioning this column in this way this particular categorical value
2:26
can be can be encoded in this way using hot encoding format let us go for this
2:32
one this our feature number 1 here this particular column has got filled up with
2:37
3 values that is 0 1 and 2 so that's why we have we have made 3 values here so
2:42
0 means I shall be putting 1 here for 1 So 0 1, 2, I shall be putting 1 here
2:48
For 2 I shall be putting 1 here. 0 1 2 I shall be putting 1 here and for 0 I shall be putting 1 here back again
2:57
For this feature number 2 we are having 3, 0 1 and 2
3:01
So how many values we are having here? How many categorical values? 4
3:05
So that's why we are having 4 columns here. So that is the column for 0, for 1, for 2, for 3
3:11
Here it is 3. So 1 will be here, rest will have the value 0s
3:15
and here it is 0 so 1 will be here here it is 1 so 1 will be here it is 2 so 1 will be in this particular column
3:23
so encoded values are binary now so now in this way we have discussed what is hot encoding
3:30
and let us go for one practical example practical implementation of this hot encoding for your better understanding
3:36
so here is the implementation for you in this video we are explaining the one hot encoding technique using python
3:45
So what is that? So here we are having a sample code. Let me explain line by line
3:51
From NAMPI import args max. So we'll be using this one. Here we're having one data. So there is welcome
3:59
So what is the length? Length is our seven characters. Okay. So now print data. So if I print it, obviously I'll be getting the required welcome as output. So that is my print data. Okay. Next. Next I'm going for one alphabet. So
4:15
this alphabet is nothing but a string which is containing all the respective 26
4:19
alphabets are there including one blank space also so we'll define a mapping of
4:25
characters to integers so we are going for character to integer so what is that
4:30
we are creating one dictionary where character C will be there and I will be
4:35
the expected position of the character so for I and C in enumerate alphabet so So alphabet is containing this particular string where a to Z is written one blank space is there because my data
4:48
may also have some blank spaces. So welcome to our place. So in that case we'll be having
4:54
blank spaces in our string which is going to get encoded. So character to integer, we have
5:00
done this one and then integer to character doing the just reverse. So here we are writing I
5:05
and C so position will be given and respective character will be there in the
5:09
respective dictionary for I see in in over it that is our alphabet so let me
5:14
print them so character to integer and integer to character to see that what is
5:20
the current content in them so if I go for execution you can find that for
5:23
character to integer we are having this dictionary and with where A is a
5:27
zeroed character B is the character number one C is a character number two
5:31
Z is the character number 25 because we have started from zero
5:35
So 26 characters Z will be having the the value here key value pair so value here will be having 25 and here the blank space will have the value 26
5:45
On the other hand in case of integer to character we are having that at plus zero we are having a at place 1 we're having B at place 2 we are having C and so on
5:55
So at plus 26 we'll be having our blank space So character to integer and integer to character I have printed it for the better understanding
6:05
Next. So integer encoded. So character to integer car for car in data. So here my data is my welcome. So for car in data, so for each and every character, I'm just retrieving the respective value against the character and that has been put it in this integer encoded. So if I put this integer encoded variable on print, you can find that it is coming like this. So welcome W-E-L-C-O-M-E. So they are coming
6:35
like this why it is my 4 why it is my 22 because just come here at w if you come to
6:42
the w see for w it is 22 for w it is 22 why it is 4 because w e l so that is
6:50
e here so e is having the value 4 here so accordingly the integer has got encoded
6:56
next one now we are going for one hot encoded we are creating one list that is an empty list here so for value in integer encoded for value in integer encoded so letter
7:09
is equal to zero for underscore in range of these letters so now if you put this
7:17
letter value is equal to one and one hot encoded dot app and the respective
7:22
letter so what is happening see in this particular case what is happening if we
7:27
print this one hot encoded you can find that we're having this welcome so first
7:31
character was w so this is my 26 binary beats are there so it is at that
7:38
double place so w x y z blank space so this w is not a not b not c in this way
7:46
we are having this w is here then not x not y not z not a blank space
7:52
next one was our e for the welcome so next one was our ill so it is not
7:57
a not B not C not D but E and rest of the characters are not present so in this way
8:05
so W E L C O M in this way I'm finding this the hot encoded form so I'm getting
8:15
this one so that is a one hot encoded form I've just shown you the outputs now
8:19
we are going for the invert encoding that means from that very a zero one
8:24
binary vector I want to get back the character so now one hot encoded zero so for the first character for the first vector place you know
8:32
for the first vector you are getting the argument max and then integer to character I'm just
8:39
converting it so here the position will be respective position will be our W position
8:43
and then integer to character I'll be converting that one and then I'll be going for
8:48
this inverted so I'll be getting here output as W you can find that I'm getting
8:52
this output as W because I'm considering only the fast vector here so here
8:57
you can find that how to convert this will come to this particular encoded form and then
9:03
from there how to get the respective characters back I think you got this idea so
9:08
this is my total code the total code is on the screen you can also do the typing of
9:13
the same you can do the experiments and you can go for temporary print of the
9:18
variables for the better understanding as I did here thanks for watching
#Computer Science
#Data Formats & Protocols
#Machine Learning & Artificial Intelligence
#Programming
#Reference