Spark Word Count Example | Scala | Spark
111K views
Oct 24, 2024
Spark Word Count Example
View Video Transcript
0:00
In this video we are explaining Spark word count example and here we'll be using one
0:06
data file and that data file is containing so many different texts and here we'll be counting
0:12
that how many times one word has occurred and that is known as a word count example
0:18
So let us go for some more details. How to implement word count problem in Spark
0:26
Using the Java MapReduce program, we have already seen that how to count the
0:29
frequency of words in one or more than one text files. So, for this example, we are going to
0:36
count words on the same file which is selected in our MapReduce program and which is used
0:41
in MapReduce example earlier. So the file is stored onto the HDFS. And now, at first
0:49
we should start our Hadoop before accessing this SDFS files. So let us go for one practical
0:55
demonstration to show you that how this wordcon problem can be written and can be executed my Hadoop system is on so it is running so now here you can find that
1:07
here we're having the Hadoop root that is the HDFS route now under this we're having one
1:12
folder that is a Hadoop My files so there is a Hadoop my files under this folder we are having one
1:17
file that is a sample underscore file. T so let me show you what is the content of the file so we shall
1:23
go for control alter T we shall open one terminal and now we shall go for H D
1:29
DFS, DFS minus CAT slash Hadoop My Files. We shall go for CAT, Hadoop My Files, and the file name is sample file
1:48
So, file name is sample file.t. So I'm going to see the content of the file
1:53
The file content is this one. So this is the content. Now, we shall open our Spark Cell, and in the Spark
1:59
shell we shall execute a program that means not a program but the state of statements will be writing
2:05
some lines will be writing here one by one to perform the word count problem on this sample
2:11
underscore file.t xd so that is a purpose and that is the demonstration we are going to give you right now
2:17
so let me go for the initiation of the spark shell so to initialize the spark shell we shall
2:23
go for spark shell if we initialize the spark shell then the scalar problem
2:29
will be coming so at first we shall create one file which will read all this sample underscore file dot txd content so let me open that file so let the scholar prompt come the scholar prompt has come so i shall go for bar sample file is equal to
2:53
SC ST stands for spark context so text file and then we shall give the path should be
3:01
enclosed within double quotes so h dFS then colon slash slash local host colon 9000 so 9,000 is the port number and the folder
3:14
is had to my files how do my files and the file name is sample sample file
3:23
file dot txd so this is the total path with the file name okay it opens the text file
3:34
stored in the sdf s so now to see the content of the text file as an array so we
3:40
shall go for sample sample file dot collect we shall go for sample file
3:47
collect so to see the content of the text file as an array so you can find that the content is getting
3:55
shown you see the content is coming in the form of an array you can find this one in the form of an
4:01
array now we shall we shall split this particular content so to split all the words which will be
4:08
separated by the blank spaces okay so what you shall go for this um bar and then w count
4:14
I can give other name also no issues so then sample file dot flat map so m capital not
4:26
f flat map and then line so line dot split and here the delimiter will be a space
4:37
so I'm enclosing this space within double quotes and now closing it okay
4:44
to see the contents inside w count so all words will be separated in the array so we shall go
4:50
for w count dot collect you see in the array all the words have got separated you can find
5:00
the output here i've just marked it you can find the output okay now to see the contents inside
5:07
the w count we issued the command that is the w count dot collect here so now we shall put one
5:13
after each word in the W count RD So how to do this one so we shall go for say bar and map output I just writing this one as map opi map output W count W count dot map and then w and here each
5:36
and every word will have one after it. I'm pressing enter to see the output what is
5:44
the value we'll be getting a key value pair type of thing so let me show you that
5:48
one also so map output dot collect so it will show show us the key value pair
5:57
type of thing you see it is the key value pair so key is here this and value is
6:01
one so for each and every word we have attached we have treated that word as a key
6:06
and the value is one here okay now what we shall do we shall call the
6:12
reduce by key method we shall call the reduce by key method so now let me go for and that one so val reduce
6:22
output i shall make this one as say reduce op then map opi dot reduce by key so by key
6:34
reduce by key and here we shall go for underscore plus underscore
6:44
now what is the final output we have called the reducer also so what is the final
6:49
output so initially we have the we had the map output now we are having this
6:53
reduce output so to get this one I shall go for reduce output
6:59
dot collect you can find that it is coming like this you see it is coming
7:06
like this so here each and every key is there when the key is unique not
7:13
having the occurrences again so it is having the frequency one count is one but when
7:18
this particular key has got repeated for multiple times so so respective counts are
7:22
coming so in this way you are you are getting this so now let me let me save
7:27
this output onto some file in the HDFS so how to do that one so reduce reduce
7:34
output dot save as text file so there is a method is this one
7:43
so we are going forward and passing this parameter what is the parameter that is a path so s dfs
7:49
say i making this one as slash slash local host colon 9 000 we wrote this one earlier also so now let me decide some path uh some which will be created so let it be spark output slash wc spark spark spark before going for that let me show you that
8:17
there is no folder called spark output or under that folder obviously the
8:24
wc spark folder will be created but there is no folder called spark output so what we can do so now if i if i execute this one
8:34
so there is a reduce opi dot save as text file so we are giving this total path with the file
8:40
name so we are giving the path actually here the file will get created automatically and that should
8:45
be enclosed within double quotes so just in putting enter so now let me show you that the corresponding
8:53
spark output has got created under this we're having this double this park and under this we're having this underscore success and part hyphen
9:01
five zero so this is the file which is containing actually the output so let
9:06
me show you the output also so how to show so I shall go for the terminal
9:13
once again okay so this is the terminal I'm having so I'm going for let me
9:19
come out from this so I shall go for exit so coming out so I've got the
9:23
dollar prompt back again clear so let me see so I shall go for say s dFS minus cat to see the content here so minus cat
9:36
slash so spark output so there is a first folder under which we're having
9:44
the next folder is our there is a WC spark you can find here so WC spark
9:52
and then we are having this part hyphen 1 2 3 4 5 zeros so that is the content you are going to get
10:03
so this is the content you can find that this the content so the content has been written
10:09
onto this part hyphen uh 5 zeros so here the content has got written so instead of writing
10:15
this fool instead of writing this full also we can put the uh what should i say the respective
10:19
wild cut characters we can put so that will also work for us so that will also
10:27
produce the same output so in this demonstration we have given you the idea that
10:32
what are the different steps should be followed to execute the what count
10:36
problem onto a text file in our spark shell thanks for watching this video
#Computer Education
#Programming