MapReduce and Design Patterns - Simple Random Sampling Example
105K views
Oct 18, 2024
MapReduce and Design Patterns - Simple Random Sampling Example https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we shall be discussing simple random sampling example
0:05
So it is one kind of filtering type of but not actually filtering will be there, not finding
0:11
one criteria in the set of records, but here we shall be generating random numbers and
0:17
depending upon the random numbers from the full data set will be selecting a subset of the datasets
0:24
So let us go for more discussion on it. So here we will reduce the total data set in the data set
0:29
into smaller set and where the records with equal probability are being selected
0:36
So from the huge data set we'll be selecting a subset. So it is not completely a filtering job because here you are not giving any criteria that
0:46
whether the criteria is matching so the record should stay, otherwise the record should
0:50
get dismissed. It is not like that. So without matching a certain condition, here a random number will generate to select records
0:59
randomly and from the posts dot XML so this XML we have used in the earlier
1:04
examples also so we'll be using the same example the same XML here so it
1:09
finds the random records with a given percentage so let us go for one
1:14
practical demonstration and implementation of this concept here we are going to
1:20
implement one problem that is our filtering pattern example falling under the
1:25
filtering pattern design pattern having the input fire that is a posts dot XML so this posts dot XML is having the size of one zero 8.93
1:36
mb so it is a big file and residing under the folder that is slash input slash post on the name node
1:44
so here we will be providing one percent and depending upon the percent the respective comments
1:51
comments will be selected from this posts dot XML so this post dot XML is having thousands of
1:57
rows it is having the size 108.93 mb so let me show you some of the rows
2:01
for your understanding so we are opening the editor to show you the
2:06
respective posts.xml content you can find here the whole content has been
2:13
enclosed within the posts tag here we have shown only the three rows but the
2:18
multiple attributes are there in each and every row and thousands of rows are
2:22
there in posts So this is one of the rows It is a one of the rows It is is having the attribute like ID post type ID accepted answer ID creation date then score then view count body
2:36
then will be coming to owner user ID then last edited user ID last edit date
2:44
last activity date title will be there then tags answer count comment count
2:52
favorite count and community owned card date so these are the multiple
2:57
attributes are there under the row we're having only one class here that is a simple
3:02
simple sampling m r tux dot java so respective class name is this and here it is a
3:08
map only job so we'll be we'll be having one inner class this inner class name is
3:13
a simple sample mapper which is extending the mapper class so it is having two
3:18
member variables one is a random object random and double object We are overriding one method here that is a setup
3:26
This particular setup method will convert which is taken into percentage value into a string and then converted to double value
3:35
Then it will be converted to double value and if the if the in the mapper in the respective math method which we are overriding here
3:44
this percent, whatever you are taking this percent, this percent if the random dot next double is less than percent
3:51
is less than percent then only the context dot write there is a null writable dot get so
3:57
whenever the the random dot get next double is less than percent the value whatever will be
4:03
providing will be providing here the percent value is 0.1 so here the name then only it will
4:10
be written onto the context object we're having the respective main method which is checking
4:19
that how many arguments we are passing here how many arguments if the length is three
4:23
then it's okay otherwise it will go for system dot exit 2 so we require to pass
4:28
percent input path and then output path so these three arguments are to be passed will be
4:34
accessed using arc 0 arcs 1 arcs 2 so config.set percent arc 0 so the percent whatever you'll be
4:42
passing so that will be kept in the percent during the runtime we'll be passing the uh
4:47
this person here we have created one job instance the name of the job is simple sampling map reduced task so there is a name of the job so we have defined the job and then we are having the set jar by class and here the respective
5:02
set here you see we're having this setup method where we're reading this
5:08
percent now we are going for the set jarred by class so here the simple sampling
5:16
mr tax dot class so this class will be will be set here For the file input format and for the file output format, we are supposed to read the argument 2 and 3, that is here, arcs 1 and arcs 2
5:29
So set output path and add input path. We're having only the mapper class
5:33
The class name is simple, simple mapper. So that's why we're using that class name here
5:38
So set mapper class, but it is a map only job. So that is no reducer task is there
5:44
No reducer will work. so output key has been initialized with its type is nullable null writable and then output
5:52
value will be of the type of text we're having this code which would return the
5:56
completion status whether it has got completed successfully or not and the value
6:00
will be returned so before going for execution we should create the jar file
6:05
against this particular project so going to the project and then we're going to
6:11
create the jar so export here we're supposed to select the this is a jar file so where we are supposed to select the respective path the
6:19
respective project the file name so against which the jar file will be created
6:24
already we have created the jar file so you are just skipping this step so
6:30
let me come to the execution now so we require to execute this command so the
6:37
command is hadoop then we'll be going for here jar and then map reduce design
6:43
pattern is the folder then jar for jar files is a folder and then filtering pattern or jar so
6:49
respective jar file name and then you should be we shall be going for filtering
6:54
pattern that is a package name simple sampling MR tasks so that is a class
7:00
name here so package name dot class name then point to one we are passing this one
7:04
as percent so this value will be coming to this percent input folder is your input
7:10
slash post output folder is our slash output. So here we have executed now we are going to execute the command now So here the percent value we have given as point one so those comments which will be coming within this percent point one they will be only
7:30
get will get registered onto the part file under the slash output folder on the
7:37
name node so let me let me see the current content so we'll be going for
7:41
had to DFS admin we're writing this one because you can find that this particular name node is in the set mode so let me make it come
7:55
out from the step node so we are executing the comments once again because the
8:00
previous one could not generate the outputs so again we're executing see here
8:06
we'll be getting all the outputs there and you see the bytes written is 96406
8:13
let me come to the output folder here it will be creating
8:18
multiple part files I hope so see we're coming to this output folder you can
8:24
find here it is it has created seven part files so seven part files are there so
8:30
let me see the content so I want to see those those comments which will be
8:35
which will be obtained from posts dot XML and those comments whose percentage will
8:40
be your less than point one so let me go for the printing so is DFS
8:47
DFS minus Gat and then we shall go for the path and then part star so this is the
8:59
total content of all the seven part files so we can find that these are nothing
9:03
but whatever we had in the respective XML files so those contents have come
9:10
which has satisfied that percent which has satisfied the respective percent value that is point one
9:16
So these are the respective comments which has been selected from the from the main XML file who has actually we have filtered them out this is a this is a XML file content and these are the selected comments there posts there
9:34
the respective rows have got selected which has messed with the percent condition we're deleting this one this part file
9:44
the output folder I hope the concept is now getting clear thanks for watching
#Programming
#Science
#Software