0:00
So, my name is Buck, I'm an applied data scientist here at Microsoft
0:08
What does that mean? That means I take the data science stuff, the machine learning and big data and all that, and advanced ytics, and I have to put it to use
0:17
I can't just talk about all these cool formulas and stuff. We actually go out and use them on customer projects
0:24
So, what we're going to be talking about today, and you can cut over to the screen if you want to there, Simon
0:28
is the SQL Server Big Data Clusters. And we'll just stay in this view for right now
0:34
I'm sort of watching the chat, the comments here. And then if there's more, thank you, Chashim
0:40
Appreciate that. And we'll walk down. I teach college and I do other things
0:45
So I'm used to watching the comments and stuff and I'll address those as I can
0:50
Okay, so we're gonna be talking about secure ytics at scale. So what we mean by that is companies want a place
0:57
where they can keep large amounts of structured and unstructured data. And sometimes they want it in the cloud, and sometimes they want it on-premises
1:06
and sometimes they want both. And sometimes they want it in a cloud that Microsoft doesn't run
1:11
Can we do that? Yes. So I'm a very simple man. I'm slow at things
1:18
It takes me a while to learn stuff. So one of my strengths is taking something complicated and explaining it simply
1:25
And the reason why that's a strength of mine is because I'm a simple person to begin with
1:30
So I have to explain it simply because that's the that's the way I understand it
1:35
So we will we will do that. So feel free to pop your questions in there. We'll go through. We'll keep a watch on time and I'll make sure
1:43
Thank you, Ali. We'll make sure that we show some examples to all this stuff
1:47
By the way, I'm going to give you a resource to get all this information. So don't panic. Just watch. Just enjoy yourself. Ask questions, whatever you'd like to do
1:55
Grab yourself a beverage. Get ready for that tasty meal you're about to have at your break there
2:00
OK, so let's talk a little bit about big data. What in the world is big data mean
2:06
All right. When I was when I was younger, big data for me was a megabyte
2:11
That was literally big data for me. A megabyte was huge. Nobody had that. Thousands of dollars to get a megabyte drive and so on. Obviously, that's not big data now. And in fact, there is no sort of number for big data. Big data to me, when I describe it to people, the way I put it is big data is any data that you can't compute in the time that you have with the technology that you have
2:41
So let's assume you've got a Commodore 64, a megabyte is big data
2:47
If you've got a huge cluster, maybe a terabyte is not big data. It's not a size
2:51
It's what you want to do with it and how you want to do it. And of course, we think about where does this data sort of come from, right
2:57
Well, it's pretty easy to think about how this happens. We have a place to put it and we've been putting it there longer
3:04
That's the simple explanation, right? You got that drawer in your house
3:08
You've seen this, right? It's right next to the refrigerator. When you first moved into your apartment or your house, that drawer was empty
3:14
And it's the one right next to the fridge to the right or the left. And you started putting things in it
3:20
And you've been there a while in your house or your apartment. And now you can't even open that drawer
3:24
You don't know what's in there. That's big data. You've been in that house longer and you've been shoving things in that drawer because it was easy
3:32
And that's big data. Now, because of this, we need to do something with it
3:38
And you think about like large retailers, like some of the big box stores and so on
3:43
They've been collecting a lot of data. Every time you put your credit card in, every time you buy something, somebody's tracking that data that you're putting in there
3:52
So the next question becomes, if we've got big data, what what do we do with it
3:58
So I've got a bunch of data. So what? How do I leverage this? Well, I've broken this down by sector, by like industry sector here
4:06
And keep in mind, these have changed over time. But this is, I think it's about nine months ago, I did this research
4:14
The number one user of big data at the moment is retail. Second was finance, then healthcare and so on
4:19
This may have changed slightly a bit. But the number one use case used to be in-store ytics
4:25
And now it is demand prediction. So every time you go to get something and you go to buy it
4:32
They need to predict when they need to make ship and stock the next one
4:37
Robbie says, my understanding nodes need 24 disks and 64 gigs of RAM
4:41
I think someone in that realm, do we need these kind of beefy settings for just a POC or learning BDC
4:48
No, I'll show you a place where you can put it in one VM. It's what I do. But we'll talk about that
4:52
By the way, we'll get to the architecture. We'll talk about that a bit. But it's big data. All right. So it's big data. That's how that works
5:00
You need a big system to process big data. But we have a learning environment for you much smaller
5:04
In fact, I'm running one right now on my system, just just underneath my desk here
5:11
I'm not even hitting something outside. But at any rate, we've got these use cases
5:16
These use cases are usually predictive or even prescriptive ytics. And those require a lot of data. Now, whenever you're doing data algorithms, you need to you need to process
5:30
this data because you show it good examples and bad examples and the algorithm develops a bias
5:36
towards one or the other and can then do predictions. So you need a lot of it to do this
5:41
stuff. Hence the reason for big data. So here's some of the use cases I think you see is one that's
5:46
kind of interesting in finance. It used to be fraud detection. They were trying to detect fraud
5:52
Now they're trying to prevent cyber attacks. So it's kind of interesting that that one flipped and
5:57
then you can see the rest of these blockchain computation and customer retention and so on
6:02
Okay, so that's the use case. Let's stop here. We've got big data. We got sources of big data
6:07
all over the place. And then we have a reason to do big data. You're welcome, Robbie
6:14
So we've got a reason to use big data. So, so far, big data, a reason to use it. What do we do
6:21
Well, when you have a lot of data, you can't scale up anymore. You have to scale out. So I'm a simple man. So I built a little cartoon for you. I'm not a good cartoonist either, but I did the best I could with my incredible PowerPoint skills here
6:36
All right. You've probably been to the grocery store, right? You went to a store and bought something. Right
6:41
And so some people get in line and then and then you get in line after all these other people. Right
6:47
And here's the deal. There's a there's a register here. Yeah. That's got a computation on it
6:53
And that computation is really, really easy. It is take this item, record the price and add it to the price of the next item that comes
7:02
N plus one. That's it. That's it. Just take the N that you had, add the next item, and then do that
7:08
again and again until end of list, until EOL. Very simple algorithm, right? And so the register's
7:15
punching that in there, and you're standing there. You've got like two items, and the person in front
7:19
of you has like 50, and they're doing everything with change, loose change, and you're like, oh my
7:23
gosh what happening here This is what happening inside your computer There are only four things in a computer CPU memory disk and network Those four things are busy doing the computation
7:37
So if we think about this, what happens? Can we get somebody to come to register three or register
7:43
three? So yeah. Hi, Joshua. Come to register three. So register three opens up. Once register
7:51
three opens up, you take your cart and you push it to the other line. And here's what's interesting
7:57
Stay with me here. Both of these registers have exactly the same algorithm, don't they? They both
8:02
do things the same way. They use the same computation. But each of these grocery carts
8:10
has different items in them, right? You've got your toilet paper and your bananas and your
8:14
your soft drinks and somebody else has something else, right? So that's data. So the data is
8:21
different and it has moved over the same algorithm, but someplace else. They have distributed
8:29
the processing. That's what's happening here. Now, at the end of the day, you got to put all
8:34
this back together because the manager in the back, she wants to know the complete price of
8:39
all the registers. So that's easy. Now register A and register B just send their totals off and that
8:47
is calculated. And there you go. That's distributed processing. Okay. So all of you who are listening
8:53
and paying attention and not checking your email or surfing the web, you are now allowed to put on
8:59
your resume that you are a Hadoop expert. So you are now a Hadoop expert. Put that down on your
9:05
resume and I'll explain why you can do that. This right here, these grocery carts that you see on
9:12
the screen here, that is HDFS, the Hadoop Distributed File System. The data has been spread out and this
9:21
particular piece in green here knows where that is. The map reduce is the mapping of the same
9:28
algorithm across the data. And the reduce is this part right here. And the yarn, this is the person
9:38
who said, hey, Bob, come to line three. Bob, open up a register on line three. That's yarn. Okay
9:45
number three, you can take your break. That's exactly how this works. So if anybody asks you
9:49
if you're a Hadoop expert, you say, yes, I attended a class by Buckwoody and I'm a Hadoop
9:54
expert. All right. But this is slow. This is batch oriented. Yeah. At the end of the day
9:59
we're waiting to calculate all this stuff. We want something much faster. So my good friends
10:04
at Berkeley, California, invented something called Spark. Now, Spark still uses HDFS and it still
10:11
uses lots of computers. They call them Node. If you want to make more money in computing
10:16
you take something that already exists and you name it something else that sounds cooler
10:21
And then you can charge more for that. So we have these nodes, which are just computers
10:26
We have the distributed storage. But what Spark does is it reads this stuff into memory and does some networking tricks
10:33
And it's much, much faster. And it calls these things resilient distributed data sets or data frames
10:39
So it brings them up in. It just sits there. Then you lay a library over the top of them
10:45
And these libraries are quite interesting. They simply convert that RDD into something else
10:53
So, for instance, if I want to use Spark SQL, which we'll do in just a moment, I forgot time
10:58
I think we'll have time to do that. I turn it into a data set
11:01
That's what SQL knows how to work with or a data frame is how that knows to work with it
11:08
If I want to work with it in R or something else, I turn it into a data object
11:13
Josie says, not enough of my software engineering MBA. I felt so comfortable to say that
11:17
Nicely done. You're an expert. No, you put that down. Josie, you put that down
11:22
Buck what you said, you're a Hadoop expert. All right. So it just lays libraries over the top of this
11:27
All right, let's stop for a minute and let's talk about where we are because I'm a simple man. We have large sets of data
11:34
We've got something to do with that data. And we now have a way over here to process that data
11:42
And now we have a way to process that data fast. Now we have a problem because of these nodes. See that? The registers, there are only so many. And we need to figure out a way. Do we buy computers and do we stand them up in racks? And then when we're done with them, do we call the computer company and say, all done, don't need it today. You can pick them up, take them home. Then you say, I need 30 more in five minutes. You can't do that, right
12:09
So what we did with our next part of the problem is we need to virtualize our computer
12:16
Now, going back to the mainframe days when I was around, we didn't actually have computers when I was little
12:21
We just had, you know, we just yelled ones and zeros at each other across the room
12:25
But when they invented mainframes, this was the way things were done to begin with
12:29
And here's what this does. We have a physical computer and the physical computer has the big four, CPU, networking, disk and memory
12:36
It's running an operating system, some sort of operating system, a quality one like Microsoft Windows or something
12:42
And then on top of that, we have this technology called a hypervisor, and it does two things
12:48
The first thing is it does is it grabs a little bit of the computer's CPU disk network and memory
12:55
It grabs some of those away and then it presents what's called a BIOS, a basic input output system
13:02
which is the thing that happens when you turn your computer on and all that stuff is flashing across
13:07
That's your BIOS that bootstraps the computer. And the second thing that does is it looks at the hard drive and says, is there an operating system laying around
13:18
And so that's the second thing a hypervisor does is it fakes a hard drive
13:23
It calls this a virtual hard disk. The combination of this fake CPU disk network and memory and the combination of this fake hard drive allows you now to install another operating system, even one that's not on the main computer
13:40
And it carves up the CPU disk networking and memory. You have another computer
13:45
It's called a virtual computer. We used to call them virtual PCs. Now they call them VMs for virtual machines
13:50
All right. So now I can run whatever I want to in there. And it's very isolated
13:54
Remember our Spark thing? Now I can have lots and lots and lots of nodes running stuff and I can have Python, some binaries, maybe some code, whatever
14:04
And they're all independent. Here is the problem. The problem I have with this is I don't need all the stuff that an operating system brings
14:15
I don't need a mouse or keyboard or video or as much of the protection that it makes
14:20
Now I got to patch all these operating systems. Now I've got to secure all these operating systems
14:26
Now I've got to care and feed all of these things. They're like puppies
14:30
You get a puppy and it comes home. Who takes care of it? You do. Not the kids
14:34
You're the one that takes care of it. So now you have tons of these things laying around and they're not light
14:39
Just the operating itself needs a ton of memory and CPU and disk and network
14:44
And now you run on stuff. So this is this is no bueno. We don't we don't like this very much
14:49
Not only that, it's the same stuff and it's just ring fenced a little bit. So what do we do
14:54
We move on to the next way to virtualize something called containers Now if you know containers already great good for you If not pay close attention And after this you will be able to add to your resume
15:05
not only Hadoop, but also containers. This is just a, this is the best spent hour
15:11
you're ever gonna have. All right, so once again, we have our physical computer, yeah
15:16
We got our operating system. And for reasons which are not important right now
15:20
this is usually Linux. What? A Microsoft guy saying that we're going to run Linux
15:26
Yes. And there's a reason for this. And it's not important right now. Just take my word for it
15:31
All right. So we've got this operating system. We've got the big four, CPU, disk, network, and memory
15:35
We're not going to reshare them. Why do that? I already have those
15:38
Why carve them up? So I'm not going to. And I don't need another hard drive
15:43
I have a hard drive. So I don't need to do that either. So what I'm going to do is I'm just going to create something called a container runtime
15:50
You might have seen some of these. There's a lot of them out there. The main one is called Docker
15:54
You may have heard of this thing. Docker is something that we'll use. So after we get the container runtime, what does it do
16:02
All right. It takes a thing called a manifest, which is a text file
16:08
That's all it is, just text file. We used to have text files in my day
16:11
And then we went to INI files for a while. And then the whole world was written in XML
16:18
And now the whole world is written in JSON. So apparently we just keep renaming text files into something else
16:25
So it takes a JSON, JavaScript object notation, which has nothing to do with Java and nothing to do with JavaScript
16:31
But whatever, I digress. It actually puts all this stuff, and I don't think that's actually the acronym
16:36
But anyway, whatever it is, JSON. And it actually has a list of what you want
16:43
It's a lot like SQL, select star from table. You don't know where a table is
16:47
You don't know where the hard drive is. You don't care. Select from the table. In this case, I say, I want some stuff running. That's all you're going to say. You say, I want Python running and I want some code and I want this and I want that. And you lay down that file. So the container runtime can then ingest that file and make a binary image of what you said you wanted
17:11
In other words, it says Python, huh? Let me go check. And it grabs a Python and it puts that there
17:16
And it says, what else you want? Some code? I'll go grab that and it puts that there
17:20
And it says, what else you want? Oh, a database server? Let me go grab that and put that there
17:24
Now you have a binary image. When you turn that on, you have something called a container
17:31
And that's how that works. Now, a lot of people will use the word container for all three of those artifacts, but they are different
17:37
Now, as I said, you might have Python, binaries, and code. And that's all you have. And this very special little ring fence around this keeps this completely isolated
17:47
So, yeah, it's running Python and your binaries and code, but nothing else knows about it
17:51
It looks kind of like a virtual computer, but it's not a whole virtual computer
17:55
So it's really small and you can do it over and over and over
17:59
Here's what's cool about it. Let's say this container that I'm pointing at here on the left has Python 3.5, let's say, that you're running an app with
18:07
But you've got another app you want to run. So you're going to run Python 3.7 over here on the right in this container
18:14
You can do that and it will pull down the image and build it and so on. Now, let's say the middle container also uses Python 3.5
18:23
Containers are smart enough and it uses the Linux operating system well enough to where it will actually share that
18:30
It's a layering effect. So it doesn't actually reload it again into memory
18:35
saves a lot of space because it's binarily okay. Now, in this case, the binaries, binaries
18:42
code and code are different between container A and B, but the Python is not
18:48
Everybody got that? Okay, good. All right, looks like we're tracking here. Okay, all right
18:52
This is the way containers work. You may now put that down on your resume
18:57
Now, I told you earlier, if something's easy to do, then we do a lot of it
19:03
That's why there's so many people in the world. And I'll let you think about that on your own time
19:07
But that's the way that works. So now we've got sprawl. Let's stop a minute
19:12
We need to go back a little bit. Here's what we got. We got a lot of data
19:16
We got something we should be doing with the data. We've got a way to process a lot of data that's spread out
19:23
And we've got a way to do that at speed. And now we've got a way to make that pretty lightweight
19:29
But we've got a lot of them. Maybe like 10. maybe like 100, maybe like 20,000 of these things running around
19:37
That's just chaos, right? So we don't want that. So we invented yet another technology
19:44
And this is for the orchestration of all those containers. Stick with me here
19:52
Yes, Ravi, your resume is looking good. You're going to get to put one more thing on it
19:57
I know, get ready, right? All this learning. And we've only been at this, what, 30 minutes
20:01
It's amazing. All right. So here we go. This thing you've probably heard is called Kubernetes
20:07
And guess what it's made up of? Yes. That's right. Just JSON files. Yet again, JSON files
20:15
What's the deal with the JSON files and developers? So you write out, I'd like these containers
20:22
which have JSON files of their own. Keep them together and keep them running. I want three or
20:28
four of these things going all the time and put it in a nice big network and put the storage
20:32
somewhere else and handle security and handle all the network and just do that for me. And I want to
20:38
describe that in a file and you run it. And that's Kubernetes. And here's the way that works
20:43
We've got our nodes, right? We remember this. This is our physical computers because at some point
20:48
it's physical computers, right? At some point there's this and we've got storage. Obviously
20:52
we need to have disk drives somewhere and they're separated from all this. They're not on those
20:57
computers. They could be, but that's not best practice. We install this Kubernetes thing
21:02
It's just an API and it gets installed on every one of the nodes and it forms this web across all
21:10
of these. Then you talk to that Kubernetes web, if you will, and it uses something called a pod
21:18
P-O-D, because remember, we have to keep changing the names to things to make more money
21:24
So we've called this a pod. What's in a pod? This is how we talk
21:27
We write it in the JSON file. Hey, guys, I need to be in here. And we click right here and we put containers in it
21:35
One or more or like a lot more if you want. And you can group them however you want to do
21:40
But they stay together in a pod, even if it's just one. And you can stand up as many pods as you want
21:47
And you say to me, Buck, how do I know where to put these? I mean, how do you know that that's on node one or node two or no
21:54
You don't. That's the point. You tell Kubernetes, here's your servers, figure it out
22:02
This is amazing. Just figure it out. These can move around. I don't care where you put them
22:06
I don't even want to know. Doesn't matter. So it spreads out and makes this virtual group of computers with its own network address
22:15
So you hit the Kubernetes cluster this way, and then it goes inside and has its own network
22:20
to talk to everything else. Yes, keep changing the names. Now, wait a minute. We got one problem here, though
22:25
If things can just move around what happens to the storage There got to be something They fix this What they have is they have something called a claim A claim
22:35
And the way the claim works is just a number. It's just a string. And it connects to something called a volume
22:41
You could think of it like a hard drive or a LUN or a space or a C drive, whatever you
22:46
want to call it. It's just a pointer, which becomes kind of interesting
22:51
It's kind of like a little software wire, if you want to think of it like that. And here's what's cool. Let's assume our pod, the one, two, three, four, five, our fifth pod there has a bad day. Something breaks and it goes away. And then it wakes back up. It will automatically, because it woke back up, because Kubernetes told it to wake back up, it will pop out and say, that's my volume. Remount it and keep going
23:19
Wow. We have big data. We got a reason for big data. We got a way to process big data
23:23
We got a way to scale big data. And now we got a way to control the scale. Buck Woody wants to tell you how we use all this at Microsoft
23:35
Here's what we do. Ta-da. SQL Server big data clusters. Notice the nodes are gone
23:41
Nodes are gone, right? Took them out because they're not important anymore. I don't care. Doesn't matter
23:45
I told it you run on these five computers. Go run on the five computers
23:49
Not my problem. You figure it out. So now I'm only going to work with Kubernetes, my pods, the containers in them, and my volumes
23:57
That's it. That's all I'm going to work with. So that's all I care about now. I type all that up and off you go
24:02
Here we go. First thing we know is that our administrator, she can use the commands to talk to just regular old Kubernetes
24:09
Let's stop here for a minute because I said Kubernetes runs on Linux or whatever. Kubernetes runs lots of places
24:14
Like you can run it as a service. Amazon has one. Microsoft has one. Google has one. You can stand up your own. All of that
24:22
Right. It does all this for you. So we don't care where it's running. We don't care where it's running
24:27
We don't care where it's running. You could be running this on Amazon or in your office or in our office
24:31
Doesn't matter. You pick where you where you go. All right. Here we go. The first thing that we lay down when she begins to install SQL Server big data clusters is something we call a control service
24:41
Now, I'm illustrating these as pods and nodes, the blue and the gray there
24:47
But they can be, you describe where you want these, how big and how many and so on
24:51
I'm just going to do this just to keep us a little isolated. So this picture, when it's done, won't necessarily be accurate, but it could be, depending on what you want
24:59
All right, here we go. So we've got our control service. It wakes up. And the first thing it needs to do is create a little database for itself called a config store
25:06
So it's got a little pod in a container. Inside the container is a little database that's keeping track of everything
25:15
No harm, no foul. Here we go. Next part. We need a way to know if all this stuff is working and how well it's working
25:22
So we need logs and we need a performance tuner of some sort, a performance view of some sort
25:28
And there are two open source ones that are really good. Grafana is a great tool that lets you look at performance counters
25:35
It will collect them from anything. You can have SQL Server and Linux and Python and whatever you want
25:42
It'll go reach out. It'll monitor them, and then it'll pull it in. And I'll show you how it does that in a minute
25:47
All right. Once they're stood up, the control service has a little proxy that it makes
25:53
And now the administrator, she can look at the proxy, and the proxy can look at Grafana and Kibana
25:58
Kibana is a logging thing. And she can look at all that. It's all on a web page
26:02
It's all inside a web page. So she's just hitting this thing with just webs, right
26:07
All right. Now, once those guys are healthy, we need to start installing SQL Server
26:11
So we do. We install SQL Server. We call this the master instance
26:16
Now, you hit that the way you hit any SQL Server. It's got a TCP IP address
26:21
It's got a port, and you talk to it in your apps like it's SQL Server
26:25
You can use Management Studio, Azure Data Studio. You can use whatever you want
26:29
It'll work. All right. Once that's running, you may want to talk to data that is not inside your network
26:37
We do something with this called Polybase. And Polybase lets you mount other kinds of data like it's a table in SQL Server
26:45
You don't move it. You don't copy it. You simply, it's like a view
26:49
You can just point to it. That's called Polybase. And I mean, we can talk to Teradata, Oracle, SQL Server, of course, DB2, HDFS, another HDFS, and so on
27:00
question Thomas can an existing data lake be part of a big data cluster so data doesn't have to move
27:05
absolutely not only that you can use big data cluster as your data lake and the data doesn't
27:10
have to move so you could certainly point or you do both to where you have data you need in here
27:15
and you have data you need out there if I've got time I'll show you guys an example pretty cool
27:19
stuff all right um we have another thing what if you want to keep a bunch of relational data like
27:24
terabytes or petabytes of relational data like bi type stuff so you want not only that external
27:30
data that is non-ordered, but you want to keep ordered regular RDBMS data. We have something
27:37
called a data pool. And inside that data pool, you talk to, again, look at the line from the
27:43
application over there. You only talk to SQL Server. It handles all this other stuff for you
27:48
I'll show you. It's a thing of beauty. All right. Next, we talked about that whole Spark thing
27:52
So we've got a couple of pods in there and some containers in that pod, and we give you Spark
27:58
and more SQL Server that talks directly to HDFS. So once again, you talk to SQL Server
28:06
and it will talk to Spark for you. You don't have to learn MapReduce
28:11
You don't gotta go learn Python. None of that. We'll do all that for you
28:16
So that's the way this looks. But wait, there's more. What if you create a machine learning model
28:20
and you wanna score it or something? You wanna get a response back, but you wanna stay in this security boundary
28:25
We can do that with something we call an application pool. So you can deploy your models with Python or whatever else you want to serve
28:34
In here, there's a little proxy. So now your users can hit that
28:38
By the way, your users can also use the Spark directly if they want
28:43
They don't have to. All right, good. All right, that's as complicated as that one gets, I think
28:49
Yeah. You know, I'm going to bail out a slide world here for a minute
28:53
And I'm going to show you guys. Let's go take a look at this stuff really working. Let's go do that
28:57
Hopefully you can see that okay. This is Azure Data Studio. It's an open source tool
29:03
It runs on Linux and Windows and Mac and whatever. I'm hitting a regular server here
29:08
So here we go. I've got a big data cluster over here and you can see I've got databases and all that
29:13
But you can also see I've got an HDFS location and that goes back
29:18
Let me pop back over to Slide World a minute. So that is right here
29:23
See this right here? That right there is this right here. HDFS
29:28
There we go. It's got directories in it and all that. This is huge. Now, by the way, this is that single node I was telling you about
29:33
You can run one of these very small. But obviously, this is meant to be really, really big
29:38
All right, here we go. So first of all, when I hit this, it's just SQL servers
29:43
Now, here's what's kind of interesting. You're looking at a notebook here. If you've never seen a Jupyter notebook, Jupyter notebook is nothing but a fancy web page
29:49
It can run code and it can have text. That's it. Don't get excited. It's not that big a deal
29:54
Here's some text, right? And here's some code, right? And the code is dependent on the kernel
29:59
whatever that kernel That's what you run in here. All right. So I'm running SQL in here. No, you won't normally see
30:04
that in Jupyter Notebooks. We wrote that. So we actually contributed back to that. All right
30:08
So here we do. You probably know what this is. Select Add Add Version, SP Configure
30:13
Database Name, Where You At, all that kind of stuff. I run the cell and it does the work. So
30:19
let's see what it came back with. This is SQL Server. I'm spreading this out. I have my screen
30:25
really big so you guys can see it. It's developer edition. Come on, man. It's developer edition
30:31
64-bit, a DAW, on Linux. So let me be clear here. Hold on. Microsoft SQL Server running on Linux
30:44
Interesting. Yeah, we wrote so that you could have SQL running on Linux, and it does, and it works
30:49
amazing. And now it can run inside a container because that runs on Linux. And that container
30:56
can run on a pod and the pod can run on a Kubernetes cluster and the cluster can run on a
31:02
node. Ta-da! Now we're part of that big data world. We can do this. Oops. We can do this
31:11
because we can run in Linux. And so we wrote to do that. All right, here we go. All right
31:16
So I just did some stuff. And really the only thing you're going to see different between this, you might see it's kind of funny. Let me put this slide down so I can see your questions
31:24
if you have one. All right. It's kind of funny. Look at the path for master database files and
31:29
stuff. Isn't that wild? Var opt, right? That's all Linux with the forward slashes and all that
31:33
They do that that way. And I can select data. I can come down here and I can just select data
31:39
By the way, I got the worldwide importers backup file from my Windows SQL server 2016
31:48
express edition on my laptop and I backed it up and I put it in SQL 2019 in Linux, the backup file
31:58
And I said, restore that. And it restored it and it upgraded it. Hold on a minute. It actually
32:07
upgraded it automatically from the old version to the new one. So that's kind of cool. And I can
32:13
back up from here and go restore it in windows. No problem. Works great. All right. Now, once I've
32:18
ingested some data. It's just SQL. It's just SQL. So now I can just start questioning it and just
32:23
wide world importers and so on. So here we go. All right. Now let's go to the next step. That's
32:28
pretty cool. But let me go to the next step here. What if we want to talk to data that isn't in SQL
32:33
server? In fact, what I'd like to do is I'd like to go down to my partner customers here and I want
32:40
to get that file, this comma separated value file. Keep in mind, there may be millions of them
32:45
Remember the shopping carts? There may be millions of shopping carts in here, but they all have the same structure. Right
32:51
And I want to talk to that, but I want to talk to it with SQL, with SQL, like it's a table
32:57
Now, you normally have to write all kind of code to make that true. In this case, you don't have to do that
33:01
All right. Here's what I do. First of all, I tell it I need a file format for CSVs
33:05
I called it CSV file I very clever that way And I told it guys this is Linux Right So I told it this is Linux Now this looks just like what you would do with BCP only we not going to import the data
33:16
We're just going to point to it. Interesting. So first, I just made a comma separate value file
33:22
Now I said, create an external data source. So I need kind of a connection over to that thing
33:28
And this right here, by the way, is a HDFS endpoint. point that's just the HDFS server like like this thing that's just this whole thing right here this
33:40
gray box right there all right so that's what that is all right once I've done that it says okay by
33:45
the way I don't have any security here because it's inside my data my big data cluster this could
33:51
have been over on Amazon I had to provide names and passwords or certificates or whatever and
33:56
there's a way to do that this could have been Oracle this could have been DB2 this could have
34:00
been Teradata. This could have been SQL server that's living somewhere else. Now I get a table
34:06
in here. I literally get a table in here called, let me see if I can find it real quick. I'm not
34:12
going to spin on this. Let's see if I can find it. I don't remember what scheme I put under and all
34:17
that. So I basically created an external table now, and this is just the definition. This is
34:24
just the definition of, where is that? Partner customer. It is literally the definition. Let me
34:31
show you this file real quick. View preview. Let me show you this. So it's got three columns, boom
34:36
boom, and boom. That's what that's got. No, I don't want to save it. And here we go. Customer
34:42
source, customer name, email address. That's what's in that text file. And here's the trick
34:47
Here's where it is. Here's the directory, not the name of the file. Don't care about what the name
34:53
of the file is. That's HDFS's problem. So it's going to walk across all these files. I am querying
34:59
a directory. I am not moving the data. All right. So I do that. And now I can select top 10
35:06
and I can come out of that. Out of that, I've aliased it with HDFS and here it comes. And those
35:13
weird little characters, they're in the file, right? So I am doing a select statement against
35:19
HDFS data. No MapReduce, none of that. Isn't that cool? And by the way, I can now join them. I want
35:26
to know the company over there. They sent me a list of names and addresses and all that. And I
35:32
want to pair that up with data I have inside my SQL server. And so I just do a join. I just go get
35:41
the HDFS person right there and I go grab the local file name right out of my customer source
35:47
and I join them up and there we go. These people exist in both databases
35:52
Ta-da! I just joined data from HDFS and SQL and I just did it with a pointer
35:57
and I didn't move the data. All right, now what if I want to store a bunch of data from HDFS
36:02
but in a relational table? I can do that too. I do it exactly the same way
36:06
except this time I make a data pool instead of the HDFS. I also use the HDFS
36:11
to go find the web click streams. I think that's in the web logs
36:15
Here we are web click streams I go get that and I take it out of here and I am moving it this time and I save it with a insert into this doesn listen scary It does it right here Insert into just a regular old insert into
36:29
Now I did move the data because I wanted it to be relational and I could have done all kind of sequely things
36:36
In fact, this is kind of ETL with a query. It's kind of weird
36:41
I could have like had an Oracle table and an HGFS and a bucket over in Amazon
36:47
I could have queried them all, done some joins, uppercase this, lowercase that, and then do an insert into after that
36:55
Boom. Bob's your uncle. I've got all that data ETL right into my system
36:59
That's all I had to do. Amazing stuff. Now, what if I want to use that cool Spark stuff
37:05
So now what I'm going to do is I'm going to change my kernel to PySpark, which is the Python that Spark uses
37:11
I'm going to attach it to that same computer, to that same cluster, same node
37:16
Can SQL server on Linux use permissions defined with Active Directory? It can
37:22
It can. I can't show you that right now. It's complicated. It's complicated. Linux is a mess
37:27
But, yes, we can do that. Okay. Spark for ETL. I've got a lot of data
37:32
Tons of data. Millions and millions and millions of rows. And I want to do the ETL not in SQL or anything else
37:37
I want to do it in Spark because, hey, it's what Spark does. So all I do is start up a Spark thing. All I do is a Spark call, which looks like this. There's the file format. Here's the blah. Save it as a table. That's what it does. Save it as a table. Then it writes down a ton of stuff. And now I can say, run some Spark over that table I just made. And this is like limit 10, top 10. That's what that is. Right? And then show it to me. Look at that
38:09
So now I can use my SQL magic, my Foo, my SQL Foo
38:13
I can use that in here on Spark. Isn't that awesome? And so now I've got a file there
38:19
And by the way, I could take that file and then look at it with SQL. I could do both
38:24
My data science, she's working away. My DBA, he's working away. They can be working on the same data at the same time
38:31
Just amazing stuff. All right, here we go. Machine learning. We're coming into the home stretch
38:36
What do we want to put all this together? Long story short, what I did was I made a Python Spark model
38:42
I went out and got a ton of data. I did all kind of data science-ery things to do everything
38:47
I looked at the data. I did data ingestion. I did feature engineering
38:52
Yeah, whatever. All that stuff is really cool. Don't tell anybody that I told you this because this is part of our secret decoder ring that we use as data scientists
39:02
This is all I really do. I added a comment here to make it look like I was doing more because it looked just three lines and I didn't look right
39:10
So I added a line to make it look like I was doing something important. And all I'm doing here is I'm using one of the most common models you can use, which basically fits a line
39:20
That's what it does. And it says, well, it was this way yesterday. So, you know, it was cold yesterday, but it's going to be cold today
39:26
That all it did Once it done I can make a prediction Now here what cool I can actually save all that out I can do predictions This one doing predictive maintenance It telling me when a battery on a component is going to fail Literally when I predict this is about to go bad in like three minutes you better go fix it Right And it sends me a thing Here what cool This is for a refrigeration unit on a truck
39:48
So what I can do is save that model once I've trained it with all that big, big, big data
39:53
Now it's tiny because I don't have the data anymore. I've just got the model. It's just tiny
39:58
And I pop that out and you can run it on something called SQL Edge
40:03
And SQL Edge is SQL Server running on a tiny, tiny little like Python box, little tiny guy, a chip, if you will
40:12
Now I can embed that in the truck. The truck's monitoring the system, refrigeration system all the time, 24-7
40:19
And it says, hey, dude, tomorrow at 3 o'clock, your battery's going to die
40:23
I've already ordered it for you. It's waiting for you at the next truck stop
40:28
And they've been notified to set up an appointment to put it in for you. And so the truck driver pulls over, gets the battery, keeps going
40:36
No more replacing batteries every month when you don't need to. No more spoiled goods so that the ice cream's not damaged when you do
40:43
And Joe asked the best question, which is, do you have a website or a blog with all this crap on it
40:49
I mean, my gosh, it's a lot to keep in mind. And it is. So I'm going to pop through because there's a bunch more stuff here, folks
40:55
This is where I show you kind of how the guts all work and all that. You don't care. It's not important
40:59
This talks about the Active Directory stuff that you asked about and it shows how all that works
41:04
And I show an example of how financial tech does their work
41:08
They do a fraud detection thing. I've got all this somewhere else, so I'm not going to spend time on it
41:14
I'm going to get to the answer to your question. Ta-da! Joe asks and ye shall receive
41:20
If you want to find all the documentation, which has everything I've gone over, just go to aka.demis.com forward slash BDC
41:27
But if you want some in-depth training, hands-on, that goes step by step and does everything I did and reduplicates all of it
41:35
And you learn along the way with free training on Kubernetes, free training on containers, free training on machine learning and HDFS, everything you can imagine
41:44
Go to aka.ms forward slash SQL workshops. And you'll find this is a neat little secret, not just big data cluster stuff there, but all the stuff from Bob Ward and Anna Hoffman
41:56
You'll find there as well as yours truly. And then if you just want the code examples, various code examples, we've got those there at aka.ms forward slash BDC samples
42:09
Well, Simon, I'm out of breath, man. We're on time and on budget
42:16
I mean, it's time for me to eat lunch. I haven't had lunch, so I got to go get a sandwich or something
42:20
But this is the resources there that you've got, folks. Simon, if you want to bring that back
42:26
Okay, everybody take one of those screenshots. Hold on a minute before I take a screenshot. Simon, make a funny face