CTP project and CISC4900 Presentation, Signalyze
2 views
Jan 1, 2025
Our repository: https://github.com/WillSnakeTaka/Signalyze For more info please visit: https://www.mysteppe.com/category/music-releases/ Spotify: https://open.spotify.com/intl-de/artist/58kAMATJwufWYn7yp0hIk4?si=V023A9ISR2qaNRwezn_rPg SoundCloud: https://soundcloud.com/gav-khugzhem-taka SuperLink: https://ffm.bio/w5q94qq
View Video Transcript
0:06
hi Welcome to our presentation for 4900 today our topic was a shaliz it's a
0:13
program it's a project we did throughout semester our teammates are Marco Chen
0:20
Regal B pun and me so our project overview is uh Liberto
0:27
doing ta spe and my uh teammates Marco Chen doing the gestra hand handun Mark
0:33
detection sections so here is our uh recapturing of our uh presentation
0:40
nights and this is [Music]
0:50
us all right now let's start with the
0:55
presentation so we first go through planning [Music]
1:02
Toto work on the TA of speech documentation and recommendation for project huging face and uh stream
1:09
streamate testing for test text to speech and project suggestions the marro
1:15
worked on open CV sign language training data collection local testing and implementation on streamlet I'm working
1:22
on smaller project such as uh H5 training to testing different platforms
1:28
for usability and put the project together integrating text to speech open
1:33
CV and also working on exploring liosa for future a
1:40
augmentation so today's presentation format uh going through like this first I will introducing a voice processing uh
1:48
testing so this is the most important part of my learning where I learned all the stuffs from this uh video processing
1:56
for music and also for How to Train H5 file a train different datas and the
2:02
second stage is like I going testing liosa and it's uses on music analysis
2:07
which will lead to my understanding of a voice processing which later on uh help
2:13
me integrating this text of speech and the uh for the integration of the sign
2:18
language and text of speech and to understanding basically how this a of speech things and the audio handling
2:25
work so the third step was introducing image processing and then video processing and then talk about platforms
2:31
and failures and the last last steps we will present our project and also a
2:37
recapturing of our presentation night with my teammates so now let's introduce my work
2:45
and for the project so for my part my contribution was a planning so uh rerto
2:51
and M have formatting uh documentary to listing what do we need to use for this
2:57
project and I'm going to plan and and making smaller project to make each step work and then we put the things together
3:04
similar like how do you build this uh like puzzle things so the first thing is
3:10
understanding basic steps and once that we understanding the basic concepts we can finding different resources and the
3:16
coding and testing and then the testing platform including Unity flask hugging
3:21
face Streamlight a uh Amazon Cloud and also render heru and also uh various
3:29
different platform per for implementation experimenting on liosa and currently in progress and these
3:34
things uh make it work tonight actually it's the small Personal
3:40
Achievement so the problem statement the sign language is a crucial communication
3:45
tools for the people with death and hard hearing communities but many people are
3:50
not fluent in it so brezing this Gap through technology can enhance this incl
3:56
inclusivity and accessibility so that's why we come up with this uh idea about
4:01
using open CV to recognizing hand system so our solution is first we
4:08
process a system that can capture shine gestures and using a camera and then recognize gestures through machine
4:14
learning models and then convert this gesture to different text and text to spe in real
4:20
time so the technology we use is open CV which was like a real time video
4:26
processing uh app and also media pipe was really helpful like introduced by
4:31
ham and also it's hand tracking and gesture recommendation system penser flow and kalas is uh handling deep
4:39
learning and model development and finally we're using python libraries to
4:44
uh using this txos spe and NP and also flask so my first step starting with
4:52
learning something small so handling WS is the most important part because like we have to uh go through one dimensional
4:58
to three dimensional know so the voice is the basic part and then the image using this uh CNN formats and different
5:06
formats and then after uh voice and the image we go through video and integrating how video and the voice can
5:13
integrate together so for me music study is part of my speciality like in this
5:18
small project so I using music study and voice handling inspired by Ham's
5:23
repository for emotional Rec recognization system to start this project first let's go through uh Ham's
5:31
uh really amazing report story so husam is our teaching assistant which has been
5:37
really helpful and he's so great so they are doing something using machine
5:43
learning and librosa to recognize different speech and generate General uh generalizing emotions so using this
5:50
emotion to detect uh other voices using this machine learning model and what I
5:55
learned from this uh really amazing model mods like they using inspect and explore data and selecting
6:02
engineering features build and train models evaluate models so basically they uh import
6:09
different libraries and using machine learning to learn things and they have this P PBE to presenting audio segment
6:16
audio that can be manipulated using using python code the first step is inspect explore
6:24
data so uh they they training with really great data like from K but for me
6:30
was like I want to learn something about my own music so I doing my own style of training I will show you later so
6:37
importing data size is another uh issues because like they always have formatting issues things and preparing data and uh
6:45
revness redness so this this part was basically dealing with uh data labeling
6:51
data I think data cleaning and data like preparation but for my part I don't
6:56
really need those things because I label my own uh data I know was my emotion so I put into different folder and then
7:02
just check later and then preparing data for creating this data sit this is another
7:10
stat and after you labeling different things for example female male and neutral emotion and all those uh things
7:17
then you replace read uh like replace different kind of like type of things
7:22
for example replace female with female and these type of things but for my case I doing with small data and my own data
7:29
so it doesn't apply so their project is more ambitious but our project is a little bit smaller so yeah then they
7:36
using V uh data visualization currently I don't have this uh feature yet but I
7:42
my my my my own data was only dealing with six different prototype I will show you
7:47
later yeah so basically they are using liosa domestic uh do domestic Learning
7:54
System to mapping this different type of uh waves and then you you learning this
8:00
different type of things for example H and also like the the seconds so they can study these
8:08
type of things uh like to recognize a different partn they haven't seen before so use these things to predict things
8:14
and this is really great system and for this this spectrum things I already seen
8:19
the things during Professor cohan's class with electronic music L master so
8:25
this is also another very interesting thing in learning different sound
8:31
Yeah so basically this is the idea and this is from Ham's temp Ham's repo I learning extensively and now let's going
8:38
back to this learning process so the learning uh FF ffmpeg is another uh
8:44
formatting thing so this formatting can convert in between MP3 to waveform because waveform is bigger not bigger
8:51
like more specific MP3 is smaller and then I also learning so many things with
8:57
profess Cohen on this Timber things and how how Tempo things and how this uh electronic things going through this uh
9:03
other things I forgot what's the software was the software that you can paste uh track there and they reflecting
9:09
different like uh pictures and also wait sorry labeling my
9:16
own data is another process I label my own data I'm going to show you my basic data labeling things going on
9:24
so for my small project I have this uh music data so all this music I pick up
9:30
different music I compos during the past three years cuz I know those styles are unique to me so I label them as as a
9:37
different labels and then put this uh music sty in there to learning from so that's how I label my own data
9:45
and then supposed to be generating a CSV format to uh recording this data point
9:50
for example they're using different software extracting Temple and the chroma and also those like picture line
9:57
things and then up after this process we can generating a H5 machine learning
10:03
model and they using this model to predicting other things so now let's going back to topic
10:09
so what's is ffmp eg format ffmpeg is a
10:14
powerful open source tool for handling mtim media data it can convert audio and
10:20
video formats and then resample re-encode and normalize this audio file so basically for my case it's
10:27
like converting between MP3 and wave because like the wave s format was better with a better quality and it's
10:34
easier to extracting this uh music information so for handling unusual
10:41
audios yeah there is like when we dealing with the data learning sorry
10:48
when we dealing with this uh machine learning and data labeling there will always be this uh conflicting things and
10:54
they were resulting the fail is not reading into C CSV or don't recording
11:00
correctly so for me was PL to false read all fils and there's a process I learned
11:06
called fallback mechanism it's like if a Fisher have this Temple mfcc and chromas
11:12
basically they are different wave recording different wave things if it it fails and has unexpected Dimensions this
11:19
Dimensions things that uh in my best knowledge is like array so for example if Temple only have one array and then
11:26
the other format have two array so this things will not match so I have to force into some data into it so for my
11:33
solution is like for any problematic fil inro output has consistent Dimensions
11:39
this is the first issues and if U something for example like electronic music or from some of my music have this
11:46
really weird tones they will not be recognized then I put it as zeros on alternative placeholder so this zero is
11:54
still valid because um when they read through this uh datas they will still recognize a part part and use it use it
12:00
to predicting other partn and no no skipping so sometimes they skip these fils when it's not working so we have to
12:07
write the code saying this loog issues must be like a continue processing every
12:12
fil to make sure every fil have recorded something even if something was not showing there they put a zero there
12:18
instead of uh like skip it skip this whole fil to mess up this whole
12:23
fils and the next stage is called debugging the debug debugging fails usually they have this uh uniform shape
12:30
for all features so it's basically they have Temple mfcc chroma is reshaped to
12:37
fix this dimension for example Tempo has one row and mfcc has 13 rows and chroma
12:42
has 12 rows so I will skip this uh process this is another like kind of like personal discussion so this is my
12:49
backlog so the biggest issue I might was this uh like something called dimension thing because for example these things
12:56
when they uh when they produce reading my uh one of the piece Professor zerio
13:01
like reference or experimental things they don't have the same dimensions but array in the index zero has two
13:08
dimensions and the other has one so I have to that's why I went through this process to f everything through and
13:14
eventually this thing more rate for example they say Temple shape is one 1
13:19
5,000 and the mfcc shape is 13 5,000 the chroma shape is this like all of these
13:25
things will give you a number and this number is supposed to convert into CSV
13:31
but CSV failed I'm going to explain why because there are fail types the npy is a binary fail for my specific to NP and
13:38
also optimized for strong arrays for metadata for example this strong arrays and this is why I'm going to skip this
13:45
uh thing you can read [Music] later so for analyzing why my CSV file
13:52
didn't work so why it didn't work because our features have multi- dimensionals for example 26 5,000 which
13:58
is a CS can't handle this uh things like by itself CU CSV was like one dimension
14:04
so for flattering those arrays into a CVS would lose the structure integrating in this data and complicate processing
14:11
and model training so that's why we are using NP NP structure inside of this
14:16
structure and Mark also come up with other different structures which is really great for his final hand
14:23
recognization system from media pipe so now let's go to next micro stpe
14:29
now we produce our first uh npy sorry npy format this suppos this just imagine
14:37
this is similar as CSV so they listing all the stas like which is this this
14:43
this form what is this point what what number is it and we can now train H5 to
14:48
recognize those parns now we plan for training the
14:54
models the training in the models first were load and uh the data site load py F
15:00
and separated features and label them and then we train them split into training and testing side and after
15:07
these things we'll be able to produce H5 file so how it works basically the motor
15:12
learns pattern from these input features and those features capture ke characteristic my music style such as
15:19
Rhythm tone and Timber and all those are comp compiled from this chroma uh
15:26
packages so this is my six different I call it traditional centin and crab
15:31
class Canon Innovative existence BL and the reference so experimental so each
15:36
have the uh different ideas they doesn't exist in any other sty so after learning
15:42
my own partner they will produce six stells labeled at 0 to
15:47
5 now I make it into an h55 fil so now we're going through this training
15:53
process using neural network map mapping to capture this uh part
15:59
[Music] so now we have this H5 file using my uh
16:04
CSV which was npy file like the two dimensional things and then they will give you a H5 file this H5 f is the most
16:11
important part of any modor training because like usually other people train from tens of thousands files into this
16:19
H5 file so people can borrow this model to train their own model and this is uh
16:24
I learning how to do these things from scratch because I really like uh Ham's
16:29
idea like back then we don't have huging face we don't have all that we have to stream by ourself this is why uh I
16:35
really like this uh things so I start to train start to train things by myself so
16:41
not every h55 will end up in success so there was only two working one is this
16:47
the other one was the handshake I'll show you really quick now let's testing these things on streamlets
16:55
so here is my project let's show let me show you [Music]
17:12
see now they will show you this six sty right and then you drag a random video
17:18
with a either MP3 or wave and they will analyze a wh of this six sty integrating
17:24
with each other for example let me put my own music there
17:30
put something run regular
17:37
see now it's analyzing based on my H5 [Music]
17:46
file and what it will produce it predicting like we have this uh 7.4% of
17:52
this sty they were using this part to recognize which six St integrating with
17:58
other so they recognize my uh this piece was most close to this uhx there which
18:04
was 32 so this this is how how it end like ended so inov way of existence 36%
18:10
so they were learning how different music style like related correlated to each other based on this pattern and
18:16
this is my Personal Achievement on this small things sorry I forgot how
18:25
to yeah let me stop here
18:31
now let's back to the main project section two so the section two is image processing and media pipe for video
18:37
processing exploring media pipe and hand tracking and Analysis uh different like
18:43
data using Training Method the method we use is data preparation data training and real time
18:49
integration so the fail what fails is B the unity I will not show it here
18:54
because this on file is even more complicated this H5 H5 only have two success by myself usually people borrow
19:01
from other library but like other people's Library could be bed and you myself you okay so like my teammate H
19:08
really well but so for now I'm just going to talking about like U my contribution on this small micro project
19:16
so the main step in process is cleaning data and uses this is my friends doing my teammates doing this type of things
19:22
like did handle it really well and then Tex of speech uh do it
19:28
really well we have different formatting and I'm using uh Max native so now we
19:34
have backup plan backup my backup plan was to using Unity developing this like a number recognization system so we we
19:41
start from basic number to hand to video to V uh to audio and integrating all of
19:49
them together so now what does our first prototype do our first prototype is
19:55
input we capture using this media pipe for this finger points and then we learn
20:01
these finger point patterns so what's the potential from this basic prototype it can of course
20:09
learning a sign language helping people helping people with disability and also for music this sign language can help us
20:14
learning conduct comparisons and also for gaming we can using this gaming hand controller and for new music art project
20:22
some people already come up with these things and also can anise just play for fun like I develop game we'll show you
20:31
later so H5 versus Onix they are two different formats one was H5 one was onx
20:38
so the H5 hierarchy data format was using this tensorflow caras and onx was
20:44
like mostly for with unity bar things but it didn't work so I'm mostly using H5 in this
20:51
case the solutions and steps focus on cleaning data and testing real time
20:56
integration and refining the text spe and then you're using unity and finally you tou with flask flask is also Ham's
21:03
idea like ham is really great I so many great things we learned so many good things from
21:12
him the unity said that bakuda was something I failed I want to try more because Unity was better than other
21:19
platforms this is the bakuda testing and H5 what's is H5 as I show you before H5
21:25
is called hierarchy data format these things using the reading through different format for example c c CVS and
21:33
also the phy things like I show you before the two dimensional things I call
21:38
it like that user the use case is commonly used for tensor flow and kasas and also they can start different neural
21:45
networks ways training figures the onx format was better for uh
21:50
baracuda and also us AI P to sensor flow and different
21:56
platforms so how do we convert in between we creating a code converting between H5 and onx optimization this
22:03
things also good with GPU and mobile things there are also other
22:10
formats how do we converting different formats we converting from H5 to onx using this B no using this PP in style
22:18
these things and then we copy paste the code like this the code like this the T
22:23
onx export model input tens model onx and then we you we creating a converter
22:30
converting these things into the other format but these things always fail for me I don't know why but in the future
22:35
we'll continue exploring the flux was some best idea so I learning so much
22:41
with this flux I will show you really quick later as I show you I'm going to uh like
22:48
tell you how I set up this flask so basically flask a was like deploying a
22:53
flask a using different platforms so my solution was using uh rer and my
23:01
teammates also have this uh kind of like streamlets hugging face things now let's
23:07
Tes in different platforms so let me close one of them first
23:14
[Music]
23:36
yeah first let me show you about this uh
23:41
basic testing [Music] system first we test if this uh hand
23:48
shaking things working or not [Music]
24:15
wait wait wait wait this student supposed to be like this
24:21
[Music]
24:31
or I'm show you another one [Music]
24:51
sorry so for this project
25:01
yeah this project will show you this will have generating a different uh format and the guys they're playing a
25:08
guys in game with you for example this this hand this hand is the uh math and
25:15
this is the F and this is the math so they were playing something similar to
25:21
this uh Cesar rock with you and then they will generate randomly generate something so I also trained these things
25:27
one by one with uh using this media pipe to recognize recognize different hand marks and then I make 150 different
25:35
pictures to learning this things so this is my unique design on this similar to
25:41
this uh model par and recognization system now let's show sorry my computer is a little bit slow so these things
25:47
will be really uh unstable for example I'm using math you choose a mice you
25:54
win give a second will say mice is0 75 and then they will randomly generate
26:01
something see if you win or not you win now let's do hand you can't even move your hand like this you choose the fool
26:08
no I choose the fool computer wins give it a little bit time you choose the fool 100% this is easy to
26:16
this is just a p sign and now we have a cat let's see the cat was the hard one
26:21
you have curving finger no they choose you choose a cat computer wins see you choose a
26:30
cat it's a tie you choose a cat you win yeah this is basically what
26:37
happened I'm stop it just in case draining my computer and they were
26:43
generating here so now for the last part I would
26:48
like to show our final project at the presentation night and thanks thankfully
26:53
my teammates uh RTO and Marco CH and me was working on these things so for the
27:00
last I would like to show our recapturing for maros Shen's work so here is maros
27:07
[Music]
27:12
Shen sorry now let's uh introduce our project
27:18
you can say something so uh actually our projects are related about um the comp
27:26
computer about on the data science so basically we uh use the uh medium P
27:33
which is a uh uh open source library from uh uh Pyon uh which is relate to
27:40
the uh compation recognition so um it's providing a a model that uh we can use
27:49
our own data set to chain it to recognize different uh label So based on
27:56
those label we put in the JP picture or MPG picture in those label and then put
28:04
in the model to change it after we change we can also decide like how
28:10
precise we want and also we can well turn the model in the future to work
28:16
small side but for this project now we did is we change it we use the data set
28:22
uh the ASL side it a to c to recognize
28:28
all that hand so uh now I'm going to show a demo on
28:34
the here on my so I'm going to side the light camera F so basically
28:43
uh so uh oh that's perfect so uh we have a
28:49
side here you so first of all uh if put
28:55
our hand there's a landmark uh detect on the finger so it have like 21 phone
29:02
yeah card on your hand recognize is a hand shake yeah definitely you can do it
29:07
like like face shap something like that but now uh I can try to do a hand side to let the model weize my uh side
29:15
language so I do the a so it give uh the decided WID and also the how confid the
29:24
confidence how precise a model Rec this okay and then uh let's try two different
29:30
side now okay uh
29:37
[Music]
29:49
okay this is a s no this is s s like
29:55
this yeah that's e okay this a uhhuh this a this is a okay okay this is B if
30:04
you close the finger you close the finger that will
30:10
be okay and then uh if you make a finger cross that
30:15
[Music] be and you can try to be
30:22
and and then three finger is W okay that's great
30:29
the letter will only keep the record of the change cover
30:34
if your size keep the same it don't work so try to keep like less
30:42
M [Music] okay thank you that'll be all thank you
30:47
for uh watching our video hope you enjoy have a nice day
#Enterprise Technology
#Intelligent Personal Assistants