Child, Daniel_MCS_2023

Title	Child, Daniel_MCS_2023
Alternative Title	Comparing Classifiers vs Clustering Algorithms on an Audio Gait Dataset
Creator	Child, Daniel
Collection Name	Master of Computer Science
Description	The following masters of computer science thesis explores the use of machine learning to identify individuals based upon their gait.
Abstract	Researchers have explored different methods for the identification of persons by the sound generated when they walk (i.e., gait). Using machine learning they can train a system to identify individuals. In buildings where only authorized personnel are allowed a system like this could provide an additional layer of authentication. In this thesis, audio was captured by participants walking past a series of microphones to capture their gait. The Fast Fourier Transform algorithm was used to process the data into a form usable by machine learning algorithms. A comparison is made between classification and clustering algorithms, and several machine learning algorithms are tested in each category. Also mentioned are three methods for feature elimination and their impact on performance. Lastly, tests are conducted to see how including a backpack can alter the results, which tests the ability for the algorithms to generalize under normal human behavioral changes.
Subject	Algorithms; Computer programming; Machine learning; Computer science
Keywords	gait; machine learning; classification; cluster; unsupervised; python; raspberry pi; Fourier; fast fourier transform
Digital Publisher	Stewart Library, Weber State University, Ogden, Utah, United States of America
Date	2023
Medium	Thesis
Type	Text
Access Extent	49 page PDF; 2.6 MB
Language	eng
Rights	The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights.
Source	University Archives Electronic Records: Master of Computer Science. Stewart Library, Weber State University
OCR Text	Show COMPARING CLASSIFIERS VS CLUSTERING ALGORITHMS ON AN AUDIO GAIT DATASET By Daniel Child A thesis Submitted to the faculty of the MSCS Graduate Program of Weber State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science Graduation December 16, 2022 Ogden, Utah Approved: Date: Committee Chair, Ph.D. Robert Ball Committee member, Ph.D. Kyle Feuz Committee member, Ph.D. Matthew Paulson Feb 22, 2023 Feb 24, 2023 Feb 27, 2023 ACKNOWLEDGMENTS I would like to thank my committee chair, Dr. Robert Ball, and my committee members, Dr. Kyle Feuz, and Dr. Matthew Paulson, for their guidance and support throughout the course of this research. In addition, I would also like to thank my parents, Nathan Cummings, and most importantly my wife for their support. ii ABSTRACT Researchers have explored different methods for the identification of persons by the sound generated when they walk (i.e., gait). Using machine learning they can train a system to identify individuals. In buildings where only authorized personnel are allowed a system like this could provide an additional layer of authenti-cation. In this thesis, audio was captured by participants walking past a series of microphones to capture their gait. The Fast Fourier Transform algorithm was used to process the data into a form usable by machine learn-ing algorithms. A comparison is made between classification and clustering algorithms, and several machine learning algorithms are tested in each category. Also mentioned are three methods for feature elimination and their impact on performance. Lastly, tests are conducted to see how including a backpack can alter the results, which tests the ability for the algorithms to generalize under normal human behavioral changes. iii TABLE OF CONTENTS Page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Unsupervised Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Research Question and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Blunders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Gait Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Measuring Cluster Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Participant Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Audio Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 Audio Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Ingesting Audio Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3.2 T-distributed Stochastic Neighbor Embedding (t-SNE) . . . . . . . . . . . . . . . 17 4.3.3 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1 Testing with PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 Testing with LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 iv 5.4 Backpack Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.1 Measuring Cluster Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2 Selecting a k value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.5 Backpack Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.6 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.4 Unsupervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 v LIST OF TABLES Table Page 1.1 Both methods listed were able to perform much better with smaller datasets. Faces is a facial recognition system trained on State Department visa pictures. . . . . . . . . . . . 2 5.1 Best result for each machine learning algorithm using PCA . . . . . . . . . . . . . . . . 21 5.2 Accuracies of ML algorithms with LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.3 Each classifier was trained on participants not wearing a backpack (left column). Then exposed to the audio with backpacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.4 test results when the algorithms trained on backpack audio and tested with non backpack audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 vi LIST OF FIGURES Figure Page 3.1 Matein backpack used in study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Hallway used for thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Pathway of participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Actual system layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 Clipping audio using Audacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Audio file with bell interruption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Dataset graphed after using PCA with 2 components . . . . . . . . . . . . . . . . . . . . 15 4.4 Accuracy of Logistic Regression vs PCA % . . . . . . . . . . . . . . . . . . . . . . . . 16 4.5 Effects of different amounts of PCA 5 algorithms . . . . . . . . . . . . . . . . . . . . . 16 4.6 Dataset after running it through t-SNE with 2 features . . . . . . . . . . . . . . . . . . . 17 4.7 Different amounts of t-SNE vs gait dataset . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.8 Dataset after running it through LDA with 2 features . . . . . . . . . . . . . . . . . . . . 19 4.9 Accuracies of Machine Learning algorithms vs LDA . . . . . . . . . . . . . . . . . . . . 19 5.1 8 algorithms vs different amounts of PCA . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 8 algorithms vs different amounts of LDA . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.3 6 algorithms and their accuracies vs number of participants included in the dataset . . . . 23 5.4 Algorithm accuracy vs number of participants using LDA . . . . . . . . . . . . . . . . . 23 5.5 8 algorithm’s accuracies vs number of participants (PCA) . . . . . . . . . . . . . . . . . 24 6.1 hierarchical dendrogram on our data set with out feature elimination . . . . . . . . . . . 27 6.2 hierarchical dendrogram on our data set after LDA was used . . . . . . . . . . . . . . . . 28 6.3 Each clustering algorithm compared by their v-measure (k=8) . . . . . . . . . . . . . . . 29 6.4 Cluster completeness score on unseen data . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.5 Clusters produced by the MeanShift algorithm . . . . . . . . . . . . . . . . . . . . . . . 30 6.6 Clusters produced by the GaussianMixture algorithm . . . . . . . . . . . . . . . . . . . . 31 6.7 Clusters produced by the Affinity Propagation algorithm . . . . . . . . . . . . . . . . . . 31 6.8 V-Measure score vs number of participants . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.9 V-measure score with unseen backpack data . . . . . . . . . . . . . . . . . . . . . . . . 32 6.10 Clusters produced by the MeanShift algorithm . . . . . . . . . . . . . . . . . . . . . . . 33 6.11 Clusters produced by the GaussianMixture algorithm . . . . . . . . . . . . . . . . . . . . 34 6.12 Clusters produced by the Affinity Propagation algorithm . . . . . . . . . . . . . . . . . . 34 6.13 Clustering algorithms with different amounts of PCA . . . . . . . . . . . . . . . . . . . 35 vii CHAPTER 1 Introduction The objective of this thesis is to investigate a new way to identify individuals for a security system. Using unique traits of the human body for identification is known as biometrics. The trait we want to analyze for this study is gait. Gait is the pattern of how a person walks, it includes the cycle of movement your legs go through as you move. 1.1 Biometrics The book Handbook of Biometrics defines Biometrics as: “the science of establishing the identity of an indi-vidual based on the physical, chemical or behavioral attributes of the person.”This means to say that biometric systems are designed to identify someone based on characteristics that makes each human unique. An im-portant aspect of a biometric system is that they are fully automated or semi-automated. The benefit of using a biometric system is that they are hard to deceive. For example, if I were to try to get into someone’s phone using their fingerprint scanner I would have to replicate their finger closely enough that the system thinks it’s the finger of the owner. Many personal devices require a biometric scan or a password, but often they do not want both. Once the attacker can replicate the owner’s finger, they would have full access to all the data on the device without knowing the password. The problem is that the owner of the device cannot get a new finger to use with the phone, so once the finger print scan is compromised It you cannot be changed.[1] Another example of a biometric system is facial recognition. A facial recognition system takes a picture of your face and looks for unique characteristics, it then compares it against a database of other faces to see if there is a match. Other examples of biometrics include voice recognition where the sound of your voice is analyzed, and iris recognition where your eye is scanned. Each of these biometric systems have their own advantages and disadvantages that make them desirable for security. In an article from the National Institute of Standards and Technology [2] several studies are listed where biometric systems were constructed, and their accuracies were tested. Table 1.1 was constructed from the data in the article. This table shows us that biometric systems struggle to scale. When a system can scale it means that we can continue to add more users in the system without a loss in performance. In this example a single fingerprint system had 500 people it and had an accuracy of 96%. Once they got up to 100,000 users the accuracy dropped to 86%. If a system does not scale well, the accuracy drops and accuracy is crucial for se- 1 curity. For a biometric system to be utilized in a facility, such as a bank or school, the ability to scale is a must. Method Test Size Probability of Identification Fingerprint 500 96% Fingerprint 100,000 86% Faces 500 86% Faces 10,000 77% Table 1.1: Both methods listed were able to perform much better with smaller datasets. Faces is a facial recognition system trained on State Department visa pictures. 1.2 Machine Learning For this thesis I will use sound for gait recognition. To recognize gait in an audio sample I will use two different techniques of machine learning they are classification and clustering. This thesis will compare the functionality of these two techniques on a data set of audio samples of the participants walking in a hallway. 1.2.1 Classifiers Classifier algorithms are part of the supervised learning family of machine learning. When samples to iden-tify, classification algorithms will attempt to identify which class that sample belongs to. For example, if you were using machine learning to identify a picture contains either a cat or dog, you would give it a picture and the algorithm would try to classify the picture as either a cat or a dog. In section 2 examples are given where classifying algorithms were used to identify participants based on their gait. 1.2.2 Unsupervised Clusters Unsupervised clustering algorithms use different methods to lump similar data samples together. These algorithms are not trained with the labels like classification algorithms do. They seek to group the samples together based on hidden patterns in the data. The hope here is that these algorithms will be able to identify the unique pattern each person has in their gait and cluster them together. An example would be if you had several books and used a clustering algorithm on them based on writing style, the hope would be that the books would cluster together based on author or genre [3][4][5]. 1.3 Summary The objective of this thesis is to investigate the use of an audio gait recognition system as biometric au-thentication for a security system. There already exists research that uses gait to identify people by using classification algorithms [6]. In this thesis, I will compare the performance of classification and unsupervised clustering algorithms. The hope is that clustering algorithms will be able to help find the patterns in the audio 2 and filter out extra noise. While there are already several methods to lock a system using biometrics, such as fingerprint scans or facial recognition, this system can be used to identify people as they approach the restricted area. This, in addition to other biometric tools, could help truly identify who is trying to gain access. 1.4 Research Question and Hypothesis My research question is the following: Will unsupervised clustering algorithms be able to find the patterns in someone’s gait better than classification algorithms? My hypothesis was the following: Clustering algorithms will be able to cluster participants better than classification algorithms can classify each participant. 1.5 Blunders When preparing data for machine learning it is important to process your data properly so that your results are not skewed. First you need to split your data into two sections, a training and a test set. Then the training set is ran through a normalizer, a scalier and a feature extractor algorithm. Once these algorithms have had a chance to learn the training set and process the data, the test set is ran through the same algorithms and the settings learned from the training set are applied to the test set. This is done because it replicates how a system would be set up in the real world. The first time the tests in the following chapters were conducted, I was able to achieve accuracies in the 80%-95% range. To get those results the normalizer, scalier and feature extractor algorithms were being ran before the data was split into the two datasets. Once this was discovered, corrections were made so that it would be a more realistic test. The new results are discussed in the following chapters. The rest of my thesis is as following: chapter 2 discusses related works and their influence on this study. In chapter 3 the experiment design is discussed including how the audio was recorded and stored. The way each audio file is processed, normalized, and smoothed is discussed in 4. The results from the classification algorithms are discussed in chapter 5 and the clustering algorithms in chapter 6. To see my code used throughout this thesis visit my gitlab repo. 3 CHAPTER 2 Related Work This chapter discusses some of the books and papers that are used for the foundation of this thesis. Discussed here are three main topics: biometrics, gait analysis, and machine learning. Included in the section of machine learning is the section on how to measure the success of unsupervised clustering algorithms that I use in chapter 6. 2.1 Biometrics When it comes to biometric systems there are some important traits that they must have to be effective. In an article by A.K. Jain, A. Ross et al, three functionalities of a biometric system are listed, they are: Verification, Identification, and Screening. Verification means to have a way to know that this person is who they say they are. Identification is for taking the data used in the verification step and checking our database to see if this person is in it. Screening, if the person is found in the database, is for checking if they have the permissions to access what they are going through this process for [7]. In an article by Anil Jain, and Ajay Kumar, says that for a biometric system to be able to use a human trait, that trait must be both have distinctiveness and permanence. For this system to be useful, the gait of everyone must be different enough to differentiate from one another. This means that regardless of what each person is wearing, the pattern of sound generated needs to be able to differentiate the person. Permanence means that it something will not change, in this scenario it is important that whatever biometric features we are analyzing will not be different the next time we examine the same feature [8]. Biometrics can have other uses besides security authentication. One example is using biometrics in a car so that the car can automatically adjust comfort settings, and make sure that the driver is medically safe to drive [9]. Another is using data acquired from biometric sensors to use for marketing and advertisements. Advertisers already track people through their searches and purchase history, using biometrics they will be able to track what you are doing when you are out in public [10]. In an article by Bruce Schneier, he mentions some dangers of using biometrics as a security system. While he lists many concerning points, the one that we will be exploring is how the system will respond to data it has not encountered before. The problem is that many biometric systems freeze or give false responses 4 if it has never encountered a person not already in the database. When a system like this is established, it is important to decide if we would rather have its error on the side of a false positive or false negative. For this reason, this thesis is designed to use backpacks that are not introduced in the beginning to see if they can throw off the system [11]. Another concern for biometrics is that its reliability varies between groups of people, one example is ethnicity and gender. Many facial recognition systems will have been trained with more data from one race than another making the reliability fall when it tries to identify individuals of the other races. Another study found that diabetes influenced on iris scans. This meant that people with diabetes had a higher chance of getting errors in their scan [12]. 2.2 Gait Analysis Research into gait analysis has grown in recent years. The way a person walks can give away information about their body. This is because a person’s movement is reliant on a complicated system involving your nerves, muscles, bones, joints, and brain. With several variables about each person’s body at play, it gives each person their own unique gait. Since everyone’s gait is unique it will allow us to tell the difference be-tween one participant from another.[13] The medical field has seen many benefits from this research including, aiding orthopedic surgeons for best treatment methods [14], and diagnosing early onset Parkinson’s disease. When a person develops Parkinson’s disease one of the main things that changes is the way they walk. This means that their gait is different. Many people that suffer from Parkinson’s have similar characteristics to one another, this allowed researchers to build systems that could identify these characteristics to diagnose patients with the disease [15][16][17]. There are two main papers that are the most relevant to this thesis. The first is by P. Bours and A. Evensen. They had 12 participants each walk 10 meters 20 times. Each participant wore the same type of shoe, and then any other pair they had. This was to test two things, the first being that the machine learning could identify their participants even if they wore different shoes. The second was to test that if everyone wore the same shoe would they get mistaken for each other. They were able to achieve an accuracy of two thirds. In a biometric authentication system, the accuracy should be higher. They also covered the microphones with a thin sheet of fabric to reduce the noise caused by airflow when the participants walked. [6] J. T. Geiger, M. Kneißl, B. W. Schuller, and G. Rigoll also conducted a similar experiment where they 5 had participants walk down a hallway with microphones. The difference between this paper and the one previously mentioned was that they had participants walk normally, then did it again but with shoe covers on, then a third time with a backpack on. This was to test a more natural environment where there will be other noises that cannot be reasonably controlled. This group was also able to achieve an accuracy of two thirds when they wore the same shoes in the test as were in the training models [18]. Although their study is not as relevant to this thesis, Tsuji et al also used microphones to record and process people’s gait. Their study was to investigate the possibility for a system to be able to identify how many people were by the microphone. They had one microphone placed in the center of 1-3 participants. The participants were instructed to walk in place around the microphone. They also conducted a test to see if their algorithm could identify the participants in the audio recording, but it did not work [19]. In another paper, M. Balazia and K. N. Plataniotis were successful in detecting participants based on their gait, however, they did not use microphones but cameras to capture their movement. Once they recorded their movements, their system was able to render a skeleton of the person and used that to figure out the way they walked. Of the many methods they tested their best was able to achieve an accuracy of 92% [20]. These articles show that it is possible for machine learning to recognize patterns in people’s gait. Many of them suggest that they can be accurate enough to serve as a way of authentication. Lastly, they demonstrate that audio recordings of people walking can give us enough data to identify someone’s gait. 2.3 Machine Learning A study by Dimosthenis Ioannidis, Dimitrios Tzovaras et al, was conducted to see which out of four machine learning algorithms worked best on an existing biometric database. Out of the four algorithms: Gaussian Mixture Models, Artificial Neural Network, Fuzzy Expert System, and Support Vector Machines (SVM), SVM was able to do the best. Because of their findings this thesis will also use a few different types of Support Vector Machines for tests [21][22]. In the study by M. Z. Rodriguez, C. H. Comin, et al they test many different clustering algorithms against each other. They found that overall, each algorithm performed better when the datasets had many features. “When a large number of features was considered, such as in the case of the datasets containing 200 features, large gains in performance were observed for all methods.” [23]. However, their tests were written in R, the tests that I will conduct will be written in python using the Scikit-learn [24] package. Scikit-learn does not have many of the clustering algorithms used in this paper, so I will be using as many that are implemented in 6 the package. 2.3.1 Measuring Cluster Success Since clustering algorithms do not aim to classify each participant, accuracy is a little bit harder to measure. One way to measure the performance of a clustering algorithm is with the Homogeneity score. This method measures an important requirement that a cluster should contain only samples belonging to a single class. In other words, if each cluster in a clustering algorithm only had one class in it the score would be 100%. The problem with using the Homogeneity score is that it does not show you if one class has been spread across multiple clusters. One class could be in to separate clusters, but if each cluster is homogeneous the score will still be 100%.[25] Another method for measuring cluster algorithms is by using the Completeness score. This method mea-sures that all the samples belonging to the same class should be in the same cluster. This measure fixes the problem that the homogeneity score overlooks. To have a 100% completeness each entire class should fit in one single cluster. However, a cluster could have multiple classes in it if each of those classes do not have samples in other clusters. While it would have a high completeness score, it would have a low homogeneity score [26]. The method I will rely on is called Normalized Mutual Information (NMI) score. This method uses both the Completeness score and the Homogeneity score to measure the performance of a clustering algorithm. NMI = 2 ∗ h ∗ c h+c (2.1) To get a 100% with this method each cluster must have only one class in it and each class can only be a part of one cluster. This solves the problem that both previously mentioned scoring methods had.[27][28]. 2.4 Summary Biometric systems are great tools in several applications such as authentication, or health diagnoses. It is important that biometric systems are trained on many scenarios so that it does not fail to work with less represented groups of people. Gait is one feature biometric systems use to find unique features in people. To analyze gait machine learning can be used to find patterns unique to people or circumstance. The machine learning technique I am focusing on is clustering and to measure the success of the clustering algorithms I 7 will use the Normalized Mutual Information score. 8 CHAPTER 3 Experiment Design Participants were friends and family that happened to be at a community picnic. Ages ranged from the youngest who was 18 years old and the oldest who was 90 years old, with the majority of the participants in their 20s. In total I was able to get 24 people to participate in this study. Each participant was told to walk up and down a hallway for two minutes. Once the time was up, the partic-ipant was told to stop no matter where they were in the hallway. The participants were then asked to walk the same hallway for an additional two minutes but this time they wore a backpack. Each participant wore the same backpack to reduce variables in the experiment. The backpack weighs 1.7-2lbs by itself and an additional 2lbs of notebooks, pens, and papers were added. Half of the participants wore the backpack the first time they walked and the other half wore the backpack the second time they walked, doing this helped reduce confounding variables. A picture of the backpack can be seen in figure 3.1. The hallway was approximately 30 feet long and had carpet. This hallway with carpet was chosen because it had little echo. While participants were waiting for their turn, they were told to wait in an area away from this hallway to reduce interference. The hallway that was used for the study can be found in figure 3.2, note that this was during set up and the microphones were not finished being configured. Each microphone was setup to be as identical as possible to prevent extra variables. 3.1 Hardware The audio was recorded on a series of Raspberry Pi 4 along the hallway the participants were walking in. Raspberry Pi are single board computers that have the approximate length and width of a credit card. In figure 3.3, each diamond represents a location of the raspberry pi 4 and microphone. The arrow indicates the path the participants followed for two minutes. Each Raspberry Pi was connected to a Yeti Blue microphone and a wireless network so they could receive commands from a laptop controlled by me. The laptop ran a python script that connected to each raspberry pi and ran the following command: shell.run(["arecord","-f","S16_LE","-c2","-r44100", "--duration=120", FILENAME]) The command ”arecord” is a terminal command that comes pre-installed on many Linux distributions includ-ing Raspberry Pi OS which is what I was using. When called it will record using which ever microphone is 9 Figure 3.1: Matein backpack used in study selected. The flags used in the command set the recording quality to the same standard quality that CDs have, and the duration flag sets how long to record (in seconds). See the full code in the repo. 3.2 Participant Privacy To protect the participants who participated in this study, names were not associated with the audio files collected. To keep the audio files disassociated from the participants, the python script ?? assigned each participant a 4-digit id. Each audio file name follows this format: p<participant id>-b<true\|false>-pi<1\|2\|3\|4>.wav The section that follows the ”-b” marks whether the participant was wearing the backpack in the recording. The number after ”-pi” documents which raspberry pi recorded that audio file. 3.3 Audio Collection On the day of recording the audio samples, it was discovered by the last participant that one of the raspberry pi devices was muted. While I was hoping for the design found in 3.3, we ended up with only the data from three Raspberry Pis because one of them was muted. The final design resulted in the design featured in 3.4. 10 Figure 3.2: Hallway used for thesis 11 Figure 3.3: Pathway of participants With 24 participants each walking twice and three recordings per walk, I ended up with 144 audio files. Each of the audio files were 2 minutes long. While I did my best to keep the hallway used for testing quiet, there were still many instances were extra Figure 3.4: Actual system layout noise was captured by the microphones. One benefit to having three microphones recording the participants is that I could still use the audio from other microphones that did not pick up the noise. This discussion is continued in chapter 4. 12 CHAPTER 4 Data Preprocessing 4.1 Audio Clipping Figure 4.1: Clipping audio using Audacity Each of the 144 audio files were broken down further by clipping the audio files after each time the participants walked past the microphone. On average, each of the original audio files was clipped into 10 additional files. To clip the audio into many smaller audio files, I used Audacity. Figure 4.1 shows how a typical audio file was cut up. Each label represents the end of the audio to the left of it and the start of the new audio to the right of it. The cut being where the participant would turn around. Extra noise picked up by the microphones was not removed from the audio clips unless it was loud enough that the sound of the participant walking past could not be heard. If the extra noise was too loud that clip was deleted. Some of the sounds picked up by the microphones that was not removed includes: water fountain cooling system, motorcycles from the road outside, people talking that were oblivious to the study, etc. Figure 4.2 is an example of an audio clip that would have been removed. On the right side the sound appears to be louder than the rest of the audio file. That was caused by someone ringing a bell while I was recording. From the start of the bell to the rest of the clip was removed because the participant that was walking could not be heard. 13 Figure 4.2: Audio file with bell interruption 4.2 Ingesting Audio Files Each file is read into a python array. Each number a in the array is then smoothed using the following formula: a 28 ∗ 2−1 (4.1) Equation 4.1 comes from a tutorial on processing audio files [29]. This equation takes each number input a and puts it into range of [-1,1) which changes the given numbers to signal amplitudes. This whole process is called smoothing, this allows us to remove noise from the data before it is processed. Once the audio file has been smoothed, it is then run through a discrete Fourier Transform [30]. The Discrete Fourier Transform (DFT) is a technique used in signal processing that takes audio and turns it into something that machine learning models can use in the form of an array [31]. The Fourier Transform function used allowed me to specify the length of the array I want returned [32]. For this study I chose 882,000 because of the example I was following online. I then tested it with some lower numbers and accuracy dropped, I then tested it with larger numbers and the accuracy did not improve. This led me to believe that 882,000 was the sweet spot. For this thesis I only used one of the two channels that are returned, that gives me an array size of 441, 000. Once the audio file is ingested and smoothed, it was inserted into a Pandas Dataframe, and the participant’s number was added to a Pandas series object [33]. Once each of the 144 audio files were added to these panda objects, they were stored in a python pickle. A python pickle is a way to write temporary python variables to the hard drive to use for later, this way I can run the tests without processing each audio file every time [34]. 14 4.3 Feature Extraction My data set contains 441,000 features (think of features as columns of a spreadsheet). There are two problems with having a lot of features, one is that we run the risk of overfitting the data which reduces the accuracy, and two it negatively impacts speed and performance. To fix this problem feature extraction or feature reduction is used. Feature extraction is used to preserve data and find patterns, while also reducing the number of features used. These techniques can help improve each machine learning algorithm, while also reducing the resources needed. The following subsections compare three feature extraction algorithms on the dataset. Each of the algorithms were preceded by scikt-learn’s [24] StandardScaler() object. [35] 4.3.1 Principal Component Analysis (PCA) PCA linearly transforms the dataset to reduce the number of features. This algorithm is unsupervised mean-ing that it assumes that each sample is unrelated and does not take their classification into account. [3][5][36] When we set the number of dimensions to 2 and graph it, we can see how the inputs are placed together in Figure 4.3: Dataset graphed after using PCA with 2 components one clump.4.3 This is problematic when we try to run clustering algorithms against this dataset. The point of clustering algorithm is that they try to identify clusters in the dataset, by doing this they can assume that the data clustered together are related. When data is in one cluster than these clustering algorithms either clump them all together or find arbitrary clusters in the dataset. However, this is not as much of a problem with classification algorithms. The results against classification algorithms are found below. Figure 4.4 shows the results from the first initial PCA test to see how many features PCA should use on our 15 Figure 4.4: Accuracy of Logistic Regression vs PCA % dataset. This test started with 10% of the original features and incremented by 10 until 100%. The results from this test show that the best accuracy was at 50%. The problem with this test is that it only uses one machine learning algorithm called Logistic Regression that was chosen at random. Figure 4.5 contains the results of the second test that includes more algorithms for a comparison. Figure 4.5 shows us that any amount of PCA is better for machine learning results than no PCA at all. It also Figure 4.5: Effects of different amounts of PCA 5 algorithms shows that if the PCA is turned up all the way to 100%, the accuracy is negatively impacted. We can see this by looking at the left and right ends of the graph where the accuracy is the lowest for each algorithm. While each algorithm performed better in different places the majority of them did best when PCA was around 20% or 90%. AdaBoost is the exception here that never got an accuracy score above 12%. Logistic Regression were able to score around 78% accuracy and Random Forest with 70% when PCA was a 20% . Which means for a real-world test using these two algorithms we would want PCA to give us 20% of the original amount of features to give us the best results. From this we can conclude that PCA positively effects classification algorithms on this dataset. 16 4.3.2 T-distributed Stochastic Neighbor Embedding (t-SNE) The difference between t-SNE and PCA is that this algorithm does not use any sort of linear dimensionality reduction technique. t-SNE focuses on preserving the Neighborhood of each point while it reduces the fea-tures. That means that it tries to keep the relationship between points like the original dataset [37][38]. The t-SNE script found in this script generates data that is better for clustering algorithms compared to the PCA cluster from figure 4.3. While the data is still in one cluster, figure 4.6 shows that the data points are spread over a wider area. If you look closely at the graph, you can see where some regions have a higher density of points, this is better for clustering algorithms because they look for groups. I wanted to see how various amounts of t-SNE effects machine learning with this dataset, I measured the ac-curacies of the algorithms while changing how many features t-SNE returns. This particular implementation of t-SNE used is called HUT which only allows us to pick a integer between 1-3. The best response from this test was SVC (RBF) which was able to score an accuracy a little over 5.5%. The some of the algorithms scored best with 2 features the majority did best with 3. Given that the performance was so low with this test, I decided to not use it for any more tests. The results can be seen in figure 4.7. Figure 4.6: Dataset after running it through t-SNE with 2 features 17 Figure 4.7: Different amounts of t-SNE vs gait dataset 4.3.3 Linear Discriminant Analysis (LDA) LDA is like PCA in that it linearly transforms the data. The biggest difference is that LDA is a supervised technique, this means that LDA can take the classes into account when finding the best way to separate each example. [39][40] LDA gives us something most similar to what I was hoping to see. For the best results we would want to see about 24 different clusters, one for each participant. However, figure 4.8 shows us our dataset has been clumped together in about four clusters. When it is using more than two features it is possible that there are more distinct clusters. Similar to the tests conducted with PCA and t-SNE I wanted to see what the best possible score that can be achieved with LDA. Different amounts of features were used and compared with the accuracies of the machine learning models. LDA lets us choose any number of features between 1 and the number of classes (in this case participants) minus 1. Since there were 24 participants we can use up to 23 features. Figure 4.9 shows us the more features included the better the results. Something else noteworthy is that each algorithm has a consistent path where PDA varied more between each algorithm. 4.4 Results After the manual review of each of these feature extraction algorithms and the testing done in 4.3.1 shows us that there is no method that gets the best result out of every algorithm. PCA was able to achieve higher scores in some areas, and LDA was a lot more predictable and easier to implement. Since both of them have their strengths they will be used in the in the coming chapters 5 and 6. t-SNE was the worst performer of the three 18 Figure 4.8: Dataset after running it through LDA with 2 features Figure 4.9: Accuracies of Machine Learning algorithms vs LDA tested and does not yield any benefits, therefore it will not be used in the coming chapters. Each of these algorithms can produce more than two features, I can only graph two of them, it is possible that with many features the clustering algorithms will be able to find ways of clustering each participant’s samples into one cluster. 19 CHAPTER 5 Classifiers In this section, classification algorithms are tested to see how they perform against the gait dataset. Both PCA and LDA are tested to see which can provide the best possible accuracy, and section 5.3 tests to see how well classification algorithm are able to scale to large datasets. Finally section 5.4 discusses the impact the addition of backpacks had on this dataset. 5.1 Testing with PCA In figure 4.5, several algorithms were used to see how many features should be used to maximize our results, and we will investigate this further here. The same experiment was conducted again, but this time more algorithms were explored. While scikit-learn’s [24] implementation of the AdaBoost algorithm uses a tech-nique to support multiple classes or in this case participants (referred to as SAMME [41]), the results from figure 4.5 show that it was never able to give us any results better than 12%. Because of that, AdaBoost was removed from the following experiment. Some of these algorithms that were added were simply because of curiosity, for example scikit-learn supports different versions of SVM as seen in figure 5.1 as SVC, and I wanted to see which one worked the best. The Multi-Layer Perceptron (MLP) was added because after learning about it in my machine learning class I figured it would be a great candidate to add to this thesis. In short, this test was conducted because I wanted to see how more of these untested algorithms would perform in this thesis. This second attempt shown in feature 5.1 demonstrates a path that each algorithm follows. The worst results came from both ends of the line graph. On the right side PCA is leaving just as many features as what was put in, on the left side so many features have been removed that the algorithms do not have enough data to make accurate predictions. From this, we can conclude that feature extraction is important. Depending on the algorithm, the best results are when 20% or 90% of the original number of features are extracted from PCA. The majority of these algorithms were able to perform the best around 20%. Between 20% and 90% many of the algorithms drop in accuracy before improving again as it continues to progress towards the other side. Table 5.1 shows us the each algorithm’s best score and the percentage of PCA that was used to yield that score. The best performer was the Logistic Regression algorithm with an accuracy of 78.2%. To achieve that score PCA was set to 20%, or that 20% of the original 442,000 features were kept. 20 Figure 5.1: 8 algorithms vs different amounts of PCA Classifier Best PCA % Accuracy % Decision Tree 10 47.4 Logistic Regression 20 78.2 MLP 10 60 Random Forest 20 69.7 SGDC 30 54.2 SVC (Linear) 90 68.1 SVC (RBF) 20 68 SVC (Sigmoid) 30 67.3 Table 5.1: Best result for each machine learning algorithm using PCA 5.2 Testing with LDA Given the results found in feature extraction section 4.3, a test like the PCA experiment was conducted using LDA. LDA gives us a much smaller window for selecting the number of features. This test starts with 2 features and works up to 22. This test was ran 10 times and then the averages of each algorithm for each amount of LDA used was taken to produce this chart. Figure 5.2 shows that the more features that are used the better the results are. The exception to that is the decision tree algorithm that flattens out after 6 features. If you look at the best score from each algorithm in figure 5.1 and compare them to the results found in table 5.2 you will see that each algorithm had a drop in accuracy when LDA was used. In the PCA test the best score was Logistic Regression with 78%, once I switched to LDA the best was still Logistic Regression but with an accuracy of only 69%. Another thing we learn from this table is that if we compare the first column with the second, it shows that the more features from LDA used it will yield a better outcome. This 21 Figure 5.2: 8 algorithms vs different amounts of LDA Classifier LDA with 10 features LDA with 22 features Decision Tree 37.0 50.3 Logistic Regression 52.3 69.3 MLP 54.5 68.5 Random Forest 55.7 64.6 SGDC 35.1 44.0 SVC (Linear) 59.9 65.5 SVC (RBF) 51.6 62.6 SVC (Sigmoid) 46.8 56.4 Table 5.2: Accuracies of ML algorithms with LDA means that LDA has a positive impact on the results, however, the results are not as good as PCA. 5.3 Scalability As discussed in chapter 1 if a biometric tool was to be successful in the real-world it would need to recognize lots of people. For this test I only have 24 participants, so this test is to see how different the results will be between different number of users. If there is a downward trend with just the 24 participants that means it will not be able to support a large group. Each time I conducted this test the results were different. This is because some of the participants are very distinct while others are similar in sound making it harder to predict. To overcome this obstacle, the experi-ment was run ten times and each algorithm’s results were added to a 2-dimensional array. After the algorithms were ran, an average was taken for each column. The outcome of the results can be seen in figure 5.3. Just like we saw in table 1.1[2], the ability for many of the algorithms to identify the participants dropped as more 22 Figure 5.3: 6 algorithms and their accuracies vs number of participants included in the dataset participants were added to the system. Most of the algorithms had a slight downward trend that shows that these algorithms won’t be able to scale. To get a cleaner picture this test will need to be ran many more times than I have time for. Figure 5.4: Algorithm accuracy vs number of participants using LDA To test what effect LDA has on the scalability of these algorithms I ran the test again. This test was the same as the first attempt but with an extra step. Once the participants were randomly chosen, they were scaled and processed using LDA. In figure 5.4, all the algorithms except for AdaBoost had a flat trend. This means 23 that they were not affected by the number of participants. Since a lot of the features have been removed it would take a lot more data before the effects could be seen. Figure 5.5: 8 algorithm’s accuracies vs number of participants (PCA) This last figure 5.5 is the same test but with PCA instead of LDA to see if we will notice any scaling issues with this method. AdaBoost is the most obvious since it has a steep drop near the beginning. But they all have a slight downward trend to them. Logistic Regression almost got up to 80% when it only had 6 participants. This shows us that PCA will not help us much to stop scaling issues. Since most of the algorithms did best with 20% of features, 20% was chosen for this quick test. 5.4 Backpack Impact Classifier Accuracy without backpacks % Accuracy with backpacks % Decision Tree 29.4 18.6 Logistic Regression 48.4 29.2 MLP 48.6 30.4 Random Forest 46.0 29.6 SGDC 27.6 19.8 SVC (Linear) 48.8 34.6 SVC (RBF) 40.4 27.0 SVC (Sigmoid) 35.9 26.1 Table 5.3: Each classifier was trained on participants not wearing a backpack (left column). Then exposed to the audio with backpacks. The goal of this test is to see how much of the algorithm is using the participants gait versus the unique 24 sound their shoes made. Since everyone wore the same backpack, the sound it would make would be similar between trials. This test also shows the impact that small changes, such as wearing a new pair of shoes, could have on a system like this. This test was done by separating the audio samples where the participants were not wearing backpacks, and the samples where they were. Each algorithm in table 5.3 was trained on 80% of the audio without backpacks. Using the remaining 20% backpack free audio samples, the accuracy of each algorithm was measured, these results are found in the second column. These algorithms that had been trained on the audio without backpacks were then used to predict the participants in the audio samples containing backpacks. The accuracy results of the audio containing backpack audio is in the third column. Table 5.3 shows us how the algorithms struggled to identify the participants when they were not already trained on the sound of a backpack. If you take the average from table 5.1, you will get 60.4%. Comparing that to the average of the middle column from table 5.3, which is 40.6%, it can be seen that the accuracy fell 20%. This difference is due to two main reasons. The first reason was that the algorithm had less audio to train on because 20% of the backpack-free audio was reserved for testing. The second reason is when you expose an algorithm to data that it has not trained on, it will not be able to classify that data properly. In this test, the algorithms were not trained with the backpack audio so they struggled to identify the participants correctly. To make sure that the accuracy did not fall just because there was less data to train on I ran the same test except this time the machine learning algorithms were trained on the backpack data. Then they were tested on 20% of the reserved backpack data and all of the audio samples that did not have backpacks in the test. The results found in table 5.4 show us that the results are a lot higher in both columns which means that the data was impacted by the backpack, and not that it was just less data to train with. Classifier Accuracy with backpack % Accuracy without backpack % Decision Tree 40.4 20.1 Logistic Regression 63.6 30.5 MLP 61.38 28.4 Random Forest 59.3 30.0 SGDC 34.4 21.69 SVC (Linear) 64.7 31.54 SVC (RBF) 57.1 27.6 SVC (Sigmoid) 53.8 25.9 Table 5.4: test results when the algorithms trained on backpack audio and tested with non backpack audio 25 CHAPTER 6 Clustering 6.1 Measuring Cluster Success In section 2.3.1 three different methods are mentioned to measure the success of a clustering algorithm. They are the completeness score, the homogeneity score, and the V-measure score. For this section we will rely on the v-measure score because it is the most thorough score and uses the first two methods in it. 6.2 Selecting a k value Many machine learning algorithms have hyperparameters. Hyperparameters are settings that the algorithms have that needed to be decided before they are used. Many clustering algorithms have a hyperparameter, k, that tells the algorithm how many clusters to generate. The method we chose to find the best number of k was by generating a dendrogram chart [30]. This chart shows how each audio file relates to each other. The higher the legs go to connect two audio files together, the further in distance their relation to each other. Figure 6.1 shows two distinct sections distinguished by two different colors. Within the green section you can see two more large clusters that could be considered their own section. Because of this chart, I felt that 5 different clusters would be appropriate for the data set. Since I found that LDA was the best feature extraction algorithm before feeding the data into a clustering algorithm, I thought it would be good to see how it affects the dendrogram chart. Figure 6.2 shows how LDA changes the clustering on the dendrogram. After reviewing this second chart, I decided that approximately 8 distinct clusters could be found. This resulted in a final decision to use 8 as our k value in staid of 5. 6.3 Results Now that I had a way to measure the performance of each algorithm (see section 2.3.1) and had found a k I wanted to use (based on the dendrogram in section 6.2), it was time to run the clustering tests. All nine of Scikit-learn’s [24] clustering algorithms were used for this part of the study. The algorithms that needed a k hyperparameter were passed an 8, that includes the following algorithm: KMeans, Birch, Agglomerative, Spectral, Gaussian. The other 4 algorithms did not have a k parameter so changing the number of k did not affect the outcome. The outcome can be found in figure 6.3. Figure 6.3 was generated by taking the data after it had been smoothed, transformed using the Fourier trans-form technique, and then feature extracted using LDA as discussed in chapter 4. The data was then split up 26 Figure 6.1: hierarchical dendrogram on our data set with out feature elimination 27 Figure 6.2: hierarchical dendrogram on our data set after LDA was used 28 Figure 6.3: Each clustering algorithm compared by their v-measure (k=8) into a training portion that contained 80% of the data and test portion which contained the other 20%. Each algorithm was trained on the training data then tested with the test set. For this part of the test both backpack and non-backpack audio was mixed in the training and test sets. The samples containing backpacks and those that do are separated for the test discussed in section 6.5. When a clustering algorithm predict function is called, it does not return a label of a class, instead it returns the number of a cluster. I take the outputs of each algorithm and the original labels and pass both lists into the v-measure function to get these results. Given these outcomes the best clustering algorithm to use for this situation is Affinity Propagation since it was almost able to reach 60%. Since Affinity Propagation received a score of 60% that means that it was able to keep the clusters unique to one participant, and that each participant was assigned to only one cluster a little over half of the time. Since it did not receive a score of 100% that means that some of the participants were assigned into multiple clusters and not every cluster had only one participant in it. But this does mean that this algorithm is good at consistently assigning each participant to the same clusters. If you take figure 6.3 and compare it to figure 6.4 you will notice that some of the algorithms that scored poorly in the first one did well in the second one. One example is DBSCAN in figure 6.3 it doesn’t even show up, but in figure 6.4 it is the best scoring. Since the v-measure score takes both completeness and ho-mogeneity into account and we can see that the completeness score is almost 100% we can infer that it scored a zero in homogeneity. This means that DBSCAN is great at putting each sample from participants into the same cluster, it struggles with keeping each cluster unique to a participant. Whereas Affinity Propagation and OPTICS can both put each sample in the correct cluster and each cluster only has one participant in it. 29 Figure 6.4: Cluster completeness score on unseen data Figures 6.5, 6.6, and 6.7 show us how each algorithm chose to cluster the dataset. MeanShift which scored Figure 6.5: Clusters produced by the MeanShift algorithm low lumped the majority of the examples into one cluster. The GausianMixture model produced how I imag-ined the top performing cluster algorithm would look given the original image unclustered (see figure 4.3) , however it produced average results. The top performing algorithm from figure 6.4 is the AffinityPropagation algorithm shown in figure 6.7. While it does not cluster the way we would expect, its method was able to give us the best v-measure score. Something to keep in mind is that these clustering algorithms are using 10 features and then we are picking 2 to graphing them. While the way Affinity Propagation clustered seams random from our two dimensional view, it is using eight more features that we cannot see. These other fea- 30 Figure 6.6: Clusters produced by the GaussianMixture algorithm Figure 6.7: Clusters produced by the Affinity Propagation algorithm tures could be what these algorithms are using to define these clusters. Just to recap, overall each algorithm was good at keeping the number of participants assigned to a cluster low, not all of them were able to keep each participant in their own cluster. While figure 6.6 shows us what I would expect a clustering algorithm to cluster the points together because all of the dots by each other are similar colors, figure 6.7 shows us that there is a better way. AffinityPropagation tells us there is other data that is not being graphed that yields better result for higher v-measure scores. 6.4 Scalability To test the scalability of these clustering algorithms a test was conducted where only a fraction of the par-ticipants were included. The v-measure score was then measured, and then another participant would be included. The test would follow these steps until each participant was included. To get a accurate representa- 31 Figure 6.8: V-Measure score vs number of participants tion the test was ran five times and the results from each iteration were averaged to produce figure 6.8. Many of the algorithms including AffinityPropagation, GaussianMixture, and KMeans all had relatively flat trends which means that they were unaffected by the number of participants. Spectral, MeanShift, and DBSCAN did have a downward trend, that means that they will continue to be less accurate the more participants are added to the dataset. 6.5 Backpack Impact Figure 6.9: V-measure score with unseen backpack data In the previous section the backpack data was mixed in with samples without backpacks. In this section 32 I conduct a different test to measure each algorithm’s ability to filter out extra noise that it was not trained on. This test is like the test conducted in section 5.4. For this test the algorithms were trained on only audio samples that did not include backpacks, I then had each algorithm assign each sample containing a backpack to a cluster. The hope is that each algorithm would continue to place each participant in the same cluster that the non-backpack samples were assigned to. Figure 6.9 shows us the results of this test. While the bars look relatively the same, I want to point out that the scale on the y-axis has changed. Each algorithm including the AffinityPropagation has dropped in their v-measure score. This means that the clustering algorithms ability to consistently assign each participant to their own unique cluster has dropped. This means that for a system like this to be effective the algorithm will need to be trained on a large variety of data. It is important to note that out of all of the algorithms Affinity Propagation was the least effected by the new noise in the data, and still had the best score. 6.6 PCA Back in section 1.5 I discussed how this test was ran one specific way then it was realize that the test were all ran with a flaw that made the results better than they actually are. Back then it was determined that LDA was the best algorithm for feature extraction and so everything in this chapter up to this point was using LDA. In this section some of these test will be ran using PCA to see how different the results will be. Figure 6.10: Clusters produced by the MeanShift algorithm 33 Figure 6.11: Clusters produced by the GaussianMixture algorithm Figure 6.12: Clusters produced by the Affinity Propagation algorithm Figures 6.10, 6.11, and 6.12 are the same clustering techniques that were plotted in section 6.3, however, these ones were ran after the data was modified by the PCA algorithm. I am not immediately able to determi-nant if these results are better given that these charts are only showing two of the hundreds of features from the dataset. Since there is no percentage of PCA that works best for all algorithms, I tested each algorithm with different amounts of PCA percentages to find the best combinations. Figure 6.13 shows us the results of this test. This test shows us that no matter what percentage we use, the results yield will not be as good as the results from the LDA tests. The best result from this test is 47.2% v-measure score, while the LDA test yields is just over 60%. 34 Figure 6.13: Clustering algorithms with different amounts of PCA 35 CHAPTER 7 Conclusion 7.1 Conclusion Presented in this paper are two different machine learning algorithms and their ability to identify individuals using audio samples of their gait. Audio was recorded in a hallway with a combination of 3 raspberry pi and microphones. Each participant walked back and forth for a total of four minutes. The audio was then broken down into each pass by a microphone yielding 144 audio clips. The clips were then smoothed, transformed into arrays, and stored for testing. Review of the results between classification and clustering algorithms shows us that both have different strengths. Classification algorithms were able to yield the highest score when it was trained on both backpack and backpack less data. The clustering algorithms did not give us as good of scores, however when they were tested on unseen backpack data their v-measure scores did not drop as much. 7.2 Preprocessing Three different methods were tested to reduce the number of features in the dataset: PCA, LDA, and t-SNE. Each classification algorithm reacted differently to different amounts of PCA, however it made a positive impact regardless if the percentage of features to keep are between 20% and 90%. PCA was able to yield higher scores there was not a number that worked the best for each algorithm. LDA was the opposite, where it was much more consistent on finding a number that worked for each algorithm, but it did not yield the best results. 7.3 Classifiers The best result was achieved with a dataset comprised of all 24 participants with a training set including backpacks. The data was scaled with StandardScaler and transformed using either LDA or PCA. In table 5.2 the Logistic Regression algorithm was able to reach a score of 78.2%. The classifiers dropped in performance when they were only trained on the audio without backpacks and then scored with audio samples containing backpacks. The best result was achieved by SVC (Linear) with an accuracy of 34.6%. While the backpacks did not generate a lot of noise, the drop in accuracy means that classification algorithms are not able to filter out new noise. For classifiers to be successful for audio gait recognition they need to be trained on a wide range of scenarios. 36 7.4 Unsupervised Clustering Using the v-measure score method the best clustering algorithm was able to achieve the high score was Affin-ity Propagation with around 60%. The worst was DBSCAN with 0%. When the algorithms were trained on data without backpacks and then scored using data samples with backpacks, the Affinity Propagation algo-rithm’s score remained 60%. While the score is not favorable, that is a major improvement over the classifi-cation algorithms that had a much larger drop in accuracy once backpacks were introduced. This means that the unsupervised clustering algorithms were able to filter out extra noise better than classification algorithms. As stated in the beginning, my hypothesis was that unsupervised clustering algorithms were going to be able to perform better than classification algorithms. Overall, classification algorithms were able to perform the best when the data was smoothed, and feature extracted using LDA. Clustering algorithms were not able to outperform the classification algorithms when both were given the chance to be trained on the backpack data. When the backpack data was excluded from the training set the clustering algorithms were able to hold a higher score. This means that clustering algorithms will be able to adapt better to noises that it has not been trained on. 7.5 Scalability The first round of scalability testing presented in chapter 5 suggested that classification algorithms would struggle the more participants are added. Not only because there would be an increased number of participants to identify, but also each audio clip contained so many features that the algorithms began to slow down, and their accuracy dropped. Once LDA was added for feature elimination each of the algorithms had a slight downward trend. This means that there is little difference in between 5 and 24 participants. To test the limits of scale, more data would be required. Due to the lack of time the unsupervised clustering algorithms were not tested on their ability to scale without LDA. The test from 6 used LDA to reduce the number of features, this reduced the time needed for each clustering algorithm to run. Most of the clustering algorithms reacted similarly to the classification algorithms where the trend was relatively flat, but slightly dropped. This also suggests that clustering algorithms can support a large number of participants. It is still unknown how many they are capable to handling. 7.6 Future Work A few areas of this system need to be improved before it can be used as an additional layer of authentication in a secured area. One area is the ability to recognize multiple people walking together. This could be an extension to the work of Tsuji where their system was able to find the number of people walking by their 37 microphone [19]. It was not able to identify unique individuals, but it was able to know if there were 1-3 people walking. Another area of research would be to find ways to overcome different types of shoes and environmental variables. Something as simple as wearing a backpack was able to throw off the system, advancements in this research could help overcome these challenges. Another area that needs to be researched is a way to combine the results of these two methods into one system. If that can be achieved, it would be possible to get the accuracy of the classification algorithms, with the noise filtering of the clustering algorithm. A system that could include both of those features and overcome the problems of the previous paragraph would be much closer to a real-world application. Many biometric systems like those used on smartphones only test to see if the person trying to gain access is the one person that has permission to the system. It is currently unknown how well this system would be able to distinguish one from anyone else and should be tested. 38 References [1] A. K. Jain, P. Flynn, and A. A. Ross, Handbook of biometrics. Springer Science & Business Media, 2007. [2] C. L. Wilson, “Biometric accuracy standards,” 3 2003. [Online]. Avail-able: https://csrc.nist.gov/CSRC/media/Events/ISPAB-MARCH-2003-MEETING/documents/ March2003-Biometric-Accuracy-Standards.pdf [3] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of Data Science, vol. 2, no. 2, pp. 165–193, 2015. [4] A. K. Jain and R. C. Dubes, Algorithms for clustering data. Prentice-Hall, Inc., 1988. [5] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, vol. 16, no. 3, pp. 645–678, 2005. [6] P. Bours and A. Evensen, “The shakespeare experiment: Preliminary results for the recognition of a person based on the sound of walking,” in 2017 International Carnahan Conference on Security Technology (ICCST), 2017, pp. 1–6. [7] A. Jain, A. Ross, and S. Pankanti, “Biometrics: a tool for information security,” IEEE Transactions on Information Forensics and Security, vol. 1, no. 2, pp. 125–143, 2006. [8] A. K. Jain and A. Kumar, “Biometrics of next generation: An overview,” Second generation biometrics, vol. 12, no. 1, pp. 2–3, 2010. [9] J. de J. Lozoya-Santos, V. Sep´ulveda-Arr´oniz, J. C. Tudon-Martinez, and R. A. Ramirez-Mendoza, “Survey on biometry for cognitive automotive systems,” Cognitive Systems Research, vol. 55, pp. 175– 191, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1389041717301948 [10] R. A. Ram´ırez-Mendoza, J. d. J. Lozoya-Santos, R. Zavala-Yo´e, L. M. Alonso-Valerdi, R. Morales- Menendez, B. Carri´on, P. P. Cruz, and H. G. Gonzalez-Hernandez, Biometry: Technology, Trends and Applications. CRC Press, 2022. [11] B. Schneier, “The uses and abuses of biometrics,” Communications of the ACM, vol. 42, no. 8, pp. 136–136, 1999. [12] M. Azimi, “Investigation into the reliability of contactless biometric systems,” Ph.D. dissertation, War-saw University of Technology, 2020. [13] M. W. Whittle, Gait analysis: an introduction. Butterworth-Heinemann, 2014. [14] H. G. Chambers and D. H. Sutherland, “A practical guide to gait analysis,” JAAOS-Journal of the Amer-ican Academy of Orthopaedic Surgeons, vol. 10, no. 3, pp. 222–231, 2002. [15] S. Del Din, M. Elshehabi, B. Galna, M. A. Hobert, E. Warmerdam, U. Suenkel, K. Brockmann, F. Met-zger, C. Hansen, D. Berg et al., “Gait analysis with wearables predicts conversion to parkinson disease,” Annals of neurology, vol. 86, no. 3, pp. 357–367, 2019. [16] S. Shetty and Y. Rao, “Svm based machine learning approach to identify parkinson’s disease using gait analysis,” in 2016 International Conference on Inventive Computation Technologies (ICICT), vol. 2. IEEE, 2016, pp. 1–5. [17] A. Balakrishnan, J. Medikonda, P. K. Namboothiri et al., “Role of wearable sensors with machine learning approaches in gait analysis for parkinson’s disease assessment: a review,” Engineered Science, vol. 19, pp. 5–19, 2022. 39 [18] J. T. Geiger, M. Kneißl, B. W. Schuller, and G. Rigoll, “Acoustic gait-based person identification using hidden markov models,” in Proceedings of the 2014 Workshop on Mapping Personality Traits Challenge and Workshop, ser. MAPTRAITS ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 25–30. [Online]. Available: https://doi.org/10.1145/2668024.2668027 [19] K. Tsuji, R. Takao, M. Yamada, K. Harada, and Y. Kamiya, “Multiple person detection by footsteps sounds using gmrs,” IEEE Sensors Journal, vol. 21, no. 5, pp. 6543–6554, 2021. [20] M. Balazia and K. Plataniotis, “Human gait recognition from motion capture data in signature poses,” IET Biometrics, vol. 6, 11 2016. [21] D. Tzovaras, A. Bekiaris, and Y. Damousis, “Human monitoring and authentication using biodynamic indicators and behavioural analysis,” 2008. [Online]. Available: http://www.humabio-eu.org/ [22] I. G. Damousis and S. Argyropoulos, “Four machine learning algorithms for biometrics fusion: A comparative study,” Appl. Comp. Intell. Soft Comput., vol. 2012, jan 2012. [Online]. Available: https://doi.org/10.1155/2012/242401 [23] M. Z. Rodriguez, C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio, L. d. F. Costa, and F. A. Rodrigues, “Clustering algorithms: A comparative approach,” PloS one, vol. 14, no. 1, pp. e0 210 236– e0 210 236, 2019. [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [25] P. Hansen and B. Jaumard, “Cluster analysis and mathematical programming,” Mathematical program-ming, vol. 79, no. 1, pp. 191–215, 1997. [26] G. Bonaccorso, Mastering Machine Learning Algorithms: Expert Techniques to Implement Popular Machine Learning Algorithms and Fine-Tune Your Models. Birmingham: Packt Publishing, Limited, 2018. [27] A. F. McDaid, D. Greene, and N. Hurley, “Normalized mutual information to evaluate overlapping community finding algorithms,” arXiv preprint arXiv:1110.2515, 2011. [28] M. Meil˘a, “Comparing clusterings—an information based distance,” Journal of multivariate analysis, vol. 98, no. 5, pp. 873–895, 2007. [29] yshk and imbr, “Python scipy fft wav files,” Apr 2014. [Online]. Available: https://stackoverflow.com/ a/23378284 [30] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Pe-terson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. Vander- Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020. [31] H. J. Nussbaumer, “The fast fourier transform,” in Fast Fourier Transform and Convolution Algorithms. Springer, 1981, pp. 80–111. [32] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R´ıo, M. Wiebe, P. Peterson, P. G´erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020. [Online]. Available: https://doi.org/10.1038/s41586-020-2649-2 40 [33] T. pandas development team, “pandas-dev/pandas: Pandas,” Feb. 2020. [Online]. Available: https://doi.org/10.5281/zenodo.3509134 [34] G. Van Rossum, The Python Library Reference, release 3.8.2. Python Software Foundation, 2020. [35] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature extraction: foundations and applications. Springer, 2008, vol. 207. [36] R. Bro and A. K. Smilde, “Principal component analysis,” Analytical methods, vol. 6, no. 9, pp. 2812– 2831, 2014. [37] M. C. Cieslak, A. M. Castelfranco, V. Roncalli, P. H. Lenz, and D. K. Hartline, “t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,” Marine genomics, vol. 51, p. 100723, 2020. [38] E. Taskesen and M. J. Reinders, “2d representation of transcriptomes by t-sne exposes relatedness be-tween human tissues,” PloS one, vol. 11, no. 2, p. e0149853, 2016. [39] A. J. Izenman, “Linear discriminant analysis,” in Modern multivariate statistical techniques. Springer, 2013, pp. 237–280. [40] P. Xanthopoulos, P. M. Pardalos, and T. B. Trafalis, “Linear discriminant analysis,” in Robust data mining. Springer, 2013, pp. 27–33. [41] J. Zhu, S. Rosset, H. Zou, and T. Hastie, “Adaboost-samme.” 41 Gait Identification Daniel Child Thesis Final Audit Report 2023-02-27 Created: 2023-02-22 By: Isabelle Vivier (isabellevivier@weber.edu) Status: Signed Transaction ID: CBJCHBCAABAAGnjzZ6Bg16EDrWs3V9vhU6o5gUhv3S2I "Gait Identification Daniel Child Thesis" History Document created by Isabelle Vivier (isabellevivier@weber.edu) 2023-02-22 - 5:51:19 PM GMT Document emailed to Robert Ball (robertball@weber.edu) for signature 2023-02-22 - 5:53:47 PM GMT Email viewed by Robert Ball (robertball@weber.edu) 2023-02-22 - 6:50:12 PM GMT Document e-signed by Robert Ball (robertball@weber.edu) Signature Date: 2023-02-22 - 6:50:25 PM GMT - Time Source: server Document emailed to Kyle Feuz (kylefeuz@weber.edu) for signature 2023-02-22 - 6:50:27 PM GMT Email viewed by Kyle Feuz (kylefeuz@weber.edu) 2023-02-24 - 9:12:54 PM GMT Document e-signed by Kyle Feuz (kylefeuz@weber.edu) Signature Date: 2023-02-24 - 9:13:06 PM GMT - Time Source: server Document emailed to Matt Paulson (mattpaulson@weber.edu) for signature 2023-02-24 - 9:13:09 PM GMT Email viewed by Matt Paulson (mattpaulson@weber.edu) 2023-02-27 - 2:35:30 PM GMT Document e-signed by Matt Paulson (mattpaulson@weber.edu) Signature Date: 2023-02-27 - 2:36:01 PM GMT - Time Source: server Agreement completed. 2023-02-27 - 2:36:01 PM GMT
Format	application/pdf
ARK	ark:/87278/s6mbp3mx
Setname	wsu_smt
ID	96897
Reference URL	https://digital.weber.edu/ark:/87278/s6mbp3mx

Back to Search Results