Misc,

Reducing Neural Networks for Mobile Devices

T. Hubatscheck, J. Kurzweg, and L. Lazar.
(2020)

Abstract

Machine learning (ML) has countless areas of application. A few of the important ones are understanding and interpreting the environment, image recognition in order to be able to perform automated image organization, for example. But also text-oriented tasks such as auto-completion or translation. These tasks are also desirable on mobile devices with limited resources such as memory and computing power. The calculation of real-time applications must take place on the device itself in order to guarantee low latency, independence of connectivity and privacy. To cover that increasing demand to use ML on mobile devices, we present reduction methods and various implementation ideas for use on the Microsoft HoloLens as an exceptionally complex example device. We will give an overview of widespread reduction approaches that include parameter pruning, quantization, knowledge distillation, dynamic capacity networks, weight sharing, and weight factorization. We unsuccessfully tried to use TensorFlow (Lite) directly on the HoloLens, which failed due to the 32-bit architecture, and TensorFlow.js as a browser application, which worked theoretically but was not effectively usable due to the poor performance of the Edge browser. Finally using various framework software, such as ONNX and Windows ML, we have created a test environment in order to compare the selected methods with each other for effectiveness. With this we succeeded in pruning the LeNet-5 model, which was trained on the MNIST dataset, by factor 3. To shed more light on the various types of quantization offered by the TensorFlow Lite format and how they deal with pruned networks, we measured all possible combinations for more performance stability on a server. Unfortunately, the desired effect of the pruning failed to appear. The main thing that stands out is the significantly smaller file size by factor a of 3.5 of quantized models, especially if you note that the accuracy of the reduced model is only 0.01% lower compared to the original model.

BibTeX key: hubatscheck2020reducing
entry type: misc
year: 2020

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Generic %1 hubatscheck2020reducing %A Hubatscheck, Thomas %A Kurzweg, Jan %A Lazar, Léon %D 2020 %E Kässinger, Johannes %K PN7-1 persival simtech %T Reducing Neural Networks for Mobile Devices %X Machine learning (ML) has countless areas of application. A few of the important ones are understanding and interpreting the environment, image recognition in order to be able to perform automated image organization, for example. But also text-oriented tasks such as auto-completion or translation. These tasks are also desirable on mobile devices with limited resources such as memory and computing power. The calculation of real-time applications must take place on the device itself in order to guarantee low latency, independence of connectivity and privacy. To cover that increasing demand to use ML on mobile devices, we present reduction methods and various implementation ideas for use on the Microsoft HoloLens as an exceptionally complex example device. We will give an overview of widespread reduction approaches that include parameter pruning, quantization, knowledge distillation, dynamic capacity networks, weight sharing, and weight factorization. We unsuccessfully tried to use TensorFlow (Lite) directly on the HoloLens, which failed due to the 32-bit architecture, and TensorFlow.js as a browser application, which worked theoretically but was not effectively usable due to the poor performance of the Edge browser. Finally using various framework software, such as ONNX and Windows ML, we have created a test environment in order to compare the selected methods with each other for effectiveness. With this we succeeded in pruning the LeNet-5 model, which was trained on the MNIST dataset, by factor 3. To shed more light on the various types of quantization offered by the TensorFlow Lite format and how they deal with pruned networks, we measured all possible combinations for more performance stability on a server. Unfortunately, the desired effect of the pruning failed to appear. The main thing that stands out is the significantly smaller file size by factor a of 3.5 of quantized models, especially if you note that the accuracy of the reduced model is only 0.01% lower compared to the original model.

@misc{hubatscheck2020reducing, abstract = {Machine learning (ML) has countless areas of application. A few of the important ones are understanding and interpreting the environment, image recognition in order to be able to perform automated image organization, for example. But also text-oriented tasks such as auto-completion or translation. These tasks are also desirable on mobile devices with limited resources such as memory and computing power. The calculation of real-time applications must take place on the device itself in order to guarantee low latency, independence of connectivity and privacy. To cover that increasing demand to use ML on mobile devices, we present reduction methods and various implementation ideas for use on the Microsoft HoloLens as an exceptionally complex example device. We will give an overview of widespread reduction approaches that include parameter pruning, quantization, knowledge distillation, dynamic capacity networks, weight sharing, and weight factorization. We unsuccessfully tried to use TensorFlow (Lite) directly on the HoloLens, which failed due to the 32-bit architecture, and TensorFlow.js as a browser application, which worked theoretically but was not effectively usable due to the poor performance of the Edge browser. Finally using various framework software, such as ONNX and Windows ML, we have created a test environment in order to compare the selected methods with each other for effectiveness. With this we succeeded in pruning the LeNet-5 model, which was trained on the MNIST dataset, by factor 3. To shed more light on the various types of quantization offered by the TensorFlow Lite format and how they deal with pruned networks, we measured all possible combinations for more performance stability on a server. Unfortunately, the desired effect of the pruning failed to appear. The main thing that stands out is the significantly smaller file size by factor a of 3.5 of quantized models, especially if you note that the accuracy of the reduced model is only 0.01% lower compared to the original model.}, added-at = {2021-12-13T15:06:14.000+0100}, author = {Hubatscheck, Thomas and Kurzweg, Jan and Lazar, L{\'e}on}, biburl = {https://puma.ub.uni-stuttgart.de/bibtex/21c82df24d61ba951eb1bbec4b8b16cce/kaessijs}, description = {Bachelor research project}, editor = {Kässinger, Johannes}, interhash = {396e156fee4bfdc789ee87b533fde280}, intrahash = {1c82df24d61ba951eb1bbec4b8b16cce}, keywords = {PN7-1 persival simtech}, timestamp = {2021-12-13T14:14:10.000+0100}, title = {Reducing Neural Networks for Mobile Devices}, year = 2020 }

PUMA

Reducing Neural Networks for Mobile Devices

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on