Abstract: There is a growing trend to deploy deep neural networks at the edge for high-accuracy, real-time data mining and user interaction. Applications such as speech recognition and language understanding often apply a deep neural network to encode an input sequence and then use a decoder to generate the output sequence. A promising technique to accelerate these applications on resource-constrained devices is network pruning, which compresses the size of the deep neural network without severe drop in inference accuracy. However, we observe that althou...
(read more)
Topics: 
Artificial intelligence
Natural language processing