TY - GEN
T1 - BOOST
T2 - 42nd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2023
AU - Guo, Chuliang
AU - Lou, Binglei
AU - Liu, Xueyuan
AU - Boland, David
AU - Leong, Philip H.W.
AU - Zhuo, Cheng
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Adapting CNNs to changing problems is challenging on resource-limited edge devices due to intensive computations, high precision requirements, large storage needs, and high bandwidth. This paper presents BOOST, a novel block minifloat (BM)-based parallel CNN training accelerator on memory- and computation-constrained FPGAs for transfer learning (TL). By updating a small number of layers online, BOOST enables adaptation to changing problems. Our approach utilizes a unified 8-bit BM datatype (bm(2,5) ), i.e., with a sign bit, 2 exponent bits, and 5 mantissa bits, and proposes unified Conv and dilated Conv blocks that support non-unit stride and enable task-level parallelism during back-propagation to minimize latency. For ResNet20 and VGG-like training on CIFAR-10 and SVHN datasets, BOOST achieves near 32-bit floating point accuracy, reducing latency by 21%-43% and BRAM usage by 63%-66% compared to back-propagation training without TL. Notably, BOOST outperforms the prior SOTA works to achieve perbatch throughput of 131 and 209 GOPs for ResNet20 and VGG-like respectively.
AB - Adapting CNNs to changing problems is challenging on resource-limited edge devices due to intensive computations, high precision requirements, large storage needs, and high bandwidth. This paper presents BOOST, a novel block minifloat (BM)-based parallel CNN training accelerator on memory- and computation-constrained FPGAs for transfer learning (TL). By updating a small number of layers online, BOOST enables adaptation to changing problems. Our approach utilizes a unified 8-bit BM datatype (bm(2,5) ), i.e., with a sign bit, 2 exponent bits, and 5 mantissa bits, and proposes unified Conv and dilated Conv blocks that support non-unit stride and enable task-level parallelism during back-propagation to minimize latency. For ResNet20 and VGG-like training on CIFAR-10 and SVHN datasets, BOOST achieves near 32-bit floating point accuracy, reducing latency by 21%-43% and BRAM usage by 63%-66% compared to back-propagation training without TL. Notably, BOOST outperforms the prior SOTA works to achieve perbatch throughput of 131 and 209 GOPs for ResNet20 and VGG-like respectively.
UR - https://www.scopus.com/pages/publications/85181403356
U2 - 10.1109/ICCAD57390.2023.10323638
DO - 10.1109/ICCAD57390.2023.10323638
M3 - 会议稿件
AN - SCOPUS:85181403356
T3 - IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
BT - 2023 42nd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2023 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 28 October 2023 through 2 November 2023
ER -