## Reconfigurable hardware architecture for faster descriptor extraction in SURF

## Y. Kim<sup>™</sup> and H. Jung

Speeded up robust features (SURFs) are considered to be the most efficient feature extraction algorithm and it has been implemented in powerful hardware for real-time operation due to its characteristics of data-intensive computation of high complexity. Especially, the computational load of the descriptor extraction procedure is very significant and the overall performance of SURF can be improved by speeding up the descriptor extraction step with increasing parallel hardware accelerators. However, simply increasing the hardware accelerators is burdensome because of causing significant area and power consumption. Therefore, a reconfigurable hardware architecture is proposed that enables achieving the maximum performance of the descriptor extraction step with making the best use of the existing accelerators without any additional ones. Experimental results show that the proposed architecture improves the performance of the descriptor extraction step by 24.77-47.45% with negligible area overheads when compared with the existing hardware implementations of the SURF algorithm.

Introduction: Image feature extraction is a crucial step for recognising and tracking objects in computer vision. Among existing feature extraction algorithms, speeded up robust features (SURFs) [1] are considered to be the most efficient feature extraction algorithm because SURF has properties such as scale/rotation invariance and robustness to illumination changes. The SURF algorithm has been implemented in powerful hardware [2, 3] for real-time operation on various computer vision systems due to its characteristics of data-intensive computation of high complexity. Especially, the computational load of the descriptor extraction step is very significant and the overall performance of SURF can be improved by speeding up this step with increasing parallel hardware accelerators. However, simply increasing the hardware accelerators is burdensome because of causing significant area and power consumption. Therefore, in this Letter, we propose a reconfigurable hardware architecture that enables achieving the maximum performance of the descriptor extraction step with making the best use of the existing accelerators without any additional ones.



**Fig. 1** *SURF analysis and hardware architecture of descriptor extraction a* Computational load analysis of SURF

b Two-level parallel hardware architecture of descriptor extraction

SURF analysis and hardware architecture of descriptor extraction: The primary goal of SURF is searching for correspondences between two images of the same scene or object. Thus, first of all, each input image is individually processed by the two procedures in the SURF algorithm - interest point detection and descriptor extraction. Then, the descriptors extracted from two images are compared by the matching procedure in order to find the correspondences. Therefore, the SURF algorithm consists of three main procedures and their computational load analysis is shown in Fig. 1*a*. This result is obtained by using Intel® VTune<sup>™</sup> amplifier [4] that profiles the SURF algorithm in OpenCV [5] with the image 'tile0' [6] running on Intel i7-6700 processor operating at 3.4 GHz and 16 GB memory. Fig. 1a demonstrates that the descriptor extraction procedure is the most dominant in the entire computational load. This indicates that the overall performance of SURF can be improved by speeding up this procedure with two-level parallelism implemented in hardware as shown in Fig. 1b. At the first level, two input images are concurrently processed by the two separate descriptor extraction units. The second level means that a descriptor extraction unit speeds up the extraction step with the inner parallel hardware accelerator modules and their input/output memories saving the interest points/descriptors.

*Critical consideration for speed up of descriptor extraction:* The two separate descriptor extraction units finish their works at almost the same time when both the inputs that are interest points detected on two images are the same or similar in number. It means that there is no way to further accelerate the descriptor extraction step, except adding more hardware accelerator modules in both the extraction units. However, simply increasing the hardware accelerator modules is burdensome because it may cause significant area and power consumption by lots of memories and arithmetic components in the accelerator modules. However, as shown in Fig. 2, if the numbers of interest points on two images are different, one of the two extraction units finishes its work earlier and it reveals the idle hardware accelerator modules. Therefore, in this case, there is substantial room for the further acceleration of the descriptor extraction step by utilising the idle accelerator modules without any additional ones.



Fig. 2 Descriptor extraction example with two different numbers of interest points on two images

*Reconfigurable hardware architecture for faster descriptor extraction:* To make the most of the idle accelerator modules for the

speed up of the ongoing descriptor extraction step, we propose a reconfigurable hardware architecture as shown in Fig. 3 that illustrates how the proposed architecture enables the two idle accelerator modules as in Fig. 2 to be used for the ongoing descriptor extraction from 'Image#B'. What is the first to recognise in Fig. 3 is that each input/ output memory is divided into two parts and each part is half the size of the original memory as in Fig. 2.



Fig. 3 Further acceleration of descriptor extraction from 'Image#B' on reconfigurable hardware architecture

This division is to support concurrent processing of all the accelerator modules in both the extraction units with coupling the half of the memories in one extraction unit with the accelerator modules in the other unit when they are idle. Such coupling between the memories and the accelerator modules can be accomplished by reconfigurable interconnection through the multiplexers and the switch logics as in Fig. 3. In this manner, the proposed reconfigurable architecture enables achieving the maximum performance of the descriptor extraction procedure by making the best use of the existing accelerator modules without any idle ones.

Experiments and results: For quantitative evaluation, we have designed a base architecture such as Fig. 2 and a proposed reconfigurable architecture such as Fig. 3 at RT-level using Verilog with the same functionality and accuracy of the SURF algorithm in OpenCV [5]. Each of their descriptor extraction units has two accelerator modules and 64 kB/ 2 MB input/output memories fit to process an image with a resolution of 640 × 480 px. The RT-level architectures have been synthesised with targeting Xilinx Virtex7 XC7V2000T FPGA and Table 1 shows the synthesis result comparison between the proposed architecture and the base architecture. The area cost of the proposed architecture including reconfiguration unit, multiplexers, and switching logics has increased by only 0.27% of the slice look-up tables and 0.09% of the slice registers when compared with the base architecture. In addition, the critical path delay of the proposed architecture is the same as the base architecture -3.99 ns. Therefore, this comparison indicates that the proposed reconfigurable architecture does not cause performance degradation in terms of the critical path delay with negligible area overheads.

## Table 1: Area and critical path delay comparison

|                     | Hardware architecture |               |         |         |
|---------------------|-----------------------|---------------|---------|---------|
| S                   | Base                  | Proposed      |         |         |
| area                | aliaa LUTa            | number        | 480,560 | 481,482 |
|                     | Slice LUTS            | increased*(%) | —       | 0.27    |
|                     | slice registers       | number        | 410,890 | 411,246 |
|                     |                       | increased*(%) | —       | 0.09    |
| aritical path dalay |                       | Time, ns      | 3.99    | 3.99    |
| cifical path delay  |                       | increased*(%) | —       | 0       |

Increased\*(%): Increase rate compared with base, (proposed/base -1) × 100

To demonstrate the performance improvement by the proposed architecture, we have evaluated its performance with increasing the gap between the numbers of interest points on two images that are shown as the six cases of interest point ratio in Table 2. In all six of these cases, the base architecture as [2, 3] shows the same performance because its performance is always bounded by the descriptor extraction step processing more interest points. However, the proposed architecture improves performance by 24.77–47.45% when the interest point ratio increases because of the wider gap between the numbers of interest points, the further accelerated descriptor extraction step processing more interest points is by utilising the idle accelerator modules.

| <b>Table 2:</b> Performance comparison | Table | 2: | Perf | ormance | comparison |
|----------------------------------------|-------|----|------|---------|------------|
|----------------------------------------|-------|----|------|---------|------------|

| No. of interest points |      |               |      | Decorintor astroation |             |          |                                 |
|------------------------|------|---------------|------|-----------------------|-------------|----------|---------------------------------|
| Input<br>image#A       |      | Input image#B |      |                       | time (s)    |          | Time-saving <sup>a</sup><br>(%) |
| Name                   | No.  | Name          | No.  | Ratio <sup>b</sup>    | Base [2, 3] | Proposed |                                 |
| tile0<br>[6] 4796      |      | tile0 [6]     | 4796 | 1:1.00                | 8.70        | 8.70     | 0                               |
|                        |      | tile1 [6]     | 2420 | 1:0.50                | 8.70        | 6.55     | 24.77                           |
|                        |      | tile2 [6]     | 1594 | 1:0.33                | 8.70        | 5.80     | 33.36                           |
|                        | 4796 | tile3 [6]     | 1204 | 1:0.25                | 8.70        | 5,44     | 37.44                           |
|                        |      | tile4 [6]     | 958  | 1:0.19                | 8.70        | 5.22     | 40.00                           |
|                        |      | tile 5 [6]    | 503  | 1:0.10                | 8.70        | 4.81     | 44.71                           |
|                        |      | tile6 [6]     | 241  | 1:0.05                | 8.70        | 4.57     | 47.45                           |

<sup>a</sup>Time-saving (%):  $(1 - \text{proposed/base}) \times 100$ .

<sup>b</sup>Ratio: Ratio of number of interest points from image#A to number of interest points from image#B.

*Conclusion:* In this Letter, we propose a reconfigurable hardware architecture for faster descriptor extraction in the SURF algorithm. Experimental results show that the proposed architecture improves the performance of the descriptor extraction step by 24.77–47.45%.

Acknowledgment: This research was supported by the Sookmyung Women's University Research Grant no. 1-1403-0047.

© The Institution of Engineering and Technology 2018 Submitted: *16 August 2017* E-first: *12 January 2018* doi: 10.1049/el.2017.3133

One or more of the Figures in this Letter are available in colour online.

Y. Kim and H. Jung (Department of Computer Science, Sookmyung Women's University, Seoul, Republic of Korea)

☑ E-mail: ykim@sookmyung.ac.kr

## References

- Bay, H., Ess, A., Tuytelaars, T., et al.: 'Speeded-up robust feature (SURF)', Comput. Vis. Image Underst., 2008, 110, (3), pp. 346–359, doi:10.1016/j.cviu.2007.09.014
- 2 Bouris, D., Nikitakis, A., and Papaefstathiou, I.: 'Fast and efficient FPGA-based feature detection employing the SURF algorithm'. Int. Symp. Field-Programmable Custom Computing Machines, Charlotte, NC, USA, May 2010, pp. 3–10, doi:10.1109/FCCM.2010.11
- 3 Lee, S.S., Jang, S.J., Kim, J., et al.: 'Memory-efficient SURF architecture for ASIC implementation', *Electron. Lett.*, 2014, **50**, (15), pp. 1058–1059, doi:10.1049/el.2013.4102
- 4 Intel: 'Get free download & trial'. Available at http://software.intel.com/ intel-vtune-amplifier-xe, accessed May 2016
- 5 OpenCV: '2.4.9'. Available at http://opencv.org/releases.html, accessed May 2016
- 6 Embedded Systems Lab., Sookmyung Women's Univ.: 'Benchmarks'. Available at http://esl.sookmyung.ac.kr/surf\_benchmarks.html, accessed April 2017