Elimination of the Background of Electron Microscope Images by Using FPGA

Ádám Fazekas, Hiroshi Daimon, Hiroyuki Matsuda, and László Tóth

Abstract

The purpose of our development is to design an FPGA based hardware acceleration system that is able to be used for analyzing photoemission electron microscope (PEEM) images or improving their quality. Even though a usual PEEM has an energy filter unit, which is able to eliminate certain disturbing signals, a post processing computation can also be useful to improve the image quality. Here we propose an FPGA based hardware acceleration system for the computation of a certain image background component. It has uniquely designed hardware modules that perform the computations in parallel, resulting in less calculation time. The system shown here is a prototype which was only used for testing and experimental purposes.

Keywords: photoelectron spectra, Shirley background, field-programmable gate array, hardware acceleration

1 Introduction

Due to the technological advancement there is an increasing demand for observing processes which take place in micro- and nano-scale ranges. For this purpose among others, photoemission electron microscopy (PEEM) provides a solution. It gives photoelectron spectra from individual small areas and has a wide range of applications in many branches of science, such as physical, chemical and biological research, nanotechnology, semiconductor design and manufacturing. However the images that they produce could contain certain disturbing signals which completely obscure the useful information, i.e. a photoelectron peak contains large background originated from higher energy peaks. There are several ways to eliminate this background component and one of them is to apply post-process computations. This method can be a computationally intensive task because the background should be calculated for each pixel, so it is appropriate to use hardware acceleration to
reduce the execution time. Field-programmable gate arrays (FPGA) [2] provide an excellent opportunity for the design of such systems. The FPGA architecture offers massive parallel capabilities and uniquely customizable hardware components designed to carry out the given task in the most efficient way. In this paper we propose a prototype of hardware acceleration system for computing a background component on FPGA platform. The system was implemented and tested on an Altera DE2 Development and Education FPGA board [1]. The design performs the calculations in parallel with specialized hardware components; therefore, it takes less time to complete than it would take with an ordinary computer.

2 Physical background

2.1 About the Photoelectron Emission Microscope

This work is related to a new display-type ellipsoidal mesh analyzer (DELMA) [6, 14], with a new type $1\pi$ sr wide acceptance angle spherical aberration corrected electrostatic lens (WAAEL) [5, 9, 7, 8, 15]. This special photoemission electron microscope is able to be used for simultaneous angular and energy distribution measurements, electron spectroscopy and spectrography, diffraction and holographic measurements. Furthermore, due to the extremely large acceptance angle it can be used for stereo photoemission electron microscopy (Stereo-PEEM) to obtain three-dimensional atomic and electronic structures of microscopic-materials.

2.2 The examined background component

There are several solutions for the correction of chromatic and spherical aberration where i.e. one of the unique is operating by applying a time dependent electric field [11]. The objective lens that was applied in our case corrects the spherical aberration only by applying a quasi-ellipsoidal shape mesh lens inside [5]. One of the advantages of this type of lens is that the sample area is field free. Furthermore, it has no pass energy limit; therefore, it can be applied in wide energy ranges. However, due to the wide acceptance angle it requires careful design and construction. The quality of the measured images can be improved if we can distinguish the background and then subtract it from the image. This can be achieved by taking images at many pass-energies, where each pixel of the image behaves in a way that is known, but differently from the background components (Fig. 1) [15]. In this paper we deal with the elimination of the Shirley-background component [12] by using an FPGA processor. As a first step we have examined this method for image processing purposes and tested it on a low cost FPGA device.
Figure 1: The spectral image sequence (bottom) and the corresponding intensity distribution for the calculations of the background subtracted images [15, 6]. Curve 'a' is the original intensity distribution among the energy axes (E) at given (x,y) coordinates on the images, where an elastic peak and a plasmon-loss peak are seen. Curve 'b' is the calculated Shirley-background. Curve 'c' is the intensity after the subtraction of the background.

3 The applied Hardware

3.1 A brief description of Field Programmable Gate Arrays

Field programmable gate arrays (FPGA) are integrated circuits that do not have specific functionality; therefore, one must program the device to make it able to perform the required task. The main components of an FPGA are the logic blocks that contain logic elements, programmable interconnect and input/output ports. In addition, almost every FPGA has further special components like embedded memory or multiplier circuits. By configuring or reconfiguring the logic elements and the programming interconnect between them the functionality is given to the device to perform the desired task. This flexibility allows the designer to implement various hardware designs using a hardware description language like Verilog or VHDL. The benefits of FPGAs are not only the flexibility but also the paralleling capabilities. The implemented hardware modules can be run parallel independently, which greatly improves the system efficiency. For these reasons FPGAs often provide higher performance than ordinary processors and digital signal processing devices [2].
3.2 The applied FPGA device

The background computing system was implemented on an Altera DE2 Development and Education FPGA board [1]. This device has a Cyclon II FPGA processor, some basic hardware peripherals such as external memories and several input output ports like RS-232. Although this device was not developed for high performance computations, it was perfectly suitable for testing and experimental purposes for the prototype system. For the design we used the Altera Quartus II web edition development software, and implemented the design in the Verilog hardware description language.

4 Implementation

4.1 Algorithm

To determine the background component (Fig. 2) we used the iterative Shirley method [12] and modified the algorithm structure for a feasible hardware design. The method was implemented in both hardware and software platforms for comparison reasons.

![Figure 2: The Shirley background and parameters for its computation.](image)

Our design divides the Shirley algorithm into two phases. The first phase computes the area between the points of the background \((S_i)\) and the points of a spectrum (data) with rectangle approximation; where \(E_1\) and \(E_2\) are the two energy indexes, the background computation takes place between them, and its values are set by the user. \(\Delta E\) is the energy difference (step size) between two consecutive points in the spectrum. Here is the pseudocode for the area computation \((A_{max} = A_1 + A_2)\):
Algorithm 1 Area computation

1: for \( k := E_2 \) downto \( E_1 \) do
2: \( A_{\text{max}} := A_{\text{max}} + (\text{data}(k) - S_i(k)) \cdot \Delta E \)
3: \( A_2(k) := A_{\text{max}} \)
4: end for

This part also computes the value of \( A_2 \) for every point because it is produced during the computation of \( A_{\text{max}} \). The second phase computes the points of the Shirley background for the next iteration, where \( I_1 \) and \( I_2 \) are the intensities at \( E_1 \) and \( E_2 \) energies. Pseudocode for computing the Shirley background in the ‘i’-th iteration:

Algorithm 2 Shirley computation

1: for \( j := E_1 \) to \( E_2 \) do
2: \( S_i(j) := I_2 + (I_1 - I_2) \cdot (A_2(j)/A_{\text{max}}) \)
3: end for

This separation is important since we could define unique arithmetical circuits for both parts that made the computation more efficient. In the following, the implementation will be explained in more details.

4.2 System architecture

The background computing system consists of two main components, an FPGA based computer unit and a Java application running on a personal computer. The Java application provides the measured data and the computation parameters for the FPGA through serial communication using the RS-232 communication standard. Also, this application gives user interface for the system, where the computation parameters such as \( E_1 \), \( E_2 \) and the step size \( \Delta E \) can be set. The FPGA stores the incoming data in the external memory. After the transfer, the central controller unit starts serving the background computer hardware modules which operate in parallel. The computed background intensities are stored in the external memory too. When all the background intensities have been calculated, the central controller unit sends the results back to the Java application through the serial port. Figure 3 shows the schematic diagram of the system. There are subsystems for the different tasks such as the memory controller for the memory operations, the I/O controller for communication, and Shirley modules for computing the Shirley background. Each subsystem is controlled by the system controller which has the role of managing the serving strategy for the Shirley modules.

4.3 Shirley background computation with hardware modules

The values of the Shirley background are determined by specialized hardware modules. All of the Shirley modules have their own memory and arithmetical modules
so they could operate independently in parallel. The modules implement the iterative Shirley method for the background calculation. The current system operates with the IEEE-754 standard single precision floating point and the Q32.16 fixed point number formats [3]. We used real numbers to keep the computation precision however, considering only image processing purposes even integer arithmetic would be enough. In this case we can get much shorter running time and simpler circuit, but the results would not reflect the measured data precisely. The primary data format used for storing data in the memory and transferring it between the PC and the FPGA system is IEEE-754 standard single precision floating point. The arithmetical modules are optimized with a pipeline technique [10] that results in increased throughput and reduced running time. The Shirley modules consist of three major components; the arithmetical, the memory controller and the Shirley controller units. The memory controller unit performs the read and write operations on the dedicated memory which contains the required data for the computation of one spectrum. The controller unit manages the computation procedure implemented as a finite state machine [4]. The iterative Shirley method was divided into two parts as mentioned earlier. The area between the measured data and the flat background intensities is computed at the beginning of the process and the new approximations of the background intensities are calculated afterwards. The area processor unit (Fig. 4) determines the summarized area, $A_{\text{max}}$, in the first step of each iteration, while the $A_2$ values are saved for every point in a dedicated memory, so they will be simply read out when they are needed for the calculation. This part uses Q32.16 fixed point arithmetic for fast summation, thus conversion is required at the ingress and egress part of this circuit. After this, the Shirley processor unit (Fig. 4) determines the intensities of the points of the next background approximation. The Shirley computer module uses the IEEE-754 standard single precision floating point number format, because the division can be implemented more efficiently by this than the fixed point case. The reason behind the usage of different number formats is that the summations consume much more time and resources with floating point than with fixed point numbers, even if we consider the...
time for conversion. Furthermore, the division with fixed-point number format has
similar disadvantages compared to the case of floating-point. The whole iteration
forms one pipeline circuit; therefore, the computation of the background intensities
performed rapidly.

![Figure 4: The block diagram of the area processor unit.](image)

![Figure 5: The block diagram of the Shirley processor unit.](image)

The iteration process ends when the difference between two consecutive back-
ground approximations no longer exceeds a predefined constant value.

### 4.4 The final results

We have designed a hardware accelerated computation system that can distinguish
certain components of electron microscope images originated from different physical
processes, i.e. the disturbing backgrounds. The prototype of the system has been
completed and it is able to determine and subtract certain background components.
The magnified test images of a mesh sample (SUS316, #100) were taken by the
DELMA at the beam-line BL07LSU of Spring-8 [6, 13] (Fig. 6). The sample in
the presented work was irradiated by 1 keV energy and 250 µm diameter electron
beam with 14° inclination to the sample surface. The system performance was
tested on low magnification images with forced quality degradation by taking the
images with fully opened apertures. The magnified (12×) images were intensified
and converted into visible light with a microchannel plate (MCP) - phosphor screen
combination and recorded as 300 × 300 pixel size images by a PCO camera. Figure
6 shows the results of the background removal by FPGA processor. Significant
improvements can be seen at the electron beam illuminated image center after the
subtraction of this background component (Fig. 6).

In the center of the image notable contrast and signal-to-noise ratio improve-
ment can be seen (Fig. 7). This is important since the center region can be used
in higher magnification cases.
Figure 6: The original (left) [6] and background-subtracted images by FPGA processor (right). The arrows indicate the centerlines of the sample region where the intensity curves of figure 7 were measured.

Figure 7: The intensity distribution of the images in the horizontal centerlines of the sample region marked by the blue (original) and red (background-subtracted) arrows in Fig. 6.

We have also achieved remarkable results in the relative reduction of running time. Using specialized hardware components the background computation is done efficiently in parallel. Our present system realizes two parallelized computation threads where the limitation comes only from the utilized development board’s properties. Theoretically, the parallel threads are limited only by the physical resources of the FPGA and the applied serving strategy. For example, by a board with more advanced FPGA, we could implement even more than a thousand parallel Shirley modules. In this case large amounts of measured data must be sent to
the device; therefore, the bottleneck of the prototype system is the communication between the PC and the FPGA board. However, this process can be omitted by integrating the FPGA based system into the measuring device. The current FPGA prototype system has significantly shorter computation time than an ordinary computer has even though that the applied FPGA has only 50 MHz clock frequency while the PC runs on 3.6 GHz. Furthermore, the FPGA has much lower power consumption. The following table summarizes the results:

Table 1: Running times of the FPGA system with one Shirley module, with two Shirley modules in parallel, and the PC for a hundred spectra.

<table>
<thead>
<tr>
<th>Spectrum length (number of points)</th>
<th>Running Time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>FPGA</td>
</tr>
<tr>
<td>70</td>
<td>3,660</td>
</tr>
<tr>
<td>175</td>
<td>8,070</td>
</tr>
<tr>
<td>350</td>
<td>15,420</td>
</tr>
<tr>
<td>700</td>
<td>30,102</td>
</tr>
</tbody>
</table>

Table 1 shows the running times of the algorithm on different platforms. Column one holds information about the length of the spectra. The second column shows that the computation time of hundred spectra were measured. The third column shows the running time of the background computation on the FPGA\(^1\) with one Shirley module. In this case there is no parallel computing of different spectra, just the hardware implementation of the previously described algorithm, with a pipelined design. The fourth column shows the running time of the background computations for a hundred spectra with two parallel Shirley modules on the FPGA. In those cases there are no external factors that would affect the length of the computation so they are identical for every run. The fifth column shows the running time of the software implementation of the same algorithm on a personal computer\(^2\). There are several factors that affects the computation time such as the scheduling of the operating system or available resources at a given moment etc. So the running time, in this case, is the average of several computations. The results on the FPGA tend to be better than on the PC (Fig. 8) even though that the PC has a much higher clock frequency (50MHz << 3.6GHz) and the point is to offload the efforts from the PC and embed the background correction system into the measuring device. The performance can be increased further if more Shirley modules can be placed on the FPGA.

\(^1\)Altera DE2 board with Cyclone II FPGA clocked with 50 MHz.
Figure 8: The proportions of the difference between the running times of different platforms remain the same if the spectrum length (energy resolution of the image sequence) is increasing.

5 Conclusions

The prototype of the background computer system provided remarkable results and important experiences which will be useful at the design of a new high performance hardware acceleration system. During the development of the prototype we realized that relevant performance enhancement requires a high-end FPGA platform which has the necessary resources to determine the background values in real-time. That device could be used as an embedded unit related to the measuring instrument so it is no longer necessary to use communication protocols between the PC and the hardware. The running time could be easily reduced in the future if the system performs the computations in more than two threads of the Shirley modules, which is limited mainly by the applied hardware resources and not by the realization of the method, therefore using a higher performance FPGA, more parallel computation modules could be executed simultaneously.

6 Acknowledgement

We would like to thank to the Japan Synchrotron Radiation Research Institute (JASRI) for getting possibility of using the BL07LSU beam-line at the Super Photon Ring - 8 GeV [13] synchrotron facility and to the Altera Company for providing the DE2 development and education board in the framework of Altera University Program.

1Altera DE2 board with Cyclone II FPGA clocked with 50 MHz.
References


