Publications | Qi Zhan

2024

REACT: IR-Level Patch Presence Test for Binary

Qi Zhan , Xing Hu , Xin Xia , and Shanping Li

In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , 2024

Abs PDF Code

Patch presence test is critical in software security to ensure that binary files have been patched for known vulnerabilities. It is challenging due to the semantic gap between the source code and the binary, and the small and subtle nature of patches. In this paper, we propose React, the first patch presence test approach on IR-level. Based on the IR code compiled from the source code and the IR code lifted from the binary, we first extract four types of feature (return value, condition, function call, and memory store) by executing the program symbolically. Then, we refine the features from the source code and rank them. Finally, we match the features to determine the presence of a patch with an SMT solver to check the equivalence of features at the semantic level.To evaluate our approach, we compare it with state-of-the-art approaches, BinXray and PS3, on a dataset containing binaries compiled from different compilers and optimization levels. Our experimental results show that React achieves scores of 0.88, 0.98, and 0.93, in terms of precision, recall, and F1 score, respectively. React outperforms the baselines by 39% and 12% in terms of the F1 score, while the testing speed of our approach is 2x faster than BinXray and 100x faster than PS3. Furthermore, we conduct an ablation study to evaluate the effectiveness of each component in React, which shows that SMT solver and refinement can contribute to 16% and 10% improvement in terms of the F1 score, respectively.
PS3: Precise Patch Presence Test based on Semantic Symbolic Signature

Qi Zhan , Xing Hu , Zhiyang Li , Xin Xia , David Lo , and Shanping Li

In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024

Abs PDF Code

During software development, vulnerabilities have posed a significant threat to users. Patches are the most effective way to combat vulnerabilities. In a large-scale software system, testing the presence of a security patch in every affected binary is crucial to ensure system security. Identifying whether a binary has been patched for a known vulnerability is challenging, as there may only be small differences between patched and vulnerable versions. Existing approaches mainly focus on detecting patches that are compiled in the same compiler options. However, it is common for developers to compile programs with very different compiler options in different situations, which causes inaccuracy for existing methods. In this paper, we propose a new approach named PS3, referring to precise patch presence test based on semantic-level symbolic signature. PS3 exploits symbolic emulation to extract signatures that are stable under different compiler options. Then PS3 can precisely test the presence of the patch by comparing the signatures between the reference and the target at semantic level.To evaluate the effectiveness of our approach, we constructed a dataset consisting of 3,631 (CVE, binary) pairs of 62 recent CVEs in four C/C++ projects. The experimental results show that PS3 achieves scores of 0.82, 0.97, and 0.89 in terms of precision, recall, and F1 score, respectively. PS3 outperforms the state-of-the-art baselines by improving 33% in terms of F1 score and remains stable in different compiler options.
Survey on Vulnerability Awareness of Open Source Software

Qi Zhan , Shengyi Pan , Xing Hu , Lingfeng Bao , and Xin Xia

软件学报, 2024

Abs PDF

As the modern software scale expands, software vulnerabilities bring a great threat to the security and reliability of computer systems and software, causing huge damage to people’s production and life. In recent years, as open source software (OSS) is widely used, the vulnerability issues of OSS have received much attention. Vulnerability awareness techniques can effectively help OSS users to identify vulnerabilities at the early stage for timely defense. Different from the vulnerability detection techniques for traditional software, the transparency and cooperativity of OSS vulnerabilities bring great challenges to vulnerability awareness. Therefore, various techniques are proposed by scholars and developers to perceive potential vulnerabilities and risks in OSS from the code and open source community, so as to find OSS vulnerabilities as early as possible and reduce the losses caused by the vulnerabilities. To boost the development of OSS vulnerability awareness techniques, this study conducts a systematic literature review of existing research works. The study selects 45 high-level papers on open source vulnerability awareness techniques, including code-based, open source community discussion-based, and patch-based vulnerability awareness techniques. The results of these papers are systematically summarized. Especially, this study proposes the category of techniques based on the OSS vulnerability life cycle for the first time according to the most recent publications, which supplements and improves the existing taxonomy of vulnerability awareness techniques. Finally, the study discusses the challenges in the field and predicts future research direction.

2022

C4: Contrastive Cross-Language Code Clone Detection

Chenning Tao* , Qi Zhan* , Xing Hu , and Xin Xia

In 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) , 2022

Abs PDF Code

During software development, developers introduce code clones by reusing existing code to improve programming productivity. Considering the detrimental effects on software maintenance and evolution, many techniques are proposed to detect code clones. Existing approaches are mainly used to detect clones written in the same programming language. However, it is common to develop programs with the same functionality but in different programming languages to support various platforms. In this paper, we propose a new approach named C4, referring to Contrastive Cross-language Code Clone detection model. It can detect cross-language clones with learned representations effectively. C4 exploits the pre-trained model CodeBERT to convert programs in different languages into high-dimensional vector representations. In addition, we fine tune the C4 model through a constrastive learning objective that can effectively recognize clone pairs and non-clone pairs. To evaluate the effectiveness of our approach, we conduct extensive experiments on the dataset proposed by CLCDSA. Experimental results show that C4 achieves scores of 0.94, 0.90, and 0.92 in terms of precision, recall and F-measure and substantially outperforms the state-of-the-art baselines.