Jump to content

Code stylometry

From Wikipedia, the free encyclopedia

Code stylometry(also known asprogram authorship attributionorsource code authorship analysis) is the application ofstylometryto computer code to attribute authorship to anonymousbinaryorsource code.It often involves breaking down and examining the distinctive patterns and characteristics of the programming code and then comparing them to computer code whose authorship is known.[1]Unlikesoftware forensics,code stylometry attributes authorship for purposes other thanintellectual property infringement,includingplagiarism detection,copyright investigation, and authorship verification.[2]

History[edit]

In 1989, researchers Paul Oman and Curtis Cook identified the authorship of 18 differentPascalprograms written by six authors by using “markers” based ontypographiccharacteristics.[3]

In 1998, researchers Stephen MacDonell, Andrew Gray, and Philip Sallis developed a dictionary-based author attribution system called IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination) that determined the authorship of source code in computer programs written inC++.The researchers noted that authorship can be identified using degrees of flexibility in the writing style of the source code, such as:[4]

  • The way the algorithm in the source code solves the given problem
  • The way the source code is laid out (spacing, indentation, bordering characteristics, standard headings, etc.)
  • The way the algorithm is implemented in the source code

The IDENTIFIED system attributed authorship by first merging all the relevant files to produce a single source code file and then subjecting it to a metrics analysis by counting the number of occurrences for each metric. In addition, the system was language-independent due to its ability to create new dictionary files and meta-dictionaries.[4]

In 1999, a team of researchers led by Stephen MacDonell tested the performance of three different program authorship discrimination techniques on 351 programs written in C++ by 7 different authors. The researchers compared the effectiveness of using afeed-forward neural network (FFNN)that was trained on aback-propagationalgorithm,multiple discriminant analysis (MDA),andcase-based reasoning (CBR).At the end of the experiment, both the neural network and the MDA had an accuracy rate of 81.1%, while the CBR reached an accuracy performance of 88.0%.[5]

In 2005, researchers from the Laboratory of Information and Communication Systems Security atAegean Universityintroduced a language-independent method of program authorship attribution where they usedbyte-leveln-gramsto classify a program to an author. This technique scanned the files and then created a table of differentn-grams found in the source code and the number of times they appear. In addition, the system could operate with limited numbers of training examples from each author. However, the more source code programs that were present for each author, the more reliable the author attribution. In an experiment testing their approach, the researchers found that classification usingn-grams reached an accuracy rate of up to 100%, although the rate declined drastically if the profile size exceeded 500 and then-gram size was 3 or less.[3]

In 2011, researchers from the University of Wisconsin created a program authorship attribution system that identified a programmer based on the binary code of a program instead of the source code. The researchers utilizedmachine learningand training code to determine which characteristics of the code would be helpful in describing the programming style. In an experiment testing the approach on a set of programs written by 10 different authors, the system achieved an accuracy rate of 81%. When tested using a set of programs written by almost 200 different authors, the system performed with an accuracy rate of 51%.[6]

In 2015, a team of postdoctoral researchers fromPrinceton University,Drexel University,theUniversity of Maryland,and theUniversity of Goettingenas well as researchers from theU.S. Army Research Laboratorydeveloped a program authorship attribution system that could determine the author of a program from a sample pool with programs written by 1,600 coders with a 94 percent accuracy. The methodology consisted of four steps:[7]

  1. Disassembly- The program is disassembled to obtain information on its characteristics.
  2. Decompilation- The program is converted into a variant of C-likepseudocodethroughdecompilationto obtainabstract syntax trees.
  3. Dimensionality reduction- The most relevant and useful features for author identification are selected.
  4. Classification- A random-forest classifier attributes the authorship of the program.

This approach analyzed various characteristics of the code, such as blank space, the use of tabs and spaces, and the names of variables, and then used a method of evaluation called a syntax tree analysis that translated the sample code into tree-like diagrams that displayed the structural decisions involved in writing the code. The design of these diagrams prioritized the order of the commands and the depths of the functions that were nestled in the code.[8]

The 2014 Sony Pictures hacking attack[edit]

U.S. intelligence officials were able to determine that the2014 cyber attack on Sony Pictureswas sponsored by North Korea after evaluating the software, techniques, and network sources. The attribution was made after cybersecurity experts noticed similarities between the code used in the attack and a malicious software known asShamoon,which was used in the2013 attacksagainst South Korean banks and broadcasting companies by North Korea.[9]

References[edit]

  1. ^Claburn, Thomas (March 16, 2018)."FYI: AI tools can unmask anonymous coders from their binary executables".The Register.RetrievedAugust 2,2018.
  2. ^De-anonymizing Programmers via Code Stylometry.August 12, 2015.ISBN9781939133113.RetrievedAugust 2,2018.{{cite book}}:|website=ignored (help)
  3. ^abFrantzeskou, Georgia; Stamatatos, Efstathios; Gritzalis, Stefanos (October 2005)."Supporting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-Level Information".E-business and Telecommunication Networks.Communications in Computer and Information Science. Vol. 3. pp. 283–290.doi:10.1007/978-3-540-75993-5_14.ISBN978-3-540-75992-8– via ResearchGate.
  4. ^abGray, Andrew; MacDonnell, Stephen; Sallis, Philip (January 1998)."IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination): A dictionary-based system for extracting source code metrics for software forensics".Proceedings. 1998 International Conference Software Engineering: Education and Practice (Cat. No.98EX220).pp. 252–259.doi:10.1109/SEEP.1998.707658.hdl:10292/3472.ISBN978-0-8186-8828-7.S2CID53463447– via ResearchGate.
  5. ^MacDonell, Stephen; Gray, Andrew; MacLennan, Grant; Sallis, Philip (February 1999)."Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis".Neural Information Processing.1.ISSN1177-455X– via ResearchGate.
  6. ^Rosenblum, Nathan; Zhu, Xiaojin; Miller, Barton (September 2011)."Who wrote this code? Identifying the authors of program binaries".Proceedings of the 16th European Conference on Research in Computer Security.Esorics'11: 172–189.ISBN978-3-642-23821-5– via ACM Digital Library.
  7. ^Brayboy, Joyce (January 15, 2016)."Malicious coders will lose anonymity as identity-finding research matures".U.S. Army.RetrievedAugust 2,2018.
  8. ^Greenstadt, Rachel (February 27, 2015)."Dusting for Cyber Fingerprints: Coding Style Identifies Anonymous Programmers".Forensic Magazine.RetrievedAugust 2,2018.
  9. ^Brunnstrom, David; Finkle, Jim (December 18, 2014)."U.S. considers 'proportional' response to Sony hacking attack".Reuters.RetrievedAugust 2,2018.