Steganalysis and Machine Learning: A European answer

Steganography is a secret mechanism for encoding information by any means of transmission. Its use has been known since ancient Greece and defined in glossaries towards the end of the fifteenth century. Both the encoded information and the medium of transmission are secret; that is, known only to the parties who intend to communicate in an occult way. Steganography therefore presents itself as an ideal tool for the creation of secret communication channels that can be used in sophisticated espionage scenarios, computer crime and data breaches in public and private sectors.

Steganography differs from cryptography, in which the encoded information and the medium of transmission are generally known (for example, the HTTPS protocol used by this site). In this case, the encoding mechanism makes the extraction of information extremely difficult without the knowledge of additional data, known as encryption/decryption keys. These keys are known only to the parties authorized to communicate (for example, your browser and our web server).

The process of analyzing steganography is also known as steganalysis. In its simplest implementation, this process aims to detect the presence of steganography in one or more transmission media, and in a further stage it may extract the hidden message.

The effectiveness of steganalysis techniques is strictly dependent on the degree of sophistication and "personalization" of the steganography techniques used by an opponent.

The simplest case reflects an opponent with low or zero knowledge of steganography, who simply uses tools already implemented and made available by others (off-the-shelf tools). In computer security, such an opponent is often called a script kiddie.

In the digital field, there are many pieces of software that implement steganography and most of them combine cryptographic techniques. Examples of open-source software using both techniques are shown in Table 1.

Table 1: Examples of open-source steganography tools
Tool name	Means of transmission
Xiao Steganography¹	Images, audio
Image Steganography²	Images
Steghide³	Images, audio
SSuite Picsel⁴	Images
Stego Magic⁵	Images, audio, video, executables
OpenPuff⁶	Images, audio, video, Flash

Of course, off-the-shelf tools are also available to those who intend to perform steganalysis. While implementing steganography, each piece of software typically leaves (more or less implicitly) characteristic artifacts in the manipulated files, which can be studied to build signatures (fingerprinting). These signatures can be used in the steganalysis phase to identify not only the presence of steganography, but the specific tool used, as well as to successfully extract hidden contents^7,8. Most steganalysis systems employ this approach⁹.

It is easy to see that we are in a vicious circle ("arms-race") which prefigures an increase in the sophistication of techniques and tools used both by those who intend to use steganography, and by those who instead intend to unmask it and reveal its hidden contents. Among the two subjects, in general, the first profile has an advantage, since it will be able at any time to change the means of transmission and/or encoding of information to escape detection.

For example, an opponent could modify the steganography software implementation to escape fingerprinting, or even implement totally new steganographic techniques. This of course has a cost—we are no longer in the presence of kiddie scripts—but this cost can be reasonable according to motivations (e.g. strategic/economic benefits of a cyber-espionage organization).

This situation is well known in the field of computer security—it is generally much easier to attack computer systems than to defend them. Malware instances constantly appear in "polymorphic" variants precisely to evade the detection mechanisms put in place by defenders (e.g. antimalware signatures).

In this scenario, machine learning may represent a sophisticated weapon at the service of those who intend to unmask steganography. Through machine learning techniques it is in fact possible to automatically develop a steganalysis model, starting from a set of samples with and/or without steganography.

Figure 1: Machine learning techniques

Most of the proposed approaches use so-called two-class supervised learning (steganography present/absent), which requires the use of samples with and without steganography, to automatically determine statistical differences. This method is particularly useful for detecting the presence of known steganographic techniques variants (e.g. implemented in new software) for which there are no signatures.

Examples of various algorithms based on supervised learning for the detection of steganography in images have been implemented in an open-source library called Aletheia¹⁰.

Signatures and supervised learning can provide good accuracy when it comes to detecting known steganography techniques and its variants, but are subject to evasion in the presence of totally new techniques, for example, with a statistical profile significantly different from that observed on the samples used in the learning phase.

For this reason, other studies^11,12 have instead proposed the use of unsupervised anomaly-based learning techniques. This approach only employs samples in which steganography is absent, for the automatic construction of a normal profile. The presence of anomalies ("outliers"), or deviations from this profile, can therefore be used to detect totally unknown steganographic techniques. This approach, however, must focus on features whose deviation from the norm is a reliable indication of steganography to offer good accuracy. Think, for example, of the comparison between the size specified in the header of a file, compared to the actual size.

Since each steganalysis technique has its pros, a combination is often useful—signatures, supervised and non-supervised learning¹². This is exactly one of the objectives of the SIMARGL project, funded by the European Commission, under the Grant Agreement n° 833042.

The project, with a total budget of 6 million euros, aims to create advanced steganalysis systems applied to the detection of (stego)malware, malicious software increasingly used by cybercrime and national states in espionage actions. In this project, relevant international actors such as Airbus, Siveco, Thales, Orange Cert, FernUniversität (project coordinator), work alongside other international partners from 7 countries (Netzfactor, ITTI, Warsaw University, IIR, RoEduNet, Stichting CUIng Foundation, Pluribus One, Numera, CNR also participate in the consortium); the partners will field artificial intelligence, sophisticated products already available, and machine learning algorithms on the way for improvement, in order to propose an integrated solution capable of facing different scenarios and acting at different levels: from monitoring network traffic to detecting blurred bits within images.

The challenge of the SIMARGL project has just begun and will provide concrete answers to the problem of stegomalware in the next two years: the project will end in April 2022.

It is important to emphasize that machine learning (and more generally artificial intelligence) is a neutral technology (like many other technologies). Specifically, it is of dual use¹³ and does not belong to the domain of the good. In principle, machine learning can also be used to develop more sophisticated, polymorphic, data-based steganographic techniques.

Let's get ready, because this scenario could represent the future of cyber threats (and perhaps a piece of the future is already present).

This blog post is published by Igino Corona and Matteo Mauri from Pluribus One.

Footnotes:

¹ Xiao Steganography, https://www.softpedia.com/get/Security/Encrypting/Xiao-Steganography.shtml

² Image Steganography, https://archive.codeplex.com/?p=imagesteganography

³ Steghide, http://steghide.sourceforge.net/download.php

⁴ SSuite Picsel, https://www.ssuitesoft.com/ssuitepicselsecurity.htm

⁵ Stego Magic, https://www.gohacking.com/hide-data-in-image-audio-video-files-steganography/

⁶ Open Puff, https://embeddedsw.net/OpenPuff_Steganography_Home.html

⁷ Pengjie Cao, Xiaolei He, Xianfeng Zhao, Jimin Zhang, Approaches to obtaining fingerprints of steganography tools which embed message in fixed positions, Forensic Science International: Reports, Volume 1, 2019, 100019, ISSN 2665-9107, https://doi.org/10.1016/j.fsir.2019.100019

⁸ Chen Gong, Jinghong Zhang, Yunzhao Yang, Xiaowei Yi, Xianfeng Zhao, Yi Ma, Detecting fingerprints of audio steganography software, Forensic Science International: Reports, Volume 2, 2020, 100075,ISSN 2665-9107, https://doi.org/10.1016/j.fsir.2020.100075

⁹ Gary C. Kessler, An Overview of Steganography for the Computer Forensics Examiner, https://www.garykessler.net/library/fsc_stego.html

¹⁰ Aletheia, https://github.com/daniellerch/aletheia

¹¹ Jacob T. Jackson, Gregg H. Gunsch, Roger L. Claypoole, Jr., Gary B. Lamont, Blind Steganography Detection Using a Computational Immune System: A Work in Progress, International Journal of Digital Evidence, Winter 2003, Issue 1, Volume 4

¹² Brent T. McBride, Gilbert L. Peterson, Steven C. Gustafson, A new blind method for detecting novel steganography, Digital Investigation, Volume 2, Issue 1, 2005, Pages 50-70, ISSN 1742-2876, https://doi.org/10.1016/j.diin.2005.01.003

¹³ Fabio Roli, Matteo Mauri, Artificial Intelligence: past, present and future. Part II - The Good, the Bad and the Ugly, AI & Cybersecurity insights: Pluribus One Blog, September 2019, https://www.pluribus-one.it/company/blog/81-artificial-intelligence/76-good-bad-ugly-in-ai