Skip to end of metadata
Go to start of metadata

Real Data Corpus

The Real Data Corpus (RDC) is a collection of disk images extracted from secondary storage devices that were acquired from second-hand markets around the world. In total, the RDC currently consists of 58 TiB of data contained in 3,127 disk images from 29 countries. A variety of devices are represented, including magnetic media and solid state storage from laptops, desktops, mobile phones, USB memory sticks, and other media. The dataset is hosted in the HPC infrastructure at the Naval Postgraduate School, as well as in AWS Govcloud. 

Potential Uses

The Real Data Corpus is a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery tools.
  • Training students in forensics and data recovery
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

The RDC has been cited in over 60 articles. See our current list here.

Current Contents

The following countries are represented:

Data By Country

Country CodeCountry Name# ImagesSize (readable)    Size (bytes)
8.03 TiBUnited Arab Emirates

 89

8.03 TiB8,825,258,946,193
1.52 TiBAustria 441.52 TiB1,674,638,465,645
574.73 MiBBosnia 7574.73 MiB602,644,365
1.56 TiBBangladesh 591.56 TiB1,710,737,896,119
1.38 TiBBahamas 341.38 TiB1,521,076,921,944
893.59 GiBCanada

 54

893.59 GiB959,485,134,238
1.61 GiBSwitzerland 21.61 GiB1,727,236,374
561.15 GiBChina 746561.15 GiB602,527,863,126
1.38 TiBCzech Republic 241.38 TiB1,521,964,790,057
636.61 GiBGermany 41636.61 GiB683,551,846,923
53.75 GiBEgypt 753.75 GiB57,710,165,396
622.91 GiBGhana 21622.91 GiB668,842,288,279
3.04 GiBGreece 73.04 GiB3,267,501,589
145.92 GiBHong Kong 8145.92 GiB156,677,292,656
510.46 GiBHungary 22510.46 GiB548,097,899,391
7.5 TiBIsrael 3007.5 TiB8,246,871,569,750
10.99 TiBIndia

 669

10.99 TiB12,078,854,291,887
29.58 GiBJapan 429.58 GiB31,760,575,283
108.54 GiBMorocco11108.54 GiB116,547,412,932
403.45 GiB Mexico171403.45 GiB433,196,045,674
1.86 TiB Malaysia781.86 TiB2,043,906,920,751
204.38 GiBPanama 17204.38 GiB219,454,669,389
3.44 TiBPakistan 883.44 TiB3,784,807,218,108
1.07 TiBPalestine 1391.07 TiB1,174,201,653,174
818.9 GiBSerbia 24818.9 GiB879,290,824,361
5.9 TiBSingapore 2385.9 TiB6,491,155,492,690
6.99 TiBThailand 1886.99 TiB7,681,881,741,459
484.83 GiBTurkey 10484.83 GiB520,583,203,000
850.58 GiBUnited Kingdom 26850.58 GiB913,307,005,195
57.74 TiBAll 312857.74 TiB63,487,653,727,660

Access and Availability

Please contact us if you would like access to the Real Data Corpus. In general, due to privacy concerns, we do not release copies of the data to private individuals. However, depending on the requirements of the project, we may be able to offer access through one of two methods:

  1. Mediated Access. Researchers submit source code, build instructions, and detailed instructions for running their experiment. We return sanitized results. This is the most expedient option in cases where the desired experiment does not involve human subjects research.
  2. Direct Access. Researchers create virtual machines on Amazon GovCloud, and these machines are granted access to the dataset. Because this method may involve direct contact with sensitive data, it involves additional review.

Please be aware that due to limited staff we cannot always accommodate all requests. Efforts are underway to develop infrastructure that will allow us to meet a wider range of research requirements without unduly increasing privacy risks.

IRB Required for Research

The National Research Act[2] (NRA) of 1974 and the Common Rule,[3] govern all federally funded research in the United States that is performed with human beings as experimental subjects. Because portions of the Real Data Corpus were funded by the US Government, this legal framework must be followed in research involving the Real Data Corpus. The Common Rule creates a four-part test that determines whether or not proposed activity must be reviewed by an IRB. Specifically, IRB approval is required if:

  1. . The activity constitutes scientific “research,” a term that the Common Rule broadly defines as “a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge.”[4]
  2. . The research must be federally funded.[5]
  3. . The research must involve human subjects, which the Common Rule defines as “a living individual about whom an investigator (whether professional or student) conducting research obtains (1) data through intervention or interaction with the individual, or (2) identifiable private information.”[6]
  4. . The research is not “exempt” under the regulations.[7] The Common Rule exempts research involving “existing data, documents, [and] records…” provided that the data set is either “publicly available” or that the subjects “cannot be identified, directly or through identifiers linked to the subjects”(§46.101(b)(4)).

Research involving the Real Data Corpus is not exempt under the Common Rule because the RDC is not publicly available and in many cases it is possible to identify individuals whose data are in the collection. Furthermore, the majority of the subjects included in the Real Data Corpus have not provided consent to have their data used for research. Mitigating factors allowing the use of this data is the fact that the data was lawfully obtained, research involving this data is “minimal risk” (provided that the data is properly protected and personally identifiable information inside the RDC is kept confidential), the fact that there is substantial public benefit in using the RDC for research into computer forensics and computer security, and the fact that there is no practical alternative to using this data. Even if research involving the RDC were exempt, most US universities do not allow experiments to make their own determination of exemption. Instead, these institutions require that the experimenter submit an application for exempt research to the IRB. To date no IRB has blocked the approval of research that involves the RDC. In order to submit an application to an IRB it is necessary for all experimenters who will make use of the human subject data to take the appropriate human subject training proscribed by their institution. Most institutions prohibit students from filing applications directly, and instead require that an application be filed by a researcher or professor that can be considered a “principal investigator” for external funding. As a result, any proposed use of the RDC in research requires that an IRB application be filed with the host institution and with the Naval Postgraduate School. A copy of both the application and the approval from both the host institution and NPS must be provided prior to access being granted. The application must clearly state:

  • The proposed research that is to be done.
  • Why it is necessary to use the RDC; why simulated or realistic data cannot be used as an alternative.
  • What measures will be used to protect the data in the RDC.
  • What measures will be used to prevent the publication of personally identifiable information in any research products.

Please provide us with your IRB application prior to submitting it to your IRB! We can review the application and let you know if it is consistent with the IRB approval that we have already approved, or if we will need to apply for additional IRB approval. Sample applications are available upon request.

Contact Information

For more information or if you're interested in access to the Real Data Corpus, please contract:

Brittany Ramsey - Research Associate

blramsey@nps.edu  (831) 656-2014

  • No labels