File formats you can crawl

Fully supported formats

You can extract metadata and content from:

  • HTML/XML and derived formats (.html, .htm, .xml)

  • XML Schema Definition (.xsd)

  • Plain text (.txt, .csv, .ini, …)

  • OLE2 compound documents in the MS Office 97 family (.doc, .xls, .ppt)

  • OOXML documents in the MS Office 2007+ family (.docx, .xlsx, .pptx)

  • OpenDocument documents (OpenOffice/LibreOffice)

  • iWorks documents in Numbers, Pages, Keynote (.numbers, .pages, .key)

  • PDF (.pdf)

  • Email (.msg, .eml, .pst, .mbox)

  • Ebooks (.ibooks, .epub, …)

  • Rich text format (.rtf)

  • RSS/Atom/IPTC ANPA News Wire feed formats

  • Help files (.chm)

  • Source code in Java, .Net, Python, C, C++, Groovy (and associated files)


Compressed formats

You can decompress and look inside compressed formats supported by Apache Commons Compress. The supported archive formats are:

  • application/zip (.zip)

  • application/gzip (.gzip)

  • application/x-tar (.tar)

  • application/x-7z-compressed (.7zip)

  • application/x-rar-compressed (.rar)

  • application/x-bzip (.bzip)

  • application/x-bzip2 (.bzip2)

We process all the above archive/container types unless we detect a file that is either corrupt or created with malicious intent. For example, we would not unpack a “zip bomb”, a tiny file that abuses the ZIP specification to decompress to many terabytes of (very repetitive) data.

Any failure to read the contents of an archive will be treated as a content extraction failure and reported.


Partially supported formats

For other file formats, the platform attempts to extract as much metadata as possible but is unable to capture “content”. Some examples include:

  • CAD Formats (.dwg, .step)

  • Truetype fonts (.ttf)

  • Executables (x86/x64 on Windows, Linux, BSD)

  • Java JAR/WARs

  • Scientific formats for HDF, NetCDF, Matlab, GDAL, Grib

  • Audio (.mp3, .midi, .flac, .ogg)

  • Video (.mp4, .flv, .qt, .3gpp, …)

  • Images (.jpeg, .png, .tiff, .gif, …)

  • Pkcs7 signed messages