File formats you can crawl
Fully supported formats
You can extract metadata and content from:
HTML/XML and derived formats (.html, .htm, .xml)
XML Schema Definition (.xsd)
Plain text (.txt, .csv, .ini, …)
OLE2 compound documents in the MS Office 97 family (.doc, .xls, .ppt)
OOXML documents in the MS Office 2007+ family (.docx, .xlsx, .pptx)
OpenDocument documents (OpenOffice/LibreOffice)
iWorks documents in Numbers, Pages, Keynote (.numbers, .pages, .key)
PDF (.pdf)
Email (.msg, .eml, .pst, .mbox)
Ebooks (.ibooks, .epub, …)
Rich text format (.rtf)
RSS/Atom/IPTC ANPA News Wire feed formats
Help files (.chm)
Source code in Java, .Net, Python, C, C++, Groovy (and associated files)
Compressed formats
You can decompress and look inside compressed formats supported by Apache Commons Compress. The supported archive formats are:
application/zip (.zip)
application/gzip (.gzip)
application/x-tar (.tar)
application/x-7z-compressed (.7zip)
application/x-rar-compressed (.rar)
application/x-bzip (.bzip)
application/x-bzip2 (.bzip2)
We process all the above archive/container types unless we detect a file that is either corrupt or created with malicious intent. For example, we would not unpack a “zip bomb”, a tiny file that abuses the ZIP specification to decompress to many terabytes of (very repetitive) data.
Any failure to read the contents of an archive will be treated as a content extraction failure and reported.
Partially supported formats
For other file formats, the platform attempts to extract as much metadata as possible but is unable to capture “content”. Some examples include:
CAD Formats (.dwg, .step)
Truetype fonts (.ttf)
Executables (x86/x64 on Windows, Linux, BSD)
Java JAR/WARs
Scientific formats for HDF, NetCDF, Matlab, GDAL, Grib
Audio (.mp3, .midi, .flac, .ogg)
Video (.mp4, .flv, .qt, .3gpp, …)
Images (.jpeg, .png, .tiff, .gif, …)
Pkcs7 signed messages