Open Discover® SDK for .NET - A document file format identification and content extraction API

Open Discover® SDK for .NET

Identifying file formats using internal binary signatures for reliable and fast file format identification (versus using unreliable file extensions)
Extracting text from supported file formats and optionally identifying languages present in the extracted text (DOC, XLS, PPT, DOCX, XLSX, PPTX, ONENOTE, MSG, EML, EMLX, DXL, and many more)
Extracting metadata from supported file formats (over 1,325 known metadata fields in total)
Extracting embedded items/attachments from supported document formats
Extracting archive container items (7ZIP, ZIP, RAR, TAR, etc)
Extracting mail store container email objects (PST, OST, OST2013, Outlook for Mac OLM, MBOX, etc
Automatically detecting and extracting sensitive personally identifying information (PII) like social security numbers, credit card numbers, bank account/routing numbers, IBAN accounts, investment accounts, maiden names, phone numbers, addresses, IP addresses, crytocurrency addresses, email addresses, and more
Detecting and extracting entities related to medical, health care, and insurance records (and more)

Full text search using Lucene.NET
Machine learning using extracted text and metadata
Text analytics and document concept clustering
Information governance
Website crawling/full-text website search
Enterprise search and content management
IT Departments - identify, metadata scan, and de-duplicate documents on file servers
eDiscovery applications
And more...