What is Open Discover™ SDK?
It is a .NET developer toolkit that developers use to identify/classify document file formats and extract content such as text, metadata, attributes (e.g., ‘WorksheetHasHiddenColumns’), embedded objects, and attachments. Languages in the extracted text such as English, Chinese, etc. are automatically identified. The SDK also calculates MD5/SHA1 hashes for all documents and calculates additional sophisticated hashes on email and office document types. These hashes are useful for de-duplicating documents, that is removing duplicate documents from a document set.
Before indexing a document for full text search or using machine learning classification on a set of documents, you first need to get document text, metadata, and that of the document’s attachments. The Open Discover™ SDK is useful companion toolset for machine learning, full-text indexing, file storage document identification/classification, ECM systems, eDiscovery, and more.
How many document file formats can Open Discover™ SDK identify?
The SDK can identify 1,300+ document file formats. The SDK does not rely on file extensions to identify file formats except for a few cases (described below), the SDK uses binary or other unique internal signatures of the document to identify its file format.
To identify a document, using C#/.NET code, is as simple as:
- where method argument ‘_stream’ is an open .NET Stream object (e.g., FileStream or MemoryStream)
- where method argument ‘filename’ is the filename or full path of file with extension (if it exists).
It is not necessary to pass in filename as an argument but is strongly recommended. Some file formats such as encrypted Microsoft Office 2007-2016 documents have the same file format and same internal signatures and cannot be 100% identified until the internal package hosting the real document is decrypted. In cases such as these, and a few other special cases, the file extension is used in conjunction with the internal signatures to identify the document.
In the example code snippet above, the returned result ‘docIdResult’ is an IdResult object (see class diagram below) that specifies the identified file format (property name "ID", ex: Id.OutlookMessage, Id.Excel2007Encrypted, etc.), the classification of the document format (property name "Classification", ex: IdClassification.Email, IdClassification.Spreadsheet, IdClassification.WordProcessing, etc.), MIME type if known, the character set encoding if a text based document format, text description of the file format, and more.
How many document file formats can Open Discover™ SDK extract content from?
The SDK can extract content from 600+ document file formats and growing (counted by document file format ID). For document types that aren’t supported, a fast and accurate binary-to-text extractor is provided that allows useful text, if any present, in UTF8, UTF16, and code page 1252 encodings to be extracted from the binary.
To extract content from a document, using C#/.NET code, we make a method call to the content extractor factory that makes use of the identified document format to return an appropriate extraction interface for that particular format :
- where method argument ‘_stream’ is an open Stream object (FileStream or MemoryStream) to the document;
- where argument ‘_docIdResult’ is the document identification result returned in the earlier code snippet;
- where ‘filename’ is the filename or full path of file with extension (if it exists);
- where ‘_contentConfig’ is a ContentExtractionSettings object that has setting options for what is extracted (e.g., only extract metadata, or to extract text, metadata, and attachments/embedded objects) and options for hashing, language identification of extracted text, etc.
In the example code snippet above, the returned ‘docContentResult’ result object is a ContentExtractorResult object from which the user can get the appropriate interface to extract content for the document’s particular file format. Archives, mail stores, and office documents have their own distinct extraction interface types.
The code snippet below shows how to use the ContentExtractorType.Document content extractor. If the document in encrypted with a password and the SDK supports decrypting the document type, then a dialog prompting for the valid password is displayed in this example:
It is that simple. The returned '_docContent' object (DocumentContent class object) in the above code contains the extracted text, languages present in the extracted text, document attributes, metadata, embedded objects and attachments. All retrieved in one method call. The class diagram of the returned DocumentContent class looks like this:
The example C# projects distributed with the SDK show how to use all the content extractor types in addition to examples showing how to use DocumentIdentifier class to identify directories containing files in parallel and also several examples of how to use the PlatformWorker class to process batches of documents as a task and also how to process archives and mail stores as their own task.
See the 'API Reference' for detailed descriptions of all SDK API classes.
Can Open Discover™ SDK decrypt password protected documents?
The SDK can decrypt common office formats such as Microsoft Office 97-2003, Microsoft Office 2007-2016, Open Document Formats (OpenOffice and Libre Office), ZIP, 7Z, RAR, and PDF by cycling through a user supplied list of known passwords.
The SDK identifies many encrypted document formats. Knowing if a document is encrypted is useful for many reasons, such as:
- To get document passwords from key people leaving a company
- IT Security: verify that employees are encrypting their documents and following security guidelines
- To identify why content extraction failed on a particular document, e.g., if no valid passwords given to extract from an encrypted document or archive.
Who is the ideal user of Open Discover™ SDK?
The Open Discover™ SDK is ideal technology for processing unstructured content for business applications such as:
- Corporate information governance
- Full-text search using SDK with open source Lucene.NET
- Text analytics/document concept clustering
- Enterprise search and content management
- Big Data, machine learning, AI, etc.
- Website crawling/full-text search