COMING SOON … Going live January 2019, until then website is under development. If you want to test drive Open Discover SDK before then, contact us at: sales@dotfurther.com

Open Discover .NET SDK

dotfurther's Open Discover SDK gives .NET developers the software tools they need for very high performance document processing for full-text search, data analytics, eDiscovery, and Big Data. This SDK allows you to:

  • Identify 1,300+ document file formats using internal binary signatures
  • Extract text, metadata, attributes, embedded objects, and attachments from 100's of file formats including archives (.7,.zip,.rar,.tar,etc), mail stores (Outlook PST/OST, MBOX, etc), and encrypted archives, PDFs, Microsoft Office documents, and Open Office documents.
  • Identify the various languages present in extracted text.
  • De-duplicate copies of the same email (whether in .msg or MIME format) or office document using sophisticated hashing. Binary MD5/SHA1 hashes are automatically calculated by SDK for all document types but email (EDRM standard hashes) and Microsoft Office documents have extra content based hashes for better de-duplication - all automatically calculated by SDK content extractors.

Furthermore, Open Discover SDK comes with PlatformWorker, a .NET class, that makes it easy to:

  • Process 100's to 1000's of documents in sets (tasks), in a highly parallel way, to achieve very high performance when processing large document collections. By "process" we mean identify the document's file format, extract it's text, metadata, embedded items, attachments, hash document, de-NIST,  and identify languages present in the extracted text. 
  • Recursively process documents completely, i.e., process the input document and its embedded items/attachments (and their embedded items, if any, recursively).
  • Process large archives (.zip, .7z, .rar, etc) and mail stores (.pst/.ost/.mbox, etc) recursively as a single task or break very large archives and mail stores into partitions (subsets) that can be processed across multiple servers, virtual machines, or desktop PCs by multiple PlatformWorker instances for very efficient distributed processing.
  • Multiple processing modes (depths):
    • TextAndMetadata:  full-processing of all input documents and their embedded objects/attachments (reclusively)
    • MetadataContainerItemsFirstLevel: Metadata only is extracted for all supported input files and also that of container child items. If a container contains a child item container, only the child item container's metadata is extracted; the child container's items are ignored. This processing mode is great for eDiscovery Early Case Assessment (ECA) - de-duplicate and reduce data collections.
    • MetadataNoContainerItems:  Metadata only is extracted for all input items. Input items that are containers only have their metadata extracted and their child items are ignored; however, a count of contained child items is stored in the container's outputted metadata.
  • Create your own platform as a service (PaaS) for full-text search, analytics, etc., using the PlatformWorker as the distributed processing engine.

 

 

Download Free Trial

Test drive Open Discover .NET SDK + PlatformWorker.