This guide contains instructions on how to set up and then monitor the NetDocuments OCR service. It also details some anticipated frequently asked questions about the service.
If you have any feedback, questions, or issues, submit a request via our support site.
Table of Contents
- Creating Your NetDocuments OCR Service
- Monitoring and Modifying the Service
- NetDocuments OCR FAQ
- What NetDocuments OCR Does
- NetDocuments OCR Two Modes
- Exportable Reports & Regular Summary Email Notifications
Many documents will exist in your NetDocuments repository that are image documents – including image formats like TIFF and JPEG, but also PDF documents where the content is like a photograph of the page eg. scanned documents.
There is no text information in the document that the user can search for, just millions of dots on a page of various colors and shades that represent an image of the document.
There is no simple way a person can determine if a PDF document is text-searchable. It can only be done by opening documents and trial and error searching or selecting for text.
That means that if a user tries to search for documents containing a particular word or phrase, the document will not be found. A user wishing to review a 1000-page document will find he/she can open it in their PDF application and select Find to bring up text to be searched on, however, no text will be found until the user waits 10-15 minutes for the document to be OCRd.
Neither of these scenarios allows the NetDocuments user to efficiently find and use their documents, which is one of the many reasons you have implemented NetDocuments in your organization.
What NetDocuments OCR Does
NetDocuments OCR will:
- Using OCR technology, create and apply a text layer to non-text-searchable PDF documents
- Achieve character recognition accuracy. See https://www.abbyy.com/en-au/ocr-sdk/key-features/ocr/ for more information
- Convert image documents (BMP, JPEG, PNG, and TIFF) to text-searchable PDF documents that retain all their original image content
- Analyze MS Outlook emails (MSG) containing attachments that are non-text searchable PDF or image documents and convert the attachments to searchable PDF-format documents. Emails which are themselves attachments to the email and their attachments are also analyzed and processed in the same way
- Optionally apply image compression to the OCRd document to reduce its size by up to 50% to enable faster viewing and downloading of documents
- Analyze PDF documents to determine if they contain text or if the quantity of text characters found is less than a specific number of characters per page – only those specific pages in a PDF requiring OCRing will be processed
- Process your entire existing NetDocuments repository as well as instantly check and process any new documents you save
- Ensure that any annotations in PDF documents such as comments, handwriting, notes, and stamps remain as annotations for future editing
NetDocuments OCR Two Modes
NetDocuments OCR has two modes for processing documents. When you sign up to the NetDocuments OCR product via your NetDocuments account manager, it will be configured with:
- Only the Active Monitoring service, or
- Both the Backlog Processing service and the Active Monitoring service
Backlog Processing Service
The Backlog Processing service is designed to crawl through your existing NetDocuments repository, searching for documents saved over many years. It checks all documents that could potentially be processed and flags those that meet the processing requirements from those which do not require processing.
The documents that require OCRing (and optionally, compressing) are processed, and once each document is successfully processed, it is then saved back into NetDocuments.
At this point, the NetDocuments indexing engine will automatically index that document so text in the document can be searched using the full-text search features of NetDocuments.
The document will either be stored as a New Version or, as a future enhancement, replace the existing document version.
When is Backlog Required?
If you have been a NetDocuments customer for:
- 6 months or less, you must purchase the Backlog Processing service with NetDocuments OCR
- More than 6 months, you have the choice to purchase or not purchase the Backlog Processing Service
The only caveat to this rule is if you purchased the NetDocuments Ingestion Service from NetDocuments and your documents have already been ingested/added to your NetDocuments repository PRIOR to switching on NetDocuments OCR. In this case, you have the choice to purchase or not purchase the Backlog Processing service.
However, note that if you do not purchase the Backlog Processing service, then none of your existing documents stored in NetDocuments at the moment you switch on the NetDocuments OCR process will be OCRd or compressed.
Active Monitoring Service
The Active Monitoring service is designed to watch for any newly saved documents in NetDocuments. This service will test every minute for new documents that have been saved into NetDocuments, assess them for processing, OCR and, optionally, compress them as required.
All customers who sign up for the NetDocuments OCR process will have at a minimum this Active Monitoring service. The same assessment, processing and saving steps are performed as in the Backlog Processing service.
Exportable Reports & Regular Summary Email Notifications
From version 2.10 onwards, when you configure the NetDocuments OCR service, its last step is to set up email notifications. NetDocuments OCR can notify you about the latest weekly or daily processing reports that are available for download from the NetDocuments OCR Portal. Only summary information is sent in the email notifications themselves. See Set email notification preferences for how you can an opt to receive regular notifications about any of the following Report types:
- A weekly Completed Documents Report
This includes all documents that have been completed, and those that have exceptioned. This report can be quite large as it reflects all documents completed in the last 7 days.
- A daily Deferred (Save) Documents Report
This lists documents that could not be saved as they were checked out at the time of save and shows the number of save attempts made for each document.
- A weekly Exceptions Report
This is a more targeted report only showing documents that have exceptioned.
These reports are collated on a weekly (7 days) basis for Completed and Exceptions, and daily for Deferred Save documents.
Note: Customers whose service was pre-configured with a version of NetDocuments OCR prior to version 2.10, and who now want to receive these notifications, must enable which email notifications they want to get. See Enable Notifications for Processing Reports for how to do this. The downloadable reports are created automatically so no action is required by you to receive them except enablement of the above email notification types.
- Notification for a Completed Documents report
- Notification for an Exception Documents report
- Notification for a Deferred Documents report
Each notification contains a Download Report button to access the NetDocuments OCR Portal, and summarily provides the following information:
- Schedule either Weekly or Daily
- Reporting Period, eg July 08, 2019 to July 14, 2019 UTC
- Number of Documents referenced in the report for that period, eg 3,400
Below are examples of email notifications for each type of report.
See Download a Weekly or Daily Processing Report for how to access a report.
Notification for a Completed Documents report
Notification for an Exception Documents report
Notification for a Deferred Documents report
Creating Your NetDocuments OCR Service
Create the NetDocuments OCR User Account
Before initially logging onto the NetDocuments OCR module, it is essential that you create the special NetDocuments OCR user.
This User is used to mark the modifications made to each processed document and to allow Administrators and users to identify those documents modified by the NetDocuments OCR process.
- Create the special NetDocuments OCR user in the Create User Account dialog box as follows:
- First Name: NetDocuments (not case sensitive)
- Last Name: OCR (not case sensitive)
- Email Address: Your choice, but must be unique (leave or amend from the pre-populated address (as previously entered)
- Username: Your choice, but must be unique
- Select OK.
- Configure this user to be a Repository Administrator and a Cabinet administrator in each cabinet requiring OCR services.
Log into NetDocuments OCR
- Log into NetDocuments as the above special NetDocuments OCR user.
- Go to NetDocuments Administration > NetDocuments OCR.
In NetDocuments, go to Repository Administration > OCR Dashboard.
Note: If the link does not appear in the above Repository menu, it means you are not a licensed client of the NetDocuments OCR module. Then you should contact the NetDocuments sales team.
The Get Started page appears.
- Select Sign In to launch the Welcome to NetDocuments OCR.
- Select your NetDocuments repository region, then select Next.
The Setup New Service wizard appears.
Note: If you have already configured your NetDocuments OCR service, the NetDocuments OCR Dashboard directly opens a successful login. The dashboard is detailed in Monitoring and Modifying the Service.
Configure Your OCR Service
- Define Dates for Backlog Service
- Select Cabinets
- Choose File Formats to OCR
- Define OCR Character Setting
- Set Languages
- Opt for Compression
- Set the Saving Preference
- Set Email Notification Preferences
Define Dates for Backlog Service
Backlog is the first tab that displays in the Wizard if you are licensed to use the Backlog service in NetDocuments OCR. If you have only licensed the Active Monitoring service (and not the Backlog service), you are taken directly to the Cabinets tab in Step 3 below.
- Specify the Last Modified Date you want to use for processing your document backlog by selecting:
- Before - Automatically defaults to Today’s Date (and does not allow you to enter another date).
This option sets up the Backlog Processing service so that it searches for all documents with a modified date prior to and including the ‘Before’ date. This service determines if they need to be OCRd to ensure that all historical documents will become text-searchable. The Backlog Processing service always works backward in reverse chronological order, processing the most recently modified/ saved documents as soon as possible, then processing older documents backward, a day at a time.
NetDocuments OCR searches up to 26 hours ahead of today’s date to ensure it will assess any document saved with tomorrow’s date in a timezone ahead of the location in which this date is set.
Tip: Use the Before option if you want to initially test the Backlog Processing service before using it on your entire NetDocuments repository. Then after the process has started running, you can pause its progress in the Administrator Dashboard, review the first few documents, then resume its progress.
- Between - Means you must use the Date Pickers to select a relevant range of dates.
This option sets up the Backlog Processing service so that instead of searching your entire NetDocuments repository it only searches for and assesses documents that were modified and saved within a specific date range. For example, if you want to process only the previous 5 years of documents, then NetDocuments OCR will search for documents with a last modified date starting from and including the first date and no older than the second date. So, if today’s date is 30 September 2018, you would select dates Between: 30 September 2018 and 1 October 2013.
Tip: Use it if you prefer to ignore very old documents.
- Select Next.
- In the Cabinets tab, select the cabinet(s) you want to be processed.
Most firms only have one cabinet but if you have more, select more than one at your discretion. There is no limit.
Some firms have a cabinet just used for testing systems and functions. You would most likely not want to select the Sandbox or testing cabinet.
- Select Next.
Choose File Formats to OCR
- In the File Formats tab, select which file formats to process.
NetDocuments OCR does not search and process all file formats, as many are generally text searchable already such as Word, Excel documents. Other document types, such as PDF, image formats such as TIFF, MSG email files that have attachments of PDF or image files, often do not have searchable text - and therefore should be processed.
Any file formats that you do not check will be ignored for processing. For Outlook emails, NetDocuments OCR drills down on MSG files that also contain MSG files as attachments and seek out the relevant file formats you have selected.
TIFF, BMP, PNG, and JPEG files are converted to PDF format before OCRing and optionally, compressing. Outlook email attachments that are in the list of supported file formats shown on the screen are converted to PDF if not in that format already, and the original attachment replaced with the PDF document after OCRing and compressing. Any file formats not marked for processing attached to an email (or unsupported email attachments such as TXT, DOCX, etc), will remain attached to the email with their contents unchanged.
- Select Next.
Define OCR Character Setting
- In the OCR tab, set the number of characters in a page that define if a page has already been OCRd or not.
Note: The character count includes visible rendered text and hidden text layers from OCR processes.
If NetDocuments OCR can detect the page contains headers, footers and/or watermarks, these are excluded from the character count. If the page still contains visible rendered text and/or hidden text less than or equal to the amount specified, NetDocuments OCR will still assess the page to determine if it should be OCRd.
NetDocuments OCR does not re-process a document that has already been OCRd if that OCR process placed more than the specified number of text characters on a page. You can set any number between 0 and a maximum of 200 characters for this setting. However, the 120-character setting is recommended to cater for pages with small amounts of visible rendered text like page numbers.
Note: Any text not in a specific PDF header and/or footer is counted as visible rendered text. Text at the bottom of a page in a PDF is not automatically a footer. It is only considered a footer if it is specified according to the PDF specification.
If a page of a document contains a combination of visible and invisible text equalling or greater than the number of characters you specify (eg. 120), then the page will not be processed.
- Select Next.
- In the Languages tab, select one or more languages you want to be considered when OCRing your documents.
NetDocuments OCR supports detection and OCRing of multiple languages. As a NetDocuments repository can contain documents in more than one language, you can specify up to 16 languages you’d expect to find in your documents. You don’t need to specify all 16 if your documents are all in one or a few languages.
Important: Select those languages that you think are common in your set of documents. The purpose of selecting the languages is to allow the OCR process to do additional spell checking to increase the accuracy of interpreting the text. You must select at least one language, and up to 16. Selecting more languages will have little effect on the speed of processing but will improve its accuracy.
Note: Do not select languages you know are irrelevant or highly unlikely to be in your document set – as doing so places an unnecessary burden on processing and may reduce the effectiveness of the spell-checking.
Handwriting in any language is not OCRd but remains in its original graphical format.
- Select Next.
Opt for Compression
- In the Compression tab, choose whether you want to compress documents.
If you opt to have your OCRd documents compressed, a reduction in file sizes by up to 50% can be achieved, which results in faster viewing and downloads of documents for your users.
The exact compression achieved depends on the existing resolution of images and whether any previous compression was applied.
The compression method used is Mixed Raster Content (MRC). In simplest terms, a typical scanned document may be large due to its background which may comprise 90% of its file size. MRC compression dramatically reduces the background image ‘size’ without impacting on the quality of the image representing text on the page below 300 dpi.
Compression will not occur on pages that contain visible rendered text. Therefore, pages that contain an image and rendered text will not be compressed, only OCRd.
- Select Next.
Set the Saving Preference
- In the Save tab, select the method that should be followed to save your completed document.
In this version of the service, only the Save as New version option is available. This creates a new version of the document in NetDocuments with the version number incremented and displayed in the Versions tab as created by the NetDocuments OCR user.
(The option to Replace the Original Source Document will be available in future versions of NetDocuments OCR.)
- Select the Retain Locked Status check box if you store documents in NetDocuments and sometimes use the Locked status indicator on documents and want to retain this setting on the new version of the documents saved by NetDocuments OCR.
- Select Next.
Set Email Notification Preferences
- In the Notification tab, select any or all of the following options:
- Notify me when Completed Documents weekly report is available
- Notify me when Deferred (Save) Documents daily report is available
- Notify me when Exceptions weekly report is available
- Under Email Address enter the email address you wish the above report notifications to come to you about the performance of the NetDocuments OCR process.
We recommend you make this a group email account for your organization or an email address that is not for a ‘natural person’ instead of a specific person in your organization. This address does not have to be the one associated with the special ‘NetDocuments OCR’ user.
Note: Existing customers whose service is pre-configured, must enable these email notifications as this functionality was not available in NetDocuments OCR before version 2.10. See Enable Notifications for Processing Reports for details.
- Select Next.
Monitoring and Modifying the Service
- NetDocuments OCR Dashboard
- Download a Weekly or Daily Processing Report
- Enable Notifications for Processing Reports
- Generate a NetDocuments Activity Log Report
The NetDocuments OCR service does not require constant monitoring by a user. It is an automatic background task. However, from time to time an Administrator may want to check on progress and/or understand and react to certain situations that arise. NetDocuments OCR provides several ways Administrators can get access to the information they need via its NetDocuments OCR Dashboard.
NetDocuments OCR Dashboard
The NetDocuments OCR Dashboard is a web page accessible from any browser compatible with NetDocuments authentication. It displays the progress of the Active Monitoring and Backlog Processing services.
Access the Dashboard
- In NetDocuments, go to Repository Administration, and select the NetDocuments OCR Dashboard link.
Note: If you have not yet configured your NetDocuments OCR service, this link takes you to the New Service Wizard, and not directly to the dashboard. Then see Creating Your NetDocuments OCR Service.
The Get Started page appears.
- Select Sign In to launch the Welcome to NetDocuments OCR.
- Select your NetDocuments repository region, and then select Next.
The NetDocuments OCR Dashboard opens.
The NetDocuments OCR Dashboard is split into two sections:
- Active Monitoring Report – Shows the progress of the Active Monitoring process, showing volumes of documents searched, assessed and processed. This part of the dashboard always displays as every client is licensed for Active Monitoring.
- Backlog Report - Shows the progress of processing your entire backlog of documents. This console only displays if you have licensed the NetDocuments OCR Backlog process from NetDocuments.
1 - Indicates that the service is running and allows you to pause/resume it.
2 - Provides the total number of supported documents found for analysis and processing.
3 - Shows the percentage of total documents found that have completed processing.
The following information in this dashboard relates to both the Backlog and Active Monitoring processes.
|Documents found||The total number of documents found of configured file formats, eg PDF, image files and MSG files. This number does not include unsupported file types, such as Word and Excel, that by definition would not require OCRing.|
The number of documents at any stage of processing (but not yet completed), including those:
The number of documents unable to be saved after being OCRd due to being checked out by a user just before the Save process.
NetDocuments OCR makes up to 10 attempts to resubmit any document that could not be saved after it is OCRd.Documents unable to be saved within the 10 attempts become Exceptions and are reported to the Administrator. (The original document in NetDocuments remains completely untouched).
|Updated||The number of documents that have been successfully saved to NetDocuments after being OCRd (and optionally compressed).|
|Update not required||
The number of documents where any of the following conditions occur:
|Not Supported||If a document has a password, is an XFA document, or is digitally signed, then it will be marked as not supported.|
|Exceptions||The number of documents accessible for inspection and OCR processing but which fail due to being corrupted, their file content does not match the file extension, an unknown error or access being denied. These exception types are not re-attempted.|
Access the Administrator Guide via Dashboard
You can access this NetDocuments OCR Administrator Guide in two ways:
- Via the ? Help icon at the top right of the NetDocuments OCR’s Get Started page (accessed when you log in to initially configure the service.)
- Via the ? Help icon at the top right of the NetDocuments OCR Dashboard web page (whenever you need to consult the Dashboard).
Download a Weekly or Daily Processing Report
- Log into NetDocuments OCR -> Reports > Exported Reports.
Note: You must use your NetDocuments OCR user credentials, not your personal credentials.
Available reports list in the window dating back to the start of the 2.10 deployment date.
Historical reports will be available over the coming months.
- Click the Download button against the report you wish to download.
- Once downloaded, open the CSV report in Excel and then:
▪ Filter or sort by ‘state’ such as ‘Completed’ or ‘Exceptions’
▪ Drill down on ‘reason codes’ for unsupported documents, or documents that have exceptioned.
Enable Notifications for Processing Reports
As this feature was disabled by default in earlier versions, customers whose service was pre-configured with a version of NetDocuments OCR prior to version 2.10, can choose to enable email notifications for each report type. The steps below explain how.
- Log into NetDocuments OCR -> Service Configuration.
- In the Notification Configuration section, as required, enable any or all of the following options:
▪ Notify me when Completed Documents weekly report is available
▪ Notify me when Deferred (Save) Documents daily report is available
▪ Notify me when Exceptions weekly report is available
Generate a NetDocuments Activity Log Report
The NetDocuments OCR Dashboard provides information on the number of documents processed and the weekly Digest Emails provide details of the documents that have been processed, including OCRs and successful saves.
If you want a report for a specific day or date range with Doc IDs of the affected documents, then do as follows:
- In NetDocuments, go to Admin > Request Activity Logs > Export and produce a date-based report to XML.
- Optionally, load the report into Excel – it is recommended you only select a limited date range to report on if you have a very large repository.
- Apply a filter for the Save as new version activity (see the Name column below) and filter by the NetDocuments OCR special user account.
The filtered list shows you the documents processed and saved by NetDocuments OCR in the defined timeframe. The docId column provides you with the unique NetDocuments reference.