This guide contains instructions on how to set up and then monitor the NetDocuments OCR service. It also details some anticipated frequently asked questions about the service.
If you have any feedback, questions, or issues, submit a request via our support site.
Table of Contents
Many documents will exist in your NetDocuments repository that are image documents – including image formats like TIFF and JPEG, but also PDF documents where the content is like a photograph of the page eg. scanned documents.
There is no text information in the document that the user can search for, just millions of dots on a page of various colors and shades that represent an image of the document.
There is no simple way a person can determine if a PDF document is text-searchable. It can only be done by opening documents and trial and error searching or selecting for text.
That means that if a user tries to search for documents containing a particular word or phrase, the document will not be found. A user wishing to review a 1000-page document will find he/she can open it in their PDF application and select Find to bring up text to be searched on, however, no text will be found until the user waits 10-15 minutes for the document to be OCRd.
Neither of these scenarios allows the NetDocuments user to efficiently find and use their documents, which is one of the many reasons you have implemented NetDocuments in your organization.
What NetDocuments OCR Does
NetDocuments OCR will:
- Using OCR technology, create and apply a text layer to non-text-searchable PDF documents
- Achieve character recognition accuracy. See https://www.abbyy.com/en-au/ocr-sdk/key-features/ocr/ for more information
- Convert image documents (BMP, JPEG, PNG, and TIFF) to text-searchable PDF documents that retain all their original image content
- Analyze MS Outlook emails (MSG) containing attachments that are non-text searchable PDF or image documents and convert the attachments to searchable PDF-format documents. Emails which are themselves attachments to the email and their attachments are also analyzed and processed in the same way
- Optionally apply image compression to the OCRd document to reduce its size by up to 50% to enable faster viewing and downloading of documents
- Analyze PDF documents to determine if they contain text or if the quantity of text characters found is less than a specific number of characters per page – only those specific pages in a PDF requiring OCRing will be processed
- Process your entire existing NetDocuments repository as well as instantly check and process any new documents you save
- Ensure that any annotations in PDF documents such as comments, handwriting, notes, and stamps remain as annotations for future editing
NetDocuments OCR Two Modes
NetDocuments OCR has two modes for processing documents. When you sign up to the NetDocuments OCR product via your NetDocuments account manager, it will be configured with:
- Only the Active Monitoring service, or
- Both the Backlog Processing service and the Active Monitoring service
Backlog Processing Service
The Backlog Processing service is designed to crawl through your existing NetDocuments repository, searching for documents saved over many years. It checks all documents that could potentially be processed and flags those that meet the processing requirements from those which do not require processing.
The documents that require OCRing (and optionally, compressing) are processed, and once each document is successfully processed, it is then saved back into NetDocuments.
At this point, the NetDocuments indexing engine will automatically index that document so text in the document can be searched using the full-text search features of NetDocuments.
The document will either be stored as a New Version or, as a future enhancement, replace the existing document version.
When is Backlog Required?
If you have been a NetDocuments customer for:
- 6 months or less, you must purchase the Backlog Processing service with NetDocuments OCR
- More than 6 months, you have the choice to purchase or not purchase the Backlog Processing Service
The only caveat to this rule is if you purchased the NetDocuments Ingestion Service from NetDocuments and your documents have already been ingested/added to your NetDocuments repository PRIOR to switching on NetDocuments OCR. In this case, you have the choice to purchase or not purchase the Backlog Processing service.
However, note that if you do not purchase the Backlog Processing service, then none of your existing documents stored in NetDocuments at the moment you switch on the NetDocuments OCR process will be OCRd or compressed.
Active Monitoring Service
The Active Monitoring service is designed to watch for any newly saved documents in NetDocuments. This service will test every minute for new documents that have been saved into NetDocuments, assess them for processing, OCR and, optionally, compress them as required.
All customers who sign up for the NetDocuments OCR process will have at a minimum this Active Monitoring service. The same assessment, processing and saving steps are performed as in the Backlog Processing service.
Creating Your NetDocuments OCR Service
Create the NetDocuments OCR User Account
Before initially logging onto the NetDocuments OCR module, it is essential that you create the special NetDocuments OCR user.
This User is used to mark the modifications made to each processed document and to allow Administrators and users to identify those documents modified by the NetDocuments OCR process.
- Create the special NetDocuments OCR user in the Create User Account dialog box as follows:
- First Name: NetDocuments (not case sensitive)
- Last Name: OCR (not case sensitive)
- Email Address: Your choice, but must be unique (leave or amend from the pre-populated address (as previously entered)
- Username: Your choice, but must be unique
- Select OK.
- Configure this user to be a Repository Administrator and a Cabinet administrator in each cabinet requiring OCR services.
Log into NetDocuments OCR
- Log into NetDocuments as the above special NetDocuments OCR user.
- Go to NetDocuments Administration > NetDocuments OCR.
In NetDocuments, go to Repository Administration > NetDocuments OCR Dashboard.
Note: If the link does not appear in the above Repository menu, it means you are not a licensed client of the NetDocuments OCR module. Then you should contact the NetDocuments sales team.
The Get Started page appears.
- Select Sign In to launch the Welcome to NetDocuments OCR.
- Select your NetDocuments repository region, then select Next.
The Setup New Service wizard appears.
Note: If you have already configured your NetDocuments OCR service, the NetDocuments OCR Dashboard directly opens a successful login. The dashboard is detailed in Monitoring the Service.
Configure Your OCR Service
- Define Dates for Backlog Service
- Select Cabinets
- Choose File Formats to OCR
- Define OCR Character Setting
- Set Languages
- Opt for Compression
- Set the Saving Preference
- Set the Address for Notifications
Define Dates for Backlog Service
Backlog is the first tab that displays in the Wizard if you are licensed to use the Backlog service in NetDocuments OCR. If you have only licensed the Active Monitoring service (and not the Backlog service), you are taken directly to the Cabinets tab in Step 3 below.
- Specify the Last Modified Date you want to use for processing your document backlog by selecting:
- Before - Automatically defaults to Today’s Date (and does not allow you to enter another date).
This option sets up the Backlog Processing service so that it searches for all documents with a modified date prior to and including the ‘Before’ date. This service determines if they need to be OCRd to ensure that all historical documents will become text-searchable. The Backlog Processing service always works backward in reverse chronological order, processing the most recently modified/ saved documents as soon as possible, then processing older documents backward, a day at a time.
NetDocuments OCR searches up to 26 hours ahead of today’s date to ensure it will assess any document saved with tomorrow’s date in a timezone ahead of the location in which this date is set.
Tip: Use the Before option if you want to initially test the Backlog Processing service before using it on your entire NetDocuments repository. Then after the process has started running, you can pause its progress in the Administrator Dashboard, review the first few documents, then resume its progress.
- Between - Means you must use the Date Pickers to select a relevant range of dates.
This option sets up the Backlog Processing service so that instead of searching your entire NetDocuments repository it only searches for and assesses documents that were modified and saved within a specific date range. For example, if you want to process only the previous 5 years of documents, then NetDocuments OCR will search for documents with a last modified date starting from and including the first date and no older than the second date. So, if today’s date is 30 September 2018, you would select dates Between: 30 September 2018 and 1 October 2013.
Tip: Use it if you prefer to ignore very old documents.
- Select Next.
- In the Cabinets tab, select the cabinet(s) you want to be processed.
Most firms only have one cabinet but if you have more, select more than one at your discretion. There is no limit.
Some firms have a cabinet just used for testing systems and functions. You would most likely not want to select the Sandbox or testing cabinet.
- Select Next.
Choose File Formats to OCR
- In the File Formats tab, select which file formats to process.
NetDocuments OCR does not search and process all file formats, as many are generally text searchable already such as Word, Excel documents. Other document types, such as PDF, image formats such as TIFF, MSG email files that have attachments of PDF or image files, often do not have searchable text - and therefore should be processed.
Any file formats that you do not check will be ignored for processing. For Outlook emails, NetDocuments OCR drills down on MSG files that also contain MSG files as attachments and seek out the relevant file formats you have selected.
TIFF, BMP, PNG, and JPEG files are converted to PDF format before OCRing and optionally, compressing. Outlook email attachments that are in the list of supported file formats shown on the screen are converted to PDF if not in that format already, and the original attachment replaced with the PDF document after OCRing and compressing. Any file formats not marked for processing attached to an email (or unsupported email attachments such as TXT, DOCX, etc), will remain attached to the email with their contents unchanged.
- Select Next.
Define OCR Character Setting
- In the OCR tab, set the number of characters in a page that define if a page has already been OCRd or not.
Note: The character count includes visible rendered text and hidden text layers from OCR processes.
If NetDocuments OCR can detect the page contains headers, footers and/or watermarks, these are excluded from the character count. If the page still contains visible rendered text and/or hidden text less than or equal to the amount specified, NetDocuments OCR will still assess the page to determine if it should be OCRd.
NetDocuments OCR does not re-process a document that has already been OCRd if that OCR process placed more than the specified number of text characters on a page. You can set any number between 0 and a maximum of 200 characters for this setting. However, the 120-character setting is recommended to cater for pages with small amounts of visible rendered text like page numbers.
Note: Any text not in a specific PDF header and/or footer is counted as visible rendered text. Text at the bottom of a page in a PDF is not automatically a footer. It is only considered a footer if it is specified according to the PDF specification.
If a page of a document contains a combination of visible and invisible text equalling or greater than the number of characters you specify (eg. 120), then the page will not be processed.
- Select Next.
- In the Languages tab, select one or more languages you want to be considered when OCRing your documents.
NetDocuments OCR supports detection and OCRing of multiple languages. As a NetDocuments repository can contain documents in more than one language, you can specify up to 16 languages you’d expect to find in your documents. You don’t need to specify all 16 if your documents are all in one or a few languages.
Important: Select those languages that you think are common in your set of documents. The purpose of selecting the languages is to allow the OCR process to do additional spell checking to increase the accuracy of interpreting the text. You must select at least one language, and up to 16. Selecting more languages will have little effect on the speed of processing but will improve its accuracy.
Note: Do not select languages you know are irrelevant or highly unlikely to be in your document set – as doing so places an unnecessary burden on processing and may reduce the effectiveness of the spell-checking.
Handwriting in any language is not OCRd but remains in its original graphical format.
- Select Next.
Opt for Compression
- In the Compression tab, choose whether you want to compress documents.
If you opt to have your OCRd documents compressed, a reduction in file sizes by up to 50% can be achieved, which results in faster viewing and downloads of documents for your users.
The exact compression achieved depends on the existing resolution of images and whether any previous compression was applied.
The compression method used is Mixed Raster Content (MRC). In simplest terms, a typical scanned document may be large due to its background which may comprise 90% of its file size. MRC compression dramatically reduces the background image ‘size’ without impacting on the quality of the image representing text on the page below 300 dpi.
Compression will not occur on pages that contain visible rendered text. Therefore, pages that contain an image and rendered text will not be compressed, only OCRd.
- Select Next.
Set the Saving Preference
- In the Save tab, select the method that should be followed to save your completed document.
In this version of the service, only the Save as New version option is available. This creates a new version of the document in NetDocuments with the version number incremented and displayed in the Versions tab as created by the NetDocuments OCR user.
(The option to Replace the Original Source Document will be available in future versions of NetDocuments OCR.)
- Select the Retain Locked Status check box if you store documents in NetDocuments and sometimes use the Locked status indicator on documents and want to retain this setting on the new version of the documents saved by NetDocuments OCR.
- Select Next.
Set the Address for Notifications
- In the Notification tab, define the email address that should receive notifications about the performance of the NetDocuments OCR process.
Ideally, this should be a group email account for your system administrator instead of one specific person in your organization.
The email address does NOT have to be the one associated with the special NetDocuments OCR user. It is recommended that the email address be an administration group email for your organization or some other email that is not personal.
This allows NetDocuments OCR, in the near future, to issue proactive reports (digest emails) to this email address on a weekly basis. See Digest Email Notifications for more details.
- Select Next.
Monitoring the Service
- NetDocuments OCR Dashboard
- Access the Administrator Guide via Dashboard
- Digest Email Notifications
- Generate a NetDocuments Activity Log Report
The NetDocuments OCR service does not require constant monitoring by a user. It is an automatic background task. However, from time to time an Administrator may want to check on progress and/or understand and react to certain situations that arise. NetDocuments OCR provides several ways Administrators can get access to the information they need via its NetDocuments OCR Dashboard.
NetDocuments OCR Dashboard
The NetDocuments OCR Dashboard is a web page accessible from any browser compatible with NetDocuments authentication. It displays the progress of the Active Monitoring and Backlog Processing services.
Access the Dashboard
- In NetDocuments, go to Repository Administration, and select the NetDocuments OCR Dashboard link.
Note: If you have not yet configured your NetDocuments OCR service, this link takes you to the New Service Wizard, and not directly to the dashboard. Then see Creating Your NetDocuments OCR Service.
The Get Started page appears.
- Select Sign In to launch the Welcome to NetDocuments OCR.
- Select your NetDocuments repository region, and then select Next.
The NetDocuments OCR Dashboard opens.
The NetDocuments OCR Dashboard is split into two sections:
- Active Monitoring Report – Shows the progress of the Active Monitoring process, showing volumes of documents searched, assessed and processed. This part of the dashboard always displays as every client is licensed for Active Monitoring.
- Backlog Report - Shows the progress of processing your entire backlog of documents. This console only displays if you have licensed the NetDocuments OCR Backlog process from NetDocuments.
1 - Indicates that the service is running and allows you to pause/resume it.
2 - Provides the total number of supported documents found for analysis and processing.
3 - Shows the percentage of total documents found that have completed processing.
The following information in this dashboard relates to both the Backlog and Active Monitoring processes.
|Documents found||The total number of documents found of configured file formats, eg PDF, image files and MSG files. This number does not include unsupported file types, such as Word and Excel, that by definition would not require OCRing.|
The number of documents at any stage of processing (but not yet completed), including those:
The number of documents unable to be saved after being OCRd due to being checked out by a user just before the Save process.
NetDocuments OCR makes up to 10 attempts to resubmit any document that could not be saved after it is OCRd.Documents unable to be saved within the 10 attempts become Exceptions and are reported to the Administrator. (The original document in NetDocuments remains completely untouched).
|Updated||The number of documents that have been successfully saved to NetDocuments after being OCRd (and optionally compressed).|
|Update not required||
The number of documents where any of the following conditions occur:
|Not Supported||If a document has a password, is an XFA document, or is digitally signed, then it will be marked as not supported.|
|Exceptions||The number of documents accessible for inspection and OCR processing but which fail due to being corrupted, their file content does not match the file extension, an unknown error or access being denied. These exception types are not re-attempted.|
Access the Administrator Guide via Dashboard
You can access this NetDocuments OCR 2.5 Administrator Guide in two ways:
- Via the ? Help icon at the top right of the NetDocuments OCR’s Get Started page (accessed when you log in to initially configure the service.)
- Via the ? Help icon at the top right of the NetDocuments OCR Dashboard web page (whenever you need to consult the Dashboard).
Digest Email Notifications
In the near future, NetDocuments OCR provides proactive reports (called digest emails) on a weekly basis to the email notification address (see Set the Address for Notifications).
The emails provide CSV formatted files with the number of:
- Updated Documents: Those OCRd and saved back into NetDocuments (DocID and version provided)
- Not Supported Documents: Those that required OCRing but were not supported for processing, eg because they were password-protected PDF documents (with DocID, version and reason code)
- Exceptions: Documents that failed processing.
- These reports will provide specific details of the document ID and version number, date/time of processing, and results or errors found.
Generate a NetDocuments Activity Log Report
The NetDocuments OCR Dashboard provides information on the number of documents processed and the weekly Digest Emails provide details of the documents that have been processed, including OCRs and successful saves.
If you want a report for a specific day or date range with Doc IDs of the affected documents, then do as follows:
- In NetDocuments, go to Admin > Request Activity Logs > Export and produce a date-based report to XML.
- Optionally, load the report into Excel – it is recommended you only select a limited date range to report on if you have a very large repository.
- Apply a filter for the Save as new version activity (see the Name column below) and filter by the NetDocuments OCR special user account.
The filtered list shows you the documents processed and saved by NetDocuments OCR in the defined timeframe. The docId column provides you with the unique NetDocuments reference.
Frequently Asked Questions
- Why does not OCR process all my documents almost instantly?
- Why does NetDocuments OCR process the different volume of documents in the same amount of time?
- How long will NetDocuments OCR process my document backlog?
- How quickly will a new document be OCRd?
- How do I track what documents have been updated?
- What could cause documents not to be OCRd?
- How do I track documents that failed to OCR?
- What happens if a document is currently checked out and being used?
- What happens if I edit a document while it is being OCRd?
- I have annotations in my PDF – are they retained?
- Do I lose my graphics and pictures?
- Why do I need to configure NetDocuments OCR with a different user account?
1. NetDocuments OCR uses the cloud’s enormous power to OCR and compress documents – why does not it process all my documents almost instantly?
OCRing and compressing documents is a complex and processor-intensive function, taking around 1 to 5 seconds per page to OCR (depending on image quality, language, and other factors). A document is OCRd on one single core processor in the Azure cloud infrastructure, so a bigger document will take longer.
NetDocuments OCR is designed to focus on the largest document throughput (the highest possible number of documents to be processed in a given time period) rather than processing a single document in the fastest possible time (but causing overall document efficiency to be reduced).
The NetDocuments OCR Backlog process will search for, assess, OCR/compress and, if necessary, save every document you have stored in NetDocuments. If NetDocuments tries to do this in just a few hours or days, it could mean impacting the performance of your system while you are opening and saving your current documents. To avoid this impact on NetDocuments system performance, the processing of the backlog of older documents is typically spread over a 6-month period (depending on the number of documents you have stored) but may occur much more quickly than that depending on your document volume.
2. I am a new 10-user NetDocuments customer with one year’s volume of saved documents – why will it take NetDocuments OCR the same amount of time to process my document backlog as it will a large 1,000 user firm with millions of documents?
NetDocuments OCR carefully averages its processing power on a per-user basis. So, a 1000-user firm processes documents at a rate 100 times that of a 10-user firm based on NetDocuments estimates of the average number of documents stored by all users of NetDocuments globally. This ensures that per user all firms have equal access to the system. If you have less than the global average number of documents per user stored in NetDocuments, your backlog process might complete more quickly than others.
3. How long will NetDocuments OCR process my document backlog?
NetDocuments OCR takes approximately 6 months to OCR and compress your entire backlog of documents. This is regardless of how large or small your organization is. NetDocuments OCR allocates processing power based on the number of users you are licensed for – the more users you have, the more processing power is allocated to you. (A 100-user firm is allocated 10 times the processing power of a 10-user firm).
This processing timeframe is designed so the process of interrogating, assessing, OCRing and re-saving documents is completed to avoid impacting the performance of your day-to-day work. NetDocuments OCR also prioritizes any new documents you save with its Active Monitoring process so that new documents always have priority over backlog documents.
4. How quickly will a new document be OCRd once I have saved it into NetDocuments?
NetDocuments OCR searches after every minute for any new documents that have been saved and will immediately assess them for processing if required. (Many document types such as Word and Excel documents need no processing.) Documents requiring OCRing will be prioritized for processing ahead of Backlog Process document queues. This is referred to as Active Monitoring. NetDocuments OCR aims to process documents within a few minutes to 1 hour, but that depends on several factors including the number of pages in the document, the number of documents that you have saved into NetDocuments in a very short amount of time, and how many other users in your firm are also saving documents at the same time. So, if you manually upload hundreds of very large documents at the one time, processing may take a little longer. However, prepare to be surprised – the power of the NetDocuments OCR cloud means processing can happen quickly even for large document volumes.
5. How do I track what documents have been updated by NetDocuments OCR?
NetDocuments makes minimal changes to a document including keeping the modified date and the modified user when saving the OCRd document as a New Version. However, there are a few simple things a user can do to determine if the document has been processed. Firstly, try opening the PDF and find a word - the word you need should now be found in the document. Secondly, if you check the document properties in the PDF application, you should see that the PDF producer is contentCrawler cloud.
The NetDocuments OCR administrator portal provides information on the number of documents processed. A report that provides the Doc IDs of those affected documents is available in NetDocuments. In NetDocuments, go to Admin > Request Activity Logs. Export a date-based report to XML. This can be loaded into Excel where a filter can be applied for Save as New Version and filter by the specific user account used for NetDocuments OCR. This will show you the documents processed and saved by NetDocuments OCR in this timeframe.
6. What could cause documents not to be OCRd?
Whilst the vast majority of documents processes correctly, you will most likely find some documents not processed – no harm is done to these documents – the original document remains in NetDocuments without alteration. The reasons for a document failing to process may include:
- Document content does not match the specified document type (extension is .pdf but it is not a PDF)
- The document is unreadable or corrupted
Also, a PDF document is classified as unsupported for processing if it:
- Is password-protected
- Has a digital certificate (as modifying this document would invalidate the certificate)
7. How do I track documents that failed to OCR?
On a weekly basis, NetDocuments OCR sends a digest report to the administrator email address you specify in the NetDocuments OCR wizard. This report provides the Doc IDs and Version numbers of those documents that failed to OCR, and a reason for each failure. It also reports on documents that were assessed for processing but were determined as not requiring to be OCRd.
8. What happens if a document is currently checked out and being used?
NetDocuments OCR automatically assesses a document when it is saved into NetDocuments to determine if it is required to be OCRd. In some cases, immediately after a document is saved into NetDocuments, a user checks out the document for further editing. If NetDocuments OCR detects a document is checked out, it will not be taken for processing at that time.
9. What happens if I edit a document while it is being OCRd?
It can occasionally occur that a document is being OCRd while a user has the document checked out for editing. This should not cause any issue to users, as NetDocuments OCR does not prevent any document from being edited at any time. After NetDocuments OCR processed a document and is ready to save that document back into NetDocuments, it first checks that the document has not already been updated by a user since a copy of that document was obtained for OCRing.
If that document has been modified, that specific OCR task will be abandoned, and the official version of the document will be re-queued to be OCRd and compressed again.
If the document in NetDocuments has not been changed but is still checked out, NetDocuments OCR will retry saving the document leaving increasingly long periods of time between each attempt and finally giving up on that document after the tenth attempt. If after 10 attempts the document is still checked out, the attempts will end, and the document will be flagged as unable to be saved due to the document being checked out.
10. I have annotations in my PDF – are they retained?
It is quite common to annotate a PDF with comments, highlighting, freehand drawings, etc. NetDocuments OCR has the unique ability to OCR and compress documents but ensures that all these annotations remain as they were in fully editable format, so you can continue to edit and add further annotations.
11. My documents contain lots of graphics and pictures without text – what happens if OCRing finds no text – do I lose my graphics?
The process of OCRing will not impact in any way graphics contained in your document. NetDocuments OCR attempts to find any words in graphics and overlay an invisible layer of text in the same location as the graphic version of the text. However, any original graphics remain completely unchanged and fully visible to the user. So, if your page or document contains only photographs that have no characters in them, no OCR text will be added, and no changes will be made to that page or document.
12. Why am I asked to configure NetDocuments OCR with a user account different from existing admin accounts?
This makes it clearer within NetDocuments what documents have been reviewed and processed by NetDocuments OCR. This is useful for reporting and audit purposes.