Backscanning 1000's of files - Handling prioritisation

  • 237 Views
  • Last Post 14 July 2022
Cam Titley posted this 14 July 2022

I wanted to share a solution I have for a situation where I need to OCR and store many thousands of multi-page PDF documents into SharePoint with the stored location being defined by subfolders based upon variables captured in the OCR process.

I welcome any discussion around better methods to handle this issue, I must point out I am a beginner with this system so I may have overlooked something completely obvious!

My problem was in the way that the Workflow process will handle a watchfolder containing multiple files. Specifically that the 'first workflow' will lockout any other workflows from polling until it has completed however many files appeared in the watchfolder.

My staff will scan a PDF to the watchfolder with a multi function centre, but if I have dropped 1000 files into the watchfolder from my backlog archive the users PDF will not be collected and processed for days (potentially).

 

My solution is to use a simple external powershell script on the server that hosts scanshare to ‘drip feed’ historical files into the watch folder.

The way the script functions is to check the contents of the scanshare watchfolder, if there are no files within it will grab only 1 file from the pool of 20000 waiting elsewhere and move it in.

If a user scans a new deal in the meantime that will be added as a second file in the drop location and be dealt with sooner.

The drip feeder will not add anything else into the watch folder until it sees that the watch folder is empty again.

This ensures the server always has something to work on, but prioritises any user submitted pdfs over the history!

 

The Powershell (.ps1) code below is set to run every minute via the Windows Task Scheduler.

# DripFeeder
# This script will ensure the destination watch folder always has at least 1 file if available in the backlog
# Author: Cam Titley 14/07/2022

$backlog_folder = "C:\ScanShare Working Folder\Sort Wait\"
$watch_folder = "C:\ScanShare Working Folder\Sort Drop\"

$nextfile = Get-ChildItem -Path $backlog_folder -Force -Recurse -File | Select-Object -First 1 | %{$_.FullName}
$watch_folder_count = ( Get-ChildItem $watch_folder | Measure-Object ).Count

"Watchfolder Count: $watch_folder_count"  
"Next File: $nextfile"

if (Test-Path -path $watch_folder) {
    if( $watch_folder_count -lt 1) {
        Move-Item -Path $nextfile -Destination $watch_folder
    }
}

luigi.zurolo posted this 14 July 2022

Hi Cam,

the single document processing queue is a standard and by design, the processing engine will check all the available documents in a specific capture source at the time that workflow will trigger the processing, then it will be released after those found documents will be processed (either failing or with success). Other workflows, if they trigger in the mean time, they will be queued directly and they will start as soon as the current documents are finishing.

Multiple processing lines are achieved at the moment with the Load Balancing module, which is deploying additional processing services on different machines and documents are dispatched to them to split the balance.

Currently on a single server there is no multi document processing possible and the ideal, in case of heavy workload, would be to configure it in a way that fulfill the customer scenario with different timings, conditions which might move documents in different workflows (so they trigger at different time) and so on.

Thank you for sharing your design configuration with us!

Close