How to use PDFTEXT variable before storing

  • 307 Views
  • Last Post 22 December 2021
Matthias posted this 06 December 2021

Hi there


I thought I know the product quite well, but this one is challenging me:


In the workflow, where I have a tif image as starting point, I'd like to extract a date which I can do by a script. Unfortunately the script requires a searchable pdf file, as it's using the PDFTEXT variable as input to extract the date. In the end, the date should be used as part of the file name. 


But the creation of the searchable pdf is done on the very last step, so I cannot have the script module before renaming it. Does it mean that I have to create the pdf temporarily and use another hot folder workflow to run the script module for renaming?
I am actually looking for a way to get it done in one worklfow, but cannot see how to get there.

Unfortunately, the regex I use in the script does not work in Smart OCR module. Probably to long?

(((0?[1-9]|[12]\d|3[01])([\-\.\/])\s?((0?[1-9]|1[0-2])([\-\.\/])\s?(19|2\d)?\d{2}))|(((0?[1-9]|[12]\d|3[01])([\-\.\/])\s?((Jan|January)|(Feb|February)|(Mar|March)|(Apr|April)|Mai|(Jun|June)|(Jul|July)|(Aug|August)|(Sep|September)|(Oct|October)|(Nov|November)|(Dec|December))([\-\.\/])?\s?)(19|2\d)?\d{2}))

Any hint, how to get there in a single worklfow would be much appreciated.

Best regards, Matthias

 

Order By: Standard | Newest | Votes
luca.scarpati posted this 07 December 2021

Hi Matthias,

 

I think there is little bit confusion inside the use of variablesmile, maybe you misunderstood the meaning/use of the variables:

  • %PDFTEXT% : takes the "text" value of the searchable PDF inserted/used as input (so in this case the PDF document already contains the OCR)
  • %OCRTEXT% : take the "text" of the OCR created during the OCR process (so we think that it is your way because in input you have a TIFF, correct? )

So you can do it all in one workflow:

  1. Capture
  2. Script that try to get your data from %OCRTEXT% with PDF (OCR enabled) as outprofile or SMART OCR module instead of Script
  3. WFS connector with %YOUR_VARIABLE% inside the filename

 

If you need "specific" help for your "end-user" purposes please contact us at our support, for sure they will give you some specific suggestions/hint.

 

Best regards,

Luca

 

Matthias posted this 22 December 2021

Hi Luca

Thanks for your prompt reply! 

You are actually right with the variables, I mixed it up in my post as I was testing with different source files and variables, as it was not working right away.

Currently, with the setup Capture (TIFF) -> Script ->WFS (searchable PDF), the variable %OCRTEXT% is not filled. I have to double-check next year again.
Now, I do use SmartOCR to extract the date right away and transform the output in the following script trigger to a standardized format. I just recognized that SmartOCR seems to use the last Capture group, so I have to insert in every group (except the first) "?:" to have groups, but non-capturing groups).
Thanks for nudging me to the right direction. 

By the way, is there a chance to specify in SmartOCR if multiple values are matching, which of them should be taken? Looks like SmartOCR takes the last hit? Or is it possible to specify a page number in SmartOCR?

Best regards and Merry Christmas!!
Matthias

Close