pdf - Two input file types at the same time in GNU parallel? -


is possible have 2 input file types @ same time using 1 instance of gnu parallel?

this long command:

find . -name \*.pdf | parallel -j 4 --progress --eta 'mkdir -p {.} && gs -dquiet -dinterpolate -dsafer -dbatch -dnopause -dpdfsettings=/ebook -dnumrenderingthreads=4 -sdevice=pgmraw -r300 -dtextalphabits=4 -sprocesscolormodel=devicegray -scolorconversionstrategy=gray -doverrideicc -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm' 

a)

  • creates folder each pdf reads (first input file type)
  • convert pdf ghostscript pgm images
  • moves them in respective folder
  • then use tesseract perform ocr on each pgm (second input file type)
  • after save text files in each respective folder
  • and finally, deletes pgm image files.

however, above command consists of 2 commands combined &&, splitting above routine 2 separate parts. result would:

b)

  1. convert first pdfs pgm image files (which eat lot of disk space!)
  2. before start ocr , subsequent purge of unneeded pgm image files.

this undesired, eat disk space before second part of command execute!

is possible combine both commands one, parallel go through whole process of a) first 4 pdfs (as parallel 4 jobs @ same time -j 4), before going next 4 pdf files?

however, seems below minimal example not possible parallel:

parallel -j 4 --progress --eta 'mkdir -p {.} && gs -sdevice=pgmraw -r300 -o {.}/{.}-%03d.pgm {}' && tesseract {} {.} -l deu_frak && rm {.}.pgm’ ::: *.pdf *.pgm 

note, 2 input file extensions ::: *.pdf *.pgm @ end.

what can make parallel follow routine a)?

edit:

this entire code have tried proposed ole tange:

generate_pgm() {   pdf="$1"   find . -name \*.pdf | parallel 'mkdir -p {.} && gs -dquiet -dinterpolate -dsafer -dbatch -dnopause -dpdfsettings=/ebook -dnumrenderingthreads=4 -sdevice=pgmraw -r300 -dtextalphabits=4 -sprocesscolormodel=devicegray -scolorconversionstrategy=gray -doverrideicc -o {.}/{.}-%03d.pgm {}' ::: *.pdf } export -f generate_pgm ocr() {   pgm="$1"   find . -name \*.pgm | parallel 'tesseract {} {.} -l deu_frak && rm {.}.pgm'   rm "$pgm" } export -f ocr  time parallel -j 4 --progress --eta 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm'  ::: *pdf 

unfortunately, has been unsuccessful script same original script. create folders of pdf , start converting pdf pgm while starting ocr on first pgm images, instead of going through process each 4 pdf before starting next four.

i see 2 solutions:

generate_pgm() {   pdf="$1"   # gs stuff } export -f generate_pgm ocr() {   pgm="$1"   # tesseract stuff   rm "$pgm" } export -f ocr  parallel 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm'  ::: *pdf 

this process file before going next.

it will, however, run n^2 processes (n=number of cores). avoid use --load:

parallel 'generate_pgm {}; parallel --load 100% --argsep ,,, ocr ,,, pgm/*.pgm'  ::: *pdf 

this way should 1 active process per cpu core.

if want convert 1 pdf @ time:

parallel -j1 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm'  ::: *pdf 

another solution use dir processor https://www.gnu.org/software/parallel/man.html#example:-gnu-parallel-as-dir-processor:

nice parallel generate_pgm ::: *pdf & inotifywait -qmre moved_to -e close_write --format %w%f pgm_output_dir |   parallel ocr 

this way the pgm-generation done in parallel. risk here if pgm-generation faster ocr, still fill disk.


Comments

Popular posts from this blog

html - How to set bootstrap input responsive width? -

javascript - Highchart x and y axes data from json -

javascript - Get js console.log as python variable in QWebView pyqt -