pdf - Two input file types at the same time in GNU parallel? -
is possible have 2 input file types @ same time using 1 instance of gnu parallel
?
this long command:
find . -name \*.pdf | parallel -j 4 --progress --eta 'mkdir -p {.} && gs -dquiet -dinterpolate -dsafer -dbatch -dnopause -dpdfsettings=/ebook -dnumrenderingthreads=4 -sdevice=pgmraw -r300 -dtextalphabits=4 -sprocesscolormodel=devicegray -scolorconversionstrategy=gray -doverrideicc -o {.}/{.}-%03d.pgm {}' && time find . -name \*.pgm | parallel -j 4 --progress --eta 'tesseract {} {.} -l deu_frak && rm {.}.pgm'
a)
- creates folder each pdf reads (first input file type)
- convert pdf
ghostscript
pgm images - moves them in respective folder
- then use tesseract perform ocr on each pgm (second input file type)
- after save text files in each respective folder
- and finally, deletes pgm image files.
however, above command consists of 2 commands combined &&
, splitting above routine 2 separate parts. result would:
b)
- convert first pdfs pgm image files (which eat lot of disk space!)
- before start ocr , subsequent purge of unneeded pgm image files.
this undesired, eat disk space before second part of command execute!
is possible combine both commands one, parallel
go through whole process of a) first 4 pdfs (as parallel
4 jobs @ same time -j 4
), before going next 4 pdf files?
however, seems below minimal example not possible parallel
:
parallel -j 4 --progress --eta 'mkdir -p {.} && gs -sdevice=pgmraw -r300 -o {.}/{.}-%03d.pgm {}' && tesseract {} {.} -l deu_frak && rm {.}.pgm’ ::: *.pdf *.pgm
note, 2 input file extensions ::: *.pdf *.pgm
@ end.
what can make parallel
follow routine a)?
edit:
this entire code have tried proposed ole tange:
generate_pgm() { pdf="$1" find . -name \*.pdf | parallel 'mkdir -p {.} && gs -dquiet -dinterpolate -dsafer -dbatch -dnopause -dpdfsettings=/ebook -dnumrenderingthreads=4 -sdevice=pgmraw -r300 -dtextalphabits=4 -sprocesscolormodel=devicegray -scolorconversionstrategy=gray -doverrideicc -o {.}/{.}-%03d.pgm {}' ::: *.pdf } export -f generate_pgm ocr() { pgm="$1" find . -name \*.pgm | parallel 'tesseract {} {.} -l deu_frak && rm {.}.pgm' rm "$pgm" } export -f ocr time parallel -j 4 --progress --eta 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
unfortunately, has been unsuccessful script same original script. create folders of pdf , start converting pdf pgm while starting ocr on first pgm images, instead of going through process each 4 pdf before starting next four.
i see 2 solutions:
generate_pgm() { pdf="$1" # gs stuff } export -f generate_pgm ocr() { pgm="$1" # tesseract stuff rm "$pgm" } export -f ocr parallel 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
this process file before going next.
it will, however, run n^2 processes (n=number of cores). avoid use --load
:
parallel 'generate_pgm {}; parallel --load 100% --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
this way should 1 active process per cpu core.
if want convert 1 pdf @ time:
parallel -j1 'generate_pgm {}; parallel --argsep ,,, ocr ,,, pgm/*.pgm' ::: *pdf
another solution use dir processor https://www.gnu.org/software/parallel/man.html#example:-gnu-parallel-as-dir-processor:
nice parallel generate_pgm ::: *pdf & inotifywait -qmre moved_to -e close_write --format %w%f pgm_output_dir | parallel ocr
this way the pgm-generation done in parallel. risk here if pgm-generation faster ocr, still fill disk.
Comments
Post a Comment