extract

tool to extract text from major document formats

View the Project on GitHub miyako/extract

version platform license downloads

extract

tool to extract text from major document formats (namespace: extract)

supported formats

format parser remarks
xls xls-parser cells comma delimited
xlsx opc-parser cells comma delimited
pdf pdfium-parser  
msg olecf-parser subject+body
eml gmime-parser subject+body
rtf rtf-parser  
txt txt-parser  
html tidy-parser  
ppt olecf-parser  
pptx opc-parser  
doc olecf-parser with control characters
docx opc-parser without header/footer
jpg ocrs-parser only latin alphabet
gif ocrs-parser only latin alphabet
ico ocrs-parser only latin alphabet
bmp ocrs-parser only latin alphabet
webp ocrs-parser only latin alphabet
pnm ocrs-parser only latin alphabet
png ocrs-parser only latin alphabet

acknowledgements

usage

instantiate the class passing an extension as parameter.

var $extract : cs.extract.extract
$extract:=cs.extract.extract.new(".docx")

use cs.extract.formats to get the list of supported formats.

$extensions:=cs.extract.formats.new().extensions

there are 2 ways to invoke .getText(); synchronous and asynchronous.

synchronous: pass a single parameter and receive a collection of results in return.

$texts:=$extract.getText(${file: $file})

you can pass a single object or a collection of objects in a single call.

asynchronous: pass a second formula parameter. an empty collection is returned at this point.

the formula should have the following signature:

#DECLARE($worker : 4D.SystemWorker; $params : Object)

var $text : Text
$text:=$worker.response

[!TIP] whatever value you pass in data is returned in context

$extract.getText({file: $file.getContent(); data: $file}; Formula(onResponse))
#DECLARE($worker : 4D.SystemWorker; $params : Object)

var $text : Text
$text:=$worker.response
$file:=$params.context

use this to match input against output.

property type description
file 4D.File 4D.Blob Text input