tool to extract text from major document formats
tool to extract text from major document formats (namespace: extract
)
format | parser | remarks |
---|---|---|
xls | xls-parser | cells comma delimited |
xlsx | opc-parser | cells comma delimited |
pdfium-parser | ||
msg | olecf-parser | subject+body |
eml | gmime-parser | subject+body |
rtf | rtf-parser | |
txt | txt-parser | |
html | tidy-parser | |
ppt | olecf-parser | |
pptx | opc-parser | |
doc | olecf-parser | with control characters |
docx | opc-parser | without header/footer |
jpg | ocrs-parser | only latin alphabet |
gif | ocrs-parser | only latin alphabet |
ico | ocrs-parser | only latin alphabet |
bmp | ocrs-parser | only latin alphabet |
webp | ocrs-parser | only latin alphabet |
pnm | ocrs-parser | only latin alphabet |
png | ocrs-parser | only latin alphabet |
instantiate the class passing an extension as parameter.
var $extract : cs.extract.extract
$extract:=cs.extract.extract.new(".docx")
use cs.extract.formats
to get the list of supported formats.
$extensions:=cs.extract.formats.new().extensions
there are 2 ways to invoke .getText()
; synchronous and asynchronous.
synchronous: pass a single parameter and receive a collection of results in return.
$texts:=$extract.getText(${file: $file})
you can pass a single object or a collection of objects in a single call.
asynchronous: pass a second formula parameter. an empty collection is returned at this point.
the formula should have the following signature:
#DECLARE($worker : 4D.SystemWorker; $params : Object)
var $text : Text
$text:=$worker.response
[!TIP] whatever value you pass in
data
is returned incontext
$extract.getText({file: $file.getContent(); data: $file}; Formula(onResponse))
#DECLARE($worker : 4D.SystemWorker; $params : Object)
var $text : Text
$text:=$worker.response
$file:=$params.context
use this to match input against output.
property | type | description |
---|---|---|
file |
4D.File 4D.Blob Text |
input |