Tool to extract text from major document formats
Tool to extract text from major document formats (namespace: extract)
| format | parser | remarks |
|---|---|---|
| xls | xls-parser | cells comma delimited |
| xlsx | opc-parser | cells comma delimited |
| pdfium-parser | ||
| msg | olecf-parser | subject+body |
| eml | gmime-parser | subject+body |
| rtf | rtf-parser | |
| txt | txt-parser | |
| html | tidy-parser | |
| ppt | olecf-parser | |
| pptx | opc-parser | |
| doc | olecf-parser | with control characters |
| docx | opc-parser | without header/footer |
| jpg | ocrs-parser | only latin alphabet |
| gif | ocrs-parser | only latin alphabet |
| ico | ocrs-parser | only latin alphabet |
| bmp | ocrs-parser | only latin alphabet |
| webp | ocrs-parser | only latin alphabet |
| pnm | ocrs-parser | only latin alphabet |
| png | ocrs-parser | only latin alphabet |
instantiate the class passing an extension as parameter.
var $extract : cs.extract.extract
$extract:=cs.extract.extract.new(".docx")
use cs.extract.formats to get the list of supported formats.
$extensions:=cs.extract.formats.new().extensions
there are 2 ways to invoke .getText(); synchronous and asynchronous.
synchronous: pass a single parameter and receive a collection of results in return.
$texts:=$extract.getText(${file: $file})
you can pass a single object or a collection of objects in a single call.
asynchronous: pass a second formula parameter. an empty collection is returned at this point.
the formula should have the following signature:
#DECLARE($worker : 4D.SystemWorker; $params : Object)
var $text : Text
$text:=$worker.response
[!TIP] whatever value you pass in
datais returned incontext
$extract.getText({file: $file.getContent(); data: $file}; Formula(onResponse))
#DECLARE($worker : 4D.SystemWorker; $params : Object)
var $text : Text
$text:=$worker.response
$file:=$params.context
use this to match input against output.
| property | type | description |
|---|---|---|
file |
4D.File 4D.Blob Text |
input |
json |
Boolean |
default: false |