Template extraction
These are the endpoints that are involved in extracting data based on the given template from a given document.
Sync Extraction
/ocrsvc/v{ver}/TemplateExtraction is called when user want to from the specified file formats like .jpg or .jpeg or .png.
The table shows the required values:
Component | Description |
---|---|
API Method | /ocrsvc/v{ver}/TemplateExtraction |
Input |
|
Output | The response is of JSON type |
Input Template
Template file is used to retrieve entities, tables, and key-value pairs using different techniques, such as regular expressions and region-based approaches. Below are some commonly used extractor types:
- Regular Expression Extractor (REExtractor): Extracts specific entities from text using predefined regular expression patterns. This is useful for structured data formats like dates, phone numbers, and invoice numbers.
- Region of Interest Extractor (ROIExtractor): Identifies and extracts entities based on predefined regions within a document. This method is useful when the position of the data is consistent across documents.
- Table Extraction using Regular Expressions (TableREExtractor): Extracts tabular data using pattern-based rules. This approach is effective for structured tables where patterns can be defined to locate rows and columns.
- Table Extraction using Region of Interest (TableROIExtractor): Retrieves tabular data by defining specific regions in a document. This is beneficial for documents where tables have a fixed layout.
- Key-Value Pair Extraction in Tables (KeyValueTableREExtractor): Extracts key-value pairs in a tabular format using regular expressions. This is particularly useful for structured forms where key-value relationships are consistently formatted.
Response output
{
"Fields": [
{
"FieldName": "Name of the extracted entity",
"OCR_text": "Extracted text from OCR",
"Confidence": "Confidence score of extraction",
"PageNo": "Page number where the entity was found"
}
],
"Tables": [
{
"TableRows": [
[
{
"ColumnName": "Column name of the table",
"OCR_text": "Extracted text from OCR",
"Geometry": [Left, Top, Width, Height]
}
]
],
"PageNo": "Page number where the table was found"
}
],
"_metadata": {
"TotalFields": "Total number of extracted fields",
"TotalTables": "Total number of extracted tables",
"Confidence": "Overall confidence score",
"TaskID": "Unique identifier for the OCR processing job",
"OcrProvider": "Name of the OCR service provider used for extraction",
"TenantID": "Identifier for the tenant or client using the service",
"NumberOfPages": "Total number of pages in the document"
}
}
Async Extraction
/ocrsvc/v{ver}/AysncTemplateExtraction is called when called when user want to from the specified file formats like .pdf .
The table shows the required values:
Component | Description |
---|---|
API Method | /ocrsvc/v{ver}/AysncTemplateExtraction |
Input |
|
Output | Task ID is generated. |
Using the Task ID, user need to submit the Task ID in /ocrsvc/v{ver}/GetJobResult. This API returns the job result for all the Async API for the given TaskID
Response output
{
"Fields": [
{
"FieldName": "Name of the extracted entity",
"OCR_text": "Extracted text from OCR",
"Confidence": "Confidence score of extraction",
"PageNo": "Page number where the entity was found"
}
],
"Tables": [
{
"TableRows": [
[
{
"ColumnName": "Column name of the table",
"OCR_text": "Extracted text from OCR",
"Geometry": [Left, Top, Width, Height]
}
]
],
"PageNo": "Page number where the table was found"
}
],
"_metadata": {
"TotalFields": "Total number of extracted fields",
"TotalTables": "Total number of extracted tables",
"Confidence": "Overall confidence score",
"TaskID": "Unique identifier for the OCR processing job",
"OcrProvider": "Name of the OCR service provider used for extraction",
"TenantID": "Identifier for the tenant or client using the service",
"NumberOfPages": "Total number of pages in the document"
}
}