Splitter

You can use a splitter in the Data flow to split an input document into multiple output documents. Documents can be split based on JSON Path, XPath, output document size, and number of lines in the output document. You can also set documents as passthrough (not split).

Split by Path

These are the supported document types:

  • BOD
  • JSON Conventional

This split type can be useful in case the input document contains multiple instances of the same object. The path to the repeatable object element is defined. Each output document contains one object together with the whole part of document before (header) and after (footer) the object.

For example, when a Purchase Order with multiple Purchase Order Lines is received and as the next step an API must be called per line. Splitter can be used to split a Purchase Order into multiple Purchase Orders. Each one containing all headers and footers, and only one Purchase Order Line. API is then called once per each created Purchase Order.

A sibling to a parent element without any instance of element that is defined by a path, is not included in any output document. See Example 2 - Excluding parent element.

To define a required splitting element, you must use XPath for BOD documents or JSON Path for JSON Conventional documents.

Split by Size

These are the supported document types:

  • DSV
  • JSON Newline-delimited
  • ANY

This split type can be useful in case the output document must be always smaller than certain size threshold. When the maximum size is defined, the output is split before reaching the maximum size.

For example, when a large DSV document must be sent to the AWS Kinesis stream with 1MB size limitation.

The limit size value for the output message must be in a range of 0,1 - 5,0 MB. When “Use compressed size” option is used, the output size is calculated based on size of deflated output document.

Note: The deflate compression can result in varying compression rations. Therefore, the size of the output document is not as close to the defined size as possible when the uncompressed size limit is used. Especially the first output message in the split can be significantly smaller.

For DSV documents you can select the Document has header lines check box.

Documents are split at the end of line, where line delimiters \r (Carriage Return), \n (Line Feed) and combination \r\n are supported. For ANY document type, you can select the Split at end of line check box. If this check box is cleared, then the split happens when the size threshold is reached, independently on lines.

When single line is larger that the size limit, the Error message (Confirm.BOD) is generated and processing of this message is stopped.

Split by Lines

These are the supported document types:

  • DSV
  • JSON Newline-delimited
  • ANY

This split type can be useful when output documents are limited by the defined number of lines. For example, when a JSON Newline-delimited document must be sent to the API which can handle a single line.

The number of lines in the output message must be in range 1 – 99 999 999.

For DSV documents you can select that the Document has header lines check box. Header lines are counted on top of defined limit. For example if the “number of lines” is set to 50 with 3 header lines, then the output document has 53 total lines.

Documents are split at the end of line, where line delimiters \r (Carriage Return), \n (Line Feed) and combination \r\n are supported.

Passthrough

All document types are supported.

Useful in case the input document must be sent to the next activity without any further actions. The document is still the same. Therefore, all headers including messageID, instance and batch headers are preserved.

Specific logic for DSV document types

When DSV enclosing characters configuration is defined in Data Catalog, the content inside enclosing characters is kept as is, without any modifications.

When you select the Document has header lines option, the number of header lines in the document must be defined in Data Catalog. If not defined, then it is expected that the document contains one header line. If there are less lines in the input document than the defined number of header lines, then the Error message, a Confirm.BOD, is generated.

The header configuration in Data Catalog:

  • Number of Header Lines - The number of header lines that exist in the data object. This many lines are copied to the beginning of every output message.

  • Column Header Line - The line number that contains the column headers for the object. This value is irrelevant for the splitter function.

The message content is not validated, if there is no header line, the first line is considered as header.

Output message header logic

These rules do not apply for passthrough split type:

  • The message ID of the output documents is the same as the Message ID of the input document but extended with a sequence number. The sequence number is for example :1 or :2

  • If a defined path expression is not found in an input document, then no splitting is done. The input document is passed to the output as is. The Message ID is extended with :0

  • For BOD, JSON Conventional and ANY document types. If an input document is split into at least two output documents, then the instances message header is deleted.

  • For JSON Newline-delimited and DSV document types. Instance header equals the number of data lines,not counting headers, in output document.

  • All batch headers from input document are cleared and replaced with batch headers from splitter. When output is a single message, no new batch headers are populated.

  • All other message headers are copied from the input document to all output documents.