Interface RewriteDataFiles

    • Field Detail

      • PARTIAL_PROGRESS_ENABLED

        static final java.lang.String PARTIAL_PROGRESS_ENABLED
        Enable committing groups of files (see max-file-group-size-bytes) prior to the entire rewrite completing. This will produce additional commits but allow for progress even if some groups fail to commit. This setting will not change the correctness of the rewrite operation as file groups can be compacted independently.

        The default is false, which produces a single commit when the entire job has completed.

        See Also:
        Constant Field Values
      • PARTIAL_PROGRESS_ENABLED_DEFAULT

        static final boolean PARTIAL_PROGRESS_ENABLED_DEFAULT
        See Also:
        Constant Field Values
      • PARTIAL_PROGRESS_MAX_COMMITS

        static final java.lang.String PARTIAL_PROGRESS_MAX_COMMITS
        The maximum amount of Iceberg commits that this rewrite is allowed to produce if partial progress is enabled. This setting has no effect if partial progress is disabled.
        See Also:
        Constant Field Values
      • PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT

        static final int PARTIAL_PROGRESS_MAX_COMMITS_DEFAULT
        See Also:
        Constant Field Values
      • MAX_FILE_GROUP_SIZE_BYTES

        static final java.lang.String MAX_FILE_GROUP_SIZE_BYTES
        The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into groups. These sub-units of the rewrite are referred to as file groups. The largest amount of data that should be compacted in a single group is controlled by MAX_FILE_GROUP_SIZE_BYTES. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. For example a sort based rewrite may not scale to terabyte sized partitions, those partitions need to be worked on in small subsections to avoid exhaustion of resources.

        When grouping files, the underlying rewrite strategy will use this value as to limit the files which will be included in a single file group. A group will be processed by a single framework "action". For example, in Spark this means that each group would be rewritten in its own Spark action. A group will never contain files for multiple output partitions.

        See Also:
        Constant Field Values
      • MAX_FILE_GROUP_SIZE_BYTES_DEFAULT

        static final long MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
        See Also:
        Constant Field Values
      • MAX_CONCURRENT_FILE_GROUP_REWRITES

        static final java.lang.String MAX_CONCURRENT_FILE_GROUP_REWRITES
        The max number of file groups to be simultaneously rewritten by the rewrite strategy. The structure and contents of the group is determined by the rewrite strategy. Each file group will be rewritten independently and asynchronously.
        See Also:
        Constant Field Values
      • MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT

        static final int MAX_CONCURRENT_FILE_GROUP_REWRITES_DEFAULT
        See Also:
        Constant Field Values
      • TARGET_FILE_SIZE_BYTES

        static final java.lang.String TARGET_FILE_SIZE_BYTES
        The output file size that this rewrite strategy will attempt to generate when rewriting files. By default this will use the "write.target-file-size-bytes value" in the table properties of the table being updated.
        See Also:
        Constant Field Values
    • Method Detail

      • binPack

        default RewriteDataFiles binPack()
        Choose BINPACK as a strategy for this rewrite operation
        Returns:
        this for method chaining
      • filter

        RewriteDataFiles filter​(Expression expression)
        A user provided filter for determining which files will be considered by the rewrite strategy. This will be used in addition to whatever rules the rewrite strategy generates. For example this would be used for providing a restriction to only run rewrite on a specific partition.
        Parameters:
        expression - An iceberg expression used to determine which files will be considered for rewriting
        Returns:
        this for chaining