Class RewriteManifestsSparkAction

java.lang.Object
org.apache.iceberg.spark.actions.RewriteManifestsSparkAction
All Implemented Interfaces:
Action<RewriteManifests,RewriteManifests.Result>, RewriteManifests, SnapshotUpdate<RewriteManifests,RewriteManifests.Result>

public class RewriteManifestsSparkAction extends Object implements RewriteManifests
An action that rewrites manifests in a distributed manner and co-locates metadata for partitions.

By default, this action rewrites all manifests for the current partition spec and writes the result to the metadata folder. The behavior can be modified by passing a custom predicate to rewriteIf(Predicate) and a custom spec ID to specId(int). In addition, there is a way to configure a custom location for staged manifests via stagingLocation(String). The provided staging location will be ignored if snapshot ID inheritance is enabled. In such cases, the manifests are always written to the metadata folder and committed without staging.

  • Field Details

  • Method Details

    • self

      protected RewriteManifestsSparkAction self()
    • specId

      public RewriteManifestsSparkAction specId(int specId)
      Description copied from interface: RewriteManifests
      Rewrites manifests for a given spec id.

      If not set, defaults to the table's default spec ID.

      Specified by:
      specId in interface RewriteManifests
      Parameters:
      specId - a spec id
      Returns:
      this for method chaining
    • rewriteIf

      public RewriteManifestsSparkAction rewriteIf(Predicate<ManifestFile> newPredicate)
      Description copied from interface: RewriteManifests
      Rewrites only manifests that match the given predicate.

      If not set, all manifests will be rewritten.

      Specified by:
      rewriteIf in interface RewriteManifests
      Parameters:
      newPredicate - a predicate
      Returns:
      this for method chaining
    • stagingLocation

      public RewriteManifestsSparkAction stagingLocation(String newStagingLocation)
      Description copied from interface: RewriteManifests
      Passes a location where the staged manifests should be written.

      If not set, defaults to the table's metadata location.

      Specified by:
      stagingLocation in interface RewriteManifests
      Parameters:
      newStagingLocation - a staging location
      Returns:
      this for method chaining
    • execute

      public RewriteManifests.Result execute()
      Description copied from interface: Action
      Executes this action.
      Specified by:
      execute in interface Action<RewriteManifests,RewriteManifests.Result>
      Returns:
      the result of this action
    • sortBy

      public RewriteManifestsSparkAction sortBy(List<String> partitionFields)
      Description copied from interface: RewriteManifests
      Rewrite manifests in a given order, based on partition field names

      Supply an optional set of partition field names to sort the rewritten manifests by. Choosing a frequently queried partition field can reduce planning time by skipping unnecessary manifests.

      For example, given a table PARTITIONED BY (a, b, c, d), one may wish to rewrite and sort manifests by ('d', 'b') only, based on known query patterns. Rewriting Manifests in this way will yield a manifest_list whose manifest_files point to data files containing common 'd' then 'b' partition values.

      If not set, manifests will be rewritten in the order of the transforms in the table's partition spec.

      Specified by:
      sortBy in interface RewriteManifests
      Parameters:
      partitionFields - Exact transformed column names used for partitioning; not the raw column names that partitions are derived from. E.G. supply 'data_bucket' and not 'data' for a bucket(N, data) partition * definition
      Returns:
      this for method chaining
    • snapshotProperty

      public RewriteManifestsSparkAction snapshotProperty(String property, String value)
    • commit

      protected void commit(SnapshotUpdate<?> update)
    • commitSummary

      protected Map<String,String> commitSummary()
    • spark

      protected org.apache.spark.sql.SparkSession spark()
    • sparkContext

      protected org.apache.spark.api.java.JavaSparkContext sparkContext()
    • option

      public RewriteManifestsSparkAction option(String name, String value)
    • options

      public RewriteManifestsSparkAction options(Map<String,String> newOptions)
    • options

      protected Map<String,String> options()
    • withJobGroupInfo

      protected <T> T withJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)
    • newJobGroupInfo

      protected JobGroupInfo newJobGroupInfo(String groupId, String desc)
    • newStaticTable

      protected Table newStaticTable(TableMetadata metadata, FileIO io)
    • newStaticTable

      protected Table newStaticTable(String metadataFileLocation, FileIO io)
    • contentFileDS

      protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table)
    • contentFileDS

      protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds)
    • manifestDS

      protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table)
    • manifestDS

      protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshotIds)
    • manifestListDS

      protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table)
    • manifestListDS

      protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, Set<Long> snapshotIds)
    • statisticsFileDS

      protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, Set<Long> snapshotIds)
    • otherMetadataFileDS

      protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
    • allReachableOtherMetadataFileDS

      protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
    • loadMetadataTable

      protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
    • deleteFiles

      protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files)
      Deletes files and keeps track of how many files were removed for each file type.
      Parameters:
      executorService - an executor service to use for parallel deletes
      deleteFunc - a delete func
      files - an iterator of Spark rows of the structure (path: String, type: String)
      Returns:
      stats on which files were deleted
    • deleteFiles

      protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)