Class ExpireSnapshotsSparkAction

  • All Implemented Interfaces:
    Action<ExpireSnapshots,​ExpireSnapshots.Result>, ExpireSnapshots

    public class ExpireSnapshotsSparkAction
    extends java.lang.Object
    implements ExpireSnapshots
    An action that performs the same operation as ExpireSnapshots but uses Spark to determine the delta in files between the pre and post-expiration table metadata. All of the same restrictions of ExpireSnapshots also apply to this action.

    This action first leverages ExpireSnapshots to expire snapshots and then uses metadata tables to find files that can be safely deleted. This is done by anti-joining two Datasets that contain all manifest and content files before and after the expiration. The snapshot expiration will be fully committed before any deletes are issued.

    This operation performs a shuffle so the parallelism can be controlled through 'spark.sql.shuffle.partitions'.

    Deletes are still performed locally after retrieving the results from the Spark executors.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER  
      protected static org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER  
      protected static java.lang.String FILE_PATH  
      protected static java.lang.String LAST_MODIFIED  
      protected static java.lang.String MANIFEST  
      protected static java.lang.String MANIFEST_LIST  
      protected static java.lang.String OTHERS  
      protected static java.lang.String STATISTICS_FILES  
      static java.lang.String STREAM_RESULTS  
      static boolean STREAM_RESULTS_DEFAULT  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS​(Table table)  
      protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS​(Table table)  
      protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS​(Table table, java.util.Set<java.lang.Long> snapshotIds)  
      protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles​(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)
      Deletes files and keeps track of how many files were removed for each file type.
      protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles​(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)  
      ExpireSnapshotsSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
      Passes an alternative delete implementation that will be used for manifests, data and delete files.
      ExpireSnapshots.Result execute()
      Executes this action.
      ExpireSnapshotsSparkAction executeDeleteWith​(java.util.concurrent.ExecutorService executorService)
      Passes an alternative executor service that will be used for files removal.
      org.apache.spark.sql.Dataset<FileInfo> expireFiles()
      Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.
      ExpireSnapshotsSparkAction expireOlderThan​(long timestampMillis)
      Expires all snapshots older than the given timestamp.
      ExpireSnapshotsSparkAction expireSnapshotId​(long snapshotId)
      Expires a specific Snapshot identified by id.
      protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table, MetadataTableType type)  
      protected org.apache.spark.sql.Dataset<FileInfo> manifestDS​(Table table)  
      protected org.apache.spark.sql.Dataset<FileInfo> manifestDS​(Table table, java.util.Set<java.lang.Long> snapshotIds)  
      protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS​(Table table)  
      protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS​(Table table, java.util.Set<java.lang.Long> snapshotIds)  
      protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId, java.lang.String desc)  
      protected Table newStaticTable​(TableMetadata metadata, FileIO io)  
      ThisT option​(java.lang.String name, java.lang.String value)  
      protected java.util.Map<java.lang.String,​java.lang.String> options()  
      ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)  
      protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS​(Table table)  
      ExpireSnapshotsSparkAction retainLast​(int numSnapshots)
      Retains the most recent ancestors of the current snapshot.
      protected ExpireSnapshotsSparkAction self()  
      protected org.apache.spark.sql.SparkSession spark()  
      protected org.apache.spark.api.java.JavaSparkContext sparkContext()  
      protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS​(Table table, java.util.Set<java.lang.Long> snapshotIds)  
      protected <T> T withJobGroupInfo​(JobGroupInfo info, java.util.function.Supplier<T> supplier)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • STREAM_RESULTS_DEFAULT

        public static final boolean STREAM_RESULTS_DEFAULT
        See Also:
        Constant Field Values
      • STATISTICS_FILES

        protected static final java.lang.String STATISTICS_FILES
        See Also:
        Constant Field Values
      • COMMA_SPLITTER

        protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
      • COMMA_JOINER

        protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
    • Method Detail

      • retainLast

        public ExpireSnapshotsSparkAction retainLast​(int numSnapshots)
        Description copied from interface: ExpireSnapshots
        Retains the most recent ancestors of the current snapshot.

        If a snapshot would be expired because it is older than the expiration timestamp, but is one of the numSnapshots most recent ancestors of the current state, it will be retained. This will not cause snapshots explicitly identified by id from expiring.

        Identical to ExpireSnapshots.retainLast(int)

        Specified by:
        retainLast in interface ExpireSnapshots
        Parameters:
        numSnapshots - the number of snapshots to retain
        Returns:
        this for method chaining
      • deleteWith

        public ExpireSnapshotsSparkAction deleteWith​(java.util.function.Consumer<java.lang.String> newDeleteFunc)
        Description copied from interface: ExpireSnapshots
        Passes an alternative delete implementation that will be used for manifests, data and delete files.

        Manifest files that are no longer used by valid snapshots will be deleted. Content files that were marked as logically deleted by snapshots that are expired will be deleted as well.

        If this method is not called, unnecessary manifests and content files will still be deleted.

        Identical to ExpireSnapshots.deleteWith(Consumer)

        Specified by:
        deleteWith in interface ExpireSnapshots
        Parameters:
        newDeleteFunc - a function that will be called to delete manifests and data files
        Returns:
        this for method chaining
      • expireFiles

        public org.apache.spark.sql.Dataset<FileInfo> expireFiles()
        Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.

        This does not delete data files. To delete data files, run execute().

        This may be called before or after execute() to return the expired files.

        Returns:
        a Dataset of files that are no longer referenced by the table
      • spark

        protected org.apache.spark.sql.SparkSession spark()
      • sparkContext

        protected org.apache.spark.api.java.JavaSparkContext sparkContext()
      • option

        public ThisT option​(java.lang.String name,
                            java.lang.String value)
      • options

        public ThisT options​(java.util.Map<java.lang.String,​java.lang.String> newOptions)
      • options

        protected java.util.Map<java.lang.String,​java.lang.String> options()
      • withJobGroupInfo

        protected <T> T withJobGroupInfo​(JobGroupInfo info,
                                         java.util.function.Supplier<T> supplier)
      • newJobGroupInfo

        protected JobGroupInfo newJobGroupInfo​(java.lang.String groupId,
                                               java.lang.String desc)
      • contentFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS​(Table table)
      • contentFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS​(Table table,
                                                                       java.util.Set<java.lang.Long> snapshotIds)
      • manifestDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestDS​(Table table)
      • manifestDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestDS​(Table table,
                                                                    java.util.Set<java.lang.Long> snapshotIds)
      • manifestListDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS​(Table table)
      • manifestListDS

        protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS​(Table table,
                                                                        java.util.Set<java.lang.Long> snapshotIds)
      • statisticsFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS​(Table table,
                                                                          java.util.Set<java.lang.Long> snapshotIds)
      • otherMetadataFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS​(Table table)
      • allReachableOtherMetadataFileDS

        protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS​(Table table)
      • loadMetadataTable

        protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable​(Table table,
                                                                                           MetadataTableType type)
      • deleteFiles

        protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles​(java.util.concurrent.ExecutorService executorService,
                                                                                             java.util.function.Consumer<java.lang.String> deleteFunc,
                                                                                             java.util.Iterator<FileInfo> files)
        Deletes files and keeps track of how many files were removed for each file type.
        Parameters:
        executorService - an executor service to use for parallel deletes
        deleteFunc - a delete func
        files - an iterator of Spark rows of the structure (path: String, type: String)
        Returns:
        stats on which files were deleted
      • deleteFiles

        protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles​(SupportsBulkOperations io,
                                                                                             java.util.Iterator<FileInfo> files)