org.apache.iceberg.spark.actions.RewriteManifestsSparkAction

All Implemented Interfaces:: Action<RewriteManifests,RewriteManifests.Result>, RewriteManifests, SnapshotUpdate<RewriteManifests,RewriteManifests.Result>

public class RewriteManifestsSparkAction extends Object implements RewriteManifests

An action that rewrites manifests in a distributed manner and co-locates metadata for partitions.

By default, this action rewrites all manifests for the current partition spec and writes the result to the metadata folder. The behavior can be modified by passing a custom predicate to rewriteIf(Predicate) and a custom spec ID to specId(int). In addition, there is a way to configure a custom location for staged manifests via stagingLocation(String). The provided staging location will be ignored if snapshot ID inheritance is enabled. In such cases, the manifests are always written to the metadata folder and committed without staging.

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteManifests
RewriteManifests.Result
Field Summary

Fields

Modifier and Type

Field

Description

protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner

COMMA_JOINER

protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter

COMMA_SPLITTER

protected static final String

FILE_PATH

protected static final String

LAST_MODIFIED

protected static final String

MANIFEST

protected static final String

MANIFEST_LIST

protected static final String

OTHERS

protected static final String

STATISTICS_FILES

static final String

USE_CACHING

static final boolean

USE_CACHING_DEFAULT
Method Summary

Modifier and Type

Method

Description

protected org.apache.spark.sql.Dataset<FileInfo>

allReachableOtherMetadataFileDS(Table table)

protected void

commit(SnapshotUpdate<?> update)

protected Map<String,String>

commitSummary()

protected org.apache.spark.sql.Dataset<FileInfo>

contentFileDS(Table table)

protected org.apache.spark.sql.Dataset<FileInfo>

contentFileDS(Table table, Set<Long> snapshotIds)

protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary

deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files)

Deletes files and keeps track of how many files were removed for each file type.

protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary

deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)

RewriteManifests.Result

execute()

Executes this action.

protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>

loadMetadataTable(Table table, MetadataTableType type)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestDS(Table table)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestDS(Table table, Set<Long> snapshotIds)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestListDS(Table table)

protected org.apache.spark.sql.Dataset<FileInfo>

manifestListDS(Table table, Set<Long> snapshotIds)

protected JobGroupInfo

newJobGroupInfo(String groupId, String desc)

protected Table

newStaticTable(String metadataFileLocation, FileIO io)

protected Table

newStaticTable(TableMetadata metadata, FileIO io)

RewriteManifestsSparkAction

option(String name, String value)

protected Map<String,String>

options()

RewriteManifestsSparkAction

options(Map<String,String> newOptions)

protected org.apache.spark.sql.Dataset<FileInfo>

otherMetadataFileDS(Table table)

RewriteManifestsSparkAction

rewriteIf(Predicate<ManifestFile> newPredicate)

Rewrites only manifests that match the given predicate.

protected RewriteManifestsSparkAction

self()

RewriteManifestsSparkAction

snapshotProperty(String property, String value)

RewriteManifestsSparkAction

sortBy(List<String> partitionFields)

Rewrite manifests in a given order, based on partition field names

protected org.apache.spark.sql.SparkSession

spark()

protected org.apache.spark.api.java.JavaSparkContext

sparkContext()

RewriteManifestsSparkAction

specId(int specId)

Rewrites manifests for a given spec id.

RewriteManifestsSparkAction

stagingLocation(String newStagingLocation)

Passes a location where the staged manifests should be written.

protected org.apache.spark.sql.Dataset<FileInfo>

statisticsFileDS(Table table, Set<Long> snapshotIds)

protected <T> T

withJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.iceberg.actions.Action
option, options

Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate
snapshotProperty

Field Details
- USE_CACHING
  
  public static final String USE_CACHING
  See Also:
  
  Constant Field Values
- USE_CACHING_DEFAULT
  
  public static final boolean USE_CACHING_DEFAULT
  See Also:
  
  Constant Field Values
- MANIFEST
  
  protected static final String MANIFEST
  See Also:
  
  Constant Field Values
- MANIFEST_LIST
  
  protected static final String MANIFEST_LIST
  See Also:
  
  Constant Field Values
- STATISTICS_FILES
  
  protected static final String STATISTICS_FILES
  See Also:
  
  Constant Field Values
- OTHERS
  
  protected static final String OTHERS
  See Also:
  
  Constant Field Values
- FILE_PATH
  
  protected static final String FILE_PATH
  See Also:
  
  Constant Field Values
- LAST_MODIFIED
  
  protected static final String LAST_MODIFIED
  See Also:
  
  Constant Field Values
- COMMA_SPLITTER
  
  protected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER
- COMMA_JOINER
  
  protected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER
Method Details
- self
  
  protected RewriteManifestsSparkAction self()
- specId
  
  public RewriteManifestsSparkAction specId(int specId)
  
  Description copied from interface: RewriteManifests
  
  Rewrites manifests for a given spec id.
  If not set, defaults to the table's default spec ID.
  
  Specified by:
  
  specId in interface RewriteManifests
  
  Parameters:
  
  specId - a spec id
  
  Returns:
  
  this for method chaining
- rewriteIf
  
  public RewriteManifestsSparkAction rewriteIf(Predicate<ManifestFile> newPredicate)
  
  Description copied from interface: RewriteManifests
  
  Rewrites only manifests that match the given predicate.
  If not set, all manifests will be rewritten.
  
  Specified by:
  
  rewriteIf in interface RewriteManifests
  
  Parameters:
  
  newPredicate - a predicate
  
  Returns:
  
  this for method chaining
- stagingLocation
  
  public RewriteManifestsSparkAction stagingLocation(String newStagingLocation)
  
  Description copied from interface: RewriteManifests
  
  Passes a location where the staged manifests should be written.
  If not set, defaults to the table's metadata location.
  
  Specified by:
  
  stagingLocation in interface RewriteManifests
  
  Parameters:
  
  newStagingLocation - a staging location
  
  Returns:
  
  this for method chaining
- execute
  
  public RewriteManifests.Result execute()
  
  Description copied from interface: Action
  
  Executes this action.
  
  Specified by:
  
  execute in interface Action<RewriteManifests,RewriteManifests.Result>
  
  Returns:
  
  the result of this action
- sortBy
  
  public RewriteManifestsSparkAction sortBy(List<String> partitionFields)
  
  Description copied from interface: RewriteManifests
  
  Rewrite manifests in a given order, based on partition field names
  Supply an optional set of partition field names to sort the rewritten manifests by. Choosing a frequently queried partition field can reduce planning time by skipping unnecessary manifests.
  For example, given a table PARTITIONED BY (a, b, c, d), one may wish to rewrite and sort manifests by ('d', 'b') only, based on known query patterns. Rewriting Manifests in this way will yield a manifest_list whose manifest_files point to data files containing common 'd' then 'b' partition values.
  If not set, manifests will be rewritten in the order of the transforms in the table's partition spec.
  
  Specified by:
  
  sortBy in interface RewriteManifests
  
  Parameters:
  
  partitionFields - Exact transformed column names used for partitioning; not the raw column names that partitions are derived from. E.G. supply 'data_bucket' and not 'data' for a bucket(N, data) partition * definition
  
  Returns:
  
  this for method chaining
- snapshotProperty
  
  public RewriteManifestsSparkAction snapshotProperty(String property, String value)
- commit
  
  protected void commit(SnapshotUpdate<?> update)
- commitSummary
  
  protected Map<String,String> commitSummary()
- spark
  
  protected org.apache.spark.sql.SparkSession spark()
- sparkContext
  
  protected org.apache.spark.api.java.JavaSparkContext sparkContext()
- option
  
  public RewriteManifestsSparkAction option(String name, String value)
- options
  
  public RewriteManifestsSparkAction options(Map<String,String> newOptions)
- options
  
  protected Map<String,String> options()
- withJobGroupInfo
  
  protected <T> T withJobGroupInfo(JobGroupInfo info, Supplier<T> supplier)
- newJobGroupInfo
  
  protected JobGroupInfo newJobGroupInfo(String groupId, String desc)
- newStaticTable
  
  protected Table newStaticTable(TableMetadata metadata, FileIO io)
- newStaticTable
  
  protected Table newStaticTable(String metadataFileLocation, FileIO io)
- contentFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table)
- contentFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, Set<Long> snapshotIds)
- manifestDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table)
- manifestDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, Set<Long> snapshotIds)
- manifestListDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table)
- manifestListDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, Set<Long> snapshotIds)
- statisticsFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, Set<Long> snapshotIds)
- otherMetadataFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table)
- allReachableOtherMetadataFileDS
  
  protected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table)
- loadMetadataTable
  
  protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type)
- deleteFiles
  
  protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(ExecutorService executorService, Consumer<String> deleteFunc, Iterator<FileInfo> files)
  
  Deletes files and keeps track of how many files were removed for each file type.
  
  Parameters:
  
  executorService - an executor service to use for parallel deletes
  
  deleteFunc - a delete func
  
  files - an iterator of Spark rows of the structure (path: String, type: String)
  
  Returns:
  
  stats on which files were deleted
- deleteFiles
  
  protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, Iterator<FileInfo> files)

Class RewriteManifestsSparkAction

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.iceberg.actions.RewriteManifests

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.iceberg.actions.Action

Methods inherited from interface org.apache.iceberg.actions.SnapshotUpdate

Field Details

USE_CACHING

USE_CACHING_DEFAULT

MANIFEST

MANIFEST_LIST

STATISTICS_FILES

OTHERS

FILE_PATH

LAST_MODIFIED

COMMA_SPLITTER

COMMA_JOINER

Method Details

self

specId

rewriteIf

stagingLocation

execute

sortBy

snapshotProperty

commit

commitSummary

spark

sparkContext

option

options

options

withJobGroupInfo

newJobGroupInfo

newStaticTable

newStaticTable

contentFileDS

contentFileDS

manifestDS

manifestDS

manifestListDS

manifestListDS

statisticsFileDS

otherMetadataFileDS

allReachableOtherMetadataFileDS

loadMetadataTable

deleteFiles

deleteFiles