Class DeleteOrphanFilesSparkAction
- java.lang.Object
- 
- org.apache.iceberg.spark.actions.DeleteOrphanFilesSparkAction
 
- 
- All Implemented Interfaces:
- Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>,- DeleteOrphanFiles
 
 public class DeleteOrphanFilesSparkAction extends java.lang.Object implements DeleteOrphanFiles An action that removes orphan metadata, data and delete files by listing a given location and comparing the actual files in that location with content and metadata files referenced by all valid snapshots. The location must be accessible for listing via the HadoopFileSystem.By default, this action cleans up the table location returned by Table.location()and removes unreachable files that are older than 3 days usingTable.io(). The behavior can be modified by passing a custom location tolocationand a custom timestamp toolderThan(long). For example, someone might point this action to the data folder to clean up only orphan data files.Configure an alternative delete method using deleteWith(Consumer).For full control of the set of files being evaluated, use the compareToFileList(Dataset)argument. This skips the directory listing - any files in the dataset provided which are not found in table metadata will be deleted, using the sameTable.location()andolderThan(long)filtering as above.Note: It is dangerous to call this action with a short retention interval as it might corrupt the state of the table if another operation is writing at the same time. 
- 
- 
Nested Class SummaryNested Classes Modifier and Type Class Description static classDeleteOrphanFilesSparkAction.FileURI- 
Nested classes/interfaces inherited from interface org.apache.iceberg.actions.DeleteOrphanFilesDeleteOrphanFiles.PrefixMismatchMode, DeleteOrphanFiles.Result
 
- 
 - 
Field SummaryFields Modifier and Type Field Description protected static org.apache.iceberg.relocated.com.google.common.base.JoinerCOMMA_JOINERprotected static org.apache.iceberg.relocated.com.google.common.base.SplitterCOMMA_SPLITTERprotected static java.lang.StringFILE_PATHprotected static java.lang.StringLAST_MODIFIEDprotected static java.lang.StringMANIFESTprotected static java.lang.StringMANIFEST_LISTprotected static java.lang.StringOTHERSprotected static java.lang.StringSTATISTICS_FILES
 - 
Method SummaryAll Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.spark.sql.Dataset<FileInfo>allReachableOtherMetadataFileDS(Table table)DeleteOrphanFilesSparkActioncompareToFileList(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files)protected org.apache.spark.sql.Dataset<FileInfo>contentFileDS(Table table)protected org.apache.spark.sql.Dataset<FileInfo>contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummarydeleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)Deletes files and keeps track of how many files were removed for each file type.protected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummarydeleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files)DeleteOrphanFilesSparkActiondeleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc)Passes an alternative delete implementation that will be used for orphan files.DeleteOrphanFilesSparkActionequalAuthorities(java.util.Map<java.lang.String,java.lang.String> newEqualAuthorities)Passes authorities that should be considered equal.DeleteOrphanFilesSparkActionequalSchemes(java.util.Map<java.lang.String,java.lang.String> newEqualSchemes)Passes schemes that should be considered equal.DeleteOrphanFiles.Resultexecute()Executes this action.DeleteOrphanFilesSparkActionexecuteDeleteWith(java.util.concurrent.ExecutorService executorService)Passes an alternative executor service that will be used for removing orphaned files.protected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>loadMetadataTable(Table table, MetadataTableType type)DeleteOrphanFilesSparkActionlocation(java.lang.String newLocation)Passes a location which should be scanned for orphan files.protected org.apache.spark.sql.Dataset<FileInfo>manifestDS(Table table)protected org.apache.spark.sql.Dataset<FileInfo>manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected org.apache.spark.sql.Dataset<FileInfo>manifestListDS(Table table)protected org.apache.spark.sql.Dataset<FileInfo>manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected JobGroupInfonewJobGroupInfo(java.lang.String groupId, java.lang.String desc)protected TablenewStaticTable(TableMetadata metadata, FileIO io)DeleteOrphanFilesSparkActionolderThan(long newOlderThanTimestamp)Removes orphan files only if they are older than the given timestamp.ThisToption(java.lang.String name, java.lang.String value)protected java.util.Map<java.lang.String,java.lang.String>options()ThisToptions(java.util.Map<java.lang.String,java.lang.String> newOptions)protected org.apache.spark.sql.Dataset<FileInfo>otherMetadataFileDS(Table table)DeleteOrphanFilesSparkActionprefixMismatchMode(DeleteOrphanFiles.PrefixMismatchMode newPrefixMismatchMode)Passes a prefix mismatch mode that determines how this action should handle situations when the metadata references files that match listed/provided files except for authority/scheme.protected DeleteOrphanFilesSparkActionself()protected org.apache.spark.sql.SparkSessionspark()protected org.apache.spark.api.java.JavaSparkContextsparkContext()protected org.apache.spark.sql.Dataset<FileInfo>statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds)protected <T> TwithJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier)
 
- 
- 
- 
Field Detail- 
MANIFESTprotected static final java.lang.String MANIFEST - See Also:
- Constant Field Values
 
 - 
MANIFEST_LISTprotected static final java.lang.String MANIFEST_LIST - See Also:
- Constant Field Values
 
 - 
STATISTICS_FILESprotected static final java.lang.String STATISTICS_FILES - See Also:
- Constant Field Values
 
 - 
OTHERSprotected static final java.lang.String OTHERS - See Also:
- Constant Field Values
 
 - 
FILE_PATHprotected static final java.lang.String FILE_PATH - See Also:
- Constant Field Values
 
 - 
LAST_MODIFIEDprotected static final java.lang.String LAST_MODIFIED - See Also:
- Constant Field Values
 
 - 
COMMA_SPLITTERprotected static final org.apache.iceberg.relocated.com.google.common.base.Splitter COMMA_SPLITTER 
 - 
COMMA_JOINERprotected static final org.apache.iceberg.relocated.com.google.common.base.Joiner COMMA_JOINER 
 
- 
 - 
Method Detail- 
selfprotected DeleteOrphanFilesSparkAction self() 
 - 
executeDeleteWithpublic DeleteOrphanFilesSparkAction executeDeleteWith(java.util.concurrent.ExecutorService executorService) Description copied from interface:DeleteOrphanFilesPasses an alternative executor service that will be used for removing orphaned files. This service will only be used if a custom delete function is provided byDeleteOrphanFiles.deleteWith(Consumer)or if the FileIO does notsupport bulk deletes. Otherwise, parallelism should be controlled by the IO specificdeleteFilesmethod.If this method is not called and bulk deletes are not supported, orphaned manifests and data files will still be deleted in the current thread. - Specified by:
- executeDeleteWithin interface- DeleteOrphanFiles
- Parameters:
- executorService- the service to use
- Returns:
- this for method chaining
 
 - 
prefixMismatchModepublic DeleteOrphanFilesSparkAction prefixMismatchMode(DeleteOrphanFiles.PrefixMismatchMode newPrefixMismatchMode) Description copied from interface:DeleteOrphanFilesPasses a prefix mismatch mode that determines how this action should handle situations when the metadata references files that match listed/provided files except for authority/scheme.Possible values are "ERROR", "IGNORE", "DELETE". The default mismatch mode is "ERROR", which means an exception is thrown whenever there is a mismatch in authority/scheme. It's the recommended mismatch mode and should be changed only in some rare circumstances. If there is a mismatch, use DeleteOrphanFiles.equalSchemes(Map)andDeleteOrphanFiles.equalAuthorities(Map)to resolve conflicts by providing equivalent schemes and authorities. If it is impossible to determine whether the conflicting authorities/schemes are equal, set the prefix mismatch mode to "IGNORE" to skip files with mismatches. If you have manually inspected all conflicting authorities/schemes, provided equivalent schemes/authorities and are absolutely confident the remaining ones are different, set the prefix mismatch mode to "DELETE" to consider files with mismatches as orphan. It will be impossible to recover files after deletion, so the "DELETE" prefix mismatch mode must be used with extreme caution.- Specified by:
- prefixMismatchModein interface- DeleteOrphanFiles
- Parameters:
- newPrefixMismatchMode- mode for handling prefix mismatches
- Returns:
- this for method chaining
 
 - 
equalSchemespublic DeleteOrphanFilesSparkAction equalSchemes(java.util.Map<java.lang.String,java.lang.String> newEqualSchemes) Description copied from interface:DeleteOrphanFilesPasses schemes that should be considered equal.The key may include a comma-separated list of schemes. For instance, Map("s3a,s3,s3n", "s3"). - Specified by:
- equalSchemesin interface- DeleteOrphanFiles
- Parameters:
- newEqualSchemes- list of equal schemes
- Returns:
- this for method chaining
 
 - 
equalAuthoritiespublic DeleteOrphanFilesSparkAction equalAuthorities(java.util.Map<java.lang.String,java.lang.String> newEqualAuthorities) Description copied from interface:DeleteOrphanFilesPasses authorities that should be considered equal.The key may include a comma-separate list of authorities. For instance, Map("s1name,s2name", "servicename"). - Specified by:
- equalAuthoritiesin interface- DeleteOrphanFiles
- Parameters:
- newEqualAuthorities- list of equal authorities
- Returns:
- this for method chaining
 
 - 
locationpublic DeleteOrphanFilesSparkAction location(java.lang.String newLocation) Description copied from interface:DeleteOrphanFilesPasses a location which should be scanned for orphan files.If not set, the root table location will be scanned potentially removing both orphan data and metadata files. - Specified by:
- locationin interface- DeleteOrphanFiles
- Parameters:
- newLocation- the location where to look for orphan files
- Returns:
- this for method chaining
 
 - 
olderThanpublic DeleteOrphanFilesSparkAction olderThan(long newOlderThanTimestamp) Description copied from interface:DeleteOrphanFilesRemoves orphan files only if they are older than the given timestamp.This is a safety measure to avoid removing files that are being added to the table. For example, there may be a concurrent operation adding new files while this action searches for orphan files. New files may not be referenced by the metadata yet but they are not orphan. If not set, defaults to a timestamp 3 days ago. - Specified by:
- olderThanin interface- DeleteOrphanFiles
- Parameters:
- newOlderThanTimestamp- a long timestamp, as returned by- System.currentTimeMillis()
- Returns:
- this for method chaining
 
 - 
deleteWithpublic DeleteOrphanFilesSparkAction deleteWith(java.util.function.Consumer<java.lang.String> newDeleteFunc) Description copied from interface:DeleteOrphanFilesPasses an alternative delete implementation that will be used for orphan files.This method allows users to customize the delete function. For example, one may set a custom delete func and collect all orphan files into a set instead of physically removing them. If not set, defaults to using the table's ioimplementation.- Specified by:
- deleteWithin interface- DeleteOrphanFiles
- Parameters:
- newDeleteFunc- a function that will be called to delete files
- Returns:
- this for method chaining
 
 - 
compareToFileListpublic DeleteOrphanFilesSparkAction compareToFileList(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> files) 
 - 
executepublic DeleteOrphanFiles.Result execute() Description copied from interface:ActionExecutes this action.- Specified by:
- executein interface- Action<DeleteOrphanFiles,DeleteOrphanFiles.Result>
- Returns:
- the result of this action
 
 - 
sparkprotected org.apache.spark.sql.SparkSession spark() 
 - 
sparkContextprotected org.apache.spark.api.java.JavaSparkContext sparkContext() 
 - 
optionpublic ThisT option(java.lang.String name, java.lang.String value)
 - 
optionspublic ThisT options(java.util.Map<java.lang.String,java.lang.String> newOptions) 
 - 
optionsprotected java.util.Map<java.lang.String,java.lang.String> options() 
 - 
withJobGroupInfoprotected <T> T withJobGroupInfo(JobGroupInfo info, java.util.function.Supplier<T> supplier) 
 - 
newJobGroupInfoprotected JobGroupInfo newJobGroupInfo(java.lang.String groupId, java.lang.String desc) 
 - 
newStaticTableprotected Table newStaticTable(TableMetadata metadata, FileIO io) 
 - 
contentFileDSprotected org.apache.spark.sql.Dataset<FileInfo> contentFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds) 
 - 
manifestDSprotected org.apache.spark.sql.Dataset<FileInfo> manifestDS(Table table, java.util.Set<java.lang.Long> snapshotIds) 
 - 
manifestListDSprotected org.apache.spark.sql.Dataset<FileInfo> manifestListDS(Table table, java.util.Set<java.lang.Long> snapshotIds) 
 - 
statisticsFileDSprotected org.apache.spark.sql.Dataset<FileInfo> statisticsFileDS(Table table, java.util.Set<java.lang.Long> snapshotIds) 
 - 
otherMetadataFileDSprotected org.apache.spark.sql.Dataset<FileInfo> otherMetadataFileDS(Table table) 
 - 
allReachableOtherMetadataFileDSprotected org.apache.spark.sql.Dataset<FileInfo> allReachableOtherMetadataFileDS(Table table) 
 - 
loadMetadataTableprotected org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> loadMetadataTable(Table table, MetadataTableType type) 
 - 
deleteFilesprotected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(java.util.concurrent.ExecutorService executorService, java.util.function.Consumer<java.lang.String> deleteFunc, java.util.Iterator<FileInfo> files)Deletes files and keeps track of how many files were removed for each file type.- Parameters:
- executorService- an executor service to use for parallel deletes
- deleteFunc- a delete func
- files- an iterator of Spark rows of the structure (path: String, type: String)
- Returns:
- stats on which files were deleted
 
 - 
deleteFilesprotected org.apache.iceberg.spark.actions.BaseSparkAction.DeleteSummary deleteFiles(SupportsBulkOperations io, java.util.Iterator<FileInfo> files) 
 
- 
 
-