Class FileSystemWalker

java.lang.Object
org.apache.iceberg.util.FileSystemWalker

public class FileSystemWalker extends Object
Utility class for recursively traversing file systems and identifying hidden paths. Provides methods to list files recursively while filtering out hidden paths based on specified criteria.
  • Method Details

    • listDirRecursivelyWithFileIO

      public static void listDirRecursivelyWithFileIO(SupportsPrefixOperations io, String dir, Map<Integer,PartitionSpec> specs, Predicate<FileInfo> filter, Consumer<String> fileConsumer)
      Recursively lists files in the specified directory that satisfy the given conditions. Use FileSystemWalker.PartitionAwareHiddenPathFilter to filter out hidden paths.
      Parameters:
      io - FileIO implementation interface supporting prefix operations
      dir - Base directory to start recursive listing
      specs - Map of partition specs for this table. Used to prevent partition directories from being filtered as hidden paths.
      filter - File filter condition, only files satisfying this condition will be collected.
      fileConsumer - Consumer to accept matching file locations
    • listDirRecursivelyWithHadoop

      public static void listDirRecursivelyWithHadoop(String dir, Map<Integer,PartitionSpec> specs, Predicate<org.apache.hadoop.fs.FileStatus> filter, org.apache.hadoop.conf.Configuration conf, int maxDepth, int maxDirectSubDirs, Consumer<String> directoryConsumer, Consumer<String> fileConsumer)
      Recursively traverses the specified directory using Hadoop FileSystem API to collect file paths that meet the conditions.

      This method provides depth control and subdirectory quantity limitation:

      • Stops traversal when maximum recursion depth is reached and adds current directory to pending list
      • Stops traversal when number of direct subdirectories exceeds threshold and adds subdirectories to pending list
      Parameters:
      dir - The starting directory path to traverse
      specs - Map of partition specs for this table. Used to prevent * partition directories from being filtered as hidden paths.
      filter - File filter condition, only files satisfying this condition will be collected
      conf - Hadoop's configuration used to load the FileSystem
      maxDepth - Maximum recursion depth limit
      maxDirectSubDirs - Upper limit of subdirectories that can be processed directly
      directoryConsumer - Consumer for collecting parameter for storing unprocessed directory paths
      fileConsumer - Consumer for collecting qualified file paths