Class ColumnarBatchUtil

java.lang.Object
org.apache.iceberg.spark.data.vectorized.ColumnarBatchUtil

public class ColumnarBatchUtil extends Object
  • Method Summary

    Modifier and Type
    Method
    Description
    static boolean[]
    buildIsDeleted(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize)
    Builds a boolean array to indicate if a row is deleted or not.
    static Pair<int[],Integer>
    buildRowIdMapping(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize)
    Builds a row ID mapping inside a batch to skip deleted rows.
    static org.apache.spark.sql.vectorized.ColumnVector[]
    removeExtraColumns(DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, org.apache.spark.sql.vectorized.ColumnVector[] columnVectors)
    Removes extra column vectors added for processing equality delete filters that are not part of the final query output.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • buildRowIdMapping

      public static Pair<int[],Integer> buildRowIdMapping(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize)
      Builds a row ID mapping inside a batch to skip deleted rows.
       Initial state
       Data values: [v0, v1, v2, v3, v4, v5, v6, v7]
       Row ID mapping: [0, 1, 2, 3, 4, 5, 6, 7]
      
       Apply position deletes
       Position deletes: 2, 6
       Row ID mapping: [0, 1, 3, 4, 5, 7, -, -] (6 live records)
      
       Apply equality deletes
       Equality deletes: v1, v2, v3
       Row ID mapping: [0, 4, 5, 7, -, -, -, -] (4 live records)
       
      Parameters:
      columnVectors - the array of column vectors for the batch
      deletes - the delete filter containing delete information
      rowStartPosInBatch - the starting position of the row in the batch
      batchSize - the size of the batch
      Returns:
      the mapping array and the number of live rows, or null if nothing is deleted
    • buildIsDeleted

      public static boolean[] buildIsDeleted(org.apache.spark.sql.vectorized.ColumnVector[] columnVectors, DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, long rowStartPosInBatch, int batchSize)
      Builds a boolean array to indicate if a row is deleted or not.
       Initial state
       Data values: [v0, v1, v2, v3, v4, v5, v6, v7]
       Is deleted array: [F, F, F, F, F, F, F, F]
      
       Apply position deletes
       Position deletes: 2, 6
       Is deleted array: [F, F, T, F, F, F, T, F] (6 live records)
      
       Apply equality deletes
       Equality deletes: v1, v2, v3
       Is deleted array: [F, T, T, T, F, F, T, F] (4 live records)
       
      Parameters:
      columnVectors - the array of column vectors for the batch.
      deletes - the delete filter containing information about which rows should be deleted.
      rowStartPosInBatch - the starting position of the row in the batch, used to calculate the absolute position of the rows in the context of the entire dataset.
      batchSize - the number of rows in the current batch.
      Returns:
      an array of boolean values to indicate if a row is deleted or not
    • removeExtraColumns

      public static org.apache.spark.sql.vectorized.ColumnVector[] removeExtraColumns(DeleteFilter<org.apache.spark.sql.catalyst.InternalRow> deletes, org.apache.spark.sql.vectorized.ColumnVector[] columnVectors)
      Removes extra column vectors added for processing equality delete filters that are not part of the final query output.

      During query execution, additional columns may be included in the schema to evaluate equality delete filters. For example, if the table schema contains columns C1, C2, C3, C4, and C5, and the query is 'SELECT C5 FROM table'. While equality delete filters are applied on C3 and C4, the processing schema includes C5, C3, and C4. These extra columns (C3 and C4) are needed to identify rows to delete but are not included in the final result.

      This method removes the extra column vectors from the end of column vectors array, ensuring only the expected column vectors remain.

      Parameters:
      deletes - the delete filter containing delete information.
      columnVectors - the array of column vectors representing query result data
      Returns:
      a new column vectors array with extra column vectors removed, or the original column vectors array if no extra column vectors are found