[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776

cloud-fan · 2023-04-13T11:55:46Z

What changes were proposed in this pull request?

In Spark, we have defined some internal field metadata to help query resolution and compilation. For example, there are quite some field metadata that are related to metadata columns.

However, when we create tables, these internal field metadata can be leaked. This PR updates CTAS/RTAS commands to remove these internal field metadata before creating tables. CREATE/REPLACE TABLE command is fine as users can't generate these internal field metadata via the type string.

Why are the changes needed?

to avoid potential issues, like mistakenly treating a data column as metadata column

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

cloud-fan · 2023-04-13T11:56:18Z

cc @gengliangwang

gengliangwang · 2023-04-14T03:41:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

+  def removeInternalMetadata(schema: StructType): StructType = {
+    StructType(schema.map { field =>
+      val newMetadata = new MetadataBuilder().withMetadata(field.metadata)
+        .remove(METADATA_COL_ATTR_KEY)


Nit: shalll we define an array for all the internal metadata outside the method?
E.g.

val internalMetaData = Seq( METADATA_COL_ATTR_KEY, QUALIFIED_ACCESS_ONLY, ...

gengliangwang

Nice catch

cloud-fan · 2023-04-14T09:07:46Z

thanks for review, merging to master!

zsxwing · 2023-04-17T04:05:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

+  val AUTO_GENERATED_ALIAS = "__autoGeneratedAlias"
+
+  val INTERNAL_METADATA_KEYS = Seq(
+    AUTO_GENERATED_ALIAS,


Are these metadata keys only in the top level columns?

I believe so, after checking the code that generates them. For example, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L171-L175

[SPARK-43123](apache/spark#40776) fixed an issue that Spark might leak internal metadata, which caused Delta to store Spark's internal metadata in its table schema. Spark's internal metadata may trigger special behaviors. For example, if a column metadata has `__metadata_col`, it cannot be selected by star. If we leak `__metadata_col` to any column in a Delta table, we won't be able to query this column when using `SELECT *`. Although [SPARK-43123](apache/spark#40776) fixes the issue in new Spark versions, we might have already persisted internal metadata in some Delta tables. To make these Delta tables readable again, this PR adds an extra step to clean up internal metadata before returning the table schema to Spark. GitOrigin-RevId: 60eb4046d55e955379c98e409993b33e753c5256

special internal field metadata should not be leaked to catalogs

7b5b443

github-actions bot added the SQL label Apr 13, 2023

gengliangwang reviewed Apr 14, 2023

View reviewed changes

gengliangwang approved these changes Apr 14, 2023

View reviewed changes

address comments

93f9335

cloud-fan closed this in 4c938d6 Apr 14, 2023

zsxwing reviewed Apr 17, 2023

View reviewed changes

ulysses-you mentioned this pull request Feb 6, 2024

[VL] Do not fallback write files if output columns contain Spark internal metadata apache/incubator-gluten#4661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776

[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776

cloud-fan commented Apr 13, 2023

cloud-fan commented Apr 13, 2023

gengliangwang Apr 14, 2023 •

edited

Loading

gengliangwang left a comment

cloud-fan commented Apr 14, 2023 •

edited

Loading

zsxwing Apr 17, 2023

cloud-fan Apr 17, 2023

[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776

[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776

Conversation

cloud-fan commented Apr 13, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Apr 13, 2023

gengliangwang Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 14, 2023 • edited Loading

zsxwing Apr 17, 2023

Choose a reason for hiding this comment

cloud-fan Apr 17, 2023

Choose a reason for hiding this comment

gengliangwang Apr 14, 2023 •

edited

Loading

cloud-fan commented Apr 14, 2023 •

edited

Loading