-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs #40776
Conversation
def removeInternalMetadata(schema: StructType): StructType = { | ||
StructType(schema.map { field => | ||
val newMetadata = new MetadataBuilder().withMetadata(field.metadata) | ||
.remove(METADATA_COL_ATTR_KEY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: shalll we define an array for all the internal metadata outside the method?
E.g.
val internalMetaData = Seq(
METADATA_COL_ATTR_KEY,
QUALIFIED_ACCESS_ONLY,
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
thanks for review, merging to master! |
val AUTO_GENERATED_ALIAS = "__autoGeneratedAlias" | ||
|
||
val INTERNAL_METADATA_KEYS = Seq( | ||
AUTO_GENERATED_ALIAS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these metadata keys only in the top level columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so, after checking the code that generates them. For example, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L171-L175
[SPARK-43123](apache/spark#40776) fixed an issue that Spark might leak internal metadata, which caused Delta to store Spark's internal metadata in its table schema. Spark's internal metadata may trigger special behaviors. For example, if a column metadata has `__metadata_col`, it cannot be selected by star. If we leak `__metadata_col` to any column in a Delta table, we won't be able to query this column when using `SELECT *`. Although [SPARK-43123](apache/spark#40776) fixes the issue in new Spark versions, we might have already persisted internal metadata in some Delta tables. To make these Delta tables readable again, this PR adds an extra step to clean up internal metadata before returning the table schema to Spark. GitOrigin-RevId: 60eb4046d55e955379c98e409993b33e753c5256
What changes were proposed in this pull request?
In Spark, we have defined some internal field metadata to help query resolution and compilation. For example, there are quite some field metadata that are related to metadata columns.
However, when we create tables, these internal field metadata can be leaked. This PR updates CTAS/RTAS commands to remove these internal field metadata before creating tables. CREATE/REPLACE TABLE command is fine as users can't generate these internal field metadata via the type string.
Why are the changes needed?
to avoid potential issues, like mistakenly treating a data column as metadata column
Does this PR introduce any user-facing change?
no
How was this patch tested?
new test