Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable streaming / CDF streaming reads on column mapping enabled tables with fewer limitations #1358

Closed
wants to merge 2 commits into from

Conversation

jackierwzhang
Copy link
Contributor

@jackierwzhang jackierwzhang commented Aug 26, 2022

Description

Resolves #1357

As streaming uses the latest schema to read historical data batches and column mapping schema changes (e.g. rename/drop column) can cause latest schema to diverge, we decided to temporarily completely block streaming read on column mapping tables before.

As a close follow up in this PR, we think it is at least possible to enable the following use cases:

Read from a column mapping table without rename or drop column operations.
Upgrade to column mapping tables.
Existing compatible schema change operations such as ADD COLUMN.

How was this patch tested?

New unit tests.

Does this PR introduce any user-facing changes?

No

@tdas tdas requested a review from scottsand-db August 26, 2022 18:06
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for doing this. Comments, code, and tests are all very clean and thorough. Left a few minor minor questions/comments.

@vkorukanti vkorukanti added this to the 2.2.0 milestone Dec 5, 2022
@tnk-dev
Copy link

tnk-dev commented May 3, 2023

Hey Delta folks,

since the only way to allow renaming and dropping columns in delta without rewriting everything is to enable columnMapping, which collides with Change Data Feed resulting in
com.databricks.sql.transaction.tahoe.DeltaColumnMappingUnsupportedSchemaIncompatibleException: Change Data Feed (CDF) read is not supported on tables with column mapping schema changes (e.g. rename or drop)...
and can only be solved with spark.databricks.delta.changeDataFeed.unsafeBatchReadOnIncompatibleSchemaChanges.enabled, we as a company are kinda scared of enabling this option and are not sure at all about the ramifications of enabling unsafeBatchReadOnIncompatibleSchemaChanges. Plus we don't really like the 2 character partitioning convention in our Datalake (S3), that is replaced by the columnMapping option.

So my question is:
What is best practice here according to Delta regarding above mentioned pain points?

@dennyglee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants