Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace use of the csv library by pandas built-ins for data ingestion #1534

Merged
merged 14 commits into from
Feb 26, 2021

Conversation

rayanht
Copy link
Contributor

@rayanht rayanht commented Dec 16, 2020

Addresses #1489.

Summary of changes:

  • Extracted the header validation logic into a helper function validate_csv_fields to avoid duplicate code between read_and_validate_csv and read_and_validate_redline
  • Replaced calls to csv.DictReader by pandas.read_csv
  • Inverted the try... except in read_and_validate_csv since pandas.read_csv will catch parsing errors during the initial read.
  • Updated the read_and_validate_redline docstring to list the RuntimeError that might be raised during validation.

Unimplemented suggestions:

  • The consecutive dictionary assignments in read_and_validate_redline can be replaced by a literal:
row_to_yield = {}
row_to_yield["message"] = summary
row_to_yield["timestamp"] = timestamp
row_to_yield["datetime"] = dt_iso_format
row_to_yield["timestamp_desc"] = timestamp_desc
row_to_yield["alert"] = alert  # Extra field
row_to_yield["tag"] = tags  # Extra field

Could simply become:

row_to_yield = {"message": summary, "timestamp": timestamp, "datetime": dt_iso_format,
                        "timestamp_desc": timestamp_desc, "alert": alert}
  • There might be a way of getting rid of the last remaining use of the csv library (csv.register_dialect( 'redlineDialect', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)) by using pandas built-ins. Need to investigate

@google-cla google-cla bot added the cla: yes label Dec 16, 2020
@rayanht
Copy link
Contributor Author

rayanht commented Dec 16, 2020

For suggestion no.2 it seems like all the dialect parameters can be passed directly to pandas.read_csv by using keyword args, without having to create a csv.Dialect object. The only issue with that approach is with the quoting argument which either takes an int or a csv.QUOTE object, meaning that we either have to use a magic number there or keep the csv dependency.

image
^ Taken from the pandas documentation of read_csv

@berggren berggren requested a review from kiddinn December 16, 2020 15:59
@berggren
Copy link
Contributor

Assigning this review to @kiddinn

Copy link
Contributor

@kiddinn kiddinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments,b ut thank you for the contribution.

Copy link
Contributor

@kiddinn kiddinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer to one comment

@kiddinn
Copy link
Contributor

kiddinn commented Dec 21, 2020

let me know when you are ready for another round of review

@rayanht rayanht requested a review from kiddinn December 22, 2020 15:52
Copy link
Contributor

@kiddinn kiddinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more comments

@kiddinn
Copy link
Contributor

kiddinn commented Jan 10, 2021

any updates?

@rayanht
Copy link
Contributor Author

rayanht commented Jan 12, 2021

any updates?

Hi, sorry I've been super busy with some uni coursework 😅 Will implement the suggested changes today.

Copy link
Contributor

@kiddinn kiddinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

letme know when ready for another round of review

Copy link
Contributor

@kiddinn kiddinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@kiddinn
Copy link
Contributor

kiddinn commented Feb 5, 2021

what's the status of this?

There is also a conflict that must be resolved when syncing the branch....

@rayanht
Copy link
Contributor Author

rayanht commented Feb 5, 2021

@kiddinn I've just pushed the change that catches ValueErrors raised by pandas.to_datetime. The behaviour is that the whole chunk gets skipped if we encounter a malformed date and a warning log is emitted to inform the user. The log is pretty generic as of right now, should we maybe include in the log that the chunk that got skipped is the one that spans from row n to row m?

@kiddinn
Copy link
Contributor

kiddinn commented Feb 5, 2021

@rayanht yes, we should include what rows were skipped, as much as we can provide we should do that. Also what the issue was, we need to provide the user as much details as we can here.

Copy link
Contributor

@kiddinn kiddinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made minor stylistic changes to the PR, otherwise LGTM

@kiddinn
Copy link
Contributor

kiddinn commented Feb 5, 2021

I might wait with the merging of this PR until Monday, we will need to do some more testing, since this is a pretty significant change.

Thank you though for the PR, much appreciated. And will as I said most likely get merged on Monday after brief testing.

@kiddinn kiddinn linked an issue Feb 9, 2021 that may be closed by this pull request
@kiddinn
Copy link
Contributor

kiddinn commented Feb 18, 2021

We will merge this PR in as soon as we make a release, that should happen any time now.

Since this is a large change in how we ingest files, we don't want to do that at the same time as we are changing how we index data. So we opted to wait until after the release to merge this in, giving us some more time to test and tweak the ingestion to make sure it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve CSV import worker
3 participants