Replace use of the csv library by pandas built-ins for data ingestion #1534

rayanht · 2020-12-16T14:44:25Z

Addresses #1489.

Summary of changes:

Extracted the header validation logic into a helper function validate_csv_fields to avoid duplicate code between read_and_validate_csv and read_and_validate_redline
Replaced calls to csv.DictReader by pandas.read_csv
Inverted the try... except in read_and_validate_csv since pandas.read_csv will catch parsing errors during the initial read.
Updated the read_and_validate_redline docstring to list the RuntimeError that might be raised during validation.

Unimplemented suggestions:

The consecutive dictionary assignments in read_and_validate_redline can be replaced by a literal:

row_to_yield = {}
row_to_yield["message"] = summary
row_to_yield["timestamp"] = timestamp
row_to_yield["datetime"] = dt_iso_format
row_to_yield["timestamp_desc"] = timestamp_desc
row_to_yield["alert"] = alert  # Extra field
row_to_yield["tag"] = tags  # Extra field

Could simply become:

row_to_yield = {"message": summary, "timestamp": timestamp, "datetime": dt_iso_format,
                        "timestamp_desc": timestamp_desc, "alert": alert}

There might be a way of getting rid of the last remaining use of the csv library (csv.register_dialect( 'redlineDialect', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)) by using pandas built-ins. Need to investigate

rayanht · 2020-12-16T15:05:46Z

For suggestion no.2 it seems like all the dialect parameters can be passed directly to pandas.read_csv by using keyword args, without having to create a csv.Dialect object. The only issue with that approach is with the quoting argument which either takes an int or a csv.QUOTE object, meaning that we either have to use a magic number there or keep the csv dependency.

^ Taken from the pandas documentation of read_csv

berggren · 2020-12-16T16:00:14Z

Assigning this review to @kiddinn

kiddinn

Few comments,b ut thank you for the contribution.

timesketch/lib/utils.py

kiddinn

answer to one comment

kiddinn · 2020-12-21T14:06:38Z

let me know when you are ready for another round of review

kiddinn

few more comments

timesketch/lib/utils.py

kiddinn · 2021-01-10T10:53:02Z

any updates?

rayanht · 2021-01-12T17:34:56Z

any updates?

Hi, sorry I've been super busy with some uni coursework 😅 Will implement the suggested changes today.

kiddinn

letme know when ready for another round of review

kiddinn

.

kiddinn · 2021-02-05T14:41:30Z

what's the status of this?

There is also a conflict that must be resolved when syncing the branch....

rayanht · 2021-02-05T15:24:05Z

@kiddinn I've just pushed the change that catches ValueErrors raised by pandas.to_datetime. The behaviour is that the whole chunk gets skipped if we encounter a malformed date and a warning log is emitted to inform the user. The log is pretty generic as of right now, should we maybe include in the log that the chunk that got skipped is the one that spans from row n to row m?

kiddinn · 2021-02-05T15:31:18Z

@rayanht yes, we should include what rows were skipped, as much as we can provide we should do that. Also what the issue was, we need to provide the user as much details as we can here.

making minor adjustments, style guide

kiddinn

I made minor stylistic changes to the PR, otherwise LGTM

kiddinn · 2021-02-05T16:18:08Z

I might wait with the merging of this PR until Monday, we will need to do some more testing, since this is a pretty significant change.

Thank you though for the PR, much appreciated. And will as I said most likely get merged on Monday after brief testing.

kiddinn · 2021-02-18T13:56:07Z

We will merge this PR in as soon as we make a release, that should happen any time now.

Since this is a large change in how we ingest files, we don't want to do that at the same time as we are changing how we index data. So we opted to wait until after the release to merge this in, giving us some more time to test and tweak the ingestion to make sure it works.

Replace use of the csv library by pandas built-ins for data ingestion

73664d6

google-cla bot added the cla: yes label Dec 16, 2020

Fix linting errors

75ed16a

berggren requested a review from kiddinn December 16, 2020 15:59

kiddinn suggested changes Dec 17, 2020

View reviewed changes

kiddinn reviewed Dec 17, 2020

View reviewed changes

rayanht added 2 commits December 22, 2020 15:09

Ingest CSV files by chunks and add a couple of tests

266c84d

Replace deprecated use of assertRegexpMatches by assertRegex

111efb0

rayanht requested a review from kiddinn December 22, 2020 15:52

kiddinn suggested changes Dec 29, 2020

View reviewed changes

kiddinn reviewed Jan 15, 2021

View reviewed changes

kiddinn reviewed Jan 22, 2021

View reviewed changes

rayanht added 2 commits February 5, 2021 16:19

Cacth ValueErrors raised during date parsing

c227ab2

Merge branch 'master' into feature/migrate-csv-to-pandas

732de98

Add EOF newline to utils_test.py

c848870

rayanht and others added 5 commits February 5, 2021 16:36

Log which lines were skipped due to a malformed date

af3cf03

Update utils.py

116a356

making minor adjustments, style guide

Update utils.py

cc6928d

Update utils.py

962f3a8

Update utils.py

e407501

kiddinn approved these changes Feb 5, 2021

View reviewed changes

kiddinn mentioned this pull request Feb 9, 2021

Improve CSV import worker #1489

Closed

kiddinn linked an issue Feb 9, 2021 that may be closed by this pull request

Improve CSV import worker #1489

Closed

Merge branch 'master' into feature/migrate-csv-to-pandas

ec9ff9a

kiddinn mentioned this pull request Feb 18, 2021

Importer doesn't accept dates before 1677 or after 2262 #1617

Open

Merge branch 'master' into feature/migrate-csv-to-pandas

ee7a792

kiddinn merged commit 234a0fc into google:master Feb 26, 2021

kiddinn mentioned this pull request Mar 5, 2021

Example Upload timeline CSV fails with No columns to parse from file #1666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace use of the csv library by pandas built-ins for data ingestion #1534

Replace use of the csv library by pandas built-ins for data ingestion #1534

rayanht commented Dec 16, 2020

rayanht commented Dec 16, 2020 •

edited

Loading

berggren commented Dec 16, 2020

kiddinn left a comment

kiddinn left a comment

kiddinn commented Dec 21, 2020

kiddinn left a comment

kiddinn commented Jan 10, 2021

rayanht commented Jan 12, 2021

kiddinn left a comment

kiddinn left a comment

kiddinn commented Feb 5, 2021

rayanht commented Feb 5, 2021

kiddinn commented Feb 5, 2021

kiddinn left a comment

kiddinn commented Feb 5, 2021

kiddinn commented Feb 18, 2021

Replace use of the csv library by pandas built-ins for data ingestion #1534

Replace use of the csv library by pandas built-ins for data ingestion #1534

Conversation

rayanht commented Dec 16, 2020

Summary of changes:

Unimplemented suggestions:

rayanht commented Dec 16, 2020 • edited Loading

berggren commented Dec 16, 2020

kiddinn left a comment

Choose a reason for hiding this comment

kiddinn left a comment

Choose a reason for hiding this comment

kiddinn commented Dec 21, 2020

kiddinn left a comment

Choose a reason for hiding this comment

kiddinn commented Jan 10, 2021

rayanht commented Jan 12, 2021

kiddinn left a comment

Choose a reason for hiding this comment

kiddinn left a comment

Choose a reason for hiding this comment

kiddinn commented Feb 5, 2021

rayanht commented Feb 5, 2021

kiddinn commented Feb 5, 2021

kiddinn left a comment

Choose a reason for hiding this comment

kiddinn commented Feb 5, 2021

kiddinn commented Feb 18, 2021

rayanht commented Dec 16, 2020 •

edited

Loading