-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace use of the csv library by pandas built-ins for data ingestion #1534
Replace use of the csv library by pandas built-ins for data ingestion #1534
Conversation
For suggestion no.2 it seems like all the dialect parameters can be passed directly to
|
Assigning this review to @kiddinn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments,b ut thank you for the contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer to one comment
let me know when you are ready for another round of review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few more comments
any updates? |
Hi, sorry I've been super busy with some uni coursework 😅 Will implement the suggested changes today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
letme know when ready for another round of review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
what's the status of this? There is also a conflict that must be resolved when syncing the branch.... |
@kiddinn I've just pushed the change that catches ValueErrors raised by |
@rayanht yes, we should include what rows were skipped, as much as we can provide we should do that. Also what the issue was, we need to provide the user as much details as we can here. |
making minor adjustments, style guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made minor stylistic changes to the PR, otherwise LGTM
I might wait with the merging of this PR until Monday, we will need to do some more testing, since this is a pretty significant change. Thank you though for the PR, much appreciated. And will as I said most likely get merged on Monday after brief testing. |
We will merge this PR in as soon as we make a release, that should happen any time now. Since this is a large change in how we ingest files, we don't want to do that at the same time as we are changing how we index data. So we opted to wait until after the release to merge this in, giving us some more time to test and tweak the ingestion to make sure it works. |
Addresses #1489.
Summary of changes:
validate_csv_fields
to avoid duplicate code betweenread_and_validate_csv
andread_and_validate_redline
csv.DictReader
bypandas.read_csv
try... except
inread_and_validate_csv
sincepandas.read_csv
will catch parsing errors during the initial read.read_and_validate_redline
docstring to list the RuntimeError that might be raised during validation.Unimplemented suggestions:
read_and_validate_redline
can be replaced by a literal:Could simply become:
csv.register_dialect( 'redlineDialect', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)
) by using pandas built-ins. Need to investigate