Skip to content
This repository was archived by the owner on May 28, 2024. It is now read-only.

198 add val data to training revised #208

Merged
merged 3 commits into from
Mar 8, 2023

Conversation

jsadler2
Copy link
Collaborator

@jsadler2 jsadler2 commented Mar 6, 2023

** This PR supercedes #202. Updated version of that after our discussion on our 3/1 meeting. **

I moved the old validation data into the training set and I moved the old "test" set into the validation set.

This also starts the training on the water year (#204) and excludes any provisional data (#135).

⚠️ This PR will constitute the final results for the paper since we would no longer have a test set.

closes #198
closes #204
closes #159
closes #166

@jsadler2 jsadler2 requested review from lekoenig and galengorski March 6, 2023 17:55
Copy link
Collaborator

@lekoenig lekoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jsadler2! I have a couple pretty minor questions just to confirm some of the assignments. I feel comfortable merging this PR whenever you are ready.

@@ -1,2 +1,2 @@
n_unique_siteids,n_unique_latlon,n_all_WQP_sites,n_all_nwis_daily_sites,n_all_nwis_inst_sites,n_unique_nwis_sites,n_sites_intersect_WQPNWIS,n_obsdays_total,n_obsdays_nwis,n_obsdays_discrete
1987,1936,1980,24,10,27,20,92403,56868,35535
1987,1936,1980,24,10,27,20,91961,56426,35535
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the WQP sites/records are omitted in #207 as part of #205

@@ -1,2 +1 @@
tar_name,hash
p2_wqp_data_subset,56f5edeb7d42b18b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I am pretty sure this file gets omitted altogether in #207 as part of #205

@@ -212,6 +211,7 @@ p2a_targets_list <- list(
# we need these to make the prepped data file, so force a dependency of this
# target on p2a_well_obs_data.
p2a_well_obs_data
p2a_well_obs_data_zarr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to include, although I'm curious why you added p2a_well_obs_data_zarr in addition to the dependency on p2a_well_obs_data. Was there a change that didn't get picked up by p2a_well_obs_data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why, but the Zarr file was, for some reason, not being rebuilt as expected before I added it here.

I'm 94% sure this is reflecting a lack of understanding on my part, but I'm thinking it's not worth the time at this point to try to really figure out this behavior.

@@ -124,8 +120,7 @@ rule make_predictions:
def filter_predictions(all_preds_file, partition, out_file):
df_preds = pd.read_feather(all_preds_file)
all_sites = df_preds.site_id.unique()
trn_sites = all_sites[(~np.isin(all_sites, config["validation_sites"])) &
(~np.isin(all_sites, config["test_sites"]))]
trn_sites = all_sites[~np.isin(all_sites, config["validation_sites"])]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind confirming what this line in the .smk file is doing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsadler2 confirmed that the ~ refers to the inverse so what this line is doing is subsetting all_sites for the sites that are not within the validation_sites vector from the config file.

- "01475530"
- "01475548"
train_start_date: '1980-10-01'
train_end_date: '2015-10-01'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like from p2a_site_splits that records on 2015-10-01 are treated as validation data, is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@jsadler2 jsadler2 merged commit 0b238d5 into USGS-R:main Mar 8, 2023
@jsadler2 jsadler2 deleted the 198-add-val-data-to-training-REVISED branch March 8, 2023 16:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants