Skip to content
This repository was archived by the owner on May 28, 2024. It is now read-only.

Use provisional NWIS data? #135

Closed
lekoenig opened this issue Jun 16, 2022 · 6 comments · Fixed by #206
Closed

Use provisional NWIS data? #135

lekoenig opened this issue Jun 16, 2022 · 6 comments · Fixed by #206
Assignees

Comments

@lekoenig
Copy link
Collaborator

From @lekoenig in #134:

I've set earliest_date and latest_date as 10/1/1979 and 12/31/2021, respectively. I chose this latest_date because I was under the impression that data records are meant to be approved within 120 days, however, I noticed there's still some provisional data starting in August 2021. We may ultimately just opt to omit provisional data to avoid what we saw in #78 that considerable chunks of data had been removed from NWIS in between data pulls for several of our sites.

Should we omit provisional data from the model input files?

@lekoenig
Copy link
Collaborator Author

lekoenig commented Nov 2, 2022

There are 304 unique days in our dataset with DO records marked as "provisional:"

01473500: 55 days (10/7/2021 - 11/30/2021)
01475530: 87 days (9/7/2021 - 12/2/2021)
01475548: 98 days (8/27/2021 - 12/2/2021)
01480617: 45 days (10/18/2021 - 12/1/2021) now tagged as approved on NWIS
01481500: 19 days (12/13/2021 - 12/31/2021) now tagged as approved on NWIS

As of 11/2/2022, the time series underlying 64 of those records have now been approved but 240 are still listed as "provisional." Since it's been nearly 12 months since most of those records were collected, I've contacted the PA WSC requesting more information.

@lekoenig
Copy link
Collaborator Author

lekoenig commented Mar 1, 2023

I've been checking in on these data periodically, here's an update as of 3/1/2023:

01473500: 55 days (10/7/2021 - 11/30/2021)
01475530: 87 days (9/7/2021 - _12/2/2021) now tagged as approved on NWIS
01475548: 98 days (8/27/2021 - 12/2/2021) now tagged as approved on NWIS
01480617: 45 days (10/18/2021 - 12/1/2021) now tagged as approved on NWIS
01481500: 19 days (12/13/2021 - 12/31/2021) now tagged as approved on NWIS

# 1) 01473500
dat_01473500 <- dataRetrieval::readNWISuv(siteNumber = "01473500", parameterCd = "00300", startDate = "2021-10-07", endDate = "2021-11-30")
unique(dat_01473500$X_00300_00000_cd)
#> [1] "A" "P"
dat_01473500 %>% group_by(X_00300_00000_cd) %>% summarize(n = n())
#> # A tibble: 2 x 2
#>   X_00300_00000_cd     n
#>   <chr>            <int>
#> 1 A                   44
#> 2 P                 5232

# 2) 014755300
dat_01475530 <- dataRetrieval::readNWISuv(siteNumber = "01475530", parameterCd = "00300", startDate = "2021-09-05", endDate = "2021-12-05")
unique(dat_01475530$X_00300_00000_cd)
#> [1] "A"

# 3) 01475548
dat_01475548 <- dataRetrieval::readNWISuv(siteNumber = "01475548", parameterCd = "00300", startDate = "2021-08-25", endDate = "2021-12-31")
unique(dat_01475548$X_00300_00000_cd)
#> [1] "A"

# 4) 01480617
dat_01480617 <- dataRetrieval::readNWISuv(siteNumber = "01480617", parameterCd = "00300", startDate = "2021-10-05", endDate = "2021-12-05")
unique(dat_01480617$X_00300_00000_cd)
#> [1] "A"

#5) 01481500
dat_01481500 <- dataRetrieval::readNWISuv(siteNumber = "01481500", parameterCd = "00300", startDate = "2021-12-05", endDate = "2021-12-31")
unique(dat_01481500$X_00300_00000_cd)
#> [1] "A"

So it looks like 01473500 is the only data that still hasn't been approved.

do_snippet

@lekoenig lekoenig self-assigned this Mar 1, 2023
@lekoenig
Copy link
Collaborator Author

lekoenig commented Mar 1, 2023

Inventory provisional data after re-pulling the data on 3/1/2023 (using latest_date = '2021-12-31'):

library(tidyverse)
targets::tar_load(p1_daily_data)
p1_daily_data %>% group_by(Value_cd) %>% summarize(n = n())
#> # A tibble: 4 x 2
#>   Value_cd     n
#>   <chr>    <int>
#> 1 A        55552
#> 2 A e          1
#> 3 P           55
#> 4 NA         146

# Value_Max_cd and Value_Min_cd have the same output as below, so only showing Value_cd (mean)
filter(p1_daily_data, Value_cd == "P") %>% group_by(site_no) %>% summarize(n = n())
#> # A tibble: 1 x 2
#>   site_no      n
#>   <chr>    <int>
#> 1 01473500    55

# now check instantaneous data
tar_load(p1_inst_data)
p1_inst_data %>% group_by(Value_Inst_cd) %>% summarize(n = n())
#> # A tibble: 6 x 2
#>   Value_Inst_cd       n
#>   <chr>           <int>
#> 1 A             1863536
#> 2 A e                 1
#> 3 P                5271
#> 4 P ***               9
#> 5 P Ssn               2
#> 6 NA                  2

# 01473500 is the biggest source of provisional codes
filter(p1_inst_data, Value_Inst_cd %in% c("P", "P ***", "P Ssn")) %>% 
   group_by(site_no, Value_Inst_cd) %>% 
   summarize(n = n(), .groups = 'drop')
#> # A tibble: 5 x 3
#>   site_no  Value_Inst_cd     n
#>   <chr>    <chr>         <int>
#> 1 01473500 P              5270
#> 2 01473500 P Ssn             1
#> 3 01475548 P ***             9
#> 4 01481000 P                 1
#> 5 01481000 P Ssn             1

filter(p1_inst_data, site_no == "01473500", Value_Inst_cd %in% c("P", "P ***", "P Ssn")) %>% 
   mutate(date = as.Date(dateTime)) %>%
   pull(date) %>% 
   range()
#> [1] "2021-10-07" "2021-12-01"

# How many observation-days do we have now across mean/min/max (in other words, how 
# many days have at least one non-NA value for mean-DO, max-DO, or min-DO)?
tar_load(p2_daily_combined)
dim(p2_daily_combined)
#> [1] 56830    10

@lekoenig
Copy link
Collaborator Author

lekoenig commented Mar 1, 2023

Inventory provisional data after re-pulling the data on 3/1/2023 (updating NWIS pull date to match proposed validation time period latest_date = '2022-10-01):

library(tidyverse)
targets::tar_load(p1_daily_data)
p1_daily_data %>% group_by(Value_cd) %>% summarize(n = n())
#> # A tibble: 5 x 2
#>   Value_cd     n
#>   <chr>    <int>
#> 1 A        55943
#> 2 A e          1
#> 3 P         1005
#> 4 P ***        2
#> 5 NA         146

# Value_Max_cd and Value_Min_cd have the same output as below, so only showing Value_cd (mean)
filter(p1_daily_data, Value_cd == "P") %>% group_by(site_no) %>% summarize(n = n())
#> # A tibble: 6 x 2
#>   site_no      n
#>   <chr>    <int>
#> 1 01473500   270
#> 2 01475530   204
#> 3 01480617    58
#> 4 01480870   214
#> 5 01481000   214
#> 6 01481500    45

# Here are the date ranges for those provisional daily data
filter(p1_daily_data, Value_Min_cd == "P") %>% 
  group_by(site_no) %>% 
  summarize(min_date = min(Date), max_date = max(Date))
#> # A tibble: 6 x 3
#>   site_no  min_date   max_date  
#>   <chr>    <date>     <date>    
#> 1 01473500 2021-10-07 2022-10-01
#> 2 01475530 2022-03-09 2022-10-01
#> 3 01480617 2022-08-05 2022-10-01
#> 4 01480870 2022-02-26 2022-10-01
#> 5 01481000 2022-03-02 2022-10-01
#> 6 01481500 2022-08-18 2022-10-01

# now check instantaneous data
tar_load(p1_inst_data)
p1_inst_data %>% group_by(Value_Inst_cd) %>% summarize(n = n())
#> # A tibble: 6 x 2
#>   Value_Inst_cd       n
#>   <chr>           <int>
#> 1 A             1901164
#> 2 A e                 1
#> 3 P              116403
#> 4 P ***              94
#> 5 P Ssn               5
#> 6 NA                  2

# some sites have a decent amount of provisional data
filter(p1_inst_data, Value_Inst_cd %in% c("P", "P ***", "P Ssn")) %>% 
        group_by(site_no, Value_Inst_cd) %>% 
        summarize(n = n(), .groups = 'drop')
#> # A tibble: 13 x 3
#>    site_no  Value_Inst_cd     n
#>    <chr>    <chr>         <int>
#>  1 01473500 P             25947
#>  2 01473500 P Ssn             2
#>  3 01475530 P             19742
#>  4 01475530 P Ssn             1
#>  5 01475548 P             19521
#>  6 01475548 P ***             9
#>  7 01475548 P Ssn             1
#>  8 01480617 P              5536
#>  9 01480870 P             20819
#> 10 01480870 P ***            85
#> 11 01481000 P             20561
#> 12 01481000 P Ssn             1
#> 13 01481500 P              4277

# when does the provisional instantaneous data start and stop?
filter(p1_daily_data, Value_Min_cd %in% c("P", "P ***", "P Ssn")) %>% 
     group_by(site_no) %>% 
     summarize(min_date = min(Date), max_date = max(Date))
#> # A tibble: 6 x 3
#>   site_no  min_date   max_date  
#>   <chr>    <date>     <date>    
#> 1 01473500 2021-10-07 2022-10-01
#> 2 01475530 2022-03-09 2022-10-01
#> 3 01480617 2022-08-05 2022-10-01
#> 4 01480870 2022-02-26 2022-10-01
#> 5 01481000 2022-03-02 2022-10-01
#> 6 01481500 2022-08-18 2022-10-01

# How many observation-days do we have now across mean/min/max (in other words, how 
# many days have at least one non-NA value for mean-DO, max-DO, or min-DO)?
dim(p2_daily_combined)
#> [1] 58379    10

@jsadler2
Copy link
Collaborator

jsadler2 commented Mar 1, 2023

Thanks for checking on all of that, @lekoenig. Good to know the extent of the provisional data. That said, I'm not sure it will affect us too much. In the lastest runs, I set the end of the test set to 2021-10-01. We could theoretically go through 2022-10-01, but then we'd need to update the gridmet data. The way things are set up, we have to go in 365-day increments, which means we can't end in summer 2022 which is where our current gridmet data ends. I don't know that adding one more year to our test set is really going to give us much anyway, when we already have 6 years of data.

Sorry for not looking at those dates earlier! I guess knowing that we were stopping at 2021-10-01 could have saved you the time to do the analysis.

@lekoenig
Copy link
Collaborator Author

lekoenig commented Mar 2, 2023

Oh, OK thanks for confirming those dates! I hadn't realized until yesterday that the NWIS latest_date differed from the test_end_date.

You're right, if we stick with 2021-10-01 we avoid the data that is still tagged as provisional. Before yesterday, I hadn't re-pulled the DO data since 6/11/2022. In the meantime, it looks like some sites took down ~45-90 days worth of data from NWIS, including the sites mentioned within this issue. I guess some data was probably determined to be bad during the records approval process.

diff_daily_data

diff_inst_data

@lekoenig lekoenig linked a pull request Mar 2, 2023 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants