Bug #702
closedEnter basic PLANTS:Evalute implementation of PLANTS data-c
Added by Michael Lee about 22 years ago. Updated almost 20 years ago.
0%
Description
RP evaluate ML testing of PLANTS implementation and generally makes sure PLANTS
are working as they should.
Related issues
Updated by Michael Lee almost 22 years ago
- Bug 576 has been marked as a duplicate of this bug. ***
Updated by Michael Lee over 21 years ago
depends on gabe finishing the loading of plants on a machine that can allow odbc
access to the database
Updated by Michael Lee over 21 years ago
john really doesn't want to know about this any more.
Updated by Michael Lee over 21 years ago
The plants on beta look pretty good. The Usage.Name_ID issue that I saw on
communities doesn't seem to be an issue here (anymore).
The final issue to work out is the correlation table. Right now, the synonym is
getting dumped into plantUsage.acceptedSynonym field, which seems logical,
except that this field is not part of the logical model. It is dangerous, too,
as there could be many synonyms for one plantConcept that is "not accepted".
The real home for this information is through the commCorrelation table, with
the convergence = "equal" or "undetermined" (?). I have "equal" in VegBranch,
(which you could look at a vegbank_module to see how this works,too).
Bob, if you want to poke around here some more, that'd be great. For now I'll
pass the bug to Gabe and then Gabe, pass it back to Bob and me when correlation
is being populated for more evaluation.
Here's my rough notes of looking through the plant taxonomy:
-----------------------------------------------------------------------
concept: Carya buckleyi (ref=USDA) OK
status: Carya buckleyi is not accepted (correct), level correct, parent correct
usage : 3 systems: correct names, correct name_ID's good
correlation: need to add this: This concept (since it is not accepted) links to
via status to concept_ID = 2844 (Carya texana Buckl.)
(This is currently only getting dumped into usage.acceptedsynonym, which is a
denormalized field at best, one that we can't rely on b/c there could be 2 or 10
synonyms)
concept: Carya glabra var. hirsuta
--looks good
Updated by Michael Lee over 21 years ago
- Bug 694 has been marked as a duplicate of this bug. ***
Updated by Michael Lee over 21 years ago
- Bug 703 has been marked as a duplicate of this bug. ***
Updated by Michael Lee over 21 years ago
As I understand things, the following error is of medium-high severity and
priority. It must be fixed before release.
There is a non-intuitive error in the loading of the plants on beta. The same
problem occurred to me on loading in VegBranch. The issue is really USDA's
fault, but we still have to solve it. Here's my attempt to explain it:
When a (Genus species) has all its (Genus species var. variety)'s made as
synonyms to other things, then nominal variety (the variety name = species name)
disappears from the USDA list. We need to add these names back in, and
generally assume that the nominal variety is a synonym of the species that
remains. For example, Carya ovata, from the USDA website:
-------------------
Symbol Scientific Name Common Name
CACA38 Carya carolinae-septentrionalis (Ashe) Engl. & Graebn. southern shagbark
hickory
CAOVA Carya ovata (P. Mill.) K. Koch var. australis (Ashe) Little
CAOVC Carya ovata (P. Mill.) K. Koch var. carolinae-septentrionalis
(Ashe) Reveal
CAOV2 Carya ovata (P. Mill.) K. Koch shagbark hickory
CAOVF Carya ovata (P. Mill.) K. Koch var. fraxinifolia Sarg.
CAOVN Carya ovata (P. Mill.) K. Koch var. nuttallii Sarg.
CAOVP Carya ovata (P. Mill.) K. Koch var. pubescens Sarg.
-----------------------------------
This example has the first 2 indented Caryas as synonyms of "Carya
carolinae-septentrionalis" and the last 3 indented Caryas as synonyms of "Carya
ovata". Here, the missing piece is "Carya ovata var. ovata" which should also
be added to the list of synonyms for "Carya ovata".
There are some problems with us reconstructing this. First, we don't know the
USDA code, so I guess we just don't populate that system. Second, the don't
necessarily know the Scientific Name with Author, though I think it's pretty
safe to just use the (Genus species Auth) with "var. [species]" at the end,
here: "Carya ovata (P. Mill.) K. Koch var. ovata".
Third, we have to worry about ssp. (subspecies) as well as var. (varieties) of
the species. In the above case, all the var.'s could have been ssp.'s in which
case we would haved added "Carya ovata (P. Mill.) K. Koch ssp. ovata", but this
is a ficticious example. There are cases where a species has varieties and
subspecies. Then, I guess we'd have to add both the nominal subspecies and
nominal variety.
NOTE THAT NOMINALS ARE GENERALLY INCLUDED IN THE USDA LIST IF THERE ARE
VARIETIES OR SUBSPECIES THAT ARE "ACCEPTED" THAT IS, THEY HAVE NO SYNONYM IN THE
4TH COLUMN OF THE .CSV FILE THAT WE DOWNLOAD.
Bob had some rules for generating this nomical taxon lower than the species
level, so if he can find these, or Gabe can find John's old copy, that would be
great.
Updated by Gabriel Farrell over 21 years ago
Updated by Michael Lee over 21 years ago
(blocks bug 699, doesn't depend on it.)
Nominal varieties can be recreated by SAS program I adapted from earlier
taxonomic endeavours. The new list has been sent to Gabe and Bob. I assumed
that this was going to be a 1.0 release issue, but we have the minor detail of
what to do about usage.namestatus for these to deal with. The loader probably
assumes that names are standard, but these new names are probably non-standard.
But the file is of the same format as the usda download, so we should be able
to use the same loader, provided we can work out the usage.namesstatus issue.
This is a sizable chunk of taxa: 9093 new ones from 82113 origanal taxa on the
downloaded list.
If there are other years that are being loaded (I am certain that there are at
least 2 years), please tell me where to get the USDA download and I can run it
through my SAS program and generate the missing taxa for that year, too.
I pass this bug to Gabe for implementation of missing taxa for release 1.0
unless Bob says no, that this is 1.1 issue.
Updated by Robert Peet over 21 years ago
“There are some problems with us reconstructing this. First, we don't know
the USDA code, so I guess we just don't populate that system. Second, the
don't necessarily know the Scientific Name with Author, though I think it's
pretty safe to just use the (Genus species Auth) with "var. [species]" at the
end, here: "Carya ovata (P. Mill.) K. Koch var. ovata".”
RKP1. We do not know the usda code, if there ever was one, so this remains
blank.
RKP2. there is no problem with the author. Just give the correct author for
species and omit author after the nominal var or ssp name.
“Third, we have to worry about ssp. (subspecies) as well as var. (varieties)
of the species. In the above case, all the var.'s could have been ssp.'s in
which case we would have added "Carya ovata (P. Mill.) K. Koch ssp. ovata",
but this is a ficticious example. There are cases where a species has
varieties and subspecies. Then, I guess we'd have to add both the nominal
subspecies and nominal variety.”
RKP3. Yes, in some cases we will need both nominals.
“we have the minor detail of what to do about usage.namestatus for these to
deal with. The loader probably assumes that names are standard, but these new
names are probably non-standard. But the file is of the same format as the
usda download, so we should be able to use the same loader, provided we can
work out the usage.namesstatus issue.”
RKP4. no problem with usage.namestatus being standard, provided that the
plantStatus:plantConceptStatus = Not Accepted by USDA
RKP5. by the way, there is an error in the online interactive ERD in that
plantUsage links to plantConcept rather than plantStatus
“This is a sizable chunk of taxa: 9093 new ones from 82113 original taxa on
the downloaded list.”
RKP6. This is shockingly large.
“If there are other years that are being loaded (I am certain that there are
at least 2 years), please tell me where to get the USDA download and I can run
it through my SAS program and generate the missing taxa for that year, too.”
RKP7. Yes, we are loading 1996 and 2002. Gabe has both files and I had
assumed both were loaded. Both need to be loaded and the mappings between
them checked for correctness. I will make my current two files available as
http://www.bio.unc.edu/faculty/peet/plantlst2002 and
http://www.bio.unc.edu/faculty/peet/plantlst1996 at least temporarily. Note
that there is a newer version of Plants2003 waiting to be downloaded and if
time were not an issue I would suggest loading the new usda list rather than
the 2002 list.
Updated by Michael Lee over 21 years ago
All Bob's comments are OK with me in the above comment, I'll just comment on a
few myself.
RKP4. no problem with usage.namestatus being standard, provided that the
plantStatus:plantConceptStatus = Not Accepted by USDA
---plantUsage.plantNAmeStatus = "STANDARD" then------------------------
RKP5. by the way, there is an error in the online interactive ERD in that
plantUsage links to plantConcept rather than plantStatus
---I did not have this problem :
http://vegbank.nceas.ucsb.edu/vegbank/design/erd/vegbank_erd.pdf links ok from
both Concept and both Status tables
RKP7. Yes, we are loading 1996 and 2002. Gabe has both files and I had
assumed both were loaded. Both need to be loaded and the mappings between
them checked for correctness. I will make my current two files available as
http://www.bio.unc.edu/faculty/peet/plantlst2002 and
http://www.bio.unc.edu/faculty/peet/plantlst1996 at least temporarily. Note
that there is a newer version of Plants2003 waiting to be downloaded and if
time were not an issue I would suggest loading the new usda list rather than
the 2002 list.
---Thank you for making these available, I'll get on to recreating nominals with
those datasets, too.
***Gabe, where do we stand on loading BOTH YEARS OF USDA PLANTS DATA? If
they are not loaded, we might be able to use our "preloading schema" (if it
worked) to load plants as I load them on VegBranch. If that saves you some
time, then great. We could talk about whether this is a good plan.
Bob, I have instructions from you on loading the PLANTS data and want to confirm
that this is indeed still the way we will do it. I paste instructions here from
a WORD doc you sent john sometime ago: [my comments in brackets]
-------------------------------------------------------------------
Here are the basic rules and ideas. I could spell them out as above, but I know
it unlikely you will follow the exact recipe, so this may be better for you.
1. We should use Plants version3.5 for 2002 (on beta). The effective date for
2002 = June 10, 2002. The reference will be item 2 in the list of references at
the top of this document.
2. Each concept that was “accepted” in 1996 will be assumed to be “not accepted”
effective as of 10 June 2002.
3. We need to enter in plantName all new names (scientific, common, codes)
including new spellings (many of which are trivial little changes like
diacritical marks or changing the hybrid indicator from x to a cross mark.).
4. A new plantConcept entry should be created for all records in plants 2002,
and a plantStatus record is needed for each reporting it as accepted or not
accepted, starting June 10.
5. All correlations will be reported as convergence = “undetermined”.
[ MTL - that is correlations between 1996 and 2002 or also for synonyms
completely in one year? - I assume for everything, which is something new to me]
6. Create a correlation table entry for all concepts that are based on taxa
listed as not accepted in 2002, with a pointer to the accepted concept.
7. Create a correlation table entry for all concepts accepted in 1996 and which
have their USDA codes still associated with an accepted taxon in 2002. Table
should point to the accepted taxon with the same code.
8. The roughly 20,000 spelling inconsistencies in scientific names between 1996
and 2002 should be identified through consistency in USDA code and thereby cause
us no problems, other than bulking up the database. However, we should add new
nonstandard name usages for the 2002 data indicating the old but no longer
standard spellings.
[unless Gabe has done this already, this step may have to wait, as it is
complicated]
Updated by Michael Lee over 21 years ago
The USDA plant list files have MS-Latin-I characters embedded in them, which
leads to misinterpretation of some foreign characters: üéäåèÅö
I have fixed the files and they are now on tekka at:
http://tekka.nceas.ucsb.edu/~lee/plantlst1996_charFix.txt.zip
http://tekka.nceas.ucsb.edu/~lee/plantlst2002_charFix.txt.zip
Some of these errors persist in the USDA dataset on the web, i.e.
http://plants.usda.gov/cgi_bin/plant_profile.cgi?symbol=HAMAT2
This may be a headache for us as we attempt to display these same characters on
the web, as these characters have special codes that are used in html.
Updated by Michael Lee over 21 years ago
Enter the 2002 USDA data for release 1.0, which looks good, and enter 1996 and
correlate it with 2002 for 1.1
Updated by Michael Lee about 21 years ago
basic plants are in VegBank and loaded properly
Updated by Michael Lee almost 20 years ago
changed from components that are to be deleted to "misc" so that bugs don't get
deleted with component. Sorry for all the email.