Earthster Developer Blog

Developer Blog


Thursday 22 July 2010

Reference data for chemicals and other substances

The Earthster RefData project is developing reference data relevant to Life Cycle Analysis (LCA) and publishing it as linked open data. A key component of this reference are lists of elementary flows, that is flows of substances or energy to or from the environment, that are relevant to LCA.

I have been looking for reference URLs that I can use to identify these substances which ideally could be dereferenced to retrieve RDF data about each substance. That data doesn't seem to exist at the moment.

Looking at chemicals, there are a number of identifier schemes for identifying chemicals. The most commonly used seems to be CAS Numbers which are issued by the American Chemical Society. The society operates the CAS Registry, a database of information about chemicals identified by these numbers, but this is not freely available.

National Center for Biotechnology Information (NCBI) maintain PubChem, a freely available database of information about chemicals. I even found a reference to an RDF translation of the database, but unfortunately the link is broken. Unfortunately, PubChem does not use CAS numbers as the identifier for a chemical. A search for aniline for example yields one page that does not contain a reference to aniline's CAS number, and another page that does. These two pages have different SIDs but share a compound ID. In fact there are a lot of pages with the same compound ID and different SIDs. I wonder if there is one that is the reference page. More work is needed to understand the structure and contect of PubChem to see if it is a potential source of LOD reference data for chemicals.

Wikipedia contains quite a rich set of information about chemicals in its info boxes and DBPedia is extracting some of that information. The DBPedia info box ontology defines a property for CAS numbers, however, my recent investigations indicated that extraction of this property can be patchy - its not in the RDF for cases where it is present in the wikipedia entry. No doubt that will improve, and we might even be able to help with that.

So far, I haven't found LOD reference data chemicals that I can use. Still looking ...

1 comment:

  1. The single process GWP lists from the Ecoinvent Database used in Simapro doesn't have CAS numbers for all the gases listed. Since you seem to have been looking for unique identifiers I'm wondering what you use as a unique identifier when CAS numbers are missing in LCIA reference data. The chemical formula?

    ReplyDelete