Return to Video

cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4

  • 0:06 - 0:10
    (chairman) So, let's welcome Egon,
  • 0:10 - 0:13
    who will describe how he is trying
  • 0:13 - 0:18
    to improve the coverage
    of chemical compounds in Wikidata.
  • 0:18 - 0:20
    (Egon) Yeah, thank you.
  • 0:21 - 0:22
    So, let's see...
  • 0:23 - 0:25
    Oh, this is not right.
  • 0:26 - 0:27
    Sorry.
  • 0:32 - 0:33
    They put the wrong slide deck.
  • 0:40 - 0:43
    (person 1) The one was better than--
    (person 2) [inaudible]
  • 0:43 - 0:44
    (person 1) (laughs)
  • 0:49 - 0:50
    How about this?
  • 0:53 - 0:55
    This one actually says WikidataCon.
  • 0:57 - 0:58
    Slightly different slides.
  • 1:00 - 1:06
    Okay, so yeah, coverage and correctness,
  • 1:06 - 1:08
    accuracy, quality, if you like.
  • 1:09 - 1:12
    And the other thing here
  • 1:12 - 1:15
    is what makes it different
    from some of the other things
  • 1:15 - 1:17
    that we've seen at the WikidataCon
  • 1:17 - 1:20
    is how I do this quality,
  • 1:20 - 1:22
    and the coverage, actually.
  • 1:22 - 1:25
    So I'm actually taking advantage here
    of my background,
  • 1:25 - 1:27
    which is in cheminformatics,
  • 1:27 - 1:30
    which is something
    that we use in our research.
  • 1:30 - 1:33
    And cheminformatics
  • 1:33 - 1:35
    is a way to understanding
    the chemical structures,
  • 1:35 - 1:37
    what we will see in a moment.
  • 1:37 - 1:40
    We can do things that we cannot do
    with the regular toolsets
  • 1:40 - 1:44
    that we have, like Shape Expressions,
    quality constraints and sorts.
  • 1:46 - 1:49
    Now, one of the interesting
    things of chemistry
  • 1:49 - 1:52
    is that chemical structures,
  • 1:53 - 1:55
    are sometimes the same,
    sometimes not the same,
  • 1:55 - 1:58
    depending on what you want to know.
  • 1:58 - 2:01
    And this slide reflects that a bit,
  • 2:01 - 2:05
    and what we see here
    is biologically the same compounds
  • 2:05 - 2:08
    but chemically two different compounds.
  • 2:08 - 2:12
    But at biological levels
    with the [inaudible],
  • 2:12 - 2:14
    they are in equilibrium,
  • 2:14 - 2:16
    and you will not be able
    to really distinguish between them,
  • 2:17 - 2:20
    unless you're looking
    for a particular type of biology
  • 2:20 - 2:22
    like reaction mechanisms.
  • 2:23 - 2:27
    Another interesting thing
    about Wikidata and Wikipedia
  • 2:27 - 2:32
    is that we have things
    like long-chain fatty acids,
  • 2:32 - 2:35
    chemical concepts
    which are not a specific compound
  • 2:35 - 2:37
    but actually a class of compounds.
  • 2:37 - 2:43
    Now, this class can be based
    on similar features in the molecules,
  • 2:43 - 2:46
    so like in the case
    of the long-chain fatty acids--
  • 2:46 - 2:52
    they all have a long-chain fatty
    and an acid group.
  • 2:52 - 2:54
    In other cases, there are the classes
  • 2:54 - 2:57
    based more
    on the biological functionality,
  • 2:57 - 3:00
    like a certain type of inhibitor,
    like an ACE inhibitor.
  • 3:02 - 3:06
    And this introduces
    a lot of interesting things,
  • 3:06 - 3:10
    partly because of this close link
    with Wikipedias.
  • 3:10 - 3:13
    And one of the things that we see
  • 3:13 - 3:18
    is that Wikipedia may have a chembox
    for a particular compound
  • 3:18 - 3:20
    bur actually be about a compound class,
  • 3:20 - 3:25
    resulting to a slightly different concept
  • 3:25 - 3:27
    of what the two things
    are actually meaning
  • 3:27 - 3:30
    and the sitelink being more complicated.
  • 3:33 - 3:39
    We need this
    for understanding the biology.
  • 3:39 - 3:43
    So the research in our group
    is understanding the living cell.
  • 3:43 - 3:47
    The system's biology here
    described in the biological process
  • 3:47 - 3:51
    is we have a pathway database
    for that--WikiPathways,
  • 3:51 - 3:54
    and if we look at the chemistry in there
    of the small molecules,
  • 3:56 - 3:59
    the chemistry is sometimes
    described in a lot of details,
  • 3:59 - 4:01
    sometimes in less detail,
  • 4:01 - 4:04
    pretty much like this Wikipedia
    Wikidata link that we just had
  • 4:04 - 4:08
    resulting in basically links
    to a lot of different databases
  • 4:08 - 4:10
    with slightly different focuses.
  • 4:11 - 4:15
    Some databases like LIPID MAPS
    and the Human Metabolome Database,
  • 4:15 - 4:18
    they are very much focused on the biology,
  • 4:18 - 4:21
    whereas a database like ChEBI,
  • 4:21 - 4:25
    that's very much focused
    on the chemical entities.
  • 4:26 - 4:28
    So we try to breach that,
  • 4:28 - 4:33
    and that two-three years ago gave us
    a very interesting insight
  • 4:33 - 4:36
    that if you look at the lines here
  • 4:36 - 4:40
    where in blue we have the total number
    of the small molecules
  • 4:40 - 4:42
    we have in these Pathways
  • 4:42 - 4:45
    and the numbers in red
    that we can match to that,
  • 4:45 - 4:48
    there is this gap, and this gap
    is complicated chemistry.
  • 4:49 - 4:52
    Also, poignantly, things
    missing in Wikidata.
  • 4:53 - 4:55
    So therefore the need
    for date completeness
  • 4:55 - 4:58
    and the data quality.
  • 5:00 - 5:02
    And here we have an example.
  • 5:02 - 5:06
    This is actually
    a curation report of yesterday,
  • 5:06 - 5:08
    and these are still things
    that we have in Pathways
  • 5:08 - 5:14
    but that we do not know
    what the equivalent thing is in Wikidata.
  • 5:14 - 5:18
    And one of the things here
    that I'm picking out here
  • 5:18 - 5:19
    is strigolactone.
  • 5:19 - 5:21
    And this is a class of compounds.
  • 5:22 - 5:27
    So we have that in one of our Pathways,
    this particular Pathway over there.
  • 5:29 - 5:32
    So you start matching this
    to Wikidata and Wikipedia,
  • 5:32 - 5:35
    and to actually use
    for this compound Wikipedia page
  • 5:36 - 5:39
    with these six structures--
  • 5:39 - 5:42
    images, name, no links, nothing.
  • 5:42 - 5:43
    Nothing in Wikipedia,
  • 5:43 - 5:46
    just this information,
    not machine-readable.
  • 5:48 - 5:53
    So based on the name,
    I can actually find three out of the six
  • 5:53 - 5:55
    of these compounds in Wikidata,
  • 5:55 - 5:57
    not linked, not classified.
  • 5:57 - 6:00
    So if we look at the class
    of strigolactones,
  • 6:00 - 6:03
    of which these six are examples,
  • 6:03 - 6:05
    Wikidata did not give us anything.
  • 6:05 - 6:08
    So that's the kind of curation
    that I'm interested in.
  • 6:09 - 6:13
    On the right here--
    that page is actually pretty much empty,
  • 6:13 - 6:17
    but it's exactly what Scholia is showing
    for this class of chemical compounds.
  • 6:18 - 6:23
    So Scholia is one of the tools
    that I've been using to do this curation.
  • 6:26 - 6:31
    This missing classification
    is a bit of information
  • 6:31 - 6:32
    missing in Wikidata,
  • 6:32 - 6:35
    but we can add this classification,
  • 6:35 - 6:37
    and we can retrieve that
    from some sources.
  • 6:37 - 6:39
    We will see with LIPID MAPS later,
  • 6:39 - 6:41
    we can automate
    adding these missing links,
  • 6:42 - 6:44
    if we understand the chemistry.
  • 6:46 - 6:51
    So this diagram over here--
    we have fatty acid over there again
  • 6:51 - 6:54
    and the long-chain fatty acid over here
  • 6:54 - 6:57
    that we saw on one
    of the previous slides--
  • 6:57 - 7:01
    very long-chain fatty acids
    and a number of other fatty acids.
  • 7:01 - 7:04
    This kind of information helps us see
  • 7:04 - 7:09
    a [inaudible] of the chemistry
    in Wikidata.
  • 7:10 - 7:14
    Scholia can visualize the 2D structure,
  • 7:14 - 7:16
    and this thing is actually
    automatically generated
  • 7:16 - 7:19
    from the chemical structure in Wikidata
  • 7:19 - 7:22
    on the fly creating
    in the Scalable Vector Graphics.
  • 7:23 - 7:24
    (coughing)
  • 7:25 - 7:26
    Sorry.
  • 7:26 - 7:28
    With the stereochemistry annotation there
  • 7:28 - 7:33
    to help the chemist see
    the completeness of the data
  • 7:33 - 7:36
    because also the stereochemistry
    might be missing.
  • 7:36 - 7:40
    We also get an overview on Scholia
    of related compounds
  • 7:40 - 7:42
    based on the InChIKey,
  • 7:42 - 7:46
    where the first block basically indicates
    how the atoms are connected
  • 7:46 - 7:50
    and the second column
    indicates things like stereochemistry
  • 7:50 - 7:54
    and things like
    which isotopes are in there,
  • 7:54 - 7:57
    for example C11 instead of C12
  • 7:57 - 8:02
    or C13 instead of C12.
  • 8:03 - 8:06
    The last number
    of the last letter over here
  • 8:06 - 8:09
    actually indicates the charge,
  • 8:09 - 8:14
    so that's the example that we saw earlier
    between the citric acid and the citrate,
  • 8:14 - 8:17
    or was it the acetic acid and the acetate?
  • 8:20 - 8:23
    By putting in a bit of the main knowledge,
  • 8:23 - 8:28
    we can do a lot more... making sense
    of what we have in Wikidata.
  • 8:29 - 8:33
    A bit more about Scholia
  • 8:33 - 8:36
    is that about data completeness
    with the physical and chemical properties,
  • 8:36 - 8:40
    the literature, those are whole things
    that we want to have access to.
  • 8:40 - 8:45
    But it only works if we can find
    the right chemical in Wikidata.
  • 8:49 - 8:52
    We started using Wikidata
    in a number of our projects,
  • 8:52 - 8:55
    so WikiPathways was one of them.
  • 8:55 - 8:56
    This is another project
  • 8:56 - 9:00
    in the area of the nanosafety,
    risk assessment,
  • 9:00 - 9:03
    where they use OECD testing guidelines
  • 9:03 - 9:08
    and using Wikidata here
    to make an overview of the experiments.
  • 9:08 - 9:12
    And this means that we can now
    actually start annotating articles
  • 9:12 - 9:14
    where these protocols have been used.
  • 9:16 - 9:20
    And in this way, we get a better insight
    in the quality of literature as well.
  • 9:21 - 9:24
    We get to see which DDTs
  • 9:24 - 9:27
    are well tested, established
    experimental methods
  • 9:27 - 9:32
    and an indication of how good the data is
    that came out of that.
  • 9:35 - 9:38
    Another example--this is nanomaterials,
  • 9:38 - 9:40
    specific nanomaterials,
  • 9:41 - 9:43
    where there is a unique code--
    we've added that--
  • 9:43 - 9:45
    with the same purpose of being able
  • 9:45 - 9:48
    to track down literature
    about these nanomaterials.
  • 9:49 - 9:52
    But, again, we need exact descriptions.
  • 9:53 - 9:56
    Now, this is
    the LIPID MAPS classification,
  • 9:56 - 9:58
    and here we see an interesting thing,
  • 9:58 - 10:02
    and this has shown up
    in some of the presentations,
  • 10:03 - 10:04
    elsewhere as well.
  • 10:04 - 10:08
    This idea that some of the things
    that we have in Wikidata
  • 10:08 - 10:10
    is not always matching the sources.
  • 10:10 - 10:12
    So different ontological models,
  • 10:12 - 10:16
    different ideas
    of what a particular thing means.
  • 10:16 - 10:21
    And so, if we look at the LIPID MAPS,
    we have a lipid in the middle
  • 10:21 - 10:23
    and then a number of classes,
  • 10:23 - 10:25
    and many of these are in Wikidata.
  • 10:25 - 10:31
    But here, around actually
    fatty acids or fatty acyls,
  • 10:31 - 10:33
    that's where there is a mismatch
  • 10:33 - 10:37
    causing something that should be
    actually purely hierarchical,
  • 10:37 - 10:41
    actually it started to show
    some loops over there,
  • 10:41 - 10:46
    the mismatch of two representations
    of a lipid chemistry.
  • 10:49 - 10:52
    Now, the goal of this work
    is not so much to reconcile this
  • 10:52 - 10:53
    but to visualize it
  • 10:53 - 10:56
    so that we can understand
    what is going on
  • 10:56 - 10:59
    and correct things
    that are actually clearly wrong.
  • 11:06 - 11:10
    The interesting about LIPID MAPS
    is actually that the classification
  • 11:10 - 11:13
    is indicated in the external identifier.
  • 11:13 - 11:15
    So one of the things
    that we've been using
  • 11:15 - 11:19
    is these external numbers
    to make this automatic classification
  • 11:19 - 11:22
    because everything
    that starts with an LMFA05
  • 11:23 - 11:26
    is actually a fatty alcohol.
  • 11:26 - 11:28
    So I can translate that
    into Quickstatements,
  • 11:28 - 11:30
    push that into Quickstatements
  • 11:30 - 11:33
    and get that annotated in Wikidata.
  • 11:40 - 11:44
    This slide is just reflecting
    the advantage for LIPID MAPS here
  • 11:44 - 11:46
    which been collaborating with them
  • 11:46 - 11:50
    because they get a lot of data
    out of Wikidata as well,
  • 11:50 - 11:55
    which we can cross-reference,
    which we can compare if it's correct.
  • 11:55 - 11:58
    LIPID MAPS is a quite curated database
  • 11:58 - 12:02
    but like everyone actually having trouble
  • 12:02 - 12:05
    with access to literature,
  • 12:05 - 12:08
    the demand of literature
    and filtering the literature,
  • 12:08 - 12:10
    getting to the right articles.
  • 12:13 - 12:15
    Shape Expressions is probably
    something that you've seen.
  • 12:15 - 12:19
    We have a few of them for chemistry now.
  • 12:19 - 12:21
    This is the example for racemic mixture.
  • 12:21 - 12:26
    In a case of racemic mixture,
    you want to have two parts in there.
  • 12:26 - 12:27
    It's a mixture,
  • 12:27 - 12:31
    so at least two chemical entities
    need to be in there.
  • 12:31 - 12:36
    Moreover, each of the [inaudible] parts
    has to be a chemical compound.
  • 12:36 - 12:41
    This is another level of a way
    we can curate the content.
  • 12:42 - 12:44
    There have to be more of them.
  • 12:44 - 12:48
    We have quite a few
    different concepts in Wikidata
  • 12:48 - 12:50
    like groups of co-compounds.
  • 12:51 - 12:55
    There is a class that is
    of structurally similar compounds, etc.
  • 12:57 - 13:01
    If you run a query like this,
    this case for the other one,
  • 13:01 - 13:04
    other schema that we have
    for chemical elements,
  • 13:04 - 13:07
    you can do the same thing--
    you can run it on a single item
  • 13:07 - 13:10
    or you can run that on everything
    that is a chemical element.
  • 13:10 - 13:14
    This is something
    that I can very much recommend
  • 13:14 - 13:17
    having a look at
    if you have not done so already.
  • 13:18 - 13:22
    Now, if we go to the automation of things,
  • 13:22 - 13:26
    here I'm using a tool called Bioclipse.
  • 13:26 - 13:29
    This is something that we worked on
    some time ten years ago.
  • 13:29 - 13:32
    It's a platform
    for chemistry and biology,
  • 13:32 - 13:35
    or cheminformatics and bioinformatics.
  • 13:35 - 13:37
    aimed at automating things,
  • 13:40 - 13:41
    including visualizations and sorts.
  • 13:41 - 13:45
    But I've taken that now
    and developed a number of scripts
  • 13:45 - 13:47
    that I can actually run
    on the command line,
  • 13:47 - 13:51
    which makes it easier to automate things,
    as we will see in a moment,
  • 13:51 - 13:52
    and doing all sort of things,
  • 13:52 - 13:56
    for example, the classification
    according to the LIPID MAP identifiers,
  • 13:56 - 14:02
    that's the scripts all available
    from the GitHub repository here.
  • 14:03 - 14:06
    And typically, I have them
    create Quickstatements
  • 14:06 - 14:09
    because that gets me
    an additional check step
  • 14:09 - 14:14
    after I created the Quickstatements
    and see what does data actually look like.
  • 14:16 - 14:19
    Annotation of main subjects.
  • 14:19 - 14:23
    This one is my script too,
    starting from SMILES
  • 14:23 - 14:27
    to actually add chemical compounds
    that are not in Wikidata yet,
  • 14:27 - 14:29
    which happens a lot.
  • 14:30 - 14:34
    So three or four weeks ago
    I added something like 500 compounds
  • 14:34 - 14:36
    which our project was looking into
  • 14:36 - 14:42
    because these are
    volatile compounds in oils.
  • 14:45 - 14:47
    This script adds the compounds.
  • 14:47 - 14:51
    They will later on add the annotation
    of which pieces that compound comes from
  • 14:51 - 14:53
    and what the properties are.
  • 14:56 - 14:59
    Bioclipse itself is based
    on the Chemistry Development Kit
  • 15:00 - 15:02
    and a few other libraries.
  • 15:02 - 15:05
    This allows me to do the chemistry.
  • 15:05 - 15:08
    And this is a very well-validated toolkit.
  • 15:08 - 15:12
    The SMILES part has been done
    by John Mayfield.
  • 15:12 - 15:17
    I have done a lot of validation
    against other tools.
  • 15:18 - 15:21
    And the quality
    is actually really high now,
  • 15:21 - 15:26
    comparable or in some cases even better
    of commercial cheminformatics tools.
  • 15:26 - 15:29
    It has given me a lot of reassurance
  • 15:29 - 15:35
    that the quality checking that we do
    with this tool on Wikidata
  • 15:35 - 15:37
    is giving interesting results.
  • 15:38 - 15:40
    This is the Quickstatements.
  • 15:40 - 15:43
    Quickstatements
    is Magnus' work, of course.
  • 15:44 - 15:48
    What happens if we take the SMILES,
    it calculates the InChI,
  • 15:48 - 15:52
    and the InChIKey, it even looks up
    based on the InChIKey,
  • 15:52 - 15:54
    if there is a PubChem identifier
  • 15:54 - 15:59
    that uses the InChIKey,
    the PubChem identifier,
  • 15:59 - 16:01
    to see if this compound
    already is in Wikidata.
  • 16:01 - 16:04
    And only if it's not already there,
  • 16:04 - 16:07
    then it will actually create
    a CREATE statement.
  • 16:09 - 16:13
    A bit of automatic classification
    here is an option.
  • 16:13 - 16:15
    So if I'm adding a class of compounds,
  • 16:15 - 16:18
    I can automatically indicate
    what these are all...
  • 16:18 - 16:20
    this type of compounds,
  • 16:20 - 16:23
    and I can also indicate, if needed,
  • 16:23 - 16:25
    if there is a particular article
    where I got this information
  • 16:25 - 16:28
    from automatically adding references.
  • 16:30 - 16:33
    Well, this is what
    the Quickstatements output looks like
  • 16:33 - 16:36
    for the annotation of main subjects.
  • 16:36 - 16:38
    You've probably seen that as well.
  • 16:41 - 16:43
    A newer thing that I started doing
  • 16:43 - 16:46
    is actually doing reasoning
    on the data in Wikidata.
  • 16:46 - 16:51
    So if I have the SMILES, then I can check
    the molecular formula, for example.
  • 16:51 - 16:53
    I can check the InChIKey.
  • 16:55 - 17:00
    At some point, what we are going to do
    is calculate physicochemical properties
  • 17:00 - 17:04
    and see if that matches
    what is in Wikidata.
  • 17:05 - 17:07
    This will highlight typos
  • 17:08 - 17:10
    or wrong units, for example.
  • 17:11 - 17:15
    At this moment...
    so this is a run of this morning.
  • 17:15 - 17:18
    What we see here
    is two tests actually failing,
  • 17:18 - 17:20
    and this is an example of it.
  • 17:20 - 17:23
    This is the InChIKey
    that is computed from the isomeric SMILES
  • 17:23 - 17:27
    is different from the InChIKey
    given in the entry.
  • 17:28 - 17:32
    This can result from data
    being pulled in from different resources.
  • 17:33 - 17:36
    So these are entries, about 300 of them,
  • 17:36 - 17:39
    in the 160,000 chemicals
    that we have in Wikidata.
  • 17:39 - 17:42
    So it's a very small amount, really,
  • 17:42 - 17:46
    where there is information,
    and someone needs to look at it.
  • 17:47 - 17:51
    Now, these are all organic compounds
  • 17:51 - 17:54
    and also quite a few inorganic compounds
  • 17:54 - 17:57
    where these things just work less well.
  • 17:58 - 18:01
    But I found in the other test
    that is failing
  • 18:01 - 18:04
    immediately a couple of things
    that are very clearly wrong.
  • 18:09 - 18:12
    PubChem is a huge database.
  • 18:12 - 18:14
    They do validation as well.
  • 18:14 - 18:17
    We are in the process
    of submitting Wikidata there,
  • 18:17 - 18:19
    which I'm really happy about.
  • 18:19 - 18:23
    It's in the last validation step
    at this moment.
  • 18:23 - 18:26
    And this will also mean that PubChem,
  • 18:26 - 18:29
    which has something
    like 100 million compounds
  • 18:29 - 18:31
    will actually link back to Wikidata.
  • 18:31 - 18:35
    It already does this, but via Wikipedia.
  • 18:35 - 18:37
    (laughing)
  • 18:37 - 18:38
    Do you recognize it?
  • 18:38 - 18:43
    With the aforementioned issues there
    of concept mismatches.
  • 18:44 - 18:46
    So this will give us a second thing.
  • 18:46 - 18:50
    And there, also,
    using the same Bioclipse scripts
  • 18:50 - 18:52
    or similar Bioclipse scripts,
  • 18:52 - 18:53
    we get validation reports,
  • 18:53 - 18:57
    again indicating things
    that chemists should look at.
  • 19:00 - 19:02
    That basically wraps it up.
  • 19:02 - 19:05
    This is still a work in progress,
    the article is in preparation.
  • 19:05 - 19:08
    I've been working
    with Finn here in Scholia
  • 19:08 - 19:11
    to support this validation.
  • 19:11 - 19:16
    We're writing up the full work,
    but for now you can look up this poster.
  • 19:16 - 19:19
    The slides are on the program
    of this session,
  • 19:19 - 19:22
    so you can look at the slides
    and look at the details.
  • 19:23 - 19:25
    And a quick acknowledgment:
  • 19:25 - 19:28
    some of this work has been done
    by a number of grants that I received.
  • 19:28 - 19:30
    And thank you very much.
  • 19:30 - 19:32
    (applause)
  • 19:36 - 19:38
    (chairman) Are there any questions?
  • 19:41 - 19:43
    (person 3) Thank you so much for this.
  • 19:43 - 19:44
    I am [inaudible],
  • 19:44 - 19:47
    and so far, I've been reading articles
  • 19:47 - 19:50
    on the [inaudible] Quickipedia
    on different compounds.
  • 19:50 - 19:54
    I have a little bit more than 70 articles
    with different compounds--
  • 19:54 - 19:55
    just things I come across.
  • 19:56 - 19:58
    And my question to you is
  • 19:58 - 20:02
    if I want to move my chemistry activity
    from Wikipedia to Wikidata,
  • 20:02 - 20:05
    how can I help
    in a way that is very friendly
  • 20:05 - 20:10
    to somebody who is a beginner
    in that field on Wikidata?
  • 20:12 - 20:16
    So, if that compound is in Wikipedia and..
  • 20:16 - 20:18
    Sometimes there is
    actually a Wikidata page.
  • 20:18 - 20:20
    I occasionally run into this as well,
  • 20:20 - 20:22
    in the last couple of months
    not so much anymore
  • 20:22 - 20:24
    but this morning, actually.
  • 20:26 - 20:27
    And what I typically do then
  • 20:27 - 20:31
    is I take the SMILES
    from [inaudible] infobox
  • 20:31 - 20:32
    from that compound
  • 20:32 - 20:37
    or use PubChem to look up the SMILES,
    check if the information is complete,
  • 20:37 - 20:39
    particularly the stereochemistry,
  • 20:39 - 20:43
    and then I use that
    that creates Wikidata item scripts
  • 20:43 - 20:46
    to create Quickstatements
    for that compound.
  • 20:48 - 20:50
    If there already is a Wikidata item,
  • 20:50 - 20:56
    I basically just update these scripts,
  • 20:56 - 21:00
    but rather than say, "Create Last,"
    I replace the last with the Q-codes
  • 21:00 - 21:02
    that that item already has.
  • 21:02 - 21:05
    And then it complements
    or it adds this information
  • 21:05 - 21:07
    based on the information we had.
  • 21:08 - 21:10
    This is [manuable],
  • 21:10 - 21:14
    so you can copy-paste
    a number of SMILES, put it in a file,
  • 21:14 - 21:16
    and take that.
  • 21:18 - 21:22
    Extracting that information to Wikidata
    is not something I've automated yet,
  • 21:22 - 21:25
    but this helps me...
    it's a pretty fast process.
  • 21:26 - 21:28
    I can show you later
    how to use that software.
  • 21:31 - 21:32
    (chairman) Are there other questions?
  • 21:34 - 21:35
    So, I have one.
  • 21:35 - 21:40
    Do you make an effort
    to, in fact, make this more visible
  • 21:40 - 21:42
    in this bioinformatics community
  • 21:42 - 21:47
    so that they can start using
    this structured data?
  • 21:48 - 21:49
    Yeah, I'm actively doing that.
  • 21:49 - 21:52
    So what I did not mention
    in this presentation so much,
  • 21:52 - 21:58
    but we saw that in...
    I'd have somewhere to start here--
  • 21:58 - 22:01
    this is an overview
    of different databases.
  • 22:01 - 22:05
    A similar plot, which actually
    I do not have on this slide deck
  • 22:05 - 22:09
    is the number of different identifiers
    that chemical compounds have,
  • 22:09 - 22:11
    and I've been working
    with a number of databases,
  • 22:11 - 22:16
    like MassBank,
    the Environmental Protection Agency,
  • 22:17 - 22:19
    CompTox Dashboard.
  • 22:20 - 22:22
    I've added links to the BDB database.
  • 22:22 - 22:24
    So I'm working with a number of projects
  • 22:24 - 22:27
    for pulling in additional information,
  • 22:28 - 22:30
    identifies our links out
    to other databases.
  • 22:31 - 22:33
    Regarding outreach, yes,
  • 22:33 - 22:37
    so that wrong slide deck
    that I was showing at the start,
  • 22:37 - 22:39
    there was actually
    a presentation two weeks ago
  • 22:39 - 22:42
    at an Open Science Meeting
    around chemistry.
  • 22:43 - 22:46
    I'm very much pushing this and...
  • 22:47 - 22:49
    I see a big future here.
  • 22:49 - 22:51
    There's a lot of interest.
  • 22:51 - 22:55
    And making people aware
    of the CC0 license,
  • 22:55 - 22:58
    that's typically the larger problem.
  • 22:58 - 23:03
    So we have to pull in
    the information carefully.
  • 23:05 - 23:07
    (chairman) Other questions?
  • 23:09 - 23:10
    - Okay.
    - Thank you very much.
  • 23:10 - 23:12
    (chairman) Can we thank the speaker.
  • 23:12 - 23:15
    (applause)
Title:
cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4
Video Language:
English
Duration:
23:21

English subtitles

Revisions