cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4

Edit subtitles

0:06 - 0:10

(chairman) So, let's welcome Egon,
0:10 - 0:13

who will describe how he is trying
0:13 - 0:18

to improve the coverage
of chemical compounds in Wikidata.
0:18 - 0:20

(Egon) Yeah, thank you.
0:21 - 0:22

So, let's see...
0:23 - 0:25

Oh, this is not right.
0:26 - 0:27

Sorry.
0:32 - 0:33

They put the wrong slide deck.
0:40 - 0:43

(person 1) The one was better than--
(person 2) [inaudible]
0:43 - 0:44

(person 1) (laughs)
0:49 - 0:50

How about this?
0:53 - 0:55

This one actually says WikidataCon.
0:57 - 0:58

Slightly different slides.
1:00 - 1:06

Okay, so yeah, coverage and correctness,
1:06 - 1:08

accuracy, quality, if you like.
1:09 - 1:12

And the other thing here
1:12 - 1:15

is what makes it different
from some of the other things
1:15 - 1:17

that we've seen at the WikidataCon
1:17 - 1:20

is how I do this quality,
1:20 - 1:22

and the coverage, actually.
1:22 - 1:25

So I'm actually taking advantage here
of my background,
1:25 - 1:27

which is in cheminformatics,
1:27 - 1:30

which is something
that we use in our research.
1:30 - 1:33

And cheminformatics
1:33 - 1:35

is a way to understanding
the chemical structures,
1:35 - 1:37

what we will see in a moment.
1:37 - 1:40

We can do things that we cannot do
with the regular toolsets
1:40 - 1:44

that we have, like Shape Expressions,
quality constraints and sorts.
1:46 - 1:49

Now, one of the interesting
things of chemistry
1:49 - 1:52

is that chemical structures,
1:53 - 1:55

are sometimes the same,
sometimes not the same,
1:55 - 1:58

depending on what you want to know.
1:58 - 2:01

And this slide reflects that a bit,
2:01 - 2:05

and what we see here
is biologically the same compounds
2:05 - 2:08

but chemically two different compounds.
2:08 - 2:12

But at biological levels
with the [inaudible],
2:12 - 2:14

they are in equilibrium,
2:14 - 2:16

and you will not be able
to really distinguish between them,
2:17 - 2:20

unless you're looking
for a particular type of biology
2:20 - 2:22

like reaction mechanisms.
2:23 - 2:27

Another interesting thing
about Wikidata and Wikipedia
2:27 - 2:32

is that we have things
like long-chain fatty acids,
2:32 - 2:35

chemical concepts
which are not a specific compound
2:35 - 2:37

but actually a class of compounds.
2:37 - 2:43

Now, this class can be based
on similar features in the molecules,
2:43 - 2:46

so like in the case
of the long-chain fatty acids--
2:46 - 2:52

they all have a long-chain fatty
and an acid group.
2:52 - 2:54

In other cases, there are the classes
2:54 - 2:57

based more
on the biological functionality,
2:57 - 3:00

like a certain type of inhibitor,
like an ACE inhibitor.
3:02 - 3:06

And this introduces
a lot of interesting things,
3:06 - 3:10

partly because of this close link
with Wikipedias.
3:10 - 3:13

And one of the things that we see
3:13 - 3:18

is that Wikipedia may have a chembox
for a particular compound
3:18 - 3:20

bur actually be about a compound class,
3:20 - 3:25

resulting to a slightly different concept
3:25 - 3:27

of what the two things
are actually meaning
3:27 - 3:30

and the sitelink being more complicated.
3:33 - 3:39

We need this
for understanding the biology.
3:39 - 3:43

So the research in our group
is understanding the living cell.
3:43 - 3:47

The system's biology here
described in the biological process
3:47 - 3:51

is we have a pathway database
for that--WikiPathways,
3:51 - 3:54

and if we look at the chemistry in there
of the small molecules,
3:56 - 3:59

the chemistry is sometimes
described in a lot of details,
3:59 - 4:01

sometimes in less detail,
4:01 - 4:04

pretty much like this Wikipedia
Wikidata link that we just had
4:04 - 4:08

resulting in basically links
to a lot of different databases
4:08 - 4:10

with slightly different focuses.
4:11 - 4:15

Some databases like LIPID MAPS
and the Human Metabolome Database,
4:15 - 4:18

they are very much focused on the biology,
4:18 - 4:21

whereas a database like ChEBI,
4:21 - 4:25

that's very much focused
on the chemical entities.
4:26 - 4:28

So we try to breach that,
4:28 - 4:33

and that two-three years ago gave us
a very interesting insight
4:33 - 4:36

that if you look at the lines here
4:36 - 4:40

where in blue we have the total number
of the small molecules
4:40 - 4:42

we have in these Pathways
4:42 - 4:45

and the numbers in red
that we can match to that,
4:45 - 4:48

there is this gap, and this gap
is complicated chemistry.
4:49 - 4:52

Also, poignantly, things
missing in Wikidata.
4:53 - 4:55

So therefore the need
for date completeness
4:55 - 4:58

and the data quality.
5:00 - 5:02

And here we have an example.
5:02 - 5:06

This is actually
a curation report of yesterday,
5:06 - 5:08

and these are still things
that we have in Pathways
5:08 - 5:14

but that we do not know
what the equivalent thing is in Wikidata.
5:14 - 5:18

And one of the things here
that I'm picking out here
5:18 - 5:19

is strigolactone.
5:19 - 5:21

And this is a class of compounds.
5:22 - 5:27

So we have that in one of our Pathways,
this particular Pathway over there.
5:29 - 5:32

So you start matching this
to Wikidata and Wikipedia,
5:32 - 5:35

and to actually use
for this compound Wikipedia page
5:36 - 5:39

with these six structures--
5:39 - 5:42

images, name, no links, nothing.
5:42 - 5:43

Nothing in Wikipedia,
5:43 - 5:46

just this information,
not machine-readable.
5:48 - 5:53

So based on the name,
I can actually find three out of the six
5:53 - 5:55

of these compounds in Wikidata,
5:55 - 5:57

not linked, not classified.
5:57 - 6:00

So if we look at the class
of strigolactones,
6:00 - 6:03

of which these six are examples,
6:03 - 6:05

Wikidata did not give us anything.
6:05 - 6:08

So that's the kind of curation
that I'm interested in.
6:09 - 6:13

On the right here--
that page is actually pretty much empty,
6:13 - 6:17

but it's exactly what Scholia is showing
for this class of chemical compounds.
6:18 - 6:23

So Scholia is one of the tools
that I've been using to do this curation.
6:26 - 6:31

This missing classification
is a bit of information
6:31 - 6:32

missing in Wikidata,
6:32 - 6:35

but we can add this classification,
6:35 - 6:37

and we can retrieve that
from some sources.
6:37 - 6:39

We will see with LIPID MAPS later,
6:39 - 6:41

we can automate
adding these missing links,
6:42 - 6:44

if we understand the chemistry.
6:46 - 6:51

So this diagram over here--
we have fatty acid over there again
6:51 - 6:54

and the long-chain fatty acid over here
6:54 - 6:57

that we saw on one
of the previous slides--
6:57 - 7:01

very long-chain fatty acids
and a number of other fatty acids.
7:01 - 7:04

This kind of information helps us see
7:04 - 7:09

a [inaudible] of the chemistry
in Wikidata.
7:10 - 7:14

Scholia can visualize the 2D structure,
7:14 - 7:16

and this thing is actually
automatically generated
7:16 - 7:19

from the chemical structure in Wikidata
7:19 - 7:22

on the fly creating
in the Scalable Vector Graphics.
7:23 - 7:24

(coughing)
7:25 - 7:26

Sorry.
7:26 - 7:28

With the stereochemistry annotation there
7:28 - 7:33

to help the chemist see
the completeness of the data
7:33 - 7:36

because also the stereochemistry
might be missing.
7:36 - 7:40

We also get an overview on Scholia
of related compounds
7:40 - 7:42

based on the InChIKey,
7:42 - 7:46

where the first block basically indicates
how the atoms are connected
7:46 - 7:50

and the second column
indicates things like stereochemistry
7:50 - 7:54

and things like
which isotopes are in there,
7:54 - 7:57

for example C11 instead of C12
7:57 - 8:02

or C13 instead of C12.
8:03 - 8:06

The last number
of the last letter over here
8:06 - 8:09

actually indicates the charge,
8:09 - 8:14

so that's the example that we saw earlier
between the citric acid and the citrate,
8:14 - 8:17

or was it the acetic acid and the acetate?
8:20 - 8:23

By putting in a bit of the main knowledge,
8:23 - 8:28

we can do a lot more... making sense
of what we have in Wikidata.
8:29 - 8:33

A bit more about Scholia
8:33 - 8:36

is that about data completeness
with the physical and chemical properties,
8:36 - 8:40

the literature, those are whole things
that we want to have access to.
8:40 - 8:45

But it only works if we can find
the right chemical in Wikidata.
8:49 - 8:52

We started using Wikidata
in a number of our projects,
8:52 - 8:55

so WikiPathways was one of them.
8:55 - 8:56

This is another project
8:56 - 9:00

in the area of the nanosafety,
risk assessment,
9:00 - 9:03

where they use OECD testing guidelines
9:03 - 9:08

and using Wikidata here
to make an overview of the experiments.
9:08 - 9:12

And this means that we can now
actually start annotating articles
9:12 - 9:14

where these protocols have been used.
9:16 - 9:20

And in this way, we get a better insight
in the quality of literature as well.
9:21 - 9:24

We get to see which DDTs
9:24 - 9:27

are well tested, established
experimental methods
9:27 - 9:32

and an indication of how good the data is
that came out of that.
9:35 - 9:38

Another example--this is nanomaterials,
9:38 - 9:40

specific nanomaterials,
9:41 - 9:43

where there is a unique code--
we've added that--
9:43 - 9:45

with the same purpose of being able
9:45 - 9:48

to track down literature
about these nanomaterials.
9:49 - 9:52

But, again, we need exact descriptions.
9:53 - 9:56

Now, this is
the LIPID MAPS classification,
9:56 - 9:58

and here we see an interesting thing,
9:58 - 10:02

and this has shown up
in some of the presentations,
10:03 - 10:04

elsewhere as well.
10:04 - 10:08

This idea that some of the things
that we have in Wikidata
10:08 - 10:10

is not always matching the sources.
10:10 - 10:12

So different ontological models,
10:12 - 10:16

different ideas
of what a particular thing means.
10:16 - 10:21

And so, if we look at the LIPID MAPS,
we have a lipid in the middle
10:21 - 10:23

and then a number of classes,
10:23 - 10:25

and many of these are in Wikidata.
10:25 - 10:31

But here, around actually
fatty acids or fatty acyls,
10:31 - 10:33

that's where there is a mismatch
10:33 - 10:37

causing something that should be
actually purely hierarchical,
10:37 - 10:41

actually it started to show
some loops over there,
10:41 - 10:46

the mismatch of two representations
of a lipid chemistry.
10:49 - 10:52

Now, the goal of this work
is not so much to reconcile this
10:52 - 10:53

but to visualize it
10:53 - 10:56

so that we can understand
what is going on
10:56 - 10:59

and correct things
that are actually clearly wrong.
11:06 - 11:10

The interesting about LIPID MAPS
is actually that the classification
11:10 - 11:13

is indicated in the external identifier.
11:13 - 11:15

So one of the things
that we've been using
11:15 - 11:19

is these external numbers
to make this automatic classification
11:19 - 11:22

because everything
that starts with an LMFA05
11:23 - 11:26

is actually a fatty alcohol.
11:26 - 11:28

So I can translate that
into Quickstatements,
11:28 - 11:30

push that into Quickstatements
11:30 - 11:33

and get that annotated in Wikidata.
11:40 - 11:44

This slide is just reflecting
the advantage for LIPID MAPS here
11:44 - 11:46

which been collaborating with them
11:46 - 11:50

because they get a lot of data
out of Wikidata as well,
11:50 - 11:55

which we can cross-reference,
which we can compare if it's correct.
11:55 - 11:58

LIPID MAPS is a quite curated database
11:58 - 12:02

but like everyone actually having trouble
12:02 - 12:05

with access to literature,
12:05 - 12:08

the demand of literature
and filtering the literature,
12:08 - 12:10

getting to the right articles.
12:13 - 12:15

Shape Expressions is probably
something that you've seen.
12:15 - 12:19

We have a few of them for chemistry now.
12:19 - 12:21

This is the example for racemic mixture.
12:21 - 12:26

In a case of racemic mixture,
you want to have two parts in there.
12:26 - 12:27

It's a mixture,
12:27 - 12:31

so at least two chemical entities
need to be in there.
12:31 - 12:36

Moreover, each of the [inaudible] parts
has to be a chemical compound.
12:36 - 12:41

This is another level of a way
we can curate the content.
12:42 - 12:44

There have to be more of them.
12:44 - 12:48

We have quite a few
different concepts in Wikidata
12:48 - 12:50

like groups of co-compounds.
12:51 - 12:55

There is a class that is
of structurally similar compounds, etc.
12:57 - 13:01

If you run a query like this,
this case for the other one,
13:01 - 13:04

other schema that we have
for chemical elements,
13:04 - 13:07

you can do the same thing--
you can run it on a single item
13:07 - 13:10

or you can run that on everything
that is a chemical element.
13:10 - 13:14

This is something
that I can very much recommend
13:14 - 13:17

having a look at
if you have not done so already.
13:18 - 13:22

Now, if we go to the automation of things,
13:22 - 13:26

here I'm using a tool called Bioclipse.
13:26 - 13:29

This is something that we worked on
some time ten years ago.
13:29 - 13:32

It's a platform
for chemistry and biology,
13:32 - 13:35

or cheminformatics and bioinformatics.
13:35 - 13:37

aimed at automating things,
13:40 - 13:41

including visualizations and sorts.
13:41 - 13:45

But I've taken that now
and developed a number of scripts
13:45 - 13:47

that I can actually run
on the command line,
13:47 - 13:51

which makes it easier to automate things,
as we will see in a moment,
13:51 - 13:52

and doing all sort of things,
13:52 - 13:56

for example, the classification
according to the LIPID MAP identifiers,
13:56 - 14:02

that's the scripts all available
from the GitHub repository here.
14:03 - 14:06

And typically, I have them
create Quickstatements
14:06 - 14:09

because that gets me
an additional check step
14:09 - 14:14

after I created the Quickstatements
and see what does data actually look like.
14:16 - 14:19

Annotation of main subjects.
14:19 - 14:23

This one is my script too,
starting from SMILES
14:23 - 14:27

to actually add chemical compounds
that are not in Wikidata yet,
14:27 - 14:29

which happens a lot.
14:30 - 14:34

So three or four weeks ago
I added something like 500 compounds
14:34 - 14:36

which our project was looking into
14:36 - 14:42

because these are
volatile compounds in oils.
14:45 - 14:47

This script adds the compounds.
14:47 - 14:51

They will later on add the annotation
of which pieces that compound comes from
14:51 - 14:53

and what the properties are.
14:56 - 14:59

Bioclipse itself is based
on the Chemistry Development Kit
15:00 - 15:02

and a few other libraries.
15:02 - 15:05

This allows me to do the chemistry.
15:05 - 15:08

And this is a very well-validated toolkit.
15:08 - 15:12

The SMILES part has been done
by John Mayfield.
15:12 - 15:17

I have done a lot of validation
against other tools.
15:18 - 15:21

And the quality
is actually really high now,
15:21 - 15:26

comparable or in some cases even better
of commercial cheminformatics tools.
15:26 - 15:29

It has given me a lot of reassurance
15:29 - 15:35

that the quality checking that we do
with this tool on Wikidata
15:35 - 15:37

is giving interesting results.
15:38 - 15:40

This is the Quickstatements.
15:40 - 15:43

Quickstatements
is Magnus' work, of course.
15:44 - 15:48

What happens if we take the SMILES,
it calculates the InChI,
15:48 - 15:52

and the InChIKey, it even looks up
based on the InChIKey,
15:52 - 15:54

if there is a PubChem identifier
15:54 - 15:59

that uses the InChIKey,
the PubChem identifier,
15:59 - 16:01

to see if this compound
already is in Wikidata.
16:01 - 16:04

And only if it's not already there,
16:04 - 16:07

then it will actually create
a CREATE statement.
16:09 - 16:13

A bit of automatic classification
here is an option.
16:13 - 16:15

So if I'm adding a class of compounds,
16:15 - 16:18

I can automatically indicate
what these are all...
16:18 - 16:20

this type of compounds,
16:20 - 16:23

and I can also indicate, if needed,
16:23 - 16:25

if there is a particular article
where I got this information
16:25 - 16:28

from automatically adding references.
16:30 - 16:33

Well, this is what
the Quickstatements output looks like
16:33 - 16:36

for the annotation of main subjects.
16:36 - 16:38

You've probably seen that as well.
16:41 - 16:43

A newer thing that I started doing
16:43 - 16:46

is actually doing reasoning
on the data in Wikidata.
16:46 - 16:51

So if I have the SMILES, then I can check
the molecular formula, for example.
16:51 - 16:53

I can check the InChIKey.
16:55 - 17:00

At some point, what we are going to do
is calculate physicochemical properties
17:00 - 17:04

and see if that matches
what is in Wikidata.
17:05 - 17:07

This will highlight typos
17:08 - 17:10

or wrong units, for example.
17:11 - 17:15

At this moment...
so this is a run of this morning.
17:15 - 17:18

What we see here
is two tests actually failing,
17:18 - 17:20

and this is an example of it.
17:20 - 17:23

This is the InChIKey
that is computed from the isomeric SMILES
17:23 - 17:27

is different from the InChIKey
given in the entry.
17:28 - 17:32

This can result from data
being pulled in from different resources.
17:33 - 17:36

So these are entries, about 300 of them,
17:36 - 17:39

in the 160,000 chemicals
that we have in Wikidata.
17:39 - 17:42

So it's a very small amount, really,
17:42 - 17:46

where there is information,
and someone needs to look at it.
17:47 - 17:51

Now, these are all organic compounds
17:51 - 17:54

and also quite a few inorganic compounds
17:54 - 17:57

where these things just work less well.
17:58 - 18:01

But I found in the other test
that is failing
18:01 - 18:04

immediately a couple of things
that are very clearly wrong.
18:09 - 18:12

PubChem is a huge database.
18:12 - 18:14

They do validation as well.
18:14 - 18:17

We are in the process
of submitting Wikidata there,
18:17 - 18:19

which I'm really happy about.
18:19 - 18:23

It's in the last validation step
at this moment.
18:23 - 18:26

And this will also mean that PubChem,
18:26 - 18:29

which has something
like 100 million compounds
18:29 - 18:31

will actually link back to Wikidata.
18:31 - 18:35

It already does this, but via Wikipedia.
18:35 - 18:37

(laughing)
18:37 - 18:38

Do you recognize it?
18:38 - 18:43

With the aforementioned issues there
of concept mismatches.
18:44 - 18:46

So this will give us a second thing.
18:46 - 18:50

And there, also,
using the same Bioclipse scripts
18:50 - 18:52

or similar Bioclipse scripts,
18:52 - 18:53

we get validation reports,
18:53 - 18:57

again indicating things
that chemists should look at.
19:00 - 19:02

That basically wraps it up.
19:02 - 19:05

This is still a work in progress,
the article is in preparation.
19:05 - 19:08

I've been working
with Finn here in Scholia
19:08 - 19:11

to support this validation.
19:11 - 19:16

We're writing up the full work,
but for now you can look up this poster.
19:16 - 19:19

The slides are on the program
of this session,
19:19 - 19:22

so you can look at the slides
and look at the details.
19:23 - 19:25

And a quick acknowledgment:
19:25 - 19:28

some of this work has been done
by a number of grants that I received.
19:28 - 19:30

And thank you very much.
19:30 - 19:32

(applause)
19:36 - 19:38

(chairman) Are there any questions?
19:41 - 19:43

(person 3) Thank you so much for this.
19:43 - 19:44

I am [inaudible],
19:44 - 19:47

and so far, I've been reading articles
19:47 - 19:50

on the [inaudible] Quickipedia
on different compounds.
19:50 - 19:54

I have a little bit more than 70 articles
with different compounds--
19:54 - 19:55

just things I come across.
19:56 - 19:58

And my question to you is
19:58 - 20:02

if I want to move my chemistry activity
from Wikipedia to Wikidata,
20:02 - 20:05

how can I help
in a way that is very friendly
20:05 - 20:10

to somebody who is a beginner
in that field on Wikidata?
20:12 - 20:16

So, if that compound is in Wikipedia and..
20:16 - 20:18

Sometimes there is
actually a Wikidata page.
20:18 - 20:20

I occasionally run into this as well,
20:20 - 20:22

in the last couple of months
not so much anymore
20:22 - 20:24

but this morning, actually.
20:26 - 20:27

And what I typically do then
20:27 - 20:31

is I take the SMILES
from [inaudible] infobox
20:31 - 20:32

from that compound
20:32 - 20:37

or use PubChem to look up the SMILES,
check if the information is complete,
20:37 - 20:39

particularly the stereochemistry,
20:39 - 20:43

and then I use that
that creates Wikidata item scripts
20:43 - 20:46

to create Quickstatements
for that compound.
20:48 - 20:50

If there already is a Wikidata item,
20:50 - 20:56

I basically just update these scripts,
20:56 - 21:00

but rather than say, "Create Last,"
I replace the last with the Q-codes
21:00 - 21:02

that that item already has.
21:02 - 21:05

And then it complements
or it adds this information
21:05 - 21:07

based on the information we had.
21:08 - 21:10

This is [manuable],
21:10 - 21:14

so you can copy-paste
a number of SMILES, put it in a file,
21:14 - 21:16

and take that.
21:18 - 21:22

Extracting that information to Wikidata
is not something I've automated yet,
21:22 - 21:25

but this helps me...
it's a pretty fast process.
21:26 - 21:28

I can show you later
how to use that software.
21:31 - 21:32

(chairman) Are there other questions?
21:34 - 21:35

So, I have one.
21:35 - 21:40

Do you make an effort
to, in fact, make this more visible
21:40 - 21:42

in this bioinformatics community
21:42 - 21:47

so that they can start using
this structured data?
21:48 - 21:49

Yeah, I'm actively doing that.
21:49 - 21:52

So what I did not mention
in this presentation so much,
21:52 - 21:58

but we saw that in...
I'd have somewhere to start here--
21:58 - 22:01

this is an overview
of different databases.
22:01 - 22:05

A similar plot, which actually
I do not have on this slide deck
22:05 - 22:09

is the number of different identifiers
that chemical compounds have,
22:09 - 22:11

and I've been working
with a number of databases,
22:11 - 22:16

like MassBank,
the Environmental Protection Agency,
22:17 - 22:19

CompTox Dashboard.
22:20 - 22:22

I've added links to the BDB database.
22:22 - 22:24

So I'm working with a number of projects
22:24 - 22:27

for pulling in additional information,
22:28 - 22:30

identifies our links out
to other databases.
22:31 - 22:33

Regarding outreach, yes,
22:33 - 22:37

so that wrong slide deck
that I was showing at the start,
22:37 - 22:39

there was actually
a presentation two weeks ago
22:39 - 22:42

at an Open Science Meeting
around chemistry.
22:43 - 22:46

I'm very much pushing this and...
22:47 - 22:49

I see a big future here.
22:49 - 22:51

There's a lot of interest.
22:51 - 22:55

And making people aware
of the CC0 license,
22:55 - 22:58

that's typically the larger problem.
22:58 - 23:03

So we have to pull in
the information carefully.
23:05 - 23:07

(chairman) Other questions?
23:09 - 23:10

- Okay.
- Thank you very much.
23:10 - 23:12

(chairman) Can we thank the speaker.
23:12 - 23:15

(applause)

Title:: cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4
Video Language:: English
Duration:: 23:21

	Bar Sch edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4
	C3Subtitles edited English subtitles for cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4

English subtitles

Revisions

Revision 2 Uploaded

Bar Sch

cdn.media.ccc.de/.../wikidatacon2019-1144-eng-Cheminformatics_to_improve_Wikidata_on_chemical_compounds_hd.mp4

Revisions

Our website uses cookies

Operating cookies (Required)