Articles, Blog

Algorithms for Automated Discovery of Mutated Pathways in Cancer – Ben Raphael

November 10, 2019

Ben Raphael:
Okay. Thank you. Thanks to the organizers for the opportunity to present our work. So, we heard this morning that one of the
challenges facing TCGA and other cancer genomics projects is distinguishing functional driver
mutations from random passenger mutations that are also measured in the genome when
we do high throughput genome sequencing. And, of course, functional mutation is really a
biological phenomenon and ultimately must be determined by experiment, but we can prioritize
these experiments by looking for recurrent mutations, mutations that are present in more
of the patients than we would expect by some model of chance, or, perhaps zoom out a little
bit and look at genes that are mutated more than we would expect by chance. And so, for the purposes of this talk, we’ll
sort of look at mutations at the level of genes, and we’ll assume that we’ve sort of
carefully annotated the genes in our experiment and so that we have a mutation matrix where
for each patient and each gene I’ve indicated whether or not there’s a somatic mutation
that’s present. And so, then the question is are there genes
that are mutated more than we expect by chance? And so the standard way to do this is with
a single gene test where we look at each column of the matrix and we ask, you know, do we
see more ones in that column than we would expect? And then, since we’re performing all
of these independent tests, we then have to go do some multiple hypothesis correction
to correct for the fact that we did thousands of statistical tests. So, this was the approach, simplified, but,
that was used in the first two TCGA papers, and here are the results from those two papers
with the list of significantly mutated genes, and as was described also this morning, these
lists are pretty short. We don’t find that many significantly mutated genes at a level
of statistical confidence that we’re happy with, and there’s various reasons for this.
Our statistical model may not be good, and certainly that’s part of the reason, and our
passenger mutation rate may not be quite right. The data itself, of course, has false positives
and false negatives in it. Maybe we don’t have enough samples. I mean, 91 samples is
a pretty small number to try to find recurrently mutated genes, especially given lots of the
mutational heterogeneity that’s present in tumors. But there’s also a biological reason, and
that’s that genes don’t act on their own, but of course, act together in pathways or
networks, and so cancer is also sometimes called a disease of pathways. And this has
also been appreciated in the original TCGA paper, in addition to finding single, significantly
mutated genes, there were tests of pathways that were done, and the standard approach
here is to look at known pathways, so this figure shows a network, and within that network
of genes, individual pathways were extracted, and you can then do a variant of the single
gene test, just looking simultaneously at multiple columns in this matrix. And again,
ask the question, is now this pathway, this group of genes mutated more than we would
expect by chance? And here we see that a p53 pathway, this is
sort of a schematic, but in the data, the p53 pathway was mutated in 87 percent of the
patients, which was much higher than any single gene in that pathway. Okay, but of course, this approach has some
limitations. We’re only looking at the pathways that we know. We’re ignoring somehow the connections
between the genes, the typology. We’re just viewing these as sort of columns of the matrix,
and, moreover, this idea that pathways are their own, discrete units is somewhat of a
simplification. This figure even shows that, you know, the pathways themselves are interconnected.
This is sometimes called cross-talk. So, what we’ve been doing is we’ve been asking
the question, we have a lot of, and we’re getting more and more sequenced genomes. So,
could we start to develop methods where we can look at combinations of mutated genes
that are somehow less biased by prior knowledge of pathways? So, in going from, on the left
of the picture, known pathways, could we instead look at all combinations of genes, use no
prior information? Of course, as we reduce the amount of prior knowledge of which combinations
of genes we’re going to look at, we increase the number of hypotheses that we have to test.
So, for example, if we wanted to test all possible groups of fewer than six genes, that’s
10 to the 22nd hypotheses. We would need a lot of samples in order to obtain any statistical
significance. So, we might look for some intermediates.
So, maybe we’d want to somehow restrict our groups of genes by those that are on our — in
our action network, maybe a network constructed from a superposition of pathways. Even here,
if we try to exhaustively test every part of this interaction network, maybe we look
for subnetworks, we don’t really reduce the number of hypotheses that much, okay? So, what we’ve been doing is developing algorithms
that are sort of in between and sort of different points of the spectrum, and I’m going to tell
you very quickly about two of them. The first is called HotNet, which uses the interaction
network and tries to pull out subnetworks of the interaction network that are mutated
more than we expect, and the other is called Dendrix, which gets closer to this idea of
all combinations of genes. And both of these algorithms, you know, we compute P values,
and it’s a robust manner, and I won’t really get to describe the statistics in this talk. So, HotNet, the approach here is that we are,
we have a predefined interaction network. This could be some high-quality network that
we, you know, take the textbook diagrams in pathways and superimpose them. It could be
some noisy thing that includes whatever [unintelligible] you want. You take your mutation matrix, and
the model is to find connected subnetworks that are mutated in more patients than you
would expect. So, in doing so, now that we’ve moved to the
network, there’s really two considerations. It’s not just the frequency of mutations that’s
going to determine which subnetworks we pull out. It’s also the topology of network. So,
to illustrate this briefly on the left, you see that we might, for example, have two genes
that are mutated at moderate frequency that are connected in the network via a single
path. And that’s somehow more surprising to us. That’s more of a clustering of mutations
than if we had the same two genes of the same frequencies but that are connected through
some gene of very high degree, a gene that was connected to many others. And this problem
of having these different topologies actually comes up a lot in looking at cancer genes,
because many cancer genes have very high degree in these interaction networks. So, we need
to account for both mutation frequency and topology. And the model we use for this is we actually
think of mutations as sources of heat on the graph. So, what we do is we heat up each gene
— which is a node in the network — in proportion to its frequency, and then we let that heat
diffuse over the edges of the network. And what this does is that encodes both the mutation
frequency and the topology of the network in a single model, and so now what we have
is a distribution of heat on the graph, and we can then break up the graph and find significantly
hot subnetworks, and the significantly hot requires a somewhat subtle statistical test,
which, again, I won’t describe, and I refer you to the paper for more details there, but
we can get robust P values and FDRs in — for testing this hypothesis of significantly hot. So, we worked with an ovarian analysis group
to apply HotNet to the ovarian data, and this was published earlier this year as part of
the paper. And running HotNet on the whole exome and, whole exome mutation data and copy
number data together, we found 27 subnetworks of the HPRD network, a network that contained
37,000 interactions, 27 subnetworks with at least seven genes, with a reasonably good
P value. And, so, here’s a picture of them that sort of shows you the subnetworks, each
in different color. Some of them are connected to each other. Some of them are sort of more
isolated in the network. And so, what do you do with such a picture? Well, the first thing
you do is you go see if you’ve found anything that was already known. And, so, one thing
that fell out immediately when we looked at intersections between known pathways was one
of our subnetworks overlaps significantly with the Notch [spelled phonetically] signaling
pathway, and so here was the picture of Notch that appeared in the paper. And, what you
can see is that each of the genes in this pathway is not mutated at very high frequency.
I guess Notch is — three is mutated at moderate frequency. The others are mutated at fairly
low frequency. So, it’s both a frequency and the interactions that’s driving this prediction. In total, 12 of the 27 overlapped either in
known pathway or protein complex, and others are sort of novel predictions. Some look interesting.
They’re all published in the paper, and I refer you to the appropriate supplement. So, having looked at the interaction network,
we decided to, you know, go be a little bolder and see if we could just get rid of the interaction
network entirely because, you know, interaction networks are noisy, and sometimes when I give
a talk, people complain about them. So, I said, well, let’s get rid of it all, get rid
of the whole thing. Well, I said that there’s too many hypotheses to test if we wanted to
look at all combinations of genes. So, what we do is we impose some constraints on the
sets of genes we’re going to consider, and these constraints are driven by a couple of
assumptions that are sort of supported in the literature, and one we’ve heard about
already a few times in this workshop, so these — under the assumption that driver mutations
are relatively rare compared to the passenger mutations, there isn’t a pathway that, you
know, is going to be mutated in a patient in order for that patient to have cancer,
then there’s probably only one driver mutation that’s necessary. Now, here we have to be
careful, mean, by pathways. Some pathways are large and, you know, hundreds of genes.
We mean something maybe more targeted, okay, so a pathway as shown here, and there’s, you
know, various evidence of this and what this imposes then is a mutual exclusivity between
the mutations. And, so, the black bars here indicate mutually exclusive mutations, and
the red indicate co-occurring mutations, and you can see across this pathway, there’s lots
of exclusivity, and very few patients that have more than one mutation in this pathway. The second assumption is that if the pathway,
if the set of genes is important, many of the patients will have a mutation in that
pathway, so the pathway should have high coverage. There should be, you know, lots of patients
with a driver mutation in that pathway. So, with these two assumptions, we then introduce
an algorithm. We call it de novo driver exclusivity, or Dendrix for short. So, just directly from
the mutation matrix, we try to find sets of genes, columns — actually, they should be
rows. I’ve transposed the matrix to match the figure, so rows of the matrix. Here, there,
it’s a contiguous set of rows, but that doesn’t have to be the case, and we find them to meet
these two properties. This turns out to be a computationally difficult problem, so we
have a couple of algorithms for doing this, and we have some theoretical results that
show that they perform well, and we’ve more recently, since the publication at the bottom
of the slide, extended this with an alternative scoring metric. So, I’m going to show you just a quick, couple
of examples of running these algorithms on some new data sets. The first is AML, and
so if we run Dendrix, we get several proximally exclusive sets, each with reasonable good
statistical support, and HotNet, we get a few subnetworks. Before I show you a few of
these, I was instructed to say that this is unvalidated mutation data, so anything here
is preliminary and subject to change, all right? So, running the mutual exclusivity, we get
two sets, the two top scoring sets. Actually, both have six genes in common. Four of those
six are fusion genes, and this is a highly exclusive set. These fusion genes are mostly
subtype specific, so, in some sense, what we’re doing is picking out the subtypes, and
then extending that set of six in two different directions are the two other exclusive sets.
I’ve sort of shown you the schematic of exclusivity, and you can see that they’re mostly exclusive.
Some overlap. And together, the set of six, the blue on bottom, cover about 25 percent
of the patients. When you extend them, you get coverage up to 75 percent of the patients. Now, looking at these two sets, there’s a
question of why are they separated. And so, if you look across these two sets, what you
see is that there’s lots of co-occurrence between them, so many patients that have a
mutation in more than one gene in the set, moreover, that co-occurrence I’m showing here
on the bottom, the dark red now, is co-occurrence across the two sets, the light pink is co-occurrence
within the set. And so, what you can see is that there’s a lot more co-occurrence across
the sets than within, as there should be, because these sets are exclusive, but these
co-occurrences are actually spread across multiple genes. So, what we’re seeing is an
effect of not — because we haven’t done peer-wise analysis, but looked at larger sets, we can
actually find these exclusivity and co-occurrence relationships that are not just pair-wise,
but are more complicated across multiple genes. For HotNet, here’s a view of five subnetworks.
Here’s three of them that were enriched for either known complexes, the cohesion complex,
the polychrome complex, or the keg [spelled phonetically] pathway AML. Well, that’s a
nice screensaver. Now that’s back. Okay. And, again, you know, the P values are, you know,
not a huge, but, you know, not particularly small, but we’re digging into this large interaction
network with a lot of noise, so pulling out these complexes is, we’re pretty happy with. Finally, we’ve done a quick run of the data
on the breast cancer data set again, several exclusive sets, all with pretty good P values
and several subnetworks. Some of these appeared on a poster earlier today, which is probably
still hanging up. And I’ll just show you two of the subnetworks. The first one on bottom,
you probably can’t read the gene names there, but the first two rows there are p53 and PIC3CA
which are fairly exclusive, but, really, they’re driving the set because they have very high
mutation frequencies. So, if you remove p53 and PIC3CA, then the set on the bottom actually
contains a really nice, exclusive set, genes at moderate frequencies, including GATA3 and
CTCF, and even I can’t read the slide from here. So, again, they’re on a poster, and
you’ll see them. So, that’s the summary, trying to take a view
of the data that’s sort of less biased by known biology and see if the data can just
lead us to the interesting sets directly. The two algorithms that do that and what we’re
working on is to bring in more data types, methylation is under way, gene expression
will come, and then to do a little more pre- and post-processing of what we put into the
algorithm and what comes out. We’ve been very na久e about it. We just throw in the mutation
data itself. We don’t filter by subtype. We don’t post-process, so all these things are
sort of add-ons that we hope will get us even more power. So, the acknowledgements my colleagues Fabio
Vandin and Eli Upfal at Brown worked together on maybe to develop these algorithms. Sin
Tau Wu [spelled phonetically] has done some of the analysis, Genome Institute at Wash
U for the AML data, and Andy Mongul [spelled phonetically] and others at BC Cancer Agency
for the fusion gene data from the RNA-Seq and the funding agencies. Thanks. [applause] Male Speaker:
Questions? Male Speaker:
Have you seen any exceptions to your exclusivity assumption? Ben Raphael:
I mean, sure, I’m not sure what an exception would mean. What do you mean by an exception?
I mean, there’s lots of gene sets that are not exclusive. There’s — there are gene sets
that are exclusive, but when you look on the network, they’re not interacting in any way
that we can see. So, that, in a sense, violates the idea that they’re within a pathway, at
least a known pathway. Male Speaker:
It’s probably a very na久e question, but there must be things driving p53. I mean,
mutations in p53. The driver mutations, in many cases, are going to be before that, right?
So, how do you pick out those from this analysis? Maybe I missed the whole thing. Ben Raphael:
I mean, there’s no temporal information here, right? We just get the, you know, we get the
mutations from the patients when they were sequenced, so we can’t distinguish things
that happened before p53 or after. Is that — Male Speaker:
Yeah, I mean, because, probably, right, if you’ve got p53 and 80 percent — Ben Raphael:
Sure, yeah. Male Speaker:
— and there are a lot of other things that are upstream that are doing that, so — Ben Raphael:
Yeah, and — Male Speaker:
— and that’s what they want, isn’t it? Ben Raphael:
Right, and that’s why, you know, on that last example on breasts [spelled phonetically],
you know, if you, p53 is a driver mutation. We know it, so we pull it out of the data,
and then we can start to get these more subtle signals, and I think doing that in a more
intelligent way will, you know, allow us to sort of pull out some of the things that,
you know, are obscured by the high frequency genes. Male Speaker:
Next question. Male Speaker:
Yeah, so, you know, good talk, so, you know, [unintelligible] what part you use HPRD with
primary concern [spelled phonetically] protein-protein interaction? But, we know a lot how interaction
of genetic [spelled phonetically] rather than protein-protein interaction. So, [unintelligible]
do you ever try to combine a different type of interaction, do some kind of analysis,
rather than only use protein-protein interaction? Ben Raphael:
We have not. We’ve used various protein-protein interactions. We’ve used keg as a network.
We’ve used iREF, which is sort of a mish-mash of a few, with some curation. We haven’t used
a genetic interaction network. If there’s a good one for human, we’d love to try it,
but we haven’t found a good source for that. Male Speaker:
Then how much your results depends on the protein-protein network? We know which is
[unintelligible] how that rarity affect your result? Ben Raphael:
We’ve assessed this by, you know, running it on these different networks, and the results
change a little bit, but they don’t — they didn’t change dramatically. You know, some
networks are seemingly give sort of nicer results than others, at least in terms of
the known biology, but, yes, I mean, the network we’re taking has information. Male Speaker:
Other questions? Okay. Thank you.

No Comments

Leave a Reply