PMD: Percent Maximum Difference

Percent Maximum Difference

PMD is a relative distance metric for compositional Poisson problems

Check out our manuscript!

PMD

Unique properties:

Robust to differences in sampling depth
- between subjects, within a dataset
- between subjects, between datasets
Robust to # of observed features between datasets/analyses
Linear quantification regardless of:
- Composition changes within a dataset
- Number of features between datasets

How can I use it?

You can install the python package:

python source/tutorials

Or the R package:

R source/tutorials

Install

When to use PMD

When & why
to use PMD

Any time you have Poisson sampled data, and want to answer the question:

How similar are my observations, based on the composition of the features?

Examples:

Subject-subject similarity based on:
- Single-cell-omic cluster abundance
- Flow-cytometry gated cell-type abundance
- Microbiome clade abundance
Customer-customer similarity based on:
- Any count-data, such as:
  - Counts of product classes viewed
Many many more: Any kind of Poisson sampling

A (very) brief history of prior attempts:

In 1946, Harold Cremér proposed a solution to this question, building off of the Chi-square statistic, in his book "Mathematical Methods of Statistics" creating Cremér's V, which was later bias corrected.

What's wrong with other methods?

I realized we had a problem right away when I tried to use a Chi-square test of independence & bias corrected Cremér's V

So I spent an afternoon playing around with equations & found one that actually seemed to work!

Check out our manuscript for all the nitty-gritty mathy details!

But to prove it, I had to write up a full fledged benchmark!

Here's the setup for simulation mode #1:

The contingency table setup:
- Subjects in columns
- "cell-types" in rows
- Filled in to quantify the number of times a given cell type was observed in a given subject
For simplicity here, we have 2 subjects
- although PMD is compatible with any scale!
We have simulated cell-types (rows) & we vary the number of shared vs non-shared cell-types (colors)
We also vary the # of cells observed between subjects to be different (dashed/solid lines)

Here are the results!

The correct answer for the top row of graphs is parallel, evenly spaced colored lines that go straight across, with solid & dashed lines overlapping within a color, and the bottom row should be all slope=1 perfect lines.

PMD is the only solution that linearly quantifies relative overlap of columns based on the composition of the rows

But you can also hold the number of features constant, and vary the relative abundance among clusters that all appear in both observations, rather than presence/absence of shared features. So I wrote a 2nd benchmark for this simulation mode:

Here are the results!

The correct answer here looks exactly the same as before, because it shouldn't matter if your samples differ based on presence/absence of shared clusters, or changing the relative abundance of clusters. More similar should mean more similar.

PMD is AGAIN the only solution that linearly quantifies relative overlap of columns based on the composition of the rows

But notice also that in many cases, the shapes of the curves are different depending on the simulation mode (sim 1 vs sim2). In many real world cases, we'll actually see a simultaneous mixture of these effects, meaning that the overall output can't be interpreted, because these effects manifest in different patterns.

PMD is the only metric that is completely linear, robust to differences in number of observations, robust to differences in the number of observed features, robust to both modes of similarity/difference, and is completely & immediately interpretable.

A PMD use-case

Right now, I recommend using the python package for performance + capabilities

That being said - the python package is under active development, and the R package is better documented and has nicer tutorials.

1-Million PBMCs

12-healthy

12-Type-1 Diabetes (T1D)

In our anti-correlation based feature selection paper, we analyzed 1-million PBMCs in healthy and T1D donors

distances not actually close to

reflective of real distances!

When we examined subject-subject similarity, by using reverse PMD (rPMD: it's just 1-PMD), we found that the T1D subjects all looked very similar to each other!

We can see this on the side-by-side spring embeddings as well:

Now here's where we can start to get really fancy with PMD =). In the python implementation, I've added a functionality where you can pull out the PMD standardized residuals.

This converts your input count-matrix into a normally distributed over-/under-abundance score, that you can then use to run bioler-plate stats on like a typical linear model

(This solves the depth issue, but not the compositional issue - I'm still working on that)

Here's what the PMD standardized residuals & statistics look like!

What's that super-over-abundant cluster in T1D? It turned out to be a subset of classical monocytes. Let's compare them to a non-differentially abundant subset of classical monocytes:

It turns out that they're in a strange hemi-activated state! They have high expression of Fos-B, and NAMPT (a known immune modulator), and also high HIF1a, which can be induced by various stimuli, even in normoxic conditions! Also slightly lower JAK2, and the migration associated RIPOR2!

Note, that these discoveries were really only made possible by PMD. Typical approaches to performing statistics on count data like this give wildly inflated significance values because they don't account for the sampling noise or differential depth between subjects, which PMD's standardized residuals does =) This has been shown time and time again.

PMD use-case

PMD

How can I use it?

When & why to use PMD

A PMD use-case

When & why
to use PMD