Percent Maximum Difference
PMD is a relative distance metric for compositional Poisson problems
â€‹
Check out our manuscript!
PMD
Unique properties:

Robust to differences in sampling depth

between subjects, within a dataset

between subjects, between datasets


Robust to # of observed features between datasets/analyses

Linear quantification regardless of:

Composition changes within a dataset

Number of features between datasets

How can I use it?
You can install the python package:
Or the R package:
When & why
to use PMD
Any time you have Poisson sampled data, and want to answer the question:
How similar are my observations, based on the composition of the features?
â€‹
Examples:

Subjectsubject similarity based on:

Singlecellomic cluster abundance

Flowcytometry gated celltype abundance

Microbiome clade abundance


Customercustomer similarity based on:

Any countdata, such as:

Counts of product classes viewed



Many many more: Any kind of Poisson sampling
â€‹
â€‹
A (very) brief history of prior attempts:
In 1946, Harold Cremér proposed a solution to this question, building off of the Chisquare statistic, in his book "Mathematical Methods of Statistics" creating Cremér's V, which was later bias corrected.
What's wrong with other methods?
I realized we had a problem right away when I tried to use a Chisquare test of independence & bias corrected Cremér's V
So I spent an afternoon playing around with equations & found one that actually seemed to work!
Check out our manuscript for all the nittygritty mathy details!
But to prove it, I had to write up a full fledged benchmark!
Here's the setup for simulation mode #1:

The contingency table setup:

Subjects in columns

"celltypes" in rows

Filled in to quantify the number of times a given cell type was observed in a given subject


For simplicity here, we have 2 subjects

although PMD is compatible with any scale!


We have simulated celltypes (rows) & we vary the number of shared vs nonshared celltypes (colors)

We also vary the # of cells observed between subjects to be different (dashed/solid lines)
â€‹â€‹
Here are the results!
The correct answer for the top row of graphs is parallel, evenly spaced colored lines that go straight across, with solid & dashed lines overlapping within a color, and the bottom row should be all slope=1 perfect lines.
â€‹
â€‹
PMD is the only solution that linearly quantifies relative overlap of columns based on the composition of the rows
â€‹
But you can also hold the number of features constant, and vary the relative abundance among clusters that all appear in both observations, rather than presence/absence of shared features. So I wrote a 2nd benchmark for this simulation mode:
â€‹
Here are the results!
The correct answer here looks exactly the same as before, because it shouldn't matter if your samples differ based on presence/absence of shared clusters, or changing the relative abundance of clusters. More similar should mean more similar.
â€‹
â€‹â€‹â€‹â€‹PMD is AGAIN the only solution that linearly quantifies relative overlap of columns based on the composition of the rows
But notice also that in many cases, the shapes of the curves are different depending on the simulation mode (sim 1 vs sim2). In many real world cases, we'll actually see a simultaneous mixture of these effects, meaning that the overall output can't be interpreted, because these effects manifest in different patterns.
PMD is the only metric that is completely linear, robust to differences in number of observations, robust to differences in the number of observed features, robust to both modes of similarity/difference, and is completely & immediately interpretable.
â€‹
A PMD usecase
Right now, I recommend using the python package for performance + capabilities
â€‹
That being said  the python package is under active development, and the R package is better documented and has nicer tutorials.
12healthy
12Type1 Diabetes (T1D)
â€‹
In our anticorrelation based feature selection paper, we analyzed 1million PBMCs in healthy and T1D donors
distances not actually close to
reflective of real distances!
When we examined subjectsubject similarity, by using reverse PMD (rPMD: it's just 1PMD), we found that the T1D subjects all looked very similar to each other!
We can see this on the sidebyside spring embeddings as well:
Now here's where we can start to get really fancy with PMD =). In the python implementation, I've added a functionality where you can pull out the PMD standardized residuals.
This converts your input countmatrix into a normally distributed over/underabundance score, that you can then use to run biolerplate stats on like a typical linear model
(This solves the depth issue, but not the compositional issue  I'm still working on that)
â€‹
Here's what the PMD standardized residuals & statistics look like!
What's that superoverabundant cluster in T1D? It turned out to be a subset of classical monocytes. Let's compare them to a nondifferentially abundant subset of classical monocytes:
It turns out that they're in a strange hemiactivated state! They have high expression of FosB, and NAMPT (a known immune modulator), and also high HIF1a, which can be induced by various stimuli, even in normoxic conditions! Also slightly lower JAK2, and the migration associated RIPOR2!
Note, that these discoveries were really only made possible by PMD. Typical approaches to performing statistics on count data like this give wildly inflated significance values because they don't account for the sampling noise or differential depth between subjects, which PMD's standardized residuals does =) This has been shown time and time again.