I have a @statistics question that I'd like some help with. I've got an actual problem related to environmental science, but I'm going to frame it in terms of the fediverse, for various reasons. So, if you feel like asking, "Why do you want to know this?!?" please realize that it's an example question.
@statistics So, let's say in my example question, I want to find the servers on the fediverse that have the highest rate of communication with other servers. To do this, I'm going to take a list of known servers, and then get samples from the public feeds of those servers. I'm going to count the unique domains of the addressees of all the replies in the public feeds.
@statistics so, on server1.example, if a@server1.example replied to b@server2.example and c@server3.example, and d@server1.example replied to e@server3.example and f@server4.example, we have 3 unique domains replied to (server2.example, server3.example, and server4.example).
@statistics so, when I do this analysis, I find that the servers with the most other domains replied to are also the servers with the most accounts on them. Number of accounts on the server is a confounding variable, here; I'm not finding out about cultural norms in connectedness.
@statistics So, in this fictional example, I first try dividing the number of replied-to domains by the number of total posts. This seems like it would be OK, but now I'm favouring very small, inactive servers instead. If a server has only one post, with 2 or 3 domains replied to, it's got a very high rate of domains per post.
@statistics The best I've been able to do in this situation is set a threshold value that I consider statistically significant -- say, 100 posts/day. So, I don't get a distorted view from those very small servers. This is providing satisfactory results, but I still have questions.
@statistics first, if I'm trying to get a measure of diversity in my different sample sets, where the number of observations per sample set can be wildly varying, is this an OK way to do it? and second, is there a way to determine what this threshold of statistical significance is?
@evan I think figure out what quantity you want to normalize to for the comparison you want, for example the per capita number used during pandemic reporting. Something along lines of domains per average message per user.
Ok so quick question (don't know how well it translates from example.)
Your sampling every post not a normalized list of identical amount of replies?
For instance
Post A has 8 replies on tiny server
Post B has 1 replies on large server
is different info than two 8 reply posts, no?
@mrcopilot @statistics unrelated to real question.
@evan @statistics Back of the napkin, not even sure it relates to the hypothetical...
@evan @statistics I love these sort of graph analysis questions, and I'm going to do some more reading but my immediate thought is that you might have more success dividing by the log of total posts.
Also, using the number of accounts rather than posts sounds like a better approach intuitively. I think the log of the # of accounts my be interesting.
@evan Another thought is that this might be similar to "assortativity" [0].
Here, we measure the tendency of nodes to associate with other nodes that are similar in some way.
Here, similarity can be 0 if on different servers and 1 if on the same server.
@evan The Wiki article on assortativity isn't the clearest IMHO, here's a paper that might be useful to you https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.89.208701
I think the examples in the paper are pretty analagous to what you're describing (e.g. whether Physicists or Biologists are more likely to write papers with the same coauthors). r is basically a Pearson correlation, so you -1 is most disassortative and 1 is most assortative. 0 is neither.
@evan n is the number of nodes so here you can see that while the graph of biology coauthors is massive, we can still capture that it has similar assortativity to the much smaller graph of mathematics coauthors. Same is true of the Film actors and Company directors graphs.
@evan It's a bit... ugly... but you can calculate r pretty easily just knowing the degree of the nodes in your network and the number of nodes.
@jszym @evan @statistics
I concur.
The per capita number of distinct servers would remove the confounding effect of large server userbases.
Ultimately it depends on what "highest rate of communication with other servers" means in your example, and what phenomena you want to highlight.
Is the presence of few users with huge followings a relevant variable or a confounder? How about many users with few friends all in different servers?