CoSocial Community Cooperative @coop

Recent searches

Search options

Only available when logged in.

**Evan Prodromou** @evan · Dec 22, 2023

Dec 22, 2023

I have a @statistics question that I'd like some help with. I've got an actual problem related to environmental science, but I'm going to frame it in terms of the fediverse, for various reasons. So, if you feel like asking, "Why do you want to know this?!?" please realize that it's an example question.

Evan Prodromou @evan@cosocial.ca

@statistics So, let's say in my example question, I want to find the servers on the fediverse that have the highest rate of communication with other servers. To do this, I'm going to take a list of known servers, and then get samples from the public feeds of those servers. I'm going to count the unique domains of the addressees of all the replies in the public feeds.

Dec 22, 2023, 06:17 PM··Phanpy

2boosts·1favorite

**Evan Prodromou** @evan · Dec 22, 2023

Dec 22, 2023

Evan Prodromou @evan

@statistics so, on server1.example, if a@server1.example replied to b@server2.example and c@server3.example, and d@server1.example replied to e@server3.example and f@server4.example, we have 3 unique domains replied to (server2.example, server3.example, and server4.example).

**Evan Prodromou** @evan · Dec 22, 2023

Dec 22, 2023

Evan Prodromou @evan

@statistics so, when I do this analysis, I find that the servers with the most other domains replied to are also the servers with the most accounts on them. Number of accounts on the server is a confounding variable, here; I'm not finding out about cultural norms in connectedness.

**Evan Prodromou** @evan · Dec 22, 2023

Dec 22, 2023

Evan Prodromou @evan

@statistics So, in this fictional example, I first try dividing the number of replied-to domains by the number of total posts. This seems like it would be OK, but now I'm favouring very small, inactive servers instead. If a server has only one post, with 2 or 3 domains replied to, it's got a very high rate of domains per post.

**Evan Prodromou** @evan · Dec 22, 2023

Dec 22, 2023

Evan Prodromou @evan

@statistics The best I've been able to do in this situation is set a threshold value that I consider statistically significant -- say, 100 posts/day. So, I don't get a distorted view from those very small servers. This is providing satisfactory results, but I still have questions.

**Evan Prodromou** @evan · Dec 22, 2023 *

Dec 22, 2023 *

Evan Prodromou @evan

@statistics first, if I'm trying to get a measure of diversity in my different sample sets, where the number of observations per sample set can be wildly varying, is this an OK way to do it? and second, is there a way to determine what this threshold of statistical significance is?

**Spoofer3** @Spoofer3@infosec.exchange · Dec 22, 2023

Dec 22, 2023

Spoofer3 @Spoofer3@infosec.exchange

@evan I think figure out what quantity you want to normalize to for the comparison you want, for example the per capita number used during pandemic reporting. Something along lines of domains per average message per user.

**MrCopilot** @mrcopilot@mstdn.social · Dec 22, 2023

Dec 22, 2023

MrCopilot @mrcopilot@mstdn.social

@evan @statistics

Ok so quick question (don't know how well it translates from example.)

Your sampling every post not a normalized list of identical amount of replies?

For instance

Post A has 8 replies on tiny server
Post B has 1 replies on large server

is different info than two 8 reply posts, no?

**Evan Prodromou** @evan · Dec 22, 2023

Dec 22, 2023

Evan Prodromou @evan

@mrcopilot @statistics unrelated to real question.

**MrCopilot** @mrcopilot@mstdn.social · Dec 22, 2023

Dec 22, 2023

MrCopilot @mrcopilot@mstdn.social

@evan @statistics Back of the napkin, not even sure it relates to the hypothetical...

**Joseph Szymborski** @jszym · Dec 22, 2023

Dec 22, 2023

Joseph Szymborski @jszym

@evan @statistics I love these sort of graph analysis questions, and I'm going to do some more reading but my immediate thought is that you might have more success dividing by the log of total posts.

Also, using the number of accounts rather than posts sounds like a better approach intuitively. I think the log of the # of accounts my be interesting.

**Joseph Szymborski** @jszym · Dec 22, 2023

Dec 22, 2023

Joseph Szymborski @jszym

@evan Another thought is that this might be similar to "assortativity" [0].
Here, we measure the tendency of nodes to associate with other nodes that are similar in some way.

Here, similarity can be 0 if on different servers and 1 if on the same server.

[0] https://en.wikipedia.org/wiki/Assortativity

en.wikipedia.orgAssortativity - Wikipedia

**Joseph Szymborski** @jszym · Dec 22, 2023 *

Dec 22, 2023 *

Joseph Szymborski @jszym

@evan The Wiki article on assortativity isn't the clearest IMHO, here's a paper that might be useful to you https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.89.208701

I think the examples in the paper are pretty analagous to what you're describing (e.g. whether Physicists or Biologists are more likely to write papers with the same coauthors). r is basically a Pearson correlation, so you -1 is most disassortative and 1 is most assortative. 0 is neither.

A table from the linked paper that has three columns "Network", "n", and "r".

**Joseph Szymborski** @jszym · Dec 22, 2023

Dec 22, 2023

Joseph Szymborski @jszym

@evan n is the number of nodes so here you can see that while the graph of biology coauthors is massive, we can still capture that it has similar assortativity to the much smaller graph of mathematics coauthors. Same is true of the Film actors and Company directors graphs.

**Joseph Szymborski** @jszym · Dec 22, 2023

Dec 22, 2023

Joseph Szymborski @jszym

@evan It's a bit... ugly... but you can calculate r pretty easily just knowing the degree of the nodes in your network and the number of nodes.

**pscrapy** @pscrapy@mastodon.uno · Dec 22, 2023

Dec 22, 2023

pscrapy @pscrapy@mastodon.uno

@jszym @evan @statistics
I concur.
The per capita number of distinct servers would remove the confounding effect of large server userbases.
Ultimately it depends on what "highest rate of communication with other servers" means in your example, and what phenomena you want to highlight.

Is the presence of few users with huge followings a relevant variable or a confounder? How about many users with few friends all in different servers?

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats: