Dynamic arXiv Collaboration Graphs

Research Area: Clustering of Static and Temporal Graphs
Status: In progress  

Since 1992 the arXiv.org e-Print archive is a popular repository or scienti fic e-prints, stored in several categories alongside timestamped metadata. Our Collecting Spider, see below, can be used to collect data from the arXiv.org e-Print archive. The Scheduler, see below, extracts networks of collaboration between scientists based on coauthorship. For each e-print it adds equally weighted clique-edges among the contributors such that each author gains a total edge weight of 1.0 per e-print contributed to. It lets e-prints time out after two years and removes disconnected authors.


The Collecting Spider

The arXiv API provides an interface to the arXiv database, returning query results in the format of an Atom XML feed. Our arXiv spider is a simple Python program which automates querying the database and parsing the results.

Download: arxivspider.zip

License: GPL


The Scheduler

Our tools for extracting a stream of graph events from arXiv dumps are available here. Note that for outputting intermediate graphs (can be enabled in the source code), the yfiles graph libraries are required, otherwise the corresponding functions will need to be removed from the source files.

Download: scheduler.zip

License: GPL

