Visualization and Analysis of Computer Science Publication Data
Visualization and Analysis of Computer Science Publication Data (VACS)
In computer science, an estimation of hundreds of thousands to millions of scientific publications are published every year. Even though the content itself is usually placed behind a payment barrier, some metadata are freely accessible, such as title, abstract, keywords of the publication, name of the conference as well as names and universities of the authors. A well-known publication series for computer science is the “Lecture Notes in Computer Science” by Springer-Verlag. Since 1973, more than 12 000 books containing over 360 000 papers have been published in this series.
As part of the IMI master project in the summer semester 2020 we visualized and analyzed the metadata of these publications.
First, we had to collect the different information of each book’s publications from the Springer website. With the help of the Python web scraping framework
scrapy we were able to access the relevant metadata and save it into our MySQL database. In between we had to trim and format some of the data in order to be able to easily process the data later on.
In this example we had to reformat the download counts that were given with prefixes.
if 'k' in number: value = float(number.replace('k', '')) * 1000 elif 'm' in number: value = float(number.replace('m', '')) * 1000000 elif 'b' in number: value = float(number.replace('b', '')) * 1000000000 else: value = int(number)
Another important step was to identify and remove duplicates within the different data sets, such as keywords and universities as Springer does not specify any restrictions on the spelling of these data. The identification of duplicates was an essential step to make sure that our analysis results were valid. Not only does more data increase processing times, it also separates the data relations that are necessary for omnidirectional data queries.
For the keywords we used the
fuzzywuzzy Python package. It parses the texts into tokens and compares them using the Levenshtein distance.
singleRatio = fuzzywuzzy.ratio(cachedKeyword,scrapedKeyword) #must be at least a 90% match if singleRatio > 90: #keywords are identified as equal
This way all of the following spellings
- Artificial Intelligence
- Artificial intelligence
- Artificial inteligence
- artificial intelligence - AI
can be matched with the term “artificial intelligence”.
Using identical and fuzzy matches the over 1.600.000 keywords could be reduced by 80% to around 320.000.
To find duplicates in the universities and institutions we used the
Google Maps API which is able to identify each location with a unique place id. If two addresses (name, city and country) result in the same place id they are identified as equal. Overall the initial 640.000 locations could be reduced by 70% to around 190.000.
Besides these rather complex processes we also had to take care of multilingual data as authors often use native language variants of city and country names. For the data analysis it was however necessary to be able to group data based on geographical locations. Therefore all city and country names were translated to english using the Python libraries
countryinfo as well as the
Nominatim API and
Google Maps API. We also used the Nominatim API to get geo information for the different locations to be able to create visualizations like heat, dot and arc maps.
geolocator = Nominatim(user_agent="####") address = '%s, %s, %s' % (name, city, country) location = geolocator.geocode(address, timeout=10, exactly_one=1) lat = location.latitude lon = location.longitude
To increase the significance of top rankings, like “The top 10 most publishing universities”, we tried to group the data sets of the universities as good as possible. For example
- HTW Berlin, Campus Wilhelminenhof and
- HTW Berlin, Treskowallee
would count as one “HTW Berlin” with different locations. The main difficulty here was to avoid false positive matches, that is why the grouping is only partially based on fuzzy matches.
This one, for example shows the three most popular keywords of each country.
Have a look at our website to see all visualizations and further information on our process and limitations.