Datasets accompanying the paper “Virtual Research Environments Ethnography: a Preliminary Study”, a systematic mapping study on the literature about Science gateways, Virtual Research Environments, and Virtual Laboratories.
While for legal reasons we can not share the original datasets obtained by querying the databases, since they include copyrighted data, we can share the two datasets derived from the query results and the two topic modelling datasets.
The dataset “main_dataset.csv” consists of the merged query results from ACM Digital Library, IEEEXplore, ScienceDirect, Scopus, and SpringerLink databases. It is structured into six columns: (i) doi; (ii) title; (iii) content_type; (iv) publication year; (v) keyword_search; (vi) DB.
The ‘doi’, ‘title’, and ‘publication_year’ labels are self-describing, and are used for the DOIs, titles, and publication years (in the yyyy format) respectively.
The ‘content_type’ label refers to the different and normalised typologies of resources: (a) Article; (b) Book, (c) Book Chapter; (d) Chapter; (e) Chapter ReferenceWorkEntry; (f) Conference Paper; (g) Conference Review; (h) Early Access Articles; (i) Editorial; (j) Erratum; (k) Letter; (l) Magazines; (m) Masters Thesis; (n) Note; (o) Ph.D. Thesis; (p) Retracted; (q) Review; (r) Short Survey; (s) Standards. (c) and (d) refer to the same type of entry (they are used in different databases), while in the case of (e) we observed that it is used in the Springer database to refer mainly to encyclopaedic entries.
The ‘keyword_search’ label is used for identifying the keyword group used for formulating the query: (a) science gateway | scientific gateway; (b) virtual laboratory | Vlab; or (c) virtual research environment.
The ‘DB’ label indicates the provenance of the entries from one of the five databases we selected for our study: (a) ACM; (b) IEEE; (c) ScienceDirect; (d) scopus; and (e) Springer, identifying the ACM Digital Library, IEEEXplore, ScienceDirect, Scopus, and SpringerLink respectively.
The dataset “filtered_dataset.csv” consists of the deduplicated and filtered entries (journal articles and conference papers from 2010 onward, with a DOI assigned) from the “main_dataset.csv” we used as the final dataset for answering our research questions. It is structured into ten columns: (i) doi; (ii) title; (iii) venue; (iv) publication_year; (v) content_type; (vi) abstract; (vii) keywords; (viii) science gateway | scientific gateway; (ix) virtual laboratory | Vlab; and (x) virtual research environment.
As for the previous dataset, the ‘doi’, ‘title’, and ‘publication_year’ labels are self-describing, and are used for the DOIs, titles, and publication years (in the yyyy format) respectively.
The ‘venue’ label is used for indicating the conference or the journal the entries refer to. The values derive from the original query results.
The ‘abstract’ and ‘keyword’ labels are used for the abstracts and the keywords associated with the entries. The values are mainly derived from the original query results, as we integrated the missing ones by querying OpenAIRE.
The ‘science gateway | scientific gateway’, ‘virtual laboratory | Vlab’ and ‘virtual research environment’ labels indicate the connection between the entries and the keyword group used for denoting them. The values are binary (1 if the keywords belong to the group, 0 if they do not).
The datasets “sg_vlab_vre_topics_datasets.csv” and “sgvlabvre_topics_dataset.csv” consist of the three datasets and of the unique dataset resulting from topic modelling, the first (corpus divided into three datasets) and the second analysis (corpus as a whole) respectively. They share the same structure: (i) Topic; (ii) #studies; (iii) Representative word; (iv) Representative word weight.
The ‘Topic’ label is used for the topic denomination and the values consist of an alphanumeric string indicating the dataset and the progressive topic number: (a) SG, for the scientific gateway dataset; (b) VRE, for the virtual research environment dataset; (c) VLAB, for the virtual laboratory dataset; and (d) A, for the corpus as a whole.
The ‘#studies’ label indicates the number of studies contributing to each topic.
The ‘Representative word’ and ‘Representative word weight’ labels are used for denoting the keywords describing each topic and their weights respectively.