NETWORK ANALYSIS OF THE STACK OVERFLOW TAGS

In this paper we made analysis for the stack overflow tags by using different criteria's in network science, one of the advantages of network analysis is that complex of connections can be made cleared, we started this work in first step by extracted data from dataset after that applied network concepts node degree distribution, node importance (centrality measures), also we provided a brief demonstration of how we can use graph network and tools to analyze semi-structured text as (Tags).


INTRODUCTION
Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Stack Overflow Dataset consists of Posts, Post Links, Tags, Users, Votes, Batches and Comments. We want to make analysis to this data in specific time between (2018 and 2019).
Question tags are an important part of all submitted questions in Stack Overflow, because it allowing answering users to monitor questions relevant to their old of expertise and to answer promptly to submitted questions. Only users holding an "advanced" status in Stack Overflow are allowed to generate new tags, with the rest of the community limiting themselves in using the existing tags for their questions. Tagging the submitted questions is mandatory (i.e., there is a minimum of one tag per question), and each question can contain up to five tags.
We carry out an analysis of the Stack Overflow tags viewed as a network, or a graph. Specially, we aim to get some insight about the user communities by representing tags and their cooccurrences as a graph, where the graph nodes are the tags themselves, and an edge between two nodes exists if the corresponding tags are found together in the same Stack Overflow question.
The resulting network edges are weighted, the higher the weight of the relevant edge between them will be; for example, if the tag java is found to co-exist with the tag android in 1000 questions, the weight of the edge between the nodes java and android in our graph will be 1000. This paper is organized as follows: Section 2: describe briefly the data acquisition process and present some summary statistics of the data set. Section 3: expose the construction of our first graph based on the whole data set, along with some limited analysis. Section 4: describes the meaningful reduction of our raw data set. Section 5: building the adjacency matrix. Section 6: deals with various measures of node degree distribution.
Section 7: deals with various measures of node importance (centrality). Section 8: conclusion and discussion.

DATA ACQUISITION AND SUMMARY STATISTICS
For data set extraction, we first created the project on google cloud after that we connected this project with google datasets and we used google big query tools to extract data Figure 1: shows the steps.

Dataset Overview
Stack Overflow Dataset consists of following files that are treated as tables in our Database Design Figure 2: The details of the raw data parsing in Table 1, the results are: • A list of the used tags, ordered by decreasing frequency. The summary statistics of the raw data are shown in Table 1.
• tag_freq$rel <-tag_freq$freq/sum(tag_freq$freq)*100 According to Table 2, the 10 most frequent tags account for about 21% of all tag occurrences the raw data. We can similarly explore the other end of the list that is how many tags appear very infrequently in the data set: Length (which (tag_freq$freq < 100)) Sum: 38,726 Length (which (tag_freq$freq < 10)) Sum: 13,871

REDUCED GRAPH
We presented in Table 2 a full view for stack overflow dataset and we want to reduce these data to achieve more accuracy for our dataset, we will take data from dataset in (2018 and 2019) in addition to add new condition for tags , only tags that have occurrences more than 500 times. Table 3 shows dataset statistics after reduced. Figure. (4, 5) shows charts for Top Tags and frequencies between (2018-2019).

3.2 BUILDING THE ADJACENCY MATRIX
We used Big Query ML for building adjacency  The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W3-2020, 2020 5th International Conference on Smart City Applications, 7-8 October 2020, Virtual Safranbolu, Turkey (online)

Co-occurring Tags
We created a big table to find tag occurrences; after that, we can find tags that usually go together throw this query: SELECT * FROM `seraphic-lock 259921.StackOverflow.posts_questions_partitio n3 WHERE tag1= 'javascript' ORDER BY percent DESC Limit 20 ` Figure 6. Pie charts to show 'JavaScript' tag occurrences with other tags These groupings shows: 'JavaScript' is related to 'html', 'jQuery', etc. …

Building the Adjacency Matrix
We used Big Query ML for building adjacency matrix. Big Query ML enables users to create and execute machine-learning models in Big Query using standard SQL queries.

Clusters and Connected Components
First thing is to check network globally connected, the graph after checked it is not globally connected because some of nodes are isolated, in Table 7 the following nodes are part of isolated nodes. Figure. 7 shows graph nodes with their edges.  Table 6. Shows some isolated tags The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W3-2020, 2020 5th International Conference on Smart City Applications, 7-8 October 2020, Virtual Safranbolu, Turkey (online) Figure 7. Full network graph as nodes and edges We grouped our nodes with their edges to five groups see Figure 8, our reduced graph contains 47,972 tags but we just took 500 tags (valuables) that have the highest frequency degree to visualize the graph. Figure 8 shows these grouping for tags and some statistical information we will explain it in more details.

NODE DEGREES
The degree of a node is simply the number of its direct neighbours, or equivalently the Number of the links in which the node is involved, Figure 9 shows some of our tag degrees and average degree for graph after grouping it to five groups. Figure 9. Average node degree for network graph after grouping Table 7. Top 20 Tags degrees

PATH LENGTH
In a network graph the length of a path is the number of edges that the path contains and the average shortest path length is the sum of path lengths between all pairs of nodes normalized by n*(n-1) where n is the number of nodes in Graph, Figure 10 shows average path after grouping for tags. We can see that the AVG path length for all groups between (2-3 hops).  The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W3-2020, 2020 5th International Conference on Smart City Applications, 7-8 October 2020, Virtual Safranbolu, Turkey (online)

DEGREE DISTRIBUTION
Degree distribution is most important characteristic to understand the complex networks. Figure 11 shows the degree distribution of all graph and in Figure 12 we can see that the power low distribution of the degree sequence is generated for our graph groups. In addition, the graph shows that the network in scale free regime. The degrees sequence of undirected network is extracted from adjacency matrix of real network. Figure 11. Degree distribution for graph before grouping.

DIAMETER OF A NETWORK
Diameter it is the shortest distance between the two most distant nodes in the network. In another definition, Diameter is the longest path length between any pair of vertices, Figure 14 shows that the diameter for all groups is equal to 4. Figure 14. Network Diameter for network graph after grouping

NETWORK DENSITY
Network density describes the portion of the potential connections in a network that are actual connections. A "potential connection" is a connection that could potentially exist between two "nodes"regardless of whether or not it actually does, Figure 15 shows the network density for groups of our network graph. Figure 15. Network Density for network graph after grouping

MODULARITY
Modularity is one measure of the structure of networks or graphs. It was designed to measure the strength of division of a network into modules (also called groups, clusters or communities), Figure 16 shows the modularity of our network graph groups. Figure 16. Network Modularity for network graph after grouping

BETWEENNESS CENTRALITY
Betweenness Centrality measures the extent to which a node is located in the shortest paths between other pairs of nodes. In Figure 17 we checked top 20 tags with the highest betweenness centrality. Table 8. Top 20 tags with the highest betweenness centrality.

CLOSENESS CENTRALITY
The closeness centrality of a node is a measure of how 'close' a node is to all the other nodes in the network, we presented in Figure 18 top 20 tags with the highest closeness centrality. Table 9. Top 20 tags with the highest closeness centrality Figure 18. Closeness Centrality

EIGENVECTOR-BASED CENTRALITY
Eigenvector-based centrality measures express the importance of a node based on the importance of its neighbours. In Figure  19, we checked top 20 tags with the highest eigenvector centrality. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLIV-4/W3-2020, 2020 5th International Conference on Smart City Applications, 7-8 October 2020, Virtual Safranbolu, Turkey (online) Table 10. Top 20 tags with the highest eigenvector centrality Figure 19. Eigenvector based Centrality

CONCLUSION AND FUTURE WORK
In this paper, we discussed Network analysis for stack overflow tags and then we looked at some statistics after that; we used different centrality measures and concepts in network science for analysis process of stack overflow, according to the increasing presence of representations in modern data analysis. We carry out an analysis of the Stack Overflow tags viewed as a network or a graph.
We get some insight about the user communities by representing tags and their co-occurrences as a graph, where the graph nodes are the tags themselves, and an edge between two nodes exists if the corresponding tags are found together in the same Stack Overflow question.
The resulting network edges are weighted; for example, if the tag java is found to co-exist with the tag android in 1000 questions, the weight of the edge between the nodes java and android in our graph will be 1000. In addition, we focused on this paper to study the connection between tags to identify clustered tags and isolated tags. We hope that we have presented a convincing case of the graph representation.
As future work, we plan to extend our experimentation to cover more real-world dataset tags and to experiment with larger number of dataset tags.