--- title: "Visualizing JSON schema" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Visualizing JSON schema} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = T, comment = "#>") knitr::opts_chunk$set(fig.width = 7, fig.height = 5) options(tibble.print_min = 4L, tibble.print_max = 4L) ``` JSON is a very simple data standard that, through nested data structures, can represent incredibly complex datasets. In some cases, a set of JSON data closely corresponds to a table in a SQL database. However, more commonly a JSON document more closely maps to an entire SQL database. Understanding the structure of your JSON data is critical before you begin analyzing the data. In this vignette, we use `tidyjson` to inspect the structure of JSON data and then create various visualizations to help understand a complex JSON dataset. ## JSON Definition For a refresher on JSON, see the [JSON specification](https://www.json.org/), which is a very concise summary of how JSON is formatted. In essence, there are three types of JSON data structures. Per the specification, an object is a name/value pair, like `'{"string": "value"}'`: ![A picture showing a JSON object's structure](`r system.file("json_object.gif", package = "tidyjson")`) An array is an ordered list, like `'[1, 2, 3]'`: ![A picture showing a JSON array's structure](`r system.file("json_array.gif", package = "tidyjson")`) A value is a string, number, logical or NULL scalar: ![A picture showing a JSON value's structure](`r system.file("json_value.gif", package = "tidyjson")`) What is particularly interesting about JSON is the following: * Documents can omit objects, and so no single document is representative of the schema of a collection * Objects and arrays can be deeply nested, it is not uncommon to have an object with arrays of objects with arrays of objects of values * The system generating JSON can change over time, and so older documents can have different implicit schemas from newer documents All of this makes visualizing JSON data structures very useful when embarking on an analysis. ## Load required libraries Before we start, let's load `tidyjson` along with other data manipulation and visualization libraries, and set a seed so we get consistent results. ```{r, message = FALSE} library(jsonlite) library(dplyr) library(purrr) library(magrittr) library(forcats) library(ggplot2) library(igraph) library(RColorBrewer) library(wordcloud) library(viridis) library(listviewer) library(tidyjson) set.seed(1) ``` ## Companies Data Let's work with the `companies` dataset included in the `tidyjson` package, originating at 'jsonstudio.' It is a `r class(companies)` vector of `r length(companies) %>% format(big.mark = ',')` JSON strings, each describing a startup company. We can start by finding out how complex each record is by using `json_complexity`: ```{r} co_length <- companies %>% json_complexity ``` Then we can visualize the distribution of company documents by complexity on a log-scale: ```{r} co_length %>% ggplot(aes(complexity)) + geom_density() + scale_x_log10() + annotation_logticks(side = 'b') ``` It appears that some companies have unlisted length less than 10, while others are in the hundreds or even thousands. The median is `r median(co_length$complexity)`. Let's pick an example that is particularly small to start with: ```{r} co_examp_index <- which(co_length$complexity == 20L)[1] co_examp <- companies[co_examp_index] co_examp ``` Even for such a small example it's hard to understand the structure from the raw JSON. We can instead use `listviewer::jsonedit` to view it: ```{r} co_examp %>% jsonedit(mode = "code") ``` ## Working with many companies This is great for understanding a single JSON document. But many of the objects are empty arrays, and so give us very little insight into the structure of the collection as a whole. To start working with the entire collection, let's use the `json_structure` function in tidyjson which gives us a `data.frame` where each row corresponds to an object, array or scalar in the JSON document. ```{r} co_struct <- companies %>% sample(5) %>% json_structure print(co_struct) ``` We can then aggregate all of the object names across the entire collection, excluding `null` values to count the number of documents with meaningful data under each object name. ```{r} co_names <- co_struct %>% filter(type != "null" & !is.na(name)) %>% group_by(level, name, type) %>% summarize(ndoc = n_distinct(document.id)) co_names ``` We can get a quick overview of the most common names using a `wordcloud`. ```{r} co_names %$% wordcloud(name, ndoc, scale = c(1.5, .1), min.freq = 100) ``` Alternatively, we can visualize all the names in ggplot2. ```{r, fig.height = 9} co_names %>% ungroup %>% group_by(type) %>% arrange(desc(ndoc), level) %>% mutate(rank = 1:n()) %>% ggplot(aes(1, rank)) + geom_text(aes(label = name, color = ndoc)) + scale_y_reverse() + facet_grid(. ~ type) + theme_void() + theme(legend.position = "bottom") + scale_color_viridis(direction = -1) ``` This shows there are many comon strings and arrays, and many uncommon objects, strings and numbers in the documents. ## Visualizing as Graphs To really understand the structure of a document, we need to visaulize it as a graph. Let's define a `plot_json_graph` function to use `igraph` to visualize our JSON documents ```{r} # Plots an igraph visualization of a JSON document # # @param .x a JSON string or tbl_json object # @param legend add a type color legend automatically # @param vertex.size the size of the vertices # @param edge.color the color for the edges # @param edge.width the width of the edge lines # @param show.labels should object names be shown # @param plot should the plot be rendered? # @param ... further arguments to igraph::plot.igraph plot_json_graph <- function(.x, legend = TRUE, vertex.size = 6, edge.color = 'grey70', edge.width = .5, show.labels = TRUE, plot = TRUE, ...) { if (!is.tbl_json(.x)) .x <- as.tbl_json(.x) if (nrow(.x) != 1) stop("nrow(.x) not equal to 1") structure <- .x %>% json_structure type_colors <- RColorBrewer::brewer.pal(6, "Accent") graph_edges <- structure %>% filter(!is.na(parent.id)) %>% select(parent.id, child.id) graph_vertices <- structure %>% transmute(child.id, vertex.color = type_colors[as.integer(type)], vertex.label = name) if (!show.labels) graph_vertices$vertex.label <- rep(NA_character_, nrow(graph_vertices)) g <- igraph::graph_from_data_frame(graph_edges, vertices = graph_vertices, directed = FALSE) if (plot) { op <- par(mar = c(0, 0, 0, 0)) plt <- igraph::plot.igraph(g, vertex.color = igraph::V(g)$vertex.color, vertex.size = vertex.size, vertex.label = igraph::V(g)$vertex.label, vertex.frame.color = NA, layout = layout_with_kk, edge.color = edge.color, edge.width = edge.width, ...) if (legend) legend(x = -1.3, y = -.6, levels(structure$type), pch = 21, col= "white", pt.bg = type_colors, pt.cex = 2, cex = .8, bty = "n", ncol = 1) par(op) } invisible(g) } ``` Let's use the `plot_json_graph` function to look at our simple example: ```{r} '{"object" : {"name": 1}, "array" : ["a", "b"], "string" : "value", "number" : 1, "logical": true, "null" : null}' %>% plot_json_graph ``` `plot_json_graph` produces a graph where every node in the JSON document appears as a vertex, and parent nodes are connected to children through edges. The color is coded to indicate what type each node is, and vertices are labeled based on the name (for objects) used to identify the node. Now let's look at a real company example: ```{r} co_examp %>% plot_json_graph ``` For this company, almost all of the data is in the top level object. Now, let's create a function that plots a panel of these graphs: ```{r, fig.height = 8} plot_json_graph_panel <- function(json, nrow, ncol, ...) { # Set up grid op <- par(mfrow = c(nrow, ncol)) indices <- seq_along(json) %>% keep(`<=`, nrow * ncol) for (i in indices) { plot_json_graph(json[[i]], ...) if ("names" %in% names(attributes(json))) title(names(json)[i], col.main = 'red') } par(op) invisible(NULL) } ``` And let's look at several simultaneously: ```{r, fig.height = 8} plot_json_graph_panel(companies %>% sample(5), 7, 6, legend = FALSE, show.labels = FALSE, vertex.size = 4) ``` Clearly there is a huge amount of variety in the JSON documents! Let's look at the most complex example: ```{r} most_complex <- companies[which(co_length$complexity == max(co_length$complexity))] most_complex_name <- most_complex %>% spread_values(name = jstring(name)) %>% extract2("name") ``` The most complex company is `r most_complex_name`! Let's try to plot it: ```{r} plot_json_graph(most_complex, show.labels = FALSE, vertex.size = 2) ``` That is just too big. There are many arrays of complex objects that are repetitive in structure. Instead, we can simplify the structure by using `json_schema`. ```{r} most_complex %>% json_schema %>% jsonedit(mode = "code") ``` We can visualize this as a graph, and get more meaningful coloring of the terminal nodes by instructing `json_schema` to use `type = "value"`. ```{r} most_complex %>% json_schema(type = "value") %>% plot_json_graph ``` This is overwhelmed by top level scalar objects. We can visualize the more complex objects only ```{r} most_complex %>% gather_object %>% json_types %>% json_complexity %>% filter(type %in% c('array', 'object') & complexity >= 15) %>% split(.$name) %>% map(json_schema, type = "value") %>% plot_json_graph_panel(3, 3, legend = FALSE) ``` ## Working with funding data Now let's use this insight to structure funding and geo data for a visualization. First, let's get funding round data: ```{r} rounds <- companies %>% enter_object(funding_rounds) %>% gather_array %>% spread_values( round = jstring(round_code), currency = jstring(raised_currency_code), raised = jnumber(raised_amount) ) rounds %>% head ``` Now, let's get geo data: ```{r} geos <- companies %>% enter_object(offices) %>% gather_array %>% spread_values( country = jstring(country_code), state = jstring(state_code), description = jstring(description) ) geos %>% head ``` Now, let's join together the data for the US companies, assuming that the first office in the array is the headquarters (seems reasonable from a quick visual inspection of `geos`). ```{r} hqs <- geos %>% filter(array.index == 1) %>% filter(country == "USA") %>% select(document.id, state) rounds_usd <- rounds %>% filter(currency == "USD") %>% filter(!is.na(raised)) %>% select(document.id, round, raised) rounds_by_geo <- inner_join(rounds_usd, hqs, by = "document.id") %>% as_tibble() ``` Now we can visualize the results ```{r} round_prep <- rounds_by_geo %>% dplyr::filter(!is.na(state)) %>% mutate( round = round %>% forcats::fct_collapse( "angel" = c("seed", "angel"), "d-f" = c("d", "e", "f"), "other" = c("grant", "partial", "post_ipo_equity", "private_equity", "debt_round", "unattributed") ) %>% forcats::fct_relevel("angel", "a", "b", "c", "d-f", "other") ) %>% mutate( state = state %>% forcats::fct_lump(2) ) g <- ggplot(round_prep, aes(state, raised, fill = state)) + geom_violin() + scale_y_log10() + annotation_logticks(side = 'l') + facet_grid(. ~ round) + theme(legend.position = "bottom") + labs(x = "", y = "Amount Raised (USD)") g ``` Which shows a few interesting things: * Round sizes don't increase dramatically from b through f * Variance is much higher in angel and a rounds (and other) * NY and other US states have similar distributions with the exception of d-f rounds, where CA seems higher than NY