JSON is a very simple data standard that, through nested data structures, can represent incredibly complex datasets. In some cases, a set of JSON data closely corresponds to a table in a SQL database. However, more commonly a JSON document more closely maps to an entire SQL database.
Understanding the structure of your JSON data is critical before you
begin analyzing the data. In this vignette, we use tidyjson
to inspect the structure of JSON data and then create various
visualizations to help understand a complex JSON dataset.
For a refresher on JSON, see the JSON specification, which is a very concise summary of how JSON is formatted. In essence, there are three types of JSON data structures.
Per the specification, an object is a name/value pair, like
'{"string": "value"}'
:
An array is an ordered list, like '[1, 2, 3]'
:
A value is a string, number, logical or NULL scalar:
What is particularly interesting about JSON is the following:
All of this makes visualizing JSON data structures very useful when embarking on an analysis.
Before we start, let’s load tidyjson
along with other
data manipulation and visualization libraries, and set a seed so we get
consistent results.
Let’s work with the companies
dataset included in the
tidyjson
package, originating at ‘jsonstudio.’ It is a
character vector of 1,000 JSON strings, each describing a startup
company.
We can start by finding out how complex each record is by using
json_complexity
:
Then we can visualize the distribution of company documents by complexity on a log-scale:
co_length %>%
ggplot(aes(complexity)) +
geom_density() +
scale_x_log10() +
annotation_logticks(side = 'b')
It appears that some companies have unlisted length less than 10, while others are in the hundreds or even thousands. The median is 59.
Let’s pick an example that is particularly small to start with:
co_examp_index <- which(co_length$complexity == 20L)[1]
co_examp <- companies[co_examp_index]
co_examp
#> [1] "{ \"_id\" : { \"$oid\" : \"52cdef7f4bab8bd67529c0f8\" }, \"name\" : \"The Brand Saloon\", \"permalink\" : \"the-brand-saloon\", \"crunchbase_url\" : \"http://www.crunchbase.com/company/the-brand-saloon\", \"homepage_url\" : \"http://thebrandsaloon.com\", \"blog_url\" : \"\", \"blog_feed_url\" : \"\", \"twitter_username\" : \"Thebrandsaloon\", \"category_code\" : \"advertising\", \"number_of_employees\" : null, \"founded_year\" : null, \"founded_month\" : null, \"founded_day\" : null, \"deadpooled_year\" : null, \"deadpooled_month\" : null, \"deadpooled_day\" : null, \"deadpooled_url\" : null, \"tag_list\" : \"search-engine-optimization-company-mumbai, seo-agency, seo-services, digital-marketing, digital-marketing-agency-mumbai\", \"alias_list\" : \"\", \"email_address\" : \"[email protected]\", \"phone_number\" : \"91-22-32632494\", \"description\" : \"Internet Marketing Services\", \"created_at\" : \"Fri Jun 19 22:40:15 UTC 2009\", \"updated_at\" : \"Tue Dec 17 13:19:20 UTC 2013\", \"overview\" : \"<p>Grooming Your Brands</p>\\n\\n<p>The Brand Saloon Inc. Offers High Quality Internet Advertising Services. TBS Specialized In Services like Search Engine Optimization(SEO) And Email Marketing Services. TBS Also Offers many Different Advertising Packages To Promote Your Website And Your Business On Internet. TBS Wants To Take Customer Satisfaction To The Next Level In The Field Of Internet Marketing Services.</p>\", \"image\" : null, \"products\" : [], \"relationships\" : [], \"competitions\" : [], \"providerships\" : [], \"total_money_raised\" : \"$0\", \"funding_rounds\" : [], \"investments\" : [], \"acquisition\" : null, \"acquisitions\" : [], \"offices\" : [], \"milestones\" : [], \"ipo\" : null, \"video_embeds\" : [], \"screenshots\" : [], \"external_links\" : [ { \"external_url\" : \"http://fiveonline.in/\", \"title\" : \"Five Online Web Solutions\" } ], \"partners\" : [] }"
Even for such a small example it’s hard to understand the structure
from the raw JSON. We can instead use listviewer::jsonedit
to view it:
This is great for understanding a single JSON document. But many of the objects are empty arrays, and so give us very little insight into the structure of the collection as a whole.
To start working with the entire collection, let’s use the
json_structure
function in tidyjson which gives us a
data.frame
where each row corresponds to an object, array
or scalar in the JSON document.
co_struct <- companies %>% sample(5) %>% json_structure
print(co_struct)
#> # A tbl_json: 1,192 x 10 tibble with a "JSON" attribute
#> ..JSON document.id parent.id level index child.id seq name type length
#> <chr> <int> <chr> <int> <int> <chr> <list> <chr> <fct> <int>
#> 1 "{\"_id\… 1 <NA> 0 1 1 <list> <NA> obje… 42
#> 2 "{\"_id\… 2 <NA> 0 1 1 <list> <NA> obje… 42
#> 3 "{\"_id\… 3 <NA> 0 1 1 <list> <NA> obje… 42
#> 4 "{\"_id\… 4 <NA> 0 1 1 <list> <NA> obje… 42
#> # ℹ 1,188 more rows
We can then aggregate all of the object names across the entire
collection, excluding null
values to count the number of
documents with meaningful data under each object name.
co_names <- co_struct %>%
filter(type != "null" & !is.na(name)) %>%
group_by(level, name, type) %>%
summarize(ndoc = n_distinct(document.id))
#> `summarise()` has grouped output by 'level', 'name'. You can override using the
#> `.groups` argument.
co_names
#> # A tibble: 106 × 4
#> # Groups: level, name [106]
#> level name type ndoc
#> <int> <chr> <fct> <int>
#> 1 1 _id object 5
#> 2 1 acquisition object 1
#> 3 1 acquisitions array 5
#> 4 1 alias_list string 2
#> # ℹ 102 more rows
We can get a quick overview of the most common names using a
wordcloud
.
co_names %$% wordcloud(name, ndoc, scale = c(1.5, .1), min.freq = 100)
#> Warning in wordcloud(name, ndoc, scale = c(1.5, 0.1), min.freq = 100):
#> source_description could not be fit on page. It will not be plotted.
#> Warning in wordcloud(name, ndoc, scale = c(1.5, 0.1), min.freq = 100):
#> acquisitions could not be fit on page. It will not be plotted.
#> Warning in wordcloud(name, ndoc, scale = c(1.5, 0.1), min.freq = 100):
#> relationships could not be fit on page. It will not be plotted.
#> Warning in wordcloud(name, ndoc, scale = c(1.5, 0.1), min.freq = 100):
#> video_embeds could not be fit on page. It will not be plotted.
#> Warning in wordcloud(name, ndoc, scale = c(1.5, 0.1), min.freq = 100):
#> screenshots could not be fit on page. It will not be plotted.
Alternatively, we can visualize all the names in ggplot2.
co_names %>%
ungroup %>%
group_by(type) %>%
arrange(desc(ndoc), level) %>%
mutate(rank = 1:n()) %>%
ggplot(aes(1, rank)) +
geom_text(aes(label = name, color = ndoc)) +
scale_y_reverse() +
facet_grid(. ~ type) +
theme_void() +
theme(legend.position = "bottom") +
scale_color_viridis(direction = -1)
This shows there are many comon strings and arrays, and many uncommon objects, strings and numbers in the documents.
To really understand the structure of a document, we need to
visaulize it as a graph. Let’s define a plot_json_graph
function to use igraph
to visualize our JSON documents
# Plots an igraph visualization of a JSON document
#
# @param .x a JSON string or tbl_json object
# @param legend add a type color legend automatically
# @param vertex.size the size of the vertices
# @param edge.color the color for the edges
# @param edge.width the width of the edge lines
# @param show.labels should object names be shown
# @param plot should the plot be rendered?
# @param ... further arguments to igraph::plot.igraph
plot_json_graph <- function(.x, legend = TRUE, vertex.size = 6,
edge.color = 'grey70', edge.width = .5,
show.labels = TRUE, plot = TRUE,
...) {
if (!is.tbl_json(.x)) .x <- as.tbl_json(.x)
if (nrow(.x) != 1) stop("nrow(.x) not equal to 1")
structure <- .x %>% json_structure
type_colors <- RColorBrewer::brewer.pal(6, "Accent")
graph_edges <- structure %>%
filter(!is.na(parent.id)) %>%
select(parent.id, child.id)
graph_vertices <- structure %>%
transmute(child.id,
vertex.color = type_colors[as.integer(type)],
vertex.label = name)
if (!show.labels)
graph_vertices$vertex.label <- rep(NA_character_, nrow(graph_vertices))
g <- igraph::graph_from_data_frame(graph_edges, vertices = graph_vertices,
directed = FALSE)
if (plot) {
op <- par(mar = c(0, 0, 0, 0))
plt <- igraph::plot.igraph(g,
vertex.color = igraph::V(g)$vertex.color,
vertex.size = vertex.size,
vertex.label = igraph::V(g)$vertex.label,
vertex.frame.color = NA,
layout = layout_with_kk,
edge.color = edge.color,
edge.width = edge.width,
...)
if (legend)
legend(x = -1.3, y = -.6, levels(structure$type), pch = 21,
col= "white", pt.bg = type_colors,
pt.cex = 2, cex = .8, bty = "n", ncol = 1)
par(op)
}
invisible(g)
}
Let’s use the plot_json_graph
function to look at our
simple example:
'{"object" : {"name": 1},
"array" : ["a", "b"],
"string" : "value",
"number" : 1,
"logical": true,
"null" : null}' %>%
plot_json_graph
plot_json_graph
produces a graph where every node in the
JSON document appears as a vertex, and parent nodes are connected to
children through edges. The color is coded to indicate what type each
node is, and vertices are labeled based on the name (for objects) used
to identify the node.
Now let’s look at a real company example:
For this company, almost all of the data is in the top level object.
Now, let’s create a function that plots a panel of these graphs:
plot_json_graph_panel <- function(json, nrow, ncol, ...) {
# Set up grid
op <- par(mfrow = c(nrow, ncol))
indices <- seq_along(json) %>% keep(`<=`, nrow * ncol)
for (i in indices) {
plot_json_graph(json[[i]], ...)
if ("names" %in% names(attributes(json)))
title(names(json)[i], col.main = 'red')
}
par(op)
invisible(NULL)
}
And let’s look at several simultaneously:
plot_json_graph_panel(companies %>% sample(5), 7, 6, legend = FALSE, show.labels = FALSE,
vertex.size = 4)
Clearly there is a huge amount of variety in the JSON documents!
Let’s look at the most complex example:
most_complex <- companies[which(co_length$complexity == max(co_length$complexity))]
most_complex_name <- most_complex %>%
spread_values(name = jstring(name)) %>%
extract2("name")
The most complex company is Groupon! Let’s try to plot it:
That is just too big. There are many arrays of complex objects that
are repetitive in structure. Instead, we can simplify the structure by
using json_schema
.
We can visualize this as a graph, and get more meaningful coloring of
the terminal nodes by instructing json_schema
to use
type = "value"
.
This is overwhelmed by top level scalar objects. We can visualize the more complex objects only
Now let’s use this insight to structure funding and geo data for a visualization.
First, let’s get funding round data:
rounds <- companies %>%
enter_object(funding_rounds) %>%
gather_array %>%
spread_values(
round = jstring(round_code),
currency = jstring(raised_currency_code),
raised = jnumber(raised_amount)
)
rounds %>% head
#> # A tbl_json: 6 x 6 tibble with a "JSON" attribute
#> ..JSON document.id array.index round currency raised
#> <chr> <int> <int> <chr> <chr> <dbl>
#> 1 "{\"id\":6252,\"rou..." 1 1 seed USD 2000000
#> 2 "{\"id\":15229,\"ro..." 1 2 partial USD 200000
#> 3 "{\"id\":22565,\"ro..." 1 3 debt_round USD 2606500
#> 4 "{\"id\":34222,\"ro..." 1 4 b USD 1500000
#> # ℹ 2 more rows
Now, let’s get geo data:
geos <- companies %>%
enter_object(offices) %>%
gather_array %>%
spread_values(
country = jstring(country_code),
state = jstring(state_code),
description = jstring(description)
)
geos %>% head
#> # A tbl_json: 6 x 6 tibble with a "JSON" attribute
#> ..JSON document.id array.index country state description
#> <chr> <int> <int> <chr> <chr> <chr>
#> 1 "{\"description\":..." 1 1 USA MA "OutSmart Power …
#> 2 "{\"description\":..." 2 1 CAN <NA> ""
#> 3 "{\"description\":..." 4 1 PRT <NA> "Headquarters"
#> 4 "{\"description\":..." 5 1 USA MI "Lycera"
#> # ℹ 2 more rows
Now, let’s join together the data for the US companies, assuming that
the first office in the array is the headquarters (seems reasonable from
a quick visual inspection of geos
).
hqs <- geos %>%
filter(array.index == 1) %>%
filter(country == "USA") %>%
select(document.id, state)
rounds_usd <- rounds %>%
filter(currency == "USD") %>%
filter(!is.na(raised)) %>%
select(document.id, round, raised)
rounds_by_geo <- inner_join(rounds_usd, hqs, by = "document.id") %>% as_tibble()
Now we can visualize the results
round_prep <- rounds_by_geo %>%
dplyr::filter(!is.na(state)) %>%
mutate(
round = round %>% forcats::fct_collapse(
"angel" = c("seed", "angel"),
"d-f" = c("d", "e", "f"),
"other" = c("grant", "partial", "post_ipo_equity", "private_equity",
"debt_round", "unattributed")
) %>% forcats::fct_relevel("angel", "a", "b", "c", "d-f", "other")
) %>%
mutate(
state = state %>% forcats::fct_lump(2)
)
g <- ggplot(round_prep, aes(state, raised, fill = state)) +
geom_violin() +
scale_y_log10() +
annotation_logticks(side = 'l') +
facet_grid(. ~ round) +
theme(legend.position = "bottom") +
labs(x = "", y = "Amount Raised (USD)")
g
Which shows a few interesting things: