@sfbtrr161

Schema Inference for Massive JSON Datasets

, , , , and . Proceedings of the Conference on Extending Database Technology (EDBT), page 222-233. (2017)
DOI: 10.5441/002/edbt.2017.21

Abstract

In the recent years JSON affirmed as a very popular dataformat for representing massive data collections. JSON datacollections are usually schemaless. While this ensures sev-eral advantages, the absence of schema information has im-portant negative consequences: the correctness of complexqueries and programs cannot be statically checked, userscannot rely on schema information to quickly figure out thestructural properties that could speed up the formulation ofcorrect queries, and many schema-based optimizations arenot possible.In this paper we deal with the problem of inferring aschema from massive JSON datasets. We first identify aJSON type language which is simple and, at the same time,expressive enough to capture irregularities and to give com-plete structural information about input data. We thenpresent our main contribution, which is the design of a schemainference algorithm, its theoretical study, and its implemen-tation based on Spark, enabling reasonable schema infer-ence time for massive collections. Finally, we report aboutan experimental analysis showing the effectiveness of our ap-proach in terms of execution time, precision, and concisenessof inferred schemas, and scalability.

Links and resources

Tags

community

  • @hlawatml
  • @benlahhm
  • @sfbtrr161
  • @mueller
  • @leonkokkoliadis
  • @tinabarthelmes
  • @visus
@sfbtrr161's tags highlighted