An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data

Ruizhe Ma, Weiwei Zhou, Zongmin Ma

Source Title: Journal of Database Management (JDM) 35(1)

DOI: 10.4018/JDM.339915

Article PDF Download Open access articles are freely available for download

Abstract

In IoT (internet of things), most data from the connected devices change with time and have sampling intervals, which are called time-series data. It is challenging to design a time series storage model that can write massive time-series data in a short time and can query and analyze the persistent time-series data for a long time. This paper constructs the RHTSDB (Redis-HBase Time Series Database) storage model based on Redis and HBase. RHTSDB uses the memory database Redis (Remote Dictionary Server) to cache massive time-series data, providing efficient data storage and query functions. HBase is used in RHTSDB for long-term storage of time-series data to realize their persistence. The paper designs a cold and hot separation mechanism for time-series data, where the infrequently accessed cold data are stored in HBase, and the frequently accessed and latest data are stored in Redis. Experiments verify that RHTSDB has apparent advantages over Apache IoTDB and HBase in data intake and query efficiency.

Article Preview

Top

Introduction

With the development of the Internet of Things (IoT) (Eom & Lee, 2017), the amount of time-series data has shown explosive growth. Time-series data refers to a sequence of data points collected at fixed time intervals (Lee & Chung, 2014). Each data point is associated with a timestamp that indicates the generation time of the corresponding data. Typically, the data collected by a sensor in a particular period can be expressed as a time series [(t₁, v₁), (t₂, v₂), ..., (t_n, v_n)], where v_i refers to the value collected at t_i time (Di Martino et al., 2019). Of course, complete time-series data can include the collection time and collection value as well as the source description information of the current collection value. For example, we need to include some measurement data information, such as the names of collection subject and collection index. Comprehensive use cases in the real world have generated a large amount of measurement data from millions or billions of different sources. Slack collects measurement data from 4 billion unique sources at 12 million samples per second daily, for example, generating up to 12 TB of compressed data daily. It is essential to manage and process a large amount of time-series data efficiently. Unfortunately, many off-the-shelf systems cannot scale to support these workloads, which leads to the random Patchwork and vulnerability of customized solutions (Solleza, Crotty, Karumuri, Tatbul & Zdonik, 2022). For this reason, diverse time series databases are proposed to ensure efficient ingestion performance and save storage space as much as possible. Given that time-series data in applications are generally massive and redundant data containing source description information in time-series data are enormous, efficient storage and query of massive time-series data is challenging.

We identify two major categories of time series databases: which are respectively called native time series databases and common time series databases in this paper. The native time series databases are the storage systems that are developed especially for time-series data management according to their structural and usage characteristics, such as InfluxDB¹, FluteDB (Li et al., 2018), and Apache IoTDB (Wang et al., 2020 & 2023). This category of time series databases can efficiently reduce the overhead of storage space and the query delay. However, for time-series data management and processing, many other functions and operations are essential in time series databases, such as flexible aggregation, data retention, multidimensional range query, among others. While the native time series databases cannot provide full support to time-series data analysis well, mature database systems are good at dealing with relationships between data and support many unnecessary operations and guarantees for time series, increasing inefficiency and unnecessary complexity (Shafer, Sambasivan, Rowe, & Ganger, 2013). The common time series databases are the storage systems that directly apply the common databases for storing and processing time-series data. Depending on what types of databases are applied, we further identify two categories of common time series databases. The first one uses relational databases as the back end of common time series databases (e.g., (Rhea et al., 2017)). In recent years, NoSQL (Not only SQL) databases have attracted increasing attention from both academia and industry (Hu & Dessloch, 2015), which offer flexible data representation models and horizontal hardware scalability so that Big Data can be processed in real time (Bajaj & Bick, 2020). The second category of common time series databases uses NoSQL databases for processing time-series data (Di Martino et al., 2019).

Complete Article List

Search this Journal:

Reset

Volume 35: 1 Issue (2024)

Volume 34: 3 Issues (2023)

Volume 33: 5 Issues (2022): 4 Released, 1 Forthcoming

Volume 32: 4 Issues (2021)

Volume 31: 4 Issues (2020)

Volume 30: 4 Issues (2019)

Volume 29: 4 Issues (2018)

Volume 28: 4 Issues (2017)

Volume 27: 4 Issues (2016)

Volume 26: 4 Issues (2015)

Volume 25: 4 Issues (2014)

Volume 24: 4 Issues (2013)

Volume 23: 4 Issues (2012)

Volume 22: 4 Issues (2011)

Volume 21: 4 Issues (2010)

Volume 20: 4 Issues (2009)

Volume 19: 4 Issues (2008)

Volume 18: 4 Issues (2007)

Volume 17: 4 Issues (2006)

Volume 16: 4 Issues (2005)

Volume 15: 4 Issues (2004)

Volume 14: 4 Issues (2003)

Volume 13: 4 Issues (2002)

Volume 12: 4 Issues (2001)

Volume 11: 4 Issues (2000)

Volume 10: 4 Issues (1999)

Volume 9: 4 Issues (1998)

Volume 8: 4 Issues (1997)

Volume 7: 4 Issues (1996)

Volume 6: 4 Issues (1995)

Volume 5: 4 Issues (1994)

Volume 4: 4 Issues (1993)

Volume 3: 4 Issues (1992)

Volume 2: 4 Issues (1991)

Volume 1: 2 Issues (1990)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data

Abstract

Introduction

Complete Article List