Jun 16, 2015

Conquering With CLOBs

Introduction

Today’s web-based applications can freely pass around large amounts of data formatted in XML and JSON. Businesses often want to hang on to and analyze even the most obscure pieces of information, either for historical (audit) purposes, or to attempt to mine the data for competitive advantage. Storing XML or JSON in databases can be problematic. Fortunately, modern databases rise to the challenge and include data types that allow us to store large blocks of text outside of the tablespace while still providing near seamless access to that data. The CLOB is how Oracle stores large blocks of character data.

What is a CLOB?

A character large object is a data type in a database management system capable of storing very large amounts of text (string) data. CLOBs are often stored in an out-of-table location apart from other table data. The majority of database systems support this data type; some are named text, memo, or long character. In Oracle, a CLOB can store an array of character data up to 4GB in size.

Why Not Just Use varchar?

In Oracle, the varchar and varchar2 data types are identical and are the typical type used for storing string data. The varchar type has a maximum size of 4000 bytes, which is usually enough for most purposes. varchar data is stored within the table along with whatever columns that are defined in the table. The problem is, when we need to store JSON or XML data, 4000 bytes gets used up very quickly. Due to the manner in which the database allocates storage for data, large arrays of character data can cause performance and optimization problems. Thus, the CLOB type was invented to solve the problem. If you are using another DBMS, SQL Server for example, the varchar types have a larger maximum size and thus you may be able to get away with using them for storing JSON and/or XML data. However, the risk with using any fixed size character field for marked up text is that someone along the line will insert some data that will exceed the stated storage capacity. If you have tight control over that data, varchar might work, otherwise 4GB gives you a lot of buffer in your design.

How to create a table with a CLOB column (Oracle)

Creating a table with a CLOB column is simple:

The above SQL create statement creates a new table that includes an id column, a creation date timestamp column to record when a row is inserted, and a CLOB column named “data” in which we will store some JSON data.

Using CLOBs

Everything in software design involves trade-offs and CLOBs are no different. For the price of being able to store large amounts of out-of-table character data, interacting with that data becomes slightly more difficult. As the desire to store JSON and XML data in tables has become more prevalent, newer versions of popular databases are eliminating the overhead and the hassle of dealing with the CLOB type. My notes and observations come from working with Oracle 11g (v11.2.0.2.0). It is important to mention that in Release 12, Oracle has upped the ante even further by adding native support for XML and JSON data, including constraining CLOB columns to a particular type (a check to ensure the data is valid JSON):

That type of constraint is not available in version 11g. Version 12 also includes the capability of writing select queries using a dotted notation directly on data contained within the JSON structure, for example:

Working with CLOB Data

Inserting bulk data is seamless, depending on which tool or framework you are using.

Here is an example insert statement:

There is a ‘to_clob()’ function you can wrap around the string literal if your development tool requires it; IntelliJ does not. Note that there is a hard limit on the length of the actual string in the SQL statement itself, regardless of what data type your column is. This gist example generates the “string too long” error.

There is a work-around for inserting large JSON or XML structures from your SQL statements.

Deleting data stored in CLOB form requires no special syntax. Depending on the DBMS and the amount of data being removed, it may be prudent for the DBA to reclaim the LOB space.

Writing SQL to Interact With a CLOB

Querying a CLOB requires different syntax:

Updates that completely replace your CLOB data are trivial:

What is more interesting than a straight update of the CLOB data is that you can edit parts of your XML or JSON in a batch fashion:

That query could easily be extended to update many rows simultaneously, perhaps to add or repair previously saved XML.

Hibernate and CLOBs

Hibernate makes working with CLOB data completely seamless. A CLOB column is simply represented as a String type. In Java entity classes, a CLOB looks exactly the same as varchar data. I have pushed a simple example to Github. 

Outside of Grails, few developers seem to be generating their tables using Hibernate, probably because DBAs want to be able to review and run SQL in higher environments such as QA and PROD, since many companies do not give developers full access to those databases. For those developers fortunate enough to have free rein, adding the annotation ‘@Lob’ to the intended property of your entity, will ensure that Hibernate uses the CLOB datatype instead of varchar.

Again, if you are not using Hibernate to create your tables for you, the @Lob annotation is spurious.

The BLOB

The BLOB datatype is worth a quick mention: A BLOB is a Binary Large Object, and the name is exactly what it implies. Instead of character data, BLOBs contain binary data, for example images, PDFs; anything really, that is large in size and other than text. Like CLOBs, BLOBs are stored out of table and have special considerations around using them.

A Word About NoSQL

I will be criticized if I fail to mention NoSQL solutions because they were created specifically to deal with storing JSON. Hadoop, Couchbase, and others represent groundbreaking technology for storing XML and JSON in simple map structures while avoiding the overhead of a full DBMS. These tools are excellent, but I have not yet been able to make use of them. Company data is either extremely valuable, sensitive (e.g. PHI or PCI), or both, and in these days of major embarrassing data breaches, all data is being more conservatively looked after than ever before. Due to the high value placed on data, it is difficult sell for anyone in a leadership position (not to mention a developer) to make the case for implementing a NoSQL solution. Even where it is possible to do so, individuals must first answer many difficult questions relating to architecture, security, monitoring, redundancy, backups, purchasing and allocation of adequate storage, and training of administrative staff. Companies heavily invested in databases such as Oracle are not often interested in procuring a competing form of data storage. No matter how speedy or technically sensible a NoSQL solution may be, it may not be feasible to implement it. I fully expect that NoSQL will achieve wider adoption in the future as Oracle and other large database players (“trusted brands”) are now producing their own NoSQL products. Perhaps in the near future NoSQL capabilities will be considered common infrastructure for developers to solve problems with.

Conclusion

The CLOB datatype is a specialized and very useful type available in modern database systems. Using CLOBs when needed will keep your database running efficiently and thereby keep you out of trouble with your DBA. Future widespread availability of NoSQL solutions will no doubt allow for even more thoughtful system architectures to be built. In the meantime, CLOBs can fill the gap.

About the Author

Object Partners profile.
Leave a Reply

Your email address will not be published.

Related Blog Posts
Natively Compiled Java on Google App Engine
Google App Engine is a platform-as-a-service product that is marketed as a way to get your applications into the cloud without necessarily knowing all of the infrastructure bits and pieces to do so. Google App […]
Building Better Data Visualization Experiences: Part 2 of 2
If you don't have a Ph.D. in data science, the raw data might be difficult to comprehend. This is where data visualization comes in.
Unleashing Feature Flags onto Kafka Consumers
Feature flags are a tool to strategically enable or disable functionality at runtime. They are often used to drive different user experiences but can also be useful in real-time data systems. In this post, we’ll […]
A security model for developers
Software security is more important than ever, but developing secure applications is more confusing than ever. TLS, mTLS, RBAC, SAML, OAUTH, OWASP, GDPR, SASL, RSA, JWT, cookie, attack vector, DDoS, firewall, VPN, security groups, exploit, […]