Introduction
The evolution of the modern Relational
Database Management System has been shaped by the database itself, but the
problem remains the “distance” between the decision maker and the data. The
complexity arises when the decision maker requires information from the data in
the form of answers to queries. Now it is about the toolkit that allows for the
searching, indexing and querying of vast amounts of disparate data sources that
exist within the enterprise; enter the data warehouse: a read only set of data
and query results all shaped up and fit for mining. But for today’s management
the time taken to run the warehousing routine and subsequently mine it is too
long! They, management, want answers to unstructured questions and general
grammatical queries in milliseconds utilising all of the available warehoused
information and up to the moment transaction data.
Is it that senior managers, or managers in
general, lack the basic information retrieval skills necessary to seek and
utilise the toolkit resources available to them within the enterprise data
structure?
The evolution of databases
The information environment
Most
commonly identified as “the information environment is the aggregate of
individuals, organizations, and systems that collect, process, disseminate or
act on information.” In today’s world this leads to a
more complex information environment. There has been an exponential explosion
in data and data capture, as the internet of things starts to gain a foothold.
The growth of “things” has led to an enormous growth in sparse data in which
schema less data structures are become the vogue allowing the storage of data
in the database without prior knowledge of its definition. There is a downside
to schema less data model in that: at some point it is necessary to derive or
imply a schema.
Bringing together the disparate data stores
and systems management processes within the enterprise into a homogenous whole
is the goal of the information systems manager and is primarily considered as
the enterprise information environment. One of the primary reasons for
designing and developing a database is for its use as an integration tool.
Ask the right questions of your data!
The alternatives to SQL
![]() |
| Figure 1 the types of noSQL databases |
In a nutshell if you are serious about your
enterprise information system then there is not a viable alternative to SQL
RDBMs. It’s super-fast, reliable, scalable, resilient and cost effective. noSQL
is simply muddying the waters for the media savvy. To start with it is not just
one alternative technology, but they share a scale out philosophy and are
predominantly open source which makes choosing one of them a difficult choice
as vendor lock in becomes critical to the success of its implementation within
the enterprise. So thematically there are four database types within the noSQL
camp each having a competitive advantage in the data environment. However,
within each theme there are a plethora of alternates as there are over 250
different offerings: all of them different, all of them proprietary in nature;
and, all of them demand more work, effort, resources, time and money than
originally planned for and advertised. Object database management systems still
exist in an extremely although in an extremely thin niche.
SQL relational databases do not have
scalability issues and problems, Facebook recently extended its SQL systems to
cover databases in the petabyte range of size. noSQL is cluster friendly so
scales horizontally efficiently, which is on a per unit basis more cost
attractive than up scaling an SQL server. A point which truly expresses the
lack of thought within the noSQL camp is that the very “popular” description noSQL
came from a meeting hashtag on twitter. noSQL lacks a standard language SQL has
been ANSI almost since the beginning. Because of the factional nature of the noSQL
market there are a severe and stark lack of skills (Cruz, 2012) .
·
RDBMS == ACID
·
noSQL == BASE
For the foreseeable future SQL is going to
continue to attract the majority of investment, with noSQL utilised and applied
in specific project arenas that aid and abet in the construction and management
of the data warehouse as well as being implemented in the OLAP and BI processes.
The data warehouse
![]() |
Everything ploughed into the data warehouse
is a solid philosophy. A data warehouse is not a technology but an
architecture. The advantage is gained in recognising the strategic value of a
data warehouse. Having identified the need the architectural requirements mean
that the organisation benefits from simply planning and designing for a data
warehouse.
![]() |
| Figure 3 Traditional process leading towards a Business Intelligence system |
Star schema
This is often considered as the simplest of
the data warehouse schemas, as such is utilised pervasively. A single fact
table is linked to multiple dimension tables radially: by use of a primary and
foreign key relationship. The fact table is third normal form, whereas the
dimension tables are de-normalised.
Snowflake schema
This schema is again often considered as
more complex. As with the star schema the fact table is third normal form, but
this time round the dimension tables are also third normal form. Although the
schema saves on redundant data and space within the database, there is an
increase in the number of joins required to build the dimension tables.
Fact constellation
This schema requires the deconstruction of
the star schema into to multiple fact tables with many shared dimensional
links. Fact constellations are difficult to design due to the aggregation of variants.
The language of relationships is changing
and evolving with trends.
![]() |
| Figure 4 Enterprise Information Systems Architecture |
Data modelling
![]() |
| Figure 5 the redundancy of schema less design |
Feature
|
Conceptual
|
Logical
|
Physical
|
Entity Names
|
✓
|
✓
|
|
Entity Relationships
|
✓
|
✓
|
|
Attributes
|
✓
|
||
Primary Keys
|
✓
|
✓
|
|
Foreign Keys
|
✓
|
✓
|
|
Table Names
|
✓
|
||
Column Names
|
✓
|
||
Column Data Types
|
✓
|
Table 1 Data Modelling model types and their features
Methodologies and techniques utilised by
the data modelling process provide a: standard, consistent and predictable: aspect
with which to administer data as a viable resource within the enterprise. By
managing data as a resource we are able to ensure close-fitting synergies for
the integration of the enterprises information systems as a complete
all-inclusive entity. It is especially advantageous in the functional design
and development of databases and the data warehouse, the data repositories of
the enterprise IS.
System design
There is an enterprise requirement for the
organisational structure for storing data. Lists carry redundant data in
multiple themes or concepts. These lists are broken apart by theme and concept
to provide relational tables. Third normal form relational databases are more
complex requiring effort, and as such are hard work; not an attractive feature
for the masses, who demand things are easy and effortless.
SQL is an international standard language,
recognised and implemented by the biggest names in the software industry, the
trendy new noSQL brigade have not got a common standard, each of the more than 250
noSQL databases requires that you learn their own proprietary query language.
There is a great deal of form without
function. It seems today that the ethos and very fundamentals of design are
wrong.
Normalisation. When is de-normalisation appropriate?
From relation to tables and entities that have been derived by data modelling.
![]() |
| Figure 6 Transitioning from a data model to a database |
noSQL is vulnerable to modification
anomalies.
There comes a point when the computational
expense of multiple queries with complex joins means a denormalisation of the
data is necessary.
Denormalisation exposes problems that are
not in normal form databases, in the end it is all for nothing.
Costs of denormalisation
|
Benefits of denormalisation
|
·
Modification anomalies
·
Redundant data
·
More storage required
|
·
Simplicity
·
Query performance improved
|
Table 2 Cost benefit analysis of denormalisation
Data mining
For this section we will be recreating the
first chapter of the book Mining the Social Web, 2nd Edition
In chapter 1: Mining Twitter: Exploring
Trending Topics, Discovering What People Are Talking About, and More: we will
work through a number of exercises. These cover a number of distinct problems
when utilising the programming language Python to interface with and
interrogate the Twitter API.
![]() |
| Figure 7 Data Mining Algorithms |
The twitter exercise was tedious and
complicated, in addition to being completely uninteresting as a scientific
exercise. From the exercise we learned that the Python programming language is
the province of esoteric code writers: “anyone can hack a piece of code” has
been said, and that is what programming in Python is like. The language
construct is minimal by design, it lacks IDE support and interfaces, and it is
predominantly made up of third party libraries and is so incredibly whitespace
unfriendly.
The exercise takes us through the
authentication and creation of a Twitter object which is then integrated in the
iPython interface. First just a print to screen and then some formatted output
utilising some imported library functions. It starts to get interesting towards
the end of the exercises, where matplotlib is imported and some of the results
are graphed and charted. Python is a powerful programming language and some of
the libraries available make it data analytics system. It would be and has
tremendous value and applications at the Extraction and Transaction phases in
the preparation and implementation of the data warehouse.
The introduction of the Pandas library
really enhances the performance of Python, it makes the slicing and dicing of
data sets remarkably straight forwards. The problem is the subject: Twitter:
Understanding that a tweet is 140 characters and the data set for a tweet is
approximately 8K bytes, the amount of augmented data and meta data is enormous,
but this is where it apparently falls down is the analysis model. Yes we can
analyse what’s trending, what devices, where the tweets come from, how many
times it’s been retweeted and a plethora of other analytics: but what’s the
point? What is twitter all about? What contribution does it actually make? How
is it benefitting anything? Put simply it makes no valid scientific contribution
it is merely for the chattering non-cognisant. Celebrity status does not make
us wise!
Conclusion
Although it is all change, everything still
remains the same. It is a question of the big picture; the bigger the picture
the more often the same patterns will remerge. In the beginning of database
history there was no SQL, then SQL got invented, adopted and grew to dominance
in the database industry; now comes no SQL in a resurgent attempt to gain
market share, as it claims: a viable alternative to SQL.
It is the 80:20 rule. 80% of the database
market will remain SQL and the rest comprising of the remaining 20% will battle
for dominance among the market and data niches. As SQL scales into the petabyte
range, it will not be long before the database sizes supported by SQL database
systems reaches and exceeds the Exabyte range.
The information environment is going to
continue to evolve in complexity, challenging the systems and database
designers of now and tomorrow with difficult and multifaceted data sources,
operational systems and reporting mechanisms.
Bibliography
Alexios Delis, N. R. (1993).
Performance Comparison of Three Modern DBMS Architectures. IEEE
TRANSACTIONS ON SOFTWARE ENGINEERING, 19(2). Retrieved 07 31, 2014
Ali, A. (2013, 06 24). Comparing Data
Warehouse Design Methodologies for Microsoft SQL Server. Retrieved 08 07,
2014, from MSSQLTips.com: http://www.mssqltips.com/sqlservertip/2976/comparing-data-warehouse-design-methodologies-for-microsoft-sql-server/
Ambler, S. (2014, 07 03). Data
Modeling 101. Retrieved 07 03, 2014, from Agile Data:
http://www.agiledata.org/essays/dataModeling101.html
Babcock, C. (2010, 09 10). What's So
Great About NoSQL? InformationWeek(1279), 26-31. Retrieved 07 31,
2014, from
http://search.proquest.com.voyager.chester.ac.uk/docview/1416192825/abstract?accountid=14620
Cruz, X. (2012, June 4). The Database
Battle: MongoDB vs MySQL. Retrieved 08 30, 2014, from CloudTimes:
http://cloudtimes.org/2012/06/04/the-database-battle-mongodb-vs-mysql/
Delis, A., & Roussopoulos, N. (1991,
09). Modern Client-Server DBMS Architectures. SIGMOD RECORD, 52-61.
Retrieved 08 01, 2014
Evans, P. (2011, 01 11). Business
Intelligence - SQL or NoSQL - Which is better. Retrieved 07 28, 2014,
from Database Journal:
http://www.databasejournal.com/sqletc/article.php/3919371/Business-Intelligence---SQL-or-NoSQL---Which-is-better.htm
George, S. (2012, April 01). Inmon vs.
Kimball: Which approach is suitable for your data warehouse? Retrieved 08
07, 2014, from TechTarget.in:
http://searchbusinessintelligence.techtarget.in/tip/Inmon-vs-Kimball-Which-approach-is-suitable-for-your-data-warehouse
Haadi, M. (2010, October 118). The
Evolution of Database. Retrieved September 03, 2014, from All About
Databases: http://mhaadi.wordpress.com/2010/10/18/the-evolution-of-database/
Inmon, W. H. (2002). Building the Data
Warehouse (3rd ed.). (R. Elliott, Ed.) New York, USA: Robert Ipsen.
Retrieved September 16, 2014
Kajeepeta, S. (2012, 05 14). NoSQL
Everywhere? Nope. InformationWeek(1333), 31-32. Retrieved 07 31, 2014,
from
http://search.proquest.com.voyager.chester.ac.uk/docview/1021724923?pq-origsite=summon
Kavis Technology Consulting. (2010, 03
01). NoSQL vs. RDBMS: Apples and Oranges? Retrieved 07 28, 2014, from
Kavis Technology Consulting:
http://www.kavistechnology.com/blog/nosql-vs-rdbms-apples-and-oranges/
Loshin, D. (2013, 09 04). An
Introduction to NoSQL Data Management for Big Data. Retrieved 08 07,
2014, from DataInformed:
http://data-informed.com/introduction-nosql-data-management-big-data/
M.Thenmozhi, K. (2014, 06). AN
ONTOLOGICAL APPROACH TO HANDLE MULTIDIMENSIONAL SCHEMA EVOLUTION FOR DATA
WAREHOUSE. International Journal of Database Management Systems, 6(3),
33-52. doi:10.5121/ijdms.2014.6303
Microsoft Corporation. (2014, 07 03). Data
Warehouse Design Considerations. Retrieved 07 03, 2014, from SQL Server:
http://technet.microsoft.com/en-us/library/aa902672(v=sql.80).aspx
Stonebraker, M. (2010, 04). SQL Databases
v. NoSQL Databases. communications of the acm, 53(4).
doi:10.1145/1721654.1721659
Vasco Santos, R. S. (2014, April).
TOWARDS A LOW COST ETL SYSTEM. International Journal of Database
Management Systems ( IJDMS ), 6(2), 67-79. doi:10.5121/ijdms.2014.6205
Veronika Abramova, J. B. (2014, June).
EXPERIMENTAL EVALUATION OF NOSQL DATABASES. International Journal of
Database Management Systems, 6(3), 16. doi:10.5121/ijdms.2014.6301







