Highlights from the 2018 Gartner Data Analytics Summit — Sydney
Data is at the core of modern technology. A successful data program means access to intelligence driven cyber security, data governance based risk management, business relevant insights, resilient operations, just in time supply chain management… the list goes on.
Here are the main ideas I went back to work with after the 2 days of presentations and meeting people from industry at the event.
Now is the time to do data
Data driven business isn’t a new concept; it’s been around for at least 10 years. But the core enablers of data driven anything have had varying levels of capability over the years. In 2018 though, they’ve all seemed to have levelled up.
- Cloud infrastructure — provision and run your data workloads at scale in a matter of minutes for a fraction of the cost. Welcome to 2018! Racking and stacking is now optional (and frowned upon)
- Accessible machine learning and AI technology — research into algorithms and open source packages to run off the shelf machine learning models for basic analytics has become pretty standard practice. The quality of analytics has substantially increased and the technical expertise required to consume such analytics has been substantially reduced. Thank you, Open Source?
- Availability of data — the ‘log everything’ mentality has finally sorta caught up with the world. We’re now mostly logging things and putting them into some mutated version of the data lake we set out to build. Nevertheless we have data. It may not be perfectly engineered data, but its good enough to squeeze insights out of.
A culture of data enablement
A culture of data enablement is when people are empowered and enabled by actionable insights from enterprise data to drive better business outcomes. Analytics based decision making, effective story telling using relevant metrics and a distributed responsibility of generating and consuming the right data are some indicators of having the culture in place.
Fostering this kind of culture includes looking at a self service model for enterprise data analytics, serviced via a centralised, documented and supported platform for those who need it. Anyone who needs the data is allowed to use it. Governance and data access control sits in the background to ensure information security, but the self service nature of the model is critical to creating the culture.
Platform and technology
Oh, not this can of worms again…
Consider the following
To have an effective data platform, the following technology considerations are absolutely essential and non negotiable in 2018.
- Elastic and scalable
Thinking about servers and racks is literally worth no one’s time. Unless you work at AWS, Google Cloud or Azure. Honestly, get with the program. The platform must grow and shrink based on our usage of it. Think of it as fabrics of storage, compute and networking; not actual devices. The sooner this happens, the closer you’re going to be to your data dream.
- Consume, don’t manage
Stressed twice for importance. Don’t spend time doing things a python script can do for you.
- Minutes, not months
Test analytics, visualisations, new data sources, machine learning models in minutes. Get feedback, fine tune, iterate and improve by the end of the day. No one should ever need to wait days or weeks or months to see results. That’s not how creativity and innovation works. By the time your enterprise change management goes through process, provision disk space and come back to a data scientist with results, they’ve already left the building and moved onto their next job.
- Secure, without restricting
Goes without saying. Security is management of business risk. When people can’t do their job because of ‘security’, there may not be a business to secure in the future. Assess and manage risk, get it signed off. Be responsible, without stopping people from doing their job.
Also decide on the services
- Data Lake or Warehouse?
The data lake concept is an iteration on the data warehouse one.
A data warehouse is when a traditionally relational database (NoSQL implementations are now common too) is used to store large amounts of historical, structured and pre-formatted data to run analytics on for business insight. Warehouses were known to be slow to ingest, and would service about 60% of common business use cases. Adding new uses cases requires time and effort, so they don’t come with analytical agility.
Data lakes are simply large stores of unstructured raw datasets that are parsed, formatted and joined during query time. This means the analytics is decoupled from the ingestion process. Anyone with a use case can query the data in any shape or form they want. This concept lays the groundwork for the ‘self service’ model in a data enabled workforce.
Research into the different tech architectures and an in-depth understanding of your strategic business needs must be looked at before picking.
- Effective metadata discovery & management
From previous failures, we as an industry now realise the importance of knowing what we’re throwing into data lakes. Doing proper meta data management and cataloging around data sources, raw/calculated fields and correlation points across data sources is critical to properly utilising a data lake. Doing this one relatively simple thing will save your platform from imminent doom. Do not skip this step.
- Providing good data interfaces and pipelines
Give your data consumers easy to access, authenticated and access controlled APIs to query and extract data from the platform. Also consider building mechanisms (or pipelines) for data flow/forwarding with some orchestration & transformation capabilities. This will let your consumers integrate data sources into other systems efficiently, without tempting duplication of data sources. It’s important to make the interfaces accessible and easy to use.
- Engineering, resilience and monitoring
Data quality is everything. ‘Garbage in, garbage out’ is now more relevant than ever before, because business decisions and sometimes livelihoods of people depend on analytics. Good data engineering, cleaning and validation with optional enrichment of sources is essential when taking onboard a new source. If this isn’t part of your process, don’t expect any returns from your expensive new platform.
Once your sources are configured, setup some automated monitoring and alerting to ensure your ingestion pipelines aren’t broken. Perform sanity checks on the data content, and check event volumes in the lake against the source systems. Monitor and remediate broken pipelines, and push out an outage notification to the consumers of the data so the analytics can be adjusted.
There’s a strong appetite for effective data platforms. There’s no one size fits all, so it’s a hard product to buy off the shelf. The question is how far off we are to the end goal, and realising the true potential of the data driven world.