- Simplifies access to ERP data
- Reduces data replication
- Enables sharing of data (eliminating point to point integration)
- Drive data archiving strategy
- Rationalize data sources
- Central set of business rules
- Common reporting and analytic tools
- Unified location to apply Data Governance and Security
The Architecture has data flow up from the source systems and into the data lake (~1,000 tables). The data movement is a combination of direct file imports, batch ingestion with Sqoop and incremental ingestion. The incremental ingestion was done by CDC and partition swaps originally but was eventually replaced with Attunity Replicate to simplify the maintenance and speed up processing.
Governance and security is applied, the sensitive data is either tokenized or not allowed into the cluster. The data is processed through the ETL layer, consisting mostly of Hive processes. The Logical model is the semantic layer provided by the reporting tools.
There are areas to improve.
- Data Validations are executed with manual scripts after the fact.
- Data Governance is limited and needs review as more users are given access to the data
- There is no managed data catalog at this time
- Very few curated and maintained data sets
- Self-service data preparation tools are being reviewed
- Capabilities are limited to small number of uses