Sqoop Atlas Bridge

Sqoop Model

The default Sqoop modelling is available in org.apache.atlas.sqoop.model.SqoopDataModelGenerator. It defines the following types:

sqoop_operation_type(EnumType) - values [IMPORT, EXPORT, EVAL]
sqoop_dbstore_usage(EnumType) - values [TABLE, QUERY, PROCEDURE, OTHER]
sqoop_process(ClassType) - super types [Process] - attributes [name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName]
sqoop_dbdatastore(ClassType) - super types [DataSet] - attributes [name, dbStoreType, storeUse, storeUri, source, description, ownerName]

The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well: sqoop_process - attribute name - sqoop-dbStoreType-storeUri-endTime sqoop_dbdatastore - attribute name - dbStoreType-connectorUrl-source

Sqoop Hook

Sqoop added a SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in sqoopHook. This is used to add entities in Atlas using the model defined in org.apache.atlas.sqoop.model.SqoopDataModelGenerator. Follow these instructions in your sqoop set-up to add sqoop hook for Atlas in <sqoop-conf>/sqoop-site.xml:

  • Sqoop Job publisher class. Currently only one publishing class is supported
sqoop.job.data.publish.class org.apache.atlas.sqoop.hook.SqoopHook
  • Atlas cluster name
atlas.cluster.name
  • Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/
  • Link <atlas-home>/hook/sqoop/*.jar in sqoop lib

Refer Configuration for notification related configurations

Limitations

  • Only the following sqoop operations are captured by sqoop hook currently - hiveImport