Data Processing and Analytics Jobs

The following topics describe job attributes that work with data processing platforms and services:

AWS Athena Job

AWS Athena enables you to process, analyze, and store your data in the cloud.

The following table describes the AWS Athena job attributes.

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to AWS Athena.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Athena Client Request Token

Defines a unique ID (idempotency token), which guarantees that the job executes only once.

Default: aws-athena-client-request-token-%%ORDERID-%%TIME

DB Catalog Name

Defines the name of the group of databases (catalog) that the query references.

Database Name

Defines the name of the database that the query references.

Action

Determines which of the following queries executes:

  • Query: Executes the query that you enter in the Query attribute.

  • Run Prepared Query: Executes a predefined query that is stored in the AWS Athena platform.

  • Query and Create Table: Executes the query that you enter in the Query attribute and saves the results to a new table.

  • Unload: Executes the query that you enter in the Query attribute and saves the results to a file in an Amazon S3 bucket.

Query

Defines the SQL-based query that executes.

Prepared Query Name

Defines the name of the predefined query that is stored in the AWS Athena platform.

Table Name

Defines the name of the table that is created, which is populated by the results of a query in AWS Athena.

Unload File Type

Determines the file format that the query results are saved in, as follows:

  • JSON

  • CSV

  • ORC

  • Parquet

  • Avro

  • Text File

Output Location

Defines the AWS S3 bucket path where the file is saved.

Format: s3://<path>

AWS Athena automatically generates a filename that incorporates the Query Execution ID, which is a unique ID applied to each query that is executed.

Workgroup

Defines the workgroup for this job.

Workgroups can consist of users, teams, applications, or workloads, and they can set limits on the data that each query or group processes.

Add Configurations

Determines whether to add additional job definitions.

  • Yes

  • No

S3 ACL Option

Defines the Amazon S3 canned access control list (ACL), which is a predefined set of grantees and permissions assigned to your stored query results.

BUCKET_OWNER_FULL_CONTROL is the only canned ACL that is currently supported in AWS Athena. This setting gives you and the bucket owner full control of the query results.

Encryption Options

Determines one of the following ways to encrypt the query results:

  • SSE_S3: Encrypts the data in the Amazon S3 with Server-Side Encryption (SSE) and Amazon S3-managed encryption keys.

  • SSE_KMS: Encrypts the data in the Amazon S3 with SSE and the AWS Key Management Service (KMS), which enables you to manage the encryption keys.

  • CSE_KMS: Encrypts the data in the Amazon S3 object storage with SSE and enables you to provide your own encryption keys.

KMS Key

(SSE_KMS and CSE_KMS Only) Defines the Amazon Resource Name (ARN) of the KMS key.

An ARN is a standardized AWS resource address.

arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst

Bucket Owner

Defines the AWS account ID of the Amazon S3 bucket owner.

Show JSON Output

Determines whether to show the full JSON API response in the job output.

Status Polling Frequency

Determines the number of seconds to wait before checking the status of the job.

Default: 10

Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

AWS Data Pipeline Job

AWS Data Pipeline is a cloud-based extract, transform, load (ETL) service that enables you to automate the transfer, processing, and storage of your data.

The following table describes the AWS Data Pipeline job attributes.

Attribute

Action

Description

Connection Profile

N/A

Determines the authorization credentials that are used to connect Control-M to AWS Data Pipeline.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Action

N/A

Determines one of the following AWS Data Pipeline actions:

  • Trigger Pipeline: Executes an existing AWS Data Pipeline.

  • Create Pipeline: Creates a new AWS Data Pipeline.

Pipeline Name

Create Pipeline

Defines the name of the new AWS Data Pipeline.

Pipeline Unique ID

Create Pipeline

Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be used again.

Valid Characters: Any alphanumeric characters.

Parameters

Create Pipeline

Defines the parameter objects, in JSON format, which define the variables for your AWS Data Pipeline.

For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference.

Copy
"parameterObjects": [ {
   "attributes": [
      {
         "key":"description",
         "stringValue":"S3outputfolder"
      }
    ],
   "id": "myS3OutputLoc"
}
],
"parameterValues": [ {
   "id":"myShellCmd",
   "stringValue":"grep -rc \"GET\" ${IN_DIR}/* > ${OUT_DIR}/out.txt"
}
],
"pipelineObjects": [ {
   "fields": [
      {
         "key":"input",
         "refValue":"S3InputLocation"
      },
      {
         "key":"stage",
         "stringValue":"true"
      }
   ],
   "id":"ShellCommandActivityObj",
   "name":"ShellCommandActivityObj"
}
]

Trigger Created Pipeline

Create Pipeline

Determines whether to execute, or trigger, the newly created AWS Data Pipeline.

Pipeline ID

Trigger Pipeline

Determines which pipeline to execute, or trigger.

Status Polling Frequency

All actions

Determines the number of seconds to wait before checking the status of the Data Pipeline job.

Default: 20

Failure Tolerance

All actions

Determines the number of times to check the job status before ending Not OK.

Default: 2

 

AWS EMR Job

AWS EMR is a managed cluster platform that enables you to execute big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.

The following table describes AWS EMR job attributes.

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to AWS EMR.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Cluster ID

Defines the name of the AWS EMR cluster to connect to the Notebook.

In the EMR API, this field is called the Execution Engine ID.

Notebook ID

Determines which Notebook ID executes the script.

In the EMR API, this field is called the Editor ID.

Relative Path

Defines the full path and name of the script file in the Notebook.

Notebook Execution Name

Defines the job execution name.

Service Role

Defines the service role to connect to the Notebook.

Use Advanced JSON Format

Enables you to provide Notebook execution information through JSON code.

This JSON Body parameter replaces the values of the following parameters (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role).

JSON Body

Defines Notebook execution settings, in JSON format. For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference.

Copy
{
"EditorId": "e-DJJ0HFJKU71I9DWX8GJAOH734",
"RelativePath": "ShowWaitingAndRunningClustersTest2.ipynb",
"NotebookExecutionName":"Tests",
"ExecutionEngine": {
   "Id": "j-AR2G6DPQSGUB"
},
"ServiceRole": "EMR_Notebooks_DefaultRole"
}

Azure Databricks Job

Azure Databricks is a cloud-based data analytics platform that enables you to process and analyze large workloads of data.

The following table describes the Azure Databricks job type attributes.

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to Azure Databricks.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

  • Variable Name: %%AZURE-ACCOUNT

Databricks Job ID

Determines the ID of the Azure Databricks job that is created in a Databricks workspace.

Parameters

Defines task parameters to override when the job executes, according to the Databricks convention. Your list of parameters must begin with the name of the parameter type.

Copy
"notebook_params":{"param1":"val1", "param2":"val2"}
"jar_params": ["param1", "param2"]

For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided in the Azure Databricks documentation.

For no parameters, type the following code:

Copy
"params": {}

Idempotency Token

(Optional) Defines a token to use to re-execute job executions that timed out in Databricks.

Values:

  • Control-M-Idem_%%ORDERID: With this token, upon re-execution, Control-M invokes the monitoring of the existing job execution in Databricks. Default.

  • Any other value: Replaces the Control-M idempotency token. When you re-execute a job using a different token, Databricks creates a new job execution with a new unique run ID.

Status Polling Frequency

(Optional) Determines the number of seconds to wait before checking the status of the job.

Default: 30

Azure HDInsight Job

Azure HDInsight enables you to execute an Apache Spark batch job and perform big data analytics.

The following table describes Azure HDInsight job parameters:

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to Azure HDInsight.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Parameters

Determines which parameters are passed to the Apache Spark Application during job execution, in JSON format (name:value pairs).

This JSON must include the file and className elements.

Status Polling Interval

Determines the number of seconds to wait before the Apache Spark batch job is verified.

Default: 10 seconds

Bring job logs to output

Determines whether logs from Apache Spark appear in the job output.

Azure Synapse Job

Azure Synapse Analytics enables you to perform data integration and big data analytics.

The following table describes Azure Synapse job parameters:

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to Azure Synapse.

Pipeline Name

Defines the name of a pipeline that you defined in your Azure Synapse workspace.

Parameters

Defines pipeline parameters, in JSON format, to override when the job executes.

Copy
 {"param1":"value1", "param2":"value2"}

For no parameters, type {}.

Status Polling Interval

(Optional) Defines the number of seconds to wait before checking the status of the job.

Default: 20 seconds

Databricks Job

Databricks enables you to integrate jobs created in the Databricks environment with your existing Control-M workflows.

The following table describes the Databricks job type attributes:

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to Databricks.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Databricks Job ID

Determines the ID of the Databricks job that is created in a Databricks workspace.

Parameters

Defines task parameters, in JSON format, to override when the job executes, according to the Databricks convention. Your list of parameters must begin with the name of the parameter type.

Copy
"notebook_params":{"param1":"val1", "param2":"val2"}
"jar_params": ["param1", "param2"]

For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided through the Azure Databricks documentation.

For no parameters, type the following code:

Copy
"params": {}

Idempotency Token

(Optional) Defines a token to use to re-execute job executions that timed out in Databricks.

Values:

  • Control-M-Idem_%%ORDERID: With this token, upon re-execution, Control-M invokes the monitoring of the existing job execution in Databricks. Default.

  • Any other value: Replaces the Control-M idempotency token. When you re-execute a job using a different token, Databricks creates a new job execution with a new unique run ID.

Status Polling Frequency

(Optional) Determines the number of seconds to wait before checking the status of the job.

Default: 30

DBT Job

DBT (Data Build Tool) is a cloud-based computing platform that enables you to develop, test, schedule, document, and analyze data models.

The following table describes the DBT job type attributes.

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to DBT.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

DBT Job ID

Defines the ID of the preexisting job in the DBT platform that you want to execute.

Run Comment

Defines a free-text description of the job.

Override Job Commands

Determines whether to override the predefined DBT job commands.

Define Commands

Defines the new DBT job commands.

dbt test

dbt run

Status Polling Frequency

Determines the number of seconds to wait before checking the status of the job.

Default: 10

Failure Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

GCP BigQuery Job

Google Cloud Platform (GCP) BigQuery is a cloud-computing platform that enables you to process, analyze, and store your data.

The following table describes the GCP BigQuery job type attributes.

Attribute

Action

Description

Connection Profile

N/A

Determines the authorization credentials that are used to connect Control-M to GCP BigQuery.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Project Name

All actions

Determines the project that the job uses.

Dataset Name

  • Query

  • Extract

  • Routine

Determines the database that the job uses.

Action

N/A

Determines one of the following GCP BigQuery actions to perform:

  • Query: Executes one or more SQL statements that are supported by GCP BigQuery.

  • Copy: Creates a copy of an existing table.

  • Load: Loads source data into an existing table.

  • Extract: Exports data from an existing table into Google Cloud Storage.

  • Routine: Executes a stored procedure, table function, or previously defined function.

Run Select Query and Copy to Table

Query

(Optional) Determines whether to paste the results of a SELECT statement into a new table.

Table Name

  • Query

  • Extract

Defines the new table name.

SQL Statement

Query

Defines one or more SQL statements supported by GCP BigQuery.

Rule: It must be written in a single line, with character strings separated by one space only.

Query Parameters

Query

Defines the query parameters, in JSON format, which enable you to control the presentation of the data.

Copy
{
    "name": "IFteam",
    "paramterType": { 
        "type": "STRING"
    },
    "parameterValue": {
        "value": "BMC"
    }
}

Copy Operation Type

Copy

Determines one of the following copy operations:

  • Clone: Creates a copy of a base table that has write access.

  • Snapshot: Creates a read-only copy of a base table.

  • Copy: Creates a copy of a snapshot.

  • Restore: Creates a writable table from a snapshot.

Source Table Properties

Copy

Defines the properties of the table, in JSON format, that is cloned, backed up, or copied.

You can copy or back up one or more tables at a time.

Copy

    {
        "datasetID": "Test1"
        "projectID": "SomeProj1",
        "tableID": "IFteam1"
    }
    {
        "datasetID": "Test2"
        "projectId": "SomepProj2"
        "tableID": "IFteam2"
    }
}

Destination Table Properties

  • Copy

  • Load

Defines the properties of a new table, in JSON format.

Copy

  "datasetID": "Test3"
  "projectID": "SomeProj3"
  "tableID": "IFteam3" 
}

Destination/Source Bucket URIs

  • Load

  • Extract

Defines the source or destination data URI for the table that you are loading or extracting.

You can load or extract multiple tables.

You must use commas to distinguish elements from each other.

"gs://source1_site1/source1.json"

Show Load Options

Load

Determines whether to add more fields to a table that you are loading.

Load Options

Load

Defines additional fields, in JSON format, for the table that you are loading.

Copy
"schema"
    {
        "fields"
        [
            {
                "name": "name1",
                "type": "STRING1"
            }
            {
                "name": "name2",
                "type": "STRING2"
            }
            {
                "name": "name3",
                "type": "STRING3"
            }
        ]
    }

Extract As

Extract

Determines one of the following file formats to export the data to:

  • CSV

  • JSON

Routine

Routine

Defines a routine and the values that it must execute.

Copy
Call new_r(‘value1’)

Job Timeout

All actions

Determines the maximum number of milliseconds to execute the GCP BigQuery job.

Default: 30000 milliseconds (30 seconds)

Connection Timeout

All actions

Determines the number of seconds to wait before the job ends Not OK.

Default: 10

Status Polling Frequency

All actions

Determines the number of seconds to wait before checking the status of the job.

Default: 5

GCP Dataflow Job

Google Cloud Platform (GCP) Dataflow enables you to perform cloud-based data processing for batch and real-time data streaming applications.

The following table describes the GCP Dataflow job type attributes.

Parameter

Description

Connection profile

Determines the authorization credentials that are used to connect Control-M to GCP Dataflow.

Project ID

Defines the project ID for your Google Cloud project.

Location

Defines the Google Compute Engine region to create the job.

Template Type

Defines one of the following types of GCP Dataflow templates:

  • Classic Template: Developers execute the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.

  • Flex Template: Developers package the pipeline into a Docker image and then use the Google Cloud CLI to build and save the Flex Template spec file in Cloud Storage.

Template Location (gs://)

Defines the path for temporary files. This must be a valid Google Cloud Storage URL that begins with gs://.

The pipeline option tempLocation is used as the default value, if it has been set.

Parameters (JSON Format)

Defines input parameters, in JSON format, to be passed on to job execution.

You must include the jobname and parameters elements.

Copy

    "jobName": "wordcount"
    "parameters":
    { 
        "inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt"
        "output": "gs://controlmbucket/counts"
    } 

Verification Poll Interval (in seconds)

(Optional) Defines the number of seconds to wait before checking the status of the job.

Default: 10

Log Level

Determines one of the following levels of details to retrieve from the GCP logs in the case of job failure:

  • TRACE

  • DEBUG

  • INFO

  • WARN

  • ERROR

GCP Dataproc Job

Google Cloud Platform (GCP) Dataproc enables you to perform cloud-based big data processing and machine learning.

The following table describes the GCP Dataproc job type attributes.

Parameter

Description

Connection profile

Determines the authorization credentials that are used to connect Control-M to GCP Dataproc.

Project ID

Defines the project ID for your Google Cloud project.

Account Region

Defines the Google Compute Engine region to create the job.

Dataproc task type

Defines one of the following Dataproc task types to execute:

  • Workflow Template: Reusable workflow configuration that defines a graph of jobs with information on where to execute those jobs.

  • Job: A single Dataproc job.

Workflow Template

(For a Workflow Template task type) Defines the ID of a Workflow Template.

Parameters (JSON Format)

(For a Job task type) Defines input parameters to be passed on to job execution, in JSON format.

You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings.

Verification Poll Interval (in seconds)

(Optional) Defines the number of seconds to wait before checking the status of the job.

Default: 20

Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

Hadoop Job

The Hadoop job connects to the Hadoop framework, which enables you to split up and process large data sets on clusters of commodity servers. You can expand your enterprise business workflows to include tasks that execute in your Big Data Hadoop cluster in Control-M with the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.

The following table describes the Hadoop job type attributes.

Attribute

Description

Connection Profile

Determines the authorization credentials that are used to connect Control-M to Hadoop.

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Variable Name: %%HDP-ACCOUNT

Execution Type

Determines the execution type for Hadoop job execution, as follows:

Variable Name: %%HDP-EXEC_TYPE

Pre Commands

Defines the Pre commands performed before job execution (not for HDFS Commands jobs and Oozie Extractor jobs), and the argument for each command.

Fail the job if the command fails

Determines whether the entire job fails if any of the Pre commands fail (not for HDFS Commands jobs and Oozie Extractor jobs).

Post Commands

Defines the Post commands performed before job execution (not for HDFS Commands jobs and Oozie Extractor jobs), and the argument for each command.

Fail the job if the command fails

Determines whether the entire job fails if any of the Post commands fail (not for HDFS Commands jobs and Oozie Extractor jobs).

DistCp Job Attributes

The following table describes the DistCp job attributes.

Attribute

Description

Target Path

Defines the absolute destination path.

Variable Name: %%HDP-DISTCP_TARGET_PATH

Source Path

Defines the source paths.

Variable Name: %%HDP-DISTCP_SOURCE_PATH-Nxxx_ARG

Command Line Options

Defines the sets of attributes and values that are added to the command line.

Variable Names:

  • Name: %%HDP-DISTCP_OPTION-Nxxx-NAME

  • Value: %%HDP- DISTCP_OPTION-Nxxx-VAL

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job outputClosed A tab in the job properties pane of the Monitoring domain where the job output appears that indicates whether a job ended OK, and is used, for example, with jobs that check file location..

Distributed Shell Job Attributes

The following table describes the Distributed Shell job attributes.

Attribute

Description

Shell Type

Determines what the Distributed Shell job executes, as follows:

  • Command: Executes a shell command entry as defined by Command.

  • Script File: Executes a script file as defined by Command, Script Full Path, and Shell Script Arguments.

Variable Name: %%HDP-SHELL_TYPE

Command

Defines the shell command entry to execute for the job execution.

Variable Name: %%HDP-SHELL_COMMAND

Script Full Path

Defines the full path to the script file which is executed. The script file is located in the HDFS.

Variable Name: %%HDP-SHELL_SCRIPT_FULL_PATH

Shell Script Arguments

Defines the shell script arguments.

Variable Name: %%HDP-SHELL-Nxxx-ARG

More Options

Opens more attributes.

Files/Archives

Defines the full path to the file or archive to upload as a dependency to the HDFS working directory.

Variable Names:

  • Type: %%HDP-SHELL_FILE_DEP-Nxxx-TYPE

  • Path: %%HDP-SHELL_FILE_DEP -Nxxx-PATH

Options

Defines the additional option (Name and Value) to set when executing the job.

Variable Names:

  • Name: %%HDP-SHELL_OPTION -Nxxx-NAME

  • Value: %%HDP-SHELL_OPTION -Nxxx-VAL

Environment Variables

Defines the environment variables for the shell script/command.

Variable Name: %%HDP-SHELL_ENV_VARIABLE-Nxxx-ARG

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

HDFS Commands Job Attributes

The following table describes the HDFS Commands job attributes.

Attribute

Description

Command

Defines the command for the argument to be performed with job execution.

Variable Name: %%HDP-HDFS_CMD_ACTION-Nxxx-CMD

Arguments

Defines the argument used by the command.

Variable Name: %%HDP-HDFS_CMD_ACTION-Nxxx-ARG

HDFS File Watcher Job Attributes

The following table describes the HDFS File Watcher job attributes.

Attribute

Description

File name full path

Defines the full path of the file being watched.

Variable Name: %%HDP-HDFS_FILE_PATH

Min detected size

Determines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file.

Variable Name: %%HDP-MIN_DETECTED_SIZE

Max time to wait

Determines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes.

Variable Name: %%HDP-MAX_WAIT_TIME

File Name Variable

Defines the variable name that is used in succeeding jobs.

Variable Name: %%HDP-FW_DETECTED _FILE_NAME_VAR

Impala Job Attributes

The following table describes the Impala job attributes.

Attribute

Description

Source

Determines the source type to execute the queries, as follows:

  • Query File: Executes a query file as defined by Query File Full Path.

  • Open Query: Executes an open query command as defined by Query.

Variable Name: %%HDP-IMPALA_QUERY_SOURCE

Query File Full Path

Defines the location of the file used to execute the queries.

Variable Name: %%HDP-IMPALA_QUERY_FILE_PATH

Query

Defines the query command used to execute the queries.

Variable Name: %%HDP-IMPALA_OPEN_QUERY

Command Line Options

Defines the sets of attributes and values that are added to the command line.

Variable Name: %%HDP-HDP-IMPALA_CMD_OPTION-Nxxx-ARG

Hive Job Attributes

The following table describes the Hive job attributes.

Attribute

Description

Full path to Hive script

Defines the full path to the Hive script on the Hadoop host.

Variable Name: %%HDP-HIVE_SCRIPT_NAME

Script Parameters

Defines the list of parameters for the script.

Variable Names:

  • Name: %%HDP-HIVE_SCRIPT_PARAM_Nxxx-NAME

  • Value: %%HDP-HIVE_SCRIPT_PARAM-Nxxx-VAL

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

Java-Map-Reduce Job Attributes

The following table describes the Java Map-Reduce job attributes.

Attribute

Description

Full path to Jar

Defines the full path to the jar containing the Map Reduce Java program on the Hadoop host.

Variable Name: %%HDP-JAVA_JAR_NAME

Main Class

Defines the class that is included in the jar containing a main function and the map reduce implementation.

Variable Name: %%HDP-JAVA_MAIN_CLASS

Arguments

Defines the argument used by the command.

Variable Name: %%HDP-JAVA_Nxxx_ARG

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

Oozie Job Attributes

The following table describes the Oozie job attributes.

Attribute

Description

Job Properties File

Defines the job properties file path.

Variable Name: %%HDP-OOZIE_JOB_PROPERTIES_FILE

Job Properties (Add/Overwrite)

Defines the Oozie job properties.

A set of properties is comprised of the following:

  • Key: Defines a key name associated with each property.

    Variable Name: %%HDP-OOZIE_PROPERTY-Nxxx-KEY

  • Value: Defines a value associated with each property.

    Variable Name: %%HDP-OOZIE_PROPERTY-Nxxx-VAL

You can add new properties or override property values defined in the Job Properties File.

Rerun from point of failure

Determines whether to re-execute an Oozie job from the point of its failure.

Pig Job Attributes

The following table describes the Pig job attributes.

Attribute

Description

Full Path to Pig Program

Defines the full path to the Pig program on the Hadoop host.

Variable Name: %%HDP-PIG_PROG_NAME

Pig Program Parameters

Defines the list of program parameters.

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

Properties

Defines a list of properties (Name and Value) to be executed with the job.

These properties override the Hadoop defaults.

Archives

Defines the location of the Hadoop archives.

Files

Defines the location of the Hadoop files.

Spark Job Attributes

The following table describes the Spark job attributes.

Attribute

Description

Program Type

Determines the Spark program type, as follows:

  • Python Script: As defined by Full Path to Script.

  • Java / Scala Application: As defined by Application Jar File and Full Path to Script.

Variable Name: %%HDP-SPARK_PROG_TYPE

Full Path to Script

Defines the full path to the python script to execute.

Variable Name: %%HDP-SPARK_FULL_PATH_TO_PYTHON_SCRIPT

Application Jar File

Defines the path to the jar including your application and all the dependencies.

Variable Name: %%HDP-SPARK_APP_JAR_FULL_PATH

Main Class to Run

Defines the main class of the application.

Variable Name: %%HDP-SPARK_MAIN_CLASS_TO_RUN

Application Arguments

Defines the attribute arguments that are added at the end of the Spark command line either after the main class for Java / Scala Applications or after the script of the Python Script.

Variable Name: %%HDP-SPARK_Nxxx_ARG

Command Line Options

Defines the sets of attributes and values that are added to the command line.

Variable Names:

  • Name: %%HDP-SPARK_OPTION -Nxxx-NAME

  • Value: %%HDP-SPARK_OPTION -Nxxx-VAL

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

Sqoop Job Attributes

The following table describes the Sqoop job attributes.

Attribute

Description

Command Editor

Defines any valid Sqoop command necessary for job execution. Sqoop can only be used for job execution if defined in Sqoop connection attributes.

HDP-SQOOP_COMMAND

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

Properties

Defines a list of properties (Name and Value) to be executed with the job.

These properties override the Hadoop defaults.

Archives

Defines the location of the Hadoop archives.

Files

Defines the location of the Hadoop files.

Streaming Job Attributes

The following table describes the Streaming job attributes.

Attribute

Description

Input Path

Defines the input file for the Mapper step.

Variable Name: %%HDP-INPUT_PATH

Output Path

Defines the HDFS output path for the Reducer step.

Variable Name: %%HDP-OUTPUT_PATH

Mapper Command

Defines the command that executes as a mapper.

Variable Name: %%HDP-MAPPER_COMMAND

Reducer Command

Defines the command that executes as a reducer.

Variable Name: %%HDP-REDUCER_COMMAND

Streaming Options

Defines the sets of attributes (Name and Value) that are added to the end of the Streaming command line.

Variable Names:

  • Name: %%HDP-STREAMING_PARAM-Nxxx-NAME

  • Value: %%HDP-STREAMING_PARAM-Nxxx-VAL

Generic Options

Defines the sets of attributes (Name and Value) that are added to the Streaming command line.

Variable Names:

  • Name: %%HDP-GENERIC_PARAM-Nxxx-NAME

  • Value: %%HDP-GENERIC_PARAM-Nxxx-VAL

Append Yarn aggregated logs to output

Determines whether to add Yarn aggregated logs to the job output.

Tajo Job Attributes

The following table describes the Tajo job attributes.

Attribute

Description

Command Source

Determines the source of the Tajo command, as follows:

  • Input File: Executes the Tajo command from an input file as defined by the Full File Path.

    Variable Name: %%HDP-TAJO_INPUT_FILE

  • Open Query: Executes an open query as the Tajo command, as defined by Open Query.

    Variable Name: %%HDP-TAJO_OPEN_QUERY

Full File Path

Defines the file path of the input file that executes the Tajo command.

Open Query

Defines the query.

Variable Name: %%HDP-TAJO_OPEN_QUERY

Snowflake Job

Snowflake is a cloud-computing platform that enables you to process, analyze, and store your data.

The following table describes the Snowflake job type attributes.

Attribute

Action

Description

Connection Profile

N/A

Determines one of the following types of authorization credentials, which are used to connect Control-M to Snowflake:

  • Snowflake

  • Snowflake IdP

Rules:

  • Characters: 1−30

  • Case Sensitive: Yes

  • Invalid Characters: Blank spaces.

Database

N/A

Determines the database that the job uses.

Schema

N/A

Determines the schema that the job uses.

A schema is an organizational model that describes the layout and definition of fields and tables, and their relationships to each other, in a database.

Action

N/A

Determines one of the following Snowflake actions to perform:

  • SQL Statement: Executes any number of Snowflake-supported SQL statements, such as queries, calling or creating procedures, database maintenance tasks, and creating and editing tables.

  • Copy from Query: Copies a queried database and schema into an existing or new file in cloud storage.

  • Copy from Table: Copies from an existing table.

  • Create Table and Query: Creates a table, populated by a query, in the specified database and schema.

  • Copy into Table: Copies data from a cloud storage location into an existing table in Snowflake.

  • Start or Pause Snowpipe: Starts or pauses an existing Snowpipe.

  • Stored Procedure: Calls an existing procedure and its arguments.

  • Snowpipe Load Status: Monitors the status of a Snowpipe for a set period of time.

Snowflake SQL Statement

SQL Statement

Determines one or more Snowflake-supported SQL commands.

Rule: Must be written in a single line, with strings separated by one space only.

Statement Timeout

All Actions

Determines the maximum number of seconds to execute the job in Snowflake.

Show More Options

All Actions

Determines whether the following job-defining attributes are displayed:

  • Parameters

  • Role

  • Bindings

  • Warehouse

Parameters

All Actions

Defines Snowflake-provided parameters, in JSON format, that let you control how data is presented.

Copy

  "param1":"value1",
  "param2":"value2"
}

Role

All Actions

Determines the Snowflake role used for this Snowflake job.

A role is an entity that can be assigned privileges on secure objects. You can be assigned one or more roles from a limited selection.

Bindings

All Actions

Defines the values, in JSON format to bind to the variables used in the Snowflake job.

For more information on bindings, see the Snowflake documentation.

The following JSON script defines two binding variables:

Copy
"1"
    { 
      "type": "FIXED"
      "value": "123" 
    } 
"2"
    { 
      "type": "TEXT"
      "value": "String" 
    }

Warehouse

All Actions

Determines the warehouse used in the Snowflake job.

A warehouse is a cluster of virtual machines that processes a Snowflake job.

Show Output

All Actions

Determines whether to show a full JSON response in the log output.

Status Polling Frequency

All Actions

Determines the number of seconds to wait before checking the status of the job.

Default: 20

Query to Location

Copy from Query

Defines the cloud storage location.

Query Input

Copy from Query

Defines the query used for copying the data.

Storage Integration

  • Copy from Query

  • Copy from Table

  • Copy into Table

Defines the storage integration object, which stores an Identity and Access Management (IAM) entity and an optional set of blocked cloud storage locations.

Overwrite

  • Copy from Query

  • Copy from Table

Determines whether to overwrite an existing file in the cloud storage, as follows:

  • Yes

  • No

File Format

  • Copy from Query

  • Copy from Table

Determines one of the following file formats for the saved file:

  • JSON

  • CSV

Copy Destination

Copy from Table

Defines where the JSON or CSV file is saved.

You can save to Amazon Web Services, Google Cloud Platform, or Microsoft Azure.

s3://<bucket name>/

From Table

Copy from Table

Defines the name of the copied table.

Create Table Name

Create Table and Query

Defines the name of the new or existing table where the data is queried.

Query

Create Table and Query

Defines the query used for the copied data.

Snowpipe Name

  • Start or Pause Snowpipe

  • Snowpipe Load Status

Defines the name of the Snowpipe.

A Snowpipe loads data from files when they are ready, or staged.

Table Name

Copy into Table

Defines the name of the table that the data is copied into.

From Location

Copy into Table

Defines the cloud storage location from where the data is copied, in CSV or JSON format.

s3://location-path/FileName.csv

Start or Pause Snowipe

Start or Pause Snowpipe

Determines whether to start or pause the Snowpipe, as follows:

  • Start Snowpipe

  • Pause Snowpipe

Stored Procedure Name

Stored Procedure

Defines the name of the stored procedure.

Procedure Argument

Stored Procedure

Defines the value of the argument in the stored procedure.

Table Name

Snowpipe Load Status

Defines the table that is monitored when loaded by the Snowpipe.

Stage Location

Snowpipe Load Status

Defines the cloud storage location.

A stage is a pointer that indicates where data is stored, or staged.

s3://CloudStorageLocation/

Days Back

Snowpipe Load Status

Determines the number of days to monitor the Snowpipe load status.

Status File Cloud Location Path

Snowpipe Load Status

Defines the cloud storage location where a CSV file log is created.

The CSV file log details the load status for each Snowpipe.

Storage Integration

Snowpipe Load Status

Defines the Snowflake configuration for the cloud storage location, defined in the previous attribute−Status File Cloud Location Path.

S3_INT