Logging, Monitoring, and Debugging
Overview
There are a variety of ways in which you can log and debug the execution of platform application services, tools, and APIs.
- Logging application services (Log forwarder and Elasticsearch)
- Checking Service Status
- Kubernetes tools
- Event logs
- Cluster support logs
- API error information
For further troubleshooting assistance, visit Iguazio Support.
Logging Application Services
The platform has a default tenant-wide log-forwarder application service (log-forwarder
) for forwarding application-service logs.
The logs are forwarded to an instance of the Elasticsearch open-source search and analytics engine by using the open-source Filebeat log-shipper utility.
The log-forwarder service is disabled by default.
To enable it, on the log-forwarder
service; in the
Typically, the log-forwarder service should be configured to work with your own remote off-cluster instance of Elasticsearch.
- The default transfer protocol, which is used when the URL doesn't begin with "
http://
" or "https://
", is HTTPS. - The default port, which is used when the URL doesn't end with "
:<port number>
", is port 80 for HTTP and port 443 for HTTPS.
Checking Service Status
In the
- Press
Inspect to see the status. You can also download a txt file from the popup.
Kubernetes Tools
You can use the Kubernetes
- Use the
get pods command to display information about the cluster's pods:kubectl get pods
- Use the
logs command to view the logs for a specific pod; replacePOD
with the name of one of the pods returned by theget command:kubectl logs POD
- Use the
top pod command to view pod resource metrics and monitor resource consumption; replace[POD]
with the name of one of the pods returned by theget command or remove it to display logs for all pods:kubectl top pod [POD]
To run
- The
get pods andlogs commands require the "Log Reader" service account or higher. - The
top pod command requires the "Service Admin" service account.
For more information about the
IGZTOP - Performance Reporting Tool
igztop
is a small tool that displays useful information about pods in the default-tenant
namespace.
Running igztop
Usage:
igztop pods [--cpu] [--filter=<KEY>=<VALUE>] [--label=<KEY>=<VALUE>] [--columns=<KEY>] [--no-borders] [--no-pager]
igztop nodes [--no-pager]
igztop update
igztop (-h | --help)
igztop --version
Options:
-h --help
-v --version
-c --cpu Sort the table by CPU usage, rather than by memory usage (default).
-f --filter=<KEY>=<VALUE> A filtering key-value pair, based on column names, e.g. 'node=k8s-node1', 'name=presto', 'owner=admin'.
-l --label=<KEY>=<VALUE> Filter pods by label, e.g. '-l app=v3iod'
-o --columns=<KEY> Show additional columns. Can be one or combination of "projects","gpu","resources", e.g. '--columns projects,gpu'. Partial names are supported, e.g. '-o proj'
--no-pager Print the output table to the terminal without paging
Examples
The default output includes the name, memory usage, cpu usage and node name for each running pod, sorted by memory usage.
Sorting by CPU usage is supported by passing the --cpu
or -c
flag.
Pods that aren't currently using resources do not appear in the table.
Information about Pods
$ igztop pods
Kubernetes Pods
┃ Name ┃ CPU ┃ Memory ┃ Node ┃
│ jupyter-edmond-847d5bb947-25fjz │ 9m │ 2937Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ v3iod-6mpwh │ 4m │ 2802Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ v3iod-9cgnp │ 4m │ 2800Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ jupyter-amit-59bf47fd7b-vb45f │ 10m │ 2255Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-salesh-75b5b7db68-qttqp │ 9m │ 2237Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ v3io-webapi-96w5x │ 7m │ 1867Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ v3io-webapi-2t9lm │ 6m │ 1864Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ docker-registry-778548878-q7tzb │ 1m │ 1738Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ spark-master-d5d47bbb-s7jqp │ 7m │ 940Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ spark-worker-86cc4d5d9c-4lkdl │ 8m │ 867Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ jupyter-shapira-57db64967c-vkrm4 │ 8m │ 718Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-nirs-5589c8d984-h7tzl │ 6m │ 630Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ mlrun-db-67f46884dd-c298c │ 12m │ 621Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ mlrun-api-chief-6f64c75447-xz58h │ 51m │ 585Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-models-shapira-model-monitoring-stream-5fd48d8696-cjl85 │ 3m │ 508Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ mysql-kf-699c4c75bb-rpxpg │ 3m │ 468Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ mlrun-api-worker-5f9c87bc94-qwkfb │ 6m │ 430Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-streaming-test1-shapira-extract-688cc5b858-vbk6q │ 3m │ 308Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-fraud-demo-edmond-transactions-ingest-f8854c48f-f9hjf │ 2m │ 302Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-streaming-test-shapira-extract-8ff9ddc8c-2gq9z │ 3m │ 301Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-3720-7c9c959759-2vwsl │ 6m │ 276Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-models-shapira-serving-model-6b56c8f87c-v486c │ 1m │ 212Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-models-shapira-serving-function-65cdc595b4-pwhc8 │ 1m │ 202Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-serving-steps-shapira-test-steps-7d9b86d5df-cpdpj │ 1m │ 161Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-serving-steps-shapira-test-steps-archive-59cbff6666w8988 │ 1m │ 159Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-itay-68c5b5b87f-wkkwf │ 9m │ 99Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-dashboard-54df54887c-5vsww │ 2m │ 83Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ ml-pipeline-visualizationserver-685c68cdbd-mj2cw │ 4m │ 83Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ monitoring-prometheus-server-d6588dd47-cvfhq │ 3m │ 69Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ provazio-controller-6b8456dff9-v7zmn │ 7m │ 64Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ grafana-75d598cc96-txrw7 │ 2m │ 58Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ metadata-writer-f59d94448-444g4 │ 1m │ 56Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ ml-pipeline-ui-56b9997fc7-nr9vc │ 4m │ 41Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-streaming-test1-shapira-nuclio-df-69446f7b88-n6kld │ 1m │ 31Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-test-func-d894d4b84-tfd5h │ 1m │ 31Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-test-599845c8ff-lp6bd │ 1m │ 31Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-streaming-test-shapira-nuclio-df-6568fbfcb4-kn54c │ 1m │ 30Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ framesd-6c97f79585-9zhr2 │ 1m │ 29Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-controller-6c9c966b56-6hktm │ 1m │ 25Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ ml-pipeline-858b55c5b-g95cl │ 4m │ 21Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ workflow-controller-564ff94cd4-g4d6w │ 1m │ 20Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ nuclio-scaler-b67469c78-4m2df │ 1m │ 17Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ metrics-server-exporter-77fb887958-mxs4b │ 24m │ 14Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ mpi-operator-7f68c8556f-bjnsn │ 3m │ 13Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ ml-pipeline-viewer-crd-7f886bbf5f-n6vtv │ 1m │ 13Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-dlx-6c8c74d497-mh4gh │ 1m │ 13Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ mlrun-ui-6768f98785-nb6v8 │ 0m │ 13Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ metadata-envoy-deployment-6c975596-7qwks │ 4m │ 12Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ authenticator-74cf9cd5f9-hbln6 │ 1m │ 12Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ ml-pipeline-persistenceagent-65fd97c56b-g7c8k │ 1m │ 11Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ ml-pipeline-scheduledworkflow-7cf8c6cd4c-4q66r │ 1m │ 11Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ spark-operator-6dbc5d9566-4m679 │ 1m │ 10Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ keycloak-oauth2-proxy-redis-master-0 │ 16m │ 9Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ v3iod-locator-65c5c44957-6nx6v │ 0m │ 8Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ keycloak-oauth2-proxy-54cb56f465-qp5rw │ 1m │ 6Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ oauth2-proxy-7b68c8d99d-5fv6v │ 1m │ 4Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ metadata-grpc-deployment-68b6995c89-nq6ns │ 1m │ 3Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ nuclio-serving-steps-shapira-test-steps-archive-59cbff6666w8988 │ 10m │ 2937Mi │ ip-172-31-0-193.us-east-2.compute.internal │
│ Sum │ 272m │ 27128Mi │ │
└─────────────────────────────────────────────────────────────────┴──────┴─────────┴────────────────────────────────────────────┘
Results can be filtered to match substrings of any column:
$ igztop -f name=jupy
Kubernetes Pods
┃ Name ┃ CPU ┃ Memory ┃ Node ┃
│ jupyter-edmond-847d5bb947-25fjz │ 9m │ 2935Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-amit-59bf47fd7b-vb45f │ 9m │ 2256Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-salesh-75b5b7db68-qttqp │ 7m │ 2232Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-shapira-57db64967c-vkrm4 │ 8m │ 721Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-nirs-5589c8d984-h7tzl │ 6m │ 629Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-3720-7c9c959759-2vwsl │ 7m │ 277Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-itay-68c5b5b87f-wkkwf │ 51m │ 134Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ jupyter-shapira-57db64967c-vkrm4 │ 51m │ 2935Mi │ ip-172-31-0-55.us-east-2.compute.internal │
│ Sum │ 97m │ 9184Mi │ │
The --columns
or -o
options can be used to display additional information. Available options are projects
,
gpu
and resources
(partial strings are supported). For example, to display the pods that are using GPUs:
$ igztop -o gpu
Kubernetes Pods
┃ Name ┃ CPU ┃ Memory ┃ Node ┃ GPU ┃ GPU % ┃
│ jupyter-58c5bf598f-z86qm │ 10m │ 2936Mi │ ip-172-31-0-86.us-east-2.compute.internal │ 1/1 │ 66% │
│ Sum │ 10m │ 2936Mi │ │ │ │
They can also be used in combination with other options. The example below expands the table with information about function and job pods that belong to MLRun projects, and then filters the list by a specific project:
$ igztop -o proj -f project=models-shapira
┃ Name ┃ CPU ┃ Memory ┃ Node ┃ Project ┃ Owner ┃ MLRun Job ┃ MLRun Function ┃ MLRun Job Type ┃ Nuclio Function ┃
│ nuclio-models-shapira-model-monitoring-stream-5fd48d8696-cjl85 │ 3m │ 508Mi │ ip-172-31-0-55.us-east-2.compute.internal │ models-shapira │ shapira │ │ │ serving │ models-shapira-model-monitoring-stream │
│ nuclio-models-shapira-serving-model-6b56c8f87c-v486c │ 1m │ 212Mi │ ip-172-31-0-193.us-east-2.compute.internal │ models-shapira │ shapira │ │ │ serving │ models-shapira-serving-model │
│ nuclio-models-shapira-serving-function-65cdc595b4-pwhc8 │ 1m │ 202Mi │ ip-172-31-0-193.us-east-2.compute.internal │ models-shapira │ shapira │ │ │ remote │ models-shapira-serving-function │
│ nuclio-models-shapira-model-monitoring-stream-5fd48d8696-cjl85 │ 3m │ 508Mi │ ip-172-31-0-193.us-east-2.compute.internal │ models-shapira │ shapira │ │ │ serving │ models-shapira-model-monitoring-stream │
│ Sum │ 5m │ 922Mi │ │ │ │ │ │ │ │
Information about Nodes
$ igztop nodes
Kubernetes Nodes
┃ Name ┃ Status ┃ IP Address ┃ Node Group ┃ Instance Type ┃ CPU ┃ Memory ┃
│ ip-172-31-0-193.us-east-2.compute.internal │ Ready │ 172.31.0.193 │ initial │ m5.4xlarge │ 2.63% │ 22.97% │
│ ip-172-31-0-55.us-east-2.compute.internal │ Ready │ 172.31.0.55 │ initial │ m5.4xlarge │ 4.10% │ 42.70% │
Event Logs
The
- The
Event Log tab displays system event logs. - The
Alerts tab displays system alerts. - The
Audit tab displays a subset of the system events for audit purposes — security events (such as a failed login) and user actions (such as creation and deletion of a container).
The
You can specify the email of a user with the IT Admin management policy to receive email notification of events. Press the Settings icon (), then type the user name in
Events in the Event Log Tab
Event class | Event kind | Event description | |
---|---|---|---|
System | System.Cluster.Offline | Cluster 'cluster_name' moved to offline mode | |
System | System.Cluster.Shutdown | Cluster 'cluster_name' shutdown | |
System | System.Cluster.Shutdown.Aborted | Cluster 'cluster_name' shutdown aborted | |
System | System.Cluster.Online | Cluster 'cluster_name' moved to online mode | |
System | System.Cluster.Maintenance | Cluster 'cluster_name' moved to maintenance mode | |
System | System.Cluster.OnlineMaintenance | Cluster 'cluster_name' moved to online maintenance mode | |
System | System.Cluster.Degraded | Cluster 'cluster_name' is running in degraded mode | |
System | System.Cluster.Failback | Cluster 'cluster_name' moved to failback mode | |
System | System.Cluster.DataAccessType.ReadOnly | Successfully changed cluster 'cluster_name' data access type to read only | |
System | System.Cluster.DataAccessType.ReadWrite | Successfully changed cluster 'cluster_name' data access type to read/write | |
System | System.Cluster.DataAccessType.ContainerSpecific | Successfully changed data access type of data containers | |
System | System.Node.Down | Node 'node_name' is down | |
System | System.Node.Offline | Node 'node_name' is offline | |
System | System.Node.Online | Node 'node_name' is online | |
System | System.Node.Initialization | Node 'node_name' is in initialization state | |
Software | Software.ArtifactGathering.Job.Started | Artifact gathering job started on node 'node_name' | |
Software | Software.ArtifactGathering.Job.Succeeded | Artifact gathering completed successfully on node 'node_name' | |
Software | Software.ArtifactGathering.Job.Failed | Artifact gathering failed on node 'node_name' | |
Software | Software.ArtifactBundle.Upload.Succeeded | System logs were uploaded to 'upload_paths' successfully | |
Software | Software.ArtifactBundle.Upload.Failed | Logs collection could not be uploaded to 'upload_paths' | |
Software | Software.IDP.Synchronization.Started | IDP synchronization with 'IDP server' has been started. | |
Software | Software.IDP.Synchronization.Completed | IDP synchronization with 'IDP server' has been complated. | |
Software | Software.IDP.Synchronization.Periodic.Failed | IDP synchronization with 'IDP server' failed to complete periodic update. | |
Software | Software.IDP.Synchronization.Failed | IDP synchronization with 'IDP server' failed | |
Hardware | Hardware.UPS.NoAcPower | UPS 'upsId' connected to Node 'nodeName' lost AC power | |
Hardware | Hardware.UPS.LowBattery | UPS 'upsId' connected to Node 'nodeName' battery is low | |
Hardware | Hardware.UPS.PermanentFailure | UPS 'upsId' connected to Node 'nodeName' in failed state | |
Hardware | Hardware.UPS.AcPowerRestored | UPS 'upsId' connected to Node 'nodeName' AC power restored | |
Hardware | Hardware.UPS.Reachable | UPS 'upsId' connected to Node 'nodeName' is reachable | y |
Hardware | Hardware.UPS.Unreachable | UPS 'upsId' connected to Node 'nodeName' is unreachable | |
Hardware | Hardware.Network.Interface.Up | Network interface to 'interfaceName' on node 'nodeName' - link regained | |
Hardware | Hardware.Network.Interface.Down | Network interface to'interfaceName' on node 'nodeName' - link disconnected | |
Hardware | Hardware.temperature.high | Drive on node 'nodeName' temperature is above normal. Temperature is 'temp'. | |
Capacity | Capacity.StoragePool.UsedSpace.High | Space on storage pool 'pool_name' has reached current% of the total pool size. | |
Capacity | Capacity.StoragePoolDevice.UsedSpace.High | Space on storage pool device 'storage_pool_device_name' on storage device 'storage_device_name' has reached current% of the total size. | |
Capacity | Capacity.Tenant.UsedSpace.High | Space on tenant | |
Alert | Alert.Test.External | Test description | |
Software | Software.Cluster.Reconfiguration.Completed | Reconfiguration on cluster 'cluster_name' completed successfully | |
Software | Software.Cluster.Reconfiguration.Failed | Reconfiguration on cluster 'cluster_name' failure | |
Software | Software.Events.Reconfiguration.Completed | Reconfiguration on cluster 'cluster_name' completed successfully | |
Software | Software.Events.Reconfiguration.Failed | Reconfiguration on cluster 'cluster_name' failure | |
Software | Software.AppServices.Reconfiguration.Completed | Reconfiguration on cluster 'cluster_name' completed successfully | |
Software | Software.AppServices.Reconfiguration.Failed | Reconfiguration on cluster 'cluster_name' failure | |
Software | Software.ArtifactVersionManifest.Reconfiguration.Completed | Reconfiguration on cluster 'cluster_name' completed successfully | |
Software | Software.ArtifactVersionManifest.Reconfiguration.Failed | Reconfiguration on cluster 'cluster_name' failure | |
System | System.DataContainer.Normal | DataContainer 'data_container_id' is running in normal mode. | |
System | System.DataContainer.Degraded | DataContainer 'data_container_id' is running in degraded mode. | |
System | System.DataContainer.Mapping.GenerationFailed | Failed to generate container mapping for DataContainer data_container_id' | |
System | System.DataContainer.Mapping.DistributionFailed | Failed to distribute container mapping for DataContainer 'data_container_id' | |
System | System.DataContainer.Resync.Complete | Resync completed on container 'data_container_id' | |
System | System.DataContainer.DataAccessType.ReadOnly | Data container 'data_container_id' is running in read only mode | |
System | System.DataContainer.DataAccessType.ReadWrite | Data container 'data_container_id' is running in read/write mode | |
System | System.DataContainer.DataAccessType.Update.Failed | Failed to set data access type for data container 'data_container_id' | |
System | System.Failover.Completed | Failover completed successfully | |
System | System.Failover.Failed | Failover failed | |
Software | Software.Email.Sending.Failed | Sending email failed due to 'reason' | |
Capacity | Capacity.StoragePool.UsableCapacity.CalculationFaile | Failed to calculate usable capacity of storage pool | |
Hardware | Hardware.Disks.DiskFailed | Storage device 'device_name' on node 'node_name' has failed | |
System | System.AppCluster.Initialization.Succeeded | App cluster 'name' was initialized successfully | |
System | System.AppCluster.Initialization.Failed | Failed to initialize app cluster 'name' | |
System | System.AppCluster.Services.Deployment.Succeeded | Default app services manifest for tenant 'tenant_name' was deployed successfully | |
System | System.AppCluster.Services.Deployment.Failed | Failed to deploy default app services manifest for tenant 'tenant_name' | |
System | System.Tenancy.Tenant.Creation.Succeeded | Tenant 'tenant_name' was successfully created | |
System | System.Tenancy.Tenant.Creation.Failed | Failed to create tenant | |
System | System.Tenancy.Tenant.Deletion.Succeeded | Tenant 'tenant_name' was successfully deleted | |
System | System.Tenancy.Tenant.Deletion.Failed | Failed to delete tenant 'tenant_name' | |
System | System.AppCluster.Tenant.Creation.Succeeded | Tenant 'tenant_name' was successfully created on app cluster 'app_cluster' | |
System | System.AppCluster.Tenant.Creation.Failed | Failed to create tenant on app cluster | |
System | System.AppCluster.Tenant.Deletion.Succeeded | Tenant 'tenant_name' was successfully deleted from app cluster 'app_cluster' | |
System | System.AppCluster.Tenant.Deletion.Failed | Failed to delete tenant 'tenant_name' from app cluster | |
Capacity | Capacity.StorageDevice.OutOfSpace | Space on storage device under 'service_id' on node 'node_id' is depleted | |
System | System.AppCluster.Tenant.Update.Succeeded | App services for tenant 'tenant_name' were successfully updated | |
System | System.AppCluster.Tenant.Update.Failed | Failed to update app services for tenant 'tenant_name' | |
System | System.AppNode.Created | App node record 'name' was created successfully | |
System | System.AppNode.Online | App node 'name' is online | |
System | System.AppNode.Unstable | App node 'name' is unstable | |
System | System.AppNode.Down | App node 'name' is down | |
System | System.AppNode.Deleted | App node 'name' was successfully deleted | |
System | System.AppNode.Offline | App node 'name' is offline | |
System | System.AppNode.NotReady | App node 'name' is not ready | |
System | System.AppNode.Preemptible.NotReady | Preemptible app node 'name' is not ready | |
System | System.AppNode.ScalingUp | App node 'name' is scaling up | |
System | System.AppNode.ScalingDown | App node 'name' is scaling down | |
System | System.AppNode.OutOfDisk | App node 'name' is out of disk space | |
System | System.AppNode.MemoryPressure | App node 'name' is low on memory | |
System | System.AppNode.DiskPressure | App node 'name' is low on disk space | |
System | System.AppNode.PIDPressure | App node 'name' has too many processes | |
System | System.AppNode.NetworkUnavailable | App node 'name' has network connectivity problem | |
System | System.AppCluster.Shutdown.Failed | App cluster shutdown failed | |
System | System.AppCluster.Online | App cluster 'name' is online | |
System | System.AppCluster.Unstable | App cluster 'name' is unstable | |
System | System.AppCluster.Down | App cluster 'name' is down | |
System | System.AppCluster.Offline | App cluster 'name' is offline | |
System | System.AppCluster.Degraded | App cluster 'name' is degraded | |
System | System.AppService.Online | App service 'name' is online | |
System | System.AppService.Offline | App service 'name' is down | |
System | System.CoreAppService.Online | App service 'name' is online (Core services: v3iod, webapi, framesd, nuclio, docker_registry, pipelines, mlrun) | |
System | System.CoreAppService.Offline | App service 'name' is down | |
Background Process | Task.Container.ImportS3.Started | S3 container 'container_id' import started. | |
Background Process | Task.Container.ImportS3.Failed | S3 container 'container_id' import failed. | |
Background Process | Task.Container.ImportS3.Completed | S3 container 'container_id' import completed successfully. | |
Security | Security.User.Login.Succeeded | user 'username' successfully logged into the system | |
Security | Security.User.Login.Failed | user 'username' failed logging into the system | |
Security | security.Session.Verification.Failed | Failed to verify session for user 'username', session id 'session_id' |
Events in the Audit Tab
Event class | Event kind | Event description |
---|---|---|
UserAction | UserAction.Container.Created | container 'container_id' created on cluster 'cluster_name' |
UserAction | UserAction.Container.Deleted | container 'container_id' deleted on cluster 'cluster_name' |
UserAction | UserAction.Container.Updated | container 'container_id' updated on cluster 'cluster_name' |
UserAction | UserAction.Container.Creation.Failed | container 'container_id' on cluster 'cluster_name' could not be created |
UserAction | UserAction.Container.Update.Failed | container 'container_id' on cluster 'cluster_name' could not be updated |
UserAction | UserAction.Container.Deletion.Failed | container 'container_id' on cluster 'cluster_name' could not be deleted |
UserAction | UserAction.User.Created | user 'username' created on cluster 'cluster_name' |
UserAction | UserAction.User.Creation.Failed | user 'username' on cluster 'cluster_name' could not be created |
UserAction | UserAction.UserGroup.Created | User group 'group' created on cluster 'cluster_name' |
UserAction | UserAction.UserGroup.Deletion.Failed | User group 'group' on cluster 'cluster_name' could not be deleted |
UserAction | UserAction.User.Deleted | user 'username' deleted on cluster 'cluster_name' |
UserAction | UserAction.User.Deletion.Failed | user 'username' on cluster 'cluster_name' could not be deleted |
UserAction | UserAction.User.Updated | user 'username' updated on cluster 'cluster_name' |
UserAction | UserAction.User.Update.Failed | user 'username' on cluster 'cluster_name' could not be updated |
UserAction | UserAction.UserGroup.Updated | User 'group name' updated on cluster 'cluster_name'. |
UserAction | UserAction.UserGroup.Update.Failed | User 'group name' on cluster 'cluster_name' could not be updated |
UserAction | UserAction.UserGroup.Creation.Failed | User 'group name' on cluster 'cluster_name' could not be created |
UserAction | UserAction.UserGroup.Deleted | User 'group name' deleted on cluster 'cluster_name' |
UserAction | UserAction.DataAccessPolicy.Applied | Data access policy for container 'name' on cluster 'cluster' applied |
UserAction | UserAction.Tenant.Creation.FailedPasswordEmail | Sending password creation email on tenant creation failed |
UserAction | UserAction.User.Creation.FailedPasswordEmail | Sending password creation email on user creation failed |
UserAction | UserAction.Services.Deployment.Succeeded | App services for tenant 'tenant_name' were deployed successfully |
UserAction | UserAction.Services.Deployment.Failed | Failed to deploy app services for tenant 'tenant_name' |
UserAction | UserAction.Project.Created | Project 'name' was created successfully |
UserAction | UserAction.Project.Creation.Failed | Project 'name' creation failed |
UserAction | UserAction.Project.Updated | Project 'name' updated successfully |
UserAction | UserAction.Project.Update.Failed | Project 'name' update failed |
UserAction | UserAction.Project.Deleted | Project 'name' deleted successfully |
UserAction | UserAction.Project.Deletion.Failed | Project 'name' deletion failed |
UserAction | UserAction.Project.Owner.Updated | Owner in project 'name' was changed from %s to %s |
UserAction | UserAction.Project.User.Role.Updated | Role for user 'username' in project 'name' was updated from 'old_owner' to 'new_owner' |
UserAction | UserAction.Project.UserGroup.Role.Updated | Role for user 'group name' in project 'project_name' was updated from 'old_role' to 'new_role' |
UserAction | UserAction.Project.User.Added | User 'group name' was added to project 'name' as 'role_name' |
UserAction | UserAction.Project.UserGroup.Added | user 'username' was added to project 'name' as 'role_name' |
UserAction | UserAction.Project.User.Removed | user 'username' was removed from project 'name' |
UserAction | UserAction.Project.UserGroup.Removed | User group name' was removed from project 'name' |
UserAction | UserAction.Network.Created | Network 'name' created on cluster 'cluster_name' |
UserAction | UserAction.Network.Creation.Failed | Network 'name' on cluster 'cluster_name' could not be created |
UserAction | UserAction.Network.Updated | Network 'name' updated on cluster 'cluster_name' |
UserAction | UserAction.Network.Update.Failed | Network 'name' on cluster 'cluster_name' could not be updated |
UserAction | UserAction.Network.Deleted | Network 'name' deleted on cluster 'cluster_name' |
UserAction | UserAction.Network.Deletion.Failed | Network 'name' on cluster 'cluster_name' could not be deleted |
UserAction | UserAction.StoragePool.Created | storage pool 'name' created on cluster 'cluster_name'td |
UserAction | UserAction.StoragePool.Creation.Failed | storage pool 'name' on cluster 'cluster_name' could not be created |
UserAction | UserAction.Cluster.Updated | Cluster 'cluster_name' updated |
UserAction | UserAction.Cluster.Update.Failed | Cluster 'cluster_name' could not be updated |
UserAction | UserAction.Cluster.Deleted | Cluster 'cluster_name' deleted |
UserAction | UserAction.Cluster.Deletion.Failed | cluster 'cluster_name' could not be deleted |
UserAction | UserAction.Cluster.Shutdown | Cluster 'cluster_name' is down per user request 'username' |
Cluster Support Logs
Users with the IT Admin management policy can collect and download support logs for the platform clusters from the dashboard. Log collection is triggered for a data cluster, but the logs are collected from both the data and application cluster nodes.
You can trigger collection of cluster support-logs from the dashboard in one of two ways; (note that you cannot run multiple collection jobs concurrently):
-
On the
Clusters page, open the action menu () for a data cluster in the clusters table (Type = "Data"); then select theCollect logs menu option. -
On the
Clusters page, select to display theSupport Logs tab for a specific data cluster — either by selecting theSupport logs option from the cluster's action menu () or by selecting the data cluster and then selecting theSupport Logs tab; then selectCollect Logs from the action toolbar. Optionally, select filter criteria in theSelect a filter dialog and pressCollect Logs again.Filters reflect both the log source and the log level. The non-full options are for more concise logs. The full versions provide full logs, which might be requested by Customer support. The context filter is usually used by Customer Support, who supplies the context string, if required.
You can view the status of all collection jobs and download archive files of the collected logs from the data-cluster's
API Error Information
The platform APIs return error codes and error and warning messages to help you debug problems with your application. See, for example, the Error Information documentation in the Data-Service Web-API General Structure reference documentation.