Debugging and Troubleshooting Gestalt Logging

Here is a general process for debugging an environment that doesn't seem to be responding properly.

1. Is there a problem with the ES cluster?

You can look at the elastic search REST API to confirm that there are any indices in there. * kubectl port-forward es-client-pod-name 9200:9200 * this will bind the pods port 9200 to your local machine for easy debugging * http localhost:9200/_cluster/health * if you see anything but "green" here, then you have problems. * http localhost:9200/_cat/indices * this will list any indices that ES knows about. If there are none, you probably haven't configured your log aggregator properly, start there. * http localhost:9200/_cat/nodes * this will list the nodes that the cluster knows about. if they're not all there, you got problems. try restarting the problematic ones. WARNING - don't restart a good one cause if you're missing nodes the replication is broken and you'll lose data.

2. Are there logs in the index in ES

  • after mapping the ports for ES using port-forward
  • http localhost:9200/<index_name>/_search?pretty=true&q=*:*
    • this will show "all" records for this index, however it's limited in size by default. you can use the &size= switch to control how many results you want.
    • this also frees you up to use something like jq to find the results your interested in
    • there are helper scripts in the payloads directory for querying both kubernetes and DCOS : pod.sh & container.sh
  • if there are no records, or you don't see the records for the containers then you probably have an issue with your aggregator

3. Can the logging provider see the ES cluster?

  • kubectl port-forward logging-provider-pod-name 9000:9000
  • http localhost:9000/cluster/stats
    • this should display some quick stats about the ES cluster. if you see them, great your provider is speaking with ES. If not, you got problems with the communication. Check the values of the variables ES_SERVICE_HOST and ES_SERVICE_PORT are those things visible to this container?
    • something to check here is to exec into that container and then issue a curl command against the ES container using those coordinates:
    • kubectl exec -ti logging-provider-pod /bin/bash
    • curl ES_SERVICE_HOST:ES_SERVICE_PORT/_cat/indices
      • any http connection errors here are indicative of a connection issue that has to be resolved in the typical networking debugging ways.

4. Is the CaaS provider linked to the logging provider?

  • http https://<Gestalt Meta URL>/root/providers/<caas_provider_id>

    • look for the linked_providers entry. it should look like this

    json "linked_providers": [ { "id": "045e0398-80bc-4022-b2d0-23c5677582ef", "name": "log-provider", "type": "Gestalt::Configuration::Provider::Logging", "typeId": "e1782fef-4b7c-4f75-b8b8-6e6e2ecd82b2" } ] * if you need to add the provider, do so via the Gestalt UI.

5. Does the logging provider have a SERVICE_VHOST_0 entry? And is it reachable?

  • this is the variable that the UI uses to contact the provider. This needs to be reachable FROM THE BROWSER so it has to be plumbed for external access in most cases using ingress, or ELB, or DNS, or any combination.

6. Everything is configured but the UI still says : "The container hasn't logged anything so far..."

  • Try to adjust the "Time" entry on the UI in the top right to an appropriate timeframe "Last Day" is usually pretty good.
  • If it's still not displaying, then you probably have too many indices in ES for the query to terminate in time.
  • If you've configured the S3 repository on the logging provider you can do this : http localhost:9000/clean
  • try configuring a periodic lambda that does this once a day and you won't run into this in the future.