Overview
Authentication is the process of verifying the identity of a user or service. There are two places where Authentication occurs inside DataHub:
- DataHub frontend service when a user attempts to log in to the DataHub application.
- DataHub backend service when making API requests to DataHub.
In this document, we'll tak a closer look at both.
Authentication in the Frontend
Authentication of normal users of DataHub takes place in two phases.
At login time, authentication is performed by either DataHub itself (via username / password entry) or a third-party Identity Provider. Once the identity of the user has been established, and credentials validated, a persistent session token is generated for the user and stored in a browser-side session cookie.
DataHub provides 3 mechanisms for authentication at login time:
- Native Authentication which uses username and password combinations natively stored and managed by DataHub, with users invited via an invite link.
- Single Sign-On with OpenID Connect to delegate authentication responsibility to third party systems like Okta or Google/Azure Authentication. This is the recommended approach for production systems.
- JaaS Authentication for simple deployments where authenticated users are part of some known list or invited as a Native DataHub User.
In subsequent requests, the session token is used to represent the authenticated identity of the user, and is validated by DataHub's backend service (discussed below). Eventually, the session token is expired (24 hours by default), at which point the end user is required to log in again.
DataHub also supports Guest users to access the system without requiring an explicit login when enabled. The default configuration disables guest authentication. When Guest access is enabled, accessing datahub with a configurable URL path logs the user in an existing user that is designated as the guest. The privileges of the guest user are controlled by adjusting privileges of that designated guest user.
Authentication in the Backend (Metadata Service)
When a user makes a request for Data within DataHub, the request is authenticated by DataHub's Backend (Metadata Service) via a JSON Web Token. This applies to both requests originating from the DataHub application, and programmatic calls to DataHub APIs. There are two types of tokens that are important:
- Session Tokens: Generated for users of the DataHub web application. By default, having a duration of 24 hours.
These tokens are encoded and stored inside browser-side session cookies. The duration a session token is valid for is configurable via the MAX_SESSION_TOKEN_AGEenvironment variable on the datahub-frontend deployment. Additionally, theAUTH_SESSION_TTL_HOURSconfigures the expiration time of the actor cookie on the user's browser which will also prompt a user login. The difference between these is that the actor cookie expiration only affects the browser session and can still be used programmatically, but when the session expires it can no longer be used programmatically either as it is created as a JWT with an expiration claim.
- Personal Access Tokens: These are tokens generated via the DataHub settings panel useful for interacting with DataHub APIs. They can be used to automate processes like enriching documentation, ownership, tags, and more on DataHub. Learn more about Personal Access Tokens here.
- OAuth Provider Tokens: JWT tokens issued by external OAuth2/OIDC providers (like Okta, Auth0, Azure AD) can be used for service-to-service authentication. This enables seamless integration with existing OAuth infrastructure and is ideal for automated services and applications. Learn more about OAuth Provider authentication here.
To learn more about DataHub's backend authentication, check out Introducing Metadata Service Authentication.
Credentials must be provided as Bearer Tokens inside of the Authorization header in any request made to DataHub's API layer.
Authorization: Bearer <your-token>
As with the frontend, the backend also can optionally enable Guest authentication. If Guest authentication is enabled, all API calls made to the backend without an Authorization header are treated as guest users and the privileges associated with the designated guest user apply to those requests.
Note that in DataHub local quickstarts, Authentication at the backend layer is disabled for convenience. This leaves the backend
vulnerable to unauthenticated requests and should not be used in production. To enable
backend (token-based) authentication, simply set the METADATA_SERVICE_AUTH_ENABLED=true environment variable
for the datahub-gms container or pod.
References
For a quick video on the topic of users and groups within DataHub, have a look at DataHub Basics — Users, Groups, & Authentication 101