Skip to main content

Extract Text

Working with text data is a common task, but it can be challenging to find and select the necessary parts of data from the text. This is especially true when dealing with large datasets or unstructured text. Extracting specific information, such as email addresses, phone numbers, or dates, requires efficient and accurate methods.

In this article, we have collected various approaches to extract parts of text by pattern, by position, or using AI-driven techniques. These methods can help streamline the process and improve accuracy.


Extract email address

Email addresses can be extracted from text as patterns because their structure is known: xxx@xxx.xx. This regular format allows for the use of regular expressions (regex) to identify and extract email addresses from a body of text.

The common approach to extracting text by pattern is using function regex_extract. Regular expressions provide a powerful way to search for patterns in text.

To extract an email from any text column, you can use the following expression:

regex_extract(`column`, "([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)")

This regex pattern matches the typical structure of email addresses, capturing sequences of alphanumeric characters, dots, underscores, and hyphens followed by an '@' symbol and a domain.

If the text contains an email at any position, the email address will be returned.

Extract email address


Extract digits

To extract numerical values (digits) from a text, regular expressions can be applied as well. Regular expressions are highly effective for identifying and extracting sequences of digits.

To extract numerical values from any text column, you can use the following regex pattern:

regex_extract(`column`, "\d+")

Extract digits


Extract data from JSON string

Extracting information from strings represented as JSON is relatively easy. The key function used with JSON is json_path.

Let's check a few different cases.

Assume we have the following JSON string in the billing_address column that represents the object:

{
"zip": "33422",
"city": "Branne",
"name": "Alexander",
"phone": "093 762 3994",
"country": "France",
"address1": "23 Rue Emmanuel Roy",
"last_name": "Example"
}

We need the following expressions to extract different pieces of information:

json_path(`billing_address`, "zip") // Extract ZIP => 33422
json_path(`billing_address`, "phone") // Extract Telephone => 093 762 3994
json_path(`billing_address`, "name") + " " + json_path(`billing_address`, "last_name") // Full Name => Alexander Example

In the case of an array:

[
{"code": "product-a", "price": "17.5", "title": "Product A"},
{"code": "product-b", "price": "62.3", "title": "Product B"}
]

You can use the following expressions:

json_path(`shipping_lines`, "0.code") // product-a
json_path(`shipping_lines`, "#.code") // ["product-a", "product-b"]

Also, it is sometimes useful to use that function in combination with searching for some value:

[
{"option": "size", "value": "XS"},
{"option": "color", "value": "Blue"},
{"option": "qty", "value": 1}
]
json_path(`options`, '#(name="color").value') // Blue

These examples demonstrate how to use the json_path function to extract specific information from JSON strings, whether they represent objects or arrays.

json_path allows you to extract or find any information in the JSON string. If you can't find an example applicable to your case, just contact our support team, and we will help you.