Problem 16.2 Levenshtein distance variant 4

#1

This variant asks: Given a string A and a regular expression r, what is the string in the language of r that is closest to A?
I just want to verify that I understand the question correctly:
Given regex “abcd” and string “aefbcgd”, the solution should return “bc”.
Is that correct? Thanks!

1 Like

#2

Hi @lgtout ,

The regular expression we discussed here shall contains “*” or “.” or “?” as those will create a set of strings. Then the problem here would become among those strings produced by the regular expression, which one is the closest one.

0 Likes

#3

Ah, I see. That’s quite different from what I thought. Ok, thanks @tsunghsienlee.

0 Likes

#4

@tsunghsienlee

If I apply your explanation, then this variant is a very trivial change to problem 16.2. All that’s necessary is to apply the regex and iterate through the resulting strings, applying the Levenshtein function (from the 16.2 solution) to each and finding the one with the least Levenshtein distance.

Am I missing something?

None of the other variants I’ve seen in the book add so little difficulty to the original problem as this one. Which makes me think I must be missing something in my interpretation of the problem statement.

Thanks!

0 Likes

#5

Hi @lgtout,

I am not sure that is the right way as there are almost infinite strings when the regex contains those metacharacters (i.e., * and .). So the difficulty here is how to solve that problem without enumerating all possible combinations of strings.

0 Likes

#6

Thanks, @tsunghsienlee.
I’m still having trouble understanding this problem. Could you give me a concrete example? Thanks.

0 Likes

#7

Like A*b and AAb. One is regex and the other is pure string.

0 Likes

#8

Thanks. Are these the inputs to the solution function? And what would be the result in this case?

0 Likes

#9

Yes, and the output is the distance, which shall be an int.

0 Likes

#10

Thanks.

Is the result in this case AAb, since this has a distance of 0?

Can the regex contain more than one of “.”, “*”, “+”?

How about combinations such as “.*”, “.+”?

Are “.*?+” the only metacharacters allowed?

2 Likes

#11

@tsunghsienlee I think this problem, as worded in the book, would benefit from clearer wording in future editions. In particular, I think the phrase “…in the language of the regular expression r …” is at best ambiguous, at worst a misdirection.

0 Likes

#12

There will be a overhaul of variants in the future. Because we think variants is a good way for readers keep practicing what they want. However, we still have some urgent things we would like to focus on now so we decide to keep this what we have now.

0 Likes

#13

Personally, I love having the variants. And I hope they stay in future editions. But perhaps because most people don’t do the variants, issues with them may have a tendency of not being caught as early as issues with the “main” problems.

0 Likes

#14

I love variants in the way it provides a way for readers to practice similar problems with the knowledge they just learn. However, lacking solution due to space is a problem here. Therefore we would like move it to problems with test data by using our newly built testing framework. However, it would take time for us to prepare those test data and rewrite those variants. Furthermore, we might need to consolidate those again.

1 Like

#15

Given this formulation, here is my stab at the problem.

The core idea of the below algorithms is to calculate the levenshtein distance, while taking into account the special cases of the regex expression.

*, +, ? and . are supported for the regex expression, without support for expression blocks

There are two versions below, one using recursion with memoization and the other using dynamic programming

Recursion with memoization algorithm
def regex_dist(regex: str, target: str):
    def regex_dist_aux(r_i, t_i):
        if (r_i == -1 and t_i == -1):
            return 0
        if (r_i == -1):
            return t_i + 1
        if (t_i == -1):
            i, counter = r_i, 0 
            while i >= 0:
                char = regex[i]
                i -= 2 if (char == '?' or char == '*' or char == '+') else 1
                counter += 1 if (not (char == '?' or char == '*')) else 0
            return counter
        
        if (memo[r_i][t_i] is not None):
            return memo[r_i][t_i]
        
        # Regex special cases
        if (regex[r_i] == '.'):
            memo[r_i][t_i] = regex_dist_aux(r_i - 1, t_i - 1)
            return memo[r_i][t_i]
        
        if (regex[r_i] == '+' or regex[r_i] == '*' or regex[r_i] == '?'):
            if (regex[r_i-1] == target[t_i]):
                if (regex[r_i] == '?'):
                    memo[r_i][t_i] = regex_dist_aux(r_i - 2, t_i - 1)
                else:
                    memo[r_i][t_i] = min(regex_dist_aux(r_i - 2, t_i - 1), regex_dist_aux(r_i, t_i - 1))
            else:
                additional_cost = 1 if (regex[r_i] == '+') else 0
                memo[r_i][t_i] = min(regex_dist_aux(r_i - 2, t_i - 1) + 1,
                           regex_dist_aux(r_i, t_i - 1) + 1,
                           regex_dist_aux(r_i - 2, t_i) + additional_cost)
            return memo[r_i][t_i]

        # Other characters
        if (regex[r_i] == target[t_i]):
            memo[r_i][t_i] = regex_dist_aux(r_i - 1, t_i - 1)
        else:
            memo[r_i][t_i] = min(regex_dist_aux(r_i - 1, t_i - 1) + 1,
                   regex_dist_aux(r_i, t_i - 1) + 1,
                   regex_dist_aux(r_i - 1, t_i) + 1)
        return memo[r_i][t_i]
        
        
    memo = [[None] * (len(target) + 1) for _ in range(len(regex) + 1)]
    return regex_dist_aux(len(regex) - 1, len(target) -1)

Complete gist of recursion with memoization algorithm here: https://gist.github.com/lopespm/53a215d0b2b0518b52b6bb6687bdaff6

Dynamic programming algorithm
def regex_dist(regex: str, target: str):

    current = [None] * (len(regex) + 1)
    
    for t_i in range(len(target) + 1):
        prev = [item for item in current]
        for r_i in range(len(regex) + 1):
            if (r_i == 0 and t_i == 0):
                current[r_i] = 0
            elif (r_i == 0):
                current[r_i] = t_i
            elif (t_i == 0):
                i, counter = r_i - 1, 0 
                while i >= 0:
                    char = regex[i]
                    i -= 2 if (char == '?' or char == '*' or char == '+') else 1
                    counter += 1 if (not (char == '?' or char == '*')) else 0
                current[r_i] = counter
                
            # Regex special cases
            elif (regex[r_i-1] == '.'):
                current[r_i] = prev[r_i - 1]

            elif (regex[r_i-1] == '+' or regex[r_i-1] == '*' or regex[r_i-1] == '?'):
                if (regex[r_i-1-1] == target[t_i-1]):
                    if (regex[r_i-1] == '?'):
                        current[r_i] = prev[r_i - 2]
                    else:
                        current[r_i] = min(prev[r_i - 2], prev[r_i])
                else:
                    additional_cost = 1 if (regex[r_i-1] == '+') else 0
                    current[r_i] = min(prev[r_i - 2] + 1,
                               prev[r_i] + 1,
                               current[r_i - 2] + additional_cost)

            # Other characters
            elif (regex[r_i-1] == target[t_i-1]):
                current[r_i] = prev[r_i - 1]
            else:
                current[r_i] = min(prev[r_i - 1] + 1,
                       prev[r_i] + 1,
                       current[r_i - 1] + 1)
    
    return current[-1]

Complete gist of dynamic programming algorithm here: https://gist.github.com/lopespm/2362a77e7bd230a4622a43709c195826

0 Likes

#16

I think the problem can be better explained by its brute force solution.

  1. Given a regex r and a string s.
  2. Find all possible strings which result in a match with r.
  3. Find the string with minimum distance from the original string s.
0 Likes