Class: Unisec::Rugrep
- Inherits:
-
Object
- Object
- Unisec::Rugrep
- Defined in:
- lib/unisec/rugrep.rb
Overview
Ruby grep : Ruby regular expression search for Unicode code point names
Constant Summary collapse
- UCD_DERIVEDNAME =
UCD Derived names file location
File.join(__dir__, '../../data/DerivedName.txt')
Class Method Summary collapse
-
.regrep(regexp) ⇒ Array<Hash>
Search code points by (Ruby) regexp.
-
.regrep_display(regexp) ⇒ Object
Display a CLI-friendly output listing all code points corresponding to a regular expression.
-
.regrep_display_slow(regexp) ⇒ Object
Display a CLI-friendly output listing all code points corresponding to a regular expression.
-
.regrep_slow(regexp) ⇒ Array<Hash>
Search code points by (Ruby) regexp.
-
.ucd_derivedname_version ⇒ String
Returns the version of Unicode used in UCD local file (data/DerivedName.txt).
Class Method Details
.regrep(regexp) ⇒ Array<Hash>
Search code points by (Ruby) regexp
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/unisec/rugrep.rb', line 32 def self.regrep(regexp) out = [] file = File.new(UCD_DERIVEDNAME) file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point as integer and the name cp_int, name = line.split(';') cp_int = cp_int.chomp.to_i(16) name.lstrip! next unless /#{regexp}/i.match?(name) # compiling regexp once is surprisingly not faster out << { char: TwitterCldr::Utils::CodePoints.to_string([cp_int]), codepoint: cp_int, name: name } end out end |
.regrep_display(regexp) ⇒ Object
Display a CLI-friendly output listing all code points corresponding to a regular expression.
64 65 66 67 68 69 70 |
# File 'lib/unisec/rugrep.rb', line 64 def self.regrep_display(regexp) codepoints = regrep(regexp) codepoints.each do |cp| puts "#{Properties.deccp2stdhexcp(cp[:codepoint]).ljust(7)} #{cp[:char].ljust(4)} #{cp[:name]}" end nil end |
.regrep_display_slow(regexp) ⇒ Object
Display a CLI-friendly output listing all code points corresponding to a regular expression.
118 119 120 121 122 123 124 |
# File 'lib/unisec/rugrep.rb', line 118 def self.regrep_display_slow(regexp) codepoints = regrep_slow(regexp) codepoints.each do |cp| puts "#{Properties.deccp2stdhexcp(cp[:codepoint]).ljust(7)} #{cp[:char].ljust(4)} #{cp[:name]}" end nil end |
.regrep_slow(regexp) ⇒ Array<Hash>
⚠ This command is very time consuming (~ 1min) and unoptimized (execute one regexp per code point…)
Search code points by (Ruby) regexp
94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/unisec/rugrep.rb', line 94 def self.regrep_slow(regexp) out = [] TwitterCldr::Shared::CodePoint.each do |cp| next unless /#{regexp}/oi.match?(cp.name) # compiling regexp once is surprisingly not faster out << { char: TwitterCldr::Utils::CodePoints.to_string([cp.code_point]), codepoint: cp.code_point, name: cp.name } end out end |
.ucd_derivedname_version ⇒ String
Returns the version of Unicode used in UCD local file (data/DerivedName.txt)
76 77 78 79 |
# File 'lib/unisec/rugrep.rb', line 76 def self.ucd_derivedname_version first_line = File.open(UCD_DERIVEDNAME, &:readline) first_line.match(/-(\d+\.\d+\.\d+)\.txt/).captures.first end |