Usually we use a tool like Boost Spirit Qi to retrieve information from a source file. However, in some situations (like building a syntax highlighter) that is not enough and we also need some meta data about the information. This short article will describe a convenient way to get additionally the position of respective data points.

Boost Spirit Qi offers the qi::iter_pos parser which provides access to the underlying iterator. The iterator itself doesn't provide the position directly but we can calculate it. To do this, we need to know the respective starting point of the input.

We could store the starting point next to the parser but that would increase the effort of using this parser, since we need to set it each time before parsing. The better solution is to embed this data into a parser component, so the reference point can be set by the parser. We provide a small encapsulation:

template<typename Iterator>
struct CurrentPos {
  CurrentPos() {
    save_start_pos = qi::omit[boost::spirit::repository::qi::iter_pos[
            phx::bind(&CurrentPos::setStartPos, this, qi::_1)]];
    current_pos = boost::spirit::repository::qi::iter_pos[
            qi::_val = phx::bind(&CurrentPos::getCurrentPos, this, qi::_1)];
  }

  qi::rule<Iterator> save_start_pos;
  qi::rule<Iterator, std::size_t()> current_pos;

private:
  void setStartPos(const Iterator &iterator) {
    start_pos_ = iterator;
  }

  std::size_t getCurrentPos(const Iterator &iterator) {
    return std::distance(start_pos_, iterator);
  }

  Iterator start_pos_;
};

This component provides two parsing rules:

  • save_start_pos: Needs to be called at the beginning of the parsing to store the reference position. It does not provide any attributes.
  • current_pos: Will be called each time we want to get the current position of the parser, which it provides by a ''std::size_t'' attribute.

This method works well for source code. However, as soon as you need to process Unicode files and expect additional symbols the position will be wrong. Some Unicode symbols take more then one character in UTF-8 encoding and the underlying iterator doesn't know about that so std::distance doesn't give the desired result. This article describes how to fix that.

Usage

To demonstrate this component in action, we want to parse words, which we parse as connected chars for simplicity, and store the position next to them. The data type is just a tuple in our case:

typedef std::tuple<std::size_t, std::string> word_t;

This leads to the following two rules:

qi::rule<iterator_type, std::string(), qi::space_type> string = 
    qi::lexeme[+(qi::char_ - qi::space)];

qi::rule<iterator_type, word_t(), qi::space_type> word = 
    current_pos.current_pos >> string;

Of course it is essential to call save_start_pos at the beginning of the parsing. So, we just add it to the start rule:

qi::rule<iterator_type, std::vector<word_t>(), qi::space_type> start = 
    current_pos.save_start_pos >> *(word);

Complete example

Here is the full example to play:

#define BOOST_SPIRIT_USE_PHOENIX_V3
#define BOOST_SPIRIT_UNICODE

#include <boost/fusion/adapted/std_tuple.hpp>

#include <boost/spirit/include/phoenix.hpp>
namespace phx = boost::phoenix;

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/repository/include/qi_iter_pos.hpp>
namespace qi = boost::spirit::qi;

#include <iostream>
#include <string>
#include <tuple>

//======================================================================
template<typename Iterator>
struct CurrentPos {
  CurrentPos() {
    save_start_pos = qi::omit[boost::spirit::repository::qi::iter_pos[
            phx::bind(&CurrentPos::setStartPos, this, qi::_1)]];
    current_pos = boost::spirit::repository::qi::iter_pos[
            qi::_val = phx::bind(&CurrentPos::getCurrentPos, this, qi::_1)];
  }

  qi::rule<Iterator> save_start_pos;
  qi::rule<Iterator, std::size_t()> current_pos;

private:
  void setStartPos(const Iterator &iterator) {
    start_pos_ = iterator;
  }

  std::size_t getCurrentPos(const Iterator &iterator) {
    return std::distance(start_pos_, iterator);
  }

  Iterator start_pos_;
};

//======================================================================
int main() {
  std::string input("Hello world!");

  typedef std::string::const_iterator iterator_type;

  iterator_type first(input.begin()), last(input.end());

  typedef std::tuple<std::size_t, std::string> word_t;

  CurrentPos<iterator_type> current_pos;

  qi::rule<iterator_type, std::string(), qi::space_type> string = 
      qi::lexeme[+(qi::char_ - qi::space)];

  qi::rule<iterator_type, word_t(), qi::space_type> word = 
      current_pos.current_pos >> string;

  qi::rule<iterator_type, std::vector<word_t>(), qi::space_type> start = 
      current_pos.save_start_pos >> *(word);

  std::vector<word_t> data;
  bool result = qi::phrase_parse(first, last, start, qi::space, data);
  if (result) {
    result = first == last;
  }

  if (result) {
    for (const auto &e : data) {
      std::cout << "Position: " << std::get<0>(e) << std::endl 
                << "Word:     " << std::get<1>(e) << std::endl;
    }
  } else {
    std::cout << "Failure" << std::endl;
  }
}

Output:

Position: 0
Word:     Hello
Position: 6
Word:     world!